Tag Archives: Amazon EC2

Optimizing EC2 Workloads with Amazon CloudWatch

2021-07-09 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/optimizing-ec2-workloads-with-amazon-cloudwatch/

This post is written by David (Dudu) Twizer, Principal Solutions Architect, and Andy Ward, Senior AWS Solutions Architect – Microsoft Tech.

In December 2020, AWS announced the availability of gp3, the next-generation General Purpose SSD volumes for Amazon Elastic Block Store (Amazon EBS), which allow customers to provision performance independent of storage capacity and provide up to a 20% lower price-point per GB than existing volumes.

This new release provides an excellent opportunity to right-size your storage layer by leveraging AWS’ built-in monitoring capabilities. This is especially important with SQL workloads as there are many instance types and storage configurations you can select for your SQL Server on AWS.

Many customers ask for our advice on choosing the ‘best’ or the ‘right’ storage and instance configuration, but there is no one solution that fits all circumstances. This blog post covers the critical techniques to right-size your workloads. We focus on right-sizing a SQL Server as our example workload, but the techniques we will demonstrate apply equally to any Amazon EC2 instance running any operating system or workload.

We create and use an Amazon CloudWatch dashboard to highlight any limits and bottlenecks within our example instance. Using our dashboard, we can ensure that we are using the right instance type and size, and the right storage volume configuration. The dimensions we look into are EC2 Network throughput, Amazon EBS throughput and IOPS, and the relationship between instance size and Amazon EBS performance.

The Dashboard

It can be challenging to locate every relevant resource limit and configure appropriate monitoring. To simplify this task, we wrote a simple Python script that creates a CloudWatch Dashboard with the relevant metrics pre-selected.

The script takes an instance-id list as input, and it creates a dashboard with all of the relevant metrics. The script also creates horizontal annotations on each graph to indicate the maximums for the configured metric. For example, for an Amazon EBS IOPS metric, the annotation shows the Maximum IOPS. This helps us identify bottlenecks.

Please take a moment now to run the script using either of the following methods described. Then, we run through the created dashboard and each widget, and guide you through the optimization steps that will allow you to increase performance and decrease cost for your workload.

Creating the Dashboard with CloudShell

First, we log in to the AWS Management Console and load AWS CloudShell.

Once we have logged in to CloudShell, we must set up our environment using the following command:

# Download the script locally
wget -L https://raw.githubusercontent.com/aws-samples/amazon-ec2-mssql-workshop/master/resources/code/Monitoring/create-cw-dashboard.py

# Prerequisites (venv and boto3)
python3 -m venv env # Optional
source env/bin/activate  # Optional
pip3 install boto3 # Required

The commands preceding download the script and configure the CloudShell environment with the correct Python settings to run our script. Run the following command to create the CloudWatch Dashboard.

# Execute
python3 create-cw-dashboard.py --InstanceList i-example1 i-example2 --region eu-west-1

At its most basic, you just must specify the list of instances you are interested in (i-example1 and i-example2 in the preceding example), and the Region within which those instances are running (eu-west1 in the preceding example). For detailed usage instructions see the README file here. A link to the CloudWatch Dashboard is provided in the output from the command.

Creating the Dashboard Directly from your Local Machine

If you’re familiar with running the AWS CLI locally, and have Python and the other pre-requisites installed, then you can run the same commands as in the preceding CloudShell example, but from your local environment. For detailed usage instructions see the README file here. If you run into any issues, we recommend running the script from CloudShell as described prior.

Examining Our Metrics

Once the script has run, navigate to the CloudWatch Dashboard that has been created. A direct link to the CloudWatch Dashboard is provided as an output of the script. Alternatively, you can navigate to CloudWatch within the AWS Management Console, and select the Dashboards menu item to access the newly created CloudWatch Dashboard.

The Network Layer

The first widget of the CloudWatch Dashboard is the EC2 Network throughput:

The automatic annotation creates a red line that indicates the maximum throughput your Instance can provide in Mbps (Megabits per second). This metric is important when running workloads with high network throughput requirements. For our SQL Server example, this has additional relevance when considering adding replica Instances for SQL Server, which place an additional burden on the Instance’s network.

In general, if your Instance is frequently reaching 80% of this maximum, you should consider choosing an Instance with greater network throughput. For our SQL example, we could consider changing our architecture to minimize network usage. For example, if we were using an “Always On Availability Group” spread across multiple Availability Zones and/or Regions, then we could consider using an “Always On Distributed Availability Group” to reduce the amount of replication traffic. Before making a change of this nature, take some time to consider any SQL licensing implications.

If your Instance generally doesn’t pass 10% network utilization, the metric is indicating that networking is not a bottleneck. For SQL, if you have low network utilization coupled with high Amazon EBS throughput utilization, you should consider optimizing the Instance’s storage usage by offloading some Amazon EBS usage onto networking – for example by implementing SQL as a Failover Cluster Instance with shared storage on Amazon FSx for Windows File Server, or by moving SQL backup storage on to Amazon FSx.

The Storage Layer

The second widget of the CloudWatch Dashboard is the overall EC2 to Amazon EBS throughput, which means the sum of all the attached EBS volumes’ throughput.

Each Instance type and size has a different Amazon EBS Throughput, and the script automatically annotates the graph based on the specs of your instance. You might notice that this metric is heavily utilized when analyzing SQL workloads, which are usually considered to be storage-heavy workloads.

If you find data points that reach the maximum, such as in the preceding screenshot, this indicates that your workload has a bottleneck in the storage layer. Let’s see if we can find the EBS volume that is using all this throughput in our next series of widgets, which focus on individual EBS volumes.

And here is the culprit. From the widget, we can see the volume ID and type, and the performance maximum for this volume. Each graph represents one of the two dimensions of the EBS volume: throughput and IOPS. The automatic annotation gives you visibility into the limits of the specific volume in use. In this case, we are using a gp3 volume, configured with a 750-MBps throughput maximum and 3000 IOPS.

Looking at the widget, we can see that the throughput reaches certain peaks, but they are less than the configured maximum. Considering the preceding screenshot, which shows that the overall instance Amazon EBS throughput is reaching maximum, we can conclude that the gp3 volume here is unable to reach its maximum performance. This is because the Instance we are using does not have sufficient overall throughput.

Let’s change the Instance size so that we can see if that fixes our issue. When changing Instance or volume types and sizes, remember to re-run the dashboard creation script to update the thresholds. We recommend using the same script parameters, as re-running the script with the same parameters overwrites the initial dashboard and updates the threshold annotations – the metrics data will be preserved. Running the script with a different dashboard name parameter creates a new dashboard and leaves the original dashboard in place. However, the thresholds in the original dashboard won’t be updated, which can lead to confusion.

Here is the widget for our EBS volume after we increased the size of the Instance:

We can see that the EBS volume is now able to reach its configured maximums without issue. Let’s look at the overall Amazon EBS throughput for our larger Instance as well:

We can see that the Instance now has sufficient Amazon EBS throughput to support our gp3 volume’s configured performance, and we have some headroom.

Now, let’s swap our Instance back to its original size, and swap our gp3 volume for a Provisioned IOPS io2 volume with 45,000 IOPS, and re-run our script to update the dashboard. Running an IOPS intensive task on the volume results in the following:

As you can see, despite having 45,000 IOPS configured, it seems to be capping at about 15,000 IOPS. Looking at the instance level statistics, we can see the answer:

Much like with our throughput testing earlier, we can see that our io2 volume performance is being restricted by the Instance size. Let’s increase the size of our Instance again, and see how the volume performs when the Instance has been correctly sized to support it:

We are now reaching the configured limits of our io2 volume, which is exactly what we wanted and expected to see. The instance level IOPS limit is no longer restricting the performance of the io2 volume:

Using the preceding steps, we can identify where storage bottlenecks are, and we can identify if we are using the right type of EBS volume for the workload. In our examples, we sought bottlenecks and scaled upwards to resolve them. This process should be used to identify where resources have been over-provisioned and under-provisioned.

If we see a volume that never reaches the maximums that it has been configured for, and that is not subject to any other bottlenecks, we usually conclude that the volume in question can be right-sized to a more appropriate volume that costs less, and better fits the workload.

We can, for example, change an Amazon EBS gp2 volume to an EBS gp3 volume with the correct IOPS and throughput. EBS gp3 provides up to 1000-MBps throughput per volume and costs $0.08/GB (versus $0.10/GB for gp2). Additionally, unlike with gp2, gp3 volumes allow you to specify provisioned IOPS independently of size and throughput. By using the process described above, we could identify that a gp2, io1, or io2 volume could be swapped out with a more cost-effective gp3 volume.

If during our analysis we observe an SSD-based volume with relatively high throughput usage, but low IOPS usage, we should investigate further. A lower-cost HDD-based volume, such as an st1 or sc1 volume, might be more cost-effective while maintaining the required level of performance. Amazon EBS st1 volumes provide up to 500 MBps throughput and cost $0.045 per GB-month, and are often an ideal volume-type to use for SQL backups, for example.

Additional storage optimization you can implement

Move the TempDB to Instance Store NVMe storage – The data on an SSD instance store volume persists only for the life of its associated instance. This is perfect for TempDB storage, as when the instance stops and starts, SQL Server saves the data to an EBS volume. Placing the TempDB on the local instance store frees the associated Amazon EBS throughput while providing better performance as it is locally attached to the instance.

Consider Amazon FSx for Windows File Server as a shared storage solution – As described here, Amazon FSx can be used to store a SQL database on a shared location, enabling the use of a SQL Server Failover Cluster Instance.

The Compute Layer

After you finish optimizing your storage layer, wait a few days and re-examine the metrics for both Amazon EBS and networking. Use these metrics in conjunction with CPU metrics and Memory metrics to select the right Instance type to meet your workload requirements.

AWS offers nearly 400 instance types in different sizes. From a SQL perspective, it’s essential to choose instances with high single-thread performance, such as the z1d instance, due to SQL’s license-per-core model. z1d instances also provide instance store storage for the TempDB.

You might also want to check out the AWS Compute Optimizer, which helps you by automatically recommending instance types by using machine learning to analyze historical utilization metrics. More details can be found here.

We strongly advise you to thoroughly test your applications after making any configuration changes.

Conclusion

This blog post covers some simple and useful techniques to gain visibility into important instance metrics, and provides a script that greatly simplifies the process. Any workload running on EC2 can benefit from these techniques. We have found them especially effective at identifying actionable optimizations for SQL Servers, where small changes can have beneficial cost, licensing and performance implications.

Easily Manage Security Group Rules with the New Security Group Rule ID

2021-07-08 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/easily-manage-security-group-rules-with-the-new-security-group-rule-id/

At AWS, we tirelessly innovate to allow you to focus on your business, not its underlying IT infrastructure. Sometimes we launch a new service or a major capability. Sometimes we focus on details that make your professional life easier.

Today, I’m happy to announce one of these small details that makes a difference: VPC security group rule IDs.

A security group acts as a virtual firewall for your cloud resources, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance or a Amazon Relational Database Service (RDS) database. It controls ingress and egress network traffic. Security groups are made up of security group rules, a combination of protocol, source or destination IP address and port number, and an optional description.

When you use the AWS Command Line Interface (CLI) or API to modify a security group rule, you must specify all these elements to identify the rule. This produces long CLI commands that are cumbersome to type or read and error-prone. For example:

aws ec2 revoke-security-group-egress \
         --group-id sg-0xxx6          \
         --ip-permissions IpProtocol=tcp, FromPort=22, ToPort=22, IpRanges='[{CidrIp=192.168.0.0/0}, {84.156.0.0/0}]'

What’s New?
A security group rule ID is an unique identifier for a security group rule. When you add a rule to a security group, these identifiers are created and added to security group rules automatically. Security group IDs are unique in an AWS Region. Here is the Edit inbound rules page of the Amazon VPC console:

As mentioned already, when you create a rule, the identifier is added automatically. For example, when I’m using the CLI:

aws ec2 authorize-security-group-egress                                  \
        --group-id sg-0xxx6                                              \
        --ip-permissions IpProtocol=tcp,FromPort=22,ToPort=22,           \
                         IpRanges=[{CidrIp=1.2.3.4/32}]
        --tag-specifications                                             \
                         ResourceType='security-group-rule',             \
                         "Tags": [{                                      \
                           "Key": "usage", "Value": "bastion"            \
                         }]

The updated AuthorizeSecurityGroupEgress API action now returns details about the security group rule, including the security group rule ID:

"SecurityGroupRules": [
    {
        "SecurityGroupRuleId": "sgr-abcdefghi01234561",
        "GroupId": "sg-0xxx6",
        "GroupOwnerId": "6800000000003",
        "IsEgress": false,
        "IpProtocol": "tcp",
        "FromPort": 22,
        "ToPort": 22,
        "CidrIpv4": "1.2.3.4/32",
        "Tags": [
            {
                "Key": "usage",
                "Value": "bastion"
            }
        ]
    }
]

We’re also adding two API actions: DescribeSecurityGroupRules and ModifySecurityGroupRules to the VPC APIs. You can use these to list or modify security group rules respectively.

What are the benefits ?
The first benefit of a security group rule ID is simplifying your CLI commands. For example, the RevokeSecurityGroupEgress command used earlier can be now be expressed as:

aws ec2 revoke-security-group-egress \
         --group-id sg-0xxx6         \
         --security-group-rule-ids "sgr-abcdefghi01234561"

Shorter and easier, isn’t it?

The second benefit is that security group rules can now be tagged, just like many other AWS resources. You can use tags to quickly list or identify a set of security group rules, across multiple security groups.

In the previous example, I used the tag-on-create technique to add tags with --tag-specifications at the time I created the security group rule. I can also add tags at a later stage, on an existing security group rule, using its ID:

aws ec2 create-tags                         \
        --resources sgr-abcdefghi01234561   \
        --tags "Key=usage,Value=bastion"

Let’s say my company authorizes access to a set of EC2 instances, but only when the network connection is initiated from an on-premises bastion host. The security group rule would be IpProtocol=tcp, FromPort=22, ToPort=22, IpRanges='[{1.2.3.4/32}]' where 1.2.3.4 is the IP address of the on-premises bastion host. This rule can be replicated in many security groups.

What if the on-premises bastion host IP address changes? I need to change the IpRanges parameter in all the affected rules. By tagging the security group rules with usage : bastion, I can now use the DescribeSecurityGroupRules API action to list the security group rules used in my AWS account’s security groups, and then filter the results on the usage : bastion tag. By doing so, I was able to quickly identify the security group rules I want to update.

aws ec2 describe-security-group-rules \
        --max-results 100 
        --filters "Name=tag-key,Values=usage" --filters "Name=tag-value,Values=bastion"

This gives me the following output:

{
    "SecurityGroupRules": [
        {
            "SecurityGroupRuleId": "sgr-abcdefghi01234561",
            "GroupId": "sg-0xxx6",
            "GroupOwnerId": "40000000003",
            "IsEgress": false,
            "IpProtocol": "tcp",
            "FromPort": 22,
            "ToPort": 22,
            "CidrIpv4": "1.2.3.4/32",
            "Tags": [
                {
                    "Key": "usage",
                    "Value": "bastion"
                }
            ]
        }
    ],
    "NextToken": "ey...J9"
}

As usual, you can manage results pagination by issuing the same API call again passing the value of NextToken with --next-token.

Availability
Security group rule IDs are available for VPC security groups rules, in all commercial AWS Regions, at no cost.

It might look like a small, incremental change, but this actually creates the foundation for future additional capabilities to manage security groups and security group rules. Stay tuned!

Overview of Data Transfer Costs for Common Architectures

2021-07-01 Birender Pal

Post Syndicated from Birender Pal original https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/

Data transfer charges are often overlooked while architecting a solution in AWS. Considering data transfer charges while making architectural decisions can help save costs. This blog post will help identify potential data transfer charges you may encounter while operating your workload on AWS. Service charges are out of scope for this blog, but should be carefully considered when designing any architecture.

Data transfer between AWS and internet

There is no charge for inbound data transfer across all services in all Regions. Data transfer from AWS to the internet is charged per service, with rates specific to the originating Region. Refer to the pricing pages for each service—for example, the pricing page for Amazon Elastic Compute Cloud (Amazon EC2)—for more details.

Data transfer within AWS

Data transfer within AWS could be from your workload to other AWS services, or it could be between different components of your workload.

Data transfer between your workload and other AWS services

When your workload accesses AWS services, you may incur data transfer charges.

Accessing services within the same AWS Region

If the internet gateway is used to access the public endpoint of the AWS services in the same Region (Figure 1 – Pattern 1), there are no data transfer charges. If a NAT gateway is used to access the same services (Figure 1 – Pattern 2), there is a data processing charge (per gigabyte (GB)) for data that passes through the gateway.

Figure 1. Accessing AWS services in same Region

Accessing services across AWS Regions

If your workload accesses services in different Regions (Figure 2), there is a charge for data transfer across Regions. The charge depends on the source and destination Region (as described on the Amazon EC2 Data Transfer pricing page).

Figure 2. Accessing AWS services in different Region

Data transfer within different components of your workload

Charges may apply if there is data transfer between different components of your workload. These charges vary depending on where the components are deployed.

Workload components in same AWS Region

Data transfer within the same Availability Zone is free. One way to achieve high availability for a workload is to deploy in multiple Availability Zones.

Consider a workload with two application servers running on Amazon EC2 and a database running on Amazon Relational Database Service (Amazon RDS) for MySQL (Figure 3). For high availability, each application server is deployed into a separate Availability Zone. Here, data transfer charges apply for cross-Availability Zone communication between the EC2 instances. Data transfer charges also apply between Amazon EC2 and Amazon RDS. Consult the Amazon RDS for MySQL pricing guide for more information.

Figure 3. Workload components across Availability Zones

To minimize impact of a database instance failure, enable a multi-Availability Zone configuration within Amazon RDS to deploy a standby instance in a different Availability Zone. Replication between the primary and standby instances does not incur additional data transfer charges. However, data transfer charges will apply from any consumers outside the current primary instance Availability Zone. Refer to the Amazon RDS pricing page for more detail.

A common pattern is to deploy workloads across multiple VPCs in your AWS network. Two approaches to enabling VPC-to-VPC communication are VPC peering connections and AWS Transit Gateway. Data transfer over a VPC peering connection that stays within an Availability Zone is free. Data transfer over a VPC peering connection that crosses Availability Zones will incur a data transfer charge for ingress/egress traffic (Figure 4).

Figure 4. VPC peering connection

Transit Gateway can interconnect hundreds or thousands of VPCs (Figure 5). Cost elements for Transit Gateway include an hourly charge for each attached VPC, AWS Direct Connect, or AWS Site-to-Site VPN. Data processing charges apply for each GB sent from a VPC, Direct Connect, or VPN to Transit Gateway.

Figure 5. VPC peering using Transit Gateway in same Region

Workload components in different AWS Regions

If workload components communicate across multiple Regions using VPC peering connections or Transit Gateway, additional data transfer charges apply. If the VPCs are peered across Regions, standard inter-Region data transfer charges will apply (Figure 6).

Figure 6. VPC peering across Regions

For peered Transit Gateways, you will incur data transfer charges on only one side of the peer. Data transfer charges do not apply for data sent from a peering attachment to a Transit Gateway. The data transfer for this cross-Region peering connection is in addition to the data transfer charges for the other attachments (Figure 7).

Figure 7. Transit Gateway peering across Regions

Data transfer between AWS and on-premises data centers

Data transfer will occur when your workload needs to access resources in your on-premises data center. There are two common options to help achieve this connectivity: Site-to-Site VPN and Direct Connect.

Data transfer over AWS Site-to-Site VPN

One option to connect workloads to an on-premises network is to use one or more Site-to-Site VPN connections (Figure 8 – Pattern 1). These charges include an hourly charge for the connection and a charge for data transferred from AWS. Refer to Site-to-Site VPN pricing for more details. Another option to connect multiple VPCs to an on-premises network is to use a Site-to-Site VPN connection to a Transit Gateway (Figure 8 – Pattern 2). The Site-to-Site VPN will be considered another attachment on the Transit Gateway. Standard Transit Gateway pricing applies.

Figure 8. Site-to-Site VPN patterns

Data transfer over AWS Direct Connect

Direct Connect can be used to connect workloads in AWS to on-premises networks. Direct Connect incurs a fee for each hour the connection port is used and data transfer charges for data flowing out of AWS. Data transfer into AWS is $0.00 per GB in all locations. The data transfer charges depend on the source Region and the Direct Connect provider location. Direct Connect can also connect to the Transit Gateway if multiple VPCs need to be connected (Figure 9). Direct Connect is considered another attachment on the Transit Gateway and standard Transit Gateway pricing applies. Refer to the Direct Connect pricing page for more details.

Figure 9. Direct Connect patterns

A Direct Connect gateway can be used to share a Direct Connect across multiple Regions. When using a Direct Connect gateway, there will be outbound data charges based on the source Region and Direct Connect location (Figure 10).

Figure 10. Direct Connect gateway

General tips

Data transfer charges apply based on the source, destination, and amount of traffic. Here are some general tips for when you start planning your architecture:

Avoid routing traffic over the internet when connecting to AWS services from within AWS by using VPC endpoints:
- VPC gateway endpoints allow communication to Amazon S3 and Amazon DynamoDB without incurring data transfer charges.
- VPC interface endpoints are available for some AWS services. This type of endpoint incurs hourly service charges and data transfer charges.
Use Direct Connect instead of the internet for sending data to on-premises networks.
Traffic that crosses an Availability Zone boundary typically incurs a data transfer charge. Use resources from the local Availability Zone whenever possible.
Traffic that crosses a Regional boundary will typically incur a data transfer charge. Avoid cross-Region data transfer unless your business case requires it.
Use the AWS Free Tier. Under certain circumstances, you may be able to test your workload free of charge.
Use the AWS Pricing Calculator to help estimate the data transfer costs for your solution.
Use a dashboard to better visualize data transfer charges – this workshop will show how.

Conclusion

AWS provides the ability to deploy across multiple Availability Zones and Regions. With a few clicks, you can create a distributed workload. As you increase your footprint across AWS, it helps to understand various data transfer charges that may apply. This blog post provided information to help you make an informed decision and explore different architectural patterns to save on data transfer costs.

Prime Day 2021 – Two Chart-Topping Days

2021-06-30 Jeff Barr

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/prime-day-2021-two-chart-topping-days/

In what has now become an annual tradition (check out my 2016, 2017, 2019, and 2020 posts for a look back), I am happy to share some of the metrics from this year’s Prime Day and to tell you how AWS helped to make it happen.

This year I bought all sorts of useful goodies including a Toshiba 43 Inch Smart TV that I plan to use as a MagicMirror, some watering cans, and a Dremel Rotary Tool Kit for my workshop.

Powered by AWS
As in years past, AWS played a critical role in making Prime Day a success. A multitude of two-pizza teams worked together to make sure that every part of our infrastructure was scaled, tested, and ready to serve our customers. Here are a few examples:

Amazon EC2 – Our internal measure of compute power is an NI, or a normalized instance. We use this unit to allow us to make meaningful comparisons across different types and sizes of EC2 instances. For Prime Day 2021, we increased our number of NIs by 12.5%. Interestingly enough, due to increased efficiency (more on that in a moment), we actually used about 6,000 fewer physical servers than we did in Cyber Monday 2020.

Graviton2 Instances – Graviton2-powered EC2 instances supported 12 core retail services. This was our first peak event that was supported at scale by the AWS Graviton2 instances, and is a strong indicator that the Arm architecture is well-suited to the data center.

An internal service called Datapath is a key part of the Amazon site. It is highly optimized for our peculiar needs, and supports lookups, queries, and joins across structured blobs of data. After an in-depth evaluation and consideration of all of the alternatives, the team decided to port Datapath to Graviton and to run it on a three-Region cluster composed of over 53,200 C6g instances.

At this scale, the price-performance advantage of the Graviton2 of up to 40% versus the comparable fifth-generation x86-based instances, along with the 20% lower cost, turns into a big win for us and for our customers. As a bonus, the power efficiency of the Graviton2 helps us to achieve our goals for addressing climate change. If you are thinking about moving your workloads to Graviton2, be sure to study our very detailed Getting Started with AWS Graviton2 document, and also consider entering the Graviton Challenge! You can also use Graviton2 database instances on Amazon RDS and Amazon Aurora; read about Key Considerations in Moving to Graviton2 for Amazon RDS and Amazon Aurora Databases to learn more.

Amazon CloudFront – Fast, efficient content delivery is essential when millions of customers are shopping and making purchases. Amazon CloudFront handled a peak load of over 290 million HTTP requests per minute, for a total of over 600 billion HTTP requests during Prime Day.

Amazon Simple Queue Service – The fulfillment process for every order depends on Amazon Simple Queue Service (SQS). This year, traffic set a new record, processing 47.7 million messages per second at the peak.

Amazon Elastic Block Store – In preparation for Prime Day, the team added 159 petabytes of EBS storage. The resulting fleet handled 11.1 trillion requests per day and transferred 614 petabytes per day.

Amazon Aurora – Amazon Fulfillment Technologies (AFT) powers physical fulfillment for purchases made on Amazon. On Prime Day, 3,715 instances of AFT’s PostgreSQL-compatible edition of Amazon Aurora processed 233 billion transactions, stored 1,595 terabytes of data, and transferred 615 terabytes of data

Amazon DynamoDB – DynamoDB powers multiple high-traffic Amazon properties and systems including Alexa, the Amazon.com sites, and all Amazon fulfillment centers. Over the course of the 66-hour Prime Day, these sources made trillions of API calls while maintaining high availability with single-digit millisecond performance, and peaking at 89.2 million requests per second.

Prepare to Scale
As I have detailed in my previous posts, rigorous preparation is key to the success of Prime Day and our other large-scale events. If your are preparing for a similar event of your own, I can heartily recommend AWS Infrastructure Event Management. As part of an IEM engagement, AWS experts will provide you with architectural and operational guidance that will help you to execute your event with confidence.

— Jeff;

Introducing Native Support for Predictive Scaling with Amazon EC2 Auto Scaling

2021-06-29 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/introducing-native-support-for-predictive-scaling-with-amazon-ec2-auto-scaling/

This post is written by Scott Horsfield, Principal Solutions Architect, EC2 Scalability and Ankur Sethi, Sr. Product Manager, EC2

Amazon EC2 Auto Scaling allows customers to realize the elasticity benefits of AWS by automatically launching and shutting down instances to match application demand. Today, we are excited to tell you about predictive scaling. It is a new EC2 Auto Scaling policy that predicts demand surges, and proactively increases capacity ahead of time, resulting in higher availability. With predictive scaling, you can avoid the need to overprovision capacity, resulting in lower Amazon EC2 costs. Predictive scaling has been available through AWS Auto Scaling plans since 2018 but you can now use it directly as an EC2 Auto Scaling group configuration alongside your other scaling policies. In this blog post, we give you an overview of predictive scaling and illustrate a scenario that this feature helps you with. We also walk you through the steps to configure a predictive scaling policy for an EC2 Auto Scaling group.

Product Overview

EC2 Auto Scaling offers a suite of dynamic scaling policies including target tracking, simple scaling and step scaling. Scaling policies are customer-defined guidelines for when to add or remove instances in an Auto Scaling group based on the value of a certain Amazon CloudWatch metric that represents an application’s load. EC2 Auto Scaling constantly monitors the metric and reacts according to customer-defined policies to trigger the launch of additional number of instances.

Given the inherently reactive nature of dynamic scaling policies, you may find it useful to use predictive scaling in addition to dynamic scaling when:

Your application demand changes rapidly but with a recurring pattern. For example, weekly increases in capacity requirement as business resumes after weekends.
Your application instances require a long time to initialize.

Now, you can easily configure predictive scaling alongside your existing dynamic scaling policies to increase capacity in advance of a predicted demand increase. You no longer have to overprovision your Auto Scaling group or spend time manually configuring scheduled scaling for routine demand patterns. Predictive scaling uses machine learning to predict capacity requirements based on historical usage and continuously learns on new data to make forecasts more accurate.

A primer on EC2 Auto Scaling capacity parameters

When you launch an Auto Scaling group, you define the minimum, maximum, and desired capacity, expressed as number of EC2 instances. Minimum and maximum capacity are the customer-defined lower and upper boundaries of the Auto Scaling group. Desired capacity is the actual capacity of an Auto Scaling group and is constantly calibrated by EC2 Auto Scaling. With predictive scaling, AWS is introducing a new parameter called predicted capacity.

Every day, predictive scaling forecasts the hourly capacity needed for each of the next 48 hours. Then, at the beginning of each hour, the predicted capacity value is set to the forecasted capacity needed for that hour. At any point of time, three scenarios play out for your Auto Scaling group when using predictive scaling:

If actual capacity is lower than predicted capacity, EC2 Auto Scaling scales out your Auto Scaling group so that its desired capacity is equal to the predicted capacity.
If actual capacity is already higher than predicted capacity, EC2 Auto Scaling does not scale-in your Auto Scaling group.
If the predicted capacity is outside the range of minimum and maximum capacity that you defined, EC2 Auto Scaling does not violate those limits.

Note that predictive scaling policy is not designed for use on its own because it does not trigger scale-in events. It only triggers scale-out events in anticipation of predicted demand. Therefore, you should use predictive scaling with another dynamic scaling policy, either provided by AWS or your own custom scaling automation. Dynamic scaling scales in capacity when it’s no longer needed. Each policy determines its capacity value independently, and the desired capacity is set to the higher value. This ensures that your application scales out when real-time demand is higher than predicted demand.

Predictive scaling policies operate in two modes: Forecast Only or Forecast And Scale. Forecast Only mode allows you to validate that predictive scaling accurately anticipates your routine hourly demand. This is a great way to get started with predictive scaling without impacting your current scaling behavior. Also, you can create multiple policies in Forecast Only mode to compare different configurations, such as forecasting on different metrics. Once you verify the predictions, a simple update is required to switch to Forecast And Scale mode for the policy configuration that is best-suited for your Auto Scaling group. Now that you have an understanding of this new feature, let’s walk through the steps to set it up.

Getting started with Predictive Scaling

In this section, we walk you through steps to add a predictive scaling policy to an Auto Scaling group. But first, let’s look at how dynamic scaling reacts when the demand increases rapidly. To illustrate, we created a load simulation that you can use to follow along by deploying this example AWS CloudFormation Stack in your account. This example deploys two Auto Scaling groups. The first Auto Scaling group is used to run a sample application and is configured with an Application Load Balancer (ALB). The second Auto Scaling group is for generating recurring requests to the application running on the first Auto Scaling group through the ALB. For this example, we have applied a target tracking policy to maintain CPU utilization at 25% to automatically scale the first Auto Scaling group running the application.

The following graph illustrates how dynamic scaling adjusts capacity (blue line) with changing load (red line). We are interested in the ALB Response Time metric (green line). It represents the time an application takes to process and respond to the incoming requests from the ALB. It is a good representation of the latency observed by the end users of the application. Therefore, any spike observed in this metric (green line) results in bad user experience.

Huge spike in response time when demand changes rapidly

As you can see, there are recurring periods of increased requests (red line) of different ramp-up velocity. For example, from 16:00 to 18:00 UTC, before stabilizing, the load increase is relatively more gradual than what is observed for 08:00 to 10:00 UTC time range. The ALB Response Time metric (green line) remains low for the former period of gradual ramp-up. However, for the latter steep ramp-up, while auto scaling is adding the required number of instances (blue line), we observe a spike in the response time. Let’s zoom in to have a better look at the response time metric.

ALB request count vs request time

In the preceding graph, we see the response time spikes to as high as 35 seconds for the first 5 minutes of the hour before dropping down to subsecond level. Because dynamic scaling is reactive in nature, it failed to keep up with the steep demand change observed here. This may be acceptable for applications that are not sensitive to these latencies. But for others, predictive scaling helps you better manage such scenarios, by setting the baseline capacity proactively at the beginning of the hour.

We’ll now walk you through the steps to configure a predictive scaling policy. Note that, predictive scaling requires at least 24 hours of historical load data to generate forecasts. If you are using the preceding example, allow it to run for 24 hours for the load data to be generated.

Configure Predictive Scaling policy in Forecast Only mode

First, configure your Auto Scaling group with a predictive scaling policy in Forecast Only mode so that you can review the results of the forecast and adjust any parameters to more accurately reflect the behavior you desire.

To do so, create a scaling configuration file where you define the metrics, target value, and the predictive scaling mode for your policy. The following example produces forecasts based on CPU Utilization, with each instance handling 25% of the average hourly CPU utilization for the Auto Scaling group. You can further customize these policies based on the needs of your workload.


cat <<EoF > predictive-scaling-policy-cpu.json
{
    "MetricSpecifications": [
        {
            "TargetValue": 25,
            "PredefinedMetricPairSpecification": {
                "PredefinedMetricType": "ASGCPUUtilization"
            }
        }
    ],
    "Mode": "ForecastAndScale"
}
EoF

Once you have created the configuration file, you can run the following command to add the predictive scaling policy to your Auto Scaling group.

aws autoscaling put-scaling-policy \
    --auto-scaling-group-name "Example Application Auto Scaling Group" \
    --policy-name "CPUUtilizationpolicy" \
    --policy-type "PredictiveScaling" \
    --predictive-scaling-configuration file://predictive-scaling-policy-cpu.json

Reviewing Predictive Scaling forecasts

With the scaling policy in place, and 24 hours of historical load data, you can now use predictive scaling forecasts API to review the forecasted load and forecasted capacity for the Auto Scaling group. You can also use the console to review forecasts by navigating to the Amazon EC2 console, clicking Auto Scaling Groups, selecting the Auto Scaling group that you configured with predictive scaling, and viewing the predictive scaling policy located under the Automatic Scaling section of the Auto Scaling group details view. In the policy details, a chart represents the LoadForecast and CapacityForecast, showing what is forecasted for the next 48 hours, in addition to previous forecasts and actual average instance counts. The following screenshot demonstrates the forecasts for the policy just applied to the Auto Scaling group. The orange line represents the actual values, blue line represents the historic forecast, while the green line represents the forecast for next 2 days.

historic forecast and future forecasts

The upper graph shows that the load forecast against the actual load observed. Since the scaling policy based its forecasts on Auto Scaling group CPU Utilization, the load forecast reflects the total forecasted CPU load your Auto Scaling group must handle hourly. The lower graph shows the corresponding capacity forecast against the actual. As you can see, the forecast gets more accurate with time. Predictive scaling constantly learns about the pattern and improves the forecast accuracy as it gets more data points to forecast on.

For this example, the predictive scaling policy calculates capacity such that instances in an Auto Scaling group consume 25% of the CPU load on average for each hour. Predictive scaling also provides three other predefined metric configurations to help you quickly set up forecasts on metrics other than CPU. You can create multiple predictive scaling policies in Forecast Only mode based on different metrics and target value to determine which scaling policy is the best match for your workload. This helps you compare the behavior of the predictive scaling policy for existing workloads without impacting your current configuration. The current forecasts seem fairly accurate, so we will stick with the same configurations.

Configure scaling policies in forecast and scale mode

When you are ready to allow predictive scaling to automatically adjust your Auto Scaling group’s hourly capacity, you can easily update one of the scaling policies to allow Forecast And Scale directly on the console. Else, to switch modes, create a new predictive scaling policy configuration file with the “Mode” set to “ForecastAndScale”. You can do this with the following command:


cat <<EoF > predictive-scaling-policy-cpu.json
{
    "MetricSpecifications": [
        {
            "TargetValue": 25,
            "PredefinedMetricPairSpecification": {
                "PredefinedMetricType": "ASGCPUUtilization"
            }
        }
    ],
    "Mode": "ForecastAndScale"
}
EoF

Using the configuration file generated, run the following command to update the CPU Predictive Scaling policy.

aws autoscaling put-scaling-policy \
    --auto-scaling-group-name "Example Application Auto Scaling Group" \
    --policy-name "CPUUtilizationpolicy" \
    --policy-type "PredictiveScaling" \
    --predictive-scaling-configuration file://predictive-scaling-policy-cpu.json

With this updated scaling policy in place, the Auto Scaling group’s predicted capacity will now change hourly based on the predictive scaling forecasts. The predicted capacity, which acts as the baseline for an hour, will be launched at the beginning of the hour itself. You may configure to further advance the launch time according to the time an instance takes to get provisioned and warmed-up.

Impact of Switching-On Predictive Scaling

Now that we have switched to ForecastAndScale mode and predictive scaling is actively scaling the Auto Scaling group, let’s revisit the ALB Request Time metric for the Auto Scaling group.

no latency spikes after applying predictive scaling

As you can see in the preceding screenshot, prior to the steep demand (8:00 – 10:00 UTC), 40 instances (blue line) have been added in a single step by predictive scaling. The dynamic scaling policy continues to add the remaining 9 instances required for the increasing demand. Because of the combined effect of both scaling policies, we no longer observe the spike in the response time metric (green line). Let’s zoom into the specific time frame to get a better look.

applying predictive scaling in forecast and scale mode

Throughout, the response time remains less than 0.02 seconds compared to reaching as high as 35 seconds earlier when we were only using dynamic scaling. By launching the instances ahead of steep demand change, predictive scaling has improved the end users’ experience. You do not need to resort to overprovisioning or do manual interventions to scale out your Auto Scaling groups ahead of such demand patterns. As long as there is predictable pattern, auto scaling enhanced with predictive scaling maintains high availability for your applications.

If you are using the example stack, do not forget to clean up after you are done testing the feature by deleting the stack.

Conclusion

Predictive scaling, when combined with dynamic scaling, help you ensure that your EC2 Auto Scaling group workloads have the required capacity to handle predicted and real-time load. You can allow predictive scaling on existing Auto Scaling groups in Forecast Only mode to gain visibility of the predicted capacity without actually taking any scaling actions. You can refine and tune your predictive scaling policies by choosing one of the four predefined metrics and adjusting its target value as necessary. Once completed, you can switch to Forecast And Scale mode to proactively scale your Auto Scaling group capacity based on predicted demand. By using predictive scaling and dynamic scaling together, your Auto Scaling group will have the capacity it needs to meet demand, which can improve your application’s responsiveness and reduce your EC2 costs. To learn more about the feature, refer the EC2 Auto Scaling User Guide.

Field Notes: SQL Server Deployment Options on AWS Using Amazon EC2

2021-06-18 Saqlain Tahir

Post Syndicated from Saqlain Tahir original https://aws.amazon.com/blogs/architecture/field-notes-sql-server-deployment-options-on-aws-using-amazon-ec2/

Many enterprise applications run Microsoft SQL Server as their backend relational database. There are various options for customers to benefit from deploying their SQL Server on AWS. This blog will help you choose the right architecture for your SQL Server Deployment with high availability options, using Amazon EC2 for mission-critical applications.

SQL Server on Amazon EC2 offers more efficient control of deployment options and enables customers to fine-tune their Microsoft workload performance with full control. Most importantly you can bring your own licenses (BYOL) on AWS. You can re-host, aka “lift and shift”, your SQL Server to Amazon Elastic Compute Cloud (Amazon EC2) for large scale Enterprise applications. If you are re-hosting, you can still use your existing SQL Server license on AWS. Lifting and shifting your on-premises MS SQL Server environment to AWS using Amazon EC2 is recommended to migrate your SQL Server workloads to the cloud.

First, it is important to understand the considerations for deploying a SQL Server using Amazon EC2. For example, when would you want to use Failover Cluster over Availability Groups?

The following table will help you to choose the right architecture for SQL Server architecture based on the type of workload and availability requirements:

Following table will help you to choose the right architecture for SQL Server architecture based on the type of workload and high availability requirements:

Self-managed MS SQL Server on EC2 usually means hosting MS SQL on EC2 backed by Amazon Elastic Block Store (EBS) or Amazon FSx for Windows File Server. Persistent storage from Amazon EBS and Amazon FSx delivers speed, security, and durability for your business-critical relational databases such as Microsoft SQL Server.

Amazon EBS delivers highly available and performant block storage for your most demanding SQL Server deployments to achieve maximum performance.
Amazon FSx delivers fully managed Windows native shared file storage (SMB) with a multi-Availability Zone (AZ) design for highly available (HA) SQL environments.

Previously, if you wanted to migrate your Failover Cluster SQL databases to AWS, there was no native shared storage option. You would need to implement third party solutions that added a cost and complexity to install, set up, and maintain the storage configuration.

Amazon FSx for Windows File Server provides shared storage that multiple SQL databases can connect to across multiple AZs for a DR and HA solution. It is also helpful to achieve throughput and certain IOPS without scaling up the instance types to get the same IOPS from EBS volumes.

Overview of solution

Most customers need High Availability (HA) for their SQL Server production environment to ensure uptime and availability. This is important to minimize changes to the SQL Server applications while migrating. Customers may want to protect their investment in Microsoft SQL Server licenses by taking a Bring your own license (BYOL) approach to cloud migration.

There are some scenarios where applications running on Microsoft SQL Server need full control of the infrastructure and software. If customers require it, they can deploy their SQL Server to AWS on Amazon EC2. Currently, there are three ways to deploy SQL Server workloads on AWS as shown in the following diagram:

There are some scenarios where applications running on Microsoft SQL Server need full control of the infrastructure and software. If customers require it, they can deploy their SQL Server to AWS on Amazon EC2. Currently, there are various ways to deploy SQL Server workloads on AWS as shown in the following diagram:

Walkthrough

Now the question comes, how do you deploy the preceding SQL Server architectures?

First, let’s discuss the high-level breakdown of deployment options including the two types of SQL HA modes:

Standalone
- Single SQL Server Node without HA
- Provision Amazon EC2 instance with EBS volume
- Single Availability Zone deployment
Always On Failover Cluster Instance (FCI): EC2 and FSx for Windows File Server
- Protects the whole instance, including system databases
- Failovers over at the instance level
- Requires Shared Storage, Amazon FSx for Windows File Server is a great option
- Can be used in conjunction with Availability Groups to provide read-replicas and DR copies (dependent upon SQL Server Edition)
- Can be implemented at the Enterprise or Standard Edition level (with limitations)
- Multi Availability Zone Deployment
Always On Availability Groups (AG): EC2 and EBS
- Protects one or more user databases (Standard Edition is limited to a single user database per AG)
- Failover is at the Availability Group level, meaning potentially only a subset of user databases can failover versus the whole instance
- System databases are not replicated, meaning users, jobs etc. will not automatically appear on passive nodes, manual creation is needed on all nodes
- Natively provides access to read-replicas and DR copies (dependent upon Edition)
- Can be implemented at the Enterprise or Standard
- Multi Availability Zone Deployment

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
SQL Server Licenses in case of BYOL Deployment
Identify Software and Hardware requirements for SQL Server Environment
Identify SQL Server application requirements based on best practices in this deployment guide

Deployment options on AWS

Here are some tools and services provided by AWS to deploy the SQL Server production ready environment by following best practices.

SQL Server on the AWS Cloud: Quick Start Reference Deployment

Use Case:

You want to deploy SQL Server on AWS for a Proof of Concept (PoC) or Pilot deployment using CloudFormation templates within hours by following these best practices.

Overview:

The Quick Start deployment guide provides step-by-step instructions for deploying SQL Server on the Amazon Web Services (AWS) Cloud, using AWS CloudFormation templates and AWS Systems Manager Automation documents that automate the deployment.

SQL Server on the AWS Cloud: Quick Start Reference Deployment

Implementation:

Quick Start Link: SQL Server with WSFC Quick Start

Source Code: GitHub

SQL Server Always On Deployments with AWS Launch Wizard

Use Case:

You intend to deploy SQL Server on AWS for your production workloads to benefit from automation, time and cost savings, and most importantly by leveraging proven deployment best practices from AWS.

Overview:

AWS Launch Wizard is a service that guides you through the sizing, configuration, and deployment of Microsoft SQL Server applications on AWS, following the AWS Well-Architected Framework. AWS Launch Wizard supports both single instance and high availability (HA) application deployments.

AWS Launch Wizard reduces the time it takes to deploy SQL Server solutions to the cloud. You input your application requirements, including performance, number of nodes, and connectivity, on the service console. AWS Launch Wizard identifies the right AWS resources to deploy and run your SQL Server application. You can also receive an estimated cost of deployment, modify your resources and instantly view the updated cost assessment.

When you approve, AWS Launch Wizard provisions and configures the selected resources in a few hours to create a fully-functioning production-ready SQL Server application. It also creates custom AWS CloudFormation templates, which can be reused and customized for subsequent deployments.

Once deployed, your SQL Server application is ready to use and can be accessed from the EC2 console. You can manage your SQL Server application with AWS Systems Manager.

SQL Server Always On Deployments with AWS Launch Wizard

Implementation:

AWS Launch Wizard Link: AWS Launch Wizard for SQL Server

Simplify your Microsoft SQL Server high availability deployments using Amazon FSx for Windows File Server

Use Case:

You need SQL enterprise edition to run an Always on Availability Group (AG), whereas you only need the standard edition to run Failover Cluster Instance (FCI). You want to use standard licensing to save costs but want to achieve HA. SQL Server Standard is typically 40–50% less expensive than the Enterprise Edition.

Overview:

Always On Failover Cluster (FCI) uses block level replication rather than database-level transactional replication. You can migrate to AWS without re-architecting. As the shared storage handles replication you don’t need to use SQL nodes for it, and frees up CPU/Memory for primary compute jobs. With FCI, the entire instance is protected – if the primary node becomes unavailable, the entire instance is moved to the standby node. This takes care of the SQL Server logins, SQL Server Agent jobs, and certificates that are stored in the system databases. These are physically stored in shared storage.

Simplify your Microsoft SQL Server high availability deployments using Amazon FSx for Windows File Server

Implementation:

FCI implementation: SQL Server Deployment using FCI, FSx QuickStart.

Clustering for SQL Server High Availability using SIOS Data Keeper

Use Case:

Windows Server Failover Clustering is a requirement if you are using SQL Server Enterprise or the SQL Server Standard edition and it might appear to be the perfect HA solution for applications running on Windows Server. But like FCIs, it requires the use of shared storage. If you want to use software SAN across multiple instances, then SIOS Data Keeper can be an option.

Overview:

WSFC has a potential role to play in many HA configurations, including for SQL Server FCIs, but its use requires separate data replication provisions in a SANless environment, whether in an enterprise datacenter or in the cloud. SIOS data keeper is a partner solution software SAN across multiple instances. Instead of FSx, you deploy another cluster for SIOS data keeper to host the shared volumes or use a hyper-converged model to deploy SQL Server on the same server as the SIOS data keeper. You can also use SIOS DataKeeper Cluster Edition, a highly optimized, host-based replication solution.

Clustering for SQL Server High Availability using SIOS Data Keeper

Implementation:

QuickStart: SIOS DataKeeper Cluster Edition on the AWS Cloud

Conclusion

In this blog post, we covered the different options of SQL Server Deployment on AWS using EC2. The options presented showed how you can have the exact same experience from an administration point of view, as well as full control over your EC2 environment, including sysadmin and root-level access.

We also showed various ways to achieve High Availability, by deploying SQL Server on AWS as a new environment using AWS QuickStart and AWS Launch Wizard. We also showed how you can deploy SQL Server using AWS managed windows storage Amazon FSx to handle shared storage constraint, cost and IOPS requirement scenarios. If you need shared storage in the cloud outside the Windows FSx option, AWS supports a partner solution using SIOS DataKeeper Cluster Edition.

We hope you found this blog post useful and welcome your feedback in the comments!

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Field Notes: Benchmarking Performance of the New M5zn, D3, and R5b Instance Types with Datadog

2021-06-15 Ray Zaman

Post Syndicated from Ray Zaman original https://aws.amazon.com/blogs/architecture/field-notes-benchmarking-performance-of-the-new-m5zn-d3-and-r5b-instance-types-with-datadog/

This post was co-written with Danton Rodriguez, Product Manager at Datadog.

At re:Invent 2020, AWS announced the new Amazon Elastic Compute Cloud (Amazon EC2) M5zn, D3, and R5b instance types. These instances are built on top of the AWS Nitro System, a collection of AWS-designed hardware and software innovations that enable the delivery of private networking, and efficient, flexible, and secure cloud services with isolated multi-tenancy.

If you’re thinking about deploying your workloads to any of these new instances, Datadog helps you monitor your deployment and gain insight into your entire AWS infrastructure. The Datadog Agent—open-source software available on GitHub—collects metrics, distributed traces, logs, profiles, and more from Amazon EC2 instances and the rest of your infrastructure.

How to deploy Datadog to analyze performance data

The Datadog Agent is compatible with the new instance types. You can use a configuration management tool such as AWS CloudFormation to deploy it automatically across all your instances. You can also deploy it with a single command directly to any Amazon EC2 instance. For example, you can use the following command to deploy the Agent to an instance running Amazon Linux 2:

DD_AGENT_MAJOR_VERSION=7 DD_API_KEY=[Your API Key] bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"

The Datadog Agent uses an API key to send monitoring data to Datadog. You can find your API key in the Datadog account settings page. Once you deploy the Agent, you can instantly access system-level metrics and visualize your infrastructure. The Agent automatically tags EC2 metrics with metadata, including Availability Zone and instance type, so you can filter, search, and group data in Datadog. For example, the Host Map helps you visualize how I/O load on d3.2xlarge instances is distributed across Availability Zones and individual instances (as shown in Figure 1 ).

Figure 1 – Visualizing read I/O operations on d3.2xlarge instances across three Availability Zones.

Enabling trace collection for better visibility

Installing the Agent allows you to use Datadog APM to collect traces from the services running on your Amazon EC2 instances, and monitor their performance with Datadog dashboards and alerts.

Datadog APM includes support for auto-instrumenting applications built on a wide range of languages and frameworks, such as Java, Python, Django, and Ruby on Rails. To start collecting traces, you add the relevant Datadog tracing library to your code. For more information on setting up tracing for a specific language, Datadog has language-specific guides to help get you started.

Visualizing M5zn performance in Datadog

The new M5zn instances are a high frequency, high speed and low-latency networking variant of Amazon EC2 M5 instances. M5zn instances deliver the highest all-core turbo CPU performance from Intel Xeon Scalable processors in the cloud, with a frequency up to 4.5 GHz —making them ideal for gaming, simulation modeling, and other high performance computing applications across a broad range of industries.

To demonstrate how to visualize M5zn’s performance improvements in Datadog, we ran a benchmark test for a simple Python application deployed behind Apache servers on two instance types:

M5 instance (hellobench-m5-x86_64 service in Figure 2)
M5zn instance (hellobench-m5zn-x86_64 service in Figure 2).

Our benchmark application used the aiohttp library and was instrumented with Datadog’s Python tracing library. To run the test, we used Apache’s HTTP server benchmarking tool to generate a constant stream of requests across the two instances.

The hellobench-m5zn-x86_64 service running on the M5zn instance reported a 95th percentile latency that was about 48 percent lower than the value reported by the hellobench-m5-x86_64 service running on the M5 instance (4.73 ms vs. 9.16 ms) over the course of our testing. The summary of results is shown in Datadog APM (as shown in figure 2 below):

Figure 2 – Performance benchmarks for a Python application running on two instance types: M5 and M5zn.

To analyze this performance data, we visualize the complete distribution of the benchmark response time results in a Datadog dashboard. Viewing the full latency distribution allows us to have a more complete picture when considering selecting the right instance type, so we can better adhere to Service Level Objective (SLO) targets.

Figure 3 shows that the M5zn was able to outperform the M5 across the entire latency distribution, both for the median request and for the long tail end of the distribution. The median request, or 50th percentile, was 36 percent faster (299.65 µs vs. 465.28 µs) while the tail end of the distribution process was 48 percent faster (4.73 ms vs. 9.16 ms) as mentioned in the preceding paragraph.

Screenshot of Latency Distribution : M5

Figure 3 – Using a Datadog dashboard to show how the M5zn instance type performed faster across the entire latency distribution during the benchmark test.

We can also create timeseries graphs of our test results to show that the M5zn was able to sustain faster performance throughout the duration of the test, despite processing a higher number of requests. Figure 4 illustrates the difference by displaying the 95th percentile response time and the request rate of both instances across 30-second intervals.

Figure 4 – The M5zn’s p95 latency was nearly half of the M5’s despite higher throughput during the benchmark test.

We can dig even deeper with Datadog Continuous Profiler, an always-on production code profiler used to analyze code-level performance across your entire environment, with minimal overhead. Profiles reveal which functions (or lines of code) consume the most resources, such as CPU and memory.

Even though M5zn is already designed to deliver excellent CPU performance, Continuous Profiler can help you optimize your applications to leverage the M5zn high-frequency processor to its maximum potential. As shown in Figure 5, the Continuous Profiler highlights lines of code where CPU utilization is exceptionally high, so you can optimize those methods or threads to make better use of the available compute power.

After you migrate your workloads to M5zn, Continuous Profiler can help you quantify CPU-time improvements on a per-method basis and pinpoint code-level bottlenecks. This also occurs even as you add new features and functionalities to your application.

Figure 5 – Using Datadog Continuous Profiler to identify functions with the most CPU time.

Comparing D3, D3en, and D2 performance in Datadog

The new D3 and D3en instances leverage 2nd-generation Intel Xeon Scalable Processors (Cascade Lake) and provide a sustained all core frequency up to 3.1 GHz.
Compared to D2 instances, D3 instances provide up to 2.5x higher networking speed and 45 percent higher disk throughput.
D3en instances provide up to 7.5x higher networking speed, 100 percent higher disk throughput, 7x more storage capacity, and 80 percent lower cost-per-TB of storage.

These instances are ideal for HDD storage workloads, such as distributed/clustered file systems, big data and analytics, and high capacity data lakes. D3en instances are the densest local storage instances in the cloud. For our testing, we deployed the Datadog Agent to three instances: the D2, D3, and D3en. We then used two benchmark applications to gauge performance under demanding workloads.

Our first benchmark test used TestDFSIO, an open-source benchmark test included with Hadoop that is used to analyze the I/O performance of an HDFS cluster.

We ran TestDFSIO with the following command:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -write -nrFiles 48 -size 10GB

The Datadog Agent automatically collects system metrics that can help you visualize how the instances performed during the benchmark test. The D3en instance led the field and hit a maximum write speed of 259,000 Kbps.

Figure 6 – Using Datadog to visualize and compare write speed of an HDFS cluster on D2, D3, and D3en instance types during the TestDFSIO benchmark test.

The D3en instance completed the TestDFSIO benchmark test 39 percent faster than the D2 (204.55 seconds vs. 336.84 seconds). The D3 instance completed the benchmark test 27 percent faster at 244.62 seconds.

Datadog helps you realize additional benefits of the D3en and D3 instances: notably, they exhibited lower CPU utilization than the D2 during the benchmark (as shown in figure 7).

Figure 7 – Using Datadog to compare CPU usage on D2, D3, and D3en instances during the TestDFSIO benchmark test.

For our second benchmark test, we again deployed the Datadog Agent to the same three instances: D2, D3, and D3en. In this test, we used TPC-DS, a high CPU and I/O load test that is designed to simulate real-world database performance.

TPC-DS is a set of tools that generates a set of data that can be loaded into your database of choice, in this case PostgreSQL 12 on Ubuntu 20.04. It then generates a set of SQL statements that are used to exercise the database. For this benchmark, 8 simultaneous threads were used on each instance.

The D3en instance completed the TPC-DS benchmark test 59 percent faster than the D2 (220.31 seconds vs. 542.44 seconds). The D3 instance completed the benchmark test 53 percent faster at 253.78 seconds.

Using the metrics collected from Datadog’s PostgreSQL integration, we learn that the D3en not only finished the test faster, but had lower system load during the benchmark test run. This is further validation of the overall performance improvements you can expect when migrating to the D3en.

Figure 8 – Using Datadog to compare system load on D2, D3, and D3en instances during the TPC-DS benchmark test.

The performance improvements are also visible when comparing rows returned per second. While all three instances had similar peak burst performance, the D3en and D3 sustained a higher rate of rows returned throughout the duration of the TPC-DS test.

Figure 9 – Using Datadog to compare Rows returned per second on D2, D3, and D3en instances during the TPC-DS benchmark test.

From these results, we learn that not only do the new D3en and D3 instances have faster disk throughput, but they also offer improved CPU performance, which translates into superior performance to power your most critical workloads.

Comparing R5b and R5 performance

The new R5b instances provide 3x higher EBS-optimized performance compared to R5 instances, and are frequently used to power large workloads that rely on Amazon EBS. Customers that operate applications with stringent storage performance requirements can consolidate their existing R5 workloads into fewer or smaller R5b instances to reduce costs.

To compare I/O performance across these two instance types, we installed the FIO benchmark application and the Datadog Agent on an R5 instance and an R5b instance. We then added EBS io1 storage volumes to each with a Provisioned IOPS setting of 25,000.

We ran FIO with a 75 percent read, 25 percent write workload using the following command:

sudo fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/disk-io1/test --bs=4k --iodepth=64 --size=16G —readwrite=randrw —rwmixread=75

Using the metrics collected from the Datadog Agent, we were able to visualize the benchmark performance results. In approximately one minute, FIO ramped up and reached the maximum I/O operations per second.

The left side of Figure 10 shows the R5b instance reaching the provisioned maximum IOPS of 25,000, while the read operations at 25 percent as expected. The right side shows the R5 reaching its EBS IOPS limit of 18,750, with its relative 25 percent write operations.

It should be noted that R5b instances have far higher performance ceilings than what is being shown here, which you can find in the User Guide: Amazon EBS–optimized instances.

Figure 10 – Comparing IOPS on R5b and R5 instances during the FIO benchmark test.

Also, note that the R5b finished the benchmark test approximately one minute faster than the R5 (166 seconds vs. 223 seconds). We learn that the shorter test duration is driven by the R5b’s faster read time, which reached a maximum of 75,000 Kbps.

Figure 11 – The R5b instance’s faster read time enabled it to complete the benchmark test more quickly than the R5 instance.

From these results, we have learned that the R5b delivers superior I/O capacity with higher throughput, making it a great choice for large relational databases and other IOPS-intensive workloads.

Conclusion

If you are thinking about shifting your workloads to one of the new Amazon EC2 instance types, you can use the Datadog Agent to immediately begin collecting and analyzing performance data. With Datadog’s other AWS integrations, you can monitor even more of your AWS infrastructure and correlate that data with the data collected by the Agent. For example, if you’re running EBS-optimized R5b instances, you can monitor them alongside performance data from your EBS volumes with Datadog’s Amazon EBS integration.

About Datadog

Datadog is an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in DevOps, Migration, Containers, and Microsoft Workloads.

Read more about the M5zn, D3en, D3, and R5b instances, and sign up for a free Datadog trial if you don’t already have an account.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Automate Amazon EC2 instance isolation by using tags

2021-03-01 Jose Obando

Post Syndicated from Jose Obando original https://aws.amazon.com/blogs/security/automate-amazon-ec2-instance-isolation-by-using-tags/

Containment is a crucial part of an overall Incident Response Strategy, as this practice allows time for responders to perform forensics, eradication and recovery during an Incident. There are many different approaches to containment. In this post, we will be focusing on isolation—the ability to keep multiple targets separated so that each target only sees and affects itself—as a containment strategy.

I’ll show you how to automate isolation of an Amazon Elastic Compute Cloud (Amazon EC2) instance by using an AWS Lambda function that’s triggered by tag changes on the instance, as reported by Amazon CloudWatch Events.

CloudWatch Event Rules deliver a near real-time stream of system events that describe changes in Amazon Web Services (AWS) resources. See also Amazon EventBridge.

Preparing for an incident is important as outlined in the Security Pillar of the AWS Well-Architected Framework.

Out of the 7 Design Principles for Security in the Cloud, as per the Well-Architected Framework, this solution will cover the following:

Enable traceability: Monitor, alert, and audit actions and changes to your environment in real time. Integrate log and metric collection with systems to automatically investigate and take action.
Automate security best practices: Automated software-based security mechanisms can improve your ability to securely scale more rapidly and cost-effectively. Create secure architectures, including through the implementation of controls that can be defined and managed by AWS as code in version-controlled templates.
Prepare for security events: Prepare for an incident by implementing incident management and investigation policy and processes that align to your organizational requirements. Run incident response simulations and use tools with automation to help increase your speed for detection, investigation, and recovery.

After detecting an event in the Detection phase and analyzing in the Analysis phase, you can automate the process of logically isolating an instance from a Virtual Private Cloud (VPC) in Amazon Web Services (AWS).

In this blog post, I describe how to automate EC2 instance isolation by using the tagging feature that a responder can use to identify instances that need to be isolated. A Lambda function then uses AWS API calls to isolate the instances by performing the actions described in the following sections.

Use cases

Your organization can use automated EC2 instance isolation for scenarios like these:

A security analyst wants to automate EC2 instance isolation in order to respond to security events in a timely manner.
A security manager wants to provide their first responders with a way to quickly react to security incidents without providing too much access to higher security features.

High-level architecture and design

The example solution in this blog post uses a CloudWatch Events rule to trigger a Lambda function. The CloudWatch Events rule is triggered when a tag is applied to an EC2 instance. The Lambda code triggers further actions based on the contents of the event. Note that the CloudFormation template includes the appropriate permissions to run the function.

The event flow is shown in Figure 1 and works as follows:

The EC2 instance is tagged.
The CloudWatch Events rule filters the event.
The Lambda function is invoked.
If the criteria are met, the isolation workflow begins.

When the Lambda function is invoked and the criteria are met, these actions are performed:

Checks for IAM instance profile associations.
If the instance is associated to a role, the Lambda function disassociates that role.
Applies the isolation role that you defined during CloudFormation stack creation.
Checks the VPC where the EC2 instance resides.
- If there is no isolation security group in the VPC (if the VPC is new, for example), the function creates one.
Applies the isolation security group.

Note: If you had a security group with an open (0.0.0.0/0) outbound rule, and you apply this Isolation security group, your existing SSH connections to the instance are immediately dropped. On the other hand, if you have a narrower inbound rule that initially allows the SSH connection, the existing connection will not be broken by changing the group. This is known as Connection Tracking.

Figure 1: High-level diagram showing event flow

For the deployment method, we will be using an AWS CloudFormation Template. AWS CloudFormation gives you an easy way to model a collection of related AWS and third-party resources, provision them quickly and consistently, and manage them throughout their lifecycles, by treating infrastructure as code.

The AWS CloudFormation template that I provide here deploys the following resources:

An EC2 instance role for isolation – this is attached to the EC2 Instance to prevent further communication with other AWS Services thus limiting the attack surface to your overall infrastructure.
An Amazon CloudWatch Events rule – this is used to detect changes to an AWS EC2 resource, in this case a “tag change event”. We will use this as a trigger to our Lambda function.
An AWS Identity and Access Management (IAM) role for Lambda functions – this is what the Lambda function will use to execute the workflow.
A Lambda function for automation – this function is where all the decision logic sits, once triggered it will follow a set of steps described in the following section.
Lambda function permissions – this is used by the Lambda function to execute.
An IAM instance profile – this is a container for an IAM role that you can use to pass role information to an EC2 instance.

Supporting functions within the Lambda function

Let’s dive deeper into each supporting function inside the Lambda code.

The following function identifies the virtual private cloud (VPC) ID for a given instance. This is needed to identify which security groups are present in the VPC.

def identifyInstanceVpcId(instanceId):
    instanceReservations = ec2Client.describe_instances(InstanceIds=[instanceId])['Reservations']
    for instanceReservation in instanceReservations:
        instancesDescription = instanceReservation['Instances']
        for instance in instancesDescription:
            return instance['VpcId']

The following function modifies the security group of an EC2 instance.

def modifyInstanceAttribute(instanceId,securityGroupId):
    response = ec2Client.modify_instance_attribute(
        Groups=[securityGroupId],
        InstanceId=instanceId)

The following function creates a security group on a VPC that blocks all egress access to the security group.

def createSecurityGroup(groupName, descriptionString, vpcId):
    resource = boto3.resource('ec2')
    securityGroupId = resource.create_security_group(GroupName=groupName, Description=descriptionString, VpcId=vpcId)
    securityGroupId.revoke_egress(IpPermissions= [{'IpProtocol': '-1','IpRanges': [{'CidrIp': '0.0.0.0/0'}],'Ipv6Ranges': [],'PrefixListIds': [],'UserIdGroupPairs': []}])
    return securityGroupId.

Deploy the solution

To deploy the solution provided in this blog post, first download the CloudFormation template, and then set up a CloudFormation stack that specifies the tags that are used to trigger the automated process.

Download the CloudFormation template

To get started, download the CloudFormation template from Amazon S3. Alternatively, you can launch the CloudFormation template by selecting the following Launch Stack button:

Deploy the CloudFormation stack

Start by uploading the CloudFormation template to your AWS account.

To upload the template

From the AWS Management Console, open the CloudFormation console.
Choose Create Stack.
Choose With new resources (standard).
Choose Upload a template file.
Select Choose File, and then select the YAML file that you just downloaded.

Figure 2: CloudFormation stack creation

Specify stack details

You can leave the default values for the stack as long as there aren’t any resources provisioned already with the same name, such as an IAM role. For example, if left with default values an IAM role named “SecurityIsolation-IAMRole” will be created. Otherwise, the naming convention is fully customizable from this screen and you can enter your choice of name for the CloudFormation stack, and modify the parameters as you see fit. Figure 3 shows the parameters that you can set.

The Evaluation Parameters section defines the tag key and value that will initiate the automated response. Keep in mind that these values are case-sensitive.

Figure 3: CloudFormation stack parameters

Choose Next until you reach the final screen, shown in Figure 4, where you acknowledge that an IAM role is created and you trust the source of this template. Select the check box next to the statement I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create Stack.

Figure 4: CloudFormation IAM notification

After you complete these steps, the following resources will be provisioned, as shown in Figure 5:

EC2IsolationRole
EC2TagChangeEvent
IAMRoleForLambdaFunction
IsolationLambdaFunction
IsolationLambdaFunctionInvokePermissions
RootInstanceProfile

Figure 5: CloudFormation created resources

Testing

To start your automation, tag an EC2 instance using the tag defined during the CloudFormation setup. If you’re using the Amazon EC2 console, you can apply tags to resources by using the Tags tab on the relevant resource screen, or you can use the Tags screen, the AWS CLI or an AWS SDK. A detailed walkthrough for each approach can be found in the Amazon EC2 Documentation page.

Reverting Changes

If you need to remove the restrictions applied by this workflow, complete the following steps.

From the EC2 dashboard, in the Instances section, check the box next to the instance you want to modify.

Figure 6: Select the instance to modify
In the top right, select Actions, choose Instance settings, and then choose Modify IAM role.

Figure 7: Choose Actions > Instance settings > Modify IAM role
Under IAM role, choose the IAM role to attach to your instance, and then select Save.

Figure 8: Choose the IAM role to attach
Select Actions, choose Networking, and then choose Change security groups.

Figure 9: Choose Actions > Networking > Change security groups
Under Associated security groups, select Remove and add a different security group with the access you want to grant to this instance.

Summary

Using the CloudFormation template provided in this blog post, a Security Operations Center analyst could have only tagging privileges and isolate an EC2 instance based on this tag. Or a security service such as Amazon GuardDuty could trigger a lambda to apply the tag as part of a workflow. This means the Security Operations Center analyst wouldn’t need administrative privileges over the EC2 service.

This solution creates an isolation security group for any new VPCs that don’t have one already. The security group would still follow the naming convention defined during the CloudFormation stack launch, but won’t be part of the provisioned resources. If you decide to delete the stack, manual cleanup would be required to remove these security groups.

This solution terminates established inbound Secure Shell (SSH) sessions that are associated to the instance, and isolates the instance from new connections either inbound or outbound. For outbound connections that are already established (for example, reverse shell), you either need to shut down the network interface card (NIC) at the operating system (OS) level, restart the instance network stack at the OS level, terminate the established connections, or apply a network access control list (network ACL).

For more information, see the following documentation:

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Field Notes: Streaming VR to Wireless Headsets Using NVIDIA CloudXR

2021-02-16 William Cannady

Post Syndicated from William Cannady original https://aws.amazon.com/blogs/architecture/field-notes-streaming-vr-to-wireless-headsets-using-nvidia-cloudxr/

It’s exciting to see many consumer-grade virtual reality (VR) hardware options, but setting up hardware can be cumbersome, expensive and complicated. Wired headsets require high-powered graphics workstations, and a solution to prevent you from tripping over the wires. Many room-scale headsets require two external peripherals (or ‘light towers’) to be installed so the headset can position itself in a room. These setups can take days to tune, and need resetting if the light towers are moved.

With the release of the Oculus Quest, users of virtual reality were delighted with a wireless, room-scale headset with dual hand tracking. They could enjoy VR without worrying about light towers or a high-powered graphics workstation. However, because the Quest was battery powered, it used an inherently low-powered central processing unit (CPU) and graphics processing unit (GPU). As a result, VR content had to be simplified to run on the Quest. This prevented customers from using the Quest for the most demanding graphics experiences, such as Computer Aided Design (CAD) review, or playing games such as Half-Life: Alyx.

Customers were faced with a difficult choice: expensive, complicated setups, or reduced-fidelity experiences.

In this blog post, we show you how to stream a full-fidelity VR experience from a computer on AWS to a wireless headset such as the Quest.

Overview of architecture

NVIDIA CloudXR takes NVIDIA’s experience in GPU encoding and decoding, and streams pixels to a remotely connected VR headset. By doing this, the rendering and compute requirements of visually intensive applications take place on a remote server instead of a local headset. This makes mobile headsets work with any application, regardless of their visual complexity and density.

Figure 1: architecture for streaming VR experiences from the AWS Cloud to a VR headset

To provide global scalability, NVIDIA announced the CloudXR platform will be available on G4 and P3 EC2 instances. It provides the following benefits:

At a global scale, customers can stream remote AR/VR experiences from Regions that are close them.
It enables centrally managed and deployed software experiences on Amazon Elastic Compute Cloud (Amazon EC2) instances. Previously these required physical transportation and implementation of devices and server hardware.
Lastly, IT administrators can now centrally manage content that may be sensitive or require frequent changes.

Walkthrough

Using CloudXR on AWS requires EC2 instances with NVIDIA GPUs (that is, the P3 or G4 instance types) running within your virtual private cloud (VPC). The instance must be network accessible to a remote CloudXR client running on a VR headset. Connections are 1:1, meaning that each CloudXR client is connected to a dedicated EC2 instance. If needs require multiple CloudXR clients, you can deploy multiple EC2 instances.

Note that the process outlined is accurate as of January 2021. CloudXR, and X Reality (XR) overall, is rapidly changing. Consult the latest information about CloudXR from NVIDIA. Using CloudXR within your AWS account requires you setup P3 or G4 EC2 instances, as you would within an Amazon VPC. You must also add a security group that allows the ports required for CloudXR communication. These specific ports can be found in the CloudXR documentation, available from NVIDIA.

We have created a CloudFormation template that deploys an EC2 instance with CloudXR configured for reference, linked in the prerequisites. Because it makes reference to a private AMI, it must be shared with your account in order to deploy successfully. If you’re interested in using this template, contact your AWS Account team.

Prerequisites

The following steps describe how to configure the EC2 instance manually. CloudXR streaming requires using a connection other than Windows RDP to connect to the remote EC2 instance. We use NICE DCV, which is provided at no cost to EC2 instances for remote connectivity.

For this walkthrough, you should have the following prerequisites:

An AWS account
The NVIDIA CloudXR SDK from NVIDIA
Permissions to create G4 or P3 EC2 instances
A VR headset

Deploy CloudXR Server onto Amazon EC2

It’s important to note the steps outlined are for configuring a G4 instance. If you’d prefer to use a P3 instance, manually deploy your P3 instance and install NICE DCV as described in the documentation.

Log into your AWS Account and navigate to the AWS Marketplace to install an EC2 instance with NICE DCV configured.
Create a new security group during deployment that matches the CloudXR port settings and apply it to your instance. Consult the CloudXR documentation for the latest port settings.
Wait 5 minutes for everything to initialize properly. Make note of the issues public IP address (or attach an Elastic IP address to the instance).
Navigate to https://<IP-OF-INSTANCE>:8443 to connect to NICE DCV web-browser client. b.Use the credentials created during EC2 initialization to Log in

NICE DCV login screen

NICE DCV login screen on Web Browser Client

5. Once logged into your EC2 instance, install SteamVR and CloudXR onto the remote EC2 instance. SteamVR is used as an OpenVR/XR proxy between your VR application and CloudXR. CloudXR is used to stream the SteamVR experience to a remote CloudXR Client.

6. Verify installation of the CloudXR plugin into SteamVR by navigating to the Manage Add-ons page within the Advanced Settings option. Make sure it lists CloudXRRemoteHMD as an addon and is set to ON.

Verification of CloudXR Installation

7. Add an allow entry to the Windows Firewall Entry for VRServer.exe. This allows SteamVR to use the CloudXR to stream properly. By default this file is located at %ProgramFiles% (x86)\Steam\steamapps\common\SteamVR\bin\win64\vrserver.exe\

Enabling the VRSERVER.EXE application through the Windows Application Firewall.

8. Install a CloudXR client onto your VR headset. If using an android-powered headset (that is, the Oculus Quest), you can use the sample APK within the CloudXR SDK

9. Select Finish.

Connect to your CloudXR Server and start streaming

1. Launch SteamVR on your remote EC2 instance by logging into your Steam account or configuring a no-login link following the Installation/use of SteamVR in an environment without internet access instructions.

2. When loaded, it will report a headset cannot be detected. This is OK.

SteamVR will display Headset not Detected—this is OK

SteamVR will display Headset not Detected—this is OK.

3. Within your Client headset, load the CloudXR Client application you recently installed.

4. Once connected, the headset will start display the SteamVR “void”. You also should see a view of your headset if SteamVR mirroring is enabled. The status box on the SteamVR server application will show a headset and 2 controllers attached as well.

SteamVR “Void” and Headset Connected icons

5. Congratulations. You’re now connected to an AWS EC2 instance using NVIDIA CloudXR! Any VR application you now run on the EC2 server that uses OpenVR will be streamed to your VR headset!

Cleaning up

EC2 instances are billed only when they’re being used. You’ll want to make sure to stop your instance or shut it down when you are finished with your session. Terminating your instance is not necessary.

Conclusion

In this blog post, we showed how to stream a full-fidelity VR experience from a computer on AWS to a wireless headset. Having the ability to remotely connect to GPU-powered servers to run graphic workloads is not necessarily new, but connecting to a remote server with a VR headset and having full interactivity certainly is. With this architecture, you realize the benefits of CloudXR combined with the agility and scalability available on AWS. It becomes less challenging to manage content played on VR headsets because content doesn’t reside on the VR headset—it lives on the EC2 server.

Deploying to any AWS region where GPU instances are available allows you to offer CloudXR to your users at global scale. As networks get faster and closer through services like AWS Outposts and AWS Wavelength, remote VR work will become possible for more customers. We’re excited to see what new workloads come next as this way of working grows.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Secure and automated domain membership management for EC2 instances with no internet access

2021-02-15 Rakesh Singh

Post Syndicated from Rakesh Singh original https://aws.amazon.com/blogs/security/secure-and-automated-domain-membership-management-for-ec2-instances-with-no-internet-access/

In this blog post, I show you how to deploy an automated solution that helps you fully automate the Active Directory join and unjoin process for Amazon Elastic Compute Cloud (Amazon EC2) instances that don’t have internet access.

Managing Active Directory domain membership for EC2 instances in Amazon Web Services (AWS) Cloud is a typical use case for many organizations. In a dynamic environment that can grow and shrink multiple times in a day, adding and removing computer objects from an Active Directory domain is a critical task and is difficult to manage without automation.

AWS seamless domain join provides a secure and reliable option to join an EC2 instance to your AWS Directory Service for Microsoft Active Directory. It’s a recommended approach for automating joining a Windows or Linux EC2 instance to the AWS Managed Microsoft AD or to an existing on-premises Active Directory using AD Connector, or a standalone Simple AD directory running in the AWS Cloud. This method requires your EC2 instances to have connectivity to the public AWS Directory Service endpoints. At the time of writing, Directory Service doesn’t have PrivateLink endpoint support. This means you must allow traffic from your instances to the public Directory Service endpoints via an internet gateway, network address translation (NAT) device, virtual private network (VPN) connection, or AWS Direct Connect connection.

At times, your organization might require that any traffic between your VPC and Directory Service—or any other AWS service—not leave the Amazon network. That means launching EC2 instances in an Amazon Virtual Private Cloud (Amazon VPC) with no internet access and still needing to join and unjoin the instances from the Active Directory domain. Provided your instances have network connectivity to the directory DNS addresses, the simplest solution in this scenario is to run the domain join commands manually on the EC2 instances and enter the domain credentials directly. Though this process can be secure—as you don’t need to store or hardcode the credentials—it’s time consuming and becomes difficult to manage in a dynamic environment where EC2 instances are launched and terminated frequently.

VPC endpoints enable private connections between your VPC and supported AWS services. Private connections enable you to privately access services by using private IP addresses. Traffic between your VPC and other AWS services doesn’t leave the Amazon network. Instances in your VPC don’t need public IP addresses to communicate with resources in the service.

The solution in this blog post uses AWS Secrets Manager to store the domain credentials and VPC endpoints to enable private connection between your VPC and other AWS services. The solution described here can be used in the following scenarios:

Manage domain join and unjoin for EC2 instances that don’t have internet access.
Manage only domain unjoin if you’re already using seamless domain join provided by AWS, or any other method for domain joining.
Manage only domain join for EC2 instances that don’t have internet access.

This solution uses AWS CloudFormation to deploy the required resources in your AWS account based on your choice from the preceding scenarios.

Note: If your EC2 instances can access the internet, then we recommend using the seamless domain join feature and using scenario 2 to remove computers from the Active Directory domain upon instance termination.

The solution described in this blog post is designed to provide a secure, automated method for joining and unjoining EC2 instances to an on-premises or AWS Managed Microsoft AD domain. The solution is best suited for use cases where the EC2 instances don’t have internet connectivity and the seamless domain join option cannot be used.

How this solution works

This blog post includes a CloudFormation template that you can use to deploy this solution. The CloudFormation stack provisions an EC2 Windows instance running in an Amazon EC2 Auto Scaling group that acts as a worker and is responsible for joining and unjoining other EC2 instances from the Active Directory domain. The worker instance communicates with other required AWS services such as Amazon Simple Storage Service (Amazon S3), Secrets Manager, and Amazon Simple Queue Service (Amazon SQS) using VPC endpoints. The stack also creates all of the other resources needed for this solution to work.

Figure 1 shows the domain join and unjoin workflow for EC2 instances in an AWS account.

Figure 1: Workflow for joining and unjoining an EC2 instance from a domain with full protection of Active Directory credentials

The event flow in Figure 1 is as follows:

An EC2 instance is launched or terminated in an account.
An Amazon CloudWatch Events rule detects if the EC2 instance is in running or terminated state.
The CloudWatch event triggers an AWS Lambda function that looks for the tag JoinAD: true to check if the instance needs to join or unjoin the Active Directory domain.
If the tag value is true, the Lambda function writes the instance details to an Amazon Simple Queue Service (Amazon SQS) queue.
A standalone, highly secured EC2 instance acts as a worker and polls the Amazon SQS queue for new messages.
Whenever there’s a new message in the queue, the worker EC2 instance invokes scripts on the remote EC2 instance to add or remove the instance from the domain based on the instance operating system and state.

In this solution, the security of the Active Directory credentials is enhanced by storing them in Secrets Manager. To secure the stored credentials, the solution uses resource-based policies to restrict the access to only intended users and roles.

The credentials can only be fetched dynamically from the EC2 instance that’s performing the domain join and unjoin operations. Any access to that instance is further restricted by a custom AWS Identity and Access Management (IAM) policy created by the CloudFormation stack. The following policies are created by the stack to enhance security of the solution components.

Resource-based policies for Secrets Manager to restrict all access to the stored secret to only specific IAM entities (such as the EC2 IAM role).
An S3 bucket policy to prevent unauthorized access to the Active Directory join and remove scripts that are stored in the S3 bucket.
The IAM role that’s used to fetch the credentials from Secrets Manager is restricted by a custom IAM policy and can only be assumed by the worker EC2 instance. This prevents every entity other than the worker instance from using that IAM role.
All API and console access to the worker EC2 instance is restricted by a custom IAM policy with an explicit deny.
A policy to deny all but the worker EC2 instance access to the credentials in Secrets Manager. With the worker EC2 instance doing the work, the EC2 instances that need to join the domain don’t need access to the credentials in Secrets Manager or to scripts in the S3 bucket.

Prerequisites and setup

Before you deploy the solution, you must complete the following in the AWS account and Region where you want to deploy the CloudFormation stack.

AWS Managed Microsoft AD with an appropriate DNS name (for example, test.com). You can also use your on premises Active Directory, provided it’s reachable from the Amazon VPC over Direct Connect or AWS VPN.
Create a DHCP option set with on-premises DNS servers or with the DNS servers pointing to the IP addresses of directories provided by AWS.
Associate the DHCP option set with the Amazon VPC that you’re going to use with this solution.
Any other Amazon VPCs that are hosting EC2 instances to be domain joined must be peered with the VPC that hosts the relevant AWS Managed Microsoft AD. Alternatively, AWS Transit Gateway can be used to establish this connectivity.
Make sure to have the latest AWS Command Line Interface (AWS CLI) installed and configured on your local machine.
Create a new SSH key pair and store it in Secrets Manager using the following commands. Replace <Region> with the Region of your deployment. Replace <MyKeyPair> with any custom name or leave it default.

Bash:

aws ec2 create-key-pair --region <Region> --key-name <MyKeyPair> --query 'KeyMaterial' --output text > adsshkey
aws secretsmanager create-secret --region <Region> --name "adsshkey" --description "my ssh key pair" --secret-string file://adsshkey

PowerShell:

aws ec2 create-key-pair --region <Region> --key-name <MyKeyPair>  --query 'KeyMaterial' --output text | out-file -encoding ascii -filepath adsshkey
aws secretsmanager create-secret --region <Region> --name "adsshkey" --description "my ssh key pair" --secret-string file://adsshkey

Note: Don’t change the name of the secret, as other scripts in the solution reference it. The worker EC2 instance will fetch the SSH key using GetSecretValue API to SSH or RDP into other EC2 instances during domain join process.

Deploy the solution

With the prerequisites in place, your next step is to download or clone the GitHub repo and store the files on your local machine. Go to the location where you cloned or downloaded the repo and review the contents of the config/OS_User_Mapping.json file to validate the instance user name and operating system mapping. Update the file if you’re using a user name other than the one used to log in to the EC2 instances. The default user name used in this solution is ec2-user for Linux instances and Administrator for Windows.

The solution requires installation of some software on the worker EC2 instance. Because the EC2 instance doesn’t have internet access, you must download the latest Windows 64-bit version of the following software to your local machine and upload it into the solution deployment S3 bucket in subsequent steps.

Note: This step isn’t required if your EC2 instances have internet access.

Once done, use the following steps to deploy the solution in your AWS account:

Steps to deploy the solution:

Create a private Amazon Simple Storage Service (Amazon S3) bucket using this documentation to store the Lambda functions and the domain join and unjoin scripts.
Once created, enable versioning on this bucket using the following documentation. Versioning lets you keep multiple versions of your objects in one bucket and helps you easily retrieve and restore previous versions of your scripts.
Upload the software you downloaded to the S3 bucket. This is only required if your instance doesn’t have internet access.
Upload the cloned or downloaded GitHub repo files to the S3 bucket.
Go to the S3 bucket and select the template name secret-active-dir-solution.json, and copy the object URL.
Open the CloudFormation console. Choose the appropriate AWS Region, and then choose Create Stack. Select With new resources.
Select Amazon S3 URL as the template source, paste the object URL that you copied in Step 5, and then choose Next.
On the Specify stack details page, enter a name for the stack and provide the following input parameters. You can modify the default values to customize the solution for your environment.
- ADUSECASE – From the dropdown menu, select your required use case. There is no default value.
- AdminUserId – The canonical user ID of the IAM user who manages the Active Directory credentials stored in Secrets Manager. To learn how to find the canonical user ID for your IAM user, scroll down to Finding the canonical user ID for your AWS account in AWS account identifiers.
- DenyPolicyName – The name of the IAM policy that restricts access to the worker EC2 instance and the IAM role used by the worker to fetch credentials from Secrets Manager. You can keep the default value or provide another name.
- InstanceType – Instance type to be used when launching the worker EC2 instance. You can keep the default value or use another instance type if necessary.
- Placeholder – This is a dummy parameter that’s used as a placeholder in IAM policies for the EC2 instance ID. Keep the default value.
- S3Bucket – The name of the S3 bucket that you created in the first step of the solution deployment. Replace the default value with your S3 bucket name.
- S3prefix – Amazon S3 object key where the source scripts are stored. Leave the default value as long as the cloned GitHub directory structure hasn’t been changed.
- SSHKeyRequired – Select true or false based on whether an SSH key pair is required to RDP into the EC2 worker instance. If you select false, the worker EC2 instance will not have an SSH key pair.
- SecurityGroupId – Security group IDs to be associated with the worker instance to control traffic to and from the instance.
- Subnet – Select the VPC subnet where you want to launch the worker EC2 instance.
- VPC – Select the VPC where you want to launch the worker EC2 instance. Use the VPC where you have created the AWS Managed Microsoft AD.
- WorkerSSHKeyName – An existing SSH key pair name that can be used to get the password for RDP access into the EC2 worker instance. This isn’t mandatory if you’re using user name and password based login or AWS Systems Manager Session Manager. This is required only if you have selected true for the SSHKeyRequired parameter.
Figure 2: Defining the stack name and input parameters for the CloudFormation stack
Enter values for all of the input parameters, and then choose Next.
On the Options page, keep the default values and then choose Next.
On the Review page, confirm the details, acknowledge that CloudFormation might create IAM resources with custom names, and choose Create Stack.
Once the stack creation is marked as CREATE_COMPLETE, the following resources are created:
- An EC2 instance that acts as a worker and runs Active Directory join scripts on the remote EC2 instances. It also unjoins instances from the domain upon instance termination.
- A secret with a default Active Directory domain name, user name, and a dummy password. The name of the default secret is myadcredV1.
- A Secrets Manager resource-based policy to deny all access to the secret except to the intended IAM users and roles.
- An EC2 IAM profile and IAM role to be used only by the worker EC2 instance.
- A managed IAM policy called DENYPOLICY that can be assigned to an IAM user, group, or role to restrict access to the solution resources such as the worker EC2 instance.
- A CloudWatch Events rule to detect running and terminated states for EC2 instances and trigger a Lambda function that posts instance details to an SQS queue.
- A Lambda function that reads instance tags and writes to an SQS queue based on the instance tag value, which can be true or false.
- An SQS queue for storing the EC2 instance state—running or terminated.
- A dead-letter queue for storing unprocessed messages.
- An S3 bucket policy to restrict access to the source S3 bucket from unauthorized users or roles.
- A CloudWatch log group to stream the logs of the worker EC2 instance.

Test the solution

Now that the solution is deployed, you can test it to check if it’s working as expected. Before you test the solution, you must navigate to the secret created in Secrets Manager by CloudFormation and update the Active Directory credentials—domain name, user name, and password.

To test the solution

In the CloudFormation console, choose Services, and then CloudFormation. Select your stack name. On the stack Outputs tab, look for the ADSecret entry.
Choose the ADSecret link to go to the configuration for the secret in the Secrets Manager console. Scroll down to the section titled Secret value, and then choose Retrieve secret value to display the default Secret Key and Secret Value as shown in Figure 3.

Figure 3: Retrieve value in Secrets Manager
Choose the Edit button and update the default dummy credentials with your Active Directory domain credentials.(Optional) Directory_ou is used to store the organizational unit (OU) and directory components (DC) for the directory; for example, OU=test,DC=example,DC=com.

Note: instance_password is an optional secret key and is used only when you’re using user name and password based login to access the EC2 instances in your account.

Now that the secret is updated with the correct credentials, you can launch a test EC2 instance and determine if the instance has successfully joined the Active Directory domain.

Create an Amazon Machine Image

Note: This is only required for Linux-based operating systems other than Amazon Linux. You can skip these steps if your instances have internet access.

As your VPC doesn’t have internet access, for Linux-based systems other than Amazon Linux 1 or Amazon Linux 2, the required packages must be available on the instances that need to join the Active Directory domain. For that, you must create a custom Amazon Machine Image (AMI) from an EC2 instance with the required packages. If you already have a process to build your own AMIs, you can add these packages as part of that existing process.

To install the package into your AMI

Create a new EC2 Linux instance for the required operating system in a public subnet or a private subnet with access to the internet via a NAT gateway.
Connect to the instance using any SSH client.

Install the required software by running the following command that is appropriate for the operating system:

For CentOS:

yum -y install realmd adcli oddjob-mkhomedir oddjob samba-winbind-clients samba-winbind samba-common-tools samba-winbind-krb5-locator krb5-workstation unzip

For RHEL:

yum -y  install realmd adcli oddjob-mkhomedir oddjob samba-winbind-clients samba-winbind samba-common-tools samba-winbind-krb5-locator krb5-workstation python3 vim unzip

For Ubuntu:

apt-get -yq install realmd adcli winbind samba libnss-winbind libpam-winbind libpam-krb5 krb5-config krb5-locales krb5-user packagekit  ntp unzip python

For SUSE:

sudo zypper -n install realmd adcli sssd sssd-tools sssd-ad samba-client krb5-client samba-winbind krb5-client python

For Debian:

apt-get -yq install realmd adcli winbind samba libnss-winbind libpam-winbind libpam-krb5 krb5-config krb5-locales krb5-user packagekit  ntp unzip

Follow Manually join a Linux instance to install the AWS CLI on Linux.
Create a new AMI based on this instance by following the instructions in Create a Linux AMI from an instance.

You now have a new AMI that can be used in the next steps and in future to launch similar instances.

For Amazon Linux-based EC2 instances, the solution will use the mechanism described in How can I update yum or install packages without internet access on my EC2 instances to install the required packages and you don’t need to create a custom AMI. No additional packages are required if you are using Windows-based EC2 instances.

To launch a test EC2 instance

Navigate to the Amazon EC2 console and launch an Amazon Linux or Windows EC2 instance in the same Region and VPC that you used when creating the CloudFormation stack. For any other operating system, make sure you are using the custom AMI created before.
In the Add Tags section, add a tag named JoinAD and set the value as true. Add another tag named Operating_System and set the appropriate operating system value from:
- AMAZON_LINUX
- FEDORA
- RHEL
- CENTOS
- UBUNTU
- DEBIAN
- SUSE
- WINDOWS
Make sure that the security group associated with this instance is set to allow all inbound traffic from the security group of the worker EC2 instance.
Use the SSH key pair name from the prerequisites (Step 6) when launching the instance.
Wait for the instance to launch and join the Active Directory domain. You can now navigate to the CloudWatch log group named /ad-domain-join-solution/ created by the CloudFormation stack to determine if the instance has joined the domain or not. On successful join, you can connect to the instance using a RDP or SSH client and entering your login credentials.
To test the domain unjoin workflow, you can terminate the EC2 instance launched in Step 1 and log in to the Active Directory tools instance to validate that the Active Directory computer object that represents the instance is deleted.

Solution review

Let’s review the details of the solution components and what happens during the domain join and unjoin process:

1) The worker EC2 instance:

The worker EC2 instance used in this solution is a Windows instance with all configurations required to add and remove machines to and from an Active Directory domain. It can also be used as an Active Directory administration tools instance. This instance is continuously running a bash script that is polling the SQS queue for new messages. Upon arrival of a new message, the script performs the following tasks:

Check if the instance is in running or terminated state to determine if it needs to be added or removed from the Active Directory domain.
If the message is from a newly launched EC2 instance, then this means that this instance needs to join the Active Directory domain.
The script identifies the instance operating system and runs the appropriate PowerShell or bash script on the remote EC2.
Similarly, if the instance is in terminated state, then the worker will run the domain unjoin command locally to remove the computer object from the Active Directory domain.
If the worker fails to process a message in the SQS queue, it sends the unprocessed message to a backup queue for debugging.
The worker writes logs related to the success or failure of the domain join to a CloudWatch log group. Use /ad-domain-join-solution to filter for all other logs created by the worker instance in CloudWatch.

2) The worker bash script running on the instance:

This script polls the SQS queue every 5 seconds for new messages and is responsible for following activities:

Fetching Active Directory join credentials (user name and password) from Secrets Manager.
If the remote EC2 instance is running Windows, running the Invoke-Command PowerShell cmdlet on the instance to perform the Active Directory join operation.
If the remote EC2 instance is running Linux, running realm join command on the instance to perform the Active Directory join operation.
Running the Remove-ADComputer command to remove the computer object from the Active Directory domain for terminated EC2 instances.
Storing domain-joined EC2 instance details—computer name and IP address—in an Amazon DynamoDB table. These details are used to check if an instance is already part of the domain and when removing the instance from the Active Directory domain.

More information

Now that you have tested the solution, here are some additional points to be noted:

The Active Directory join and unjoin scripts provided with this solution can be replaced with your existing custom scripts.
To update the scripts on the worker instance, you must upload the modified scripts to the S3 bucket and the changes will automatically synchronize on the instance.
This solution works with single account, Region, and VPC combination. It can be modified to use across multiple Regions and VPC combinations.
For VPCs in a different account or Region, you must share your AWS Managed Microsoft AD with another AWS account when the networking prerequisites have been completed.
The instance user name and operating system mapping used in the solution is based on the default user name used by AWS.
You can use AWS Systems Manager with VPC endpoints to log in to EC2 instances that don’t have internet access.

The solution is protecting your Active Directory credentials and is making sure that:

Active Directory credentials can be accessed only from the worker EC2 instance.
The IAM role used by the worker EC2 instance to fetch the secret cannot be assumed by other IAM entities.
Only authorized users can read the credentials from the Secrets Manager console, through AWS CLI, or by using any other AWS Tool—such as an AWS SDK.

The focus of this solution is to demonstrate a method you can use to secure Active Directory credentials and automate the process of EC2 instances joining and unjoining from an Active Directory domain.

You can associate the IAM policy named DENYPOLICY with any IAM group or user in the account to block that user or group from accessing or modifying the worker EC2 instance and the IAM role used by the worker.
If your account belongs to an organization, you can use an organization-level service control policy instead of an IAM-managed policy—such as DENYPOLICY—to protect the underlying resources from unauthorized users.

Conclusion

In this blog post, you learned how to deploy an automated and secure solution through CloudFormation to help secure the Active Directory credentials and also manage adding and removing Amazon EC2 instances to and from an Active Directory domain. When using this solution, you incur Amazon EC2 charges along with charges associated with Secrets Manager pricing and AWS PrivateLink.

You can use the following references to help diagnose or troubleshoot common errors during the domain join or unjoin process.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Running cost optimized Spark workloads on Kubernetes using EC2 Spot Instances

2021-01-26 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/running-cost-optimized-spark-workloads-on-kubernetes-using-ec2-spot-instances/

This post is written by Kinnar Sen, Senior Solutions Architect, EC2 Spot

Apache Spark is an open-source, distributed processing system used for big data workloads. It provides API operations to perform multiple tasks such as streaming, extract transform load (ETL), query, machine learning (ML), and graph processing. Spark supports four different types of cluster managers (Spark standalone, Apache Mesos, Hadoop YARN, and Kubernetes), which are responsible for scheduling and allocation of resources in the cluster. Spark can run with native Kubernetes support since 2018 (Spark 2.3). AWS customers that have already chosen Kubernetes as their container orchestration tool can also choose to run Spark applications in Kubernetes, increasing the effectiveness of their operations and compute resources.

In this post, I illustrate the deployment of scalable, resilient, and cost optimized Spark application using Kubernetes via Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon EC2 Spot Instances. Learn how to save money on big data workloads by implementing this solution.

Overview

Amazon EC2 Spot Instances

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS Cloud. Spot Instances are available at up to a 90% discount compared to On-Demand Instance prices. Capacity pools are a group of EC2 instances that belong to particular instance family, size, and Availability Zone (AZ). If EC2 needs capacity back for On-Demand Instance usage, Spot Instances can be interrupted by EC2 with a two-minute notification. There are many graceful ways to handle the interruption to ensure that the application is well architected for resilience and fault tolerance. This can be automated via the application and/or infrastructure deployments. Spot Instances are ideal for stateless, fault tolerant, loosely coupled and flexible workloads that can handle interruptions.

Amazon Elastic Kubernetes Service

Amazon EKS is a fully managed Kubernetes service that makes it easy for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane. It provides a highly available and scalable managed control plane. It also provides managed worker nodes, which let you create, update, or terminate shut down worker nodes for your cluster with a single command. It is a great choice for deploying flexible and fault tolerant containerized applications. Amazon EKS supports creating and managing Amazon EC2 Spot Instances using Amazon EKS-managed node groups following Spot best practices. This enables you to take advantage of the steep savings and scale that Spot Instances provide for interruptible workloads running in your Kubernetes cluster. Using EKS-managed node groups with Spot Instances requires less operational effort compared to using self-managed nodes. In addition to launching Spot Instances in managed node groups, it is possible to specify multiple instance types in EKS managed node groups. You can find more in this blog.

Apache Spark and Kubernetes

When a spark application is submitted to the Kubernetes cluster the following happens:

A Spark driver is created.
The driver and the run within pods.
The Spark driver then requests for executors, which are scheduled to run within pods. The executors are managed by the driver.
The application is launched and once it completes, the executor pods are cleaned up. The driver pod persists the logs and remains in a completed state until the pod is cleared by garbage collection or manually removed. The driver in a completed stage does not consume any memory or compute resources.

Spark Deployment on Kubernetes Cluster

When a spark application runs on clusters managed by Kubernetes, the native Kubernetes scheduler is used. It is possible to schedule the driver/executor pods on a subset of available nodes. The applications can be launched either by a vanilla ‘spark submit’, a workflow orchestrator like Apache Airflow or the spark operator. I use vanilla ‘spark submit’ in this blog. is also able to schedule Spark applications on EKS clusters as described in this launch blog, but Amazon EMR on EKS is out of scope for this post.

Cost optimization

For any organization running big data workloads there are three key requirements: scalability, performance, and low cost. As the size of data increases, there is demand for more compute capacity and the total cost of ownership increases. It is critical to optimize the cost of big data applications. Big Data frameworks (in this case, Spark) are distributed to manage and process high volumes of data. These frameworks are designed for failure, can run on machines with different configurations, and are inherently resilient and flexible.

If Spark deploys on Kubernetes, the executor pods can be scheduled on EC2 Spot Instances and driver pods on On-Demand Instances. This reduces the overall cost of deployment – Spot Instances can save up to 90% over On-Demand Instance prices. This also enables faster results by scaling out executors running on Spot Instances. Spot Instances, by design, can be interrupted when EC2 needs the capacity back. If a driver pod is running on a Spot Instance, which is interrupted then the application fails and the application must be re-submitted. To avoid this situation, the driver pod can be scheduled on On-Demand Instances only. This adds a layer of resiliency to the Spark application running on Kubernetes. To cost optimize the deployment, all the executor pods are scheduled on Spot Instances as that’s where the bulk of compute happens. Spark’s inherent resiliency has the driver launch new executors to replace the ones that fail due to Spot interruptions.

There are a couple of key points to note here.

The idea is to start with minimum number of nodes for both On-Demand and Spot Instances (one each) and then auto-scale usingCluster Autoscaler and EC2 Auto Scaling Cluster Autoscaler for AWS provides integration with Auto Scaling groups. If there are not sufficient resources, the driver and executor pods go into pending state. The Cluster Autoscaler detects pods in pending state and scales worker nodes within the identified Auto Scaling group in the cluster using EC2 Auto Scaling.

The scaling for On-Demand and Spot nodes is exclusive of one another. So, if multiple applications are launched the driver and executor pods can be scheduled in different node groups independently per the resource requirements. This helps reduce job failures due to lack of resources for the driver, thus adding to the overall resiliency of the system.
Using EKS Managed node groups
- This requires significantly less operational effort compared to using self-managed nodegroup and enables:
  - Auto enforcement of Spot best practices like Capacity Optimized allocation strategy, Capacity Rebalancing and use multiple instances types.
  - Proactive replacement of Spot nodes using rebalance notifications.
  - Managed draining of Spot nodes via re-balance recommendations.
- The nodes are auto-labeled so that the pods can be scheduled with NodeAffinity.
  - eks.amazonaws.com/capacityType: SPOT
  - eks.amazonaws.com/capacityType: ON_DEMAND

Now that you understand the products and best practices of used in this tutorial, let’s get started.

Tutorial: running Spark in EKS managed node groups with Spot Instances

In this tutorial, I review steps, which help you launch cost optimized and resilient Spark jobs inside Kubernetes clusters running on EKS. I launch a word-count application counting the words from an Amazon Customer Review dataset and write the output to an Amazon S3 folder. To run the Spark workload on Kubernetes, make sure you have eksctl and kubectl installed on your computer or on an AWS Cloud9 environment. You can run this by using an AWS IAM user or role that has the AdministratorAccess policy attached to it, or check the minimum required permissions for using eksctl. The spot node groups in the Amazon EKS cluster can be launched both in a managed or a self-managed way, in this post I use the former. The config files for this tutorial can be found here. The job is finally launched in cluster mode.

Create Amazon S3 Access Policy

First, I must create an Amazon S3 access policy to allow the Spark application to read/write from Amazon S3. Amazon S3 Access is provisioned by attaching the policy by ARN to the node groups. This associates Amazon S3 access to the NodeInstanceRole and, hence, the node groups then have access to Amazon S3. Download the Amazon S3 policy file from here and modify the <<output folder>> to an Amazon S3 bucket you created. Run the following to create the policy. Note the ARN.

aws iam create-policy --policy-name spark-s3-policy --policy-document file://spark-s3.json

Cluster and node groups deployment

Create an EKS cluster using the following command:

eksctl create cluster –name= sparkonk8 --node-private-networking --without-nodegroup --asg-access –region=<<AWS Region>>

The cluster takes approximately 15 minutes to launch.

Create the nodegroup using the nodeGroup config file. Replace the <<Policy ARN>> string using the ARN string from the previous step.

eksctl create nodegroup -f managedNodeGroups.yml

Scheduling driver/executor pods

The driver and executor pods can be assigned to nodes using affinity. PodTemplates can be used to configure the detail, which is not supported by Spark launch configuration by default. This feature is available from Spark 3.0.0, requiredDuringScheduling node affinity is used to schedule the driver and executor jobs. Sample podTemplates have been uploaded here.

Launching a Spark application

Create a service account. The spark driver pod uses the service account to create and watch executor pods using Kubernetes API server.

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole='edit' --serviceaccount=default:spark --namespace=default

Download the Cluster Autoscaler and edit it to add the cluster-name.

curl -LO https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Install the Cluster AutoScaler using the following command:

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Get the details of Kubernetes master to get the head URL.

kubectl cluster-info

command output

Use the following instructions to build the docker image.

Download the application file (script.py) from here and upload into the Amazon S3 bucket created.

Download the pod template files from here. Submit the application.

bin/spark-submit \
--master k8s://<<MASTER URL>> \
--deploy-mode cluster \
--name 'Job Name' \
--conf spark.eventLog.dir=s3a:// <<S3 BUCKET>>/logs \
--conf spark.eventLog.enabled=true \
--conf spark.history.fs.inProgressOptimization.enabled=true \
--conf spark.history.fs.update.interval=5s \
--conf spark.kubernetes.container.image=<<ECR Spark Docker Image>> \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.podTemplateFile='../driver_pod_template.yml' \
--conf spark.kubernetes.executor.podTemplateFile='../executor_pod_template.yml' \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=100 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.33 \
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=30 \
--conf spark.dynamicAllocation.executorIdleTimeout=60s \
--conf spark.driver.memory=8g \
--conf spark.kubernetes.driver.request.cores=2 \
--conf spark.kubernetes.driver.limit.cores=4 \
--conf spark.executor.memory=8g \
--conf spark.kubernetes.executor.request.cores=2 \
--conf spark.kubernetes.executor.limit.cores=4 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.s3a.fast.upload=true \
s3a://<<S3 BUCKET>>/script.py \
s3a://<<S3 BUCKET>>/output

A couple of key points to note here

podTemplateFile is used here, which enables scheduling of the driver pods to On-Demand Instances and executor pods to Spot Instances.
Spark provides a mechanism to allocate dynamically resources dynamically based on workloads. In the latest release of Spark (3.0.0), dynamicAllocation can be used with Kubernetes cluster manager. The executors that do not store, active, shuffled files can be removed to free up the resources. DynamicAllocation works well in tandem with Cluster Autoscaler for resource allocation and optimizes resource for jobs. We are using dynamicAllocation here to enable optimized resource sharing.
The application file and output are both in Amazon S3.

Output Files in S3

Spark Event logs are redirected to Amazon S3. Spark on Kubernetes creates local temporary files for logs and removes them once the application completes. The logs are redirected to Amazon S3 and Spark History Server can be used to analyze the logs. Note, you can create more instrumentation using tools like Prometheus and Grafana to monitor and manage the cluster.

Spark History Server + Dynamic Allocation

Observations

EC2 Spot Interruptions

The following diagram and log screenshot details from Spark History server showcases the behavior of a Spark application in case of an EC2 Spot interruption.

Four Spark applications launched in parallel in a cluster and one of the Spot nodes was interrupted. A couple of executor pods were terminated shut down in three of the four applications, but due to the resilient nature of Spark new executors were launched and the applications finished almost around the same time.
The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.
Spark jobs

The Spark Driver identified the shut down executors, which handled the shuffle files and relaunched the tasks running on those executors.

Dynamic Allocation

Dynamic Allocation works with the caveat that it is an experimental feature.

Cost Optimization

Cost Optimization is achieved in several different ways from this tutorial.

Use of 100% Spot Instances for the Spark executors
Use of dynamicAllocation along with cluster autoscaler does make optimized use of resources and hence save cost
With the deployment of one driver and executor nodes to begin with and then scaling up on demand reduces the waste of a continuously running cluster

Cluster Autoscaling

Cluster Autoscaling is triggered as it is designed when there are pending (Spark executor) pods.

The Cluster Autoscaler logs can be fetched by:

kubectl logs -f deployment/cluster-autoscaler -n kube-system —tail=10

Cluster Autoscaler Logs

Cleanup

If you are trying out the tutorial, run the following steps to make sure that you don’t encounter unwanted costs.

Delete the EKS cluster and the nodegroups with the following command:

eksctl delete cluster --name sparkonk8

Delete the Amazon S3 Access Policy with the following command:

aws iam delete-policy --policy-arn <<POLICY ARN>>

Delete the Amazon S3 Output Bucket with the following command:

aws s3 rb --force s3://<<S3_BUCKET>>

Conclusion

In this blog, I demonstrated how you can run Spark workloads on a Kubernetes Cluster using Spot Instances, achieving scalability, resilience, and cost optimization. To cost optimize your Spark based big data workloads, consider running spark application using Kubernetes and EC2 Spot Instances.

How to monitor Windows and Linux servers and get internal performance metrics

2021-01-22 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/how-to-monitor-windows-and-linux-servers-and-get-internal-performance-metrics/

This post was written by Dean Suzuki, Senior Solutions Architect.

Customers who run Windows or Linux instances on AWS frequently ask, “How do I know if my disks are almost full?” or “How do I know if my application is using all the available memory and is paging to disk?” This blog helps answer these questions by walking you through how to set up monitoring to capture these internal performance metrics.

Solution overview

If you open the Amazon EC2 console, select a running Amazon EC2 instance, and select the Monitoring tab you can see Amazon CloudWatch metrics for that instance. Amazon CloudWatch is an AWS monitoring service. The Monitoring tab (shown in the following image) shows the metrics that can be measured external to the instance (for example, CPU utilization, network bytes in/out). However, to understand what percentage of the disk is being used or what percentage of the memory is being used, these metrics require an internal operating system view of the instance. AWS places an extra safeguard on gathering data inside a customer’s instance so this capability is not enabled by default.

EC2 console showing Monitoring tab

To capture the server’s internal performance metrics, a CloudWatch agent must be installed on the instance. For Windows, the CloudWatch agent can capture any of the Windows performance monitor counters. For Linux, the CloudWatch agent can capture system-level metrics. For more details, please see Metrics Collected by the CloudWatch Agent. The agent can also capture logs from the server. The agent then sends this information to Amazon CloudWatch, where rules can be created to alert on certain conditions (for example, low free disk space) and automated responses can be set up (for example, perform backup to clear transaction logs). Also, dashboards can be created to view the health of your Windows servers.

There are four steps to implement internal monitoring:

Install the CloudWatch agent onto your servers. AWS provides a service called AWS Systems Manager Run Command, which enables you to do this agent installation across all your servers.
Run the CloudWatch agent configuration wizard, which captures what you want to monitor. These items could be performance counters and logs on the server. This configuration is then stored in AWS System Manager Parameter Store
Configure CloudWatch agents to use agent configuration stored in Parameter Store using the Run Command.
Validate that the CloudWatch agents are sending their monitoring data to CloudWatch.

The following image shows the flow of these four steps.

Process to install and configure the CloudWatch agent

In this blog, I walk through these steps so that you can follow along. Note that you are responsible for the cost of running the environment outlined in this blog. So, once you are finished with the steps in the blog, I recommend deleting the resources if you no longer need them. For the cost of running these servers, see Amazon EC2 On-Demand Pricing. For CloudWatch pricing, see Amazon CloudWatch pricing.

If you want a video overview of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.

Deploy the CloudWatch agent

The first step is to deploy the Amazon CloudWatch agent. There are multiple ways to deploy the CloudWatch agent (see this documentation on Installing the CloudWatch Agent). In this blog, I walk through how to use the AWS Systems Manager Run Command to deploy the agent. AWS Systems Manager uses the Systems Manager agent, which is installed by default on each AWS instance. This AWS Systems Manager agent must be given the appropriate permissions to connect to AWS Systems Manager, and to write the configuration data to the AWS Systems Manager Parameter Store. These access rights are controlled through the use of IAM roles.

Create two IAM roles

IAM roles are identity objects that you attach IAM policies. IAM policies define what access is allowed to AWS services. You can have users, services, or applications assume the IAM roles and get the assigned rights defined in the permissions policies.

To use System Manager, you typically create two IAM roles. The first role has permissions to write the CloudWatch agent configuration information to System Manager Parameter Store. This role is called CloudWatchAgentAdminRole.

The second role only has permissions to read the CloudWatch agent configuration from the System Manager Parameter Store. This role is called CloudWatchAgentServerRole.

For more details on creating these roles, please see the documentation on Create IAM Roles and Users for Use with the CloudWatch Agent.

Attach the IAM roles to the EC2 instances

Once you create the roles, you attach them to your Amazon EC2 instances. By attaching the IAM roles to the EC2 instances, you provide the processes running on the EC2 instance the permissions defined in the IAM role. In this blog, you create two Amazon EC2 instances. Attach the CloudWatchAgentAdminRole to the first instance that is used to create the CloudWatch agent configuration. Attach CloudWatchAgentServerRole to the second instance and any other instances that you want to monitor. For details on how to attach or assign roles to EC2 instances, please see the documentation on How do I assign an existing IAM role to an EC2 instance?.

Install the CloudWatch agent

Now that you have setup the permissions, you can install the CloudWatch agent onto the servers that you want to monitor. For details on installing the CloudWatch agent using Systems Manager, please see the documentation on Download and Configure the CloudWatch Agent.

Create the CloudWatch agent configuration

Now that you installed the CloudWatch agent on your server, run the CloudAgent configuration wizard to create the agent configuration. For instructions on how to run the CloudWatch Agent configuration wizard, please see this documentation on Create the CloudWatch Agent Configuration File with the Wizard. To establish a command shell on the server, you can use AWS Systems Manager Session Manager to establish a session to the server and then run the CloudWatch agent configuration wizard. If you want to monitor both Linux and Windows servers, you must run the CloudWatch agent configuration on a Linux instance and on a Windows instance to create a configuration file per OS type. The configuration is unique to the OS type.

To run the Agent configuration wizard on Linux instances, run the following command:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

To run the Agent configuration wizard on Windows instances, run the following commands:

cd "C:\Program Files\Amazon\AmazonCloudWatchAgent"

amazon-cloudwatch-agent-config-wizard.exe

Note for Linux instances: do not select to collect the collectd metrics in the agent configuration wizard unless you have collectd installed on your Linux servers. Otherwise, you may encounter an error.

Review the Agent configuration

The CloudWatch agent configuration generated from the wizard is stored in Systems Manager Parameter Store. You can review and modify this configuration if you need to capture extra metrics. To review the agent configuration, perform the following steps:

Go to the console for the System Manager service.
Click Parameter store on the left hand navigation.
You should see the parameter that was created by the CloudWatch agent configuration program. For Linux servers, the configuration is stored in: AmazonCloudWatch-linux and for Windows servers, the configuration is stored in: AmazonCloudWatch-windows.

System Manager Parameter Store: Parameters created by CloudWatch agent configuration wizard

Click on the parameter’s hyperlink (for example, AmazonCloudWatch-linux) to see all the configuration parameters that you specified in the configuration program.

In the following steps, I walk through an example of modifying the Windows configuration parameter (AmazonCloudWatch-windows) to add an additional metric (“Available Mbytes”) to monitor.

Click the AmazonCloudWatch-windows
In the parameter overview, scroll down to the “metrics” section and under “metrics_collected”, you can see the Windows performance monitor counters that will be gathered by the CloudWatch agent. If you want to add an additional perfmon counter, then you can edit and add the counter here.
Press Edit at the top right of the AmazonCloudWatch-windows Parameter Store page.
Scroll down in the Value section and look for “Memory.”
After the “% Committed Bytes In Use”, put a comma “,” and then press Enter to add a blank line. Then, put on that line “Available Mbytes” The following screenshot demonstrates what this configuration should look like.

AmazonCloudWatch-windows parameter contents and how to add a new metric to monitor

Press Save Changes.

To modify the Linux configuration parameter (AmazonCloudWatch-linux), you perform similar steps except you click on the AmazonCloudWatch-linux parameter. Here is additional documentation on creating the CloudWatch agent configuration and modifying the configuration file.

Start the CloudWatch agent and use the configuration

In this step, start the CloudWatch agent and instruct it to use your agent configuration stored in System Manager Parameter Store.

Open another tab in your web browser and go to System Manager console.
Specify Run Command in the left hand navigation of the System Manager console.
Press Run Command
In the search bar,
- Select Document name prefix
- Select Equal
- Specify AmazonCloudWatch (Note the field is case sensitive)
- Press enter

System Manager Run Command's command document entry field

Select AmazonCloudWatch-ManageAgent. This is the command that configures the CloudWatch agent.
In the command parameters section,
- For Action, select Configure
- For Mode, select ec2
- For Optional Configuration Source, select ssm
- For optional configuration location, specify the Parameter Store name. For Windows instances, you would specify AmazonCloudWatch-windows for Windows instances or AmazonCloudWatch-linux for Linux instances. Note the field is case sensitive. This tells the command to read the Parameter Store for the parameter specified here.
- For optional restart, leave yes
For Targets, choose your target servers that you wish to monitor.
Scroll down and press Run. The Run Command may take a couple minutes to complete. Press the refresh button. The Run Command configures the CloudWatch agent by reading the Parameter Store for the configuration and configure the agent using those settings.

For more details on installing the CloudWatch agent using your agent configuration, please see this Installing the CloudWatch Agent on EC2 Instances Using Your Agent Configuration.

Review the data collected by the CloudWatch agents

In this step, I walk through how to review the data collected by the CloudWatch agents.

In the AWS Management console, go to CloudWatch.
Click Metrics on the left-hand navigation.
You should see a custom namespace for CWAgent. Click on the CWAgent Please note that this might take a couple minutes to appear. Refresh the page periodically until it appears.
Then click the ImageId, Instanceid hyperlinks to see the counters under that section.

CloudWatch Metrics: Showing counters under CWAgent

Review the metrics captured by the CloudWatch agent. Notice the metrics that are only observable from inside the instance (for example, LogicalDisk % Free Space). These types of metrics would not be observable without installing the CloudWatch agent on the instance. From these metrics, you could create a CloudWatch Alarm to alert you if they go beyond a certain threshold. You can also add them to a CloudWatch Dashboard to review. To learn more about the metrics collected by the CloudWatch agent, see the documentation Metrics Collected by the CloudWatch Agent.

Conclusion

In this blog, you learned how to deploy and configure the CloudWatch agent to capture the metrics on either Linux or Windows instances. If you are done with this blog, we recommend deleting the System Manager Parameter Store entry, the CloudWatch data and then the EC2 instances to avoid further charges. If you would like a video tutorial of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.

Deploying CIS Level 1 hardened AMIs with Amazon EC2 Image Builder

2020-12-25 Joseph Keating

Post Syndicated from Joseph Keating original https://aws.amazon.com/blogs/devops/deploying-cis-level-1-hardened-amis-with-amazon-ec2-image-builder/

The NFL, an AWS Professional Services partner, is collaborating with NFL’s Player Health and Safety team to build the Digital Athlete Program. The Digital Athlete Program is working to drive progress in the prevention, diagnosis, and treatment of injuries; enhance medical protocols; and further improve the way football is taught and played. The NFL, in conjunction with AWS Professional Services, delivered an Amazon EC2 Image Builder pipeline for automating the production of Amazon Machine Images (AMIs). Following similar practices from the Digital Athlete Program, this post demonstrates how to deploy an automated Image Builder pipeline.

“AWS Professional Services faced unique environment constraints, but was able to deliver a modular pipeline solution leveraging EC2 Image Builder. The framework serves as a foundation to create hardened images for future use cases. The team also provided documentation and knowledge transfer sessions to ensure our team was set up to successfully manage the solution.”

—Joseph Steinke, Director, Data Solutions Architect, National Football League

A common scenario AWS customers face is how to build processes that configure secure AWS resources that can be leveraged throughout the organization. You need to move fast in the cloud without compromising security best practices. Amazon Elastic Compute Cloud (Amazon EC2) allows you to deploy virtual machines in the AWS Cloud. EC2 AMIs provide the configuration utilized to launch an EC2 instance. You can use AMIs for several use cases, such as configuring applications, applying security policies, and configuring development environments. Developers and system administrators can deploy configuration AMIs to bring up EC2 resources that require little-to-no setup. Often times, multiple patterns are adopted for building and deploying AMIs. Because of this, you need the ability to create a centralized, automated pattern that can output secure, customizable AMIs.

In this post, we demonstrate how to create an automated process that builds and deploys Center for Internet Security (CIS) Level 1 hardened AMIs. The pattern that we deploy includes Image Builder, a CIS Level 1 hardened AMI, an application running on EC2 instances, and Amazon Inspector for security analysis. You deploy the AMI configured with the Image Builder pipeline to an application stack. The application stack consists of EC2 instances running Nginx. Lastly, we show you how to re-hydrate your application stack with a new AMI utilizing AWS CloudFormation and Amazon EC2 launch templates. You use Amazon Inspector to scan the EC2 instances launched from the Image Builder-generated AMI against the CIS Level 1 Benchmark.

After going through this exercise, you should understand how to build, manage, and deploy AMIs to an application stack. The infrastructure deployed with this pipeline includes a basic web application, but you can use this pattern to fit many needs. After running through this post, you should feel comfortable using this pattern to configure an AMI pipeline for your organization.

The project we create in this post addresses the following use case: you need a process for building and deploying CIS Level 1 hardened AMIs to an application stack running on Amazon EC2. In addition to demonstrating how to deploy the AMI pipeline, we also illustrate how to refresh a running application stack with a new AMI. You learn how to deploy this configuration with the AWS Command Line Interface (AWS CLI) and AWS CloudFormation.

AWS services used
Image Builder allows you to develop an automated workflow for creating AMIs to fit your organization’s needs. You can streamline the creation and distribution of secure images, automate your patching process, and define security and application configuration into custom AWS AMIs. In this post, you use the following AWS services to implement this solution:

AWS CloudFormation – AWS CloudFormation allows you to use domain-specific languages or simple text files to model and provision, in an automated and secure manner, all the resources needed for your applications across all Regions and accounts. You can deploy AWS resources in a safe, repeatable manner, and automate the provisioning of infrastructure.
AWS KMS – Amazon Key Management Service (AWS KMS) is a fully managed service for creating and managing cryptographic keys. These keys are natively integrated with most AWS services. You use a KMS key in this post to encrypt resources.
Amazon S3 – Amazon Simple Storage Service (Amazon S3) is an object storage service utilized for storing and encrypting data. We use Amazon S3 to store our configuration files.
AWS Auto Scaling – AWS Auto Scaling allows you to build scaling plans that automate how groups of different resources respond to changes in demand. You can optimize availability, costs, or a balance of both. We use Auto Scaling to manage Nginx on Amazon EC2.
Launch templates – Launch templates contain configurations such as AMI ID, instance type, and security group. Launch templates enable you to store launch parameters so that they don’t have to be specified every time instances are launched.
Amazon Inspector – This automated security assessment service improves the security and compliance of applications deployed on AWS. Amazon Inspector automatically assesses applications for exposures, vulnerabilities, and deviations from best practices.

Architecture overview
We use Ansible as a configuration management component alongside Image Builder. The CIS Ansible Playbook applies a Level 1 set of rules to the local host of which the AMI is provisioned on. For more information about the Ansible Playbook, see the GitHub repo. Image Builder offers AMIs with Security Technical Implementation Guides (STIG) levels low-high as part of its pipeline build.

The following diagram depicts the phases of the Image Builder pipeline for building a Nginx web server. The numbers 1–6 represent the order of when each phase runs in the build process:

Source
Build components
Validate
Test
Distribute
AMI

Figure: Shows the EC2 Image Builder steps

The workflow includes the following steps:

Deploy the CloudFormation templates.
The template creates an Image Builder pipeline.
AWS Systems Manager completes the AMI build process.
Amazon EC2 starts an instance to build the AMI.
Systems Manager starts a test instance build after the first build is successful.
The AMI starts provisioning.
The Amazon Inspector CIS benchmark starts.

CloudFormation templates
You deploy the following CloudFormation templates. These CloudFormation templates have a great deal of configurations. They deploy the following resources:

vpc.yml – Contains all the core networking configuration. It deploys the VPC, two private subnets, two public subnets, and the route tables. The private subnets utilize a NAT gateway to communicate to the internet. The public subnets have full outbound access to the IGW.
kms.yml – Contains the AWS KMS configuration that we use for encrypting resources. The KMS key policy is also configured in this template.
s3-iam-config.yml – Contains the launch configuration and autoscaling groups for the initial Nginx launch. For updates and patching to Nginx, we use Image Builder to build those changes.
infrastructure-ssm-params.yml – Contains the Systems Manager parameter store configuration. The parameters are populated by using outputs from other CloudFormation templates.
nginx-config.yml – Contains the configuration for Nginx. Additionally, this template contains the network load balancer, target groups, security groups, and EC2 instance AWS Identity and Access Management (IAM) roles.
nginx-image-builder.yml – Contains the configuration for the Image Builder pipeline that we use to build AMIs.

Prerequisites
To follow the steps to provision the pipeline deployment, you must have the following prerequisites:

An AWS account with local credentials properly configured (typically under ~/.aws/credentials).
The latest version of the AWS CLI. For more information, see Installing, updating, and uninstalling the AWS CLI.
A Git client to clone the source code provided.

Deploying the CloudFormation templates
To deploy your templates, complete the following steps:

1. Clone the source code repository found in the following location:

git clone https://github.com/aws-samples/deploy-cis-level-1-hardened-ami-with-ec2-image-builder-pipeline.git

You now use the AWS CLI to deploy the CloudFormation templates. Make sure to leave the CloudFormation template names as we have written in this post.

2. Deploy the VPC CloudFormation template:

aws cloudformation create-stack \
--stack-name vpc-config \
--template-body file://Templates/vpc.yml \
--parameters file://Parameters/vpc-params.json  \
--capabilities CAPABILITY_IAM \
--region us-east-1

The output should look like the following code:

{

    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/vpc-config/7faaab30-247f-11eb-8712-0e65b6fb18f9"
}

3. Open the Parameters/kms-params.json file and update the UserARN parameter with your account ID:

[
  {
      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
  },
  {
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::<input_your_account_id>:root"
  }
]

4. Deploy the KMS key CloudFormation template:

aws cloudformation create-stack \
--stack-name kms-config \
--template-body file://Templates/kms.yml \
--parameters file://Parameters/kms-params.json \
--capabilities CAPABILITY_IAM \
--region us-east-1

The output should look like the following:

{
"StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/kms-config/f65aca80-08ff-11eb-8795-12275bc6e1ef"
}

5. Open the Parameters/s3-iam-config.json file and update the DemoConfigS3BucketName parameter to a unique name of your choosing:

[
  {
    "ParameterKey" : "Environment",
    "ParameterValue" : "dev"
  },
  {
    "ParameterKey": "NetworkStackName",
    "ParameterValue" : "vpc-config"
  },
  {
    "ParameterKey" : "KMSStackName",
    "ParameterValue" : "kms-config"
  },
  {
    "ParameterKey": "DemoConfigS3BucketName",
    "ParameterValue" : "<input_your_unique_bucket_name>"
  },
  {
    "ParameterKey" : "EC2InstanceRoleName",
    "ParameterValue" : "EC2InstanceRole"
  }
]

6. Deploy the IAM role configuration template:

aws cloudformation create-stack \
--stack-name s3-iam-config \
--template-body file://Templates/s3-iam-config.yml \
--parameters file://Parameters/s3-iam-config.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

The output should look like the following:

{
"StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/s3-iam-config/9be9f990-0909-11eb-811c-0a78092beb51"
}

Configuring IAM roles and policies

This solution uses a couple of service-linked roles. Let’s generate these roles using the AWS CLI.

1. Run the following commands:

aws iam create-service-linked-role --aws-service-name autoscaling.amazonaws.com

aws iam create-service-linked-role --aws-service-name imagebuilder.amazonaws.com

If you see a message similar to following code, it means that you already have the service-linked role created in your account and you can move on to the next step:

An error occurred (InvalidInput) when calling the CreateServiceLinkedRole operation: Service role name AWSServiceRoleForImageBuilder has been taken in this account, please try a different suffix.

Now that you have generated the IAM roles used in this post, you add them to the KMS key policy. This allows the roles to encrypt and decrypt the KMS key.

2. Open the Parameters/kms-params.json file:

[
  {
      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
  },
  {
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::12345678910:root"
  }
]

3. Add the following values as a comma-separated list to the UserARN parameter key:

arn:aws:iam::<input_your_aws_account_id>:role/EC2InstanceRole arn:aws:iam::<input_your_aws_account_id>:role/EC2ImageBuilderRole arn:aws:iam::<input_your_aws_account_id>:role/NginxS3PutLambdaRole arn:aws:iam::<input_your_aws_account_id>:role/aws-service-role/imagebuilder.amazonaws.com/AWSServiceRoleForImageBuilder arn:aws:iam::<input_your_aws_account_id>:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling

When finished, the file should look similar to the following:

[
  {
      "ParameterKey": "KeyName",
      "ParameterValue": "DemoKey"
  },
  {
    "ParameterKey": "UserARN",
    "ParameterValue": "arn:aws:iam::123456789012:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling,arn:aws:iam::<input_your_aws_account_id>:role/NginxS3PutLambdaRole,arn:aws:iam::123456789012:role/aws-service-role/imagebuilder.amazonaws.com/AWSServiceRoleForImageBuilder,arn:aws:iam::12345678910:role/EC2InstanceRole,arn:aws:iam::12345678910:role/EC2ImageBuilderRole,arn:aws:iam::12345678910:root"
  }
]

Updating the CloudFormation stack

Now that the AWS KMS parameter file has been updated, you update the AWS KMS CloudFormation stack.

1. Run the following command to update the kms-config stack:

aws cloudformation update-stack \
--stack-name kms-config \
--template-body file://Templates/kms.yml \
--parameters file://Parameters/kms-params.json \
--capabilities CAPABILITY_IAM \
--region us-east-1

The output should look like the following:

{
"StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/kms-config/6e84b750-0905-11eb-b543-0e4dccb471bf"
}

2. Open the AnsibleConfig/component-nginx.yml file and update the <input_s3_bucket_name> value with the bucket name you generated from the s3-iam-config stack:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
name: 'Ansible Playbook Execution on Amazon Linux 2'
description: 'This is a sample component that demonstrates how to download and execute an Ansible playbook against Amazon Linux 2.'
schemaVersion: 1.0
constants:
  - s3bucket:
      type: string
      value: <input_s3_bucket_name>
phases:
  - name: build
    steps:
      - name: InstallAnsible
        action: ExecuteBash
        inputs:
          commands:
           - sudo amazon-linux-extras install -y ansible2
      - name: CreateDirectory
        action: ExecuteBash
        inputs:
          commands:
            - sudo mkdir -p /ansibleloc/roles
      - name: DownloadLinuxCis
        action: S3Download
        inputs:
          - source: 's3://{{ s3bucket }}/components/linux-cis.zip'
            destination: '/ansibleloc/linux-cis.zip'
      - name: UzipLinuxCis
        action: ExecuteBash
        inputs:
          commands:
            - unzip /ansibleloc/linux-cis.zip -d /ansibleloc/roles
            - echo "unzip linux-cis file"
      - name: DownloadCisPlaybook
        action: S3Download
        inputs:
          - source: 's3://{{ s3bucket }}/components/cis_playbook.yml'
            destination: '/ansibleloc/cis_playbook.yml'
      - name: InvokeCisAnsible
        action: ExecuteBinary
        inputs:
          path: ansible-playbook
          arguments:
            - '{{ build.DownloadCisPlaybook.inputs[0].destination }}'
            - '--tags=level1'
      - name: DeleteCisPlaybook
        action: ExecuteBash
        inputs:
          commands:
            - rm '{{ build.DownloadCisPlaybook.inputs[0].destination }}'
      - name: DownloadNginx
        action: S3Download
        inputs:
          - source: s3://{{ s3bucket }}/components/nginx.zip'
            destination: '/ansibleloc/nginx.zip'
      - name: UzipNginx
        action: ExecuteBash
        inputs:
          commands:
            - unzip /ansibleloc/nginx.zip -d /ansibleloc/roles
            - echo "unzip Nginx file"
      - name: DownloadNginxPlaybook
        action: S3Download
        inputs:
          - source: 's3://{{ s3bucket }}/components/nginx_playbook.yml'
            destination: '/ansibleloc/nginx_playbook.yml'
      - name: InvokeNginxAnsible
        action: ExecuteBinary
        inputs:
          path: ansible-playbook
          arguments:
            - '{{ build.DownloadNginxPlaybook.inputs[0].destination }}'
      - name: DeleteNginxPlaybook
        action: ExecuteBash
        inputs:
          commands:
            - rm '{{ build.DownloadNginxPlaybook.inputs[0].destination }}'

  - name: validate
    steps:
      - name: ValidateDebug
        action: ExecuteBash
        inputs:
          commands:
            - sudo echo "ValidateDebug section"

  - name: test
    steps:
      - name: TestDebug
        action: ExecuteBash
        inputs:
          commands:
            - sudo echo "TestDebug section"
      - name: Download_Inspector_Test
        action: S3Download
        inputs:
          - source: 's3://ec2imagebuilder-managed-resources-us-east-1-prod/components/inspector-test-linux/1.0.1/InspectorTest'
            destination: '/workdir/InspectorTest'
      - name: Set_Executable_Permissions
        action: ExecuteBash
        inputs:
          commands:
            - sudo chmod +x /workdir/InspectorTest
      - name: ExecuteInspectorAssessment
        action: ExecuteBinary
        inputs:
          path: '/workdir/InspectorTest'

Adding files to your S3 buckets

Now you assume a role you generated from one of the previous CloudFormation stacks. This allows you to add files to the encrypted S3 bucket.

1. Run the following command and make sure to update the role to use your AWS account ID number:

aws sts assume-role --role-arn "arn:aws:iam::<input_your_aws_account_id>:role/EC2ImageBuilderRole" --role-session-name AWSCLI-Session

You see an output similar to the following:

{
    "Credentials": {
        "AccessKeyId": "<AWS_ACCESS_KEY_ID>",
        "SecretAccessKey": "<AWS_SECRET_ACCESS_KEY_ID>",
        "SessionToken": "<AWS_SESSION_TOKEN>",
        "Expiration": "2020-11-20T02:54:17Z"
    },
    "AssumedRoleUser": {
        "AssumedRoleId": "ACPATGCCLSNJCNSJCEWZ:AWSCLI-Session",
        "Arn": "arn:aws:sts::123456789012:assumed-role/EC2ImageBuilderRole/AWSCLI-Session"
    }
}

You now assume the EC2ImageBuilderRole IAM role from the command line. This role allows you to create objects in the S3 bucket generated from the s3-iam-config stack. Because this bucket is encrypted with AWS KMS, any user or IAM role requires specific permissions to decrypt the key. You have already accounted for this in a previous step by adding the EC2ImageBuilderRole IAM role to the KMS key policy.

2. Create the following environment variable to use the EC2ImageBuilderRole role. Update the values with the output from the previous step:

export AWS_ACCESS_KEY_ID=AccessKeyId
export AWS_SECRET_ACCESS_KEY=SecretAccessKey
export AWS_SESSION_TOKEN=SessionToken

3. Check to make sure that you have actually assumed the role EC2ImageBuilderRole:

aws sts get-caller-identity

You should see an output similar to the following:

{
    "UserId": "AROATG5CKLSWENUYOF6A4:AWSCLI-Session",
    "Account": "123456789012",
    "Arn": "arn:aws:sts::123456789012:assumed-role/EC2ImageBuilderRole/AWSCLI-Session"
}

4. Create a folder inside of the encrypted S3 bucket generated in the s3-iam-config stack:

aws s3api put-object --bucket <input_your_bucket_name> --key components

5. Zip the configuration files that you use in the Image Builder pipeline process:

zip -r linux-cis.zip LinuxCis/

zip -r nginx.zip Nginx/

6. Upload the configuration files to S3 bucket. Update the bucket name with the S3 bucket name you generated in the s3-iam-config stack.

aws s3 cp linux-cis.zip s3://<input_your_bucket_name>/components/

aws s3 cp nginx.zip s3://<input_your_bucket_name>/components/

aws s3 cp AnsibleConfig/cis_playbook.yml s3://<input_your_bucket_name>/components/

aws s3 cp AnsibleConfig/nginx_playbook.yml s3://<input_your_bucket_name>/components/

aws s3 cp AnsibleConfig/component-nginx.yml s3://<input_your_bucket_name>/components/

Deploying your pipeline

You’re now ready to deploy your pipeline.

1. Switch back to the original IAM user profile you used before assuming the EC2ImageBuilderRole. For instructions, see How do I assume an IAM role using the AWS CLI?

2. Open the Parameters/nginx-image-builder-params.json file and update the ImageBuilderBucketName parameter with the S3 bucket name generated in the s3-iam-config stack:

[
  {
    "ParameterKey": "Environment",
    "ParameterValue": "dev"
  },
  {
    "ParameterKey": "ImageBuilderBucketName",
    "ParameterValue": "<input_your_bucket_name>"
  },
  {
    "ParameterKey": "NetworkStackName",
    "ParameterValue": "vpc-config"
  },
  {
    "ParameterKey": "KMSStackName",
    "ParameterValue": "kms-config"
  },
  {
    "ParameterKey": "S3ConfigStackName",
    "ParameterValue": "s3-iam-config"
  }
]

3. Deploy the nginx-image-builder.yml template:

aws cloudformation create-stack \
--stack-name cis-image-builder \
--template-body file://Templates/nginx-image-builder.yml \
--parameters file://Parameters/nginx-image-builder-params.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

The template takes around 35 minutes to complete. Deploying this template starts the Image Builder pipeline.

Monitoring the pipeline

You can get more details about the pipeline on the AWS Management Console.

1. On the Image Builder console, choose Image pipelines to see the status of the pipeline.

Figure: Shows the EC2 Image Builder Pipeline status

2. Choose the pipeline (for this post, cis-image-builder-LinuxCis-Pipeline)

On the pipeline details page, you can view more information and make updates to its configuration.

Figure: Shows the Image Builder Pipeline metadata

At this point, the Image Builder pipeline has started running the automation document in Systems Manager. Here you can monitor the progress of the AMI build.

3. On the Systems Manager console, choose Automation.

4. Choose the execution ID of the arn:aws:ssm:us-east-1:123456789012:document/ImageBuilderBuildImageDocument document.

Figure: Shows the Image Builder Pipeline Systems Manager Automation steps

5. Choose the step ID to see what is happening in each step.

At this point, the Image Builder pipeline is bringing up an Amazon Linux 2 EC2 instance. From there, we run Ansible playbooks that configure the security and application settings. The automation is pulling its configuration from the S3 bucket you deployed in a previous step. When the Ansible run is complete, the instance stops and an AMI is generated from this instance. When this is complete, a cleanup is initiated that ends the EC2 instance. The final result is a CIS Level 1 hardened Amazon Linux 2 AMI running Nginx.

Updating parameters

When the stack is complete, you retrieve some new parameter values.

1. On the Systems Manager console, choose Automation.

2. Choose the execution ID of the arn:aws:ssm:us-east-1:123456789012:document/ImageBuilderBuildImageDocument document.

3. Choose step 21.

The following screenshot shows the output of this step.

Figure: Shows step of EC2 Image Builder Pipeline

4. Open the Parameters/nginx-config.json file and update the AmiId parameter with the AMI ID generated from the previous step:

[
  {
    "ParameterKey" : "Environment",
    "ParameterValue" : "dev"
  },
  {
    "ParameterKey": "NetworkStackName",
    "ParameterValue" : "vpc-config"
  },
  {
    "ParameterKey" : "S3ConfigStackName",
    "ParameterValue" : "s3-iam-config"
  },
  {
    "ParameterKey": "AmiId",
    "ParameterValue" : "<input_the_cis_hardened_ami_id>"
  },
  {
    "ParameterKey": "ApplicationName",
    "ParameterValue" : "Nginx"
  },
  {
    "ParameterKey": "NLBName",
    "ParameterValue" : "DemoALB"
  },
  {
    "ParameterKey": "TargetGroupName",
    "ParameterValue" : "DemoTG"
  }
]

5. Deploy the nginx-config.yml template:

aws cloudformation create-stack \
--stack-name nginx-config \
--template-body file://Templates/nginx-config.yml \
--parameters file://Parameters/nginx-config.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

The output should look like the following:

{
    "StackId": "arn:aws:cloudformation:us-east-1:123456789012:stack/nginx-config/fb2b0f30-24f6-11eb-ad7c-0a3238f55eb3"
}

6. Deploy the infrastructure-ssm-params.yml template:

aws cloudformation create-stack \
--stack-name ssm-params-config \
--template-body file://Templates/infrastructure-ssm-params.yml \
--parameters file://Parameters/infrastructure-ssm-parameters.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

Verifying Nginx is running

Let’s verify that our Nginx service is up and running properly. You use Session Manager to connect to a testing instance.

1. On the Amazon EC2 console, choose Instances.

You should see three instances, as in the following screenshot.

Figure: Shows the Nginx EC2 instances

You can connect to either one of the Nginx instances.

2. Select the testing instance.

3. On the Actions menu, choose Connect.

4. Choose Session Manager.

5. Choose Connect.

A terminal on the EC2 instance opens, similar to the following screenshot.

Figure: Shows the Session Manager terminal

6. Run the following command to ensure that Nginx is running properly:

curl localhost:8080

You should see an output similar to the following screenshot.

Figure: Shows Nginx output from terminal

Reviewing resources and configurations

Now that you have deployed the core services that for the solution, take some time to review the services that you have just deployed.

IAM roles

This project creates several IAM roles that are used to manage AWS resources. For example, EC2ImageBuilderRole is used to configure new AMIs with the Image Builder pipeline. This role contains only the permissions required to manage the Image Builder process. Adopting this pattern enforces the practice of least privilege. Additionally, many of the IAM polices attached to the IAM roles are restricted down to specific AWS resources. Let’s look at a couple of examples of managing IAM permissions with this project.

The following policy restricts Amazon S3 access to a specific S3 bucket. This makes sure that the role this policy is attached to can only access this specific S3 bucket. If this role needs to access any additional S3 buckets, the resource has to be explicitly added.

Policies:
  - PolicyName: GrantS3Read
    PolicyDocument:
      Statement:
        - Sid: GrantS3Read
          Effect: Allow
          Action:
            - s3:List*
            - s3:Get*
            - s3:Put*
          Resource: !Sub 'arn:aws:s3:::${S3Bucket}*'

Let’s look at the EC2ImageBuilderRole. A common scenario that occurs is when you need to assume a role locally in order to perform an action on a resource. In this case, because you’re using AWS KMS to encrypt the S3 bucket, you need to assume a role that has access to decrypt the KMS key so that artifacts can be uploaded to the S3 bucket. In the following AssumeRolePolicyDocument, we allow Amazon EC2 and Systems Manager services to be assumed by this role. Additionally, we allow IAM users to assume this role as well.

AssumeRolePolicyDocument:
  Version: 2012-10-17
  Statement:
    - Effect: Allow
      Principal:
        Service:
          - ec2.amazonaws.com
          - ssm.amazonaws.com
          - imagebuilder.amazonaws.com
        AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root'
      Action:
        - sts:AssumeRole

The principle !Sub 'arn:aws:iam::${AWS::AccountId}:root allows for any IAM user in this account to assume this role locally. Normally, this role should be scoped down to specific IAM users or roles. For the purpose of this post, we grant permissions to all users of the account.

Nginx configuration

The AMI built from the Image Builder pipeline contains all of the application and security configurations required to run Nginx as a web application. When an instance is launched from this AMI, no additional configuration is required.

We use Amazon EC2 launch templates to configure the application stack. The launch templates contain information such as the AMI ID, instance type, and security group. When a new AMI is provisioned, you simply update the launch template CloudFormation parameter with the new AMI and update the CloudFormation stack. From here, you can start an Auto Scaling group instance refresh to update the application stack to use the new AMI. The Auto Scaling group is updated with instances running on the updated AMI by bringing down one instance at a time and replacing it.

Amazon Inspector configuration

Amazon Inspector is an automated security assessment service that helps improve the security and compliance of applications deployed on AWS. With Amazon Inspector, assessments are generated for exposure, vulnerabilities, and deviations from best practices.

After performing an assessment, Amazon Inspector produces a detailed list of security findings prioritized by level of severity. These findings can be reviewed directly or as part of detailed assessment reports that are available via the Amazon Inspector console or API. We can use Amazon Inspector to assess our security posture against the CIS Level 1 standard that we use our Image Builder pipeline to provision. Let’s look at how we configure Amazon Inspector.

A resource group defines a set of tags that, when queried, identify the AWS resources that make up the assessment target. Any EC2 instance that is launched with the tag specified in the resource group is in scope for Amazon Inspector assessment runs. The following code shows our configuration:

ResourceGroup:
  Type: "AWS::Inspector::ResourceGroup"
  Properties:
    ResourceGroupTags:
      - Key: "ResourceGroup"
        Value: "Nginx"

AssessmentTarget:
  Type: AWS::Inspector::AssessmentTarget
  Properties:
    AssessmentTargetName : "NginxAssessmentTarget"
    ResourceGroupArn : !Ref ResourceGroup

In the following code, we specify the tag set in the resource group, which makes sure that when an instance is launched from this AMI, it’s under the scope of Amazon Inspector:

IBImage:
  Type: AWS::ImageBuilder::Image
  Properties:
    ImageRecipeArn: !Ref Recipe
    InfrastructureConfigurationArn: !Ref Infrastructure
    DistributionConfigurationArn: !Ref Distribution
    ImageTestsConfiguration:
      ImageTestsEnabled: false
      TimeoutMinutes: 60
    Tags:
      ResourceGroup: 'Nginx'

Building and deploying a new image with Amazon Inspector tests enabled

For this final portion of this post, we build and deploy a new AMI with an Amazon Inspector evaluation.

1. In your text editor, open Templates/nginx-image-builder.yml and update the pipeline and IBImage resource property ImageTestsEnabled to true.

The updated configuration should look like the following:

IBImage:
  Type: AWS::ImageBuilder::Image
  Properties:
    ImageRecipeArn: !Ref Recipe
    InfrastructureConfigurationArn: !Ref Infrastructure
    DistributionConfigurationArn: !Ref Distribution
    ImageTestsConfiguration:
      ImageTestsEnabled: true
      TimeoutMinutes: 60
    Tags:
      ResourceGroup: 'Nginx'

2. Update the stack with the new configuration:

aws cloudformation update-stack \
--stack-name cis-image-builder \
--template-body file://Templates/nginx-image-builder.yml \
--parameters file://Parameters/nginx-image-builder-params.json \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

This starts a new AMI build with an Amazon Inspector evaluation. The process can take up to 2 hours to complete.

3. On the Amazon Inspector console, choose Assessment Runs.

Figure: Shows Amazon Inspector Assessment Run

4. Under Reports, choose Download report.

5. For Select report type, select Findings report.

6. For Select report format, select PDF.

7. Choose Generate report.

The following screenshot shows the findings report from the Amazon Inspector run.

This report generates an assessment against the CIS Level 1 standard. Any policies that don’t comply with the CIS Level 1 standard are explicitly called out in this report.

Section 3.1 lists any failed policies.

Figure: Shows Inspector findings

These failures are detailed later in the report, along with suggestions for remediation.

In section 4.1, locate the entry 1.3.2 Ensure filesystem integrity is regularly checked. This section shows the details of a failure from the Amazon Inspector findings report. You can also see suggestions on how to remediate the issue. Under Recommendation, the findings report suggests a specific command that you can use to remediate the issue.

Figure: Shows Inspector findings issue

You can use the Image Builder pipeline to simply update the Ansible playbooks with this setting, then run the Image Builder pipeline to build a new AMI, deploy the new AMI to an EC2 Instance, and run the Amazon Inspector report to ensure that the issue has been resolved. Finally, we can see the specific instances that have been assessed that have this issue.

Organizations often customize security settings based off of a given use case. Your organization may choose CIS Level 1 as a standard but elect to not apply all the recommendations. For example, you might choose to not use the FirewallD service on your Linux instances, because you feel that using Amazon EC2 security groups gives you enough networking security in place that you don’t need an additional firewall. Disabling FirewallD causes a high severity failure in the Amazon Inspector report. This is expected and can be ignored when evaluating the report.

Conclusion
In this post, we showed you how to use Image Builder to automate the creation of AMIs. Additionally, we also showed you how to use the AWS CLI to deploy CloudFormation stacks. Finally, we walked through how to evaluate resources against CIS Level 1 Standard using Amazon Inspector.

About the Authors

Joe Keating is a Modernization Architect in Professional Services at Amazon Web Services. He works with AWS customers to design and implement a variety of solutions in the AWS Cloud. Joe enjoys cooking with a glass or two of wine and achieving mediocrity on the golf course.

Virginia Chu is a Sr. Cloud Infrastructure Architect in Professional Services at Amazon Web Services. She works with enterprise-scale customers around the globe to design and implement a variety of solutions in the AWS Cloud.

Creating a cross-region Active Directory domain with AWS Launch Wizard for Microsoft Active Directory

2020-12-15 AWS Admin

Post Syndicated from AWS Admin original https://aws.amazon.com/blogs/compute/creating-a-cross-region-active-directory-domain-with-aws-launch-wizard-for-microsoft-active-directory/

AWS Launch Wizard is a console-based service to quickly and easily size, configure, and deploy third party applications, such as Microsoft SQL Server Always On and HANA based SAP systems, on AWS without the need to identify and provision individual AWS resources. AWS Launch Wizard offers an easy way to deploy enterprise applications and optimize costs. Instead of selecting and configuring separate infrastructure services, you go through a few steps in the AWS Launch Wizard and it deploys a ready-to-use application on your behalf. It reduces the time you need to spend on investigating how to provision, cost and configure your application on AWS.

You can now use AWS Launch Wizard to deploy and configure self-managed Microsoft Windows Server Active Directory Domain Services running on Amazon Elastic Compute Cloud (EC2) instances. With Launch Wizard, you can have fully-functioning, production-ready domain controllers within a few hours—all without having to manually deploy and configure your resources.

You can use AWS Directory Service to run Microsoft Active Directory (AD) as a managed service, without the hassle of managing your own infrastructure. If you need to run your own AD infrastructure, you can use AWS Launch Wizard to simplify the deployment and configuration process.

In this post, I walk through creation of a cross-region Active Directory domain using Launch Wizard. First, I deploy a single Active Directory domain spanning two regions. Then, I configure Active Directory Sites and Services to match the network topology. Finally, I create a user account to verify replication of the Active Directory domain.

Diagram of Resources deployed in this post

Figure 1: Diagram of resources deployed in this post

Prerequisites

You must have a VPC in your home. Additionally, you must have remote regions that have CIDRs that do not overlap with each other. If you need to create VPCs and subnets that do not overlap, please refer here.
Each subnet used must have outbound internet connectivity. Feel free to either use a NAT Gateway or Internet Gateway.
The VPCs must be peered in order to complete the steps in this post. For information on creating a VPC Peering connection between regions, please refer here.
If you choose to deploy your Domain Controllers to a private subnet, you must have an RDP jump / bastion instance setup to allow you to RDP to your instance.

Deploy Your Domain Controllers in the Home Region using Launch Wizard

In this section, I deploy the first set of domain controllers into the us-east-1 the home region using Launch Wizard. I refer to US-East-1 as the home region, and US-West-2 as the remote region.

In the AWS Launch Wizard Console, select Active Directory in the navigation pane on the left.
Select Create deployment.
In the Review Permissions page, select Next.
In the Configure application settings page set the following:
- General:
  - Deployment name: UsEast1AD
- Active Directory (AD) installation
  - Installation type: Active Directory on EC2
- Domain Settings:
  - Number of domain controllers: 2
  - AMI installation type: License-included AMI

- License-included AMI: ami-################# | Windows_Server-2019-English-Full-Base-202#-##-##

- Connection type: Create new Active Directory
- Domain DNS name: corp.example.com
- Domain NetBIOS Name: CORP

- Connectivity:
  - Key Pair Name: Choose and exiting Key pair or select and existing one.
  - Virtual Private Cloud (VPC): Select Virtual Private Cloud (VPC)

- VPC: Select your home region VPC

- Availability Zone (AZ) and private subnets:
  - Select 2 Availability Zones
  - Choose the proper subnet in each subnet
  - Assign a Controller IP address for each domain controller
- Remote Desktop Gateway preferences: Disregard for now, this is set up later.
- Check the I confirm that a public subnet has been set up. Each of the selected private subnets have outbound connectivity enabled check box.

Select Next.
In the Define infrastructure requirements page, set the following inputs.
- Storage and compute: Based on infrastructure requirements
- Number of AD users: Up to 5000 users
Select Next.
In the Review and deploy page, review your selections. Then, select Deploy.

Note that it may take up to 2 hours for your domain to be deployed. Once the status has changed to Completed, you can proceed to the next section. In the next section, I prepare Active Directory Sites and Services for the second set of domain controller in my other region.

Configure Active Directory Sites and Services

In this section, I configure the Active Directory Sites and Services topology to match my network topology. This step ensures proper Active Directory replication routing so that domain clients can find the closest domain controller. For more information on Active Directory Sites and Services, please refer here.

Retrieve your Administrator Credentials from Secrets Manager

From the AWS Secrets Manager Console in us-east-1, select the Secret that begins with LaunchWizard-UsEast1AD.
In the middle of the Secret page, select Retrieve secret value.
1. This will display the username and password key with their values.
2. You need these credentials when you RDP into one of the domain controllers in the next steps.

Rename the Default First Site

Log in to the one of the domain controllers in us-east-1.
Select Start, type dssite and hit Enter on your keyboard.
The Active Directory Sites and Services MMC should appear.
1. Expand Sites. There is a site named Default-First-Site-Name.
2. Right click on Default-First-Site-Name select Rename.
3. Enter us-east-1 as the name.
Leave the Active Directory Sites and Services MMC open for the next set of steps.

Create a New Site and Subnet Definition for US-West-2

Using the Active Directory Sites and Services MMC from the previous steps, right click on Sites.
Select New Site… and enter the following inputs:
- Name: us-west-2
- Select DEFAULTIPSITELINK.
Select OK.
A pop up will appear telling you there will need to be some additional configuration. Select OK.
Expand Sites and right click on Subnets and select New Subnet.
Enter the following information:
- Prefix: the CIDR of your us-west-2 VPC. An example would be 1.0.0/24
- Site: select us-west-2
Select OK.
Leave the Active Directory Sites and Services MMC open for the following set of steps.

Configure Site Replication Settings

Using the Active Directory Sites and Services MMC from the previous steps, expand Sites, Inter-Site Transports, and select IP. You should see an object named DEFAULTIPSITELINK,

Right click on DEFAULTIPSITELINK.
Select Properties. Set or verify the following inputs on the General tab:
- Sites in this site link: us-east-1 and us-west-2
- Cost: 100
- Replicate every: 15 minutes
Select Apply.
In the DEFAULTIPSITELINK Properties, select the Attribute Editor tab and modify the following:
- Scroll down and double click on Enter 1 for the Value, then select OK twice.
  - For more information on these settings, please refer here.
Close the Active Directory Sites and Services MMC, as it is no longer needed.

Prepare Your Home Region Domain Controllers Security Group

In this section, I modify the Domain Controllers Security Group in us-east-1. This allows the domain controllers deployed in us-west-2 to communicate with each other.

From the Amazon Elastic Compute Cloud (Amazon EC2) console, select Security Groups under the Network & Security navigation section.
Select the Domain Controllers Security Group that was created with Launch Wizard Active Directory.
Select Edit inbound rules. The Security Group should start with LaunchWizard-UsEast1AD-.
Choose Add rule and enter the following:
- Type: Select All traffic
- Protocol: All
- Port range: All
- Source: Select Custom
- Enter the CIDR of your remote VPC. An example would be 1.0.0/24
Select Save rules.

Create a Copy of Your Administrator Secret in Your Remote Region

In this section, I create a Secret in Secrets Manager that contains the Administrator credentials when I created a home region.

Find the Secret that being with LaunchWizard-UsEast1AD from the AWS Secrets Manager Console in us-east-1.
In the middle of the Secret page, select Retrieve secret value.
- This displays the username and password key with their values. Make note of these keys and values, as we need them for the next steps.
From the AWS Secrets Manager Console, change the region to us-west-2.
Select Store a new secret. Then, enter the following inputs:
- Select secret type: Other type of secrets
- Add your first keypair
- Select Add row to add the second keypair
Select Next, then enter the following inputs.
- Secret name: UsWest2AD
- Select Next twice
- Select Store

Deploy Your Domain Controllers in the Remote Region using Launch Wizard

In this section, I deploy the second set of domain controllers into the us-west-1 region using Launch Wizard.

In the AWS Launch Wizard Console, select Active Directory in the navigation pane on the left.
Select Create deployment.
In the Review Permissions page, select Next.
In the Configure application settings page, set the following inputs.
- General
  - Deployment name: UsWest2AD
- Active Directory (AD) installation
  - Installation type: Active Directory on EC2
- Domain Settings:
  - Number of domain controllers: 2
  - AMI installation type: License-included AMI
  - License-included AMI: ami-################# | Windows_Server-2019-English-Full-Base-202#-##-##

- Connection type: Add domain controllers to existing Active Directory
- Domain DNS name: corp.example.com
- Domain NetBIOS Name: CORP
- Domain Administrator secret name: Select you secret you created above.
- Add permission to secret
  - After you verified the Secret you created above has the policy listed. Check the checkbox confirming the secret has the required policy.

- Domain DNS IP address for resolution: The private IP of either domain controller in your home region

- Connectivity:
  - Key Pair Name: Choose an existing Key pair
  - Virtual Private Cloud (VPC): Select Virtual Private Cloud (VPC)

- VPC: Select your home region VPC

- Availability Zone (AZ) and private subnets:
  - Select 2 Availability Zones
  - Choose the proper subnet in each subnet
  - Assign a Controller IP address for each domain controller
- Remote Desktop Gateway preferences: disregard for now, as I set this later.
- Check the I confirm that a public subnet has been set up. Each of the selected private subnets have outbound connectivity enabled check box

In the Define infrastructure requirements page set the following:
- Storage and compute: Based on infrastructure requirements
- Number of AD users: Up to 5000 users
In the Review and deploy page, review your selections. Then, select Deploy.

Note that it may take up to 2 hours to deploy domain controllers. Once the status has changed to Completed, proceed to the next section. In this next section, I prepare Active Directory Sites and Services for the second set of domain controller in another region.

Prepare Your Remote Region Domain Controllers Security Group

In this section, I modify the Domain Controllers Security Group in us-west-2. This allows the domain controllers deployed in us-west-2 to communicate with each other.

From the Amazon Elastic Compute Cloud (Amazon EC2) console, select Security Groups under the Network & Security navigation section.
Select the Domain Controllers Security Group that was created by your Launch Wizard Active Directory.
Select Edit inbound rules. The Security Group should start with LaunchWizard-UsWest2AD-EC2ADStackExistingVPC-
Choose Add rule and enter the following:
- Type: Select All traffic
- Protocol: All
- Port range: All
- Source: Select Custom
- Enter the CIDR of your remote VPC. An example would be 0.0.0/24
Choose Save rules.

Create an AD User and Verify Replication

In this section, I create a user in one region and verify that it replicated to the other region. I also use AD replication diagnostics tools to verify that replication is working properly.

Create a Test User Account

Log in to one of the domain controllers in us-east-1.
Select Start, type dsa and press Enter on your keyboard. The Active Directory Users and Computers MMC should appear.
Right click on the Users container and select New > User.
Enter the following inputs:
- First name: John
- Last name: Doe
- User logon name: jdoe and select Next
- Password and Confirm password: Your choice of complex password
- Uncheck User must change password at next logon
Select Next.
Select Finish.

Verify Test User Account Has Replicated

Log in to the one of the domain controllers in us-west-2.
Select Start and type dsa.
Then, press Enter on your keyboard. The Active Directory Users and Computers MMC should appear.
Select Users. You should see a user object named John Doe.

Note that if the user is not present, it may not have been replicated yet. Replication should not take longer than 60 seconds from when the item was created.

Summary

Congratulations, you have created a cross-region Active Directory! In this post you:

Launched a new Active Directory forest in us-east-1 using AWS Launch Wizard.
Configured Active Directory Sites and Service for a multi-region configuration.
Launched a set of new domain controllers in the us-west-2 region using AWS Launch Wizard.
Created a test user and verified replication.

This post only touches on a couple of features that are available in the AWS Launch Wizard Active Directory deployment. AWS Launch Wizard also automates the creation of a Single Tier PKI infrastructure or trust creation. One of the prime benefits of this solution is the simplicity in deploying a fully functional Active Directory environment in just a few clicks. You no longer need to do the undifferentiated heavy lifting required to deploy Active Directory. For more information, please refer to AWS Launch Wizard documentation.

Automate domain join for Amazon EC2 instances from multiple AWS accounts and Regions

2020-12-10 Sanjay Patel

Post Syndicated from Sanjay Patel original https://aws.amazon.com/blogs/security/automate-domain-join-for-amazon-ec2-instances-multiple-aws-accounts-regions/

As organizations scale up their Amazon Web Services (AWS) presence, they are faced with the challenge of administering user identities and controlling access across multiple accounts and Regions. As this presence grows, managing user access to cloud resources such as Amazon Elastic Compute Cloud (Amazon EC2) becomes increasingly complex. AWS Directory Service for Microsoft Active Directory (also known as an AWS Managed Microsoft AD) makes it easier and more cost-effective for you to manage this complexity. AWS Managed Microsoft AD is built on highly available, AWS managed infrastructure. Each directory is deployed across multiple Availability Zones, and monitoring automatically detects and replaces domain controllers that fail. In addition, data replication and automated daily snapshots are configured for you. You don’t have to install software, and AWS handles all patching and software updates. AWS Managed Microsoft AD enables you to leverage your existing on-premises user credentials to access cloud resources such as the AWS Management Console and EC2 instances.

This blog post describes how EC2 resources launched across multiple AWS accounts and Regions can automatically domain-join a centralized AWS Managed Microsoft AD. The solution we describe in this post is implemented for both Windows and Linux instances. Removal of Computer objects from Active Directory upon instance termination is also implemented. The solution uses Amazon DynamoDB to centrally store account and directory information in a central security account. We also provide AWS CloudFormation templates and platform-specific domain join scripts for you to use with AWS Lambda as a quick start solution.

Architecture

The following diagram shows the domain-join process for EC2 instances across multiple accounts and Regions using AWS Managed Microsoft AD.

Figure 1: EC2 domain join architecture

The event flow works as follows:

An EC2 instance is launched in a peered virtual private cloud (VPC) of a workload or security account. VPCs that are hosting EC2 instances need to be peered with the VPC that contains AWS Managed Microsoft AD to enable network connectivity with Active Directory.
An Amazon CloudWatch Events rule detects an EC2 instance in the “running” state.
The CloudWatch event is forwarded to a regional CloudWatch event bus in the security account.
If the CloudWatch event bus is in the same Region as AWS Managed Microsoft AD, it delivers the event to an Amazon Simple Queue Service (Amazon SQS) queue, referred to as the domain-join queue in this post.
If the CloudWatch event bus is in a different Region from AWS Managed Microsoft AD, it delivers the event to an Amazon Simple Notification Service (Amazon SNS) topic. The event is then delivered to the domain-join queue described in step 4, through the Amazon SNS topic subscription.
Messages in the domain-join queue are held for five minutes to allow for EC2 instances to stabilize after they reach the “running” state. This delay allows time for installation of additional software components and agents through the use of EC2 user data and AWS Systems Manager Distributor.
After the holding period is over, messages in the domain-join queue invoke the AWS AD Join/Leave Lambda function. The Lambda function does the following:
1. Retrieves the AWS account ID that originated the event from the message and retrieves account-specific configurations from a DynamoDB table. This configuration identifies AWS Managed Microsoft AD domain controller IPs, credentials required to perform EC2 domain join, and an AWS Identity and Access Management (IAM) role that can be assumed by the Lambda function to invoke AWS Systems Manager Run Command.
2. If needed, uses AWS Security Token Service (AWS STS) and prepares a cross-account access session.
3. Retrieves EC2 instance information, such as the instance state, platform, and tags, and validates the instance state.
4. Retrieves platform-specific domain-join scripts that are deployed with the Lambda function’s code bundle, and configures invocation of those scripts by using data read from the DynamoDB table (bash script for Linux instances and PowerShell script for Windows instances).
5. Uses AWS Systems Manager Run Command to invoke the domain-join script on the instance. Run Command enables you to remotely and securely manage the configuration of your managed instances.
6. The domain-join script runs on the instance. It uses script parameters and instance attributes to configure the instance and perform the domain join. The adGroupName tag value is used to configure the Active Directory user group that will have permissions to log in to the instance. The instance is rebooted to complete the domain join process. Various software components are installed on the instance when the script runs. For the Linux instance, sssd, realmd, krb5, samba-common, adcli, unzip, and packageit are installed. For the Windows instance, the RDS-RD-Server feature is installed.

Removal of EC2 instances from AWS Managed Microsoft AD upon instance termination follows a similar sequence of steps. Each instance that is domain joined creates an Active Directory domain object under the “Computer” hierarchy. This domain object needs to be removed upon instance termination so that a new instance that uses the same private IP address in the subnet (at a future time) can successfully domain join and enable instance access with Active Directory credentials. Removal of the Active Directory Computer object is done by running the leaveDomaini.ps1 script (included with this blog) through Run Command on the Active Directory Tools instance identified in Figure 1.

Prerequisites and setup

To build the solution outlined in this post, you need:

AWS Managed Microsoft AD with an appropriate DNS name (for example, example.com). For more information about getting started with AWS Managed Microsoft AD, see Create Your AWS Managed Microsoft AD directory.
AD Tools. To install AD Tools and use it to create the required users:
1. Launch a Windows EC2 instance in the same account and Region, and domain-join it with the directory you created in the previous step. Log in to the instance through Remote Desktop Protocol (RDP) and install AD Tools as described in Installing the Active Directory Administration Tools.
2. After the AD Tools are installed, launch the AD Users & Computers application to create domain users, and assign those users to an Active Directory security group (for example, my_UserGroup) that has permission to access domain-joined instances.
3. Create a least-privileged user for performing domain joins as described in Delegate Directory Join Privileges for AWS Managed Microsoft AD. The identity of this user is stored in the DynamoDB table and read by the AD Join Lambda function to invoke Active Directory join scripts.
4. Store the password for the least-privileged user in an encrypted Systems Manager parameter. The password for this user is stored in the secure string System Manager parameter and read by the AD Join Lambda function at runtime while processing Amazon SQS messages.
5. Assign a unique tag key and value to identify the AD Tools instance. This instance will be invoked by the Lambda function to delete Computer objects from Active Directory upon termination of domain-joined instances.
All VPCs that are hosting EC2 instances to be domain joined must be peered with the VPC that hosts the relevant AWS Managed Microsoft AD. Alternatively, AWS Transit Gateway could be used to establish this connectivity.
In addition to having network connectivity to the AWS Managed Microsoft AD domain controllers, domain join scripts that run on EC2 instances must be able to resolve relevant Active Directory resource records. In this solution, we leverage Amazon Route 53 Outbound Resolver to forward DNS queries to the AWS Managed Microsoft AD DNS servers, while still preserving the default DNS capabilities that are available to the VPC. Learn more about deploying Route 53 Outbound Resolver and resolver rules to resolve your directory DNS name to DNS IPs.
Each domain-join EC2 instance must have a Systems Manager Agent (SSM Agent) installed and an IAM role that provides equivalent permissions as provided by the AmazonEC2RoleforSSM built-in policy. The SSM Agent is used to allow domain-join scripts to run automatically. See Working with SSM Agent for more information on installing and configuring SSM Agents on EC2 instances.

Solution deployment

The steps in this section deploy AD Join solution components by using the AWS CloudFormation service.

The CloudFormation template provided with this solution (mad_auto_join_leave.json) deploys resources that are identified in the security account’s AWS Region that hosts AWS Managed Microsoft AD (the top left quadrant highlighted in Figure 1). The template deploys a DynamoDB resource with 5 read and 5 write capacity units. This should be adjusted to match your usage. DynamoDB also provides the ability to auto-scale these capacities. You will need to create and deploy additional CloudFormation stacks for cross-account, cross-Region scenarios.

To deploy the solution

Create a versioned Amazon Simple Storage Service (Amazon S3) bucket to store a zip file (for example, adJoinCode.zip) that contains Python Lambda code and domain join/leave bash and PowerShell scripts. Upload the source code zip file to an S3 bucket and find the version associated with the object.
Navigate to the AWS CloudFormation console. Choose the appropriate AWS Region, and then choose Create Stack. Select With new resources.
Choose Upload a template file (for this solution, mad_auto_join_leave.json), select the CloudFormation stack file, and then choose Next.
Enter the stack name and values for the other parameters, and then choose Next.

Figure 2: Defining the stack name and parameters

The parameters are defined as follows:

S3CodeBucket: The name of the S3 bucket that holds the Lambda code zip file object.
adJoinLambdaCodeFileName: The name of the Lambda code zip file that includes Lambda Python code, bash, and Powershell scripts.
adJoinLambdaCodeVersion: The S3 Version ID of the uploaded Lambda code zip file.
DynamoDBTableName: The name of the DynamoDB table that will hold account configuration information.
CreateDynamoDBTable: The flag that indicates whether to create a new DynamoDB table or use an existing table.
ADToolsHostTagKey: The tag key of the Windows EC2 instance that has AD Tools installed and that will be used for removal of Active Directory Computer objects upon instance termination.
ADToolsHostTagValue: The tag value for the key identified by the ADToolsHostTagKey parameter.
Acknowledge creation of AWS resources and choose to continue to deploy AWS resources through AWS CloudFormation.The CloudFormation stack creation process is initiated, and after a few minutes, upon completion, the stack status is marked as CREATE_COMPLETE. The following resources are created when the CloudFormation stack deploys successfully:
- An AD Join Lambda function with associated scripts and IAM role.
- A CloudWatch Events rule to detect the “running” and “terminated” states for EC2 instances.
- An SQS event queue to hold the EC2 instance “running” and “terminated” events.
- CloudWatch event mapping to the SQS event queue and further to the Lambda function.
- A DynamoDB table to hold the account configuration (if you chose this option).

The DynamoDB table hosts account-level configurations. Account-specific configuration is required for an instance from a given account to join the Active Directory domain. Each DynamoDB item contains the account-specific configuration shown in the following table. Storing account-level information in the DynamoDB table provides the ability to use multiple AWS Managed Microsoft AD directories and group various accounts accordingly. Additional account configurations can also be stored in this table for implementation of various centralized security services (instance inspection, patch management, and so on).

Attribute	Description
accountId	AWS account number
adJoinUserName	User ID with AD Join permissions
adJoinUserPwParam	Encrypted Systems Manager parameter containing the AD Join user’s password
dnsIP1	Domain controller 1 IP address2
dnsIP2	Domain controller 2 IP address
assumeRoleARN	Amazon Resource Name (ARN) of the role assumed by the AD Join Lambda function

Following is an example of how you could insert an item (row) in a DynamoDB table for an account.

aws dynamodb put-item --table-name <DynamoDB-Table-Name> --item file://itemData.json

where itemData.json is as follows.

{
    "accountId": { "S": "123412341234" },
    " adJoinUserName": { "S": "ADJoinUser" },
    " adJoinUserPwParam": { "S": "ADJoinUser-PwParam" },
    "dnsName": { "S": "example.com" },
    "dnsIP1": { "S": "192.0.2.1" },
    "dnsIP2": { "S": "192.0.2.2" },
    "assumeRoleARN": { "S": "arn:aws:iam::111122223333:role/adJoinLambdaRole" }
}

(Update with your own values as appropriate for your environment.)

In the preceding example, adJoinLambdaRole is assumed by the AD Join Lambda function (if needed) to establish cross-account access using AWS Security Token Service (AWS STS). The role needs to provide sufficient privileges for the AD Join Lambda function to retrieve instance information and run cross-account Systems Manager commands.

adJoinUserName identifies a user with the minimum privileges to do the domain join; you created this user in the prerequisite steps.

adJoinUserPwParam identifies the name of the encrypted Systems Manager parameter that stores the password for the AD Join user. You created this parameter in the prerequisite steps.

Solution test

After you successfully deploy the solution using the steps in the previous section, the next step is to test the deployed solution.

To test the solution

Navigate to the AWS EC2 console and launch a Linux instance. Launch the instance in a public subnet of the available VPC.
Choose an IAM role that gives at least AmazonEC2RoleforSSM permissions to the instance.
Add an adGroupName tag with the value that identifies the name of the Active Directory security group whose members should have access to the instance.
Make sure that the security group associated with your instance has permissions for your IP address to log in to the instance by using the Secure Shell (SSH) protocol.
Wait for the instance to launch and perform the Active Directory domain join. You can navigate to the AWS SQS console and observe a delayed message that represents the CloudWatch instance “running” event. This message is processed after five minutes; after that you can observe the Lambda function’s message processing log in CloudWatch logs.
Log in to the instance with Active Directory user credentials. This user must be the member of the Active Directory security group identified by the adGroupName tag value. Following is an example login command.
```
ssh ‘[email protected]’@<public-dns-name|public-ip-address>
```
Similarly, launch a Windows EC2 instance to validate the Active Directory domain join by using Remote Desktop Protocol (RDP).
Terminate domain-joined instances. Log in to the AD Tools instance to validate that the Active Directory Computer object that represents the instance is deleted.

The AD Join Lambda function invokes Systems Manager commands to deliver and run domain join scripts on the EC2 instances. The AWS-RunPowerShellScript command is used for Microsoft Windows instances, and the AWS-RunShellScript command is used for Linux instances. Systems Manager command parameters and execution status can be observed in the Systems Manager Run Command console.

The AD user used to perform the domain join is a least-privileged user, as described in Delegate Directory Join Privileges for AWS Managed Microsoft AD. The password for this user is passed to instances by way of SSM Run Commands, as described above. The password is visible in the SSM Command history log and in the domain join scripts run on the instance. Alternatively, all script parameters can be read locally on the instance through the “adjoin” encrypted SSM parameter. Refer to the domain join scripts for details of the “adjoin” SSM parameter.

Additional information

Directory sharing

AWS Managed Microsoft AD can be shared with other AWS accounts in the same Region. Learn how to use this feature and seamlessly domain join Microsoft Windows EC2 instances and Linux instances.

autoadjoin tag

Launching EC2 instances with an autoadjoin tag key with a “false” value excludes the instance from the automated Active Directory join process. You might want to do this in scenarios where you want to install additional agent software before or after the Active Directory join process. You can invoke domain join scripts (bash or PowerShell) by using user data or other means. However, you’ll need to reboot the instance and re-run scripts to complete the domain join process.

Summary

In this blog post, we demonstrated how you could automate the Active Directory domain join process for EC2 instances to AWS Managed Microsoft AD across multiple accounts and Regions, and also centrally manage this configuration by using AWS DynamoDB. By adopting this model, administrators can centrally manage Active Directory–aware applications and resources across their accounts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Coming Soon – Amazon EC2 G4ad Instances Featuring AMD GPUs for Graphics Workloads

2020-12-01 Steve Roberts

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/new-amazon-ec2-g4ad-instances-featuring-amd-gpus-for-graphics-workloads/

Customers with high performance graphic workloads, such as game streaming, animation, and video rendering for example, are always looking for higher performance with less cost. Today, I’m happy to announce new Amazon Elastic Compute Cloud (EC2) instances in the G4 instance family are in the works and will be available soon, to improve performance and reduce cost for graphics-intensive workloads. The new G4ad instances feature AMD’s latest Radeon Pro V520 GPUs and 2nd generation EPYC processors, and are the first in EC2 to feature AMD GPUs.

G4dn instances, released in 2019 and featuring NVIDIA T4 GPUs, were previously the most cost-effective GPU-based instances in EC2. G4dn instances are ideal for deploying machine learning models in production and also graphics-intensive applications. However, when compared to G4dn the new G4ad instances enable up to 45% better price performance for graphics-intensive workloads, including the aforementioned game streaming, remote graphics workstations, and rendering scenarios. Compared to an equally-sized G4dn instance, G4ad instances offer up to 40% improvement in performance.

G4dn instances will continue to be the best option for small-scale machine learning (ML) training and GPU-based ML inference due to included hardware optimizations like Tensor Cores. Additionally, G4dn instances are still best suited for graphics applications that need access to NVIDIA libraries such as CUDA, CuDNN, and NVENC. However, when there is no dependency on NVIDIA’s libraries, we recommend customers try the G4ad instances to benefit from the improved price and performance.

AMD Radeon Pro V520 GPUs in G4ad instances support DirectX 11/12, Vulkan 1.1, and OpenGL 4.5 APIs. For operating systems, customers can choose from Windows Server 2016/2019, Amazon Linux 2, Ubuntu 18.04.3, and CentOS 7.7. Instances using G4ad can be purchased as On-Demand, Savings Plan, Reserved Instances, or Spot Instances. Three instance sizes are available, from G4ad.4xlarge, with 1 GPU, to G4ad.16xlarge with 4 GPUs, as shown below.

Instance Size	GPUs	GPU Memory (GB)	vCPUs	Memory (GB)	Storage	EBS Bandwidth (Gbps)	Network Bandwidth (Gbps)
g4ad.4xlarge	1	8	16	64	600	Up to 3	Up to 10
g4ad.8xlarge	2	16	32	128	1200	3	15
g4ad.16xlarge	4	32	64	256	2400	6	25

The new G4ad instances will be available soon in US East (N. Virginia), US West (Oregon), and Europe (Ireland).

Learn more about G4ad instances.

— Steve

New EC2 M5zn Instances – Fastest Intel Xeon Scalable CPU in the Cloud

2020-12-01 Jeff Barr

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-ec2-m5zn-instances-fastest-intel-xeon-scalable-cpu-in-the-cloud/

We launched the compute-intensive z1d instances in mid-2018 for customers who asked us for extremely high per-core performance and a high memory-to-core ratio to power their front-end Electronic Design Automation (EDA), actuarial, and CPU-bound relational database workloads.

In order to address a complementary set of use cases, customers have asked us for an EC2 instance that will give them high per-core performance like z1d, with no local NVMe storage, higher networking throughput, and a reduced memory-to-vCPU ratio. They have indicated if we built an instance with this set of attributes, it would be an excellent fit for workloads such as gaming, financial applications, simulation modeling applications such as those used in the automobile, aerospace, energy and telecommunication industries, and High Performance Computing (HPC).

Introducing M5zn
Building on the success of the z1d instances, we are launching M5zn instances in seven sizes today. These instances use 2nd generation custom Intel^® Xeon^® Scalable (Cascade Lake) processors with a sustained all-core turbo clock frequency of up to 4.5 GHz. M5zn instances feature high frequency processing, are a variant of the general-purpose M5 instances, and are built on the AWS Nitro System. These instances also feature low latency 100 Gbps networking and the Elastic Fabric Adapter (EFA), in order to improve performance for HPC and communication-intensive applications.

Here are the M5zn instances (all VPC-only, HVM-only, and EBS-Optimized, with support for Optimize vCPU). As you can see, the memory-to-vCPU ratio on these instances is half that of the existing z1d instances:

Instance Name	vCPUs	RAM	Network Bandwidth	EBS-Optimized Bandwidth
m5zn.large	2	8 GiB	Up to 25 Gbps	Up to 3.170 Gbps
m5zn.xlarge	4	16 GiB	Up to 25 Gbps	Up to 3.170 Gbps
m5zn.2xlarge	8	32 GiB	Up to 25 Gbps	3.170 Gbps
m5zn.3xlarge	12	48 GiB	Up to 25 Gbps	4.750 Gbps
m5zn.6xlarge	24	96 GiB	50 Gbps	9.500 Gbps
m5zn.12xlarge	48	192 GiB	100 Gbps	19 Gbps
m5zn.metal	48	192 GiB	100 Gbps	19 Gbps

The Nitro Hypervisor allows M5zn instances to deliver performance that is just about indistinguishable from bare metal. Other AWS Nitro System components such as the Nitro Security Chip and hardware-based processing for EBS increase performance, while VPC encryption provides greater security.

Things To Know
Here are a couple of “fun facts” about the M5zn instances:

Placement Groups – M5zn instances can be used in Cluster (for low latency and high network throughput), Spread (to keep critical instances separate from each other), and Partition (to reduce correlated failures) placement groups.

Networking – M5zn instances support the Elastic Network Adapter (ENA) with dedicated 100 Gbps network connections and a dedicated 19 Gbps connection to EBS. If you are building distributed ML or HPC applications for use on a cluster of M5zn instances, be sure to take a look at the Elastic Fabric Adapter (EFA). Your HPC applications can use the Message Passing Interface (MPI) to communicate efficiently at high speed while scaling to thousands of nodes.

C-State Control – You can configure CPU Power Management on m5zn.6xlarge and m5zn.12xlarge instances. This is definitely an advanced feature, but one worth exploring in those situations where you need to squeeze every possible cycle of available performance from the instance.

NUMA – You can make use of Non-Uniform Memory Access on m5zn.12xlarge instances. This is also an advanced feature, but worth exploring in situations where you have an in-depth understanding of your application’s memory access patterns.

To learn more about these and other features, visit the EC2 M5 Instances page.

Available Now
As you can see, the M5zn instances are a great fit for gaming, HPC and simulation modeling workloads such as those used by the financial, automobile, aerospace, energy, and telecommunications industries.

You can launch M5zn instances today in the US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Ireland), Europe (Frankfurt), and Asia Pacific (Tokyo) Regions in On-Demand, Reserved Instance, Savings Plan, and Spot form. Dedicated Instances and Dedicated Hosts are also available.

Support is available in the EC2 Forum or via your usual AWS Support contact. The EC2 team is interested in your feedback and you can contact them at [email protected].

— Jeff;

Coming Soon – EC2 C6gn Instances – 100 Gbps Networking with AWS Graviton2 Processors

2020-12-01 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/coming-soon-ec2-c6gn-instances-100-gbps-networking-with-aws-graviton2-processors/

Based on the amazing feedback from customers such as Snap, NextRoll, Intuit, SmugMug, and Honeycomb who are running their workloads on Amazon Elastic Compute Cloud (EC2) instances powered by AWS Graviton2, today we are announcing an addition to our broad Arm-based Graviton2 portfolio with C6gn instances that deliver up to 100 Gbps network bandwidth, up to 38 Gbps Amazon Elastic Block Store (EBS) bandwidth, up to 40% higher packet processing performance, and up to 40% better price/performance versus comparable current generation x86-based network optimized instances.

Compared to C6g instances, this new instance type provides 4x higher network bandwidth, 4x higher packet processing performance, and 2x higher EBS bandwidth. This means that customers with workloads that need high networking bandwidth such as high performance computing (HPC), network appliance, real-time video communications, and data analytics, will be able to bring their biggest and most challenging applications to Arm and take advantage of the performance and cost-optimization.

C6gn instances will be available in 8 sizes:

Name	vCPUs	Memory (GiB)	Network Bandwidth (Gbps)	EBS Throughput (Gbps)
c6gn.medium	1	2	Up to 25	Up to 9.5
c6gn.large	2	4	Up to 25	Up to 9.5
c6gn.xlarge	4	8	Up to 25	Up to 9.5
c6gn.2xlarge	8	16	Up to 25	Up to 9.5
c6gn.4xlarge	16	32	25	9.5
c6gn.8xlarge	32	64	50	19
c6gn.12xlarge	48	96	75	28.5
c6gn.16xlarge	64	128	100	38

The new instances are built on the AWS Nitro System, a collection of AWS-designed hardware and software innovations that maximize resource efficiency. C6gn instances support Elastic Fabric Adapter (EFA) on the c6gn.16xlarge sizes for workloads that can take advantage of lower network latency (such as HPC and video processing) and use Message Passing Interface (MPI) for highly scalable clusters. These new instances also fully support network frameworks like Data Plane Development Kit (DPDK), making it easier to migrate network appliance workloads.

Coming Soon
EC2 C6gn instances will be available later this month and make it easier to optimize costs for HPC and workloads that require high network bandwidth and low latency. Let me know what you are going to build with them!

To get practice with the AWS Graviton2 architecture, you can try t4g.micro instances for free for up to 750 hours per month until March 31st, 2021.

Learn more about EC2 C6gn instances today.

— Danilo

New – Amazon EC2 R5b Instances Provide 3x Higher EBS Performance

2020-12-01 Harunobu Kameda

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-amazon-ec2-r5b-instances-providing-3x-higher-ebs-performance/

In July 2018, we announced memory-optimized R5 instances for the Amazon Elastic Compute Cloud (Amazon EC2). R5 instances are designed for memory-intensive applications such as high-performance databases, distributed web scale in-memory caches, in-memory databases, real time big data analytics, and other enterprise applications.

R5 instances offer two different block storage options. R5d instances offer up to 3.6TB of NMVe instance storage for applications that need access to high-speed, low latency local storage. In addition, all R5b instances work with Amazon Elastic Block Store. Amazon EBS is an easy-to-use, high-performance and highly available block storage service designed for use with Amazon EC2 for both throughput- and transaction-intensive workloads at any scale. A broad range of workloads, such as relational and non-relational databases, enterprise applications, containerized applications, big data analytics engines, file systems, and media workflows are widely deployed on Amazon EBS.

Today, we are happy to announce the availability of R5b, a new addition to the R5 instance family. The new R5b instance is powered by the AWS Nitro System to provide the best network-attached storage performance available on EC2. This new instance offers up to 60Gbps of EBS bandwidth and 260,000 I/O operations per second (IOPS).

Amazon EC2 R5b Instance
Many customers use R5 instances with EBS for large relational database workloads such as commerce platforms, ERP systems, and health record systems, and they rely on EBS to provide scalable, durable, and high availability block storage. These instances provide sufficient storage performance for many use cases, but some customers require higher EBS performance on EC2.

R5 instances provide bandwidth up to 19Gbps and maximum EBS performance of 80K IOPS, while the new R5b instances support bandwidth up to 60Gbps and EBS performance of 260K IOPS, providing 3x higher EBS-Optimized performance compared to R5 instances, enabling customers to lift and shift large relational databases applications to AWS. R5b and R5 vCPU to memory ratio and network performance are the same.

Instance Name	vCPUs	Memory	EBS Optimized Bandwidth (Mbps)	EBS Optimized IOPs@16KiB (IO/s)
r5b.large	2	16 GiB	Up to 10,000	Up to 43,333
r5b.xlarge	4	32 GiB	Up to 10,000	Up to 43,333
r5b.2xlarge	8	64 GiB	Up to 10,000	Up to 43,333
r5b.4xlarge	16	128 GiB	10,000	43,333
r5b.8xlarge	32	256 GiB	20,000	86,667
r5b.12xlarge	48	384 GiB	30,000	130,000
r5b.16xlarge	64	512 GiB	40,000	173,333
r5b.24xlarge	96	768 GiB	60,000	260,000
r5b.metal	96	768 GiB	60,000	260,000

Customers operating storage performance sensitive workloads can migrate from R5 to R5b to consolidate their existing workloads into fewer or smaller instances. This can reduce the cost of both infrastructure and licensed commercial software working on those instances. R5b instances are supported by Amazon RDS for Oracle and Amazon RDS for SQL Server, simplifying the migration path for large commercial database applications and improving storage performance for current RDS customers by up to 3x.

All Nitro compatible AMIs support R5b instances, and the EBS-backed HVM AMI must have NVMe 1.0e and ENA drivers installed at R5b instance launch. R5b supports io1, io2 Block Express (in preview), gp2, gp3, sc1, st1 and standard volumes. R5b does not support io2 volumes and io1 volumes that have multi-attach enabled, which are coming soon.

Available Today

R5b instances are available in the following regions: US West (Oregon), Asia Pacific (Tokyo), US East (N. Virginia), US East (Ohio), Asia Pacific (Singapore), and Europe (Frankfurt). RDS on r5b is available in US East (Ohio), Asia Pacific (Singapore), and Europe (Frankfurt), and support in other regions is coming soon.

Learn more about EC2 R5 instances and get started with Amazon EC2 today.

– Kame;

EC2 Update – D3 / D3en Dense Storage Instances

2020-12-01 Jeff Barr

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-update-d3-d3en-dense-storage-instances/

We have launched several generations of EC2 instances with dense storage including the HS1 in 2012 and the D2 in 2015. As you can guess from the name, our customers use these instances when they need massive amounts of very economical on-instance storage for their data warehouses, data lakes, network file systems, Hadoop clusters, and the like. These workloads demand plenty of I/O and network throughput, but work fine with a high ratio of storage to compute power.

New D3 and D3en Instances
Today we are launching the D3 and D3en instances. Like their predecessors, they give you access to massive amounts of low-cost on-instance HDD storage. The D3 instances are available in four sizes, with up to 32 vCPUs and 48 TB of storage. Here are the specs:

Instance Name	vCPUs	RAM	HDD Storage	Aggregate Disk Throughput (128 KiB Blocks)	Network Bandwidth	EBS-Optimized Bandwidth
d3.xlarge	4	32 GiB	6 TB (3 x 2 TB)	580 MiBps	Up to 15 Gbps	850 Mbps
d3.2xlarge	8	64 GiB	12 TB (6 x 2 TB)	1,100 MiBps	Up to 15 Gbps	1,700 Mbps
d3.4xlarge	16	128 GiB	24 TB (12 x 2 TB)	2,300 MiBps	Up to 15 Gbps	2,800 Mbps
d3.8xlarge	32	256 GiB	48 TB (24 x 2 TB)	4,600 MiBps	25 Gbps	5,000 Mbps

As you can see from the table above, the D3 instances are available in the same configurations as the D2 instances for easy migration. You’ll get 5% more memory per vCPU, a 30% boost in compute power, and 2.5x higher network performance if you migrate from D2 to D3. The instances provide low-cost dense storage that delivers high performance sequential access to large data sets. They are perfect for distributed file systems such as HDFS and MapR FS, big data analytical workloads, data warehouses, log processing, and data processing.

The D3en instances are available in six sizes, with up to 48 vCPUs and 336 TB of storage. Here are the specs:

Instance Name	vCPUs	RAM	HDD Storage	Aggregate Disk Throughput (128 KiB Blocks)	Network Bandwidth	EBS-Optimized Bandwidth
d3en.xlarge	4	16 GiB	28 TB (2 x 14 TB)	500 MiBps	Up to 25 Gbps	850 Mbps
d3en.2xlarge	8	32 GiB	56 TB (4 x 14 TB)	1,000 MiBps	Up to 25 Gbps	1,700 Mbps
d3en.4xlarge	16	64 GiB	112 TB (8 x 14 TB)	2,000 MiBps	25 Gbps	2,800 Mbps
d3en.6xlarge	24	96 GiB	168 TB (12 x 14 TB)	3,100 MiBps	40 Gbps	4,000 Mbps
d3en.8xlarge	32	128 GiB	224 TB (16 x 14 TB)	4,100 MiBps	50 Gbps	5,000 Mbps
d3en.12xlarge	48	192 GiB	336 TB (24 x 14 TB)	6,200 MiBps	75 Gbps	7,000 Mbps

The D3en instances have a high ratio of storage to vCPU, and are optimized for high throughput and high sequential I/O to very large data sets, with a cost-per-TB that is 80% lower than on D2 instances. D3en instances can host Lustre, BeeGFS, GPFS, and other distributed file systems, they can store your data lakes, and they can run your Amazon EMR, Spark, and Hadoop analytical workloads.

Both of the instance types are built on the AWS Nitro System and are powered by custom Intel^® Second Generation Scalable Xeon^® (Cascade Lake) processors that can deliver all-core turbo performance of up to 3.1 GHz. The HDD storage is encrypted at rest using AES-256-XTS; traffic between D3 or D3en instances in the same VPC or within peered VPCs is encrypted using a 256-bit key.

Things to Know
Here are a couple of things that you should keep in mind regarding the D3 and D3en instances:

Regions – D3en instances are available in the US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions; D3en instances are available in all of those regions and also in the US East (Ohio) Region, with more regions coming soon.

Purchase Options – You can purchase D3 and D3 instances in On-Demand, Savings Plan, Reserved Instance, Spot, and Dedicated Instance form.

AMIs – You must use AMIs that include the Elastic Network Adapter (ENA) and NVMe drivers.

Now Available
D3 and D3en instances are available now and you can start using them today!

— Jeff;