Tag Archives: Amazon CloudWatch

Optimizing EC2 Workloads with Amazon CloudWatch

2021-07-09 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/optimizing-ec2-workloads-with-amazon-cloudwatch/

This post is written by David (Dudu) Twizer, Principal Solutions Architect, and Andy Ward, Senior AWS Solutions Architect – Microsoft Tech.

In December 2020, AWS announced the availability of gp3, the next-generation General Purpose SSD volumes for Amazon Elastic Block Store (Amazon EBS), which allow customers to provision performance independent of storage capacity and provide up to a 20% lower price-point per GB than existing volumes.

This new release provides an excellent opportunity to right-size your storage layer by leveraging AWS’ built-in monitoring capabilities. This is especially important with SQL workloads as there are many instance types and storage configurations you can select for your SQL Server on AWS.

Many customers ask for our advice on choosing the ‘best’ or the ‘right’ storage and instance configuration, but there is no one solution that fits all circumstances. This blog post covers the critical techniques to right-size your workloads. We focus on right-sizing a SQL Server as our example workload, but the techniques we will demonstrate apply equally to any Amazon EC2 instance running any operating system or workload.

We create and use an Amazon CloudWatch dashboard to highlight any limits and bottlenecks within our example instance. Using our dashboard, we can ensure that we are using the right instance type and size, and the right storage volume configuration. The dimensions we look into are EC2 Network throughput, Amazon EBS throughput and IOPS, and the relationship between instance size and Amazon EBS performance.

The Dashboard

It can be challenging to locate every relevant resource limit and configure appropriate monitoring. To simplify this task, we wrote a simple Python script that creates a CloudWatch Dashboard with the relevant metrics pre-selected.

The script takes an instance-id list as input, and it creates a dashboard with all of the relevant metrics. The script also creates horizontal annotations on each graph to indicate the maximums for the configured metric. For example, for an Amazon EBS IOPS metric, the annotation shows the Maximum IOPS. This helps us identify bottlenecks.

Please take a moment now to run the script using either of the following methods described. Then, we run through the created dashboard and each widget, and guide you through the optimization steps that will allow you to increase performance and decrease cost for your workload.

Creating the Dashboard with CloudShell

First, we log in to the AWS Management Console and load AWS CloudShell.

Once we have logged in to CloudShell, we must set up our environment using the following command:

# Download the script locally
wget -L https://raw.githubusercontent.com/aws-samples/amazon-ec2-mssql-workshop/master/resources/code/Monitoring/create-cw-dashboard.py

# Prerequisites (venv and boto3)
python3 -m venv env # Optional
source env/bin/activate  # Optional
pip3 install boto3 # Required

The commands preceding download the script and configure the CloudShell environment with the correct Python settings to run our script. Run the following command to create the CloudWatch Dashboard.

# Execute
python3 create-cw-dashboard.py --InstanceList i-example1 i-example2 --region eu-west-1

At its most basic, you just must specify the list of instances you are interested in (i-example1 and i-example2 in the preceding example), and the Region within which those instances are running (eu-west1 in the preceding example). For detailed usage instructions see the README file here. A link to the CloudWatch Dashboard is provided in the output from the command.

Creating the Dashboard Directly from your Local Machine

If you’re familiar with running the AWS CLI locally, and have Python and the other pre-requisites installed, then you can run the same commands as in the preceding CloudShell example, but from your local environment. For detailed usage instructions see the README file here. If you run into any issues, we recommend running the script from CloudShell as described prior.

Examining Our Metrics

Once the script has run, navigate to the CloudWatch Dashboard that has been created. A direct link to the CloudWatch Dashboard is provided as an output of the script. Alternatively, you can navigate to CloudWatch within the AWS Management Console, and select the Dashboards menu item to access the newly created CloudWatch Dashboard.

The Network Layer

The first widget of the CloudWatch Dashboard is the EC2 Network throughput:

The automatic annotation creates a red line that indicates the maximum throughput your Instance can provide in Mbps (Megabits per second). This metric is important when running workloads with high network throughput requirements. For our SQL Server example, this has additional relevance when considering adding replica Instances for SQL Server, which place an additional burden on the Instance’s network.

In general, if your Instance is frequently reaching 80% of this maximum, you should consider choosing an Instance with greater network throughput. For our SQL example, we could consider changing our architecture to minimize network usage. For example, if we were using an “Always On Availability Group” spread across multiple Availability Zones and/or Regions, then we could consider using an “Always On Distributed Availability Group” to reduce the amount of replication traffic. Before making a change of this nature, take some time to consider any SQL licensing implications.

If your Instance generally doesn’t pass 10% network utilization, the metric is indicating that networking is not a bottleneck. For SQL, if you have low network utilization coupled with high Amazon EBS throughput utilization, you should consider optimizing the Instance’s storage usage by offloading some Amazon EBS usage onto networking – for example by implementing SQL as a Failover Cluster Instance with shared storage on Amazon FSx for Windows File Server, or by moving SQL backup storage on to Amazon FSx.

The Storage Layer

The second widget of the CloudWatch Dashboard is the overall EC2 to Amazon EBS throughput, which means the sum of all the attached EBS volumes’ throughput.

Each Instance type and size has a different Amazon EBS Throughput, and the script automatically annotates the graph based on the specs of your instance. You might notice that this metric is heavily utilized when analyzing SQL workloads, which are usually considered to be storage-heavy workloads.

If you find data points that reach the maximum, such as in the preceding screenshot, this indicates that your workload has a bottleneck in the storage layer. Let’s see if we can find the EBS volume that is using all this throughput in our next series of widgets, which focus on individual EBS volumes.

And here is the culprit. From the widget, we can see the volume ID and type, and the performance maximum for this volume. Each graph represents one of the two dimensions of the EBS volume: throughput and IOPS. The automatic annotation gives you visibility into the limits of the specific volume in use. In this case, we are using a gp3 volume, configured with a 750-MBps throughput maximum and 3000 IOPS.

Looking at the widget, we can see that the throughput reaches certain peaks, but they are less than the configured maximum. Considering the preceding screenshot, which shows that the overall instance Amazon EBS throughput is reaching maximum, we can conclude that the gp3 volume here is unable to reach its maximum performance. This is because the Instance we are using does not have sufficient overall throughput.

Let’s change the Instance size so that we can see if that fixes our issue. When changing Instance or volume types and sizes, remember to re-run the dashboard creation script to update the thresholds. We recommend using the same script parameters, as re-running the script with the same parameters overwrites the initial dashboard and updates the threshold annotations – the metrics data will be preserved. Running the script with a different dashboard name parameter creates a new dashboard and leaves the original dashboard in place. However, the thresholds in the original dashboard won’t be updated, which can lead to confusion.

Here is the widget for our EBS volume after we increased the size of the Instance:

We can see that the EBS volume is now able to reach its configured maximums without issue. Let’s look at the overall Amazon EBS throughput for our larger Instance as well:

We can see that the Instance now has sufficient Amazon EBS throughput to support our gp3 volume’s configured performance, and we have some headroom.

Now, let’s swap our Instance back to its original size, and swap our gp3 volume for a Provisioned IOPS io2 volume with 45,000 IOPS, and re-run our script to update the dashboard. Running an IOPS intensive task on the volume results in the following:

As you can see, despite having 45,000 IOPS configured, it seems to be capping at about 15,000 IOPS. Looking at the instance level statistics, we can see the answer:

Much like with our throughput testing earlier, we can see that our io2 volume performance is being restricted by the Instance size. Let’s increase the size of our Instance again, and see how the volume performs when the Instance has been correctly sized to support it:

We are now reaching the configured limits of our io2 volume, which is exactly what we wanted and expected to see. The instance level IOPS limit is no longer restricting the performance of the io2 volume:

Using the preceding steps, we can identify where storage bottlenecks are, and we can identify if we are using the right type of EBS volume for the workload. In our examples, we sought bottlenecks and scaled upwards to resolve them. This process should be used to identify where resources have been over-provisioned and under-provisioned.

If we see a volume that never reaches the maximums that it has been configured for, and that is not subject to any other bottlenecks, we usually conclude that the volume in question can be right-sized to a more appropriate volume that costs less, and better fits the workload.

We can, for example, change an Amazon EBS gp2 volume to an EBS gp3 volume with the correct IOPS and throughput. EBS gp3 provides up to 1000-MBps throughput per volume and costs $0.08/GB (versus $0.10/GB for gp2). Additionally, unlike with gp2, gp3 volumes allow you to specify provisioned IOPS independently of size and throughput. By using the process described above, we could identify that a gp2, io1, or io2 volume could be swapped out with a more cost-effective gp3 volume.

If during our analysis we observe an SSD-based volume with relatively high throughput usage, but low IOPS usage, we should investigate further. A lower-cost HDD-based volume, such as an st1 or sc1 volume, might be more cost-effective while maintaining the required level of performance. Amazon EBS st1 volumes provide up to 500 MBps throughput and cost $0.045 per GB-month, and are often an ideal volume-type to use for SQL backups, for example.

Additional storage optimization you can implement

Move the TempDB to Instance Store NVMe storage – The data on an SSD instance store volume persists only for the life of its associated instance. This is perfect for TempDB storage, as when the instance stops and starts, SQL Server saves the data to an EBS volume. Placing the TempDB on the local instance store frees the associated Amazon EBS throughput while providing better performance as it is locally attached to the instance.

Consider Amazon FSx for Windows File Server as a shared storage solution – As described here, Amazon FSx can be used to store a SQL database on a shared location, enabling the use of a SQL Server Failover Cluster Instance.

The Compute Layer

After you finish optimizing your storage layer, wait a few days and re-examine the metrics for both Amazon EBS and networking. Use these metrics in conjunction with CPU metrics and Memory metrics to select the right Instance type to meet your workload requirements.

AWS offers nearly 400 instance types in different sizes. From a SQL perspective, it’s essential to choose instances with high single-thread performance, such as the z1d instance, due to SQL’s license-per-core model. z1d instances also provide instance store storage for the TempDB.

You might also want to check out the AWS Compute Optimizer, which helps you by automatically recommending instance types by using machine learning to analyze historical utilization metrics. More details can be found here.

We strongly advise you to thoroughly test your applications after making any configuration changes.

Conclusion

This blog post covers some simple and useful techniques to gain visibility into important instance metrics, and provides a script that greatly simplifies the process. Any workload running on EC2 can benefit from these techniques. We have found them especially effective at identifying actionable optimizations for SQL Servers, where small changes can have beneficial cost, licensing and performance implications.

Monitoring and troubleshooting serverless data analytics applications

2021-06-28 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/monitoring-and-troubleshooting-serverless-data-analytics-applications/

This series is about building serverless solutions in streaming data workloads. The application example used in this series is Alleycat, which allows bike racers to compete with each other virtually on home exercise bikes.

The first four posts have explored the architecture behind the application, which is enabled by Amazon Kinesis, Amazon DynamoDB, and AWS Lambda. This post explains how to monitor and troubleshoot issues that are common in streaming applications.

To set up the example, visit the GitHub repo and follow the instructions in the README.md file. Note that this walkthrough uses services that are not covered by the AWS Free Tier and incur cost.

Monitoring the Alleycat application

The business requirements for Alleycat state that it must handle up to 1,000 simultaneous racers. With each racer emitting a message every second, each 5-minute race results in 300,000 messages.

While the architecture can support this throughput, the settings for each service determine how the workload scales up. The deployment templates in the GitHub repo do not use sufficiently high settings to handle this amount of data. In the section, I show how this results in errors and what steps you can take to resolve the issues. To start, I run the simulator for several races with the maximum racers configuration set to 1,000.

Monitoring the Kinesis stream

The monitoring tab of the Kinesis stream provides visualizations of stream metrics. This immediately shows that there is a problem in the application when running at full capacity:

The iterator age is growing, indicating that the data consumers are falling behind the data producers. The Get records graph also shows the number of records in the stream growing.
The Incoming data (count) metric shows the number of separate records ingested by the stream. The red line indicates the maximum capacity of this single-shard stream. With 1,000 active racers, this is almost at full capacity.
However, the Incoming data – sum (bytes) graph shows that the total amount of data ingested by the stream is currently well under the maximum level shown by the red line.

There are two solutions for improving the capacity on the stream. First, the data producer application (the Alleycat frontend) could combine messages before sending. It’s currently reaching the total number of messages per second but the total byte capacity is significantly below the maximum. This action improves message packing but increases latency since the frontend waits to group messages.

Alternatively, you can add capacity by resharding. This enables you to increase (or decrease) the number of shards in a stream to adapt to the rate of data flowing through the application. You can do this with the UpdateShardCount API action. The existing stream goes into an Updating status and the stream scales by splitting shards. This creates two new child shards that split the partition keyspace of the parent. It also results in another, separate Lambda consumer for the new shard.

Monitoring the Lambda function

The monitoring tab of the consuming Lambda function provides visualization of metrics that can highlight problems in the workload. At full capacity, the monitoring highlights issues to resolve:

The Duration chart shows that the function is exceeding its 15-second timeout, when the function normally finishes in under a second. This typically indicates that there are too many records to process in a single batch or throttling is occurring downstream.
The Error count metric is growing, which highlights either logical errors in the code or errors from API calls to downstream resources.
The IteratorAge metric appears for Lambda functions that are consuming from streams. In this case, the growing metric confirms that data consumption is falling behind data production in the stream.
Concurrent executions remain at 1 throughout. This is set by the parallelization factor in the event source mapping and can be increased up to 10.

Monitoring the DynamoDB table

The metric tab on the application’s table in the DynamoDB console provides visualizations for the performance of the service:

The consumed Read usage is well within the provisioned maximum and there is no read throttling on the table.
Consumed Write usage, shown in blue, is frequently bursting through the provisioned capacity.
The number of Write throttled requests confirms that the DynamoDB service is throttling requests since the table is over capacity.

You can resolve this issue by increasing the provisioned throughput on the table and related global secondary indexes. Write capacity units (WCUs) provide 1 KB of write throughput per second. You can set this value manually, use automatic scaling to match varying throughout, or enable on-demand mode. Read more about the pricing models for each to determine the best approach for your workload.

Monitoring Kinesis Data Streams

Kinesis Data Streams ingests data into shards, which are fixed capacity sequences of records, up to 1,000 records or 1 MB per second. There is no limit to the amount of data held within a stream but there is a configurable retention period. By default, Kinesis stores records for 24 hours but you can increase this up to 365 days as needed.

Kinesis is integrated with Amazon CloudWatch. Basic metrics are published every minute, and you can optionally enable enhanced metrics for an additional charge. In this section, I review the most commonly used metrics for monitoring the health of streams in your application.

Metrics for monitoring data producers

When data producers are throttled, they cannot put new records onto a Kinesis stream. Use the WriteProvisionedThroughputExceeded metric to detect if producers are throttled. If this is more than zero, you won’t be able to put records to the stream. Monitoring the Average for this statistic can help you determine if your producers are healthy.

When producers succeed in sending data to a stream, the PutRecord.Success and PutRecords.Success are incremented. Monitoring for spikes or drops in these metrics can help you monitor the health of producers and catch problems early. There are two separate metrics for each of the API calls, so watch the Average statistic for whichever of the two calls your application uses.

Metrics for monitoring data consumers

When data consumers are throttled or start to generate errors, Kinesis continues to accept new records from producers. However, there is growing latency between when records are written and when they are consumed for processing.

Using the GetRecords.IteratorAgeMilliseconds metric, you can measure the difference between the age of the last record consumed and the latest record put to the stream. It is important to monitor the iterator age. If the age is high in relation to the stream’s retention period, you can lose data as records expire from the stream. This value should generally not exceed 50% of the stream’s retention period – when the value reaches 100% of the stream retention period, data is lost.

If the iterator age is growing, one temporary solution is to increase the retention time of the stream. This gives you more time to resolve the issue before losing data. A more permanent solution is to add more consumers to keep up with data production, or resolve any errors that are slowing consumers.

When consumers exceed the ReadProvisionedThroughputExceeded metric, they are throttled and you cannot read from the stream. This results in a growth of records in the stream waiting for processing. Monitor the Average statistic for this metric and aim for values as close to 0 as possible.

The GetRecords.Success metric is the consumer-side equivalent of PutRecords.Success. Monitor this value for spikes or drops to ensure that your consumers are healthy. The Average is usually the most useful statistic for this purpose.

Increasing data processing throughput for Kinesis Data Streams

Adjusting the parallelization factor

Kinesis invokes Lambda consumers every second with a configurable batch size of messages. It’s important that the processing in the function keeps pace with the rate of traffic to avoid a growing iterator age. For compute intensive functions, you can increase the memory allocated in the function, which also increases the amount of virtual CPU available. This can help reduce the duration of a processing function.

If this is not possible or the function is falling behind data production in the stream, consider increasing the parallelization factor. By default, this is set to 1, meaning that each shard has a single instance of a Lambda function it invokes. You can increase this up to 10, which results in multiple instances of the consumer function processing additional batches of messages.

Using enhanced fan-out to reduce iterator age

Standard consumers use a pull model over HTTP to fetch batches of records. Each consumer operates in serial. A stream with five consumers averages 200 ms of latency each, meaning it takes up to 1 second for all five to receive batches of records.

You can improve the overall latency by removing any unnecessary data consumers. If you use Kinesis Data Firehose and Kinesis Data Analytics on a stream, these count as consumers too. If you can remove subscribers, this helps with over data consumption throughput.

If the workload needs all of the existing subscribers, use enhanced fan-out (EFO). EFO consumers use a push model over HTTP/2 and are independent of each other. With EFO, the same five consumers in the previous example would receive batches of messages in parallel, using dedicated throughput. Overall latency averages 70 ms and typically data delivery speed is improved by up to 65%. There is an additional charge for this feature.

To learn more about processing streaming data with Lambda, see this AWS Online Tech Talk presentation.

Conclusion

In this post, I show how the existing settings in the Alleycat application are not sufficient for handling the expected amount of traffic. I walk through the metrics visualizations for Kinesis Data Streams, Lambda, and DynamoDB to find which quotas should be increased.

I explain which CloudWatch metrics can be used with Kinesis Data Stream to ensure that data producers and data consumers are healthy. Finally, I show how you can use the parallelization factor and enhanced fan-out features to increase the throughput of data consumers.

For more serverless learning resources, visit Serverless Land.

Building well-architected serverless applications: Managing application security boundaries – part 1

2021-06-22 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-managing-application-security-boundaries-part-1/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Security question SEC2: How do you manage your serverless application’s security boundaries?

Defining and securing your serverless application’s boundaries ensures isolation for, within, and between components.

Required practice: Evaluate and define resource policies

Resource policies are AWS Identity and Access Management (IAM) statements. They are attached to resources such as an Amazon S3 bucket, or an Amazon API Gateway REST API resource or method. The policies define what identities have fine-grained access to the resource. To see which services support resource-based policies, see “AWS Services That Work with IAM”. For more information on how resource policies and identity policies are evaluated, see “Identity-Based Policies and Resource-Based Policies”.

Understand and determine which resource policies are necessary

Resource policies can protect a component by restricting inbound access to managed services. Use resource policies to restrict access to your component based on a number of identities, such as the source IP address/range, function event source, version, alias, or queues. Resource policies are evaluated and enforced at IAM level before each AWS service applies it’s own authorization mechanisms, when available. For example, IAM resource policies for API Gateway REST APIs can deny access to an API before an AWS Lambda authorizer is called.

If you use multiple AWS accounts, you can use AWS Organizations to manage and govern individual member accounts centrally. Certain resource policies can be applied at the organizations level, providing guardrail for what actions AWS accounts within the organization root or OU can do. For more information see, “Understanding how AWS Organization Service Control Policies work”.

Review your existing policies and how they’re configured, paying close attention to how permissive individual policies are. Your resource policies should only permit necessary callers.

Implement resource policies to prevent unauthorized access

For Lambda, use resource-based policies to provide fine-grained access to what AWS IAM identities and event sources can invoke a specific version or alias of your function. Resource-based policies can also be used to control access to Lambda layers. You can combine resource policies with Lambda event sources. For example, if API Gateway invokes Lambda, you can restrict the policy to the API Gateway ID, HTTP method, and path of the request.

In the serverless airline example used in this series, the IngestLoyalty service uses a Lambda function that subscribes to an Amazon Simple Notification Service (Amazon SNS) topic. The Lambda function resource policy allows SNS to invoke the Lambda function.

Lambda resource policy document

API Gateway resource-based policies can restrict API access to specific Amazon Virtual Private Cloud (VPC), VPC endpoint, source IP address/range, AWS account, or AWS IAM users.

Amazon Simple Queue Service (SQS) resource-based policies provide fine-grained access to certain AWS services and AWS IAM identities (users, roles, accounts). Amazon SNS resource-based policies restrict authenticated and non-authenticated actions to topics.

Amazon DynamoDB resource-based policies provide fine-grained access to tables and indexes. Amazon EventBridge resource-based policies restrict AWS identities to send and receive events including to specific event buses.

For Amazon S3, use bucket policies to grant permission to your Amazon S3 resources.

The AWS re:Invent session Best practices for growing a serverless application includes further suggestions on enforcing security best practices.

Best practices for growing a serverless application

Good practice: Control network traffic at all layers

Apply controls for controlling both inbound and outbound traffic, including data loss prevention. Define requirements that help you protect your networks and protect against exfiltration.

Use networking controls to enforce access patterns

API Gateway and AWS AppSync have support for AWS Web Application Firewall (AWS WAF) which helps protect web applications and APIs from attacks. AWS WAF enables you to configure a set of rules called a web access control list (web ACL). These allow you to block, or count web requests based on customizable web security rules and conditions that you define. These can include specified IP address ranges, CIDR blocks, specific countries, or Regions. You can also block requests that contain malicious SQL code, or requests that contain malicious script. For more information, see How AWS WAF Works.

A private API endpoint is an API Gateway interface VPC endpoint that can only be accessed from your Amazon Virtual Private Cloud (Amazon VPC). This is an elastic network interface that you create in a VPC. Traffic to your private API uses secure connections and does not leave the Amazon network, it is isolated from the public internet. For more information, see “Creating a private API in Amazon API Gateway”.

To restrict access to your private API to specific VPCs and VPC endpoints, you must add conditions to your API’s resource policy. For example policies, see the documentation.

By default, Lambda runs your functions in a secure Lambda-owned VPC that is not connected to your account’s default VPC. Functions can access anything available on the public internet. This includes other AWS services, HTTPS endpoints for APIs, or services and endpoints outside AWS. The function cannot directly connect to your private resources inside of your VPC.

You can configure a Lambda function to connect to private subnets in a VPC in your account. When a Lambda function is configured to use a VPC, the Lambda function still runs inside the Lambda service VPC. The function then sends all network traffic through your VPC and abides by your VPC’s network controls. Functions deployed to virtual private networks must consider network access to restrict resource access.

AWS Lambda service VPC with VPC-to-VPT NAT to customer VPC

When you connect a function to a VPC in your account, the function cannot access the internet, unless the VPC provides access. To give your function access to the internet, route outbound traffic to a NAT gateway in a public subnet. The NAT gateway has a public IP address and can connect to the internet through the VPC’s internet gateway. For more information, see “How do I give internet access to my Lambda function in a VPC?”. Connecting a function to a public subnet doesn’t give it internet access or a public IP address.

You can control the VPC settings for your Lambda functions using AWS IAM condition keys. For example, you can require that all functions in your organization are connected to a VPC. You can also specify the subnets and security groups that the function’s users can and can’t use.

Unsolicited inbound traffic to a Lambda function isn’t permitted by default. There is no direct network access to the execution environment where your functions run. When connected to a VPC, function outbound traffic comes from your own network address space.

You can use security groups, which act as a virtual firewall to control outbound traffic for functions connected to a VPC. Use security groups to permit your Lambda function to communicate with other AWS resources. For example, a security group can allow the function to connect to an Amazon ElastiCache cluster.

To filter or block access to certain locations, use VPC routing tables to configure routing to different networking appliances. Use network ACLs to block access to CIDR IP ranges or ports, if necessary. For more information about the differences between security groups and network ACLs, see “Compare security groups and network ACLs.”

In addition to API Gateway private endpoints, several AWS services offer VPC endpoints, including Lambda. You can use VPC endpoints to connect to AWS services from within a VPC without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection.

Using tools to audit your traffic

When you configure a Lambda function to use a VPC, or use private API endpoints, you can use VPC Flow Logs to audit your traffic. VPC Flow Logs allow you to capture information about the IP traffic going to and from network interfaces in your VPC. Flow log data can be published to Amazon CloudWatch Logs or S3 to see where traffic is being sent to at a granular level. Here are some flow log record examples. For more information, see “Learn from your VPC Flow Logs”.

Block network access when required

In addition to security groups and network ACLs, third-party tools allow you to disable outgoing VPC internet traffic. These can also be configured to allow traffic to AWS services or allow-listed services.

Conclusion

Managing your serverless application’s security boundaries ensures isolation for, within, and between components. In this post, I cover how to evaluate and define resource policies, showing what policies are available for various serverless services. I show some of the features of AWS WAF to protect APIs. Then I review how to control network traffic at all layers. I explain how Lambda functions connect to VPCs, and how to use private APIs and VPC endpoints. I walk through how to audit your traffic.

This well-architected question will be continued where I look at using temporary credentials between resources and components. I cover why smaller, single purpose functions are better from a security perspective, and how to audit permissions. I show how to use AWS Serverless Application Model (AWS SAM) to create per-function IAM roles.

For more serverless learning resources, visit https://serverlessland.com.

Using AWS Systems Manager in Hybrid Cloud Environments

2021-06-21 Shivam Patel

Post Syndicated from Shivam Patel original https://aws.amazon.com/blogs/architecture/using-aws-systems-manager-in-hybrid-cloud-environments/

Customers operating in hybrid environments today face tremendous challenges with regard to operational management, security/compliance, and monitoring. Systems administrators have to connect, monitor, patch, and automate across multiple Operating Systems (OS), applications, cloud, and on-premises infrastructure. Each of these scenarios has its own unique vendor and console purpose-built for a specific use case.

Using Hybrid Activations, a capability within AWS Systems Manager, you can manage resources irrespective of where they are hosted. You can securely initiate remote shell connections, automate patch management, and monitor critical metrics. You’re able to gain visibility into networking information and application installations via a single console.

In this post, we’ll discuss how the Session Manager and Patch Manager capabilities of Systems Manager allow you to securely connect to instances and virtual machines (VMs). You can centrally log session activity for later auditing and automate patch management, across both cloud and on-premises environments, within a single interface.

Session Manager

Session Manager is a fully managed feature of AWS Systems Manager. Session Manager provides secure and auditable instance management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys. The centralized session management capability of Session Manager provides administrators the ability to centrally manage access to all compute instances. Irrespective of where your VM is hosted, the Session Manager session can be initiated from the AWS Management Console or from the Command-line interface (CLI). When using the CLI, the Session Manager plugin must be installed. The screenshot following shows an example of this.

Figure 1. Initiating instance management via Session Manager

The session is launched using the default system generated ssm-user account. With this account, the system does not prompt for a password when initiating root level commands. To improve security, OS accounts can be used to launch sessions using the Run As feature of Session Manager.

A session initiated via Session Manager is secure. The data exchange between the client and a managed instance takes place over a secure channel using TLS 1.2. To further improve your security posture, AWS Key Management Service (KMS) encryption can be used to encrypt the session traffic between a client and a managed instance. Encrypting session data with a customer managed key enables sessions to handle confidential data interactions. For using KMS encryption, both the user who starts sessions and the managed instance that they connect to, must have permission to use the key. Step-by-step instructions on how to set this up can be found in the Session Manager documentation.

Session Manager integrates with AWS CloudTrail, and this enables security teams to track when a user starts and shuts down sessions. Session Manager can also centrally log all session activity in Amazon CloudWatch or Amazon Simple Storage Service (S3). This gives system administrators the ability to manage details, such as when the session started, what commands were typed during the session, and when it ended. To configure session manager to send logs to CloudWatch and Amazon S3, the instance profile attached to the instance must have permissions to write to CloudWatch and S3. For the Amazon EC2 instance, this will be the IAM role attached to the instance. For VMs running on VMware Cloud on AWS, or on-premises, this is the IAM role from the “Hybrid Activations” page.

Following, we show an example of a session run on an on-premises instance via Session Manager and the corresponding logs in CloudWatch. The logs are continuously streamed into CloudWatch.

Figure 2. CloudWatch log events for session activity via Session Manager

The following screenshot displays the ipconfig /all command being run remotely within PowerShell of an instance running within VMware Cloud on AWS via Session Manager:

Figure 3. Remote PowerShell session for on-premises VM via Session Manager

Patch Manager

Patch management is vital in maintaining a secure and compliant environment. Patch Manager, a capability of AWS Systems Manager, helps you monitor, select, and deploy operating system and software patches automatically. This can happen across compute running on Amazon EC2, VMware on-premises, or VMware Cloud on AWS instances.

The Patch Manager dashboard shows details such as number of instances, high-level patch compliance summaries, compliance reporting age, and common causes of noncompliance. As Patch Manager performs patching operations, it updates the dashboard with a summary of recent patching operations and a list of recurring patching tasks. This provides the operations team a single unified view into environments and simplifies their monitoring efforts.

Figure 4. Patch Manager dashboard

Figure 5. List of all recurring patching tasks

A patch baseline in Patch Manager defines which patches are approved for installation on your instances. Patch Manager provides predefined patch baselines for each supported operating system and also lets you create your own custom patch baselines. These patch baselines let you maintain patch consistency across your deployments on Amazon EC2, VMware on-premises, and VMware Cloud on AWS.

Custom patch baselines give you greater control over which patches are approved and when they are automatically applied. By using multiple patch baselines with different auto-approval delays or cutoffs, you can test patches in your development environment. Custom patch baselines also let you assign compliance levels to indicate the severity of the compliance violation when a patch is reported as missing.

Figure 6. List of Patch baselines

You can use a patch group to associate a group of instances with a specific patch baseline in Patch Manager. This ensures that you are deploying the appropriate patches with associated patch baseline rules, to the correct set of instances. These instances can be EC2, VMware on-premises, or VMware Cloud on AWS. You can also use patch groups to schedule patching during a specific maintenance window.

Patch Manager also provides the ability to scan your instances and VMs running within VMware on-premises and/or VMware Cloud on AWS. It can report compliance adherence based on pre-defined schedules. Patch compliance reports can also be saved to an Amazon S3 bucket of your choice and generated as needed. For reports on a single instance/VM, detailed patch data will be included. For reports run on all instances, a summary of missing patch data will be provided.

The Patch Manager feature of AWS Systems Manager also integrates with AWS Security Hub, a service providing a comprehensive view of your security alerts. It additionally offers security check automation capabilities. In the following image, we show non-compliant instances and servers being reported within AWS Security Hub across EC2, VMware on-premises, and VMware Cloud on AWS:

Figure 7. Non-compliant instances and VMs being reported via AWS Security Hub

Installation and deployment

To ease installation and deployment efforts, the SSM agent is pre-installed on instances created from the following Amazon Machine Images (AMIs):

Amazon Linux
Amazon Linux 2
Amazon Linux 2 ECS-Optimized Base AMIs
macOS 10.14.x (Mojave) and 10.15.x (Catalina)
Ubuntu Server 16.04, 18.04, and 20.04
Windows Server 2008-2012 R2 AMIs published in November 2016 or later
Windows Server 2016 and 2019

For other AMI’s and VMs within VMware on-premises and/or VMware Cloud on AWS, manual agent installation must be performed.

Below is an architecture diagram of our solution described in this post:

Figure 8. General example of Systems Manager process flow

Configure Systems Manager: Use the Systems Manager console, SDK, AWS Command Line Interface (AWS CLI), or AWS Tools for Windows PowerShell to configure, schedule, automate, and run actions that you want to perform on your AWS resources.
Verification and processing: Systems Manager verifies the configurations, including permissions, and sends requests to the AWS Systems Manager SSM Agent running on your instances or servers in your hybrid environment. SSM Agent performs the specified configuration changes.
Reporting: SSM Agent reports the status of the configuration changes and actions to Systems Manager in the AWS Cloud. If configured, Systems Manager then sends the status to the user and various AWS services.

Conclusion

In this post, we showcase how AWS Systems Manager can yield a unified view within your hybrid environments. It spans native AWS, VMware on-premises, and VMware Cloud on AWS. The Session Manager and Patch Manager features simplify instance connectivity and patch management. Other native capabilities of AWS Systems Manager allow application and change management, software inventory, remote initiation, and monitoring. We encourage you to use the features discussed in this post to maintain your servers across your hybrid environment.

Additional links for consideration:

Exploring serverless patterns for Amazon DynamoDB

2021-06-16 Talia Nassi

Post Syndicated from Talia Nassi original https://aws.amazon.com/blogs/compute/exploring-serverless-patterns-for-amazon-dynamodb/

Amazon DynamoDB is a fully managed, serverless NoSQL database. In this post, you learn about the different DynamoDB patterns used in serverless applications, and use the recently launched Serverless Patterns Collection to configure DynamoDB as an event source for AWS Lambda.

Benefits of using DynamoDB as a serverless developer

DynamoDB is a serverless service that automatically scales up and down to adjust for capacity and maintain performance. It also has built-in high availability and fault tolerance. DynamoDB provides both provisioned and on-demand capacity modes so that you can optimize costs by specifying capacity per table, or paying for only the resources you consume. You are not provisioning, patching, or maintaining any servers.

Serverless patterns with DynamoDB

The recently launched Serverless Patterns Collection is a repository of serverless architecture examples that demonstrate integrating two or more AWS services. Each pattern uses either the AWS Serverless Application Model (AWS SAM) or AWS Cloud Development Kit (AWS CDK). These simplify the creation and configuration of the services referenced.

There are currently four patterns that use DynamoDB:

Amazon API Gateway REST API to Amazon DynamoDB

This pattern creates an Amazon API Gateway REST API that integrates with an Amazon DynamoDB table named “Music”. The API includes an API key and usage plan. The DynamoDB table includes a global secondary index named “Artist-Index”. The API integrates directly with the DynamoDB API and supports PutItem and Query actions. The REST API uses an AWS Identity and Access Management (IAM) role to provide full access to the specific DynamoDB table and index created by the AWS CloudFormation template. Use this pattern to store items in a DynamoDB table that come from the specified API.

AWS Lambda to Amazon DynamoDB

This pattern deploys a Lambda function, a DynamoDB table, and the minimum IAM permissions required to run the application. A Lambda function uses the AWS SDK to persist an item to a DynamoDB table.

AWS Step Functions to Amazon DynamoDB

This pattern deploys a Step Functions workflow that accepts a payload and puts the item in a DynamoDB table. Additionally, this workflow also shows how to read an item directly from the DynamoDB table, and contains the minimum IAM permissions required to run the application.

Amazon DynamoDB to AWS Lambda

This pattern deploys the following Lambda function, DynamoDB table, and the minimum IAM permissions required to run the application. The Lambda function is invoked whenever items are written or updated in the DynamoDB table. The changes are then sent to a stream. The Lambda function polls the DynamoDB stream. The function is invoked with a payload containing the contents of the table item that changed. We use this pattern in the following steps.

AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Description: An Amazon DynamoDB trigger that logs the updates made to a table.
Resources:
  DynamoDBProcessStreamFunction:
    Type: 'AWS::Serverless::Function'
    Properties:
      Handler: app.handler
      Runtime: nodejs14.x
      CodeUri: src/
      Description: An Amazon DynamoDB trigger that logs the updates made to a table.
      MemorySize: 128
      Timeout: 3
      Events:
        MyDynamoDBtable:
          Type: DynamoDB
          Properties:
            Stream: !GetAtt MyDynamoDBtable.StreamArn
            StartingPosition: TRIM_HORIZON
            BatchSize: 100
  MyDynamoDBtable:
    Type: 'AWS::DynamoDB::Table'
    Properties:
      AttributeDefinitions:
        - AttributeName: id
          AttributeType: S
      KeySchema:
        - AttributeName: id
          KeyType: HASH
      ProvisionedThroughput:
        ReadCapacityUnits: 5
        WriteCapacityUnits: 5
      StreamSpecification:
        StreamViewType: NEW_IMAGE

Setting up the Amazon DynamoDB to AWS Lambda Pattern

Prerequisites

For this tutorial, you need:

Downloading and testing the pattern

From the Serverless Patterns home page, choose Amazon DynamoDB from the Filters menu. Then choose the DynamoDB to Lambda pattern.
Clone the repository and change directories into the pattern’s directory.git clone https://github.com/aws-samples/serverless-patterns/
cd serverless-patterns/dynamodb-lambda
Run sam deploy –guided. This deploys your application. Keeping the responses blank chooses the default options displayed in the brackets.
You see the following confirmation message once your stack is created.
Navigate to the DynamoDB Console and choose Tables. Select the newly created table.
Choose the Items tab and choose Create Item.
Add an item and choose Save.
You see that item now in the DynamoDB table.
Navigate to the Lambda console and choose your function.
From the Monitor tab choose View logs in CloudWatch.
You see the new image inserted into the DynamoDB table.

Anytime a new item is added to the DynamoDB table, the invoked Lambda function logs the event in Amazon Cloudwatch Logs.

Configuring the event source mapping for the DynamoDB table

An event source mapping defines how a particular service invokes a Lambda function. It defines how that service is going to invoke the function. In this post, you use DynamoDB as the event source for Lambda. There are a few specific attributes of a DynamoDB trigger.

The batch size controls how many items can be sent for each Lambda invocation. This template sets the batch size to 100, as shown in the following deployed resource. The batch window indicates how long to wait until it invokes the Lambda function.

These configurations are beneficial because they increase your capabilities of what the DynamoDB table can do. In a traditional trigger for a database, the trigger gets invoked once per row per trigger action. With this batching capability, you can control the size of each payload and how frequently the function is invoked.

Using DynamoDB capacity modes

DynamoDB has two read/write capacity modes for processing reads and writes on your tables: provisioned and on-demand. The read/write capacity mode controls how you pay for read and write throughput and how you manage capacity.

With provisioned mode, you specify the number of reads and writes per second that you require for your application. You can use automatic scaling to adjust the table’s provisioned capacity automatically in response to traffic changes. This helps to govern your DynamoDB use to stay at or below a defined request rate to obtain cost predictability.

Provisioned mode is a good option if you have predictable application traffic, or you run applications whose traffic is consistent or ramps gradually. To use provisioned mode in a DynamoDB table, enter ProvisionedThroughput as a property, and then define the read and write capacity:

 MyDynamoDBtable:
    Type: 'AWS::DynamoDB::Table'
    Properties:
      AttributeDefinitions:
        - AttributeName: id
          AttributeType: S
      KeySchema:
        - AttributeName: id
          KeyType: HASH
      ProvisionedThroughput:
        ReadCapacityUnits: 5
        WriteCapacityUnits: 5
      StreamSpecification:
        StreamViewType: NEW_IMAGE

With on-demand mode, DynamoDB accommodates workloads as they ramp up or down. If a workload’s traffic level reaches a new peak, DynamoDB adapts rapidly to accommodate the workload.

On-demand mode is a good option if you create new tables with unknown workload, or you have unpredictable application traffic. Additionally, it can be a good option if you prefer paying for only what you use. To use on-demand mode for a DynamoDB table, in the properties section of the template.yaml file, enter BillingMode: PAY_PER_REQUEST.

ApplicationTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: !Ref ApplicationTableName
      BillingMode: PAY_PER_REQUEST    
      StreamSpecification:
        StreamViewType: NEW_AND_OLD_IMAGES

Stream specification

When DynamoDB sends the payload to Lambda, you can decide the view type of the stream. There are three options: new images, old images, and new and old images. To view only the new updated changes to the table, choose NEW_IMAGES as the StreamViewType. To view only the old change to the table, choose OLD_IMAGES as the StreamViewType. To view both the old image and new image, choose NEW_AND_OLD_IMAGES as the StreamViewType.

ApplicationTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: !Ref ApplicationTableName
      BillingMode: PAY_PER_REQUEST    
      StreamSpecification:
        StreamViewType: NEW_AND_OLD_IMAGES

Cleanup

Once you have completed this tutorial, be sure to remove the stack from CloudFormation with the commands shown below.

Submit a pattern to the Serverless Land Patterns Collection

While there are many patterns available to use from the Serverless Land website, there is also the option to create your own pattern and submit it. From the Serverless Patterns Collection main page, choose Submit a Pattern.

There you see guidance on how to submit. We have added many patterns from the community and we are excited to see what you build!

Conclusion

In this post, I explain the benefits of using DynamoDB patterns, and the different configuration settings, including batch size and batch window, that you can use in your pattern. I explain the difference between the two capacity modes, and I also show you how to configure a DynamoDB stream as an event source for Lambda by using the existing serverless pattern.

For more serverless learning resources, visit Serverless Land.

Monitoring memory usage in Amazon Lightsail instance

2021-06-14 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/monitoring-memory-usage-lightsail-instance/

This post is written by Sebastian Lee, Solution Architect, Startup Singapore.

Amazon Lightsail is a great starting point for those looking to get started on AWS. Lightsail is ideal for startups, SMBs, and hobbyist developers because it simplifies the deployment of instances, databases, load-balancers, CDNs, and even containers. However, you cannot track metrics beyond CPU utilization, network utilization, and error messages. Many startups and small businesses need to review more metrics like memory usage and disk usage.

In this blog, I walk through the steps to configure a Lightsail instance to send memory usage to Amazon CloudWatch for monitoring, alarming and notifications.

architecture overview

Product and Solution Overview

Amazon CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site-reliability engineers and IT managers. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events. It provides a unified view of your AWS resources, applications and services that run on AWS and on-premise servers. You can configure your Lightsail resources to work with Amazon CloudWatch to receive more metrics.

The following sections include steps to install a Cloudwatch agent on your Amazon Lightsail instance and configure it to have the necessary permission to send memory usage metrics to Amazon Cloudwatch.

Prerequisites

Before you begin the walkthrough, you must have an instance running in your Lightsail account. You can follow the steps here if you need help creating an instance.

Walkthrough

1. Create IAM user

First, you must create an IAM user to provide permission to send data to CloudWatch.

Sign in to the AWS Management Console and open the IAM console.
In the navigation pane, choose Users, and then choose Add user.
Enter “lightsail-cloudwatch-agent” in the User name text box.
For Access type, select Programmatic access, and then choose Next: Permissions.
For Set permissions, choose Attach existing policies directly.
1. In the list of policies, select the check box next to CloudWatchAgentServerPolicy. You can use the search text box to find the policy.
Choose Next: Tags.
Optionally, you can add one or more tag-key value pairs to organize, track, or control access for this role, and then choose Next: Review.
Confirm that the correct policies are listed, and then choose Create user.
In the row for the new user, choose Show. Copy the access key and secret key to a file so that you can use them when installing the agent.
1. Important: You will not be able to copy the secret key after leaving this page. If you lose it, you will have to create a new one
Choose Close.

Now that you created an IAM user, you can SSH into your Lightsail instance.

2. SSH into Amazon Lightsail instance

You can connect to your instance using the browser-based SSH client available in the Lightsail console, or by using your own SSH client with the SSH key of your instance.

Complete the following steps to connect to your instance using the browser-based SSH client in the Lightsail console:

Open the Lightsail console.
Click the terminal icon, next to the instance, as shown in the following screenshot.

3. Installing the CloudWatch agent

Now that you have SSH’d into your instance, you are ready to install the CloudWatch agent. The CloudWatch agent is available as a package on Amazon Linux 2 instances. For other operating systems, see Download and configure the CloudWatch agent using the command line.

Enter the following command to install the CloudWatch agent on a linux instance.

> sudo yum -y install amazon-cloudwatch-agent

========================================================================
Install 1 Package
…
Installed:
amazon-cloudwatch-agent.x86_64 0:1.247347.4-1.amzn2  

Complete!

4. Setup credentials

Now that you installed the CloudWatch Agent, you must allow it to access your AWS resources. First, setup the necessary credentials.

Enter the following command to create a credentials profile in the AWS Command Line Interface (AWS CLI).

Follow the prompts to enter the access key ID and secret access key you copied in the preceding steps.

> sudo aws configure --profile AmazonCloudWatchAgent

Follow the prompts to enter the access key ID and secret access key you copied earlier in this tutorial

AWS Access Key ID [None]: <Enter the access key from step 1>
AWS Secret Access Key [None]: <Enter the secret key from step 1>
Default region name [None]:
Default output format [None]:

5. Create CloudWatch configuration file to collect memory usage metrics

To tell CloudWatch agent to collect memory usage metrics, you will need to create a CloudWatch config file.

Enter the following command to create a config file for the CloudWatch agent.

> sudo vim /opt/aws/amazon-cloudwatch-agent/bin/config.json

Press “I” to enter insert mode in Vim, and paste the following text into the file.

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "root"
    },
    "metrics": {
	"append_dimensions": {
	    "ImageID": "${aws:ImageId}",
	    "InstanceId":"${aws:InstanceId}",
	    "InstanceType":"${aws:InstanceType}"
	},
        "metrics_collected": {
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ],
                "metrics_collection_interval": 60
            }
        }
    }
}

Press “ESC”, and then type “:wq!” to save the file and exit Vim.

6. Configure CloudWatch agent

In this section, you configure the CloudWatch agent to use the shared credential profile created earlier.

Enter the following command to create a common configuration file for the CloudWatch agent.

> sudo vim /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml

Press “I” to enter insert mode in Vim, and paste the following text into the file.

[credentials]
shared_credential_profile = "AmazonCloudWatchAgent"

Press “ESC”, and then type “:wq!” to save the file and exit Vim.

7. Start CloudWatch agent

Now the necessary configuration for CloudWatch agent is setup. Let’s start the agent.

Enter the following command to start the CloudWatch agent.

> sudo amazon-cloudwatch-agent-ctl -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -a fetch-config -s 

****** processing cwagent-otel-collector ******
cwagent-otel-collector will not be started as it has not been configured yet.

****** processing amazon-cloudwatch-agent ******
…
Redirecting to /bin/systemctl restart amazon-cloudwatch-agent.service

Enter the following command to verify that the CloudWatch agent is running.

> sudo amazon-cloudwatch-agent-ctl -a status
{
  "status": "running",
  "starttime": "2021-04-16T10:34:27+0000",
  "configstatus": "configured",
  "cwoc_status": "stopped",
  "cwoc_starttime": "",
  "cwoc_configstatus": "not configured",
  "version": "1.247347.4"
}

8. Verify metrics in CloudWatch

At this point, you should be able to view your metrics in CloudWatch.

Navigate to the CloudWatch console.
On the left navigation panel, choose Metrics.
Under “Custom Namespaces”, You should see a link for “CWAgent”.
Choose CWAgent.
Choose ImageId, InstanceId, InstanceType.
Select checkbox to display metrics on graph.

cloudwatch metrics

In addition, you can create a CloudWatch alarm to monitor the memory usage metrics to automatically send you a notification when the metric reaches a threshold you specify. To create an alarm in CloudWatch, you can follow this guide.

Conclusion

In this blog, I covered how you can install the CloudWatch agent on your Amazon Lightsail instance to send memory metrics to Amazon CloudWatch. For more information on additional metrics and logs supported by CloudWatch Agent, see the CloudWatch User Guide

To get started with Amazon Lightsail, check out our getting started page for more tutorial and resources.

Choosing a CI/CD approach: AWS Services with BigHat Biosciences

2021-06-10 Mike Apted

Post Syndicated from Mike Apted original https://aws.amazon.com/blogs/devops/choosing-ci-cd-aws-services-bighat-biosciences/

Founded in 2019, BigHat Biosciences’ mission is to improve human health by reimagining antibody discovery and engineering to create better antibodies faster. Their integrated computational + experimental approach speeds up antibody design and discovery by combining high-speed molecular characterization with machine learning technologies to guide the search for better antibodies. They apply these design capabilities to develop new generations of safer and more effective treatments for patients suffering from today’s most challenging diseases. Their platform, from wet lab robots to cloud-based data and logistics plane, is woven together with rapidly changing BigHat-proprietary software. BigHat uses continuous integration and continuous deployment (CI/CD) throughout their data engineering workflows and when training and evaluating their machine learning (ML) models.

BigHat Biosciences Logo

In a previous post, we discussed the key considerations when choosing a CI/CD approach. In this post, we explore BigHat’s decisions and motivations in adopting managed AWS CI/CD services. You may find that your organization has commonalities with BigHat and some of their insights may apply to you. Throughout the post, considerations are informed and choices are guided by the best practices in the AWS Well-Architected Framework.

How did BigHat decide what they needed?

Making decisions on appropriate (CI/CD) solutions requires understanding the characteristics of your organization, the environment you operate in, and your current priorities and goals.

“As a company designing therapeutics for patients rather than software, the role of technology at BigHat is to enable a radically better approach to develop therapeutic molecules,” says Eddie Abrams, VP of Engineering at BigHat. “We need to automate as much as possible. We need the speed, agility, reliability and reproducibility of fully automated infrastructure to enable our company to solve complex problems with maximum scientific rigor while integrating best in class data analysis. Our engineering-first approach supports that.”

BigHat possesses a unique insight to an unsolved problem. As an early stage startup, their core focus is optimizing the fully integrated platform that they built from the ground-up to guide the design for better molecules. They respond to feedback from partners and learn from their own internal experimentation. With each iteration, the quality of what they’re creating improves, and they gain greater insight and improved models to support the next iteration. More than anything, they need to be able to iterate rapidly. They don’t need any additional complexity that would distract from their mission. They need uncomplicated and enabling solutions.

They also have to take into consideration the regulatory requirements that apply to them as a company, the data they work with and its security requirements; and the market segment they compete in. Although they don’t control these factors, they can control how they respond to them, and they want to be able to respond quickly. It’s not only speed that matters in designing for security and compliance, but also visibility and track-ability. These often overlooked and critical considerations are instrumental in choosing a CI/CD strategy and platform.

“The ability to learn faster than your competitors may be the only sustainable competitive advantage,” says Cindy Alvarez in her book Lean Customer Development.

The tighter the feedback loop, the easier it is to make a change. Rapid iteration allows BigHat to easily build upon what works, and make adjustments as they identify avenues that won’t lead to success.

Feature set

CI/CD is applicable to more than just the traditional use case. It doesn’t have to be software delivered in a classic fashion. In the case of BigHat, they apply CI/CD in their data engineering workflows and in training their ML models. BigHat uses automated solutions in all aspects of their workflow. Automation further supports taking what they have created internally and enabling advances in antibody design and development for safer, more effective treatments of conditions.

“We see a broadening of the notion of what can come under CI/CD,” says Abrams. “We use automated solutions wherever possible including robotics to perform scaled assays. The goal in tightening the loop is to improve precision and speed, and reduce latency and lag time.”

BigHat reached the conclusion that they would adopt managed service offerings wherever possible, including in their CI/CD tooling and other automation initiatives.

“The phrase ‘undifferentiated heavy lifting’ has always resonated,” says Abrams. “Building, scaling, and operating core software and infrastructure are hard problems, but solving them isn’t itself a differentiating advantage for a therapeutics company. But whether we can automate that infrastructure, and how we can use that infrastructure at scale on a rock solid control plane to provide our custom solutions iteratively, reliably and efficiently absolutely does give us an edge. We need an end-to-end, complete infrastructure solution that doesn’t force us to integrate a patchwork of solutions ourselves. AWS provides exactly what we need in this regard.”

Reducing risk

Startups can be full of risk, with the upside being potential future reward. They face risk in finding the right problem, in finding a solution to that problem, and in finding a viable customer base to buy that solution.

A key priority for early stage startups is removing risk from as many areas of the business as possible. Any steps an early stage startup can take to remove risk without commensurately limiting reward makes them more viable. The more risk a startup can drive out of their hypothesis the more likely their success, in part because they’re more attractive to customers, employees, and investors alike. The more likely their product solves their problem, the more willing a customer is to give it a chance. Likewise, the more attractive they are to investors when compared to alternative startups with greater risk in reaching their next major milestone.

Adoption of managed services for CI/CD accomplishes this goal in several ways. The most important advantage remains speed. The core functionality required can be stood up very quickly, as it’s an existing service. Customers have a large body of reference examples and documentation available to demonstrate how to use that service. They also insulate teams from the need to configure and then operate the underlying infrastructure. The team remains focused on their differentiation and their core value proposition.

“We are automated right up to the organizational level and because of this, running those services ourselves represents operational risk,” says Abrams. “The largest day-to-day infrastructure risk to us is having the business stalled while something is not working. Do I want to operate these services, and focus my staff on that? There is no guarantee I can just throw more compute at a self-managed software service I’m running and make it scale effectively. There is no guarantee that if one datacenter is having a network or electrical problem that I can simply switch to another datacenter. I prefer AWS manages those scale and uptime problems.”

Embracing an opinionated model

BigHat is a startup with a singular focus on using ML to reduce the time and difficulty of designing antibodies and other therapeutic proteins. By adopting managed services, they have removed the burden of implementing and maintaining CI/CD systems.

Accepting the opinionated guardrails of the managed service approach allows, and to a degree reinforces, the focus on what makes a startup unique. Rather than being focused on performance tuning, making decisions on what OS version to use, or which of the myriad optional puzzle pieces to put together, they can use a well-integrated set of tools built to work with each other in a defined fashion.

The opinionated model means best practices are baked into the toolchain. Instead of hiring for specialized administration skills they’re hiring for specialized biotech skills.

“The only degrees of freedom I care about are the ones that improve our technologies and reduce the time, cost, and risk of bringing a therapeutic to market,” says Abrams. “We focus on exactly where we can gain operational advantages by simply adopting managed services that already embrace the Well-Architected Framework. If we had to tackle all of these engineering needs with limited resources, we would be spending into a solved problem. Before AWS, startups just didn’t do these sorts of things very well. Offloading this effort to a trusted partner is pretty liberating.”

Beyond the reduction in operational concerns, BigHat can also expect continuous improvement of that service over time to be delivered automatically by the provider. For their use case they will likely derive more benefit for less cost over time without any investment required.

Overview of solution

BigHat uses the following key services:

Code – GitHub for source code management.
Test and build – AWS CodePipeline as the orchestration layer to manage build and test flows through AWS CodeBuild (compute workers for CI jobs).
Publish stage – Amazon Elastic Container Registry (Amazon ECR) stores the infrastructure containers for the runners and product containers. AWS Code Artifact stores and serves our development and production distributables.
Observability – Amazon CloudWatch Logs, Insights, Dashboards, and AWS X-Ray provide logging, metrics, and tracing. AWS CloudTrail and Amazon Athena provide deep insight into our API and object level security. AWS Security Hub is our window into our security posture.
Infrastructure As Code – AWS CloudFormation is used to define and deploy all resources.

BigHat Reference Architecture

Security

Managed services are supported, owned and operated by the provider . This allows BigHat to leave concerns like patching and security of the underlying infrastructure and services to the provider. BigHat continues to maintain ownership in the shared responsibility model, but their scope of concern is significantly narrowed. The surface area the’re responsible for is reduced, helping to minimize risk. Choosing a partner with best in class observability, tracking, compliance and auditing tools is critical to any company that manages sensitive data.

Cost advantages

A startup must also make strategic decisions about where to deploy the capital they have raised from their investors. The vendor managed services bring a model focused on consumption, and allow the startup to make decisions about where they want to spend. This is often referred to as an operational expense (OpEx) model, in other words “pay as you go”, like a utility. This is in contrast to a large upfront investment in both time and capital to build these tools. The lack of need for extensive engineering efforts to stand up these tools, and continued investment to evolve them, acts as a form of capital expenditure (CapEx) avoidance. Startups can allocate their capital where it matters most for them.

“This is corporate-level changing stuff,” says Abrams. “We engage in a weekly leadership review of cost budgets. Operationally I can set the spending knob where I want it monthly, weekly or even daily, and avoid the risks involved in traditional capacity provisioning.”

The right tool for the right time

A key consideration for BigHat was the ability to extend the provider managed tools, where needed, to incorporate extended functionality from the ecosystem. This allows for additional functionality that isn’t covered by the core managed services, while maintaining a focus on their product development versus operating these tools.

Startups must also ask themselves what they need now, versus what they need in the future. As their needs change and grow, they can augment, extend, and replace the tools they have chosen to meet the new requirements. Starting with a vendor-managed service is not a one-way door; it’s an opportunity to defer investment in building and operating these capabilities yourself until that investment is justified. The time to value in using managed services initially doesn’t leave a startup with a sunk cost that limits future options.

“You have to think about the degree you want to adopt a hybrid model for the services you run. Today we aren’t running any software or services that require us to run our own compute instances. It’s very rare we run into something that is hard to do using just the services AWS already provides. Where our needs are not met, we can communicate them to AWS and we can choose to wait for them on their roadmap, which we have done in several cases, or we can elect to do it ourselves,” says Abrams. “This freedom to tweak and expand our service model at will is incomparably liberating.”

Conclusion

BigHat Biosciences was able to make an informed decision by considering the priorities of the business at this stage of its lifecycle. They adopted and embraced opinionated and service provider-managed tooling, which allowed them to inherit a largely best practice set of technology and practices, de-risk their operations, and focus on product velocity and customer feedback. This maintains future flexibility, which delivers significantly more value to the business in its current stage.

“We believe that the underlying engineering, the underlying automation story, is an advantage that applies to every aspect of what we do for our customers,” says Abrams. “By taking those advantages into every aspect of the business, we deliver on operations in a way that provides a competitive advantage a lot of other companies miss by not thinking about it this way.”

About the authors

Mike is a Principal Solutions Architect with the Startup Team at Amazon Web Services. He is a former founder, current mentor, and enjoys helping startups live their best cloud life.

Sean is a Senior Startup Solutions Architect at AWS. Before AWS, he was Director of Scientific Computing at the Howard Hughes Medical Institute.

File Access Auditing Is Now Available for Amazon FSx for Windows File Server

2021-06-09 Martin Beeby

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/file-access-auditing-is-now-available-for-amazon-fsx-for-windows-file-server/

Amazon FSx for Windows File Server provides fully managed file storage that is accessible over the industry-standard Server Message Block (SMB) protocol. It is built on Windows Server and offers a rich set of enterprise storage capabilities with the scalability, reliability, and low cost that you have come to expect from AWS.

In addition to key features such as user quotas, end-user file restore, and Microsoft Active Directory integration, the team has now added support for the auditing of end-user access on files, folders, and file shares using Windows event logs.

Introducing File Access Auditing
File access auditing allows you to send logs to a rich set of other AWS services so that you can query, process, and store your logs. By using file access auditing, enterprise storage administrators and compliance auditors can meet security and compliance requirements while eliminating the need to manage storage as logs grow over time. File access auditing will be particularly important to regulated customers such as those in the financial services and healthcare industries.

You can choose a destination for publishing audit events in the Windows event log format. The destination options are logging to Amazon CloudWatch Logs or streaming to Amazon Kinesis Data Firehose. From there, you can view and query logs in CloudWatch Logs, archive logs to Amazon Simple Storage Service (Amazon S3), or use AWS Partner solutions, such as Splunk and Datadog, to monitor your logs.

You can also set up Lambda functions that are triggered by new audit events. For example, you can configure AWS Lambda and Amazon CloudWatch alarms to send a notification to data security personnel when unauthorized access occurs.

Using File Access Auditing on a New File System
To enable file access auditing on a new file system, I head over to the Amazon FSx console and choose Create file system. On the Select file system type page, I choose Amazon FSx for Windows File Server, and then configure other settings for the file system. To use the auditing feature, Throughput capacity must be at least 32 MB/s, as shown here:

Screenshot of creating a file system

In Auditing, I see that File access auditing is turned on by default. In Advanced, for Choose an event log destination, I can change the destination for publishing user access events. I choose CloudWatch Logs and then choose a CloudWatch Logs log group in my account.

Screenshot of the Auditing options

After my file system has been created, I launch a new Amazon Elastic Compute Cloud (Amazon EC2) Instance and join it to my Active directory. When the instance is available, I connect to it using a remote desktop client. I open File Explorer and follow the documentation to map my new file system.

Screenshot of the file system once mapped

I open the file system in Windows Explorer and then right-click and select Properties. I choose Security, Advanced, and Auditing and then choose Add to add a new auditing entry. On the page for the auditing entry, in Principal, I click Select a principal. This is who I will be auditing. I choose Everyone. Next, for Type, I select the type of auditing I want (Success/Fail/All). Under Basic permissions, I select Full control for the permissions I want to audit for.

Screenshot of auditing options on a file share

Now that auditing is set up, I create some folders and create and modify some files. All this activity is now being audited, and the logs are being sent to CloudWatch Logs.

Screenshot of a file share, where some files and folders have been created

In the CloudWatch Logs Insights console, I can start to query the audit logs. Below you can see how I ran a simple query that finds all the logs associated with a specific file.

Screenshot of AWS CloudWatch Logs Insights

Continued Momentum
File access auditing is one of many features the team has launched in recent years, including: Self-Managed Directories, Native Multi-AZ File Systems, Support for SQL Server, Fine-Grained File Restoration, On-Premises Access, a Remote Management CLI, Data Deduplication, Programmatic File Share Configuration, Enforcement of In-Transit Encryption, Storage Size and Throughput Capacity Scaling, and Storage Quotas.

Pricing
File access auditing is free on Amazon FSx for Windows File Server. Standard pricing applies for the use of Amazon CloudWatch Logs, Amazon Kinesis Data Firehose, any downstream AWS services such as Amazon Redshift, S3, or AWS Lambda, and any AWS Partner solutions like Splunk and Datadog.

Available Today
File access auditing is available today for all new file systems in all AWS Regions where Amazon FSx for Windows File Server is available. Check our documentation for more details.

— Martin

CDK Corner – April 2021

2021-04-27 Christian Weber

Post Syndicated from Christian Weber original https://aws.amazon.com/blogs/devops/cdk-corner-april-2021/

We’re getting closer and closer to CDK Day, with the event receiving 75 CFP submissions. The cdkday schedule is now available to plan out your conference day.

Updates to the CDK

Constructs promoted to General Availability

Promoting a module to stable/General Availability is always a cause for celebration. Great job to all the folks involved who helped move aws-acmpca from Experimental to Stable. PR#13778 gives a peak into the work involved. If you’re interested in helping promote a module to G.A., or would like to learn more about the process, read the AWS Construct Library Module Lifecycle document. A big thanks to the CDK Community and team for their work!

Dead Letter Queues

Dead Letter Queues (“DLQs”) are a service implementation pattern that can queue messages when a service cannot process them. For example, if an email message can’t be delivered to a client, an email server could implement a DLQ holding onto that undeliverable message until the client can process the message. DLQs are supported by many AWS services, the community and CDK team have been working to support DLQs with CDK in various modules: aws-codebuild in PR#11228, aws-stepfunctions in PR#13450, and aws-lambda-targets in PR#11617.

Amazon API Gateway

Amazon API Gateway is a fully managed service to deploy APIs at scale. Here are the modules that have received updates to their support for API Gateway:

stepfunctions-tasks now supports API Gateway with PR#13033.
You can now specify regions when integrating Amazon API Gateway with other AWS services in PR#13251.
Support for websockets api in PR#13031 is now available in aws-apigatewayv2 as a Level 2 construct. To differentiate configuration between HTTP and websockets APIs, several of the HTTP API properties were renamed. More information about these changes can be found in the conversation section of PR#13031.
You can now set default authorizers in PR#13172. This lets you use an API Gateway HTTP, REST, or Websocket APIs with an authorizer and authorization scopes that cover all routes for a given API resource.

Notable new L2 constructs

AWS Global Accelerator is a networking service that lets users of your infrastructure hosted on AWS use the AWS global network infrastructure for traffic routing, improving speed and performance. Amazon Route 53 supports Global Accelerator and, thanks to PR#13407, you can now take advantage of this functionality in the aws-route-53-targets module as an L2 construct.

Amazon CloudWatch is an important part of monitoring AWS workloads. With PR#13281, the aws-cloudwatch-actions module now includes an Ec2Action construct, letting you programmatically set up observability of EC2-based workloads with CDK.

The aws-cognito module now supports Apple ID User Pools in PR#13160 allowing Developers to define workloads that use Apple IDs for identity management.

aws-iam received a new L2 construct with PR#13393, bringing SAML implementation support to CDK. SAML has become a preferred framework when implementing Single Sign On, and has been supported with IAM for sometime. Now, set it up with even more efficiency with the SamlProvider construct.

Amazon Neptune is a managed graph database service available as a construct in the aws-neptune module. PR#12763 adds L2 constructs to support Database Clusters and Database Instances.

Level ups to existing CDK constructs

Service discovery in AWS is provided by AWS CloudMap. With PR#13192, users of aws-ecs can now register an ECS Service with CloudMap.

aws-lambda has received two notable additions related to Docker: PR#13318, and PR#12258 add functionality to package Lambda function code with the output of a Docker build, or from a Docker build asset, respectively.

The aws-ecr module now supports Tag Mutability. Tags can denote a specific release for a piece of software. Setting the enum in the construct to IMMUTABLE will prevent tags from being overwritten by a later image, if that image uses a tag already present in the container repository.

Last year, AWS announced support for deployment circuit breakers in Amazon Elastic Container Service, enabling customers to perform auto-rollbacks on unhealthy service deployments without manual intervention. PR#12719 includes this functionality as part of the aws-ecs-patterns module, via the DeploymentCircuitBreaker interface. This interface is now available and can be used in constructs such as ApplicationLoadBalancedFargateService.

The aws-ec2 module received some nice quality of life upgrades to it: Support for multi-part user-data in PR#11843, client vpn endpoints in PR#12234, and non-numeric security protocols for security groups in PR#13593 all help improve the experience of using EC2 with CDK.

Learning – Finds from across the internet

On the AWS DevOps Blog, Eric Beard and Rico Huijbers penned a post detailing Best Practices for Developing Cloud Applications with AWS CDK.

Users of AWS Elastic Beanstalk wanting to deploy with AWS CDK can read about deploying Elastic Beanstalk applications with the AWS CDK and the aws-elasticbeanstalk module.

Deploying Infrastructure that is HIPAA and HiTrust compliant with AWS CDK can help customers move faster. This best practices guide for Hipaa and HiTrust environments goes into detail on deploying compliant architecture with the AWS CDK.

Community Acknowledgements

And finally, congratulations and rounds of applause for these folks who had their first Pull Request merged to the CDK Repository!*

*These users’ Pull Requests were merged between 2021-03-01 and 2021-03-31.

Thanks for reading this update of the CDK Corner. See you next time!

How ERGO implemented an event-driven security remediation architecture on AWS

2021-03-03 Adam Sikora

Post Syndicated from Adam Sikora original https://aws.amazon.com/blogs/architecture/how-ergo-implemented-an-event-driven-security-remediation-architecture-on-aws/

ERGO is one of the major insurance groups in Germany and Europe. Within the ERGO Group, ERGO Technology & Services S.A. (ET&S), a part of ET&SM holding, has competencies in digital transformation, know-how in creating and implementing complex IT systems with focus on the quality of solutions and a portfolio aligned with the entire value chain of the insurance market.

Business Challenge and Solution

ERGO has a multi-account AWS environment where each project team subscribes to a set of AWS accounts that conforms to workload requirements and security best practices. As ERGO began its cloud journey, CIS Foundations Benchmark Standard was used as the key indicator for measuring compliance. The report showed significant room for security posture improvements. ERGO was looking for a solution that could enable the management of security events at scale. At the same time, they needed to centralize the event response and remediation in near-real time. The goal was to improve the CIS compliance metric and overall security posture.

Architecture

ERGO uses AWS Organizations to centrally govern the multi-account AWS environment. Integration of AWS Security Hub with AWS Organizations enables ERGO to designate ERGO’s Security Account as the Security Hub administrator/primary account. Other organization accounts are automatically registered as Security Hub member accounts to send events to the Security Account.

An important aspect of the workflow is to maintain segregation of duties and separation of environments. ERGO uses two separate AWS accounts to implement automatic finding remediation:

Security Account – this is the primary account with Security Hub where security alerts (findings) from all the AWS accounts of the project are gathered.
Service Account – this is the account that can take action on target project (member) AWS accounts. ERGO uses AWS Lambda functions to run remediation actions through AWS Identity and Access Management (IAM) permissions, VPC resources actions, and more.

Within the Security Account, AWS Security Hub serves as the event aggregation solution that gathers multi-account findings from AWS services such as Amazon GuardDuty. ERGO was able to centralize the security findings. But they still needed to develop a solution that routed the filtered, actionable events to the Service Account. The solution had to automate the response to these events based on ERGO’s security policy. ERGO built this solution with the help of Amazon CloudWatch, AWS Step Functions, and AWS Lambda.

ERGO used the integration of AWS Security Hub with Amazon CloudWatch to send all the security events to CloudWatch. The filtering logic of events was managed at two levels. At the first level, ERGO used CloudWatch Events rules that match event patterns to refine the types of events ERGO wanted to focus on.

The second level of filtering logic was more nuanced and related to the remediation action ERGO wanted to take on a detected event. ERGO chose AWS Step Functions to build a workflow that enabled them to further filter the events, in addition to matching them to the suitable remediation action.

Choosing AWS Step Functions enabled ERGO to orchestrate multiple steps. They could also respond to errors in the overall workflow. For example, one of the issues that ERGO encountered was the sporadic failure of the Archival Lambda function. This was due to the Security Hub API Rate Throttling.

ERGO evaluated several workarounds to deal with this situation. They considered using the automatic retries capability of the AWS SDK to make the API call in the Archival function. However, the built-in mechanism was not sufficient in this case. Another option for dealing with rate limit was to throttle the Archival Lambda functions by applying a low reserved concurrency. Another possibility was to batch the events to be SUPPRESSED and process them as one batch at a time. The benefit was in making a single API call at a time, over several parameters.

After much consideration, ERGO decided to use the “retry on error” mechanism of the Step Function to circumvent this problem. This allowed ERGO to manage the error handling directly in the workflow logic. It wasn’t necessary to change the remediation and archival logic of the Lambda functions. This was a huge advantage. Writing and maintaining error handling logic in each one of the Lambda functions would have been time-intensive and complicated.

Additionally, the remediation actions had to be configured and run from the Service Account. That means the Step Function in the Security Account had to trigger a cross-account resource. ERGO had to find a way to integrate the Remediation Lambda in the Service Account with the state machine of the Security Account. ERGO achieved this integration using a Proxy Lambda in the Security Account.

The Proxy Lambda resides in the Security Account and is initiated by the Step Function. It takes as its argument, the function name and function version to start the Remediation function in the service account.

The Remediation functions in the Service Account have permission to take action on Project accounts. As the next step, the Remediation function is invoked on the impacted accounts. This is filtered by the Step Function, which passes the Account ID to Proxy Lambda, which in turn passes this argument to Remediation Lambda. The Remediation function runs the actions on the Project accounts and returns the output to the Proxy Lambda. This is then passed back to the Step Function.

The role that Lambda assumes using the AssumeRole mechanism, is an Organization Level role. It is deployed on every account and has proper permission to perform the remediation.

Figure 1. Technical Solution implementation

Security Hub service in ERGO Project accounts sends security findings to Administrative Account.
Findings are aggregated and sent to CloudWatch Events for filtering.
CloudWatch rules invoke Step Functions as the target. Step Functions process security events based on the event type and treatment required as per CIS Standards.
For events that need to be suppressed without any dependency on the Project Accounts, the Step Function invokes a Lambda function to archive the findings.
For events that need to be executed on the Project accounts, a Step Function invokes a Proxy Lambda with required parameters.
Proxy Lambda in turn, invokes a cross-account Remediation function in Service Account. This has the permissions to run actions in Project accounts.
Based on the event type, corresponding remediation action is run on the impacted Project Account.
Remediation function passes the execution result back to Proxy Lambda to complete the Security event workflow.

Failed remediations are manually resolved in exceptional conditions.

Summary

By implementing this event-driven solution, ERGO was able to increase and maintain automated compliance with CIS AWS Foundation Benchmark Standard to about 95%. The remaining findings were evaluated on case basis, per specific Project requirements. This measurable improvement in ERGO compliance posture was achieved with an end-to-end serverless workflow. This offloaded any on-going platform maintenance efforts from the ERGO cloud security team. Working closely with our AWS account and service teams, ERGO will continue to evaluate and make improvements to our architecture.

How to monitor Windows and Linux servers and get internal performance metrics

2021-01-22 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/how-to-monitor-windows-and-linux-servers-and-get-internal-performance-metrics/

This post was written by Dean Suzuki, Senior Solutions Architect.

Customers who run Windows or Linux instances on AWS frequently ask, “How do I know if my disks are almost full?” or “How do I know if my application is using all the available memory and is paging to disk?” This blog helps answer these questions by walking you through how to set up monitoring to capture these internal performance metrics.

Solution overview

If you open the Amazon EC2 console, select a running Amazon EC2 instance, and select the Monitoring tab you can see Amazon CloudWatch metrics for that instance. Amazon CloudWatch is an AWS monitoring service. The Monitoring tab (shown in the following image) shows the metrics that can be measured external to the instance (for example, CPU utilization, network bytes in/out). However, to understand what percentage of the disk is being used or what percentage of the memory is being used, these metrics require an internal operating system view of the instance. AWS places an extra safeguard on gathering data inside a customer’s instance so this capability is not enabled by default.

EC2 console showing Monitoring tab

To capture the server’s internal performance metrics, a CloudWatch agent must be installed on the instance. For Windows, the CloudWatch agent can capture any of the Windows performance monitor counters. For Linux, the CloudWatch agent can capture system-level metrics. For more details, please see Metrics Collected by the CloudWatch Agent. The agent can also capture logs from the server. The agent then sends this information to Amazon CloudWatch, where rules can be created to alert on certain conditions (for example, low free disk space) and automated responses can be set up (for example, perform backup to clear transaction logs). Also, dashboards can be created to view the health of your Windows servers.

There are four steps to implement internal monitoring:

Install the CloudWatch agent onto your servers. AWS provides a service called AWS Systems Manager Run Command, which enables you to do this agent installation across all your servers.
Run the CloudWatch agent configuration wizard, which captures what you want to monitor. These items could be performance counters and logs on the server. This configuration is then stored in AWS System Manager Parameter Store
Configure CloudWatch agents to use agent configuration stored in Parameter Store using the Run Command.
Validate that the CloudWatch agents are sending their monitoring data to CloudWatch.

The following image shows the flow of these four steps.

Process to install and configure the CloudWatch agent

In this blog, I walk through these steps so that you can follow along. Note that you are responsible for the cost of running the environment outlined in this blog. So, once you are finished with the steps in the blog, I recommend deleting the resources if you no longer need them. For the cost of running these servers, see Amazon EC2 On-Demand Pricing. For CloudWatch pricing, see Amazon CloudWatch pricing.

If you want a video overview of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.

Deploy the CloudWatch agent

The first step is to deploy the Amazon CloudWatch agent. There are multiple ways to deploy the CloudWatch agent (see this documentation on Installing the CloudWatch Agent). In this blog, I walk through how to use the AWS Systems Manager Run Command to deploy the agent. AWS Systems Manager uses the Systems Manager agent, which is installed by default on each AWS instance. This AWS Systems Manager agent must be given the appropriate permissions to connect to AWS Systems Manager, and to write the configuration data to the AWS Systems Manager Parameter Store. These access rights are controlled through the use of IAM roles.

Create two IAM roles

IAM roles are identity objects that you attach IAM policies. IAM policies define what access is allowed to AWS services. You can have users, services, or applications assume the IAM roles and get the assigned rights defined in the permissions policies.

To use System Manager, you typically create two IAM roles. The first role has permissions to write the CloudWatch agent configuration information to System Manager Parameter Store. This role is called CloudWatchAgentAdminRole.

The second role only has permissions to read the CloudWatch agent configuration from the System Manager Parameter Store. This role is called CloudWatchAgentServerRole.

For more details on creating these roles, please see the documentation on Create IAM Roles and Users for Use with the CloudWatch Agent.

Attach the IAM roles to the EC2 instances

Once you create the roles, you attach them to your Amazon EC2 instances. By attaching the IAM roles to the EC2 instances, you provide the processes running on the EC2 instance the permissions defined in the IAM role. In this blog, you create two Amazon EC2 instances. Attach the CloudWatchAgentAdminRole to the first instance that is used to create the CloudWatch agent configuration. Attach CloudWatchAgentServerRole to the second instance and any other instances that you want to monitor. For details on how to attach or assign roles to EC2 instances, please see the documentation on How do I assign an existing IAM role to an EC2 instance?.

Install the CloudWatch agent

Now that you have setup the permissions, you can install the CloudWatch agent onto the servers that you want to monitor. For details on installing the CloudWatch agent using Systems Manager, please see the documentation on Download and Configure the CloudWatch Agent.

Create the CloudWatch agent configuration

Now that you installed the CloudWatch agent on your server, run the CloudAgent configuration wizard to create the agent configuration. For instructions on how to run the CloudWatch Agent configuration wizard, please see this documentation on Create the CloudWatch Agent Configuration File with the Wizard. To establish a command shell on the server, you can use AWS Systems Manager Session Manager to establish a session to the server and then run the CloudWatch agent configuration wizard. If you want to monitor both Linux and Windows servers, you must run the CloudWatch agent configuration on a Linux instance and on a Windows instance to create a configuration file per OS type. The configuration is unique to the OS type.

To run the Agent configuration wizard on Linux instances, run the following command:

sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

To run the Agent configuration wizard on Windows instances, run the following commands:

cd "C:\Program Files\Amazon\AmazonCloudWatchAgent"

amazon-cloudwatch-agent-config-wizard.exe

Note for Linux instances: do not select to collect the collectd metrics in the agent configuration wizard unless you have collectd installed on your Linux servers. Otherwise, you may encounter an error.

Review the Agent configuration

The CloudWatch agent configuration generated from the wizard is stored in Systems Manager Parameter Store. You can review and modify this configuration if you need to capture extra metrics. To review the agent configuration, perform the following steps:

Go to the console for the System Manager service.
Click Parameter store on the left hand navigation.
You should see the parameter that was created by the CloudWatch agent configuration program. For Linux servers, the configuration is stored in: AmazonCloudWatch-linux and for Windows servers, the configuration is stored in: AmazonCloudWatch-windows.

System Manager Parameter Store: Parameters created by CloudWatch agent configuration wizard

Click on the parameter’s hyperlink (for example, AmazonCloudWatch-linux) to see all the configuration parameters that you specified in the configuration program.

In the following steps, I walk through an example of modifying the Windows configuration parameter (AmazonCloudWatch-windows) to add an additional metric (“Available Mbytes”) to monitor.

Click the AmazonCloudWatch-windows
In the parameter overview, scroll down to the “metrics” section and under “metrics_collected”, you can see the Windows performance monitor counters that will be gathered by the CloudWatch agent. If you want to add an additional perfmon counter, then you can edit and add the counter here.
Press Edit at the top right of the AmazonCloudWatch-windows Parameter Store page.
Scroll down in the Value section and look for “Memory.”
After the “% Committed Bytes In Use”, put a comma “,” and then press Enter to add a blank line. Then, put on that line “Available Mbytes” The following screenshot demonstrates what this configuration should look like.

AmazonCloudWatch-windows parameter contents and how to add a new metric to monitor

Press Save Changes.

To modify the Linux configuration parameter (AmazonCloudWatch-linux), you perform similar steps except you click on the AmazonCloudWatch-linux parameter. Here is additional documentation on creating the CloudWatch agent configuration and modifying the configuration file.

Start the CloudWatch agent and use the configuration

In this step, start the CloudWatch agent and instruct it to use your agent configuration stored in System Manager Parameter Store.

Open another tab in your web browser and go to System Manager console.
Specify Run Command in the left hand navigation of the System Manager console.
Press Run Command
In the search bar,
- Select Document name prefix
- Select Equal
- Specify AmazonCloudWatch (Note the field is case sensitive)
- Press enter

System Manager Run Command's command document entry field

Select AmazonCloudWatch-ManageAgent. This is the command that configures the CloudWatch agent.
In the command parameters section,
- For Action, select Configure
- For Mode, select ec2
- For Optional Configuration Source, select ssm
- For optional configuration location, specify the Parameter Store name. For Windows instances, you would specify AmazonCloudWatch-windows for Windows instances or AmazonCloudWatch-linux for Linux instances. Note the field is case sensitive. This tells the command to read the Parameter Store for the parameter specified here.
- For optional restart, leave yes
For Targets, choose your target servers that you wish to monitor.
Scroll down and press Run. The Run Command may take a couple minutes to complete. Press the refresh button. The Run Command configures the CloudWatch agent by reading the Parameter Store for the configuration and configure the agent using those settings.

For more details on installing the CloudWatch agent using your agent configuration, please see this Installing the CloudWatch Agent on EC2 Instances Using Your Agent Configuration.

Review the data collected by the CloudWatch agents

In this step, I walk through how to review the data collected by the CloudWatch agents.

In the AWS Management console, go to CloudWatch.
Click Metrics on the left-hand navigation.
You should see a custom namespace for CWAgent. Click on the CWAgent Please note that this might take a couple minutes to appear. Refresh the page periodically until it appears.
Then click the ImageId, Instanceid hyperlinks to see the counters under that section.

CloudWatch Metrics: Showing counters under CWAgent

Review the metrics captured by the CloudWatch agent. Notice the metrics that are only observable from inside the instance (for example, LogicalDisk % Free Space). These types of metrics would not be observable without installing the CloudWatch agent on the instance. From these metrics, you could create a CloudWatch Alarm to alert you if they go beyond a certain threshold. You can also add them to a CloudWatch Dashboard to review. To learn more about the metrics collected by the CloudWatch agent, see the documentation Metrics Collected by the CloudWatch Agent.

Conclusion

In this blog, you learned how to deploy and configure the CloudWatch agent to capture the metrics on either Linux or Windows instances. If you are done with this blog, we recommend deleting the System Manager Parameter Store entry, the CloudWatch data and then the EC2 instances to avoid further charges. If you would like a video tutorial of this process, please see this Monitoring Amazon EC2 Windows Instances using Unified CloudWatch Agent video.

Using Route 53 Private Hosted Zones for Cross-account Multi-region Architectures

2021-01-20 Anandprasanna Gaitonde

Post Syndicated from Anandprasanna Gaitonde original https://aws.amazon.com/blogs/architecture/using-route-53-private-hosted-zones-for-cross-account-multi-region-architectures/

This post was co-written by Anandprasanna Gaitonde, AWS Solutions Architect and John Bickle, Senior Technical Account Manager, AWS Enterprise Support

Introduction

Many AWS customers have internal business applications spread over multiple AWS accounts and on-premises to support different business units. In such environments, you may find a consistent view of DNS records and domain names between on-premises and different AWS accounts useful. Route 53 Private Hosted Zones (PHZs) and Resolver endpoints on AWS create an architecture best practice for centralized DNS in hybrid cloud environment. Your business units can use flexibility and autonomy to manage the hosted zones for their applications and support multi-region application environments for disaster recovery (DR) purposes.

This blog presents an architecture that provides a unified view of the DNS while allowing different AWS accounts to manage subdomains. It utilizes PHZs with overlapping namespaces and cross-account multi-region VPC association for PHZs to create an efficient, scalable, and highly available architecture for DNS.

Architecture Overview

You can set up a multi-account environment using services such as AWS Control Tower to host applications and workloads from different business units in separate AWS accounts. However, these applications have to conform to a naming scheme based on organization policies and simpler management of DNS hierarchy. As a best practice, the integration with on-premises DNS is done by configuring Amazon Route 53 Resolver endpoints in a shared networking account. Following is an example of this architecture.

Figure 1 – Architecture Diagram

The customer in this example has on-premises applications under the customer.local domain. Applications hosted in AWS use subdomain delegation to aws.customer.local. The example here shows three applications that belong to three different teams, and those environments are located in their separate AWS accounts to allow for autonomy and flexibility. This architecture pattern follows the option of the “Multi-Account Decentralized” model as described in the whitepaper Hybrid Cloud DNS options for Amazon VPC.

This architecture involves three key components:

1. PHZ configuration: PHZ for the subdomain aws.customer.local is created in the shared Networking account. This is to support centralized management of PHZ for ancillary applications where teams don’t want individual control (Item 1a in Figure). However, for the key business applications, each of the teams or business units creates its own PHZ. For example, app1.aws.customer.local – Application1 in Account A, app2.aws.customer.local – Application2 in Account B, app3.aws.customer.local – Application3 in Account C (Items 1b in Figure). Application1 is a critical business application and has stringent DR requirements. A DR environment of this application is also created in us-west-2.

For a consistent view of DNS and efficient DNS query routing between the AWS accounts and on-premises, best practice is to associate all the PHZs to the Networking Account. PHZs created in Account A, B and C are associated with VPC in Networking Account by using cross-account association of Private Hosted Zones with VPCs. This creates overlapping domains from multiple PHZs for the VPCs of the networking account. It also overlaps with the parent sub-domain PHZ (aws.customer.local) in the Networking account. In such cases where there is two or more PHZ with overlapping namespaces, Route 53 resolver routes traffic based on most specific match as described in the Developer Guide.

2. Route 53 Resolver endpoints for on-premises integration (Item 2 in Figure): The networking account is used to set up the integration with on-premises DNS using Route 53 Resolver endpoints as shown in Resolving DNS queries between VPC and your network. Inbound and Outbound Route 53 Resolver endpoints are created in the VPC in us-east-1 to serve as the integration between on-premises DNS and AWS. The DNS traffic between on-premises to AWS requires an AWS Site2Site VPN connection or AWS Direct Connect connection to carry DNS and application traffic. For each Resolver endpoint, two or more IP addresses can be specified to map to different Availability Zones (AZs). This helps create a highly available architecture.

3. Route 53 Resolver rules (Item 3 in Figure): Forwarding rules are created only in the networking account to route DNS queries for on-premises domains (customer.local) to the on-premises DNS server. AWS Resource Access Manager (RAM) is used to share the rules to accounts A, B and C as mentioned in the section “Sharing forwarding rules with other AWS accounts and using shared rules” in the documentation. Account owners can now associate these shared rules with their VPCs the same way that they associate rules created in their own AWS accounts. If you share the rule with another AWS account, you also indirectly share the outbound endpoint that you specify in the rule as described in the section “Considerations when creating inbound and outbound endpoints” in the documentation. This implies that you use one outbound endpoint in a region to forward DNS queries to your on-premises network from multiple VPCs, even if the VPCs were created in different AWS accounts. Resolver starts to forward DNS queries for the domain name that’s specified in the rule to the outbound endpoint and forward to the on-premises DNS servers. The rules are created in both regions in this architecture.

This architecture provides the following benefits:

Resilient and scalable
Uses the VPC+2 endpoint, local caching and Availability Zone (AZ) isolation
Minimal forwarding hops
Lower cost: optimal use of Resolver endpoints and forwarding rules

In order to handle the DR, here are some other considerations:

For app1.aws.customer.local, the same PHZ is associated with VPC in us-west-2 region. While VPCs are regional, the PHZ is a global construct. The same PHZ is accessible from VPCs in different regions.
Failover routing policy is set up in the PHZ and failover records are created. However, Route 53 health checkers (being outside of the VPC) require a public IP for your applications. As these business applications are internal to the organization, a metric-based health check with Amazon CloudWatch can be configured as mentioned in Configuring failover in a private hosted zone.
Resolver endpoints are created in VPC in another region (us-west-2) in the networking account. This allows on-premises servers to failover to these secondary Resolver inbound endpoints in case the region goes down.
A second set of forwarding rules is created in the networking account, which uses the outbound endpoint in us-west-2. These are shared with Account A and then associated with VPC in us-west-2.
In addition, to have DR across multiple on-premises locations, the on-premises servers should have a secondary backup DNS on-premises as well (not shown in the diagram).
This ensures a simple DNS architecture for the DR setup, and seamless failover for applications in case of a region failure.

Considerations

If Application 1 needs to communicate to Application 2, then the PHZ from Account A must be shared with Account B. DNS queries can then be routed efficiently for those VPCs in different accounts.
Create additional IP addresses in a single AZ/subnet for the resolver endpoints, to handle large volumes of DNS traffic.
Look at Considerations while using Private Hosted Zones before implementing such architectures in your AWS environment.

Summary

Hybrid cloud environments can utilize the features of Route 53 Private Hosted Zones such as overlapping namespaces and the ability to perform cross-account and multi-region VPC association. This creates a unified DNS view for your application environments. The architecture allows for scalability and high availability for business applications.

Using Amazon CloudWatch Lambda Insights to Improve Operational Visibility

2020-12-04 Steve Roberts

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/using-amazon-cloudwatch-lambda-insights-to-improve-operational-visibility/

To balance costs, while at the same time ensuring the service levels needed to meet business requirements are met, some customers elect to continuously monitor and optimize their AWS Lambda functions. They collect and analyze metrics and logs to monitor performance, and to isolate errors for troubleshooting purposes. Additionally, they also seek to right-size function configurations by measuring function duration, CPU usage, and memory allocation. Using various tools and sources of data to do this can be time-consuming, and some even go so far as to build their own customized dashboards to surface and analyze this data.

We announced Amazon CloudWatch Lambda Insights as a public preview this past October for customers looking to gain deeper operational oversight and visibility into the behavior of their Lambda functions. Today, I’m pleased to announce that CloudWatch Lambda Insights is now generally available. CloudWatch Lambda Insights provides clearer and simpler operational visibility of your functions by automatically collating and summarizing Lambda performance metrics, errors, and logs in prebuilt dashboards, saving you from time-consuming, manual work.

Once enabled on your functions, CloudWatch Lambda Insights automatically starts collecting and summarizing performance metrics and logs, and, from a convenient dashboard, provides you with a one-click drill-down into metrics and errors for Lambda function requests, simplifying analysis and troubleshooting.

Exploring CloudWatch Lambda Insights
To get started, I need to enable Lambda Insights on my functions. In the Lambda console, I navigate to my list of functions, and then select the function I want to enable for Lambda Insights by clicking on its name. From the function’s configuration view I then scroll to the Monitoring tools panel, click Edit, enable Enhanced monitoring, and click Save. If you want to enable enhanced monitoring for many functions, you may find it more convenient to use AWS Command Line Interface (CLI), AWS Tools for PowerShell, or AWS CloudFormation approaches instead. Note that once enhanced monitoring has been enabled, it can take a few minutes before data begins to surface in CloudWatch.

In the Amazon CloudWatch Console, I start by selecting Performance monitoring beneath Lambda Insights in the navigation panel. This takes me to the Multi-function view. Metrics for all functions on which I have enabled Lambda Insights are graphed in the view. At the foot of the page there’s also a table listing the functions, summarizing some data in the graphs and adding Cold starts. The table gives me the ability to sort the data based on the metric I’m interested in.

An interesting graph on this page, especially if you are trying to balance cost with performance, is Function Cost. This graph shows the direct cost of your functions in terms of megabyte milliseconds (MB-MS), which is how Lambda computes the financial charge of a function’s invocation. Hovering over the graph at a particular point in time shows more details.

Let’s examine my ExpensiveFunction further. Moving to the summary list at the bottom of the page I click on the function name which takes me to the Single function view (from here I can switch to my other functions using the controls at the top of the page, without needing to return to the multiple function view). The graphs show me metrics for invocations and errors, duration, any throttling, and memory, CPU, and network usage on the selected function and to add to the detail available, the most recent 1000 invocations are also listed in a table which I can sort as needed.

Clicking View in the Trace column of a request in the invocations list takes me to the Service Lens trace view, showing where my function spent its time on that particular invocation request. I could use this to determine if changes to the business logic of the function might improve performance by reducing function duration, which will have a direct effect on cost. If I’m troubleshooting, I can view the Application or Performance logs for the function using the View logs button. Application logs are those that existed before Lambda Insights was enabled on the function, whereas Performance logs are those that Lambda Insights has collated across all my enabled functions. The log views enable me to run queries and in the case of the Performance logs I can run queries across all enabled functions in my account, for example to perform a top-N analysis to determine my most expensive functions, or see how one function compares to another.

Here’s how I can make use of Lambda Insights to check if I’m ‘moving the needle’ in the correct direction when attempting to right-size a function, by examining the effect of changes to memory allocation on function cost. The starting point for my ExpensiveFunction is 128MB. By moving from 128MB to 512MB, the data shows me that function cost, duration, and concurrency are all reduced – this is shown at (1) in the graphs. Moving from 512MB to 1024MB, (2), has no impact on function cost, but it further reduces duration by 50% and also affected the maximum concurrency. I ran two further experiments, first moving from 1024MB to 2048MB, (3), which resulted in a further reduction in duration but the function cost started to increase so the needle is starting to swing in the wrong direction. Finally, moving from 2048MB to 3008MB, (4), significantly increased the cost but had no effect on duration. With the aid of Lambda Insights I can infer that the sweet spot for this function (assuming latency is not a consideration) lies between 1024MB and 2048MB. All these experiments are shown in the graphs below (the concurrency graph lags slightly, as earlier invocations are finishing up as configuration changes are made).

CloudWatch Lambda Insights gives simple and convenient operational oversight and visibility into the behavior of my AWS Lambda functions, and is available today in all regions where AWS Lambda is present.

Learn more about Amazon CloudWatch Lambda Insights in the documentation and get started today.

— Steve

Monitoring AWS Outposts capacity

2020-12-04 Shubha Kumbadakone

Post Syndicated from Shubha Kumbadakone original https://aws.amazon.com/blogs/compute/monitoring-aws-outposts-capacity/

This post is authored by Mike Burbey, Sr. Outposts SA

AWS Outposts is a fully managed service that offers the same AWS infrastructure, AWS services, APIs, and tools to any data center, colocation space, or on-premises facility for a consistent hybrid experience. AWS Outposts is ideal for workloads that require low latency, access to on-premises systems, local data processing, data residency, and migration of applications with local system interdependencies.

As part of the AWS Shared Responsibility Model, customers are responsible for capacity planning while using AWS Outposts. Customers must forecast compute and storage needs in addition to data center space, power, and HVAC requirements along with associated lead times. This blog demonstrates how to create a dashboard and alarms to assist in capacity planning. Amazon CloudWatch is used to capture the metrics and create dashboards while Amazon Simple Notification Service (SNS) sends notifications when capacity exceeds a threshold.

Solution

Create a dashboard

First, log into the account that the AWS Outpost is installed in. In the AWS console go to CloudWatch, then click on Dashboards, then click on Create dashboard to create a dashboard for AWS Outposts capacity.
Provide the dashboard a name and click Create dashboard.
Select a line widget and click Next.
This widget is metrics-based. Select Metrics and click Configure.
Click Outposts.
Click By Outpost and Instance Family.
Select all of the InstanceFamily options where the Metric Name is InstanceFamilyCapacityUtilization then click Create widget.
The first widget has been added to the dashboard. This widget displays the percent of capacity used for the entire instance family. Additional widgets may be created to display the capacity available or for a specific instance type (in this case, c5.2xlarge).
Now, I add an Amazon EBS capacity widget to the dashboard. To do this, click Add widget, select Line as the widget type and Metrics as the data source.
Click Outposts.
Click Outpostid, VolumeType.
Select EBSVolumeTypeCapacityUtilization and then Create widget.
The dashboard now has two widgets setup. Click Save dashboard. If you do not save the dashboard, your changes do not save. The following image shows what your two dashboards should look like.
Dashboards are useful for monitoring capacity over time. It is unlikely that someone is looking at the dashboard at the moment when usage increases. To ensure you are notified when an increase in utilization happens, you can create an alarm that sends a notification to Amazon Simple Notification Service (SNS).

Create alarms

When creating CloudWatch alarms, only one metric can be configured per alarm. So, a single alarm can only report on a single EC2 instance family. Additional alarms must be configured for each EC2 instance family deployed in an AWS Outpost. To learn more about CloudWatch alarms, visit the technical documentation.

To create a new alarm, click Create alarm.
Click Select metric.
To select a metric for EC2 capacity utilization, select Outposts, By Outpost and Instance Family. Select the Instance Family to create an alarm for. In this example, the alarm is created for the C5 Instance Family and is based on capacity utilization. Click Select metric.
Define the threshold to alarm when the metric is greater than or equal to 80% and click Next.
When setting up the first alarm, an Amazon SNS topic must be created. The topic can be re-used when setting up additional alarms. Click Create topic. Enter a name for the SNS topic and email addresses that should receive the notification. Click Add notification, then click Next.
Enter a name and description for the alarm, click Next, and click Create alarm.

Amazon SNS requires topic subscriptions to be confirmed. Each email address receives an email message from AWS Notifications. Remember to create an alarm for each EC2 family type and EBS volume to ensure that alerts are received for all resources on the AWS Outpost. For more information on Amazon SNS, visit the developer guide.

Conclusion

You now have visibility into the compute and storage capacity of the AWS Outposts. This provides visibility to inform capacity planning activities. To learn about additional CloudWatch metrics available for AWS Outposts, visit the user guide.

For additional information on AWS Outposts capacity management check out this webinar to learn more about additional AWS Outposts metrics and the installation process.

Using AWS Lambda extensions to send logs to custom destinations

2020-11-12 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/using-aws-lambda-extensions-to-send-logs-to-custom-destinations/

You can now send logs from AWS Lambda functions directly to a destination of your choice using AWS Lambda Extensions. Lambda Extensions are a new way for monitoring, observability, security, and governance tools to easily integrate with AWS Lambda. For more information, see “Introducing AWS Lambda Extensions – In preview”.

To help you troubleshoot failures in Lambda functions, AWS Lambda automatically captures and streams logs to Amazon CloudWatch Logs. This stream contains the logs that your function code and extensions generate, in addition to logs the Lambda service generates as part of the function invocation.

Previously, to send logs to a custom destination, you typically configure and operate a CloudWatch Log Group subscription. A different Lambda function forwards logs to the destination of your choice.

Logging tools, running as Lambda extensions, can now receive log streams directly from within the Lambda execution environment, and send them to any destination. This makes it even easier for you to use your preferred extensions for diagnostics.

Today, you can use extensions to send logs to Coralogix, Datadog, Honeycomb, Lumigo, New Relic, and Sumo Logic.

Overview

To receive logs, extensions subscribe using the new Lambda Logs API.

Lambda Logs API

The Lambda service then streams the logs directly to the extension. The extension can then process, filter, and route them to any preferred destination. Lambda still sends the logs to CloudWatch Logs.

You deploy extensions, including ones that use the Logs API, as Lambda layers, with the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code tools such as AWS CloudFormation, the AWS Serverless Application Model (AWS SAM), Serverless Framework, and Terraform.

Logging extensions from AWS Lambda Ready Partners and AWS Partners available at launch

Today, you can use logging extensions with the following tools:

The Datadog extension now makes it easier than ever to collect your serverless application logs for visualization, analysis, and archival. Paired with Datadog’s AWS integration, end-to-end distributed tracing, and real-time enhanced AWS Lambda metrics, you can proactively detect and resolve serverless issues at any scale.
Lumigo provides monitoring and debugging for modern cloud applications. With the open source extension from Lumigo, you can send Lambda function logs directly to an S3 bucket, unlocking new post processing use cases.
New Relic enables you to efficiently monitor, troubleshoot, and optimize your Lambda functions. New Relic’s extension allows you send your Lambda service platform logs directly to New Relic’s unified observability platform, allowing you to quickly visualize data with minimal latency and cost.
Coralogix is a log analytics and cloud security platform that empowers thousands of companies to improve security and accelerate software delivery, allowing you to get deep insights without paying for the noise. Coralogix can now read Lambda function logs and metrics directly, without using Cloudwatch or S3, reducing the latency, and cost of observability.
Honeycomb is a powerful observability tool that helps you debug your entire production app stack. Honeycomb’s extension decreases the overhead, latency, and cost of sending events to the Honeycomb service, while increasing reliability.
The Sumo Logic extension enables you to get instant visibility into the health and performance of your mission-critical applications using AWS Lambda. With this extension and Sumo Logic’s continuous intelligence platform, you can now ensure that all your Lambda functions are running as expected, by analyzing function, platform, and extension logs to quickly identify and remediate errors and exceptions.

You can also build and use your own logging extensions to integrate your organization’s tooling.

Showing a logging extension to send logs directly to S3

This demo shows an example of using a simple logging extension to send logs to Amazon Simple Storage Service (S3).

To set up the example, visit the GitHub repo and follow the instructions in the README.md file.

The example extension runs a local HTTP endpoint listening for HTTP POST events. Lambda delivers log batches to this endpoint. The example creates an S3 bucket to store the logs. A Lambda function is configured with an environment variable to specify the S3 bucket name. Lambda streams the logs to the extension. The extension copies the logs to the S3 bucket.

Lambda environment variable specifying S3 bucket

The extension uses the Extensions API to register for INVOKE and SHUTDOWN events. The extension, using the Logs API, then subscribes to receive platform and function logs, but not extension logs.

As the example is an asynchronous system, logs for one invoke may be processed during the next invocation. Logs for the last invoke may be processed during the SHUTDOWN event.

Testing the function from the Lambda console, Lambda sends logs to CloudWatch Logs. The logs stream shows logs from the platform, function, and extension.

Lambda logs visible in CloudWatch Logs

The logging extension also receives the log stream directly from Lambda, and copies the logs to S3.

Browsing to the S3 bucket, the log files are available.

S3 bucket containing copied logs.

Downloading the file shows the log lines. The log contains the same platform and function logs, but not the extension logs, as specified during the subscription.

[{'time': '2020-11-12T14:55:06.560Z', 'type': 'platform.start', 'record': {'requestId': '49e64413-fd42-47ef-b130-6fd16f30148d', 'version': '$LATEST'}},
{'time': '2020-11-12T14:55:06.774Z', 'type': 'platform.logsSubscription', 'record': {'name': 'logs_api_http_extension.py', 'state': 'Subscribed', 'types': ['platform', 'function']}},
{'time': '2020-11-12T14:55:06.774Z', 'type': 'platform.extension', 'record': {'name': 'logs_api_http_extension.py', 'state': 'Ready', 'events': ['INVOKE', 'SHUTDOWN']}},
{'time': '2020-11-12T14:55:06.776Z', 'type': 'function', 'record': 'Function: Logging something which logging extension will send to S3\n'}, {'time': '2020-11-12T14:55:06.780Z', 'type': 'platform.end', 'record': {'requestId': '49e64413-fd42-47ef-b130-6fd16f30148d'}}, {'time': '2020-11-12T14:55:06.780Z', 'type': 'platform.report', 'record': {'requestId': '49e64413-fd42-47ef-b130-6fd16f30148d', 'metrics': {'durationMs': 4.96, 'billedDurationMs': 100, 'memorySizeMB': 128, 'maxMemoryUsedMB': 87, 'initDurationMs': 792.41}, 'tracing': {'type': 'X-Amzn-Trace-Id', 'value': 'Root=1-5fad4cc9-70259536495de84a2a6282cd;Parent=67286c49275ac0ad;Sampled=1'}}}]

Lambda has sent specific logs directly to the subscribed extension. The extension has then copied them directly to S3.

For more example log extensions, see the Github repository.

How do extensions receive logs?

Extensions start a local listener endpoint to receive the logs using one of the following protocols:

TCP – Logs are delivered to a TCP port in Newline delimited JSON format (NDJSON).
HTTP – Logs are delivered to a local HTTP endpoint through PUT or POST, as an array of records in JSON format. http://sandbox:${PORT}/${PATH}. The $PATH parameter is optional.

AWS recommends using an HTTP endpoint over TCP because HTTP tracks successful delivery of the log messages to the local endpoint that the extension sets up.

Once the endpoint is running, extensions use the Logs API to subscribe to any of three different logs streams:

Function logs that are generated by the Lambda function.
Lambda service platform logs (such as the START, END, and REPORT logs in CloudWatch Logs).
Extension logs that are generated by extension code.

The Lambda service then sends logs to endpoint subscribers inside of the execution environment only.

Even if an extension subscribes to one or more log streams, Lambda continues to send all logs to CloudWatch.

Performance considerations

Extensions share resources with the function, such as CPU, memory, disk storage, and environment variables. They also share permissions, using the same AWS Identity and Access Management (IAM) role as the function.

Log subscriptions consume memory resources as each subscription opens a new memory buffer to store the logs. This memory usage counts towards memory consumed within the Lambda execution environment.

For more information on resources, security and performance with extensions, see “Introducing AWS Lambda Extensions – In preview”.

What happens if Lambda cannot deliver logs to an extension?

The Lambda service stores logs before sending to CloudWatch Logs and any subscribed extensions. If Lambda cannot deliver logs to the extension, it automatically retries with backoff. If the log subscriber crashes, Lambda restarts the execution environment. The logs extension re-subscribes, and continues to receive logs.

When using an HTTP endpoint, Lambda continues to deliver logs from the last acknowledged delivery. With TCP, the extension may lose logs if an extension or the execution environment fails.

The Lambda service buffers logs in memory before delivery. The buffer size is proportional to the buffering configuration used in the subscription request. If an extension cannot process the incoming logs quickly enough, the buffer fills up. To reduce the likelihood of an out of memory event due to a slow extension, the Lambda service drops records and adds a platform.logsDropped log record to the affected extension to indicate the number of dropped records.

Disabling logging to CloudWatch Logs

Lambda continues to send logs to CloudWatch Logs even if extensions subscribe to the logs stream.

To disable logging to CloudWatch Logs for a particular function, you can amend the Lambda execution role to remove access to CloudWatch Logs.

{
"Version": "2012-10-17",
"Statement": [
    {
        "Effect": "Deny",
        "Action": [
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:PutLogEvents"
        ],
        "Resource": [
            "arn:aws:logs:*:*:*"
        ]
    }
  ]
}

Logs are no longer delivered to CloudWatch Logs for functions using this role, but are still streamed to subscribed extensions. You are no longer billed for CloudWatch logging for these functions.

Pricing

Logging extensions, like other extensions, share the same billing model as Lambda functions. When using Lambda functions with extensions, you pay for requests served and the combined compute time used to run your code and all extensions, in 100 ms increments. To learn more about the billing for extensions, visit the Lambda FAQs page.

Conclusion

Lambda extensions enable you to extend the Lambda service to more easily integrate with your favorite tools for monitoring, observability, security, and governance.

Extensions can now subscribe to receive log streams directly from the Lambda service, in addition to CloudWatch Logs. Today, you can install a number of available logging extensions from AWS Lambda Ready Partners and AWS Partners. Extensions make it easier to use your existing tools with your serverless applications.

To try the S3 demo logging extension, follow the instructions in the README.md file in the GitHub repository.

Extensions are now available in preview in all commercial regions other than the China regions.

For more serverless learning resources, visit https://serverlessland.com.

A New Integration for CloudWatch Alarms and OpsCenter

2020-11-05 Martin Beeby

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/a-new-integration-for-cloudwatch-alarms-and-opscenter/

Over a year ago, I wrote about the Launch of a feature in AWS Systems Manager called OpsCenter, which allows customers to aggregate issues, events, and alerts into one place and make it easier for operations engineers and IT professionals to investigate and remediate problems. Today, I get to tell you about a new integration […]

Custom logging with AWS Batch

2020-10-02 Emma White

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/custom-logging-with-aws-batch/

This post was written by Christian Kniep, Senior Developer Advocate for HPC and AWS Batch.

For HPC workloads, visibility into the logs of jobs is important to debug a job which failed, but also to have insights into a running job and track its trajectory to influence the configuration of the next job or terminate the job because it went off track.

With AWS Batch, customers are able to run batch workloads at scale, reliably and with ease as this managed serves takes out the undifferentiated heavy lifting. The customer can then focus on submitting jobs and getting work done. Customers told us that at a certain scale, the single logging driver available within AWS Batch made it hard to separate logs as they were all ending up in the same log group in Amazon CloudWatch.

With the new release of customer logging driver support, customers are now able to adjust how the job output is logged. Not only customize the Amazon CloudWatch setting, but enable the use of external logging frameworks such as splunk, fluentd, json-files, syslog, gelf, journald.

This allow AWS Batch jobs to use the existing systems they are accustom to, with fine-grained control of the log data for debugging and access control purposes.

In this blog, I show the benefits of custom logging with AWS Batch by adjusting the log targets for jobs. The first example will customize the Amazon CloudWatch log group, the second will log to Splunk, an external logging service.

Example setup

To showcase this new feature, I use the AWS Command Line Interface (CLI) to setup the following:

IAM roles, policies, and profiles to grant access and permissions
A compute environment to provide the compute resources to run jobs
A job queue, which supervises the job execution and schedules jobs on a compute environment
A job definition, which uses a simple job to demonstrate how the new configuration can be applied

Once those tasks are completed, I submit a job and send logs to a customized CloudWatch log-group and Splunk.

Prerequisite

To make things easier, I first set a couple of environment variables to have the information handy for later use. I use the following code to set up the environment variables.

# in case it is not already installed
sudo yum install -y jq 
export MD_URL=http://169.254.169.254/latest/meta-data
export IFACE=$(curl -s ${MD_URL}/network/interfaces/macs/)
export SUBNET_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/subnet-id)
export VPC_ID=$(curl -s ${MD_URL}/network/interfaces/macs/${IFACE}/vpc-id)
export AWS_REGION=$(curl -s ${MD_URL}/placement/availability-zone | sed 's/[a-z]$//')
export AWS_ACCT_ID=$(curl -s ${MD_URL}/identity-credentials/ec2/info |jq -r .AccountId)
export AWS_SG_DEFAULT=$(aws ec2 describe-security-groups \
--filters Name=group-name,Values=default \
|jq -r '.SecurityGroups[0].GroupId')

IAM

When using the AWS Management Console, you must create IAM roles manually.

Trust Policies

IAM Roles are defined to be used by a certain service. In the simplest case, you want a role to be used by Amazon EC2 – the service that provides the compute capacity in the cloud. This defines which entity is able to use an IAM Role, called Trust Policy. To set up a trust policy for an IAM role, use the following code snippet.

cat > ec2-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

Instance role

With the IAM trust policy, I now create an ecsInstanceRole and attach the pre-defined policy AmazonEC2ContainerServiceforEC2Role. This allows an instance to interact with Amazon ECS.

aws iam create-role --role-name ecsInstanceRole \
 --assume-role-policy-document file://ec2-trust-policy.json
aws iam create-instance-profile --instance-profile-name ecsInstanceProfile
aws iam add-role-to-instance-profile \
    --instance-profile-name ecsInstanceProfile \
    --role-name ecsInstanceRole
aws iam attach-role-policy --role-name ecsInstanceRole \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role

Service Role

The AWS Batch service uses a role to interact with different services. The trust relationship reflects that the AWS Batch service is going to assume this role. You can set up this role with the following logic.

cat > svc-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "batch.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF
aws iam create-role --role-name AWSBatchServiceRole \
--assume-role-policy-document file://svc-trust-policy.json
aws iam attach-role-policy --role-name AWSBatchServiceRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole

In addition to dealing with Amazon ECS, the instance role can create and write to Amazon CloudWatch log groups, to control which log group names are used, a condition is attached.

While the compute environment is coming up, let us create and attach a policy to make a new log-group possible.

cat > policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "logs:CreateLogGroup"
    ],
    "Resource": "*",
    "Condition": {
      "StringEqualsIfExists": {
        "batch:LogDriver": ["awslogs"],
        "batch:AWSLogsGroup": ["/aws/batch/custom/*"]
      }
    }
  }]
}
EOF
aws iam create-policy --policy-name batch-awslog-policy \
    --policy-document file://policy.json
aws iam attach-role-policy --policy-arn arn:aws:iam::${AWS_ACCT_ID}:policy/batch-awslog-policy --role-name ecsInstanceRole

At this point, I created the IAM roles and policies so that the instance and service are able to interact with the AWS APIs, including trust-policies to define which services are meant to use them. EC2 for the ecsInstanceRole and the AWSBatchServiceRole for the AWS Batch service itself.

Compute environment

Now, I am going to create a compute environment, which is going to spin up an instance (one vCPU target) to run the example job in.

cat > compute-environment.json << EOF
{
  "computeEnvironmentName": "od-ce",
  "type": "MANAGED",
  "state": "ENABLED",
  "computeResources": {
    "type": "EC2",
    "allocationStrategy": "BEST_FIT_PROGRESSIVE",
    "minvCpus": 1,
    "maxvCpus": 8,
    "desiredvCpus": 1,
    "instanceTypes": ["m5.xlarge"],
    "subnets": ["${SUBNET_ID}"],
    "securityGroupIds": ["${AWS_SG_DEFAULT}"],
    "instanceRole": "arn:aws:iam::${AWS_ACCT_ID}:instance-profile/ecsInstanceRole",
    "tags": {"Name": "aws-batch-compute"},
    "bidPercentage": 0
  },
  "serviceRole": "arn:aws:iam::${AWS_ACCT_ID}:role/AWSBatchServiceRole"
}
EOF
aws batch create-compute-environment --cli-input-json file://compute-environment.json

Once this section is complete, a compute environment is being spun up in the back. This will take a moment. You can use the following command to check on the status of the compute environment.

aws batch describe-compute-environments

Once it is enabled and valid we can continue by setting up the job queue.

Job Queue

Now that I have a compute environment up and running, I will create a job queue which accepts job submissions and schedules the jobs on the compute environment.

cat > job-queue.json << EOF
{
  "jobQueueName": "jq",
  "state": "ENABLED",
  "priority": 1,
  "computeEnvironmentOrder": [{
    "order": 0,
    "computeEnvironment": "od-ce"
  }]
}
EOF
aws batch create-job-queue --cli-input-json file://job-queue.json

Job definition

The job definition is used as a template for jobs. This example runs a plain container and prints the environment variables. With the new release of AWS Batch, the logging driver awslogs now allows you to change the log group configuration within the job definition.

cat > job-definition.json << EOF
{
  "jobDefinitionName": "alpine-env",
  "type": "container",
  "containerProperties": {
  "image": "alpine",
  "vcpus": 1,
  "memory": 128,
  "command": ["env"],
  "readonlyRootFilesystem": true,
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": { 
      "awslogs-region": "${AWS_REGION}", 
      "awslogs-group": "/aws/batch/custom/env-queue",
      "awslogs-create-group": "true"}
    }
  }
}
EOF
aws batch register-job-definition --cli-input-json file://job-definition.json

Job Submission

Using the above job definition, you can now submit a job.

aws batch submit-job \
  --job-name test-$(date +"%F_%H-%M-%S") \
  --job-queue arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-queue/jq \
  --job-definition arn:aws:batch:${AWS_REGION}:${AWS_ACCT_ID}:job-definition/alpine-env:1

Now, you can check the ‘Log Group’ in CloudWatch. Go to the CloudWatch console and find the ‘Log Group’ section on the left.

log groups in cloudwatch

Now, click on the log group defined above, and you should see the output of the job which allows for debugging if something within the container went wrong or processing logs and create alarms and reports.

cloudwatch log events

Splunk

Splunk is an established log engine for a broad set of customers. You can use the Docker container to set up a Splunk server quickly. More information can be found in the Splunk documentation. You need to configure the HTTP Event Collector, which provides you with a link and a token.

To send logs to Splunk, create an additional job-definition with the Splunk token and URL. Please adjust the splunk-url and splunk-token to match your Splunk setup.

{
  "jobDefinitionName": "alpine-splunk",
  "type": "container",
  "containerProperties": {
    "image": "alpine",
    "vcpus": 1,
    "memory": 128,
    "command": ["env"],
    "readonlyRootFilesystem": false,
    "logConfiguration": {
      "logDriver": "splunk",
      "options": {
        "splunk-url": "https://<splunk-url>",
        "splunk-token": "XXX-YYY-ZZZ"
      }
    }
  }
}

This forwards the logs to Splunk, as you can see in the following image.

forward to splunk

Conclusion

This blog post showed you how to apply custom logging to AWS Batch using the awslog and Splunk logging driver. While these are two important logging drivers, please head over to the documentation to find out about fluentd, syslog, json-file and other drivers to find the best driver to match your current logging infrastructure.

Why Deployment Requirements are Important When Making Architectural Choices

2020-09-29 Yusuf Mayet

Post Syndicated from Yusuf Mayet original https://aws.amazon.com/blogs/architecture/why-deployment-requirements-are-important-when-making-architectural-choices/

Introduction

Too often, architects fall into the trap of thinking the architecture of an application is restricted to just the runtime part of the architecture. By doing this we focus on only a single customer (such as the application’s users and how they interact with the system) and we forget about other important customers like developers and DevOps teams. This means that requirements regarding deployment ease, deployment frequency, and observability are delegated to the back burner during design time and tacked on after the runtime architecture is built. This leads to increased costs and reduced ability to innovate.

In this post, I discuss the importance of key non-functional requirements, and how they can and should influence the target architecture at design time.

Architectural patterns

When building and designing new applications, we usually start by looking at the functional requirements, which will define the functionality and objective of the application. These are all the things that the users of the application expect, such as shopping online, searching for products, and ordering. We also consider aspects such as usability to ensure a great user experience (UX).

We then consider the non-functional requirements, the so-called “ilities,” which typically include requirements regarding scalability, availability, latency, etc. These are constraints around the functional requirements, like response times for placing orders or searching for products, which will define the expected latency of the system.

These requirements—both functional and non-functional together—dictate the architectural pattern we choose to build the application. These patterns include Multi-tier, event-driven architecture, microservices, and others, and each one has benefits and limitations. For example, a microservices architecture allows for a system where services can be deployed and scaled independently, but this also introduces complexity around service discovery.

Aligning the architecture to technical users’ requirements

Amazon is a customer-obsessed organization, so it’s important for us to first identify who the main customers are at each point so that we can meet their needs. The customers of the functional requirements are the application users, so we need to ensure the application meets their needs. For the most part, we will ensure that the desired product features are supported by the architecture.

But who are the users of the architecture? Not the applications’ users—they don’t care if it’s monolithic or microservices based, as long as they can shop and search for products. The main customers of the architecture are the technical teams: the developers, architects, and operations teams that build and support the application. We need to work backwards from the customers’ needs (in this case the technical team), and make sure that the architecture meets their requirements. We have therefore identified three non-functional requirements that are important to consider when designing an architecture that can equally meet the needs of the technical users:

Deployability: Flow and agility to consistently deploy new features
Observability: feedback about the state of the application
Disposability: throwing away resources and provision new ones quickly

Together these form part of the Developer Experience (DX), which is focused on providing developers with APIs, documentation, and other technologies to make it easy to understand and use. This will ensure that we design for Day 2 operations in mind.

Deployability: Flow

There are many reasons that organizations embark on digital transformation journeys, which usually involve moving to the cloud and adopting DevOps. According to Stephen Orban, GM of AWS Data Exchange, in his book Ahead in the Cloud, faster product development is often a key motivator, meaning the most important non-functional requirement is achieving flow, the speed at which you can consistently deploy new applications, respond to competitors, and test and roll out new features. As well, the architecture needs to be designed upfront to support deployability. If the architectural pattern is a monolithic application, this will hamper the developers’ ability to quickly roll out new features to production. So we need to choose and design the architecture to support easy and automated deployments. Results from years of research prove that leaders use DevOps to achieve high levels of throughput:

Graphic - Using DevOps to achieve high levels of throughput

Decisions on the pace and frequency of deployments will dictate whether to use rolling, blue/green, or canary deployment methodologies. This will then inform the architectural pattern chosen for the application.

Using AWS, in order to achieve flow of deployability, we will use services such as AWS CodePipeline, AWS CodeBuild, AWS CodeDeploy and AWS CodeStar.

Observability: feedback

Once you have achieved a rapid and repeatable flow of features into production, you need a constant feedback loop of logs and metrics in order to detect and avoid problems. Observability is a property of the architecture that will allow us to better understand the application across the delivery pipeline and into production. This requires that we design the architecture to ensure that health reports are generated to analyze and spot trends. This includes error rates and stats from each stage of the development process, how many commits were made, build duration, and frequency of deployments. This not only allows us to measure code characteristics such as test coverage, but also developer productivity.

On AWS, we can leverage Amazon CloudWatch to gather and search through logs and metrics, AWS X-Ray for tracing, and Amazon QuickSight as an analytics tool to measure CI/CD metrics.

Disposability: automation

In his book, Cloud Strategy: A Decision-based Approach to a Successful Cloud Journey, Gregor Hohpe, Enterprise Strategist at AWS, notes that cloud and automation add a new “-ility”: disposability, which is the ability to set up and dispose of new servers in an automated and pain-free manner. Having immutable, disposable infrastructure greatly enhances your ability to achieve high levels of deployability and flow, especially when used in a CI/CD pipeline, which can create new resources and kill off the old ones.

At AWS, we can achieve disposability with serverless using AWS Lambda, or with containers running on Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS), or using AWS Auto Scaling with Amazon Elastic Compute Cloud (EC2).

Three different views of the architecture

Once we have designed an architecture that caters for deployability, observability, and disposability, it exposes three lenses across which we can view the architecture:

3 views of the architecture

Build lens: the focus of this part of the architecture is on achieving deployability, with the objective to give the developers an easy-to-use, automated platform that builds, tests, and pushes their code into the different environments, in a repeatable way. Developers can push code changes more reliably and frequently, and the operations team can see greater stability because environments have standard configurations and rollback procedures are automated
Runtime lens: the focus is on the users of the application and on maximizing their experience by making the application responsive and highly available.
Operate lens: the focus is on achieving observability for the DevOps teams, allowing them to have complete visibility into each part of the architecture.

Summary

When building and designing new applications, the functional requirements (such as UX) are usually the primary drivers for choosing and defining the architecture to support those requirements. In this post I have discussed how DX characteristics like deployability, observability, and disposability are not just operational concerns that get tacked on after the architecture is chosen. Rather, they should be as important as the functional requirements when choosing the architectural pattern. This ensures that the architecture can support the needs of both the developers and users, increasing quality and our ability to innovate.

Enhanced monitoring and automatic scaling for Apache Flink

2020-09-10 Karthi Thyagarajan

Post Syndicated from Karthi Thyagarajan original https://aws.amazon.com/blogs/big-data/enhanced-monitoring-and-automatic-scaling-for-apache-flink/

Thousands of developers use Apache Flink to build streaming applications to transform and analyze data in real time. Apache Flink is an open-source framework and engine for processing data streams. It’s highly available and scalable, delivering high throughput and low latency for the most demanding stream-processing applications. Monitoring and scaling your applications is critical to keep your applications running successfully in a production environment.

Amazon Kinesis Data Analytics reduces the complexity of building and managing Apache Flink applications. Amazon Kinesis Data Analytics manages the underlying Apache Flink components that provide durable application state, metrics and logs, and more. Kinesis Data Analytics recently announced new Amazon CloudWatch metrics and the ability to create custom metrics to provide greater visibility into your application.

In this post, we show you how to easily monitor and automatically scale your Apache Flink applications with Amazon Kinesis Data Analytics. We walk through three examples. First, we create a custom metric in the Kinesis Data Analytics for Apache Flink application code. Second, we use application metrics to automatically scale the application. Finally, we share a CloudWatch dashboard for monitoring your application and recommend metrics that you can alarm on.

Custom metrics

Kinesis Data Analytics uses Apache Flink’s metrics system to send custom metrics to CloudWatch from your applications. For more information, see Using Custom Metrics with Amazon Kinesis Data Analytics for Apache Flink.

We use a basic word count program to illustrate the use of custom metrics. The following code shows how to extend RichFlatMapFunction to track the number of words it sees. This word count is then surfaced via the Flink metrics API.

private static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<String, Integer>> {
     
            private transient Counter counter;
     
            @Override
            public void open(Configuration config) {
                this.counter = getRuntimeContext().getMetricGroup()
                        .addGroup("kinesisanalytics")
                        .addGroup("Service", "WordCountApplication")
                        .addGroup("Tokenizer")
                        .counter("TotalWords");
            }
     
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>>out) {
                // normalize and split the line
                String[] tokens = value.toLowerCase().split("\\W+");
     
                // emit the pairs
                for (String token : tokens) {
                    if (token.length() > 0) {
                        counter.inc();
                        out.collect(new Tuple2<>(token, 1));
                    }
                }
            }
        }

Custom metrics emitted through the Flink metrics API are forwarded to CloudWatch metrics by Kinesis Data Analytics for Apache Flink. The following screenshot shows the word count metric in CloudWatch.

Custom automatic scaling

This section describes how to implement an automatic scaling solution for Kinesis Data Analytics for Apache Flink based on CloudWatch metrics. You can configure Kinesis Data Analytics for Apache Flink to perform CPU-based automatic scaling. However, you can automatically scale your application based on something other than CPU utilization. To perform custom automatic scaling, use Application Auto Scaling with the appropriate metric.

For applications that read from a Kinesis stream source, you can use the metric millisBehindLatest. This captures how far behind your application is from the head of the stream.

A target tracking policy is one of two scaling policy types offered by Application Auto Scaling. You can specify a threshold value around which to vary the degree of parallelism of your Kinesis Data Analytics application. The following sample code on GitHub configures Application Auto Scaling when millisBehindLatest for the consuming application exceeds 1 minute. This increases the parallelism, which increases the number of KPUs.

The following diagram shows how Application Auto Scaling, used with Amazon API Gateway and AWS Lambda, scales a Kinesis Data Analytics application in response to a CloudWatch alarm.

The sample code includes examples for automatic scaling based on the target tracking policy and step scaling policy.

Automatic scaling solution components

The following is a list of key components used in the automatic scaling solution. You can find these components in the AWS CloudFormation template in the GitHub repo accompanying this post.

Application Auto Scaling scalable target – A scalable target is a resource that Application Auto Scaling can scale in and out. It’s uniquely identified by the combination of resource ID, scalable dimension, and namespace. For more information, see RegisterScalableTarget.
Scaling policy – The scaling policy defines how your scalable target should scale. As described in the PutScalingPolicy, Application Auto Scaling supports two policy types: TargetTrackingScaling and StepScaling. In addition, you can configure a scheduled scaling action using Application Auto Scaling. If you specify TargetTrackingScaling, Application Auto Scaling also creates corresponding CloudWatch alarms for you.
API Gateway – Because the scalable target is a custom resource, we have to specify an API endpoint. Application Auto Scaling invokes this to perform scaling and get information about the current state of our scalable resource. We use an API Gateway and Lambda function to implement this endpoint.
Lambda – API Gateway invokes the Lambda function. This is called by Application Auto Scaling to perform the scaling actions. It also fetches information such as current scale value and returns information requested by Application Auto Scaling.

Additionally, you should be aware of the following:

When scaling out or in, this sample only updates the overall parallelism. It doesn’t adjust parallelism or KPU.
When scaling occurs, the Kinesis Data Analytics application experiences downtime.
The throughput of a Flink application depends on many factors, such as complexity of processing and destination throughput. The step-scaling example assumes a relationship between incoming record throughput and scaling. The millisBehindLatest metric used for target tracking automatic scaling works the same way.
We recommend using the default scaling policy provided by Kinesis Data Analytics for CPU-based scaling, the target tracking auto scaling policy for the millisBehindLatest metric, and a step scaling auto scaling policy for a metric such as numRecordsInPerSecond. However, you can use any automatic scaling policy for the metric you choose.

CloudWatch operational dashboard

Customers often ask us about best practices and the operational aspects of Kinesis Data Analytics for Apache Flink. We created a CloudWatch dashboard that captures the key metrics to monitor. We categorize the most common metrics in this dashboard with the recommended statistics for each metric.

This GitHub repo contains a CloudFormation template to deploy the dashboard for any Kinesis Data Analytics for Apache Flink application. You can also deploy a demo application with the dashboard. The dashboard includes the following:

Application health metrics:
- Use uptime to see how long the job has been running without interruption and downtime to determine if a job failed to run. Non-zero downtime can indicate issues with your application.
- Higher-than-normal job restarts can indicate an unhealthy application.
- Checkpoint information size, duration, and number of failed checkpoints can help you understand application health and progress. Increasing checkpoint duration values can signify application health problems like backpressure and the inability to keep up with input data. Increasing checkpoint size over time can point to an infinitely growing state that can lead to out-of-memory errors.
Resource utilization metrics:
- You can check the CPU and heap memory utilization along with the thread count. You can also check the garbage collection time taken across all Flink task managers.
Flink application progress metrics:
- numRecordsInPerSecond and numRecordsOutPerSecond show the number of records accepted and emitted per second.
- numLateRecordsDropped shows the number of records this operator or task has dropped due to arriving late.
- Input and output watermarks are valid only when using event time semantics. You can use the difference between these two values to calculate event time latency.
Source metrics:
- The Kinesis Data Streams-specific metric millisBehindLatest shows that the consumer is behind the head of the stream, indicating how far behind current time the consumer is. We used this metric to demonstrate Application Auto Scaling earlier in this post.
- The Kafka-specific metric recordsLagMax shows the maximum lag in terms of number of records for any partition in this window.

The dashboard contains useful metrics to gauge the operational health of a Flink application. You can modify the threshold, configure additional alarms, and add other system or custom metrics to customize the dashboard for your use. The following screenshot shows a section of the dashboard.

Summary

In this post, we covered how to use the enhanced monitoring features for Kinesis Data Analytics for Apache Flink applications. We created custom metrics for an Apache Flink application within application code and emitted it to CloudWatch. We also used Application Auto Scaling to scale an application. Finally, we shared a CloudWatch dashboard to monitor the operational health of Kinesis Data Analytics for Apache Flink applications. For more information about using Kinesis Data Analytics, see Getting Started with Amazon Kinesis Data Analytics.

About the Authors

Karthi Thyagarajan is a Principal Solutions Architect on the Amazon Kinesis team.

Deepthi Mohan is a Sr. TPM on the Amazon Kinesis Data Analytics team.

Troubleshooting Amazon API Gateway with enhanced observability variables

2020-09-04 Eric Johnson

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/troubleshooting-amazon-api-gateway-with-enhanced-observability-variables/

Amazon API Gateway is often used for managing access to serverless applications. Additionally, it can help developers reduce code and increase security with features like AWS WAF integration and authorizers at the API level.

Because more is handled by API Gateway, developers tell us they would like to see more data points on the individual parts of the request. This data helps developers understand each phase of the API request and how it affects the request as a whole. In response to this request, the API Gateway team has added new enhanced observability variables to the API Gateway access logs. With these new variables, developers can troubleshoot on a more granular level to quickly isolate and resolve request errors and latency issues.

The phases of an API request

API Gateway divides requests into phases, reflected by the variables that have been added. Depending upon the features configured for the application, an API request goes through multiple phases. The phases appear in a specific order as follows:

Phases of an API request

WAF: the WAF phase only appears when an AWS WAF web access control list (ACL) is configured for enhanced security. During this phase, WAF rules are evaluated and a decision is made on whether to continue or cancel the request.
Authenticate: the authenticate phase is only present when AWS Identity and Access Management (IAM) authorizers are used. During this phase, the credentials of the signed request are verified. Access is granted or denied based on the client’s right to assume the access role.
Authorizer: the authorizer phase is only present when a Lambda, JWT, or Amazon Cognito authorizer is used. During this phase, the authorizer logic is processed to verify the user’s right to access the resource.
Authorize: the authorize phase is only present when a Lambda or IAM authorizer is used. During this phase, the results from the authenticate and authorizer phase are evaluated and applied.
Integration: during this phase, the backend integration processes the request.

Each phrase can add latency to the request, return a status, or raise an error. To capture this data, API Gateway now provides enhanced observability variables based on each phase. The variables are named according to the phase they occur in and follow the naming structure, $context.phase.property. Therefore, you can get data about WAF latency by using $context.waf.latency.

Some existing variables have also been given aliases to match this naming schema. For example, $context.integrationErrorMessage has a new alias of $context.integration.error. The resulting list of variables is as follows:

Phases and variables for API Gateway requests

API Gateway provides status, latency, and error data for each phase. In the authorizer and integration phases, there are additional variables you can use in logs. The $context.phase.requestId provides the request ID from that service and the $context.phase.integrationStatus provide the status code.

For example, when using an AWS Lambda function as the integration, API Gateway receives two status codes. The first, $context.integration.integrationStatus, is the status of the Lambda service itself. This is usually 200, unless there is a service or permissions error. The second, $context.integration.status, is the status of the Lambda function and reports on the success or failure of the code.

A full list of access log variables is in the documentation for REST APIs, WebSocket APIs, and HTTP APIs.

A troubleshooting example

In this example, an application is built using an API Gateway REST API with a Lambda function for the backend integration. The application uses an IAM authorizer to require AWS account credentials for application access. The application also uses an AWS WAF ACL to rate limit requests to 100 requests per IP, per five minutes. The demo application and deployment instructions can be found in the Sessions With SAM repository.

Because the application involves an AWS WAF and IAM authorizer for security, the request passes through four phases: waf, authenticate, authorize, and integration. The access log format is configured to capture all the data regarding these phases:

{
  "requestId":"$context.requestId",
  "waf-error":"$context.waf.error",
  "waf-status":"$context.waf.status",
  "waf-latency":"$context.waf.latency",
  "waf-response":"$context.wafResponseCode",
  "authenticate-error":"$context.authenticate.error",
  "authenticate-status":"$context.authenticate.status",
  "authenticate-latency":"$context.authenticate.latency",
  "authorize-error":"$context.authorize.error",
  "authorize-status":"$context.authorize.status",
  "authorize-latency":"$context.authorize.latency",
  "integration-error":"$context.integration.error",
  "integration-status":"$context.integration.status",
  "integration-latency":"$context.integration.latency",
  "integration-requestId":"$context.integration.requestId",
  "integration-integrationStatus":"$context.integration.integrationStatus",
  "response-latency":"$context.responseLatency",
  "status":"$context.status"
}

Once the application is deployed, use Postman to test the API with a sigV4 request.

Configuring Postman authorization

To show troubleshooting with the new enhanced observability variables, the first request sent through contains invalid credentials. The user receives a 403 Forbidden error.

Client response view with invalid tokens

The access log for this request is:

{
    "requestId": "70aa9606-26be-4396-991c-405a3671fd9a",
    "waf-error": "-",
    "waf-status": "200",
    "waf-latency": "8",
    "waf-response": "WAF_ALLOW",
    "authenticate-error": "-",
    "authenticate-status": "403",
    "authenticate-latency": "17",
    "authorize-error": "-",
    "authorize-status": "-",
    "authorize-latency": "-",
    "integration-error": "-",
    "integration-status": "-",
    "integration-latency": "-",
    "integration-requestId": "-",
    "integration-integrationStatus": "-",
    "response-latency": "48",
    "status": "403"
}

The request passed through the waf phase first. Since this is the first request and the rate limit has not been exceeded, the request is passed on to the next phase, authenticate. During the authenticate phase, the user’s credentials are verified. In this case, the credentials are invalid and the request is rejected with a 403 response before invoking the downstream phases.

To correct this, the next request uses valid credentials, but those credentials do not have access to invoke the API. Again, the user receives a 403 Forbidden error.

Client response view with unauthorized tokens

The access log for this request is:

{
  "requestId": "c16d9edc-037d-4f42-adf3-eaadf358db2d",
  "waf-error": "-",
  "waf-status": "200",
  "waf-latency": "7",
  "waf-response": "WAF_ALLOW",
  "authenticate-error": "-",
  "authenticate-status": "200",
  "authenticate-latency": "8",
  "authorize-error": "The client is not authorized to perform this operation.",
  "authorize-status": "403",
  "authorize-latency": "0",
  "integration-error": "-",
  "integration-status": "-",
  "integration-latency": "-",
  "integration-requestId": "-",
  "integration-integrationStatus": "-",
  "response-latency": "52",
  "status": "403"
}

This time, the access logs show that the authenticate phase returns a 200. This indicates that the user credentials are valid for this account. However, the authorize phase returns a 403 and states, “The client is not authorized to perform this operation”. Again, the request is rejected with a 403 response before invoking downstream phases.

The last request for the API contains valid credentials for a user that has rights to invoke this API. This time the user receives a 200 OK response and the requested data.

Client response view with valid request

The log for this request is:

{
  "requestId": "ac726ce5-91dd-4f1d-8f34-fcc4ae0bd622",
  "waf-error": "-",
  "waf-status": "200",
  "waf-latency": "7",
  "waf-response": "WAF_ALLOW",
  "authenticate-error": "-",
  "authenticate-status": "200",
  "authenticate-latency": "1",
  "authorize-error": "-",
  "authorize-status": "200",
  "authorize-latency": "0",
  "integration-error": "-",
  "integration-status": "200",
  "integration-latency": "16",
  "integration-requestId": "8dc58335-fa13-4d48-8f99-2b1c97f41a3e",
  "integration-integrationStatus": "200",
  "response-latency": "48",
  "status": "200"
}

This log contains a 200 status code from each of the phases and returns a 200 response to the user. Additionally, each of the phases reports latency. This request had a total of 48 ms of latency. The latency breaks down according to the following:

Request latency breakdown

Developers can use this information to identify the cause of latency within the API request and adjust accordingly. While some phases like authenticate or authorize are immutable, optimizing the integration phase of this request could remove a large chunk of the latency involved.

Conclusion

This post covers the enhanced observability variables, the phases they occur in, and the order of those phases. With these new variables, developers can quickly isolate the problem and focus on resolving issues.

When configured with the proper access logging variables, API Gateway access logs can provide a detailed story of API performance. They can help developers to continually optimize that performance. To learn how to configure logging in API Gateway with AWS SAM, see the demonstration app for this blog.

#ServerlessForEveryone

The Dashboard

Creating the Dashboard with CloudShell

Creating the Dashboard Directly from your Local Machine

Examining Our Metrics

The Network Layer

The Storage Layer

Additional storage optimization you can implement

The Compute Layer

Conclusion

Monitoring the Alleycat application

Monitoring the Kinesis stream

Monitoring the Lambda function

Monitoring the DynamoDB table

Monitoring Kinesis Data Streams

Metrics for monitoring data producers

Metrics for monitoring data consumers

Increasing data processing throughput for Kinesis Data Streams

Adjusting the parallelization factor

Using enhanced fan-out to reduce iterator age

Conclusion

Security question SEC2: How do you manage your serverless application’s security boundaries?

Required practice: Evaluate and define resource policies

Understand and determine which resource policies are necessary

Implement resource policies to prevent unauthorized access

Good practice: Control network traffic at all layers

Use networking controls to enforce access patterns

Using tools to audit your traffic

Block network access when required

Conclusion

Session Manager

Patch Manager

Installation and deployment

Conclusion

Benefits of using DynamoDB as a serverless developer

Serverless patterns with DynamoDB

Amazon API Gateway REST API to Amazon DynamoDB

AWS Lambda to Amazon DynamoDB

AWS Step Functions to Amazon DynamoDB

Amazon DynamoDB to AWS Lambda

Setting up the Amazon DynamoDB to AWS Lambda Pattern

Prerequisites

Downloading and testing the pattern

Configuring the event source mapping for the DynamoDB table

Using DynamoDB capacity modes

Stream specification

Cleanup

Submit a pattern to the Serverless Land Patterns Collection

Conclusion

Product and Solution Overview

Prerequisites

Walkthrough

1. Create IAM user

2. SSH into Amazon Lightsail instance

3. Installing the CloudWatch agent

4. Setup credentials

5. Create CloudWatch configuration file to collect memory usage metrics

6. Configure CloudWatch agent

7. Start CloudWatch agent

8. Verify metrics in CloudWatch

Conclusion

How did BigHat decide what they needed?

Feature set

Reducing risk

Embracing an opinionated model

Overview of solution

Security

Cost advantages

The right tool for the right time

Conclusion

About the authors

Social – Community Engagement

Updates to the CDK

Constructs promoted to General Availability

Dead Letter Queues

Amazon API Gateway

Notable new L2 constructs

Level ups to existing CDK constructs

Learning – Finds from across the internet

Community Acknowledgements

Business Challenge and Solution