Tag Archives: Best practices

Enabling load-balancing of non-HTTP(s) traffic on AWS Wavelength

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/enabling-load-balancing-of-non-https-traffic-on-aws-wavelength/

This blog post is written by Jack Chen, Telco Solutions Architect, and Robert Belson, Developer Advocate.

AWS Wavelength embeds AWS compute and storage services within 5G networks, providing mobile edge computing infrastructure for developing, deploying, and scaling ultra-low-latency applications. AWS recently introduced support for Application Load Balancer (ALB) in AWS Wavelength zones. Although ALB addresses Layer-7 load balancing use cases, some low latency applications that get deployed in AWS Wavelength Zones rely on UDP-based protocols, such as QUIC, WebRTC, and SRT, which can’t be load-balanced by Layer-7 Load Balancers. In this post, we’ll review popular load-balancing patterns on AWS Wavelength, including a proposed architecture demonstrating how DNS-based load balancing can address customer requirements for load-balancing non-HTTP(s) traffic across multiple Amazon Elastic Compute Cloud (Amazon EC2) instances. This solution also builds a foundation for automatic scale-up and scale-down capabilities for workloads running in an AWS Wavelength Zone.

Load balancing use cases in AWS Wavelength

In the AWS Regions, customers looking to deploy highly-available edge applications often consider Amazon Elastic Load Balancing (Amazon ELB) as an approach to automatically distribute incoming application traffic across multiple targets in one or more Availability Zones (AZs). However, at the time of this publication, AWS-managed Network Load Balancer (NLB) isn’t supported in AWS Wavelength Zones and ALB is being rolled out to all AWS Wavelength Zones globally. As a result, this post will seek to document general architectural guidance for load balancing solutions on AWS Wavelength.

As one of the most prominent AWS Wavelength use cases, highly-immersive video streaming over UDP using protocols such as WebRTC at scale often require a load balancing solution to accommodate surges in traffic, either due to live events or general customer access patterns. These use cases, relying on Layer-4 traffic, can’t be load-balanced from a Layer-7 ALB. Instead, Layer-4 load balancing is needed.

To date, two infrastructure deployments involving Layer-4 load balancers are most often seen:

  • Amazon EC2-based deployments: Often the environment of choice for earlier-stage enterprises and ISVs, a fleet of EC2 instances will leverage a load balancer for high-throughput use cases, such as video streaming, data analytics, or Industrial IoT (IIoT) applications
  • Amazon EKS deployments: Customers looking to optimize performance and cost efficiency of their infrastructure can leverage containerized deployments at the edge to manage their AWS Wavelength Zone applications. In turn, external load balancers could be configured to point to exposed services via NodePort objects. Furthermore, a more popular choice might be to leverage the AWS Load Balancer Controller to provision an ALB when you create a Kubernetes Ingress.

Regardless of deployment type, the following design constraints must be considered:

  • Target registration: For load balancing solutions not managed by AWS, seamless solutions to load balancer target registration must be managed by the customer. As one potential solution, visit a recent HAProxyConf presentation, Practical Advice for Load Balancing at the Network Edge.
  • Edge Discovery: Although DNS records can be populated into Amazon Route 53 for each carrier-facing endpoint, DNS won’t deterministically route mobile clients to the most optimal mobile endpoint. When available, edge discovery services are required to most effectively route mobile clients to the lowest latency endpoint.
  • Cross-zone load balancing: Given the hub-and-spoke design of AWS Wavelength, customer-managed load balancers should proxy traffic only to that AWS Wavelength Zone.

Solution overview – Amazon EC2

In this solution, we’ll present a solution for a highly-available load balancing solution in a single AWS Wavelength Zone for an Amazon EC2-based deployment. In a separate post, we’ll cover the needed configurations for the AWS Load Balancer Controller in AWS Wavelength for Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

The proposed solution introduces DNS-based load balancing, a technique to abstract away the complexity of intelligent load-balancing software and allow your Domain Name System (DNS) resolvers to distribute traffic (equally, or in a weighted distribution) to your set of endpoints.

Our solution leverages the weighted routing policy in Route 53 to resolve inbound DNS queries to multiple EC2 instances running within an AWS Wavelength zone. As EC2 instances for a given workload get deployed in an AWS Wavelength zone, Carrier IP addresses can be assigned to the network interfaces at launch.

Through this solution, Carrier IP addresses attached to AWS Wavelength instances are automatically added as DNS records for the customer-provided public hosted zone.

To determine how Route 53 responds to queries, given an arbitrary number of records of a public hosted zone, Route53 offers numerous routing policies:

Simple routing policy – In the event that you must route traffic to a single resource in an AWS Wavelength Zone, simple routing can be used. A single record can contain multiple IP addresses, but Route 53 returns the values in a random order to the client.

Weighted routing policy – To route traffic more deterministically using a set of proportions that you specify, this policy can be selected. For example, if you would like Carrier IP A to receive 50% of the traffic and Carrier IP B to receive 50% of the traffic, we’ll create two individual A records (one for each Carrier IP) with a weight of 50 and 50, respectively. Learn more about Route 53 routing policies by visiting the Route 53 Developer Guide.

The proposed solution leverages weighted routing policy in Route 53 DNS to route traffic to multiple EC2 instances running within an AWS Wavelength zone.

Reference architecture

The following diagram illustrates the load-balancing component of the solution, where EC2 instances in an AWS Wavelength zone are assigned Carrier IP addresses. A weighted DNS record for a host (e.g., www.example.com) is updated with Carrier IP addresses.

DNS-based load balancing

When a device makes a DNS query, it will be returned to one of the Carrier IP addresses associated with the given domain name. With a large number of devices, we expect a fair distribution of load across all EC2 instances in the resource pool. Given the highly ephemeral mobile edge environments, it’s likely that Carrier IPs could frequently be allocated to accommodate a workload and released shortly thereafter. However, this unpredictable behavior could yield stale DNS records, resulting in a “blackhole” – routes to endpoints that no longer exist.

Time-To-Live (TTL) is a DNS attribute that specifies the amount of time, in seconds, that you want DNS recursive resolvers to cache information about this record.

In our example, we should set to 30 seconds to force DNS resolvers to retrieve the latest records from the authoritative nameservers and minimize stale DNS responses. However, a lower TTL has a direct impact on cost, as a result of increased number of calls from recursive resolvers to Route53 to constantly retrieve the latest records.

The core components of the solution are as follows:

Alongside the services above in the AWS Wavelength Zone, the following services are also leveraged in the AWS Region:

  • AWS Lambda – a serverless event-driven function that makes API calls to the Route 53 service to update DNS records.
  • Amazon EventBridge– a serverless event bus that reacts to EC2 instance lifecycle events and invokes the Lambda function to make DNS updates.
  • Route 53– cloud DNS service with a domain record pointing to AWS Wavelength-hosted resources.

In this post, we intentionally leave the specific load balancing software solution up to the customer. Customers can leverage various popular load balancers available on the AWS Marketplace, such as HAProxy and NGINX. To focus our solution on the auto-registration of DNS records to create functional load balancing, this solution is designed to support stateless workloads only. To support stateful workloads, sticky sessions – a process in which routes requests to the same target in a target group – must be configured by the underlying load balancer solution and are outside of the scope of what DNS can provide natively.

Automation overview

Using the aforementioned components, we can implement the following workflow automation:

Event-driven Auto Scaling Workflow

Amazon CloudWatch alarm can trigger the Auto Scaling group Scale out or Scale in event by adding or removing EC2 instances. Eventbridge will detect the EC2 instance state change event and invoke the Lambda function. This function will update the DNS record in Route53 by either adding (scale out) or deleting (scale in) a weighted A record associated with the EC2 instance changing state.

Configuration of the automatic auto scaling policy is out of the scope of this post. There are many auto scaling triggers that you can consider using, based on predefined and custom metrics such as memory utilization. For the demo purposes, we will be leveraging manual auto scaling.

In addition to the core components that were already described, our solution also utilizes AWS Identity and Access Management (IAM) policies and CloudWatch. Both services are key components to building AWS Well-Architected solutions on AWS. We also use AWS Systems Manager Parameter Store to keep track of user input parameters. The deployment of the solution is automated via AWS CloudFormation templates. The Lambda function provided should be uploaded to an AWS Simple Storage Service (Amazon S3) bucket.

Amazon Virtual Private Cloud (Amazon VPC), subnets, Carrier Gateway, and Route Tables are foundational building blocks for AWS-based networking infrastructure. In our deployment, we are creating a new VPC, one subnet in an AWS Wavelength zone of your choice, a Carrier Gateway, and updating the route table for this subnet to point the default route to the Carrier Gateway.

Wavelength VPC architecture.

Deployment prerequisites

The following are prerequisites to deploy the described solution in your account:

  • Access to an AWS Wavelength zone. If your account is not allow-listed to use AWS Wavelength zones, then opt-in to AWS Wavelength zones here.
  • Public DNS Hosted Zone hosted in Route 53. You must have access to a registered public domain to deploy this solution. The zone for this domain should be hosted in the same account where you plan to deploy AWS Wavelength workloads.
    If you don’t have a public domain, then you can register a new one. Note that there will be a service charge for the domain registration.
  • Amazon S3 bucket. For the Lambda function that updates DNS records in Route 53, store the source code as a .zip file in an Amazon S3 bucket.
  • Amazon EC2 Key pair. You can use an existing Key pair for the deployment. If you don’t have a KeyPair in the region where you plan to deploy this solution, then create one by following these instructions.
  • 4G or 5G-connected device. Although the infrastructure can be deployed independent of the underlying connected devices, testing the connectivity will require a mobile device on one of the Wavelength partner’s networks. View the complete list of Telecommunications providers and Wavelength Zone locations to learn more.

Conclusion

In this post, we demonstrated how to implement DNS-based load balancing for workloads running in an AWS Wavelength zone. We deployed the solution that used the EventBridge Rule and the Lambda function to update DNS records hosted by Route53. If you want to learn more about AWS Wavelength, subscribe to AWS Compute Blog channel here.

Run fault tolerant and cost-optimized Spark clusters using Amazon EMR on EKS and Amazon EC2 Spot Instances

Post Syndicated from Kinnar Kumar Sen original https://aws.amazon.com/blogs/big-data/run-fault-tolerant-and-cost-optimized-spark-clusters-using-amazon-emr-on-eks-and-amazon-ec2-spot-instances/

Amazon EMR on EKS is a deployment option in Amazon EMR that allows you to run Spark jobs on Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances save you up to 90% over On-Demand Instances, and is a great way to cost optimize the Spark workloads running on Amazon EMR on EKS. Because Spot is an interruptible service, if we can move or reuse the intermediate shuffle files, it improves the overall stability and SLA of the job. The latest versions of Amazon EMR on EKS have integrated Spark features to enable this capability.

In this post, we discuss these features—Node Decommissioning and Persistent Volume Claim (PVC) reuse—and their impact on increasing the fault tolerance of Spark jobs on Amazon EMR on EKS when cost optimizing using EC2 Spot Instances.

Amazon EMR on EKS and Spot

EC2 Spot Instances are spare EC2 capacity provided at a steep discount of up to 90% over On-Demand prices. Spot Instances are a great choice for stateless and flexible workloads. The caveat with this discount and spare capacity is that Amazon EC2 can interrupt an instance with a proactive or reactive (2-minute) warning when it needs the capacity back. You can provision compute capacity in an EKS cluster using Spot Instances using a managed or self-managed node group and provide cost optimization for your workloads.

Amazon EMR on EKS uses Amazon EKS to run jobs with the EMR runtime for Apache Spark, which can be cost optimized by running the Spark executors on Spot. It provides up to 61% lower costs and up to 68% performance improvement for Spark workloads on Amazon EKS. The Spark application launches a driver and executors to run the computation. Spark is a semi-fault tolerant framework that is resilient to executor loss due to an interruption and therefore can run on EC2 Spot. On the other hand, when the driver is interrupted, the job fails. Hence, we recommend running drivers on on-demand instances. Some of the best practices for running Spark on Amazon EKS are applicable with Amazon EMR on EKS.

EC2 Spot instances also helps in cost optimization by improving the overall throughput of the job. This can be achieved by auto-scaling the cluster using Cluster Autoscaler (for managed nodegroups) or Karpenter.

Though Spark executors are resilient to Spot interruptions, the shuffle files and RDD data is lost when the executor gets killed. The lost shuffle files need to be recomputed, which increases the overall runtime of the job. Apache Spark has released two features (in versions 3.1 and 3.2) that addresses this issue. Amazon EMR on EKS released features such as node decommissioning (version 6.3) and PVC reuse (version 6.8) to simplify recovery and reuse shuffle files, which increases the overall resiliency of your application.

Node decommissioning

The node decommissioning feature works by preventing scheduling of new jobs on the nodes that are to be decommissioned. It also moves any shuffle files or cache present in those nodes to other executors (peers). If there are no other available executors, the shuffle files and cache are moved to a remote fallback storage.

Node Decommissioning

Fig 1 : Node Decommissioning

Let’s look at the decommission steps in more detail.

If one of the nodes that is running executors is interrupted, the executor starts the process of decommissioning and sends the message to the driver:

21/05/05 17:41:41 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 7 decommissioned message
21/05/05 17:41:41 DEBUG TaskSetManager: Valid locality levels for TaskSet 2.0: NO_PREF, ANY
21/05/05 17:41:41 INFO KubernetesClusterSchedulerBackend: Decommission executors: 7
21/05/05 17:41:41 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_2.0, runningTasks: 10
21/05/05 17:41:41 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(7, 192.168.82.107, 39007, None)) as being decommissioning.
21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning
21/05/05 20:22:17 INFO BlockManager: Starting block manager decommissioning process...
21/05/05 20:22:17 DEBUG FileSystem: Looking for FS supporting s3a

The executor looks for RDD or shuffle files and tries to replicate or migrate those files. It first tries to find a peer executor. If successful, it will move the files to the peer executor:

22/06/07 20:41:38 INFO ShuffleStatus: Updating map output for 46 to BlockManagerId(4, 192.168.13.235, 34737, None)
22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Received shuffle data block update for 0 46, ignore.
22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Received shuffle index block update for 0 46, updating.

However, if It is not able to find a peer executor, it will try to move the files to a fallback storage if available.

Fallback Storage

Fig 2: Fallback Storage

The executor is then decommissioned. When a new executor comes up, the shuffle files are reused:

22/06/07 20:42:50 INFO BasicExecutorFeatureStep: Adding decommission script to lifecycle
22/06/07 20:42:50 DEBUG ExecutorPodsAllocator: Requested executor with id 19 from Kubernetes.
22/06/07 20:42:50 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-bfd0a5813fd1b80f-exec-19, action ADDED
22/06/07 20:42:50 DEBUG BlockManagerMasterEndpoint: Received shuffle index block update for 0 52, updating.
22/06/07 20:42:50 INFO ShuffleStatus: Recover 52 BlockManagerId(fallback, remote, 7337, None)

The key advantage of this process is that it enables migrates blocks and shuffle data, thereby reducing recomputation, which adds to the overall resiliency of the system and reduces runtime. This process can be triggered by a Spot interruption signal (Sigterm) and node draining. Node draining  may happen due to high-priority task scheduling or independently.

When you use Amazon EMR on EKS with managed node groups/Karpenter, the Spot interruption handling is automated, wherein Amazon EKS gracefully drains and rebalances the Spot nodes to minimize application disruption when a Spot node is at elevated risk of interruption. If you’re using managed node groups/Karpenter, the decommission gets triggered when the nodes are getting drained and because it’s proactive, it gives you more time (at least 2 minutes) to move the files. In the case of self-managed node groups, we recommend installing the AWS Node Termination Handler to handle the interruption, and the decommission is triggered when the reactive (2-minute) notification is received. We recommend to use Karpenter with Spot Instances as it has faster node scheduling with early pod binding and binpacking to optimize the resource utilization.

The following code enables this configuration; more details are available on GitHub:

"spark.decommission.enabled": "true"
"spark.storage.decommission.rddBlocks.enabled": "true"
"spark.storage.decommission.shuffleBlocks.enabled" : "true"
"spark.storage.decommission.enabled": "true"
"spark.storage.decommission.fallbackStorage.path": "s3://<<bucket>>"

PVC reuse

Apache Spark enabled dynamic PVC in version 3.1, which is useful with dynamic allocation because we don’t have to pre-create the claims or volumes for the executors and delete them after completion. PVC enables true decoupling of data and processing when we’re running Spark jobs on Kubernetes, because we can use it as a local storage to spill in-process files too. The latest version of Amazon EMR 6.8 has integrated the PVC reuse feature of Spark, wherein if an executor is terminated due to EC2 Spot interruption or any other reason (JVM), then the PVC is not deleted but persisted and reattached to another executor. If there are shuffle files in that volume, then they are reused.

As with node decommission, this reduces the overall runtime because we don’t have to recompute the shuffle files. We also save the time required to request a new volume for an executor, and shuffle files can be reused without moving the files round.

The following diagram illustrates this workflow.

PVC Reuse

Fig 3: PVC Reuse

Let’s look at the steps in more detail.

If one or more of the nodes that are running executors is interrupted, the underlying pods get terminated and the driver gets the update. Note that the driver is the owner of the PVC of the executors, and they are not terminated. See the following code:

22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action DELETED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action MODIFIED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action DELETED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action MODIFIED

The ExecutorPodsAllocator tries to allocate new executor pods to replace the ones terminated due to interruption. During the allocation, it figures out how many of the existing PVCs have files and can be reused:

22/06/15 23:25:23 INFO ExecutorPodsAllocator: Found 2 reusable PVCs from 10 PVCs

The ExecutorPodsAllocator requests for a pod and when it launches it, the PVC is reused. In the following example, the PVC from executor 6 is reused for new executor pod 11:

22/06/15 23:25:23 DEBUG ExecutorPodsAllocator: Requested executor with id 11 from Kubernetes.
22/06/15 23:25:24 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-11, action ADDED
22/06/15 23:25:24 INFO KubernetesClientUtils: Spark configuration files loaded from Some(/usr/lib/spark/conf) : log4j.properties,spark-env.sh,hive-site.xml,metrics.properties
22/06/15 23:25:24 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script
22/06/15 23:25:24 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-11, action MODIFIED
22/06/15 23:25:24 INFO ExecutorPodsAllocator: Reuse PersistentVolumeClaim amazon-reviews-word-count-9ee82b8169a75183-exec-6-pvc-0

The shuffle files, if present in the PVC are reused.

The key advantage of this technique is that it allows us to reuse pre-computed shuffle files in their original location, thereby reducing the time of the overall job run.

This works for both static and dynamic PVCs. Amazon EKS offers three different storage offerings, which can be encrypted too: Amazon Elastic Block Store (Amazon EBS), Amazon Elastic File System (Amazon EFS), and Amazon FSx for Lustre. We recommend using dynamic PVCs with Amazon EBS because with static PVCs, you would need to create multiple PVCs.

The following code enables this configuration; more details are available on GitHub:

"spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
"spark.kubernetes.driver.reusePersistentVolumeClaim": "true"

For this to work, we need to enable PVC with Amazon EKS and mention the details in the Spark runtime configuration. For instructions, refer to How do I use persistent storage in Amazon EKS? The following code contains the Spark configuration details for using PVC as local storage; other details are available on GitHub:

"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "spark-sc"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "10Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/var/data/spill"

Conclusion

With Amazon EMR on EKS (6.9) and the features discussed in this post, you can further reduce the overall runtime for Spark jobs when running with Spot Instances. This also improves the overall resiliency and flexibility of the job while cost optimizing the workload on EC2 Spot.

Try out the EMR on EKS workshop for improved performance when running Spark workloads on Kubernetes and cost optimize using EC2 Spot Instances.


About the Author

Kinnar Kumar Sen is a Sr. Solutions Architect at Amazon Web Services (AWS) focusing on Flexible Compute. As a part of the EC2 Flexible Compute team, he works with customers to guide them to the most elastic and efficient compute options that are suitable for their workload running on AWS. Kinnar has more than 15 years of industry experience working in research, consultancy, engineering, and architecture.

Monitor AWS workloads without a single line of code with Logz.io and Kinesis Firehose

Post Syndicated from Amos Etzion original https://aws.amazon.com/blogs/big-data/monitor-aws-workloads-without-a-single-line-of-code-with-logz-io-and-kinesis-firehose/

Observability data provides near real-time insights into the health and performance of AWS workloads, so that engineers can quickly address production issues and troubleshoot them before widespread customer impact.

As AWS workloads grow, observability data has been exploding, which requires flexible big data solutions to handle the throughput of large and unpredictable volumes of observability data.

Solution overview

One option is Amazon Kinesis Data Firehose, which is a popular service for streaming huge volumes of AWS data for storage and analytics. By pulling data from Amazon CloudWatch, Amazon Kinesis Data Firehose can deliver data to observability solutions.

Among these observability solutions is Logz.io, which can now ingest metric data from Amazon Kinesis Data Firehose and make it easier to get metrics from your AWS account to your Logz.io account for analysis, alerting, and correlation with logs and traces.

In a few clicks and a few configurations, we’ll see how you can start streaming your metric data (and soon, log data!) to Logz.io for storage and analysis.

Prerequisites

  • Logz.io account – Create a free trial here
  • Logz.io shipping token – Learn about metrics tokens here. You need to be a Logz.io administrator.
  • Access to Amazon CloudWatch and Amazon Kinesis Data Firehose with the appropriate permissions to manage HTTP endpoints.
  • Appropriate permissions to create an Amazon Simple Storage Service (Amazon S3) bucket

Sending Amazon CloudWatch metric data to Logz.io with an Amazon Kinesis Data Firehose

Amazon Kinesis Data Firehose is a service for ingesting, processing, and loading data from large, distributed sources such as logs or clickstreams into multiple consumers for storage and real-time analytics. Kinesis Data Firehose supports more than 50 sources and destinations as of today. This integration can be set up in minutes without a single line of code and enables near real-time analytics for observability data generated by AWS services by using Amazon CloudWatch, Amazon Kinesis Data Firehose, and Logz.io.

Once the integration is configured, Logz.io customers can open the Infrastructure Monitoring product to see their data coming in and populating their dashboards. To see some of the data analytics and correlation you get with Logz.io, check out this short demonstration.

Let’s begin a step-by-step tutorial for setting up the integration.

  • Start by going to Amazon Kinesis Data Firehose and creating a delivery stream with Data Firehose.

Kinesis Firehose Console

  • Next you select a source and destination. Select Direct Put as the source and Logz.io the destination.
  • Next, configure the destination settings. Give the HTTP endpoint a name, which should include logz.io.
  • Select from the dropdown the appropriate endpoint you would like to use.

If you’re sending data to a European region, then set it to Logz.io Metrics EU. Or you can use the us-east-1 destination by selecting Logz.io Metrics US.

  • Next, add your Logz.io Shipping Token. You can find this by going to Settings in Logz.io and selecting Manage Tokens, which requires Logz.io administrator to access. This ensures that your account is only ingesting data from the defined sources (e.g., this Amazon Kinesis Data Firehose delivery stream).

Kinesis Stream config

Keep Content encoding on Disabled and set your desired Retry Duration.

You can also configure Buffer hints to your preferences.

  • Next, determine your Backup settings in case something goes wrong. In most cases, it’s only necessary to back up the failed data. Simply choose an Amazon S3 bucket or create a new one to store data if it doesn’t make it to Logz.io. Then, select Create a delivery stream.

Now it’s time to connect Amazon CloudWatch to our Amazon Kinesis Data Firehose Delivery Stream.

  • Navigate to Amazon CloudWatch and select Streams in the Metrics menu. Select Create metrics stream.
  • Next, you can either select to send all your Amazon CloudWatch metrics to Logz.io, or only metrics from specified namespaces.

In this case, we chose Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), AWS Lambda, and Elastic Load Balancing (ELB).

  • Under Configuration, choose the Select an existing Firehose owned by your account option and choose the Amazon Kinesis Data Firehose you just configured.

Metric Streams Config

If you’d like, you can choose additional statistics in the Add additional statistics box, which provides helpful metrics in terms of percentiles to monitor like latency metrics (i.e., which services have the highest average latency). This may increase your costs.

  • Lastly, give your metric stream a name and hit Create metric stream.

That’s it! Without writing a single line of code, we configured an integration with AWS and Logz.io that enables fast and easy infrastructure monitoring through Amazon CloudWatch data collection.

Your metrics will be stored in Logz.io for 18 months out of the box, without requiring any overhead management.

You can also begin to build dashboards and alerts to begin monitoring – like this Amazon EC2 monitoring dashboard below.

ec2 monitoring dashboard Logz.io

Conclusion

This post demonstrated how to configure an integration with AWS and Logz.io for efficient infrastructure monitoring through Amazon CloudWatch.

To learn more about building metrics dashboards in Logz.io, you can watch this video.

Currently, some users might find that they are sending more data than they really need, which can raise costs. In future versions of this integration, it will be easier to narrow down the metrics to reduce costs.

Want to try it yourself? Create a Logz.io account today, navigate to our infrastructure monitoring product, and start streaming metric data to Logz.io to start monitoring.


About the authors

Amos Etzion – Product Manager at Logz.io

Charlie Klein – Product Marketing Manager at Logz.io

Mark Kriaf – Partner Solutions Architect at AWS

Organize your AWS Serverless code to prevent merge conflicts

Post Syndicated from Mark Curtis original https://aws.amazon.com/blogs/devops/organize-your-aws-serverless-code-to-prevent-merge-conflicts/

How do you prevent the most common merge conflicts when your team is working on a Serverless application? How do you make sure that your team stays productive and avoids large merge issues while trying to update the same crucial files simultaneously? –The answer to both questions is code organization! You can use cfn-include and swagger-cli to organize, collaborate, and maintain a large serverless application as well as support a large or decentralized development team.

Real life inspiration

WRAP Technologies Inc. (WRAP) creates advanced technologies for the protection and security of public safety. Their WRAP Reality product allows law enforcement agencies to train their officers using virtual reality-based scenarios.

Too many cooks in the kitchen

When multiple developers collaborate on a serverless architecture built with AWS CloudFormation, and its extensions such as the AWS Serverless Application Model (SAM), the nature of specifying resources in both the template.yaml and the optional OpenAPI.yaml specification for Amazon API Gateway leads to merge conflicts, such as the one demonstrated in the following figure  where two developers are adding different API endpoints at the same time. These conflicts detract from the developer’s time and agility. Furthermore, navigating and maintaining the long template files required for a larger serverless architecture slows development  as the developer scans large files to find a particular resource definition.

Figure 1. The frustrating merge conflicts.

Figure 1. The frustrating merge conflicts.

By refactoring and organizing the CloudFormation and OpenAPI files, your development team can realize several benefits:

  • Improve developer efficiency by decomposing large, hard-to-manage files into a series of well-organized and single-purpose files.
  • Enhance developer productivity by allowing each developer to have ownership of their own code, thereby reducing the need to coordinate merges with teammates.
  • Eliminate potential merge issues for files that generate the most conflicts during the development of a typical Serverless API application.

Rapid development

WRAP partnered with AWS to develop and host the backend for their new officer training management platform. This entirely new platform was developed, completed, and available for use in a matter of months. Moreover, it’s a collaboration of developers spread across multiple teams worldwide, all contributing to the same code base. By instituting the norms and techniques of this post, WRAP created a large and maintainable serverless application with minimal developer code collisions.

Development of the WRAP Reality training management system was accomplished using CloudFormation for defining Infrastructure as Code (IaC), and an Amazon API Gateway OpenAPI specification for defining API contracts. The development team for the WRAP Reality training management service leveraged agile development for expediency, including the GitHub Flow branching strategy. However, since project contributors were not co-located, several considerations were put in place to make sure of consistency and speed of code development:

  • The API specifications and contracts were defined in OpenAPI (Swagger) specifications early in the development process, clearly defining the project structure up front, and allowing developers to independently build infrastructure components.
  • The two code assets central to the entire project – the CloudFormation template and the OpenAPI Specification – were decomposed into small, easily manageable components. This enabled components to be organized in a way that enhanced development productivity and practically eliminated the inevitable merge conflicts that come with large source code files that are being modified on a daily basis.

The development process was accelerated by utilizing OpenAPI integrations with AWS Services, as well as techniques for managing the OpenAPI specification and Cloudformation Template files.

Sample project

To demonstrate these techniques, we’ll explore the following sample project comprised of API endpoints for “widget” management, available on GitHub. This project provides the following end points:

  • /widget PUT: Creation of a new widget
  • /widget GET: Retrieval of a new widget
  • /reports/color GET: Retrieval of a set of widgets based on the widget color
  • /reports/filterpage GET: Retrieval of widgets based on specified filters

The overall architecture of the application is shown in the following diagram:

Figure 2. Architecture Diagram

Figure 2. Architecture Diagram

The application comprises:

  • Amazon API Gateway is a fully-managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. In this example, API Gateway serves as the web service for the API endpoints. The mapping of data to and from the API endpoints to the Lambda functions is formally defined by an OpenAPI specification file.
  • AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes. In this example, four Lambda functions are used to service each of the four API calls.
  • Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. DynamoDB is used as a persistent data store for widgets and associated properties.

OpenAPI and AWS service integration

When using API Gateway, developers have the option of using proxy Lambda integrations, or formally defining the API interface in an OpenAPI yaml file. The OpenAPI specification can be leveraged to document the API prior to development, and the example/mock features of the OpenAPI specification facilitates concurrent development by quickly establishing a working infrastructure to build upon. Furthermore, API documentation can be automatically generated from the OpenAPI specification.

As the number of endpoints increases, the OpenAPI specification file can grow in size, reaching thousands of lines of code that must be updated and maintained regularly by multiple developers. To aid in management and usability, the OpenAPI file can be decomposed into separate files for endpoints, responses, fields, and schemas.

Start with a “skeleton” file as an entry point for the OpenAPI definition, and then add a separate file for the definition of each endpoint or construct. For example, the sample project entry point is api/apiSkeleton.yaml, which contains the global definitions and effectively defines a simple list of endpoints and the reference ($ref) file path to each endpoint’s definition.

The application comprises:

/reports/color:
    $ref: './paths/reports/reportsColor.yaml'

  /reports/filterpage:
    $ref: './paths/reports/reportsFilterPage.yaml'

Diving into a file referenced by an endpoint, we see that it contains all of the specification details for that endpoint. Looking at the reportsColor.yaml file reveals the full endpoint specification for /reports/color:

get:
  description: Get widgets by color
  parameters:
    - in: path
      $ref: '../../requestParameters/color.yaml'
  responses:
    200:
      description: Get All the Widgets of a color
      content:
        application/json:
          schema:
            $ref: '../../schemas/widgetList.yaml'
    . . .

In turn, this endpoint specification can include further references to yaml files defining common parameters, schemas, and even full gateway responses. For example, color.yaml defines the color path variable:

  type: string
    description: "The widget's color"
    example: "Red"

To paraphrase a common catch phrase, “With a great many files, comes a great responsibility for organization.” To this end, we offer the following organizational structure as a start. Place all of the related API specifications in an “api” subfolder of your project. Have child subfolders for field, metadata, and gateway response definition files. Then, create child subfolder trees for each branch of your endpoints that mirror the endpoint paths. This will result in a highly-organized directory structure, as seen in the sample project:

├── api
│   ├── apiSkeleton.yaml
│   ├── fields
│   │   ├── color.yaml
│   │   ├── metadata
│   │   │   ├── count.yaml
│   │   │   ├── message.yaml
│   │   └── widgetname.yaml
│   ├── gatewayResponses
│   │   ├── error.yaml
│   │   └── notFound.yaml
│   ├── paths
│   │   ├── reports
│   │   │   ├── reportsColor.yaml
│   │   │   └── reportsFilterPage.yaml
│   │   └── widget
│   │       ├── widgetPut.yaml
│   │       └── widgetWidgetnameGet.yaml

We still need a consolidated single OpenAPI file to provide to CloudFormation during deployment to AWS. Therefore, the multiple files are combined and validated using the swagger-cli bundle command, resulting in a single file for deployment. The bundle command must be executed before a CloudFormation build. This command can also be included as a shortcut in the Makefile as the “buildOpenApi” command:

swagger-cli bundle -o api/api.yaml --dereference --t yaml  api/apiSkeleton.yaml

or

make buildOpenApi

Once compiled, api/api.yaml is then used normally for API Gateway integrations and as a Postman  API Collection import. As api/api.yaml is dynamically compiled, it’s included in .gitignore and not checked in to AWS CodeCommit.

cfn-include and nested stacks

The CloudFormation template that defines the infrastructure for even a simple service can grow to considerable length, perhaps thousands of lines. This presents challenges from a support and continued development perspective, as specific code locations become difficult to find and merge conflicts become commonplace.

CloudFormation Nested Stacks are a method of breaking a large CloudFormation template into separate templates. When there are clear delineations between groups of resources in a stack breaking it into separate nested stacks makes sense. There is also a 500 resource limit in a single CloudFormation stack and in order to go above that nested or separate stacks are necessary. Depending on the complexity of the architecture and frequency of updates however, the Nested Stacks can also become large. Furthermore, in a serverless architecture, the logical separation of architecture layers into separate stacks may not be direct, for example when a Lambda function is triggered by an event sent to an EventBridge event bus, then that Lambda function sends a different event back to the same event bus.

In these cases, CloudFormation templates can be decomposed to further leverage cfn-include . With this technique, the top-level CloudFormation template becomes a skeleton file which contains the stack parameters, global specifications, a list of resource names without properties, and the outputs. The properties of each resource are contained in separate files, referenced by an ‘include’ directive.
CloudFormation template organization

To organize your CloudFormation template, deconstruct the template into one-file-per-resource, with one main “skeleton” file as the main entry point. This skeleton file contains the full parameters, global section, conditions, and output specification. The resources are specified by resource name in this skeleton file, and then an ‘include’ directive points to the file that contains the body of the resource declaration. See the following example of the main skeleton file with two resources:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
  Widget API Service
Globals:
  Function:
    Handler: app.lambda_handler
    Runtime: python3.8
Resources:

    WidgetApi:
        !Include ./resources/apigw/widgetApiGW.yaml

    WidgetDdbTable:
        !Include ./resources/dynamodb/widgetDdbTable.yaml

Then, the resource files contain the properties of that specific resource. For example, widgetApiGW.yaml defines an API Gateway:

Type: AWS::Serverless::Api
    Properties:
      DefinitionBody:
        Fn::Transform:
          Name: AWS::Include
          Parameters:
            Location: api/api.yaml
      EndpointConfiguration:
        Type: REGIONAL
      StageName: prod
      TracingEnabled: true

This approach has the benefit of breaking the CloudFormation template into multiple small files, while still maintaining a top-level holistic view. The resource definitions, which normally comprise the majority of the content and can cause merge conflicts, are moved out of the main template.

For organization, you can create a directory in your project to contain the CloudFormation scripts. This directory also contains the entry-point skeleton file. Create further sub-folders for resources, and then further folders by resource type and architecture. We found that placing applicable AWS Identity and Access Management (IAM) role resource definitions in the same folder with the applied resource facilitated easier navigation. For example:

├── cloudformation
│   ├── resources
│   │   ├── apigw
│   │   │   └── widgetApiGW.yaml
│   │   ├── dynamodb
│   │   │   └── widgetDdbTable.yaml
│   │   └── lambda
│   │       ├── layers
│   │       │   └── lambdaDDBEnv.yaml
│   │       ├── reports
│   │       │   ├── reportsColorLambda.yaml
│   │       │   └── reportsColorLambdaRole.yaml
│   │       └── widget
│   │           ├── widgetGetLambda.yaml
│   │           └── widgetGetLambdaRole.yaml
│   └── templateSkeleton.yaml

The files must be reconstituted to a single template.yaml for CloudFormation build and deployment. This is accomplished with the cfn-include command. A convenience command can optionally be included in the Makefile.

cfn-include --yaml  cloudFormation/templateSkeleton.yaml > template.yaml

or

make buildTemplate

As the final template.yaml file is dynamically compiled, it’s included in .gitignore and not checked in to CodeCommit.

Conclusion

This post demonstrates techniques used by WRAP and AWS to rapidly develop and maintain key files in an Serverless architecture. The techniques discussed in this post allowed the WRAP and AWS team to do the following:

  • Improve developer efficiency by decomposing large, hard-to-manage files into a series of well-organized and single purpose files.
  • Enhance developer productivity by allowing each developer to have ownership of their own piece of the code without having to coordinate with teammates.
  • Eliminate potential merge issues on the files that typically generate the most conflicts during the development of a typical Serverless API application.

Applying these techniques was one of the key factors in the rapid development of the WRAP Reality training framework.

About the Authors:

 Tom Romano

Tom Romano is a Solutions Architect from Tampa, FL. Tom is a member the Service Creation team for the World Wide Public Sector, who assists GovTech and EdTech customers as they create new solutions that are cloud-native, event-driven, and serverless. He is an enthusiastic Python programmer for both application development and data analytics. In his free time, Tom flies remote control model airplanes and enjoys vacationing around Florida.

Robert Maefs

Robert Maefs is a lead technologist currently working with Wrap, Inc. developing innovative Virtual Reality training simulations for law enforcement and corrections. He is a repeat entrepreneur with expertise bringing mature technologies to under-served industries. In his personal life, Robert nerds out with board games and 3D printing.

Mark Curtis

Mark Curtis is a Senior Solutions Architect at AWS. At AWS he helps EdTech and GovTech customers architect and modernize their applications using cloud native serverless services. Prior to joining AWS, he spent 18 years developing scalable applications for both EdTech and Government customers.

Juan Peredo

Juan Peredo is a Cloud Application Architect at AWS Professional Services. He enjoys working with customers to design, migrate, and optimize cloud native applications. He is a problem solver at heart who likes using emerging technologies to solve interesting problems.

Scaling AWS Outposts rack deployments with ACE racks

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/scaling-aws-outposts-rack-deployments-with-ace-racks/

This blog post is written by Eric Vasquez, Specialist Hybrid Edge Solutions Architect, and Paul Scherer, Senior Network Service Tech.

Overview

AWS Outposts brings managed, monitored AWS infrastructure, compute, and storage to your on-premises environment. It provides the same AWS APIs, and console experience you would get within the AWS Region to which the Outpost is homed to. You may already have an Outposts rack. An Outpost can consist of one or more racks creating a pool of consumable resources as a single logical Outpost. In this post, we will introduce you to an Aggregation, Core, Edge (ACE) rack.

Depending on your familiarity with the Outpost family, you might have already heard about an ACE rack. An ACE rack serves as an aggregation point for multi-rack Outpost deployments. ACE racks reduce the physical networking port requirements as well as the logical interfaces needed, while allowing for connectivity between multiple racks in your logical Outpost. ACE racks are recommended for customers with planned deployments beyond three racks excluding the ACE rack itself.

We recommend that all customers leverage an ACE rack if planning expansions beyond three racks in the long-term, even if the initial deployment is a single rack. An ACE rack contains four routers, and these routers can connect to either two or four customer upstream devices. For the best redundancy, reliability, and resiliency, we recommend deploying an ACE rack to four upstream customer devices.

ACE racks support 10G, 40G, and 100G connections to a customer network. However, 100G connections between each ACE router to a customer device are recommended.

Outpost architecture aceOutpost extension from region and ACE rack deployment in a 15 rack Outpost configuration

Each Outposts rack comes standard with redundant Outpost networking devices, power supplies, and two top-of-rack patch panels which serve as demarcation points between the Outpost rack and your customer networking device (CND). For the remainder of this post, we’ll refer to the Outpost Networking Devices as OND and customer switches/routers as CND. The Outpost rack ONDs form Border Gateway Protocol (BGP) neighbor relationships with either your CND or the ACE rack using point-to-point (P2P) Virtual LAN (VLAN) interfaces.

For Outposts installation without an ACE rack, each Outposts OND connects to your LAN using single-mode or multi-mode fiber with LC connectors supporting 1G, 10G, 40G, or 100G connectivity. We provide flexibility for the CNDs and allow either Layer 2 or Layer 3 devices, including firewalls. Each OND uses a single LACP port channel that carries 2 VLAN point-to-points virtual interfaced (VIF)to establish 2 BGP relationships over the port channel to your upstream CND and aggregate total bandwidth. This results in each Outpost rack requiring a minimum of two physical uplinks, but as a general best practice we recommend two-per-device for a total of four uplinks, along with two LACP port channels and 4 VLAN to establish point-to-points (P2P) BGP peering’s. Note that the IP’s used in the following diagram are just examples.

Outpost Service link and Local Gateway VLANOutpost Service link and Local Gateway VLAN

As we continue to expand rack deployments, so will the number of physical uplinks and VLAN interfaces required for the added OND to a CND. When we introduce the ACE rack, the OND is no longer attached to your CND. Instead, it goes directly to ACE devices, which provide at least one uplink to your network switch/router. In this topology, AWS owns the VLAN interface allocation and configuration between compute rack OND and the ACE routers.

Let’s cover the potential downsides to a multi-rack installation without an ACE rack. In this case, we have a three-rack Outpost deployment, with one uplink (two per rack) from each rack OND to the CND. This would require you to provide: six physical ports on your devices, six fiber cables,12 VLAN VIFs, 12 P2P subnets potentially exhausting 24 ips, and six port channels.

In comparison to a three-rack install that sits behind an ACE rack, you provide fewer physical network ports on your devices, fewer fiber cabling uplinks, fewer VLAN VIFs, fewer port channels, and fewer P2P’s. Each ACE router will have its own LACP port channel with 2x VLAN VIFs in each channel (the same as an Outposts Networking Devices (OND) <> Customer connection). The following table highlights the advantages in using an ACE rack when running a multi-rack Outpost, which becomes more desirable as you continue to scale.

2-Rack Outpost

Installation

3-Rack Outpost

Installation

4-Rack Outpost

Installation

Requirement

Without ACE With ACE Without ACE With ACE Without ACE With ACE

Physical Ports

4

4

6

4

8

4

Fiber Cables

4

4 6 4 8

4

LACP Port Channels

4

4 6 4

8

4

VLAN VIFs

8

8 12 8 16

8

P2P Subnets 8 8 12 8 16

8

ACE VS Non-ACE Rack Components Comparison

Furthermore, you should consider the additional weight, and power requirements that an ACE rack introduces when planning for multi-rack deployments. In addition to initial kVA requirements for the Outpost racks you must account for the resources required for an ACE rack. An ACE rack consumes up to 10kVA of power and weighs up to 705 lbs. Carefully planning additional capacity for these resources with your AWS account team will be critical for a successful deployment.

Similar to an Outpost rack, an ACE rack deployment is monitored by AWS. The rack provides telemetry data transmitted over a set of VPN tunnels back to the anchor points in the Region to which the Outpost is homed. This allows AWS to monitor the rack for hardware failures, performance degradation, and other alarm conditions including Links, Interfaces going down, and BGP drops.

As part of the Outpost ordering process, AWS will work closely with you to determine the location for install, power availability on-site, and the network configuration of both the Outposts rack and ACE rack. This includes BGP configuration, and the Customer Owned IP Address (CoIP), which is the pool of IP addresses for route advertisements back to your CND. The COIP pool allows resources inside your Outpost rack to communicate with on-premises resources and vice-versa. Another connectivity option would be the Direct VPC Routing (DVR) where we advertise VPC subnets associated with your LGW to your on-premises networks. Outposts uses a networking connectivity back to the Region for management purposes called the service link (SL). The SL is an encrypted set of VPN connections used whenever the Outpost communicates with your chosen home Region.

Conclusion

This post addresses the most common questions surrounding ACE racks, how an ACE rack can be deployed, and why an ACE rack would be leveraged for a multi-Outpost rack deployment. In this post, we demonstrated how an ACE rack serves as a consolidation point in your on-premises environment, making multi-rack deployments scalable, while reducing complexity and physical port allocation for connectivity between an Outpost and your LAN. In addition, we described how you can get this process started. If you want to learn more about Outposts fundamentals and how you can build your applications with AWS services using Outposts for hybrid cloud deployments you can learn more check out the Outposts user guide.

Using Workflows to Build, Test, and Deploy with Amazon CodeCatalyst

Post Syndicated from Kumar Karra original https://aws.amazon.com/blogs/devops/using-workflows-to-build-test-and-deploy-with-amazon-codecatalyst/

Amazon CodeCatalyst workflows are continuous integration and continuous delivery (CI/CD) pipelines that enable you to easily build, test and deploy applications. CodeCatalyst was announced at re:Invent 2022 and is currently in preview.

Introduction:

I recently read The Unicorn Project, the follow-up to the bestselling title The Phoenix Project from Gene Kim. After a few years at Amazon, I had forgotten how some companies write software, but it all came back to me as I read. In the book, the main character, Maxine, struggles with a complicated software development lifecycle (SLDC) after joining a new team. Some of the challenges she encounters include:

  • Continually delivering high-quality updates is complicated and slow
  • Collaborating efficiently with others is challenging
  • Managing application environments is increasingly complex
  • Setting up a new project is a time consuming chore

Amazon CodeCatalyst can help address all of these issues. CodeCatalyst is an integrated DevOps service that makes it easy for development teams to quickly build and deliver applications on AWS. Over the next few weeks, my colleagues and I will release a series of blog posts describing the individual features of CodeCatalyst and how they will help you overcome the challenges that Maxine encountered in The Unicorn Project. In this first post, I focus on Workflows and address the first bullet above, “continually delivering high-quality updates is complicated and slow”.

CodeCatalyst Workflows help you reliably deliver high-quality application updates frequently, quickly and securely. CodeCatalyst uses a visual editor — or if you prefer YAML — to quickly assemble and configure actions to compose workflows that automate your CI/CD pipeline, test reporting and other manual processes. Workflows use provisioned compute, lambda compute, custom container images and a managed build infrastructure to scale execution easily without sacrificing flexibility

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Walkthrough

For this walkthrough, I am going use the Modern Three-tier Web Application blueprint. A CodeCatalyst blueprint provides a template for a new project. If you would like to follow along, you can launch the blueprint as described in Creating a project in Amazon CodeCatalyst.  This will deploy the architecture shown below.

Modern Three-tier Web Application architecture including a presentation, application and data layer

Figure 1. Modern Three-tier Web Application architecture including a presentation, application and data layer

Once the new project is launched, navigate to CI/CD > Workflows. You will see two workflows listed. Click on  ApplicationDeploymentPipeline and you will be presented with the workflow pictured below. The workflow consists of six actions: 1) ensures that CDK is configured in the account; 2) builds the backend, written in Python, including unit tests; 3) deploys the backend to either AWS Lambda or AWS Fargate depending on which you selected when you launched the project; 4) runs a series of integration tests on the deployed backend; 5) builds the frontend, written with Vue, including unit tests; and finally, 6) deploys the frontend to Amazon Simple Storage Service (Amazon S3) and Amazon CloudFront.

Six step Workflow described in the prior paragraph

Figure 2. Six step Workflow described in the prior paragraph

Let’s look at a few of these actions. If you click on each action you will see details about the workflow execution. For example, I clicked on build_backend. On the logs tab, I can see the build action executes a series of steps. In this example,  pip installs requirements and then pytest and coverage run a series of unit test. If this had been a compiled language — like Java or .NET — there would have been a build step as well.

Logs from the build action including pip, pytest, and coverage

Figure 3. Logs from the build action including pip, pytest, and coverage

If I switch to the Reports tab, I see the result of the unit tests as well as code and branch coverage. In each case the test has exceeded the pass rate, indicated by the black bar on the graph. If they had not, the build would have failed.

Results of the unit tests including code and branch coverage

Figure 4. Results of the unit tests including code and branch coverage

Next, let’s examine how the workflow is defined by clicking on the Edit button in the top right corner of the screen. If the editor opens in YAML mode, switch to Visual mode using the toggle above the code. If I click on WorkflowSource, I see that the Workflow is triggered by a push to the main branch. I could add additional triggers. CodeCatalyst supports triggering on Push or Pull Request. In addition, I can trigger off multiple branches, including wildcards (e.g. “release-.*”).  Finally, I can trigger branches when only some files in a repository change (e.g. "src/.*")

Trigger configuration showing various options

Figure 5. Trigger configuration showing various options

Now, let’s look at the build_frontend action. This is a build action, similar to the build_backend action you looked at earlier. On the Configure tab I can see the Shell commands that will be executed during the build. Remember that the frontend is written using Vue. Here I can see  npm install used to install dependencies, npm run test:unit used to run tests, and finally npm run build-only to build the Single Page App (SPA). The resulting artifacts are passed to subsequent actions in the Workflow.

Shell commands run in the build action

Figure 6. Shell commands run in the build action

Next, let’s look at the integration_test action. A managed test action is very similar to a build action, defining a series of commands to execute. On the configuration tab (not shown), I can see that this action is again running pytest. Switching to the Outputs tab, I see that CodeCatalyst is configured to automatically discover the test reports generated by pytest and other test frameworks. In addition, I have defined a minimum pass rate of 100%. This means that the workflow should fail if any of the integration tests fail.

Test report configuration dialog including success criteria

Figure 7. Test report configuration dialog including success criteria

Finally, let’s examine the deploy_frontend action. Note that all of the actions you have looked at so far include a series of commands to run in their configuration. While these actions are highly flexible, CodeCatalyst also supports purpose built actions. The cdk-deploy action is an example of this. As the name implies, this action deploys AWS Cloud Development Kit (CDK) resources. I could have called cdk deploy from the shell commands in a build action. However, using the purpose built action is easier. CodeCatalyst supports many purpose build actions developed by AWS as well as third parties. Click on the + sign in the top left corner of the screen to see a few examples.  In addition, CodeCatalyst supports GitHub actions, but that is a topic for another post.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges (See pricing page for more details). First, delete the two stacks that CDK deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and clicking the Delete project button.

Conclusion

In this post, you learned how CodeCatalyst can help you rapidly assemble automation workflows by configuring composable, pre-built actions into CI/CD pipelines. I examined actions to build, test and deploy both frontend and backend applications. In future posts, I will discuss how CodeCatalyst can address the rest of the challenges Maxine encountered in The Unicorn Project.

About the authors:

Kumar Karra

Kumar Karra is a Field Solutions Architect for AWS Small and Medium Business Customers. He has a strong background in designing and developing applications for small consumer facing customers to large mission critical applications for enterprises. He specialized in Builder’s Experience tools and enjoys helping customer shorten their time to value by guiding them on strategies to implement fast, repeatable, testable, and scalable tools and architectures.

Kawshik Sarkar

Kawshik Sarkar is a Field Solutions Architect for AWS Small Medium Business customers . He helps customers by designing solutions using AWS cloud services , to enhance their user experience ,maximize outcomes and improve business agility . He enjoys music , podcasts ,tennis  and being outdoors

Divya Konaka Satyapal

Divya Konaka Satyapal is a Sr.Technical Account Manager for WWPS Edtech/EDU customers. Her expertise lies in DevOps and Serverless architectures. She works with customers heavily on cost optimization and overall operational excellence to accelerate their cloud journey. Outside of work, she enjoys traveling and playing tennis.

Configuration driven dynamic multi-account CI/CD solution on AWS

Post Syndicated from Anshul Saxena original https://aws.amazon.com/blogs/devops/configuration-driven-dynamic-multi-account-ci-cd-solution-on-aws/

Many organizations require durable automated code delivery for their applications. They leverage multi-account continuous integration/continuous deployment (CI/CD) pipelines to deploy code and run automated tests in multiple environments before deploying to Production. In cases where the testing strategy is release specific, you must update the pipeline before every release. Traditional pipeline stages are predefined and static in nature, and once the pipeline stages are defined it’s hard to update them. In this post, we present a configuration driven dynamic CI/CD solution per repository. The pipeline state is maintained and governed by configurations stored in Amazon DynamoDB. This gives you the advantage of automatically customizing the pipeline for every release based on the testing requirements.

By following this post, you will set up a dynamic multi-account CI/CD solution. Your pipeline will deploy and test a sample pet store API application. Refer to Automating your API testing with AWS CodeBuild, AWS CodePipeline, and Postman for more details on this application. New code deployments will be delivered with custom pipeline stages based on the pipeline configuration that you create. This solution uses services such as AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, Amazon DynamoDB, AWS Lambda, and AWS Step Functions.

Solution overview

The following diagram illustrates the solution architecture:

The image represents the solution workflow, highlighting the integration of the AWS components involved.

Figure 1: Architecture Diagram

  1. Users insert/update/delete entry in the DynamoDB table.
  2. The Step Function Trigger Lambda is invoked on all modifications.
  3. The Step Function Trigger Lambda evaluates the incoming event and does the following:
    1. On insert and update, triggers the Step Function.
    2. On delete, finds the appropriate CloudFormation stack and deletes it.
  4. Steps in the Step Function are as follows:
    1. Collect Information (Pass State) – Filters the relevant information from the event, such as repositoryName and referenceName.
    2. Get Mapping Information (Backed by CodeCommit event filter Lambda) – Retrieves the mapping information from the Pipeline config stored in the DynamoDB.
    3. Deployment Configuration Exist? (Choice State) – If the StatusCode == 200, then the DynamoDB entry is found, and Initiate CloudFormation Stack step is invoked, or else StepFunction exits with Successful.
    4. Initiate CloudFormation Stack (Backed by stack create Lambda) – Constructs the CloudFormation parameters and creates/updates the dynamic pipeline based on the configuration stored in the DynamoDB via CloudFormation.

Code deliverables

The code deliverables include the following:

  1. AWS CDK app – The AWS CDK app contains the code for all the Lambdas, Step Functions, and CloudFormation templates.
  2. sample-application-repo – This directory contains the sample application repository used for deployment.
  3. automated-tests-repo– This directory contains the sample automated tests repository for testing the sample repo.

Deploying the CI/CD solution

  1. Clone this repository to your local machine.
  2. Follow the README to deploy the solution to your main CI/CD account. Upon successful deployment, the following resources should be created in the CI/CD account:
    1. A DynamoDB table
    2. Step Function
    3. Lambda Functions
  3. Navigate to the Amazon Simple Storage Service (Amazon S3) console in your main CI/CD account and search for a bucket with the name: cloudformation-template-bucket-<AWS_ACCOUNT_ID>. You should see two CloudFormation templates (templates/codepipeline.yaml and templates/childaccount.yaml) uploaded to this bucket.
  4. Run the childaccount.yaml in every target CI/CD account (Alpha, Beta, Gamma, and Prod) by going to the CloudFormation Console. Provide the main CI/CD account number as the “CentralAwsAccountId” parameter, and execute.
  5. Upon successful creation of Stack, two roles will be created in the Child Accounts:
    1. ChildAccountFormationRole
    2. ChildAccountDeployerRole

Pipeline configuration

Make an entry into devops-pipeline-table-info for the Repository name and branch combination. A sample entry can be found in sample-entry.json.

The pipeline is highly configurable, and everything can be configured through the DynamoDB entry.

The following are the top-level keys:

RepoName: Name of the repository for which AWS CodePipeline is configured.
RepoTag: Name of the branch used in CodePipeline.
BuildImage: Build image used for application AWS CodeBuild project.
BuildSpecFile: Buildspec file used in the application CodeBuild project.
DeploymentConfigurations: This key holds the deployment configurations for the pipeline. Under this key are the environment specific configurations. In our case, we’ve named our environments Alpha, Beta, Gamma, and Prod. You can configure to any name you like, but make sure that the entries in json are the same as in the codepipeline.yaml CloudFormation template. This is because there is a 1:1 mapping between them. Sub-level keys under DeploymentConfigurations are as follows:

  • EnvironmentName. This is the top-level key for environment specific configuration. In our case, it’s Alpha, Beta, Gamma, and Prod. Sub level keys under this are:
    • <Env>AwsAccountId: AWS account ID of the target environment.
    • Deploy<Env>: A key specifying whether or not the artifact should be deployed to this environment. Based on its value, the CodePipeline will have a deployment stage to this environment.
    • ManualApproval<Env>: Key representing whether or not manual approval is required before deployment. Enter your email or set to false.
    • Tests: Once again, this is a top-level key with sub-level keys. This key holds the test related information to be run on specific environments. Each test based on whether or not it will be run will add an additional step to the CodePipeline. The tests’ related information is also configurable with the ability to specify the test repository, branch name, buildspec file, and build image for testing the CodeBuild project.

Execute

  1. Make an entry into the devops-pipeline-table-info DynamoDB table in the main CI/CD account. A sample entry can be found in sample-entry.json. Make sure to replace the configuration values with appropriate values for your environment. An explanation of the values can be found in the Pipeline Configuration section above.
  2. After the entry is made in the DynamoDB table, you should see a CloudFormation stack being created. This CloudFormation stack will deploy the CodePipeline in the main CI/CD account by reading and using the entry in the DynamoDB table.

Customize the solution for different combinations such as deploying to an environment while skipping for others by updating the pipeline configurations stored in the devops-pipeline-table-info DynamoDB table. The following is the pipeline configured for the sample-application repository’s main branch.

The image represents the dynamic CI/CD pipeline deployed in your account.

The image represents the dynamic CI/CD pipeline deployed in your account.

The image represents the dynamic CI/CD pipeline deployed in your account.

The image represents the dynamic CI/CD pipeline deployed in your account.

Figure 2: Dynamic Multi-Account CI/CD Pipeline

Clean up your dynamic multi-account CI/CD solution and related resources

To avoid ongoing charges for the resources that you created following this post, you should delete the following:

  1. The pipeline configuration stored in the DynamoDB
  2. The CloudFormation stacks deployed in the target CI/CD accounts
  3. The AWS CDK app deployed in the main CI/CD account
  4. Empty and delete the retained S3 buckets.

Conclusion

This configuration-driven CI/CD solution provides the ability to dynamically create and configure your pipelines in DynamoDB. IDEMIA, a global leader in identity technologies, adopted this approach for deploying their microservices based application across environments. This solution created by AWS Professional Services allowed them to dynamically create and configure their pipelines per repository per release. As Kunal Bajaj, Tech Lead of IDEMIA, states, “We worked with AWS pro-serve team to create a dynamic CI/CD solution using lambdas, step functions, SQS, and other native AWS services to conduct cross-account deployments to our different environments while providing us the flexibility to add tests and approvals as needed by the business.”

About the authors:

Anshul Saxena

Anshul is a Cloud Application Architect at AWS Professional Services and works with customers helping them in their cloud adoption journey. His expertise lies in DevOps, serverless architectures, and architecting and implementing cloud native solutions aligning with best practices.

Libin Roy

Libin is a Cloud Infrastructure Architect at AWS Professional Services. He enjoys working with customers to design and build cloud native solutions to accelerate their cloud journey. Outside of work, he enjoys traveling, cooking, playing sports and weight training.

Approaches for authenticating external applications in a machine-to-machine scenario

Post Syndicated from Patrick Sard original https://aws.amazon.com/blogs/security/approaches-for-authenticating-external-applications-in-a-machine-to-machine-scenario/

December 8, 2022: This post has been updated to reflect changes for M2M options with the new service of IAMRA. This blog post was first published November 19, 2013.

August 10, 2022: This blog post has been updated to reflect the new name of AWS Single Sign-On (SSO) – AWS IAM Identity Center. Read more about the name change here.


Amazon Web Services (AWS) supports multiple authentication mechanisms (AWS Signature v4, OpenID Connect, SAML 2.0, and more), essential in providing secure access to AWS resources. However, in a strictly machine-to machine (m2m) scenario, not all are a good fit. In these cases, a human is not present to provide user credential input. An example of such a scenario is when an on-premises application sends data to an AWS environment, as shown in Figure 1.

This post is designed to help you decide which approach is best to securely connect your applications, either residing on premises or hosted outside of AWS, to your AWS environment when no human interaction comes into play. We will go through the various alternatives available and highlight the pros and cons of each.

Figure 1: Securely connect your external applications to AWS in machine-to-machine scenarios

Figure 1: Securely connect your external applications to AWS in machine-to-machine scenarios

Determining the best approach

Let’s start by looking at possible authentication mechanisms that AWS supports in the following table. We’ll first identify the AWS service or services where the authentication can be set up—called the AWS front-end service. Then we’ll point out the AWS service that actually handles the authentication with AWS in the background—called the AWS backend service. We will also assess each mechanism based on use case.

Table 1: Authentication mechanisms available in AWS
Authentication mechanism AWS front-end service AWS backend service Good for m2m communication?
AWS Signature v4
  • All
AWS Security Token Service (AWS STS) Yes
Mutual TLS AWS STS Yes
OpenID Connect AWS STS Yes
SAML AWS STS Yes
Kerberos
  • n/a
AWS STS Yes
Microsoft Active Directory communication AWS STS No
IAM Roles Anywhere AWS STS Yes

Notes

We’ll now review each of these alternatives and also evaluate two additional characteristics on a 5-grade scale (from very low to very high) for each authentication mechanism:

  • Complexity: How complex is it to implement the authentication mechanism?
  • Convenience: How convenient is it to use the authentication mechanism on an ongoing basis?

As you’ll see, not all of the mechanisms are necessarily a good fit for a machine-to-machine scenario. Our focus here is on authentication of external applications, but not authentication of servers or other computers or Internet of Things (IoT) devices, which has already been documented extensively.

Active Directory–based authentication is available through either AWS IAM Identity Center or a limited set of AWS services and is meant in both cases to provide end users with access to AWS accounts and business applications. Active Directory–based authentication is also used broadly to authenticate devices such as Windows or Linux computers on a network. However, it isn’t used for authenticating applications with AWS. For that reason, we’ll exclude it from further scrutiny in this article.

Let’s look at the remaining authentication mechanisms one by one, with their respective pros and cons.

AWS Signature v4

The purpose of AWS Signature v4 is to authenticate incoming HTTP(S) requests to AWS services APIs. The AWS Signature v4 process is explained in detail in the documentation for the AWS APIs but, in a nutshell, the caller computes a signature using their credentials and then adds it to the header of the HTTP(S) request. On the other end, AWS accepts the request only if the provided signature is valid.

Figure 2: AWS Signature v4 authentication

Figure 2: AWS Signature v4 authentication

Native to AWS, low in complexity and highly convenient, AWS Signature v4 is the natural choice for machine-to-machine authentication scenarios with AWS. It is used behind the scenes by the AWS Command Line Interface (AWS CLI) and the AWS SDKs.

Pros

  • AWS Signature v4 is very convenient: the signature is built in the SDKs provided by AWS and is automatically computed on the caller’s behalf. If you prefer not to use an SDK, the signature process is a simple computation that can be implemented in any programming language.
  • There are fewer credentials to manage. No need to manage tedious digital certificates or even long-lived AWS credentials, because the AWS Signature v4 process supports temporary AWS credentials.
  • There is no need to interact with a third-party identity provider: once the request is signed, you’re good to go, provided that the signature is valid.

Cons

  • If you prefer not to store long-lived AWS credentials for your on-premises applications, you must first perform authentication through a third-party identity provider to obtain temporary AWS credentials. This would require using either OpenID Connect or SAML, in addition to AWS Signature v4. You could also use IAM Roles Anywhere, which exchanges a trusted certificate for temporary AWS credentials.

Mutual TLS

Mutual TLS, more specifically the mutual authentication mechanism of the Transport Layer Security (TLS) Protocol, allows the authentication of both ends—the client and the server sides—of a communication channel. By default, the server side of the TLS channel is always authenticated. With mutual TLS, the clients must also present a valid X.509 certificate for their identity to be verified.

Amazon API Gateway has recently announced native support for mutual TLS authentication (see this blog post for more details on the new feature). You can enable mutual TLS authentication on custom domains to authenticate your regional REST and HTTP APIs (except for private or edge APIs, for which the new feature is not supported at the time of this writing).

Figure 3: Mutual TLS authentication

Figure 3: Mutual TLS authentication

Mutual TLS can be both time-consuming and complicated to set up, but it is a widespread authentication mechanism.

Pros

  • Mutual TLS is widespread for IoT and business-to-business applications

Cons

  • You need to manage the digital certificates and their lifecycles. This can add significant burden and complexity to your IT operations.
  • You also need, at an application level, to pay special care to revoked certificates to reduce the risk of misuse. Since API Gateway doesn’t automatically verify if a client certificate has been revoked, you have to implement your own logic to do so, such as by using a Lambda authorizer.

OpenID Connect

OpenID Connect (OIDC), specifically OIDC 1.0, is a standard built on top of the OAuth 2.0 authorization framework to provide authentication for mobile and web-based applications. The OIDC client authentication method can be used by a client application to gain access to APIs exposed through Amazon API Gateway. The client application typically authenticates to an OAuth 2.0 authorization server, such as Amazon Cognito or another solution supporting that standard. As a result, the client application obtains a JSON Web Token (JWT) from the OAuth 2.0 authorization server. API Gateway then allows or denies the request based on the JWT validation. For more information about the access control part of this process, see the Amazon API Gateway documentation.

Figure 4: OIDC client authentication

Figure 4: OIDC client authentication

OIDC can be complex to put in place, but it’s a widespread authentication mechanism, especially for mobile and web applications and microservices architecture, including machine-to-machine scenarios.

Pros

  • With OIDC, you avoid storing long-lived AWS credentials for your on-premises applications.
  • OIDC uses REST or JSON message flows over HTTP, which makes it a particularly good fit (compared to SAML) for application developers today.

Cons

  • You need to store and maintain a set of credentials for each client application (such as client id and client secret) and make it accessible to the application. This can add complexity to your IT operations.

SAML

SAML 2.0 is an open standard for exchanging identity and security information between applications and service providers. SAML can be used to delegate authentication to a third-party identity provider, such as an Active Directory environment that is running on premises, and to gain access to AWS by providing a valid SAML assertion. (See About SAML 2.0-based federation to learn how to configure your AWS environment to leverage SAML 2.0.)

IAM validates the SAML assertion with your identity provider and, upon success, provides a set of AWS temporary credentials to the requesting party. The whole process is described in the IAM documentation.

Figure 5: SAML authentication

Figure 5: SAML authentication

SAML can be complex to put in place, but it’s a versatile authentication mechanism that can fit a lot of different use cases, including machine-to-machine scenarios.

Pros

  • With SAML, you not only avoid storing long-lived AWS credentials for your on-premises applications, but you can also use an existing on-premises directory, such as Active Directory, as an identity provider.
  • SAML doesn’t prescribe any particular technology or protocol by which the authentication should take place. The developer has total freedom to employ whichever is more convenient or makes more sense: key-based (such as X.509 certificates), ticket-based (such as Kerberos), or another applicable mechanism.
  • SAML is also a good fit when protocol bindings other than HTTP are needed.

Cons

  • Using SAML with AWS requires a third-party identity provider for your on-premises environment.
  • SAML also requires a trust to be established between your identity provider and your AWS environment, which adds more complexity to the process.
  • Because SAML is XML-based, it isn’t as concise or nimble as AWS Signature v4 or OIDC, for example.
  • You need to manage the SAML assertions and their lifecycles. This can add significant burden and complexity to your IT operations.

Kerberos

Initially developed by MIT, Kerberos v5 is an IETF standard protocol that enables client/server authentication on an unprotected network. It isn’t supported out-of-the-box by AWS, but you can use an identity provider, such as Active Directory, to exchange the Kerberos ticket provided to your application for either an OIDC/OAuth token or a SAML assertion that can be validated by AWS.

Figure 6: Kerberos authentication (through SAML or OIDC)

Figure 6: Kerberos authentication (through SAML or OIDC)

Kerberos is highly complex to set up, but it can make sense in cases where you already have an on-premises environment with Kerberos authentication in place.

Pros

  • With Kerberos, you not only avoid storing long-lived AWS credentials for your on-premises applications, but you can also use an existing on-premises directory, such as Active Directory, as an identity provider.

Cons

  • Using Kerberos with AWS requires the Kerberos ticket to be converted into something that can be accepted by AWS. Therefore, it requires you to use either the OIDC or SAML authentication mechanisms, as described previously.

IAM Roles Anywhere

IAM Roles Anywhere establishes a trust between your AWS account and the certificate authority (CA) that issues certificates to your on-premises workloads using public key infrastructure (PKI). For a detailed overview, see the blog post Extend AWS IAM roles to workloads outside of AWS with IAM Roles Anywhere. Your workloads outside of AWS use IAM Roles Anywhere to exchange x.509 certificates for temporary AWS credentials in order to interact with AWS APIs, thus removing the need for long-term credentials in your on-premises applications. IAM Roles Anywhere enables short-term credentials for numerous hybrid environment use cases including machine-to-machine scenarios.

Figure 7: IAMRA authentication process

Figure 7: IAMRA authentication process

IAM Roles Anywhere is a versatile authentication mechanism that can fit a lot of different use cases, including machine-to-machine scenarios where your on-premises workload is accessing AWS resources.

Pros

  • With IAM Roles Anywhere you avoid storing long-lived AWS credentials for your on-premises workloads.
  • You can import a certificate revocation list (CRL) from your certificate authority (CA) to support certificate revocation.

Cons

  • You need to manage the digital certificates and their lifecycles. This can add complexity to your IT operations.
  • IAM Roles Anywhere does not support callbacks to CRL distribution points (CDPs) or Online Certificate Status Protocol (OCSP) endpoints.

Conclusion

Now we’ll collect and summarize this discussion in the following table, with the pros and cons of each approach.

Authentication mechanism AWS front-end service Complexity Convenience
AWS Signature v4
  • All
Low Very High
Mutual TLS
  • AWS IoT Core
  • Amazon API Gateway
Medium High
OpenID Connect
  • Amazon Cognito
  • Amazon API Gateway
Medium High
SAML
  • Amazon Cognito
  • AWS Identity and Access Management (IAM)
High Medium
Kerberos
  • n/a
Very High Low
IAM Roles Anywhere
  • AWS Identity and Access Management (IAM)
Medium High

AWS Signature v4 is the most convenient and least complex mechanism of these options, but as for every situation, it’s important to start from your own requirements and context before making a choice. Additional factors may influence your choice, such as the structure or the culture of your organization, or the resources available for your project. Keeping the discussion focused on simple factors on purpose, we’ve come up with the following actionable decision helper.

Use AWS Signature v4 when:

  • You have access to AWS credentials (temporary or long-lived)
  • You want to call AWS services directly through their APIs

Use mutual TLS when:

  • The cost and effort of maintaining digital certificates is acceptable for your organization
  • Your organization already has a process in place to maintain digital certificates
  • You plan to call AWS services indirectly through custom-built APIs

Use OpenID Connect when:

  • You need or want to procure temporary AWS credentials by using a REST-based mechanism
  • You want to call AWS services directly through their APIs

Use SAML when:

  • You need to procure temporary AWS credentials
  • You already have a SAML-based authentication process in place
  • You want to call AWS services directly through their APIs

Use Kerberos when:

  • You already have a Kerberos-based authentication process in place
  • None of the previously mentioned mechanisms can be used for your use case

Use IAMRA when:

  • The cost and effort of maintaining digital certificates is acceptable for your organization
  • Your organization already has a process in place to maintain digital certificates
  • You want to call AWS services directly through their APIs
  • You need temporary security credentials for workloads such as servers, containers, and applications that run outside of AWS

We hope this post helps you find your way among the various alternatives that AWS offers to securely connect your external applications to your AWS environment, and to select the most appropriate mechanism for your specific use case. We look forward to your feedback.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on one of the AWS Developer forums or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Patrick Sard

Patrick works as a Solutions Architect at AWS. Apart from being a cloud enthusiast, Patrick loves practicing tai-chi (preferably Chen style), enjoys an occasional wine-tasting (he trained as a Sommelier), and is an avid tennis player.

Jeremy Wave

Jeremy Ware

Jeremy is a Security Specialist Solutions Architect focused on Identity and Access Management. Jeremy and his team enable AWS customers to implement sophisticated, scalable, and secure IAM architecture and Authentication workflows to solve business challenges. With a background in Security Engineering, Jeremy has spent many years working to raise the Security Maturity gap at numerous global enterprises. Outside of work, Jeremy loves to explore the mountainous outdoors participate in sports such as Snowboarding, Wakeboarding, and Dirt bike riding.

AWS Local Zones and AWS Outposts, choosing the right technology for your edge workload

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/aws-local-zones-and-aws-outposts-choosing-the-right-technology-for-your-edge-workload/

This blog post is written by Joe Sacco, Senior Technical Account Manager.

The AWS Global Cloud Infrastructure includes 30 Launched Regions, 96 Availability Zones (AZs), 410+ Points of Presence with 400+ Edge Locations, and 13 Regional Edge Caches.  With over 200 AWS services, most customer workloads can run in the AWS Regions. However, for some location-sensitive workloads with low-latency or data residency requirements, and when an AWS Region isn’t close enough, AWS offers two additional infrastructure options: AWS Local Zones and AWS Outposts. Although Local Zones and Outposts solve for similar problems, we’ll review use cases as well as the services and features available that can help you decide which offering best suits your needs.

Let’s start with an overview of Local Zones and Outposts.

What are Local Zones?

Local Zones are a new type of infrastructure deployment that places AWS compute, storage, database, and other select AWS services in large metropolitan areas closer to end users. This gives you access to single-digit millisecond latency with the use of AWS Direct Connect and the ability to meet data residency requirements. Local Zones are also connected to their parent Region via AWS’s redundant and high bandwidth private network. This gives applications running in Local Zones fast, secure, and seamless access to a complete list of services in the parent Region.

Unlike Outposts, which you deploy within your datacenter or a co-location of your choice, Local Zones are owned, managed, and operated by AWS. Local Zones eliminate the need for you to manage power, connectivity, and capacity. Furthermore, you can provision workloads on a Local Zone from your AWS Management Console just as you would for AZs and Regions today.

AWS Local Zones how it worksWhat is Outposts?

Outposts is a family of fully managed solutions delivering AWS infrastructure and services to virtually any on-premises or edge location for a truly consistent hybrid experience. Outposts lets you run some AWS services locally and connect to a broad range of services available in the local AWS Region. Outposts comes in two types of offerings: Outposts rack and Outposts servers, with which you can run applications and workloads on-premises using the same AWS infrastructure, services, tools, and APIs as in AWS Regions.

The Outposts rack is available as an industry standard 42U form factor. It provides the same AWS infrastructure, services, tools, and APIs to your data center or co-location space  that you would find in an AWS Region.

Outposts Rack

The Outposts servers come in a 1U or 2U form factor and are designed for locations that have limited space or smaller capacity requirements. Both support different compute instances, as detailed in the Outposts servers feature page.

Outposts ServersCustomer use cases

Now that we have an overview of both Local Zones and Outposts service offerings, let’s dive into use cases, the differences between them, and how your business can leverage each to accomplish your workloads requirements.

Low latency

Customers today require low latency computing for workloads, such as medical imaging, transaction processing for Enterprise Resource Planning (ERP) applications, enterprise migration with hybrid architecture, real-time multiplayer gaming, telco network function virtualization, and regulated gaming workloads.

Outposts can meet ultra-low latency requirements. This is accomplished by bringing AWS services on premises and to the edge at Outpost Sites. An Outpost site is the physical location where your Outpost operates, and it can be local within one of your data centers or at a co-location facility of your choice.

When accessing from within the same metro, Local Zones will provide you with a low, single millisecond latency experience when communicating with your applications. Latency between Local Zones and AWS Regions or Local Zones and on-premises environments varies, and these will depend on how close the nearest Local Zone is as well as the type of modality used for the connection (Public Internet, VPN, and AWS Direct Connect). You should always choose the closest Local Zone location to achieve the lowest possible latency. For use cases such as mobile gaming, you can utilize Local Zones by deploying your applications to a Local Zone location nearest to your end users. Local Zones are generally available in 17 metros across the US, 4 outside the US, and we are continuing to launch Local Zones in 30 cities across 25 countries. Check out updates for more general availability of Local Zones.

Data residency

On occasion, data must remain in a specific geographic region for regulatory or information security reasons. Healthcare and other regulated industries, such as financial services or Oil & Gas, have specific data residency requirements.

Outposts helps meet a customer’s data residency requirements because it’s installed on premises and essentially brings AWS to where the data currently resides. This allows you to pick and control where your workloads run, and where your data will stay. Check out the full list of countries and territories where Outposts is available on the FAQs page of Outposts rack and the FAQs page of Outposts servers.

Local Zones bring AWS closer or within a customer’s geographic boundary in a fully AWS owned and operated mode. Although Local Zones can help meet data residency use cases in some scenarios, data residency requirements vary depending on the jurisdictions. Therefore, you should work closely with your compliance and information security teams when choosing the Local Zone location in which to deploy your regulated workloads.

Migration and modernization

When trying to migrate to the cloud and modernize your stack, some workloads can be challenging. Often there are on-premises applications which are difficult to move into Regions due to latency-sensitive system intermittencies between their various components. As dependencies arise, you may choose to segment these migrations into smaller pieces. Then this will require latency-sensitive connectivity between the various parts of the application.

Outposts and Local Zones both allow for a gradual migration and modernization of your stack. You can choose to migrate parts of their workloads while still maintaining latency-sensitive connectivity between components until the entirety is ready to move.

Factors in selecting Local Zones or Outposts

Choosing between Local Zones and Outposts will depend on the following factors, and you should examine all of them together when selecting a service for your use case.

  1. Latency requirements

Local Zones can achieve low single millisecond latency when accessing within the same metro. On the other hand, Outposts can achieve ultra-low latency requirements when deployed within your datacenter or at a co-location facility of your choice. When selecting one over the other, you must work backward from your goal and workload requirements.

If you’re conducting a migration and modernization strategy which requires ultra-low latency between a workloads application and database tiers that are difficult to migrate to the AWS Regions, then Outposts would be the right solution for you.

Alternatively, if your workload involves streaming live broadcasts to end users which requires low single millisecond latency, but your end users are located where an AWS Region isn’t available, then Local Zones distributed across various metros would work best to serve your content.

  1. Availability of services needed to support your workload

Local Zones and Outposts differ with their list of supported AWS services, and you must review your workload’s service requirements when determining the best fit for you. For example, if a customer has a computer vision workload that requires storing and retrieving large volumes of images locally using Amazon Simple Storage Service (Amazon S3), then Outposts and certain Local Zones meet this requirement while other Local Zones don’t. Learn how you can use Amazon S3 on Outposts for computer vision workloads.

Outposts rack and servers support different sets of AWS services locally. You can view comparisons between them, or visit the Outposts servers and Outposts rack feature sites for more details.

Local Zones’ features vary depending on the location in which you choose to deploy. You can view more details and a full list of supported features and services per location on our Local Zones features page.

  1. Investment and management of infrastructure on-premises

Management of the infrastructure and prerequisites are another factor when considering which AWS service best suits your needs.

Outposts is ordered through AWS, and it requires installation in a customer’s on-premises datacenter or co-location provider of their choice. Outposts rack installation is handled by AWS, while Outposts servers installation is done by the customer or a third-party of their choosing. There are power and redundant networking requirements for the Outpost Site, as well as a required subscription to AWS Enterprise Support or On-Ramp Support.

Local Zones infrastructure is fully-managed by AWS, including the power, networking, and capacity. This reduces operational management as well as the overhead cost for customers. An Enterprise support agreement isn’t required to utilize Local Zones.

You should always choose Regions or Local Zones if your use case allows, and use Outposts when a Region or Local Zone isn’t a good fit. If both Outposts and Local Zones fit a customer’s use case and requirements, then Local Zones will be the preferred choice.

  1. Regulations, compliance, and information security

If a Local Zone is either unavailable or unable to meet your residency requirements within your geographic boundary consider Outposts, which can be deployed to a data center or co-location facility of your choice. Data residency requirements can be a factor based on your industry and the regulations to which your workload must adhere. Furthermore, you should work closely with your compliance and information security teams when choosing between Local Zones or Outposts.

Conclusion

Whether you’re dealing with latency-sensitive applications, data residency requirements, or a migration and modernization strategy, AWS provides options and flexibility for you to leverage the same AWS infrastructure, services, APIs, and tools to metro areas and on-premises locations with Local Zones and Outposts.

The decision of which technology to use will depend on several factors that we discussed above. You must work across teams within your organization to make sure that the latency requirements (low single millisecond latency within a metro for Local Zones vs the ultra low latency of Outposts when deployed close to or within your datacenter), data reseidency needs, installation prerequisites, and availability of services to support your workload are met.

Once these factors are taken into account, and you have made a choice, visit our product pages for Outposts and Local Zones with information on how you can get started.

Monitoring shared AWS Outposts rack capacity

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/monitoring-shared-aws-outposts-rack-capacity/

This post is written by Adam Imeson, Sr. Hybrid Edge Specialist Solutions Architect.

AWS Outposts rack is a fully-managed service that offers the same AWS infrastructure, APIs, tools, and a subset of AWS services to any data center, colocation space, or on-premises facility for a consistent hybrid experience. Outposts rack is ideal for workloads that require low latency, access to on-premises systems, local data processing, data residency, and migration of applications with local system interdependencies.

An Outpost is a pool of AWS compute and storage capacity deployed at a customer site. In an Outposts rack deployment, an Outpost may comprise of one or more racks connected together at the site. It’s common for customers to order their Outpost in a dedicated account and then integrate with their multi-account organizational architecture by sharing the Outpost via AWS Resource Access Manager (AWS RAM). This post will explain how to set up cross-account Amazon CloudWatch metrics so that disparate stakeholders within your organization can effectively monitor your Outpost’s capacity to meet their specific needs.

Overview

The AWS account that you use to order an Outpost owns that Outpost. This includes all metrics and health events pertaining to that Outpost. Many customers must integrate Outposts into their multi-account environments, as discussed in the “Best practices: AWS Outposts in a multi-account AWS environment” posts (part 1 and part 2). This post will go into more detail on how to monitor Outposts in these environments.

The nuance here stems from the different ways to share access to AWS resources. AWS RAM allows infrastructure resources to be shared across multiple accounts. Then, the consumer accounts can launch resources on the infrastructure as though they owned it. AWS Identity and Access Management (IAM) allows customers to modify a given account’s permissions such that users in other accounts can make AWS API calls that affect the given account.

An Outpost provides infrastructure resources, so customers can share Outposts via AWS RAM. CloudWatch metrics about Outposts are data which customers retrieve using AWS API calls, so customers can share access to those metrics using IAM.

In a typical customer’s AWS Organization, there are two cases to consider. First, when the customer is sharing an Outpost to multiple development accounts, each account needs to view metrics relevant to the Outpost so that the development accounts can deploy and operate their applications.Diagram depicting an Outpost that a customer has shared to three different accounts using RAM. The three different accounts each have a different application deployed in them.

Second, when the customer has several accounts that each own different Outposts, the customer’s centralized monitoring account needs to track metrics relevant to each of the Outposts.

Diagram depicting three accounts that each own a separate Outpost, with all three accounts sharing Outpost metrics to CloudWatch in the customer’s central monitoring account.

This post will explain strategies for both cases.

Customers must monitor the health of the Outpost’s connection to its regional control plane (the Outpost’s service link), as an Outpost is an extension of an AWS Availability Zone (AZ) and is designed to be connected to an AZ at all times. The health of the Outpost’s service link is a crucial variable when application owners are diagnosing disruptions to their application, and also when infrastructure owners are diagnosing disruptions to a site. Customers can monitor their service link’s status with the ConnectedStatus metric.

Customers also must monitor their Outposts’ current capacity. Outposts necessarily have a limited capacity footprint when compared to an AWS Region. Application owners must make informed decisions about capacity as they scale their apps over time or respond to occasional hardware failures. Infrastructure owners also must maintain a holistic view of capacity across all of the Outposts for which they are responsible so that they can plan for capacity expansion over time. Customers can monitor their Outposts’ capacity using the various capacity metrics that Outposts provide.

For an overview of how to set up a capacity dashboard and capacity-based CloudWatch alarms within a single account, see “Monitoring AWS Outposts capacity.” This post will expand on the single-account strategy by introducing cross-account capabilities. See also “Cross-Account Cross-Region Dashboards with Amazon CloudWatch.” These two posts provide practical walkthroughs for setting up the metric flows explained below.

Setting up Outposts metric permissions for your organization

This post assumes that you have multiple Outposts in different accounts that are all part of the same Organization. You’re sharing these Outposts into accounts that development teams use to deploy and operate their applications. You also have a centralized monitoring account where your infrastructure team tracks various metrics across all accounts. Your Organization might look something like this:

A base diagram depicting six AWS accounts with different names. Outpost Account 1 contains an Outpost. Outpost Account 2 contains a different Outpost. Monitoring Account contains Amazon CloudWatch. Accounts A through C contain Applications A through C respectively.

The first Outpost is shared to Accounts A and B, and the second Outpost is only shared to Account B. This is just an example of how a customer might set up their environment so that Application A can deploy on Outpost 1, and Application B can deploy on both Outpost 1 and 2.

The same base diagram of the six AWS accounts as before, with arrows added to depict AWS RAM resource shares. Outpost Account 1 shares its Outpost to Accounts A and B. Outpost Account 2 shares its Outpost to Account B.

To enable centralized monitoring, each account shares CloudWatch metrics with the central monitoring account as described in “Cross-Account Cross-Region Dashboards with Amazon CloudWatch.”

The same base diagram of the six AWS accounts, with arrows added to depict CloudWatch metrics being shared from all five of the other accounts to the Monitoring Account.

Now there are application accounts which can launch on the desired Outposts, and all of the accounts are sharing metrics with the central monitoring account. The team responsible for procuring and managing the Outposts can now set up dashboards in the central monitoring account in accordance with “Monitoring AWS Outposts capacity” to get a holistic view of capacity. This is valuable for capacity planning as applications naturally grow over time.

However, this may not be sufficient for operations. Consider that each application team needs to understand how much capacity is available on the Outpost that they’re using. This is crucial for teams operating highly available applications to maintain awareness of whether they still have N+1 capacity available on the Outpost to use in the event of a hardware failure. This is also important for planning expansions to the application ahead of time, as application teams have the best understanding of the future needs of their applications. Finally, application teams can use the metrics to track the operational health of the Outpost, which is crucial for root-causing any application disruptions.

You can implement this by sharing CloudWatch metrics from the Outpost accounts to the application accounts which are consuming the Outposts’ capacity, as shown in the following diagram.

The same base diagram of the six AWS accounts, with arrows depicting CloudWatch metrics being shared. Outpost Account 1 is sharing CloudWatch metrics to Accounts A and B. Outpost Account 2 is sharing CloudWatch metrics to Account B.

Walkthrough

Log in to your application account and navigate to the CloudWatch console. Open the Settings menu and choose Configure.

Screenshot of the CloudWatch Console’s Settings menu.

Scroll to the bottom. In the View cross-account cross-region section, choose Edit.

Screenshot of the Cross-account cross-region sub-menu in the CloudWatch console.

Choose your preferred account selection method from the three options and choose Save changes. I recommend the Custom account selector option, as it strikes a good balance between a simple setup and ease of use. If you choose this option, then input the Outpost owner account’s account ID and a human-readable name for the account. This name will appear in the drop-down when you’re using the CloudWatch console to view metrics from other accounts later.

Screenshot of the Cross-account cross-region sub-menu in the CloudWatch console, with the “Custom account selector” option selected and “123456789012 Outpost owner account” in the input field.

Your application account is now prepared to view metrics from the Outpost owner account. Now log in to the account that owns the Outpost and navigate to the CloudWatch console. You still need to share the Outpost’s metrics to the application account. Open the Settings page again, and choose Configure in the Cross-account cross-region section as before. This time, choose Share data in the Share your CloudWatch data section:

Screenshot of the Cross-account cross-region sub-menu in the CloudWatch console, with the “Share data” button circled in red in the “Share your CloudWatch data” section.

Choose Add account and input the application account’s account ID. Then scroll to the bottom of the page and choose Launch CloudFormation template.

Screenshot of the “Share your CloudWatch data sub-menu in the CloudWatch console. The “Specific accounts” option in the “Sharing” section is highlighted, and the sample account ID “234567890123” is typed into the input field.

The AWS CloudFormation template will create the CloudWatch-CrossAccountSharingRole. This role gives CloudWatch read access to the AWS account that you specified, the application account. You can view and modify this role using the IAM console if you want to. For example, you might adjust the role to allow read access to an entire Organizational Unit (OU).

Now, log back in to the application account and navigate to the CloudWatch console. Choose All metrics in the left-side menu. In the Metrics section, select the Outpost owner account from the drop-down.

Screenshot of the CloudWatch console’s “All metrics” sub-page. The account selection drop-down is circled in red in the “Metrics” subsection.

You can now view the metrics from the Outpost owner account and incorporate them into the dashboards in the application account. Now the application teams can track the Outposts’ ConnectedStatus metrics to be alerted on any disconnections from the region, and they can track the Outposts’ capacity metrics as well. It’s a best practice to alarm on Outpost capacity metrics once a consumption threshold defined by business needs has been breached.

Conclusion

Outposts rack allows customers to deploy AWS infrastructure into virtually any data center, colocation space, or on-premises facility. Outposts are tied to the AWS account that ordered them, and customers can share Outposts among AWS accounts within the same Organization. When multiple teams within a customer’s Organization are interacting with the same Outpost, that introduces additional monitoring surface area for capacity and service health. This post explains how customers can accommodate their teams’ different needs by sharing Outposts metrics around their Organization along with their Outposts. As best practices, customers should share their Outposts capacity and ConnectedStatus metrics to teams who are running applications on Outposts. Customers’ operations teams should also work with their stakeholders to define a maximum capacity utilization threshold for a given Outpost and alarm on that threshold.

BloomIP Automatically Identifies production issues with Amazon DevOps Guru

Post Syndicated from David Ernst original https://aws.amazon.com/blogs/devops/bloomip-automatically-identifies-production-issues-with-amazon-devops-guru/

Operational excellence is critical for BloomIP’s customers. In this post, you will see how we built a solution to automate the detection of trends and issues in production workloads by implementing Amazon DevOps Guru for our clients.

BloomIP ensures your business is ready for what’s ahead, with security, scalability, performance, and cost control. We are cloud solutions partner that gets to know both the people and processes in your business.

The Challenge

Identifying operational issues within applications and services is time-consuming. This requires developers and cloud engineers to spend valuable time manually debugging using multiple tools. We needed to quickly identify any operational issues related to our clients applications, including any load balancer errors or user delays in accessing their application. Ensuring the application is up and running during certain times of the day is crucial to the success of our client’s business. We needed to identify any downtime or performance patterns and quickly address any related issues.

Analyzing an AWS environment after any incident requires a combination of tools such as Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray. We spend hours pouring over the information in each tool to try to identify patterns and troubleshooting steps. Still, identifying issues that correlate between those tools is a manual process.

Automating Identification of Operational Issues

To address the challenges of tedious and manual processes of analyzing different tools to identify patterns, we implemented Amazon DevOps Guru  for many of our clients. Amazon DevOps Guru helps us automatically ingests all related data from the services mentioned above and applies Machine Learning techniques to analyze and recommend fixes for abnormal behaviors. Amazon DevOps Guru organizes its findings into reactive and proactive insights.

We capture Amazon DevOps Guru Insights as events using Amazon EventBridg, and send them to an  Amazon SNS Topic, which then notifies us via email and Slack.

Architecture diagram showing a typical 3 tier web app using AWS services and integrating the application with Amazon DevOps Guru, Amazon Eventbridge and Amazon SNS Topic to send send notifications via Email and Slack

Figure 1. Architecture diagram

Results

BloomIP is leveraging DevOps Guru to scale its operations across multiple customers. Amazon DevOps Guru was easy to enable; it provides us with a single console experience to search and visualize operational data. In addition to detecting anomalies, we can see graphs and timelines related to the numerous anomalous metrics and more contextual information such as relevant events and log snippets. This helps us quickly understand the anomaly scope. Because it integrates data across multiple sources such as Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray, Amazon DevOps Guru reduces the need for us to use numerous tools.

“We were looking at a way to effortlessly scale our observability needs across multiple clients while ensuring we had the proper coverage. DevOps Guru gives us additional insight and assurance by quickly pointing out anomalies in our client’s environments. With ML-powered recommendations, DevOps Guru has allowed us to remediate repeated production issues automatically. ” – Joshua Haynes, Director of Engineering, BloomIP

Conclusion

Amazon DevOps Guru provides BloomIP with a streamlined approach to visualize operational data by integrating data across multiple sources supporting Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray and reduces the need to use multiple tools. DevOps Guru gives you a single-console dashboard to look for and visualize anomalies in your operational data.

Start monitoring your AWS applications with AWS DevOps Guru today using this link

About the authors:

David Ernst

David is a Sr. Specialist Solution Architect – DevOps, with 20+ years of experience in designing and implementing software solutions for various industries. David is an automation enthusiast and works with AWS customers to design, deploy, and manage their AWS workloads/architectures.

Abdullahi Olaoye

Abdullahi is a Senior Cloud Architect at AWS Professional Services where he works with customers of different scales to design and build IT solutions that solve business challenges. When he’s not working, he enjoys spending time with his family, traveling and learning history of different varieties through documentaries and podcasts.

Lower your Amazon OpenSearch Service storage cost with gp3 Amazon EBS volumes

Post Syndicated from Siddhant Gupta original https://aws.amazon.com/blogs/big-data/lower-your-amazon-opensearch-service-storage-cost-with-gp3-amazon-ebs-volumes/

Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open-source, distributed search and analytics suite comprising OpenSearch, a distributed search and analytics engine, and OpenSearch Dashboards, a UI and visualization tool. When you use Amazon OpenSearch Service, you configure a set of data nodes to store indexes and serve queries. The service supports instance types for data nodes with different storage options. Some supported Amazon Elastic Compute Cloud (Amazon EC2) instance types, like the R6GD or I3, have local NVMe disks. Others use Amazon Elastic Block Store (Amazon EBS) storage.

On July 2022, OpenSearch Service launched support for the next generation, general purpose SSD (gp3) EBS volumes. OpenSearch Service data nodes require low latency and high throughput storage to provide fast indexing and query. With gp3 EBS volumes, you get higher baseline performance (IOPS and throughput) at a 9.6% lower cost than with the previously offered gp2 EBS volume type. You can provision additional IOPS and throughput independent of volume size using gp3. gp3 volumes are also more stable because they don’t use burst credits. OpenSearch support for gp3 volumes includes doubling the limit on per-data node volume sizes. With these larger volumes, you can reduce the cost of passive data, increasing the amount of storage per node.

We recommend that you consider gp3 as the best Amazon EBS option for price/performance and flexibility. In this post, I discuss the basics of gp3 and various cost-saving use cases. Migrating from previous generation storage (gp2, PIOPS, and magnetic) volumes to the latest generation gp3 volumes allows you to reduce monthly storage costs and optimize instance utilization.

Comparing gp2 and gp3

gp3 is the successor to the general purpose SSD gp2 volume. The key benefits of gp3 include higher baseline performance, 9.6% lower cost, and the ability to provision higher performance regardless of volume. The following table summarizes the key differences between gp2 and gp3.

Volume type gp3 gp2
Volume size Depends on instance type. Max OpenSearch Service supports 24 TiB for R6g.12Xlarge. For the latest instance limits, see Amazon OpenSearch Service quotas. Depends on instance type. Max OpenSearch Service supports 12 TiB for R6g.12Xlarge.
Baseline IOPS 3,000 IOPS for volume size up to 1,024 GiB. For volumes above 1,024 GiB, you get 3 IOPS/GiB, without burst credit complexity. 3 IOPS/GiB (minimum 100 IOPS) to a maximum of 16,000 IOPS. Volumes smaller than 1 TiB can also burst up to 3,000 IOPS.
Max IOPS/volume 16,000 16,000
Baseline throughput 125 MiB/s free for volume size up to 170 GiB, or 250 MiB/s free for volume above 170 GiB. Between 125 MiB/s and 250 MiB/s, depending on the volume size.
Max throughput/volume 1,000 MiB/s 250 MiB/s
Price for us-east-1 Region
  • Storage – $0.122/GB-month.
  • IOPS – 3,000 IOPS free for volumes up to 1,024 GiB, or 3 IOPS/GiB free for volumes above 1,024 GiB. $0.008/provisioned IOPS-month over free limits.
  • Throughput – 125 MiB/s free for volumes up to 170 GiB, or +250 MiB/s free for every 3 TiB for volumes above 170 GiB. $0.064/provisioned MiB/s-month over free limits.
  • Storage – $0.135/GB-month.
  • IOPS and throughput provisioning not allowed.
Instance supported T3, C5, M5, R5, C6g, M6g, and R6g T2, C4, M4, R4, T3, C5, M5, R5, C6g, M6g, and R6g

Lower your monthly bills with gp3

The ability to provision IOPS and throughput independent of volume size and support for denser (twice as large) volume sizes are two significant advantages of gp3 adoption. Together, these benefits enable multiple use cases to lower your monthly bills. In this section, we present a few examples of pricing comparisons for OpenSearch domains.

gp2 vs. gp3

This is the most common scenario, in which existing gp2 customers switch to gp3 and immediately begin saving 9.6% due to the lower monthly price per GB for gp3 storage. You can also benefit from the fact that gp3 supports volume sizes two times larger for the R5, R6g, M5, and M6g instance families. This means that you don’t need to spin up new instances for denser storage requirements and can achieve higher storage on the same instance. OpenSearch Service currently supports a maximum of 24 TiB of gp3 storage on R6g.12Xlarge instances.

PIOPS (io1) vs. gp3

OpenSearch Service supports the PIOPS SSD (io1) EBS volume type. You can switch to gp3 and provision additional IOPS and throughput to meet your specific performance requirements. The following table compares the monthly cost of PIOPS (io1) and gp3 storage with R5.large.search instances for storage requirements of 6 TiB and 16000 IOPS. In this example, you would save 65% with gp3 adoption.

. PIOPS (io1) gp3
Instance cost

6 instances * $0.186/hr = $830/month

(r5.large.search can support up to 1 TiB storage for io1; to support 6 TiB we require six instances.)

3 instances * $0.167Hr = $372/month

(r6g.large.search can support up to 2 TiB storage for gp3; to support 6 TiB we require three instances.)

Storage cost (6 TiB)

6,597 GB * $0.169/GB-month = $1115/month

Notes:
(a) Price for PIOPS(io1) is $0.169 per GB/month.
(b) 6TiB = 6597 GB

6,597 GB * $0.122/GB-month = $805/month

Notes:
(a) Price for gp3 storage is $0.122 per GB/month.
(b) 6TiB = 6597 GB

PIOPS cost (16000 PIOPS)

16000 IOPS * $0.088/IOPS-month = $1408/month

Note: io1 PIOPS rate is $0.088 per IOPS-month.

18,000 IOPS is included in the price for 6 TiB volume of gp3; you don’t need to pay.

Note: 3 IOPS/ GiB Storage IOPS inlcued in price.

Total monthly bills $3,353/month $1,177/month

I3 vs. gp3

I3 instances include Non-Volatile Memory Express (NVMe) SSD-based instance storage optimized for low latency, very high random I/O performance, and high sequential read throughput, and delivers high IOPS. However, I3 uses older third-generation CPUs, and the largest storage supported size is 15 TiB with i3.16xlarge.search instance. You should consider using the largest generation instances such as R6g with gp3 storage to get lower cost and better performance over I3 instances.

To comprehend the cost advantage, let’s compare I3 and gp3 for 12 TiB of data storage needs. By switching to gp3 along with the current generation of instances, you can reduce your monthly bills by 56%, according to the calculations in the following table.

. I3.4xlarge gp3 with R6g.xlarge
On-demand instance cost for us-east-1 Region

4 instances * $1.99/hr = $5,922/month

Note: I3.4xlarge.search supports up to 3.8 TiB, so we require four instances to manage 12 TiB storage. Instance cost is $1.99/hr.

4 instances * $0.335/hr = $996/month

Note: R6g.xlarge.search supports up to 3 TiB with gp3, so we require four instances to manage 12 TiB. Instance cost is $0.335/hr.

Storage cost (12 TiB) N/A (included in instance price)

13,194 GB * $0.122/GB-month = $1,610/month

Notes:
(a) 12 TiB = 13,194 GB
(b) Storage cost is $0.122 per GB / month

Total monthly bills $5,922/month $2,606/month

UltraWarm vs. gp3

UltraWarm is designed to provide inexpensive access to infrequently accessed data, such as logs older than 30 days. Warm storage is useful for indexes that aren’t actively being written to, are queried less frequently, and don’t require high performance. If you have large and query-intensive workloads and are attempting to use UltraWarm to optimize costs but encountering higher query volumes than it can handle, you should consider moving some of the data volume to hot nodes with gp3 storage. UltraWarm will remain the least expensive option for your warm data (less-frequently accessed) type use cases, but you shouldn’t use it for hot data use cases. A combination of low-cost gp3 storage and denser instances can help you achieve cost-optimized higher performance for hot data.

The following table shows the monthly costs associated with running a 30 TiB UltraWarm workload, along with a comparison to the potential monthly costs of gp2 and gp3. With gp3, you can save up to 36% compared to gp2. Please note that UltraWarm setup does require hot data nodes; however, we excluded them in the UltraWarm column to focus on UltraWarm replacement costs with hot data nodes using gp2 and gp3.

. UltraWarm All Hot (gp2 with R6g.8xlarge) All Hot (gp3 with R6g.8xlarge)
Instance cost (On-demand)

2 UW large instances * $2.68/hr = $3,987/month

Note: ultrawarm1.large.search supports max 20 TiB, so we need two instances.

4 instances * $2.677/hr = $7,966/month

Note: r6g.8xlarge.search supports max 8 TiB with gp2, so we require four instances.

2 Instances * $2.677/hr= $3,984/month

Note: r6g.8xlarge.search supports max 16 TiB with gp3, so we only require two instances.

Storage cost (30 TiB)

32,985 GB * $0.024/GB-month = $792/month

Notes:
(1) Storage price is $0.024/per GB/month).
(2) 30 TiB = 32985 GB

32,985 GB * $0.135/GB-month = $4,453/month

Notes:
(1) Storage price is $0.135 per GB/month.
(2) 30 TiB = 32985 GB

32,985 GB * $0.122/GB-month = $4,024/month

Notes:
(1) Storage price is $0.122 per GB/month.
(2) 30 TiB = 32985 GB

Total Monthly Bills $4,779/month $12,419/month $8,008/month

All the preceding use cases are from a cost perspective. Before making any changes to the production environment, we recommend validating performance in a test environment for your unique workload and ensuring that configuration changes don’t result in performance degradation.

Optimize instance cost with gp3’s denser storage

OpenSearch Service increased the maximum volume size supported per instance for gp3 by 100% when compared to gp2 for the R5, R6g, M5, and M6g instance families due to gp3’s improved baseline performance. You can optimize your instance needs by taking advantage of the increased storage per instance volume. For example, R6g.large supports up to 2 TiB with gp3, but only 1 TiB with gp2. If you require support for 12 TiB of data storage, you can reconfigure your domains from six data nodes to three R6g.large in order to reduce your instance costs. For OpenSearch EBS instance-specific volume limits, refer to EBS volume size quotas.

Upgrade from gp2 to gp3

To use the EBS gp3 volume type, you must first upgrade your domain’s instances to supported instance types if they don’t already support gp3. For a list of OpenSearch Service supported instances, see EBS volume size quotas. The transition from gp2 to gp3 is seamless. You can upgrade domain configurations from existing EBS volume types such as gp2, Magnetic, and PIOS (io1) to gp3 through OpenSearch Service console or the UpdatedomainConfig API. The configuration change will initiate blue/green deployment, which runs in the background without impacting your online traffic and, depending on the data size, is complete in a few hours. Blue/green deployments run in the background, ensuring that your online traffic is uninterrupted and preventing data loss.

gp3 baseline performance, and additional provisioning limits

One of the gp3’s key features is the ability to scale IOPS and throughput independent of volume. When your application requires more performance, you can scale up to 16,000 IOPS and 1,000 MiB/s throughput for an additional fee. OpenSearch Service EBS gp3 delivers a baseline performance of 3,000 IOPS and 125 MiB/s throughput at any volume size. In addition, OpenSearch Service provisions additional IOPS and throughput for larger volumes to ensure optimal performance. For volumes above 1,024 GiB, you receive 3 IOPS/GiB, and for volumes above 170 GiB, you get an incremental 250 MiB/s for every 3 TiB of storage.

The following table outlines OpenSearch Service baseline IOPS and throughput, as well as the maximum amount you can provision. Note that your instance type may have additional limitations regarding how much and for how long it can support these performance baselines in a 24-hour period. For more information about instances and their limits, refer to Amazon EBS-optimized instances.

Additional performance customers can provisions

.. Baseline (included in storage price) Additional performance customers can provision
Volume Storage (in GiB) IOPS throughput (MiB/s) IOPS throughput (MiB/s)
170 3,000 125 13,000 875
172 3,000 250 13,000 750
1,024 3,000 250 13,000 750
1,025 3,075 250 12,925 750
3,000 9,000 250 7,000 750
3,001 9,003 500 6,997 500
6,000 18,000 500 NA 500
6,001 18,003 750 NA 250
9,001 27,003 1,000 NA NA
24,000 72,000 2,000 NA NA

Do you need additional performance?

In the majority of use cases, you don’t need to provision additional IOPS and throughput, and gp3 baseline performance should suffice. You can use Amazon CloudWatch metrics to find the usage patterns, and if you observe current limits of IOPS and throughput bottlenecking your index and query performance, you should provision additional performance. For more information, refer to EBS volume metrics.

Conclusion

This post explains how OpenSearch Service general purpose SSD gp3 volumes can significantly reduce monthly storage and instance costs, making them more cost-effective than gp2 volumes. Migration to gp3 volumes with the same size and performance configurations as gp2 is the quickest and simplest way to reduce costs. Additionally, you should also consider reducing instance costs by taking advantage of gp3’s support for denser storage per data node.

For more details, check out Amazon OpenSearch Service pricing and Configuration API reference for Amazon OpenSearch Service.


About the author

Siddhant Gupta is a Sr. Technical Product Manager at Amazon Web Services based in Hyderabad, India. Siddhant has been with Amazon for over five years and is currently working with the OpenSearch Service team, helping with new region launches, pricing strategy, and bringing EC2 and EBS innovations to OpenSearch Service customers . He is passionate about analytics and machine learning. In his free time, he loves traveling, fitness activities, spending time with his family and reading non-fiction books.

Establishing a data perimeter on AWS: Allow only trusted identities to access company data

Post Syndicated from Tatyana Yatskevich original https://aws.amazon.com/blogs/security/establishing-a-data-perimeter-on-aws-allow-only-trusted-identities-to-access-company-data/

As described in an earlier blog post, Establishing a data perimeter on AWS, Amazon Web Services (AWS) offers a set of capabilities you can use to implement a data perimeter to help prevent unintended access. One type of unintended access that companies want to prevent is access to corporate data by users who do not belong to the company. A combination of AWS Identity and Access Management (AWS IAM) features and capabilities that can help you achieve this goal in AWS while fostering innovation and agility form the identity perimeter. In this blog post, I will provide an overview of some of the security risks the identity perimeter is designed to address, policy examples, and implementation guidance for establishing the perimeter.

The identity perimeter is a set of coarse-grained preventative controls that help achieve the following objectives:

  • Only trusted identities can access my resources
  • Only trusted identities are allowed from my network

Trusted identities encompass IAM principals that belong to your company, which is typically represented by an AWS Organizations organization. In AWS, an IAM principal is a person or application that can make a request for an action or operation on an AWS resource. There are also scenarios when AWS services perform actions on your behalf using identities that do not belong to your organization. You should consider both types of data access patterns when you create a definition of trusted identities that is specific to your company and your use of AWS services. All other identities are considered untrusted and should have no access except by explicit exception.

Security risks addressed by the identity perimeter

The identity perimeter helps address several security risks, including the following.

Unintended data disclosure due to misconfiguration. Some AWS services support resource-based IAM policies that you can use to grant principals (including principals outside of your organization) permissions to perform actions on the resources they are attached to. While this allows developers to configure resource-based policies based on their application requirements, you should ensure that access to untrusted identities is prohibited even if the developers grant broad access to your resources, such as Amazon Simple Storage Service (Amazon S3) buckets. Figure 1 illustrates examples of access patterns you would want to prevent—specifically, principals outside of your organization accessing your S3 bucket from a non-corporate AWS account, your on-premises network, or the internet.

Figure 1: Unintended access to your S3 bucket by identities outside of your organization

Figure 1: Unintended access to your S3 bucket by identities outside of your organization

Unintended data disclosure through non-corporate credentials. Some AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2) and AWS Lambda, let you run code using the IAM credentials of your choosing. Similar to on-premises environments where developers might have access to physical and virtual servers, there is a risk that the developers can bring personal IAM credentials to a corporate network and attempt to move company data to personal AWS resources. For example, Figure 2 illustrates unintended access patterns where identities outside of your AWS Organizations organization are used to transfer data from your on-premises networks or VPC to an S3 bucket in a non-corporate AWS account.

Figure 2: Unintended access from your networks by identities outside of your organization

Figure 2: Unintended access from your networks by identities outside of your organization

Implementing the identity perimeter

Before you can implement the identity perimeter by using preventative controls, you need to have a way to evaluate whether a principal is trusted and do this evaluation effectively in a multi-account AWS environment. IAM policies allow you to control access based on whether the IAM principal belongs to a particular account or an organization, with the following IAM condition keys:

  • The aws:PrincipalOrgID condition key gives you a succinct way to refer to all IAM principals that belong to a particular organization. There are similar condition keys, such as aws:PrincipalOrgPaths and aws:PrincipalAccount, that allow you to define different granularities of trust.
  • The aws:PrincipalIsAWSService condition key gives you a way to refer to AWS service principals when those are used to access resources on your behalf. For example, when you create a flow log with an S3 bucket as the destination, VPC Flow Logs uses a service principal, delivery.logs.amazonaws.com, which does not belong to your organization, to publish logs to Amazon S3.

In the context of the identity perimeter, there are two types of IAM policies that can help you ensure that the call to an AWS resource is made by a trusted identity:

Using the IAM condition keys and the policy types just listed, you can now implement the identity perimeter. The following table illustrates the relationship between identity perimeter objectives and the AWS capabilities that you can use to achieve them.

Data perimeter Control objective Implemented by using Primary IAM capability
Identity Only trusted identities can access my resources. Resource-based policies aws:PrincipalOrgID
aws:PrincipalIsAWSService
Only trusted identities are allowed from my network. VPC endpoint policies

Let’s see how you can use these capabilities to mitigate the risk of unintended access to your data.

Only trusted identities can access my resources

Resource-based policies allow you to specify who has access to the resource and what actions they can perform. Resource-based policies also allow you to apply identity perimeter controls to mitigate the risk of unintended data disclosure due to misconfiguration. The following is an example of a resource-based policy for an S3 bucket that limits access to only trusted identities. Make sure to replace <DOC-EXAMPLE-MY-BUCKET> and <MY-ORG-ID> with your information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceIdentityPerimeter",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::<DOC-EXAMPLE-MY-BUCKET>",
        "arn:aws:s3:::<DOC-EXAMPLE-MY-BUCKET>/*"
      ],
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "<MY-ORG-ID>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

The Deny statement in the preceding policy has two condition keys where both conditions must resolve to true to invoke the Deny effect. This means that this policy will deny any S3 action unless it is performed by an IAM principal within your organization (StringNotEqualsIfExists with aws:PrincipalOrgID) or a service principal (BoolIfExists with aws:PrincipalIsAWSService). Note that resource-based policies on AWS resources do not allow access outside of the account by default. Therefore, in order for another account or an AWS service to be able to access your resource directly, you need to explicitly grant access permissions with appropriate Allow statements added to the preceding policy.

Some AWS resources allow sharing through the use of AWS Resource Access Manager (AWS RAM). When you create a resource share in AWS RAM, you should choose Allow sharing with principals in your organization only to help prevent access from untrusted identities. In addition to the primary capabilities for the identity perimeter, you should also use the ram:RequestedAllowsExternalPrincipals condition key in the AWS Organizations service control policies (SCPs) to specify that resource shares cannot be created or modified to allow sharing with untrusted identities. For an example SCP, see Example service control policies for AWS Organizations and AWS RAM in the AWS RAM User Guide.

Only trusted identities are allowed from my network

When you access AWS services from on-premises networks or VPCs, you can use public service endpoints or connect to supported AWS services by using VPC endpoints. VPC endpoints allow you to apply identity perimeter controls to mitigate the risk of unintended data disclosure through non-corporate credentials. The following is an example of a VPC endpoint policy that allows access to all actions but limits the access to trusted identities only. Replace <MY-ORG-ID> with your information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowRequestsByOrgsIdentities",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalOrgID": "<MY-ORG-ID>"
        }
      }
    },
    {
      "Sid": "AllowRequestsByAWSServicePrincipals",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "Bool": {
          "aws:PrincipalIsAWSService": "true"
        }
      }
    }
  ]
}

As opposed to the resource-based policy example, the preceding policy uses Allow statements to enforce the identity perimeter. This is because VPC endpoint policies do not grant any permissions but define the maximum access allowed through the endpoint. Your developers will be using identity-based or resource-based policies to grant permissions required by their applications. We use two statements in this example policy to invoke the Allow effect in two scenarios: if an action is performed by an IAM principal that belongs to your organization (StringEquals with aws:PrincipalOrgID in the AllowRequestsByOrgsIdentities statement) or if an action is performed by a service principal (Bool with aws:PrincipalIsAWSService in the AllowRequestsByAWSServicePrincipals statement). We do not use IfExists in the end of the condition operators in this case, because we want the condition elements to evaluate to true only if the specified keys exist in the request.

It is important to note that in order to apply the VPC endpoint policies to requests originating from your on-premises environment, you need to configure private connectivity to AWS through AWS Direct Connect and/or AWS Site-to-Site VPN. Proper routing rules and DNS configurations will help you to ensure that traffic to AWS services is flowing through your VPC interface endpoints and is governed by the applied policies for supported services. You might also need to implement a mechanism to prevent cross-Region API requests from bypassing the identity perimeter controls within your network.

Extending your identity perimeter

There might be circumstances when you want to grant access to your resources to principals outside of your organization. For example, you might be hosting a dataset in an Amazon S3 bucket that is being accessed by your business partners from their own AWS accounts. In order to support this access pattern, you can use the aws:PrincipalAccount condition key to include third-party account identities as trusted identities in a policy. This is shown in the following resource-based policy example. Replace <DOC-EXAMPLE-MY-BUCKET>, <MY-ORG-ID>, <THIRD-PARTY-ACCOUNT-A>, and <THIRD-PARTY-ACCOUNT-B> with your information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceIdentityPerimeter",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::<DOC-EXAMPLE-MY-BUCKET>",
        "arn:aws:s3:::<DOC-EXAMPLE-MY-BUCKET>/*"
      ],
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "<MY-ORG-ID>",
          "aws:PrincipalAccount": [
            "<THIRD-PARTY-ACCOUNT-A>",
            "<THIRD-PARTY-ACCOUNT-B>"
          ]
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

The preceding policy adds the aws:PrincipalAccount condition key to the StringNotEqualsIfExists operator. You now have a Deny statement with three condition keys where all three conditions must resolve to true to invoke the Deny effect. Therefore, this policy denies any S3 action unless it is performed by an IAM principal that belongs to your organization (StringNotEqualsIfExists with aws:PrincipalOrgID), by an IAM principal that belongs to specified third-party accounts (StringNotEqualsIfExists with aws:PrincipalAccount), or a service principal (BoolIfExists with aws:PrincipalIsAWSService).

There might also be circumstances when you want to grant access from your networks to identities external to your organization. For example, your applications could be uploading or downloading objects to or from a third-party S3 bucket by using third-party generated pre-signed Amazon S3 URLs. The principal that generates the pre-signed URL will belong to the third-party AWS account. Similar to the previously discussed S3 bucket policy, you can extend your identity perimeter to include identities that belong to trusted third-party accounts by using the aws:PrincipalAccount condition key in your VPC endpoint policy.

Additionally, some AWS services make unauthenticated requests to AWS owned resources through your VPC endpoint. An example of such a pattern is Kernel Live Patching on Amazon Linux 2, which allows you to apply security vulnerability and critical bug patches to a running Linux kernel. Amazon EC2 makes an unauthenticated call to Amazon S3 to download packages from Amazon Linux repositories hosted on Amazon EC2 service-owned S3 buckets. To include this access pattern into your identity perimeter definition, you can choose to allow unauthenticated API calls to AWS owned resources in the VPC endpoint policies.

The following example VPC endpoint policy demonstrates how to extend your identity perimeter to include access to Amazon Linux repositories and to Amazon S3 buckets owned by a third-party. Replace <MY-ORG-ID>, <REGION>, <ACTION>, <THIRD-PARTY-ACCOUNT-A>, and <THIRD-PARTY-BUCKET-ARN> with your information.

{
 "Version": "2012-10-17",  
 "Statement": [
    {
      "Sid": "AllowRequestsByOrgsIdentities",
      "Effect": "Allow",     
      "Principal": {
        "AWS": "*"
      },
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalOrgID": "<MY-ORG-ID>"
        }
      }
    },
    {
      "Sid": "AllowRequestsByAWSServicePrincipals",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "Bool": {
          "aws:PrincipalIsAWSService": "true"
        }
      }
    },
    {
      "Sid": "AllowUnauthenticatedRequestsToAWSResources",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::packages.<REGION>.amazonaws.com/*",
        "arn:aws:s3:::repo.<REGION>.amazonaws.com/*",
        "arn:aws:s3:::amazonlinux.<REGION>.amazonaws.com/*",
        "arn:aws:s3:::amazonlinux-2-repos-<REGION>/*"
      ]
    },
    {
      "Sid": "AllowRequestsByThirdPartyIdentitiesToThirdPartyResources",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "<ACTION>",
      "Resource": "<THIRD-PARTY-BUCKET-ARN>",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalAccount": [
            "<THIRD-PARTY-ACCOUNT-A>"
          ]
        }
      }
    }
  ]
}

The preceding example adds two new statements to the VPC endpoint policy. The AllowUnauthenticatedRequestsToAWSResources statement allows the s3:GetObject action on buckets that host Amazon Linux repositories. The AllowRequestsByThirdPartyIdentitiesToThirdPartyResources statement allows actions on resources owned by a third-party entity by principals that belong to the third-party account (StringEquals with aws:PrincipalAccount).

Note that identity perimeter controls do not eliminate the need for additional network protections, such as making sure that your private EC2 instances or databases are not inadvertently exposed to the internet due to overly permissive security groups.

Apart from preventative controls established by the identity perimeter, we also recommend that you configure AWS Identity and Access Management Access Analyzer. IAM Access Analyzer helps you identify unintended access to your resources and data by monitoring policies applied to supported resources. You can review IAM Access Analyzer findings to identify resources that are shared with principals that do not belong to your AWS Organizations organization. You should also consider enabling Amazon GuardDuty to detect misconfigurations or anomalous access to your resources that could lead to unintended disclosure of your data. GuardDuty uses threat intelligence, machine learning, and anomaly detection to analyze data from various sources in your AWS accounts. You can review GuardDuty findings to identify unexpected or potentially malicious activity in your AWS environment, such as an IAM principal with no previous history invoking an S3 API.

IAM policy samples

This AWS git repository contains policy examples that illustrate how to implement identity perimeter controls for a variety of AWS services and actions. The policy samples do not represent a complete list of valid data access patterns and are for reference purposes only. They are intended for you to tailor and extend to suit the needs of your environment. Make sure that you thoroughly test the provided example policies before you implement them in your production environment.

Deploying the identity perimeter at scale

As discussed earlier, you implement the identity perimeter as coarse-grained preventative controls. These controls typically need to be implemented for each VPC by using VPC endpoint policies and on all resources that support resource-based policies. The effectiveness of these controls relies on their ability to scale with the environment and to adapt to its dynamic nature.

The methodology you use to deploy identity perimeter controls will depend on the deployment mechanisms you use to create and manage AWS accounts. For example, you might choose to use AWS Control Tower and the Customizations for AWS Control Tower solution (CfCT) to govern your AWS environment at scale. You can use CfCT or your custom CI/CD pipeline to deploy VPC endpoints and VPC endpoint policies that include your identity perimeter controls.

Because developers will be creating resources such as S3 buckets and AWS KMS keys on a regular basis, you might need to implement automation to enforce identity perimeter controls when those resources are created or their policies are changed. One option is to use custom AWS Config rules. Alternatively, you can choose to enforce resource deployment through AWS Service Catalog or a CI/CD pipeline. With the AWS Service Catalog approach, you can have identity perimeter controls built into the centrally controlled products that are made available to developers to deploy within their accounts. With the CI/CD pipeline approach, the pipeline can have built-in compliance checks that enforce identity perimeter controls during the deployment. If you are deploying resources with your CI/CD pipeline by using AWS CloudFormation, see the blog post Proactively keep resources secure and compliant with AWS CloudFormation Hooks.

Regardless of the deployment tools you select, identity perimeter controls, along with other baseline security controls applicable to your multi-account environment, should be included in your account provisioning process. You should also audit your identity perimeter configurations periodically and upon changes in your organization, which could lead to modifications in your identity perimeter controls (for example, disabling a third-party integration). Keeping your identity perimeter controls up to date will help ensure that they are consistently enforced and help prevent unintended access during the entire account lifecycle.

Conclusion

In this blog post, you learned about the foundational elements that are needed to define and implement the identity perimeter, including sample policies that you can use to start defining guardrails that are applicable to your environment and control objectives.

Following are additional resources that will help you further explore the identity perimeter topic, including a whitepaper and a hands-on-workshop.

If you have any questions, comments, or concerns, contact AWS Support or browse AWS re:Post. If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

Tatyana Yatskevich

Tatyana Yatskevich

Tatyana is a Principal Solutions Architect in AWS Identity. She works with customers to help them build and operate in AWS in the most secure and efficient manner.

Build your Apache Hudi data lake on AWS using Amazon EMR – Part 1

Post Syndicated from Suthan Phillips original https://aws.amazon.com/blogs/big-data/part-1-build-your-apache-hudi-data-lake-on-aws-using-amazon-emr/

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by bringing core warehouse and database functionality directly to a data lake on Amazon Simple Storage Service (Amazon S3) or Apache HDFS. Hudi provides table management, instantaneous views, efficient upserts/deletes, advanced indexes, streaming ingestion services, data and file layout optimizations (through clustering and compaction), and concurrency control, all while keeping your data in open-source file formats such as Apache Parquet and Apache Avro. Furthermore, Apache Hudi is integrated with open-source big data analytics frameworks, such as Apache Spark, Apache Hive, Apache Flink, Presto, and Trino.

In this post, we cover best practices when building Hudi data lakes on AWS using Amazon EMR. This post assumes that you have the understanding of Hudi data layout, file layout, and table and query types. The configuration and features can change with new Hudi versions; the concept of this post applies to Hudi versions of 0.11.0 (Amazon EMR release 6.7), 0.11.1 (Amazon EMR release 6.8) and 0.12.1 (Amazon EMR release 6.9).

Specify the table type: Copy on Write Vs. Merge on Read

When we write data into Hudi, we have the option to specify the table type: Copy on Write (CoW) or Merge on Read (MoR). This decision has to be made at the initial setup, and the table type can’t be changed after the table has been created. These two table types offer different trade-offs between ingest and query performance, and the data files are stored differently based on the chosen table type. If you don’t specify it, the default storage type CoW is used.

The following table summarizes the feature comparison of the two storage types.

CoW MoR
Data is stored in base files (columnar Parquet format). Data is stored as a combination of base files (columnar Parquet format) and log files with incremental changes (row-based Avro format).
COMMIT: Each new write creates a new version of the base files, which contain merged records from older base files and newer incoming records. Each write adds a commit action to the timeline, and each write atomically adds a commit action to the timeline, guaranteeing a write (and all its changes) entirely succeed or get entirely rolled back. DELTA_COMMIT: Each new write creates incremental log files for updates, which are associated with the base Parquet files. For inserts, it creates a new version of the base file similar to CoW. Each write adds a delta commit action to the timeline.
Write
In case of updates, write latency is higher than MoR due to the merge cost because it needs to rewrite the entire affected Parquet files with the merged updates. Additionally, writing in the columnar Parquet format (for CoW updates) is more latent in comparison to the row-based Avro format (for MoR updates). No merge cost for updates during write time, and the write operation is faster because it just appends the data changes to the new log file corresponding to the base file each time.
Compaction isn’t needed because all data is directly written to Parquet files. Compaction is required to merge the base and log files to create a new version of the base file.
Higher write amplification because new versions of base files are created for every write. Write cost will be O(number of files in storage modified by the write). Lower write amplification because updates go to log files. Write cost will be O(1) for update-only datasets and can get higher when there are new inserts.
Read
CoW table supports snapshot query and incremental queries.

MoR offers two ways to query the same underlying storage: ReadOptimized tables and Near-Realtime tables (snapshot queries).

ReadOptimized tables support read-optimized queries, and Near-Realtime tables support snapshot queries and incremental queries.

Read-optimized queries aren’t applicable for CoW because data is already merged to base files while writing. Read-optimized queries show the latest compacted data, which doesn’t include the freshest updates in the not yet compacted log files.
Snapshot queries have no merge cost during read. Snapshot queries merge data while reading if not compacted and therefore can be slower than CoW while querying the latest data.

CoW is the default storage type and is preferred for simple read-heavy use cases. Use cases with the following characteristics are recommended for CoW:

  • Tables with a lower ingestion rate and use cases without real-time ingestion
  • Use cases requiring the freshest data with minimal read latency because merging cost is taken care of at the write phase
  • Append-only workloads where existing data is immutable

MoR is recommended for tables with write-heavy and update-heavy use cases. Use cases with the following characteristics are recommended for MoR:

  • Faster ingestion requirements and real-time ingestion use cases.
  • Varying or bursty write patterns (for example, ingesting bulk random deletes in an upstream database) due to the zero-merge cost for updates during write time
  • Streaming use cases
  • Mix of downstream consumers, where some are looking for fresher data by paying some additional read cost, and others need faster reads with some trade-off in data freshness

For streaming use cases demanding strict ingestion performance with MoR tables, we suggest running the table services (for example, compaction and cleaning) asynchronously, which is discussed in the upcoming Part 3 of this series.

For more details on table types and use cases, refer to How do I choose a storage type for my workload?

Select the record key, key generator, preCombine field, and record payload

This section discusses the basic configurations for the record key, key generator, preCombine field, and record payload.

Record key

Every record in Hudi is uniquely identified by a Hoodie key (similar to primary keys in databases), which is usually a pair of record key and partition path. With Hoodie keys, you can enable efficient updates and deletes on records, as well as avoid duplicate records. Hudi partitions have multiple file groups, and each file group is identified by a file ID. Hudi maps Hoodie keys to file IDs, using an indexing mechanism.

A record key that you select from your data can be unique within a partition or across partitions. If the selected record key is unique within a partition, it can be uniquely identified in the Hudi dataset using the combination of the record key and partition path. You can also combine multiple fields from your dataset into a compound record key. Record keys cannot be null.

Key generator

Key generators are different implementations to generate record keys and partition paths based on the values specified for these fields in the Hudi configuration. The right key generator has to be configured depending on the type of key (simple or composite key) and the column data type used in the record key and partition path columns (for example, TimestampBasedKeyGenerator is used for timestamp data type partition path). Hudi provides several key generators out of the box, which you can specify in your job using the following configuration.

Configuration Parameter Description Value
hoodie.datasource.write.keygenerator.class Key generator class, which generates the record key and partition path Default value is SimpleKeyGenerator

The following table describes the different types of key generators in Hudi.

Key Generators Use-case
SimpleKeyGenerator Use this key generator if your record key refers to a single column by name and similarly your partition path also refers to a single column by name.
ComplexKeyGenerator Use this key generator when record key and partition paths comprise multiple columns. Columns are expected to be comma-separated in the config value (for example, "hoodie.datasource.write.recordkey.field" : “col1,col4”).
GlobalDeleteKeyGenerator

Use this key generator when you can’t determine the partition of incoming records to be deleted and need to delete only based on record key. This key generator ignores the partition path while generating keys to uniquely identify Hudi records.

When using this key generator, set the config hoodie.[bloom|simple|hbase].index.update.partition.path to false in order to avoid redundant data written to the storage.

NonPartitionedKeyGenerator Use this key generator for non-partitioned datasets because it returns an empty partition for all records.
TimestampBasedKeyGenerator Use this key generator for a timestamp data type partition path. With this key generator, the partition path column values are interpreted as timestamps. The record key is the same as before, which is a single column converted to string. If using TimestampBasedKeyGenerator, a few more configs need to be set.
CustomKeyGenerator Use this key generator to take advantage of the benefits of SimpleKeyGenerator, ComplexKeyGenerator, and TimestampBasedKeyGenerator all at the same time. With this you can configure record key and partition paths as a single field or a combination of fields. This is helpful if you want to generate nested partitions with each partition key of different types (for example, field_3:simple,field_5:timestamp). For more information, refer to CustomKeyGenerator.

The key generator class can be automatically inferred by Hudi if the specified record key and partition path require a SimpleKeyGenerator or ComplexKeyGenerator, depending on whether there are single or multiple record key or partition path columns. For all other cases, you need to specify the key generator.

The following flow chart explains how to select the right key generator for your use case.

PreCombine field

This is a mandatory field that Hudi uses to deduplicate the records within the same batch before writing them. When two records have the same record key, they go through the preCombine process, and the record with the largest value for the preCombine key is picked by default. This behavior can be customized through custom implementation of the Hudi payload class, which we describe in the next section.

The following table summarizes the configurations related to preCombine.

Configuration Parameter Description Value
hoodie.datasource.write.precombine.field The field used in preCombining before the actual write. It helps select the latest record whenever there are multiple updates to the same record in a single incoming data batch.

The default value is ts. You can configure it to any column in your dataset that you want Hudi to use to deduplicate the records whenever there are multiple records with the same record key in the same batch. Currently, you can only pick one field as the preCombine field.

Select a column with the timestamp data type or any column that can determine which record holds the latest version, like a monotonically increasing number.

hoodie.combine.before.upsert During upsert, this configuration controls whether deduplication should be done for the incoming batch before ingesting into Hudi. This is applicable only for upsert operations. The default value is true. We recommend keeping it at the default to avoid duplicates.
hoodie.combine.before.delete Same as the preceding config, but applicable only for delete operations. The default value is true. We recommend keeping it at the default to avoid duplicates.
hoodie.combine.before.insert When inserted records share the same key, the configuration controls whether they should be first combined (deduplicated) before writing to storage. The default value is false. We recommend setting it to true if the incoming inserts or bulk inserts can have duplicates.

Record payload

Record payload defines how to merge new incoming records against old stored records for upserts.

The default OverwriteWithLatestAvroPayload payload class always overwrites the stored record with the latest incoming record. This works fine for batch jobs and most use cases. But let’s say you have a streaming job and want to prevent the late-arriving data from overwriting the latest record in storage. You need to use a different payload class implementation (DefaultHoodieRecordPayload) to determine the latest record in storage based on an ordering field, which you provide.

For example, in the following example, Commit 1 has HoodieKey 1, Val 1, preCombine10, and in-flight Commit 2 has HoodieKey 1, Val 2, preCombine 5.

If using the default OverwriteWithLatestAvroPayload, the Val 2 version of the record will be the final version of the record in storage (Amazon S3) because it’s the latest version of the record.

If using DefaultHoodieRecordPayload, it will honor Val 1 because the Val 2’s record version has a lower preCombine value (preCombine 5) compared to Val 1’s record version, while merging multiple versions of the record.

You can select a payload class while writing to the Hudi table using the configuration hoodie.datasource.write.payload.class.

Some useful in-built payload class implementations are described in the following table.

Payload Class Description
OverwriteWithLatestAvroPayload (org.apache.hudi.common.model.OverwriteWithLatestAvroPayload) Chooses the latest incoming record to overwrite any previous version of the records. Default payload class.
DefaultHoodieRecordPayload (org.apache.hudi.common.model.DefaultHoodieRecordPayload) Uses hoodie.payload.ordering.field to determine the final record version while writing to storage.
EmptyHoodieRecordPayload (org.apache.hudi.common.model.EmptyHoodieRecordPayload) Use this as payload class to delete all the records in the dataset.
AWSDmsAvroPayload (org.apache.hudi.common.model.AWSDmsAvroPayload) Use this as payload class if AWS DMS is used as source. It provides support for seamlessly applying changes captured via AWS DMS. This payload implementation performs insert, delete, and update operations on the Hudi table based on the operation type for the CDC record obtained from AWS DMS.

Partitioning

Partitioning is the physical organization of files within a table. They act as virtual columns and can impact the max parallelism we can use on writing.

Extremely fine-grained partitioning (for example, over 20,000 partitions) can create excessive overhead for the Spark engine managing all the small tasks, and can degrade query performance by reducing file sizes. Also, an overly coarse-grained partition strategy, without clustering and data skipping, can negatively impact both read and upsert performance with the need to scan more files in each partition.

Right partitioning helps improve read performance by reducing the amount of data scanned per query. It also improves upsert performance by limiting the number of files scanned to find the file group in which a specific record exists during ingest. A column frequently used in query filters would be a good candidate for partitioning.

For large-scale use cases with evolving query patterns, we suggest coarse-grained partitioning (such as date), while using fine-grained data layout optimization techniques (clustering) within each partition. This opens the possibility of data layout evolution.

By default, Hudi creates the partition folders with just the partition values. We recommend using Hive style partitioning, in which the name of the partition columns is prefixed to the partition values in the path (for example, year=2022/month=07 as opposed to 2022/07). This enables better integration with Hive metastores, such as using msck repair to fix partition paths.

To support Apache Hive style partitions in Hudi, we have to enable it in the config hoodie.datasource.write.hive_style_partitioning.

The following table summarizes the key configurations related to Hudi partitioning.

Configuration Parameter Description Value
hoodie.datasource.write.partitionpath.field Partition path field. This is a required configuration that you need to pass while writing the Hudi dataset. There is no default value set for this. Set it to the column that you have determined for partitioning the data. We recommend that it doesn’t cause extremely fine-grained partitions.
hoodie.datasource.write.hive_style_partitioning Determines whether to use Hive style partitioning. If set to true, the names of partition folders follow <partition_column_name>=<partition_value> format. Default value is false. Set it to true to use Hive style partitioning.
hoodie.datasource.write.partitionpath.urlencode Indicates if we should URL encode the partition path value before creating the folder structure. Default value is false. Set it to true if you want to URL encode the partition path value. For example, if you’re using the data format “yyyy-MM-dd HH:mm:ss“, the URL encode needs to be set to true because it will result in an invalid path due to :.

Note that if the data isn’t partitioned, you need to specifically use NonPartitionedKeyGenerator for the record key, which is explained in the previous section. Additionally, Hudi doesn’t allow partition columns to be changed or evolved.

Choose the right index

After we select the storage type in Hudi and determine the record key and partition path, we need to choose the right index for upsert performance. Apache Hudi employs an index to locate the file group that an update/delete belongs to. This enables efficient upsert and delete operations and enforces uniqueness based on the record keys.

Global index vs. non-global index

When picking the right indexing strategy, the first decision is whether to use a global (table level) or non-global (partition level) index. The main difference between global vs. non-global indexes is the scope of key uniqueness constraints. Global indexes enforce uniqueness of the keys across all partitions of a table. The non-global index implementations enforce this constraint only within a specific partition. Global indexes offer stronger uniqueness guarantees, but they come with a higher update/delete cost, for example global deletes with just the record key need to scan the entire dataset. HBase indexes are an exception here, but come with an operational overhead.

For large-scale global index use cases, use an HBase index or record-level index (available in Hudi 0.13) because for all other global indexes, the update/delete cost grows with the size of the table, O(size of the table).

When using a global index, be aware of the configuration hoodie[bloom|simple|hbase].index.update.partition.path, which is already set to true by default. For existing records getting upserted to a new partition, enabling this configuration will help delete the old record in the old partition and insert it in the new partition.

Hudi index options

After picking the scope of the index, the next step is to decide which indexing option best fits your workload. The following table explains the indexing options available in Hudi as of 0.11.0.

Indexing Option How It Works Characteristic Scope
Simple Index Performs a join of the incoming upsert/delete records against keys extracted from the involved partition in case of non-global datasets and the entire dataset in case of global or non-partitioned datasets. Easiest to configure. Suitable for basic use cases like small tables with evenly spread updates. Even for larger tables where updates are very random to all partitions, a simple index is the right choice because it directly joins with interested fields from every data file without any initial pruning, as compared to Bloom, which in the case of random upserts adds additional overhead and doesn’t give enough pruning benefits because the Bloom filters could indicate true positive for most of the files and end up comparing ranges and filters against all these files. Global/Non-global
Bloom Index (default index in EMR Hudi) Employs Bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Bloom filter is stored in the data file footer while writing the data.

More efficient filter compared to simple index for use cases like late-arriving updates to fact tables and deduplication in event tables with ordered record keys such as timestamp. Hudi implements a dynamic Bloom filter mechanism to reduce false positives provided by Bloom filters.

In general, the probability of false positives increases with the number of records in a given file. Check the Hudi FAQ for Bloom filter configuration best practices.

Global/Non-global
Bucket Index It distributes records to buckets using a hash function based on the record keys or subset of it. It uses the same hash function to determine which file group to match with incoming records. New indexing option since hudi 0.11.0. Simple to configure. It has better upsert throughput performance compared to the Bloom filter. As of Hudi 0.11.1, only fixed bucket number is supported. This will no longer be an issue with the upcoming consistent hashing bucket index feature, which can dynamically change bucket numbers. Non-global
HBase Index The index mapping is managed in an external HBase table. Best lookup time, especially for large numbers of partitions and files. It comes with additional operational overhead because you need to manage an external HBase table. Global

Use cases suitable for simple index

Simple indexes are most suitable for workloads with evenly spread updates over partitions and files on small tables, and also for larger tables with dimension kind of workloads because updates are random to all partitions. A common example is a CDC pipeline for a dimension table. In this case, updates end up touching a large number of files and partitions. Therefore, a join with no other pruning is most efficient.

Use cases suitable for Bloom index

Bloom indexes are suitable for most production workloads with uneven update distribution across partitions. For workloads with most updates to recent data like fact tables, Bloom filter rightly fits the bill. It can be clickstream data collected from an ecommerce site, bank transactions in a FinTech application, or CDC logs for a fact table.

When using a Bloom index, be aware of the following configurations:

  • hoodie.bloom.index.use.metadata – By default, it is set to false. When this flag is on, the Hudi writer gets the index metadata information from the metadata table and doesn’t need to open Parquet file footers to get the Bloom filters and stats. You prune out the files by just using the metadata table and therefore have improved performance for larger tables.
  • hoodie.bloom.index.prune.by.rangesEnable or disable range pruning based on use case. By default, it’s already set to true. When this flag is on, range information from files is used to speed up index lookups. This is helpful if the selected record key is monotonously increasing. You can set any record key to be monotonically increasing by adding a timestamp prefix. If the record key is completely random and has no natural ordering (such as UUIDs), it’s better to turn this off, because range pruning will only add extra overhead to the index lookup.

Use cases suitable for bucket index

Bucket indexes are suitable for upsert use cases on huge datasets with a large number of file groups within partitions, relatively even data distribution across partitions, and can achieve relatively even data distribution on the bucket hash field column. It can have better upsert performance in these cases due to no index lookup involved as file groups are located based on a hashing mechanism, which is very fast. This is totally different from both simple and Bloom indexes, where an explicit index lookup step is involved during write. The buckets here has one-one mapping with the hudi file group and since the total number of buckets (defined by hoodie.bucket.index.num.buckets(default – 4)) is fixed here, it can potentially lead to skewed data (data distributed unevenly across buckets) and scalability (buckets can grow over time) issues over time. These issues will be addressed in the upcoming consistent hashing bucket index, which is going to be a special type of bucket index.

Use cases suitable for HBase index

HBase indexes are suitable for use cases where ingestion performance can’t be met using the other index types. These are mostly use cases with global indexes and large numbers of files and partitions. HBase indexes provide the best lookup time but come with large operational overheads if you’re already using HBase for other workloads.

For more information on choosing the right index and indexing strategies for common use cases, refer to Employing the right indexes for fast updates, deletes in Apache Hudi. As you have already seen, Hudi index performance depends heavily on the actual workload. We encourage you to evaluate different indexes for your workload and choose the one which is best suited for your use case.

Migration guidance

With Apache Hudi growing in popularity, one of the fundamental challenges is to efficiently migrate existing datasets to Apache Hudi. Apache Hudi maintains record-level metadata to perform core operations such as upserts and incremental pulls. To take advantage of Hudi’s upsert and incremental processing support, you need to add Hudi record-level metadata to your original dataset.

Using bulk_insert

The recommended way for data migration to Hudi is to perform a full rewrite using bulk_insert. There is no look-up for existing records in bulk_insert and writer optimizations like small file handling. Performing a one-time full rewrite is a good opportunity to write your data in Hudi format with all the metadata and indexes generated and also potentially control file size and sort data by record keys.

You can set the sort mode in a bulk_insert operation using the configuration hoodie.bulkinsert.sort.mode. bulk_insert offers the following sort modes to configure.

Sort Modes Description
NONE No sorting is done to the records. You can get the fastest performance (comparable to writing parquet files with spark) for initial load with this mode.
GLOBAL_SORT Use this to sort records globally across Spark partitions. It is less performant in initial load than other modes as it repartitions data by partition path and sorts it by record key within each partition. This helps in controlling the number of files generated in the target thereby controlling the target file size. Also, the generated target files will not have overlapping min-max values for record keys which will further help speed up index look-ups during upserts/deletes by pruning out files based on record key ranges in bloom index.
PARTITION_SORT Use this to sort records within Spark partitions. It is more performant for initial load than Global_Sort and if your Spark partitions in the data frame are already fairly mapped to the Hudi partitions (dataframe is already repartitioned by partition column), using this mode would be preferred as you can obtain records sorted by record key within each partition.

We recommend to use Global_Sort mode if you can handle the one-time cost. The default sort mode is changed from Global_Sort to None from EMR 6.9 (Hudi 0.12.1). During bulk_insert with Global_Sort, two configurations control the sizes of target files generated by Hudi.

Configuration Parameter Description Value
hoodie.bulkinsert.shuffle.parallelism The number of files generated from the bulk insert is determined by this configuration. The higher the parallelism, the more Spark tasks processing the data. Default value is 200. To control file size and achieve maximum performance (more parallelism), we recommend setting this to a value such that the files generated are equal to the hoodie.parquet.max.file.size. If you make parallelism really high, the max file size can’t be honored because the Spark tasks are working on smaller amounts of data.
hoodie.parquet.max.file.size Target size for Parquet files produced by Hudi write phases. Default value is 120 MB. If the Spark partitions generated with hoodie.bulkinsert.shuffle.parallelism are larger than this size, it splits it and generates multiple files to not exceed the max file size.

Let’s say we have a 100 GB Parquet source dataset and we’re bulk inserting with Global_Sort into a partitioned Hudi table with 10 evenly distributed Hudi partitions. We want to have the preferred target file size of 120 MB (default value for hoodie.parquet.max.file.size). The Hudi bulk insert shuffle parallelism should be calculated as follows:

  • The total data size in MB is 100 * 1024 = 102400 MB
  • hoodie.bulkinsert.shuffle.parallelism should be set to 102400/120 = ~854

Please note that in reality even with Global_Sort, each spark partition can be mapped to more than one hudi partition and this calculation should only be used as a rough estimate and can potentially end up with more files than the parallelism specified.

Using bootstrapping

For customers operating at scale on hundreds of terabytes or petabytes of data, migrating your datasets to start using Apache Hudi can be time-consuming. Apache Hudi provides a feature called bootstrap to help with this challenge.

The bootstrap operation contains two modes: METADATA_ONLY and FULL_RECORD.

FULL_RECORD is the same as full rewrite, where the original data is copied and rewritten with the metadata as Hudi files.

The METADATA_ONLY mode is the key to accelerating the migration progress. The conceptual idea is to decouple the record-level metadata from the actual data by writing only the metadata columns in the Hudi files generated while the data isn’t copied over and stays in its original location. This significantly reduces the amount of data written, thereby improving the time to migrate and get started with Hudi. However, this comes at the expense of read performance, which involves the overhead merging Hudi files and original data files to get the complete record. Therefore, you may not want to use it for frequently queried partitions.

You can pick and choose these modes at partition level. One common strategy is to tier your data. Use FULL_RECORD mode for a small set of hot partitions, which are accessed frequently, and METADATA_ONLY for a larger set of cold partitions.

Consider the following:

Catalog sync

Hudi supports syncing Hudi table partitions and columns to a catalog. On AWS, you can either use the AWS Glue Data Catalog or Hive metastore as the metadata store for your Hudi tables. To register and synchronize the metadata with your regular write pipeline, you need to either enable hive sync or run the hive_sync_tool or AwsGlueCatalogSyncTool command line utility.

We recommend enabling the hive sync feature with your regular write pipeline to make sure the catalog is up to date. If you don’t expect a new partition to be added or the schema changed as part of each batch, then we recommend enabling hoodie.datasource.meta_sync.condition.sync as well so that it allows Hudi to determine if hive sync is necessary for the job.

If you have frequent ingestion jobs and need to maximize ingestion performance, you can disable hive sync and run the hive_sync_tool asynchronously.

If you have the timestamp data type in your Hudi data, we recommend setting hoodie.datasource.hive_sync.support_timestamp to true to convert the int64 (timestamp_micros) to the hive type timestamp. Otherwise, you will see the values in bigint while querying data.

The following table summarizes the configurations related to hive_sync.

Configuration Parameter Description Value
hoodie.datasource.hive_sync.enable To register or sync the table to a Hive metastore or the AWS Glue Data Catalog. Default value is false. We recommend setting the value to true to make sure the catalog is up to date, and it needs to be enabled in every single write to avoid an out-of-sync metastore.
hoodie.datasource.hive_sync.mode This configuration sets the mode for HiveSynctool to connect to the Hive metastore server. For more information, refer to Sync modes. Valid values are hms, jdbc, and hiveql. If the mode isn’t specified, it defaults to jdbc. Hms and jdbc both talk to the underlying thrift server, but jdbc needs a separate jdbc driver. We recommend setting it to ‘hms’, which uses the Hive metastore client to sync Hudi tables using thrift APIs directly. This helps when using the AWS Glue Data Catalog because you don’t need to install Hive as an application on the EMR cluster (because it doesn’t need the server).
hoodie.datasource.hive_sync.database Name of the destination database that we should sync the Hudi table to. Default value is default. Set this to the database name of your catalog.
hoodie.datasource.hive_sync.table Name of the destination table that we should sync the Hudi table to. In Amazon EMR, the value is inferred from the Hudi table name. You can set this config if you need a different table name.
hoodie.datasource.hive_sync.support_timestamp To convert logical type TIMESTAMP_MICROS as hive type timestamp. Default value is false. Set it to true to convert to hive type timestamp.
hoodie.datasource.meta_sync.condition.sync If true, only sync on conditions like schema change or partition change. Default value is false.

Writing and reading Hudi datasets, and its integration with other AWS services

There are different ways you can write the data to Hudi using Amazon EMR, as explained in the following table.

Hudi Write Options Description
Spark DataSource

You can use this option to do upsert, insert, or bulk insert for the write operation.

Refer to Work with a Hudi dataset for an example of how to write data using DataSourceWrite.

Spark SQL You can easily write data to Hudi with SQL statements. It eliminates the need to write Scala or PySpark code and adopt a low-code paradigm.
Flink SQL, Flink DataStream API If you’re using Flink for real-time streaming ingestion, you can use the high-level Flink SQL or Flink DataStream API to write the data to Hudi.
DeltaStreamer DeltaStreamer is a self-managed tool that supports standard data sources like Apache Kafka, Amazon S3 events, DFS, AWS DMS, JDBC, and SQL sources, built-in checkpoint management, schema validations, as well as lightweight transformations. It can also operate in a continuous mode, in which a single self-contained Spark job can pull data from source, write it out to Hudi tables, and asynchronously perform cleaning, clustering, compactions, and catalog syncing, relying on Spark’s job pools for resource management. It’s easy to use and we recommend using it for all the streaming and ingestion use cases where a low-code approach is preferred. For more information, refer to Streaming Ingestion.
Spark structured streaming For use cases that require complex data transformations of the source data frame written in Spark DataFrame APIs or advanced SQL, we recommend the structured streaming sink. The streaming source can be used to obtain change feeds out of Hudi tables for streaming or incremental processing use cases.
Kafka Connect Sink If you standardize on the Apache Kafka Connect framework for your ingestion needs, you can also use the Hudi Connect Sink.

Refer to the following support matrix for query support on specific query engines. The following table explains the different options to read the Hudi dataset using Amazon EMR.

Hudi Read options Description
Spark DataSource You can read Hudi datasets directly from Amazon S3 using this option. The tables don’t need to be registered with Hive metastore or the AWS Glue Data Catalog for this option. You can use this option if your use case doesn’t require a metadata catalog. Refer to Work with a Hudi dataset for example of how to read data using DataSourceReadOptions.
Spark SQL You can query Hudi tables with DML/DDL statements. The tables need to be registered with Hive metastore or the AWS Glue Data Catalog for this option.
Flink SQL After the Flink Hudi tables have been registered to the Flink catalog, they can be queried using the Flink SQL.
PrestoDB/Trino The tables need to be registered with Hive metastore or the AWS Glue Data Catalog for this option. This engine is preferred for interactive queries. There is a new Trino connector in upcoming Hudi 0.13, and we recommend reading datasets through this connector when using Trino for performance benefits.
Hive The tables need to be registered with Hive metastore or the AWS Glue Data Catalog for this option.

Apache Hudi is well integrated with AWS services, and these integrations work when AWS Glue Data Catalog is used, with the exception of Athena, where you can also use a data source connector to an external Hive metastore. The following table summarizes the service integrations.

AWS Service Description
Amazon Athena

You can use Athena for a serverless option to query a Hudi dataset on Amazon S3. Currently, it supports snapshot queries and read-optimized queries, but not incremental queries.

For more details, refer to Using Athena to query Apache Hudi datasets.

Amazon Redshift Spectrum

You can use Amazon Redshift Spectrum to run analytic queries against tables in your Amazon S3 data lake with Hudi format.

Currently, it supports only CoW tables. For more details, refer to Creating external tables for data managed in Apache Hudi.

AWS Lake Formation AWS Lake Formation is used to secure data lakes and define fine-grained access control on the database and table level. Hudi is not currently supported with Amazon EMR Lake Formation integration.
AWS DMS You can use AWS DMS to ingest data from upstream relational databases to your S3 data lakes into an Hudi dataset. For more details, refer to Apply record level changes from relational databases to Amazon S3 data lake using Apache Hudi on Amazon EMR and AWS Database Migration Service.

Conclusion

This post covered best practices for configuring Apache Hudi data lakes using Amazon EMR. We discussed the key configurations in migrating your existing dataset to Hudi and shared guidance on how to determine the right options for different use cases when setting up Hudi tables.

The upcoming Part 2 of this series focuses on optimizations that can be done on this setup, along with monitoring using Amazon CloudWatch.


About the Authors

Suthan Phillips is a Big Data Architect for Amazon EMR at AWS. He works with customers to provide best practice and technical guidance and helps them achieve highly scalable, reliable and secure solutions for complex applications on Amazon EMR. In his spare time, he enjoys hiking and exploring the Pacific Northwest.

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.

Three recurring Security Hub usage patterns and how to deploy them

Post Syndicated from Tim Holm original https://aws.amazon.com/blogs/security/three-recurring-security-hub-usage-patterns-and-how-to-deploy-them/

As Amazon Web Services (AWS) Security Solutions Architects, we get to talk to customers of all sizes and industries about how they want to improve their security posture and get visibility into their AWS resources. This blog post identifies the top three most commonly used Security Hub usage patterns and describes how you can use these to improve your strategy for identifying and managing findings.

Customers have told us they want to provide security and compliance visibility to the application owners in an AWS account or to the teams that use the account; others want a single-pane-of-glass view for their security teams; and other customers want to centralize everything into a security information and event management (SIEM) system, most often due to being in a hybrid scenario.

Security Hub was launched as a posture management service that performs security checks, aggregates alerts, and enables automated remediation. Security Hub ingests findings from multiple AWS services, including Amazon GuardDuty, Amazon Inspector, AWS Firewall Manager, and AWS Health, and also from third-party services. It can be integrated with AWS Organizations to provide a single dashboard where you can view findings across your organization.

Security Hub findings are normalized into the AWS Security Findings Format (ASFF) so that users can review them in a standardized format. This reduces the need for time-consuming data conversion efforts and allows for flexible and consistent filtering of findings based on the attributes provided in the finding, as well as the use of customizable responsive actions. Partners who have integrations with Security Hub also send their findings to AWS using the ASFF to allow for consistent attribute definition and enforced criticality ratings, meaning that findings in Security Hub have a measurable rating. This helps to simplify the complexity of managing multiple findings from different providers.

Overview of the usage patterns

In this section, we outline the objectives for each usage pattern, list the typical stakeholders we have seen these patterns support, and discuss the value of deploying each one.

Usage pattern 1: Dashboard for application owners

Use Security Hub to provide visibility to application workload owners regarding the security and compliance posture of their AWS resources.

The application owner is often responsible for the security and compliance posture of the resources they have deployed in AWS. In our experience however, it is common for large enterprises to have a separate team responsible for defining security-related privileges and to not grant application owners the ability to modify configuration settings on the AWS account that is designated as the centralized Security Hub administration account. We’ll walk through how you can enable read-only access for application owners to use Security Hub to see the overall security posture of their AWS resources.

Stakeholders: Developers and cloud teams that are responsible for the security posture of their AWS resources. These individuals are often required to resolve security events and non-compliance findings that are captured with Security Hub.

Value adds for customers: Some organizations we have worked with put the onus on workload owners to own their security findings, because they have a better understanding of the nuances of the engineering, the business needs, and the overall risk that the security findings represent. This usage pattern gives the applications owners clear visibility into the security and compliance status of their workloads in the AWS accounts so that they can define appropriate mitigation actions with consideration to their business needs and risk.

Usage pattern 2: A single pane of glass for security professionals

Use Security Hub as a single pane of glass to view, triage, and take action on AWS security and compliance findings across accounts and AWS Regions.

Security Hub generates findings by running continuous, automated security checks based on supported industry standards. Additionally, Security Hub integrates with other AWS services to collect and correlate findings and uses over 60 partner integrations to simplify and prioritize findings. With these features, security professionals can use Security Hub to manage findings across their AWS landscape.

Stakeholders: Security operations, incident responders, and threat hunters who are responsible for monitoring compliance, as well as security events.

Value adds for customers: This pattern benefits customers who don’t have a SIEM but who are looking for a centralized model of security operations. By using Security Hub and aggregating findings across Regions into a single Security Hub dashboard, they get oversight of their AWS resources without the cost and complexity of managing a SIEM.

Usage pattern 3: Centralized routing to a SIEM solution

Use AWS Security Hub as a single aggregation point for security and compliance findings across AWS accounts and Regions, and route those findings in a normalized format to a centralized SIEM or log management tool.

Customers who have an existing SIEM capability and complex environments typically deploy this usage pattern. By using Security Hub, these customers gather security and compliance-related findings across the workloads in all their accounts, ingest those into their SIEM, and investigate findings or take response and remediation actions directly within their SIEM console. This mechanism also enables customers to define use cases for threat detection and analysis in a single environment, providing a holistic view of their risk.

Stakeholders: Security operations teams, incident responders, and threat hunters. This pattern supports a centralized model of security operations, where the responsibilities for monitoring and identifying both non-compliance with defined practice, as well as security events, fall within single teams within the organization.

Value adds for customers: When Security Hub aggregates the findings from workloads across accounts and Regions in a single place, those finding are normalized by using the ASFF. This means that findings are already normalized under a single format when they are sent to the SIEM. This enables faster analytics, use case definition, and dashboarding because analysts don’t have to create multi-tiered use cases for different finding structures across vendors and services.

The ASFF also enables streamlined response through security orchestration, automation, response (SOAR) tools or AWS native orchestration tools such as AWS EventBridge. With the ASFF, you can effortlessly parse and filter events based on an attribute and customize automation.

Overall, this usage pattern helps to improve the typical key performance indicators (KPIs) the SecOps function is measured against, such as Mean Time to Detect (MTTD) or Mean Time to Respond (MTTR) in the AWS environment.

Setting up each usage pattern

In this section, we’ll go over the steps for setting up each usage pattern

Usage pattern 1: Dashboard for application owners

Use the following steps to set up a Security Hub dashboard for an account owner, where the owner can view and take action on security findings.

Prerequisites for pattern 1

This solution has the following prerequisites:

  1. Enable AWS Security Hub to check your environment against security industry standards and best practices.
  2. Next, enable the AWS service integrations for all accounts and Regions as desired. For more information, refer to Enabling all features in your organization.

Set up read-only permissions for the AWS application owner

The following steps are commonly performed by the security team or those responsible for creating AWS Identity and Access Management (IAM) policies.

  • Assign the AWS managed permission policy AWSSecurityHubReadOnlyAccess to the principal who will be assuming the role. Figure 1 shows an image of the permission statement.
    Figure 1: Assign permissions

    Figure 1: Assign permissions

  • (Optional) Create custom insights in Security Hub. Using custom insights can provide a view of areas of interest for an application owner; however, creating a new insights view is not allowed unless the following additional set of permissions are granted to the application owner role or user.
    {
    "Effect": "Allow",
    "Action": [
    "securityhub:UpdateInsight",
    "securityhub:DeleteInsight",
    "securityhub:CreateInsight"
    ],
    "Resource": "*"
    }

Pattern 1 walkthrough: View the application owner’s security findings

After the read-only IAM policy has been created and applied, the application owner can access Security Hub to view the dashboard, which provides the application owner with a view of the overall security posture of their AWS resources. In this section, we’ll walk through the steps that the application owner can take to quickly view and assess the compliance and security of their AWS resources.

To view the application owner’s dashboard in Security Hub

  1. Sign into the AWS Management Console and navigate to the AWS Security Hub service page. You will be presented with a summary of the findings. Then, depending on the security standards that are enabled, you will be presented with a view similar to the one shown in Figure 2.
    Figure 2: Summary of aggregated Security Hub standard score

    Figure 2: Summary of aggregated Security Hub standard score

    Security Hub generates its own findings by running automated and continuous checks against the rules in a set of supported security standards. On the Summary page, the Security standards card displays the security scores for each enabled standard. It also displays a consolidated security score that represents the proportion of passed controls to enabled controls across the enabled standards.

  2. Choose the hyperlink of a security standard to get an additional summarized view, as shown in Figure 3.
    Figure 3: Security Hubs standards summarized view

    Figure 3: Security Hubs standards summarized view

  3. As you choose the hyperlinks for the specific findings, you will get additional details, along with recommended remediation instructions to take.
    Figure 4: Example of finding details view

    Figure 4: Example of finding details view

  4. In the left menu of the Security Hub console, choose Findings to see the findings ranked according to severity. Choose the link text of the finding title to drill into the details and view additional information on possible remediation actions.
    Figure 5: Findings example

    Figure 5: Findings example

  5. In the left menu of the Security Hub console, choose Insights. You will be presented with a collection of related findings. Security Hub provides several managed insights to get you started with assessing your security posture. As shown in Figure 6, you can quickly see if your Amazon Simple Storage Service (Amazon S3) buckets have public write or read permissions. This is just one example of managed insights that help you quickly identify risks.
    Figure 6: Insights view

    Figure 6: Insights view

  6. You can create custom insights to track issues and resources that are specific to your environment. Note that creating custom insights requires IAM permissions, as described earlier in the Prerequisites for Pattern 1 section. Use the following steps to create a custom insight for compliance status.

    To create a custom insight, use the Group By filter and select how you want your insights to be grouped together:

    1. In the left menu of the Security Hub console, choose Insights, and then choose Create insight in the upper right corner.
    2. By default, there will be filters included in the filter bar. Put the cursor in the filter bar, choose Group By, choose Compliance Status, and then choose Apply.
      Figure 7: Creating a custom insight

      Figure 7: Creating a custom insight

    3. For Insight name, enter a relevant name for your insight, and then choose Create insight. Your custom insight will be created.

In this scenario, you learned how application owners can quickly assess the resources in an AWS account and get details about security risks and recommended remediation steps. For a more hands-on walkthrough that covers how to use Security Hub, consider spending 2–3 hours going through this AWS Security Hub workshop.

Usage pattern 2: A single pane of glass for security professionals

To use Security Hub as a centralized source of security insight, we recommend that you choose to accept security data from the available integrated AWS services and third-party products that generate findings. Check the lists of available integrations often, because AWS continues to release new services that integrate with Security Hub. Figure 8 shows the Integrations page in Security Hub, where you can find information on how to accept findings from the many integrations that are available.

Figure 8: Security Hub integrations page

Figure 8: Security Hub integrations page

Solution architecture and workflow for pattern 2

As Figure 9 shows, you can visualize Security Hub as the centralized security dashboard. Here, Security Hub can act as both the consumer and issuer of findings. Additionally, if you have security findings you want sent to Security Hub that aren’t provided by a AWS Partner or AWS service, you can create a custom provider to provide the central visibility you need.

Figure 9: Security Hub findings flow

Figure 9: Security Hub findings flow

Because Security Hub is integrated with many AWS services and partner solutions, customers get improved security visibility across their AWS landscape. With the integration of Amazon Detective, it’s convenient for security analysts to use Security Hub as the centralized incident triage starting point. Amazon Detective is a security incident response service that can be used to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities by collecting log data from AWS resources. To learn how to get started with Amazon Detective, we recommend watching this video.

Programmatically remediate high-volume workflows

Security teams increasingly rely on monitoring and automation to scale and keep up with the demands of their business. Using Security Hub, customers can configure automatic responses to findings based upon preconfigured rules. Security Hub gives you the option to create your own automated response and remediation solution or use the AWS provided solution, Security Hub Automated Response and Remediation (SHARR). SHARR is an extensible solution that provides predefined response and remediation actions (playbooks) based on industry compliance standards and best practices for security threats. For step-by-step instructions for setting up SHARR, refer to this blog post.

Routing to alerting and ticketing systems

For incidents you cannot or do not want to automatically remediate, either because the incident happened in an account with a production workload or some change control process must be followed, routing to an incident management environment may be necessary. The primary goal of incident response is reducing the time to resolution for critical incidents. Customers who use alerting or incident management systems can integrate Security Hub to streamline the time it takes to resolve incidents. ServiceNow ITSM, Slack and PagerDuty are examples of products that integrate with Security Hub. This allows for workflow processing, escalations, and notifications as required.

Additionally, Incident Manager, a capability of AWS Systems Manager, also provides response plans, an escalation path, runbook automation, and active collaboration to recover from incidents. By using runbooks, customers can set up and run automation to recover from incidents. This blog post walks through setting up runbooks.

Usage pattern 3: Centralized routing to a SIEM solution

Here, we will describe how to use Splunk as an AWS Partner SIEM solution. However, note that there are many other SIEM partners available in the marketplace; the instructions to route findings to those partners’ platforms will be available in their documentation.

Solution architecture and workflow for pattern 3

Figure 10: Security Hub findings ingestion to Splunk

Figure 10: Security Hub findings ingestion to Splunk

Figure 10 shows the use of a Security Hub delegated administrator that aggregates findings across multiple accounts and Regions, as well as other AWS services such as GuardDuty, Amazon Macie, and Inspector. These findings are then sent to Splunk through a combination of Amazon EventBridge, AWS Lambda, and Amazon Kinesis Data Firehose.

Prerequisites for pattern 3

This solution has the following prerequisites:

  • Enable Security Hub in your accounts, with one account defined as the delegated admin for other accounts within AWS Organizations, and enable cross-Region aggregation.
  • Set up third-party SIEM solution; you can visit the AWS marketplace for a list of our SIEM partners. For this walkthrough, we will be using Splunk, with the Security Hub app in Splunk and an HTTP Event Collector (HEC) with indexer acknowledgment configured.
  • Generate and deploy a CloudFormation template from Splunk’s automation, provided by Project Trumpet.
  • Enable cross-Region replication. This action can only be performed from within the delegated administrator account, or from within a standalone account that is not controlled by a delegated administrator. The aggregation Region must be a Region that is enabled by default.

Pattern 3 walkthrough: Set up centralized routing to a SIEM

To get started, first designate a Security Hub delegated administrator and configure cross-Region replication. Then you can configure integration with Splunk.

To designate a delegated administrator and configure cross-Region replication

  1. Follow the steps in Designating a Security Hub administrator account to configure the delegated administrator for Security Hub.
  2. Perform these steps to configure cross-Region replication:
    1. Sign in to the account to which you delegated Security Hub administration, and in the console, navigate to the Security Hub dashboard in your desired aggregation Region. You must have the correct permissions to access Security Hub and make this change.
    2. Choose Settings, choose Regions, and then choose Configure finding aggregation.
    3. Select the radio button that displays the Region you are currently in, and then choose Save.
    4. You will then be presented with all available Regions in which you can aggregate findings. Select the Regions you want to be part of the aggregation. You also have the option to automatically link future Regions that Security Hub becomes enabled in.
    5. Choose Save.

You have now enabled multi-Region aggregation. Navigate back to the dashboard, where findings will start to be replicated into a single view. The time it takes to replicate the findings from the Regions will vary. We recommend waiting 24 hours for the findings to be replicated into your aggregation Region.

To configure integration with Splunk

Note: These actions require that you have appropriate permissions to deploy a CloudFormation template.

  1. Navigate to https://splunktrumpet.github.io/ and enter your HEC details: the endpoint URL and HEC token. Leave Automatically generate the required HTTP Event Collector tokens on your Splunk environment unselected.
  2. Under AWS data source configuration, select only AWS CloudWatch Events, with the Security Hub findings – Imported filter applied.
  3. Download the CloudFormation template to your local machine.
  4. Sign in to the AWS Management Console in the account and Region where your Security Hub delegated administrator and Region aggregation are configured.
  5. Navigate to the CloudFormation console and choose Create stack.
  6. Choose Template is ready, and then choose Upload a template file. Upload the CloudFormation template you previously downloaded from the Splunk Trumpet page.
  7. In the CloudFormation console, on the Specify Details page, enter a name for the stack. Keep all the default settings, and then choose Next.
  8. Keep all the default settings for the stack options, and then choose Next to review.
  9. On the review page, scroll to the bottom of the page. Select the check box under the Capabilities section, next to the acknowledgment that AWS CloudFormation might create IAM resources with custom names.

    The CloudFormation template will take approximately 15–20 minutes to complete.

Test the solution for pattern 3

If you have GuardDuty enabled in your account, you can generate sample findings. Security Hub will ingest these findings and invoke the EventBridge rule to push them into Splunk. Alternatively, you can wait for findings to be generated from the periodic checks that are performed by Security Hub. Figure 11 shows an example of findings displayed in the Security Hub dashboard in Splunk.

Figure 11: Example of the Security Hub dashboard in Splunk

Figure 11: Example of the Security Hub dashboard in Splunk

Conclusion

AWS Security Hub provides multiple ways you can use to quickly assess and prioritize your security alerts and security posture. In this post, you learned about three different usage patterns that we have seen our customers implement to take advantage of the benefits and integrations offered by Security Hub. Note that these usage patterns are not mutually exclusive, but can be used together as needed.

To extend these solutions further, you can enrich Security Hub metadata with additional context by using tags, as described in this post. Configure Security Hub to ingest findings from a variety of AWS Partners to provide additional visibility and context to the overall status of your security posture. To start your 30-day free trial of Security Hub, visit AWS Security Hub.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, please start a new thread on the Security Hub forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Tim Holm

Tim Holm

Tim is a Principal Solutions Architect at AWS, he joined AWS in 2018 and is deeply passionate about security, cloud services, and building innovative solutions that solve complex business problems.

Danny Cortegaca

Danny Cortegaca

Danny is a Senior Security Specialist at AWS. He joined AWS in 2021 and specializes in security architecture, threat modelling, and driving risk focused conversations.

Considerations for security operations in the cloud

Post Syndicated from Stuart Gregg original https://aws.amazon.com/blogs/security/considerations-for-security-operations-in-the-cloud/

Cybersecurity teams are often made up of different functions. Typically, these can include Governance, Risk & Compliance (GRC), Security Architecture, Assurance, and Security Operations, to name a few. Each function has its own specific tasks, but works towards a common goal—to partner with the rest of the business and help teams ship and run workloads securely.

In this blog post, I’ll focus on the role of the security operations (SecOps) function, and in particular, the considerations that you should look at when choosing the most suitable operating model for your enterprise and environment. This becomes particularly important when your organization starts to adapt and operate more workloads in the cloud.

Operational teams that manage business processes are the backbone of organizations—they pave the way for efficient running of a business and provide a solid understanding of which day-to-day processes are effective. Typically, these processes are defined within standard operating procedures (SOPs), also known as runbooks or playbooks, and business functions are centralized around them—think Human Resources, Accounting, IT, and so on. This is also true for cybersecurity and SecOps, which typically has operational oversight of security for the entire organization.

Teams adopt an operating model that inherently leans toward a delegated ownership of security when scaling and developing workloads in the cloud. The emergence of this type of delegation might cause you to re-evaluate your currently supported model, and when you do this, it’s important to understand what outcome you are trying to get to. You want to be able to quickly respond to and resolve security issues. You want to help application teams own their own security decisions. You also want to have centralized visibility of the security posture of your organization. This last objective is key to being able to identify where there are opportunities for improvement in tooling or processes that can improve the operation of multiple teams.

Three ways of designing the operating model for SecOps are as follows:

  • Centralized – A more traditional model where SecOps is responsible for identifying and remediating security events across the business. This can also include reviewing general security posture findings for the business, such as patching and security configuration issues.
  • Decentralized – Responsibility for responding to and remediating security events across the business has been delegated to the application owners and individual business units, and there is no central operations function. Typically, there will still be an overarching security governance function that takes more of a policy or principles view.
  • Hybrid – A mix of both approaches, where SecOps still has a level of responsibility and ownership for identifying and orchestrating the response to security events, while the responsibility for remediation is owned by the application owners and individual business units.

As you can see from these descriptions, the main distinction between the different models is in the team that is responsible for remediation and response. I’ll discuss the benefits and considerations of each model throughout this blog post.

The strategies and operating models that I talk about throughout this blog post will focus on the role of SecOps and organizations that operate in the cloud. It’s worth noting that these operating models don’t apply to any particular technology or cloud provider. Each model has its own benefits and challenges to consider; overall, you should aim to adopt an operating model that gets to the best business outcome, while managing risk and providing a path for continuous improvement.

Background: the centralized model

As you might expect, the most familiar and well-understood operating model for SecOps is a centralized one. Traditionally, SecOps has developed gradually from internal security staff who have a very good understanding of the mostly static on-premises infrastructure and corporate assets, such as employee laptops, servers, and databases.

Centralizing in this way provides organizations with a familiar operating model and structure. Over time, operating in this model across an industry has allowed teams to develop reliable SOPs for common security events. Analysts who deal with these incidents have a good understanding of the infrastructure, the environment, and the steps that are needed to resolve incidents. Every incident gives opportunities to update the SOPs and to share this knowledge and the lessons learned with the wider industry. This continuous feedback cycle has provided benefits to SecOps teams for many years.

When security issues occur, understanding the division of responsibility between the various teams in this model is extremely important for quick resolution and remediation. The Responsibility Assignment Matrix, also known as the RACI model, has defined roles—Responsible, Accountable, Consulted, and Informed. Utilizing a model like this will help align each employee, department, and business unit so that they are aware of their role and contact points when incidents do occur, and can use defined playbooks to quickly act upon incidents.

The pressure can be high during a security event, and incidents that involve production systems carry additional weight. Typically, in a centralized model, security events flow into a central queue that a security analyst will monitor. A common approach is the Security Operations Center (SOC), where events from multiple sources are displayed on screens and also trigger activity in the queue. Security incidents are acted upon by an experienced team that is well versed in SOPs and understands the importance of time sensitivity when dealing with such incidents. Additionally, a centralized SecOps team usually operates in a 24/7 model, which might be achieved by having teams in multiple time zones or with help from an MSSP (Managed Security Service Provider). Whichever strategy is followed, having experienced security analysts deal with security incidents is a great benefit, because experience helps to ensure efficient and thorough remediation of issues.

So, with context and background set—how does a centralized SOC look and feel when it operates in the cloud, and what are its challenges?

Centralized SOC in the cloud: the advantages

Cloud providers offer many solutions and capabilities for SOCs that operate in a centralized model. For example, you can monitor your organization’s cloud security posture as a whole, which allows for key performance indicator (KPI) benchmarking, both internally and industry wide. This can then help your organization target security initiatives, training, and awareness on lower-scoring areas.

Security orchestration, automation, and response (SOAR) is a phrase commonly used across the security industry, and the cloud unlocks this capability. Combining both native and third-party security services and solutions with automation facilitates quick resolution of security incidents. The use of SOAR means that only incidents that need human intervention are actually reviewed by the analysts. After investigation, if automation can be introduced on that alert, it’s quickly applied. Having a central place for automating alerts helps the organization to have a consistent and structured approach to the response for security events and gives analysts more time to focus on activities like threat hunting.

Additionally, such threat-hunting operations require a central security data lake or similar technology. As a result, the SecOps team helps to drive the centralization of data across the business, which is a traditional cybersecurity function.

Centralized SOC in the cloud: organizational considerations

Some KPIs that a traditional SOC would typically use are time to detect (TTD), time to acknowledge (TTA), and time to resolve (TTR). These have been good metrics that SecOps managers can use to understand and benchmark how well the SecOps team is performing, both internally and against industry benchmarks. As your organization starts to take advantage of the breadth and depth available within the cloud, how does this change the KPIs that you need to track? As stated earlier, the cloud makes it easier to track KPIs through increased visibility of your cloud footprint—although you should evaluate traditional KPIs to understand whether they still make sense to use. Some additional KPIs that should be considered are metrics that show increasing automation, reduction in human access, and the overall improvement in security posture.

Organizations should consider scaling factors for operational processes and capability in the centralized SOC model. Once benefits from adopting the cloud have been realized, organizations typically expand and scale up their cloud footprint aggressively. For a centralized SecOps team, this could cause a challenging battle between the wider business, which wants to expand, and the SOC, which needs the ability to fully understand and respond to issues in the environment. For example, most organizations will put together small proof of concepts (POCs) to showcase new architectures and their benefits, and these POCs may become available as blueprints for the wider organization to consume. When new blueprints are implemented, the centralized SecOps team should implement and rely on its automation capabilities to verify that the correct alerting, monitoring, and operational processes are in place.

Decentralization: all ownership with the application teams

Moving or designing workloads in the cloud provides organizations with many benefits, such as increased speed and agility, built-in native security, and the ability to launch globally in minutes. When looking at the decentralized model, business units should incorporate practices into their development pipelines to benefit from the security capabilities of the cloud. This is sometimes referred to as a shift left or DevSecOps approach—essentially building security best practices into every part of the development process, and as early as possible.

Placing the ownership of the SecOps function on the business units and application owners can provide some benefits. One immediate benefit is that the teams that create applications and architectures have first-hand knowledge and contextual awareness of their products. This knowledge is critical when security events occur, because understanding the expected behavior and information flows of workloads helps with quick remediation and resolution of issues. Having teams work on security incidents in the ways that best fit their operational processes can also increase speed of remediation.

Decentralization: organizational considerations

When considering the decentralized approach, there are some organizational considerations that you should be aware of:

Dedicated security analysts within a central SecOps function deal with security incidents day in and day out; they study the industry, have a keen eye on upcoming threats, and are also well versed in high-pressure situations. By decentralizing, you might lose the consistent, level-headed experience they offer during a security incident. Embedding security champions who have industry experience into each business unit can help ensure that security is considered throughout the development lifecycle and that incidents are resolved as quickly as possible.

Contextual information and root cause analysis from past incidents are vital data points. Having a centralized SecOps team makes it much simpler to get a broad view of the security issues affecting the whole organization, which improves the ability to take a signal from one business unit and apply that to other parts of the organization to understand if they are also vulnerable, and to help protect the organization in the future.

Decentralizing the SecOps responsibility completely can cause you to lose these benefits. As mentioned earlier, effective communication and an environment to share data is key to verifying that lessons learned are shared across business units—one way of achieving this effective knowledge sharing could be to set up a Cloud Center of Excellence (CCoE). The CCoE helps with broad information sharing, but the minimization of team hand-offs provided by a centralized SecOps function is a strong organizational mechanism to drive consistency.

Traditionally, in the centralized model, the SOC has 24/7 coverage of applications and critical business functions, which can require a large security staff. The need for 24/7 operations still exists in a decentralized model, and having to provide that capability in each application team or business unit can increase costs while making it more difficult to share information. In a decentralized model, having greater levels of automation across organizational processes can help reduce the number of humans needed for 24/7 coverage.

Blending the models: the hybrid approach

Most organizations end up using a hybrid operating model in one way or another. This model combines the benefits of the centralized and decentralized models, with clear responsibility and division of ownership between the business units and the central SecOps function.

This best-of-both-worlds scenario can be summarized by the statement “global monitoring, local response.” This means that the SecOps team and wider cybersecurity function guides the entire organization with security best practices and guardrails while also maintaining visibility for reporting, compliance, and understanding the security posture of the organization as a whole. Meanwhile, local business units have the tools, knowledge, and expertise available to confidently own remediation of security events for their applications.

In this hybrid model, you split delegation of ownership into two parts. First, the operational capability for security is centrally owned. This centrally owned capability builds upon the partnership between the application teams and the security organization, via the CCoE. This gives the benefits of consistency, tooling expertise, and lessons learned from past security incidents. Second, the resolution of day-to-day security events and security posture findings is delegated to the business units. This empowers the people closest to the business problem to own service improvement in ways that best suit that team’s way of working, whether that’s through ChatOps and automation, or through the tools available in the cloud. Examples of the types of events you might want to delegate for resolution are items such as patching, configuration issues, and workload-specific security events. It’s important to provide these teams with a well-defined escalation route to the central security organization for issues that require specialist security knowledge, such as forensics or other investigations.

A RACI is particularly important when you operate in this hybrid model. Making sure that there is a clear set of responsibilities between the business units and the SecOps team is crucial to avoid confusion when security incidents occur.

Conclusion

The cloud has the ability to unlock new capabilities for your organization. Increased security, speed, and agility and are just some of the benefits you can gain when you move workloads to the cloud. The traditional centralized SecOps model offers a consistent approach to security detection and response for your organization. Decentralization of the response provides application teams with direct exposure to the consequences of their design decisions, which can speed up improvement. The hybrid model, where application teams are responsible for the resolution of issues, can improve the time to fix issues while freeing up SecOps to continue their works. The hybrid operating model compliments the capabilities of the cloud, and enables application owners and business units to work in ways that best suit them while maintaining a high bar for security across the organization.

Whichever operating model and strategy you decide to embark on, it’s important to remember the core principles that you should aim for:

  • Enable effective risk management across the business
  • Drive security awareness and embed security champions where possible
  • When you scale, maintain organization-wide visibility of security events
  • Help application owners and business units to work in ways that work best for them
  • Work with application owners and business units to understand the cyber landscape

The cloud offers many benefits for your organization, and your security organization is there to help teams ship and operate securely. This confidence will lead to realized productivity and continued innovation—which is good for both internal teams and your customers.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Stuart Gregg

Stuart Gregg

Stuart enjoys providing thought leadership and being a trusted advisor to customers. In his spare time Stuart can be seen either eating snacks, running marathons or dabbling in the odd Ironman.

Introducing the price-capacity-optimized allocation strategy for EC2 Spot Instances

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/introducing-price-capacity-optimized-allocation-strategy-for-ec2-spot-instances/

This blog post is written by Jagdeep Phoolkumar, Senior Specialist Solution Architect, Flexible Compute and Peter Manastyrny, Senior Product Manager Tech, EC2 Core.

Amazon EC2 Spot Instances are unused Amazon Elastic Compute Cloud (Amazon EC2) capacity in the AWS Cloud available at up to a 90% discount compared to On-Demand prices. One of the best practices for using EC2 Spot Instances is to be flexible across a wide range of instance types to increase the chances of getting the aggregate compute capacity. Amazon EC2 Auto Scaling and Amazon EC2 Fleet make it easy to configure a request with a flexible set of instance types, as well as use a Spot allocation strategy to determine how to fulfill Spot capacity from the Spot Instance pools that you provide in your request.

The existing allocation strategies available in Amazon EC2 Auto Scaling and Amazon EC2 Fleet are called “lowest-price” and “capacity-optimized”. The lowest-price allocation strategy allocates Spot Instance pools where the Spot price is currently the lowest. Customers told us that in some cases the lowest-price strategy picks the Spot Instance pools that are not optimized for capacity availability and results in more frequent Spot Instance interruptions. As an improvement over lowest-price allocation strategy, in August 2019 AWS launched the capacity-optimized allocation strategy for Spot Instances, which helps customers tap into the deepest Spot Instance pools by analyzing capacity metrics. Since then, customers have seen a significantly lower interruption rate with capacity-optimized strategy when compared to the lowest-price strategy. You can read more about these customer stories in the Capacity-Optimized Spot Instance Allocation in Action at Mobileye and Skyscanner blog post. The capacity-optimized allocation strategy strictly selects the deepest pools. Therefore, sometimes it can pick high-priced pools even when there are low-priced pools available with marginally less capacity. Customers have been telling us that, for an optimal experience, they would like an allocation strategy that balances the best trade-offs between lowest-price and capacity-optimized.

Today, we’re excited to share the new price-capacity-optimized allocation strategy that makes Spot Instance allocation decisions based on both the price and the capacity availability of Spot Instances. The price-capacity-optimized allocation strategy should be the first preference and the default allocation strategy for most Spot workloads.

This post illustrates how the price-capacity-optimized allocation strategy selects Spot Instances in comparison with lowest-price and capacity-optimized. Furthermore, it discusses some common use cases of the price-capacity-optimized allocation strategy.

Overview

The price-capacity-optimized allocation strategy makes Spot allocation decisions based on both capacity availability and Spot prices. In comparison to the lowest-price allocation strategy, the price-capacity-optimized strategy doesn’t always attempt to launch in the absolute lowest priced Spot Instance pool. Instead, price-capacity-optimized attempts to diversify as much as possible across the multiple low-priced pools with high capacity availability. As a result, the price-capacity-optimized strategy in most cases has a higher chance of getting Spot capacity and delivers lower interruption rates when compared to the lowest-price strategy. If you factor in the cost associated with retrying the interrupted requests, then the price-capacity-optimized strategy becomes even more attractive from a savings perspective over the lowest-price strategy.

We recommend the price-capacity-optimized allocation strategy for workloads that require optimization of cost savings, Spot capacity availability, and interruption rates. For existing workloads using lowest-price strategy, we recommend price-capacity-optimized strategy as a replacement. The capacity-optimized allocation strategy is still suitable for workloads that either use similarly priced instances, or ones where the cost of interruption is so significant that any cost saving is inadequate in comparison to a marginal increase in interruptions.

Walkthrough

In this section, we illustrate how the price-capacity-optimized allocation strategy deploys Spot capacity when compared to the other two allocation strategies. The following example configuration shows how Spot capacity could be allocated in an Auto Scaling group using the different allocation strategies:

{
    "AutoScalingGroupName": "myasg ",
    "MixedInstancesPolicy": {
        "LaunchTemplate": {
            "LaunchTemplateSpecification": {
                "LaunchTemplateId": "lt-abcde12345"
            },
            "Overrides": [
                {
                    "InstanceRequirements": {
                        "VCpuCount": {
                            "Min": 4,
                            "Max": 4
                        },
                        "MemoryMiB": {
                            "Min": 0,
                            "Max": 16384
                        },
                        "InstanceGenerations": [
                            "current"
                        ],
                        "BurstablePerformance": "excluded",
                        "AcceleratorCount": {
                            "Max": 0
                        }
                    }
                }
            ]
        },
        "InstancesDistribution": {
            "OnDemandPercentageAboveBaseCapacity": 0,
            "SpotAllocationStrategy": "spot-allocation-strategy"
        }
    },
    "MinSize": 10,
    "MaxSize": 100,
    "DesiredCapacity": 60,
    "VPCZoneIdentifier": "subnet-a12345a,subnet-b12345b,subnet-c12345c"
}

First, Amazon EC2 Auto Scaling attempts to balance capacity evenly across Availability Zones (AZ). Next, Amazon EC2 Auto Scaling applies the Spot allocation strategy using the 30+ instances selected by attribute-based instance type selection, in each Availability Zone. The results after testing different allocation strategies are as follows:

  • Price-capacity-optimized strategy diversifies over multiple low-priced Spot Instance pools that are optimized for capacity availability.
  • Capacity-optimize strategy identifies Spot Instance pools that are only optimized for capacity availability.
  • Lowest-price strategy by default allocates the two lowest priced Spot Instance pools that aren’t optimized for capacity availability

To find out how each allocation strategy fares regarding Spot savings and capacity, we compare ‘Cost of Auto Scaling group’ (number of instances x Spot price/hour for each type of instance) and ‘Spot interruptions rate’ (number of instances interrupted/number of instances launched) for each allocation strategy. We use fictional numbers for the purpose of this post. However, you can use the Cloud Intelligence Dashboards to find the actual Spot Saving, and the Amazon EC2 Spot interruption dashboard to log Spot Instance interruptions. The example results after a 30-day period are as follows:

Allocation strategy

Instance allocation

Cost of Auto Scaling group

Spot interruptions rate

price-capacity-optimized

40 c6i.xlarge

20 c5.xlarge

$4.80/hour 3%

capacity-optimized

60 c5.xlarge

$5.00/hour

2%

lowest-price

30 c5a.xlarge

30 m5n.xlarge

$4.75/hour

20%

As per the above table, with the price-capacity-optimized strategy, the cost of the Auto Scaling group is only 5 cents (1%) higher, whereas the rate of Spot interruptions is six times lower (3% vs 20%) than the lowest-price strategy. In summary, from this exercise you learn that the price-capacity-optimized strategy provides the optimal Spot experience that is the best of both the lowest-price and capacity-optimized allocation strategies.

Common use-cases of price-capacity-optimized allocation strategy

Earlier we mentioned that the price-capacity-optimized allocation strategy is recommended for most Spot workloads. To elaborate further, in this section we explore some of these common workloads.

Stateless and fault-tolerant workloads

Stateless workloads that can complete ongoing requests within two minutes of a Spot interruption notice, and the fault-tolerant workloads that have a low cost of retries, are the best fit for the price-capacity-optimized allocation strategy. This category has workloads such as stateless containerized applications, microservices, web applications, data and analytics jobs, and batch processing.

Workloads with a high cost of interruption

Workloads that have a high cost of interruption associated with an expensive cost of retries should implement checkpointing to lower the cost of interruptions. By using checkpointing, you make the price-capacity-optimized allocation strategy a good fit for these workloads, as it allocates capacity from the low-priced Spot Instance pools that offer a low Spot interruptions rate. This category has workloads such as long Continuous Integration (CI), image and media rendering, Deep Learning, and High Performance Compute (HPC) workloads.

Conclusion

We recommend that customers use the price-capacity-optimized allocation strategy as the default option. The price-capacity-optimized strategy helps Amazon EC2 Auto Scaling groups and Amazon EC2 Fleet provision target capacity with an optimal experience. Updating to the price-capacity-optimized allocation strategy is as simple as updating a single parameter in an Amazon EC2 Auto Scaling group and Amazon EC2 Fleet.

To learn more about allocation strategies for Spot Instances, visit the Spot allocation strategies documentation page.

Running AI-ML Object Detection Model to Process Confidential Data using Nitro Enclaves

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/running-ai-ml-object-detection-model-to-process-confidential-data-using-nitro-enclaves/

This blog post was written by, Antoine Awad, Solutions Architect, Kevin Taylor, Senior Solutions Architect and Joel Desaulniers, Senior Solutions Architect.

Machine Learning (ML) models are used for inferencing of highly sensitive data in many industries such as government, healthcare, financial, and pharmaceutical. These industries require tools and services that protect their data in transit, at rest, and isolate data while in use. During processing, threats may originate from the technology stack such as the operating system or programs installed on the host which we need to protect against. Having a process that enforces the separation of roles and responsibilities within an organization minimizes the ability of personnel to access sensitive data. In this post, we walk you through how to run ML inference inside AWS Nitro Enclaves to illustrate how your sensitive data is protected during processing.

We are using a Nitro Enclave to run ML inference on sensitive data which helps reduce the attack surface area when the data is decrypted for processing. Nitro Enclaves enable you to create isolated compute environments within Amazon EC2 instances to protect and securely process highly sensitive data. Enclaves have no persistent storage, no interactive access, and no external networking. Communication between your instance and your enclave is done using a secure local channel called a vsock. By default, even an admin or root user on the parent instance will not be able to access the enclave.

Overview

Our example use-case demonstrates how to deploy an AI/ML workload and run inferencing inside Nitro Enclaves to securely process sensitive data. We use an image to demonstrate the process of how data can be encrypted, stored, transferred, decrypted and processed when necessary, to minimize the risk to your sensitive data. The workload uses an open-source AI/ML model to detect objects in an image, representing the sensitive data, and returns a summary of the type of objects detected. The image below is used for illustration purposes to provide clarity on the inference that occurs inside the Nitro Enclave. It was generated by adding bounding boxes to the original image based on the coordinates returned by the AI/ML model.

Image of airplanes with bounding boxes

Figure 1 – Image of airplanes with bounding boxes

To encrypt this image, we are using a Python script (Encryptor app – see Figure 2) which runs on an EC2 instance, in a real-world scenario this step would be performed in a secure environment like a Nitro Enclave or a secured workstation before transferring the encrypted data. The Encryptor app uses AWS KMS envelope encryption with a symmetrical Customer Master Key (CMK) to encrypt the data.

Image Encryption with AWS KMS using Envelope Encryption

Figure 2 – Image Encryption with AWS KMS using Envelope Encryption

Note, it’s also possible to use asymmetrical keys to perform the encryption/decryption.

Now that the image is encrypted, let’s look at each component and its role in the solution architecture, see Figure 3 below for reference.

  1. The Client app reads the encrypted image file and sends it to the Server app over the vsock (secure local communication channel).
  2. The Server app, running inside a Nitro Enclave, extracts the encrypted data key and sends it to AWS KMS for decryption. Once the data key is decrypted, the Server app uses it to decrypt the image and run inference on it to detect the objects in the image. Once the inference is complete, the results are returned to the Client app without exposing the original image or sensitive data.
  3. To allow the Nitro Enclave to communicate with AWS KMS, we use the KMS Enclave Tool which uses the vsock to connect to AWS KMS and decrypt the encrypted key.
  4. The vsock-proxy (packaged with the Nitro CLI) routes incoming traffic from the KMS Tool to AWS KMS provided that the AWS KMS endpoint is included on the vsock-proxy allowlist. The response from AWS KMS is then sent back to the KMS Enclave Tool over the vsock.

As part of the request to AWS KMS, the KMS Enclave Tool extracts and sends a signed attestation document to AWS KMS containing the enclave’s measurements to prove its identity. AWS KMS will validate the attestation document before decrypting the data key. Once validated, the data key is decrypted and securely returned to the KMS Tool which securely transfers it to the Server app to decrypt the image.

Solution architecture diagram for this blog post

Figure 3 – Solution architecture diagram for this blog post

Environment Setup

Prerequisites

Before we get started, you will need the following prequisites to deploy the solution:

  1. AWS account
  2. AWS Identity and Access Management (IAM) role with appropriate access

AWS CloudFormation Template

We are going to use AWS CloudFormation to provision our infrastructure.

  1. Download the CloudFormation (CFN) template nitro-enclave-demo.yaml. This template orchestrates an EC2 instance with the required networking components such as a VPC, Subnet and NAT Gateway.
  2. Log in to the AWS Management Console and select the AWS Region where you’d like to deploy this stack. In the example, we select Canada (Central).
  3. Open the AWS CloudFormation console at: https://console.aws.amazon.com/cloudformation/
  4. Choose Create Stack, Template is ready, Upload a template file. Choose File to select nitro-enclave-demo.yaml that you saved locally.
  5. Choose Next, enter a stack name such as NitroEnclaveStack, choose Next.
  6. On the subsequent screens, leave the defaults, and continue to select Next until you arrive at the Review step
  7. At the Review step, scroll to the bottom and place a checkmark in “I acknowledge that AWS CloudFormation might create IAM resources with custom names.” and click “Create stack”
  8. The stack status is initially CREATE_IN_PROGRESS. It will take around 5 minutes to complete. Click the Refresh button periodically to refresh the status. Upon completion, the status changes to CREATE_COMPLETE.
  9. Once completed, click on “Resources” tab and search for “NitroEnclaveInstance”, click on its “Physical ID” to navigate to the EC2 instance
  10. On the Amazon EC2 page, select the instance and click “Connect”
  11. Choose “Session Manager” and click “Connect”

EC2 Instance Configuration

Now that the EC2 instance has been provisioned and you are connected to it, follow these steps to configure it:

  1. Install the Nitro Enclaves CLI which will allow you to build and run a Nitro Enclave application:
    sudo amazon-linux-extras install aws-nitro-enclaves-cli -y
    sudo yum install aws-nitro-enclaves-cli-devel -y
    
  2. Verify that the Nitro Enclaves CLI was installed successfully by running the following command:
    nitro-cli --version

    Nitro Enclaves CLI

  3. To download the application from GitHub and build a docker image, you need to first install Docker and Git by executing the following commands:
    sudo yum install git -y
    sudo usermod -aG ne ssm-user
    sudo usermod -aG docker ssm-user
    sudo systemctl start docker && sudo systemctl enable docker
    

Nitro Enclave Configuration

A Nitro Enclave is an isolated environment which runs within the EC2 instance, hence we need to specify the resources (CPU & Memory) that the Nitro Enclaves allocator service dedicates to the enclave.

  1. Enter the following commands to set the CPU and Memory available for the Nitro Enclave allocator service to allocate to your enclave container:
    ALLOCATOR_YAML=/etc/nitro_enclaves/allocator.yaml
    MEM_KEY=memory_mib
    DEFAULT_MEM=20480
    sudo sed -r "s/^(\s*${MEM_KEY}\s*:\s*).*/\1${DEFAULT_MEM}/" -i "${ALLOCATOR_YAML}"
    sudo systemctl start nitro-enclaves-allocator.service && sudo systemctl enable nitro-enclaves-allocator.service
    
  2. To verify the configuration has been applied, run the following command and note the values for memory_mib and cpu_count:
    cat /etc/nitro_enclaves/allocator.yaml

    Enclave Configuration File

Creating a Nitro Enclave Image

Download the Project and Build the Enclave Base Image

Now that the EC2 instance is configured, download the workload code and build the enclave base Docker image. This image contains the Nitro Enclaves Software Development Kit (SDK) which allows an enclave to request a cryptographically signed attestation document from the Nitro Hypervisor. The attestation document includes unique measurements (SHA384 hashes) that are used to prove the enclave’s identity to services such as AWS KMS.

  1. Clone the Github Project
    cd ~/ && git clone https://github.com/aws-samples/aws-nitro-enclaves-ai-ml-object-detection.git
  2. Navigate to the cloned project’s folder and build the “enclave_base” image:
    cd ~/aws-nitro-enclaves-ai-ml-object-detection/enclave-base-image
    sudo docker build ./ -t enclave_base

    Note: The above step will take approximately 8-10 minutes to complete.

Build and Run The Nitro Enclave Image

To build the Nitro Enclave image of the workload, build a docker image of your application and then use the Nitro CLI to build the Nitro Enclave image:

  1. Download TensorFlow pre-trained model:
    cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
    mkdir -p models/faster_rcnn_openimages_v4_inception_resnet_v2_1 && cd models/
    wget -O tensorflow-model.tar.gz https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1?tf-hub-format=compressed
    tar -xvf tensorflow-model.tar.gz -C faster_rcnn_openimages_v4_inception_resnet_v2_1
  2. Navigate to the use-case folder and build the docker image for the application:
    cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
    sudo docker build ./ -t nitro-enclave-container-ai-ml:latest
  3. Use the Nitro CLI to build an Enclave Image File (.eif) using the docker image you built in the previous step:
    sudo nitro-cli build-enclave --docker-uri nitro-enclave-container-ai-ml:latest --output-file nitro-enclave-container-ai-ml.eif
  4. The output of the previous step produces the Platform configuration registers or PCR hashes and a nitro enclave image file (.eif). Take note of the PCR0 value, which is a hash of the enclave image file.Example PCR0:
    {
        "Measurements": {
            "PCR0": "7968aee86dc343ace7d35fa1a504f955ee4e53f0d7ad23310e7df535a187364a0e6218b135a8c2f8fe205d39d9321923"
            ...
        }
    }
  5. Launch the Nitro Enclave container using the Enclave Image File (.eif) generated in the previous step and allocate resources to it. You should allocate at least 4 times the EIF file size for enclave memory. This is necessary because the tmpfs filesystem uses half of the memory and the remainder of the memory is used to uncompress the initial initramfs where the application executable resides. For CPU allocation, you should allocate CPU in full cores i.e. 2x vCPU for x86 hyper-threaded instances.
    In our case, we are going to allocate 14GB or 14,366 MB for the enclave:

    sudo nitro-cli run-enclave --cpu-count 2 --memory 14336 --eif-path nitro-enclave-container-ai-ml.eif

    Note: Allow a few seconds for the server to boot up prior to running the Client app in the below section “Object Detection using Nitro Enclaves”.

Update the KMS Key Policy to Include the PCR0 Hash

Now that you have the PCR0 value for your enclave image, update the KMS key policy to only allow your Nitro Enclave container access to the KMS key.

  1. Navigate to AWS KMS in your AWS Console and make sure you are in the same region where your CloudFormation template was deployed
  2. Select “Customer managed keys”
  3. Search for a key with alias “EnclaveKMSKey” and click on it
  4. Click “Edit” on the “Key Policy”
  5. Scroll to the bottom of the key policy and replace the value of “EXAMPLETOBEUPDATED” for the “kms:RecipientAttestation:PCR0” key with the PCR0 hash you noted in the previous section and click “Save changes”

AI/ML Object Detection using a Nitro Enclave

Now that you have an enclave image file, run the components of the solution.

Requirements Installation for Client App

  1. Install the python requirements using the following command:
    cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
    pip3 install -r requirements.txt
  2. Set the region that your CloudFormation stack is deployed in. In our case we selected Canada (Centra)
    CFN_REGION=ca-central-1
  3. Run the following command to encrypt the image using the AWS KMS key “EnclaveKMSKey”, make sure to replace “ca-central-1” with the region where you deployed your CloudFormation template:
    python3 ./envelope-encryption/encryptor.py --filePath ./images/air-show.jpg --cmkId alias/EnclaveKMSkey --region $CFN_REGION
  4. Verify that the output contains: file encrypted? True
    Note: The previous command generates two files: an encrypted image file and an encrypted data key file. The data key file is generated so we can demonstrate an attempt from the parent instance at decrypting the data key.

Launching VSock Proxy

Launch the VSock Proxy which proxies requests from the Nitro Enclave to an external endpoint, in this case, to AWS KMS. Note the file vsock-proxy-config.yaml contains a list of endpoints which allow-lists the endpoints that an enclave can communicate with.

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
vsock-proxy 8001 "kms.$CFN_REGION.amazonaws.com" 443 --config vsock-proxy-config.yaml &

Object Detection using Nitro Enclaves

Send the encrypted image to the enclave to decrypt the image and use the AI/ML model to detect objects and return a summary of the objects detected:

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
python3 client.py --filePath ./images/air-show.jpg.encrypted | jq -C '.'

The previous step takes around a minute to complete when first called. Inside the enclave, the server application decrypts the image, runs it through the AI/ML model to generate a list of objects detected and returns that list to the client application.

Parent Instance Credentials

Attempt to Decrypt Data Key using Parent Instance Credentials

To prove that the parent instance is not able to decrypt the content, attempt to decrypt the image using the parent’s credentials:

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
aws kms decrypt --ciphertext-blob fileb://images/air-show.jpg.data_key.encrypted --region $CFN_REGION

Note: The command is expected to fail with AccessDeniedException, since the parent instance is not allowed to decrypt the data key.

Cleaning up

  1. Open the AWS CloudFormation console at: https://console.aws.amazon.com/cloudformation/.
  2. Select the stack you created earlier, such as NitroEnclaveStack.
  3. Choose Delete, then choose Delete Stack.
  4. The stack status is initially DELETE_IN_PROGRESS. Click the Refresh button periodically to refresh its status. The status changes to DELETE_COMPLETE after it’s finished and the stack name no longer appears in your list of active stacks.

Conclusion

In this post, we showcase how to process sensitive data with Nitro Enclaves using an AI/ML model deployed on Amazon EC2, as well as how to integrate an enclave with AWS KMS to restrict access to an AWS KMS CMK so that only the Nitro Enclave is allowed to use the key and decrypt the image.

We encrypt the sample data with envelope encryption to illustrate how to protect, transfer and securely process highly sensitive data. This process would be similar for any kind of sensitive information such as personally identifiable information (PII), healthcare or intellectual property (IP) which could also be the AI/ML model.

Dig deeper by exploring how to further restrict your AWS KMS CMK using additional PCR hashes such as PCR1 (hash of the Linux kernel and bootstrap), PCR2 (Hash of the application), and other hashes available to you.

Also, try our comprehensive Nitro Enclave workshop which includes use-cases at different complexity levels.

Reducing Your Organization’s Carbon Footprint with Amazon CodeGuru Profiler

Post Syndicated from Isha Dua original https://aws.amazon.com/blogs/devops/reducing-your-organizations-carbon-footprint-with-codeguru-profiler/

It is crucial to examine every functional area when firms reorient their operations toward sustainable practices. Making informed decisions is necessary to reduce the environmental effect of an IT stack when creating, deploying, and maintaining it. To build a sustainable business for our customers and for the world we all share, we have deployed data centers that provide the efficient, resilient service our customers expect while minimizing our environmental footprint—and theirs. While we work to improve the energy efficiency of our datacenters, we also work to help our customers improve their operations on the AWS cloud. This two-pronged approach is based on the concept of the shared responsibility between AWS and AWS’ customers. As shown in the diagram below, AWS focuses on optimizing the sustainability of the cloud, while customers are responsible for sustainability in the cloud, meaning that AWS customers must optimize the workloads they have on the AWS cloud.

Figure 1. Shared responsibility model for sustainability

Figure 1. Shared responsibility model for sustainability

Just by migrating to the cloud, AWS customers become significantly more sustainable in their technology operations. On average, AWS customers use 77% fewer servers, 84% less power, and a 28% cleaner power mix, ultimately reducing their carbon emissions by 88% compared to when they ran workloads in their own data centers. These improvements are attributable to the technological advancements and economies of scale that AWS datacenters bring. However, there are still significant opportunities for AWS customers to make their cloud operations more sustainable. To uncover this, we must first understand how emissions are categorized.

The Greenhouse Gas Protocol organizes carbon emissions into the following scopes, along with relevant emission examples within each scope for a cloud provider such as AWS:

  • Scope 1: All direct emissions from the activities of an organization or under its control. For example, fuel combustion by data center backup generators.
  • Scope 2: Indirect emissions from electricity purchased and used to power data centers and other facilities. For example, emissions from commercial power generation.
  • Scope 3: All other indirect emissions from activities of an organization from sources it doesn’t control. AWS examples include emissions related to data center construction, and the manufacture and transportation of IT hardware deployed in data centers.

From an AWS customer perspective, emissions from customer workloads running on AWS are accounted for as indirect emissions, and part of the customer’s Scope 3 emissions. Each workload deployed generates a fraction of the total AWS emissions from each of the previous scopes. The actual amount varies per workload and depends on several factors including the AWS services used, the energy consumed by those services, the carbon intensity of the electric grids serving the AWS data centers where they run, and the AWS procurement of renewable energy.

At a high level, AWS customers approach optimization initiatives at three levels:

  • Application (Architecture and Design): Using efficient software designs and architectures to minimize the average resources required per unit of work.
  • Resource (Provisioning and Utilization): Monitoring workload activity and modifying the capacity of individual resources to prevent idling due to over-provisioning or under-utilization.
  • Code (Code Optimization): Using code profilers and other tools to identify the areas of code that use up the most time or resources as targets for optimization.

In this blogpost, we will concentrate on code-level sustainability improvements and how they can be realized using Amazon CodeGuru Profiler.

How CodeGuru Profiler improves code sustainability

Amazon CodeGuru Profiler collects runtime performance data from your live applications and provides recommendations that can help you fine-tune your application performance. Using machine learning algorithms, CodeGuru Profiler can help you find your most CPU-intensive lines of code, which contribute the most to your scope 3 emissions. CodeGuru Profiler then suggests ways to improve the code to make it less CPU demanding. CodeGuru Profiler provides different visualizations of profiling data to help you identify what code is running on the CPU, see how much time is consumed, and suggest ways to reduce CPU utilization. Optimizing your code with CodeGuru profiler leads to the following:

  • Improvements in application performance
  • Reduction in cloud cost, and
  • Reduction in the carbon emissions attributable to your cloud workload.

When your code performs the same task with less CPU, your applications run faster, customer experience improves, and your cost reduces alongside your cloud emission. CodeGuru Profiler generates the recommendations that help you make your code faster by using an agent that continuously samples stack traces from your application. The stack traces indicate how much time the CPU spends on each function or method in your code—information that is then transformed into CPU and latency data that is used to detect anomalies. When anomalies are detected, CodeGuru Profiler generates recommendations that clearly outline you should do to remediate the situation. Although CodeGuru Profiler has several visualizations that help you visualize your code, in many cases, customers can implement these recommendations without reviewing the visualizations. Let’s demonstrate this with a simple example.

Demonstration: Using CodeGuru Profiler to optimize a Lambda function

In this demonstration, the inefficiencies in a AWS Lambda function will be identified by CodeGuru Profiler.

Building our Lambda Function (10mins)

To keep this demonstration quick and simple, let’s create a simple lambda function that display’s ‘Hello World’. Before writing the code for this function, let’s review two important concepts. First, when writing Python code that runs on AWS and calls AWS services, two critical steps are required:

The Python code lines (that will be part of our function) that execute these steps listed above are shown below:

import boto3 #this will import AWS SDK library for Python
VariableName = boto3.client('dynamodb’) #this will create the AWS SDK service client

Secondly, functionally, AWS Lambda functions comprise of two sections:

  • Initialization code
  • Handler code

The first time a function is invoked (i.e., a cold start), Lambda downloads the function code, creates the required runtime environment, runs the initialization code, and then runs the handler code. During subsequent invocations (warm starts), to keep execution time low, Lambda bypasses the initialization code and goes straight to the handler code. AWS Lambda is designed such that the SDK service client created during initialization persists into the handler code execution. For this reason, AWS SDK service clients should be created in the initialization code. If the code lines for creating the AWS SDK service client are placed in the handler code, the AWS SDK service client will be recreated every time the Lambda function is invoked, needlessly increasing the duration of the Lambda function during cold and warm starts. This inadvertently increases CPU demand (and cost), which in turn increases the carbon emissions attributable to the customer’s code. Below, you can see the green and brown versions of the same Lambda function.

Now that we understand the importance of structuring our Lambda function code for efficient execution, let’s create a Lambda function that recreates the SDK service client. We will then watch CodeGuru Profiler flag this issue and generate a recommendation.

  1. Open AWS Lambda from the AWS Console and click on Create function.
  2. Select Author from scratch, name the function ‘demo-function’, select Python 3.9 under runtime, select x86_64 under Architecture.
  3. Expand Permissions, then choose whether to create a new execution role or use an existing one.
  4. Expand Advanced settings, and then select Function URL.
  5. For Auth type, choose AWS_IAM or NONE.
  6. Select Configure cross-origin resource sharing (CORS). By selecting this option during function creation, your function URL allows requests from all origins by default. You can edit the CORS settings for your function URL after creating the function.
  7. Choose Create function.
  8. In the code editor tab of the code source window, copy and paste the code below:
#invocation code
import json
import boto3

#handler code
def lambda_handler(event, context):
  client = boto3.client('dynamodb') #create AWS SDK Service client’
  #simple codeblock for demonstration purposes  
  output = ‘Hello World’
  print(output)
  #handler function return

  return output

Ensure that the handler code is properly indented.

  1. Save the code, Deploy, and then Test.
  2. For the first execution of this Lambda function, a test event configuration dialog will appear. On the Configure test event dialog window, leave the selection as the default (Create new event), enter ‘demo-event’ as the Event name, and leave the hello-world template as the Event template.
  3. When you run the code by clicking on Test, the console should return ‘Hello World’.
  4. To simulate actual traffic, let’s run a curl script that will invoke the Lambda function every 0.2 seconds. On a bash terminal, run the following command:
while true; do curl {Lambda Function URL]; sleep 0.06; done

If you do not have git bash installed, you can use AWS Cloud 9 which supports curl commands.

Enabling CodeGuru Profiler for our Lambda function

We will now set up CodeGuru Profiler to monitor our Lambda function. For Lambda functions running on Java 8 (Amazon Corretto), Java 11, and Python 3.8 or 3.9 runtimes, CodeGuru Profiler can be enabled through a single click in the configuration tab in the AWS Lambda console.  Other runtimes can be enabled following a series of steps that can be found in the CodeGuru Profiler documentation for Java and the Python.

Our demo code is written in Python 3.9, so we will enable Profiler from the configuration tab in the AWS Lambda console.

  1. On the AWS Lambda console, select the demo-function that we created.
  2. Navigate to Configuration > Monitoring and operations tools, and click Edit on the right side of the page.

  1.  Scroll down to Amazon CodeGuru Profiler and click the button next to Code profiling to turn it on. After enabling Code profiling, click Save.

Note: CodeGuru Profiler requires 5 minutes of Lambda runtime data to generate results. After your Lambda function provides this runtime data, which may need multiple runs if your lambda has a short runtime, it will display within the Profiling group page in the CodeGuru Profiler console. The profiling group will be given a default name (i.e., aws-lambda-<lambda-function-name>), and it will take approximately 15 minutes after CodeGuru Profiler receives the runtime data for this profiling group to appear. Be patient. Although our function duration is ~33ms, our curl script invokes the application once every 0.06 seconds. This should give profiler sufficient information to profile our function in a couple of hours. After 5 minutes, our profiling group should appear in the list of active profiling groups as shown below.

Depending on how frequently your Lambda function is invoked, it can take up to 15 minutes to aggregate profiles, after which you can see your first visualization in the CodeGuru Profiler console. The granularity of the first visualization depends on how active your function was during those first 5 minutes of profiling—an application that is idle most of the time doesn’t have many data points to plot in the default visualization. However, you can remedy this by looking at a wider time period of profiled data, for example, a day or even up to a week, if your application has very low CPU utilization. For our demo function, a recommendation should appear after about an hour. By this time, the profiling groups list should show that our profiling group now has one recommendation.

Profiler has now flagged the repeated creation of the SDK service client with every invocation.

From the information provided, we can see that our CPU is spending 5x more computing time than expected on the recreation of the SDK service client. The estimated cost impact of this inefficiency is also provided. In production environments, the cost impact of seemingly minor inefficiencies can scale very quickly to several kilograms of CO2 and hundreds of dollars as invocation frequency, and the number of Lambda functions increase.

CodeGuru Profiler integrates with Amazon DevOps Guru, a fully managed service that makes it easy for developers and operators to improve the performance and availability of their applications. Amazon DevOps Guru analyzes operational data and application metrics to identify behaviors that deviate from normal operating patterns. Once these operational anomalies are detected, DevOps Guru presents intelligent recommendations that address current and predicted future operational issues. By integrating with CodeGuru Profiler, customers can now view operational anomalies and code optimization recommendations on the DevOps Guru console. The integration, which is enabled by default, is only applicable to Lambda resources that are supported by CodeGuru Profiler and monitored by both DevOps Guru and CodeGuru.

We can now stop the curl loop (Control+C) so that the Lambda function stops running. Next, we delete the profiling group that was created when we enabled profiling in Lambda, and then delete the Lambda function or repurpose as needed.

Conclusion

Cloud sustainability is a shared responsibility between AWS and our customers. While we work to make our datacenter more sustainable, customers also have to work to make their code, resources, and applications more sustainable, and CodeGuru Profiler can help you improve code sustainability, as demonstrated above. To start Profiling your code today, visit the CodeGuru Profiler documentation page. To start monitoring your applications, head over to the Amazon DevOps Guru documentation page.

About the authors:

Isha Dua

Isha Dua is a Senior Solutions Architect based in San Francisco Bay Area. She helps AWS Enterprise customers grow by understanding their goals and challenges, and guiding them on how they can architect their applications in a cloud native manner while making sure they are resilient and scalable. She’s passionate about machine learning technologies and Environmental Sustainability.

Christian Tomeldan

Christian Tomeldan is a DevOps Engineer turned Solutions Architect. Operating out of San Francisco, he is passionate about technology and conveys that passion to customers ensuring they grow with the right support and best practices. He focuses his technical depth mostly around Containers, Security, and Environmental Sustainability.

Ifeanyi Okafor

Ifeanyi Okafor is a Product Manager with AWS. He enjoys building products that solve customer problems at scale.

Simplifying Amazon EC2 instance type flexibility with new attribute-based instance type selection features

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/simplifying-amazon-ec2-instance-type-flexibility-with-new-attribute-based-instance-type-selection-features/

This blog is written by Rajesh Kesaraju, Sr. Solution Architect, EC2-Flexible Compute and Peter Manastyrny, Sr. Product Manager, EC2.

Today AWS is adding two new attributes for the attribute-based instance type selection (ABS) feature to make it even easier to create and manage instance type flexible configurations on Amazon EC2. The new network bandwidth attribute allows customers to request instances based on the network requirements of their workload. The new allowed instance types attribute is useful for workloads that have some instance type flexibility but still need more granular control over which instance types to run on.

The two new attributes are supported in EC2 Auto Scaling Groups (ASG), EC2 Fleet, Spot Fleet, and Spot Placement Score.

Before exploring the new attributes in detail, let us review the core ABS capability.

ABS refresher

ABS lets you express your instance type requirements as a set of attributes, such as vCPU, memory, and storage when provisioning EC2 instances with ASG, EC2 Fleet, or Spot Fleet. Your requirements are translated by ABS to all matching EC2 instance types, simplifying the creation and maintenance of instance type flexible configurations. ABS identifies the instance types based on attributes that you set in ASG, EC2 Fleet, or Spot Fleet configurations. When Amazon EC2 releases new instance types, ABS will automatically consider them for provisioning if they match the selected attributes, removing the need to update configurations to include new instance types.

ABS helps you to shift from an infrastructure-first to an application-first paradigm. ABS is ideal for workloads that need generic compute resources and do not necessarily require the hardware differentiation that the Amazon EC2 instance type portfolio delivers. By defining a set of compute attributes instead of specific instance types, you allow ABS to always consider the broadest and newest set of instance types that qualify for your workload. When you use EC2 Spot Instances to optimize your costs and save up to 90% compared to On-Demand prices, instance type diversification is the key to access the highest amount of Spot capacity. ABS provides an easy way to configure and maintain instance type flexible configurations to run fault-tolerant workloads on Spot Instances.

We recommend ABS as the default compute provisioning method for instance type flexible workloads including containerized apps, microservices, web applications, big data, and CI/CD.

Now, let us dive deep on the two new attributes: network bandwidth and allowed instance types.

How network bandwidth attribute for ABS works

Network bandwidth attribute allows customers with network-sensitive workloads to specify their network bandwidth requirements for compute infrastructure. Some of the workloads that depend on network bandwidth include video streaming, networking appliances (e.g., firewalls), and data processing workloads that require faster inter-node communication and high-volume data handling.

The network bandwidth attribute uses the same min/max format as other ABS attributes (e.g., vCPU count or memory) that assume a numeric value or range (e.g., min: ‘10’ or min: ‘15’; max: ‘40’). Note that setting the minimum network bandwidth does not guarantee that your instance will achieve that network bandwidth. ABS will identify instance types that support the specified minimum bandwidth, but the actual bandwidth of your instance might go below the specified minimum at times.

Two important things to remember when using the network bandwidth attribute are:

  • ABS will only take burst bandwidth values into account when evaluating maximum values. When evaluating minimum values, only the baseline bandwidth will be considered.
    • For example, if you specify the minimum bandwidth as 10 Gbps, instances that have burst bandwidth of “up to 10 Gbps” will not be considered, as their baseline bandwidth is lower than the minimum requested value (e.g., m5.4xlarge is burstable up to 10 Gbps with a baseline bandwidth of 5 Gbps).
    • Alternatively, c5n.2xlarge, which is burstable up to 25 Gbps with a baseline bandwidth of 10 Gbps will be considered because its baseline bandwidth meets the minimum requested value.
  • Our recommendation is to only set a value for maximum network bandwidth if you have specific requirements to restrict instances with higher bandwidth. That would help to ensure that ABS considers the broadest possible set of instance types to choose from.

Using the network bandwidth attribute in ASG

In this example, let us look at a high-performance computing (HPC) workload or similar network bandwidth sensitive workload that requires a high volume of inter-node communications. We use ABS to select instances that have at minimum 10 Gpbs of network bandwidth and at least 32 vCPUs and 64 GiB of memory.

To get started, you can create or update an ASG or EC2 Fleet set up with ABS configuration and specify the network bandwidth attribute.

The following example shows an ABS configuration with network bandwidth attribute set to a minimum of 10 Gbps. In this example, we do not set a maximum limit for network bandwidth. This is done to remain flexible and avoid restricting available instance type choices that meet our minimum network bandwidth requirement.

Create the following configuration file and name it: my_asg_network_bandwidth_configuration.json

{
    "AutoScalingGroupName": "network-bandwidth-based-instances-asg",
    "DesiredCapacityType": "units",
    "MixedInstancesPolicy": {
        "LaunchTemplate": {
            "LaunchTemplateSpecification": {
                "LaunchTemplateName": "LaunchTemplate-x86",
                "Version": "$Latest"
            },
            "Overrides": [
                {
                "InstanceRequirements": {
                    "VCpuCount": {"Min": 32},
                    "MemoryMiB": {"Min": 65536},
                    "NetworkBandwidthGbps": {"Min": 10} }
                 }
            ]
        },
        "InstancesDistribution": {
            "OnDemandPercentageAboveBaseCapacity": 30,
            "SpotAllocationStrategy": "capacity-optimized"
        }
    },
    "MinSize": 1,
    "MaxSize": 10,
    "DesiredCapacity":10,
    "VPCZoneIdentifier": "subnet-f76e208a, subnet-f76e208b, subnet-f76e208c"
}

Next, let us create an ASG using the following command:

my_asg_network_bandwidth_configuration.json file

aws autoscaling create-auto-scaling-group --cli-input-json file://my_asg_network_bandwidth_configuration.json

As a result, you have created an ASG that may include instance types m5.8xlarge, m5.12xlarge, m5.16xlarge, m5n.8xlarge, and c5.9xlarge, among others. The actual selection at the time of the request is made by capacity optimized Spot allocation strategy. If EC2 releases an instance type in the future that would satisfy the attributes provided in the request, that instance will also be automatically considered for provisioning.

Considered Instances (not an exhaustive list)


Instance Type        Network Bandwidth
m5.8xlarge             “10 Gbps”

m5.12xlarge           “12 Gbps”

m5.16xlarge           “20 Gbps”

m5n.8xlarge          “25 Gbps”

c5.9xlarge               “10 Gbps”

c5.12xlarge             “12 Gbps”

c5.18xlarge             “25 Gbps”

c5n.9xlarge            “50 Gbps”

c5n.18xlarge          “100 Gbps”

Now let us focus our attention on another new attribute – allowed instance types.

How allowed instance types attribute works in ABS

As discussed earlier, ABS lets us provision compute infrastructure based on our application requirements instead of selecting specific EC2 instance types. Although this infrastructure agnostic approach is suitable for many workloads, some workloads, while having some instance type flexibility, still need to limit the selection to specific instance families, and/or generations due to reasons like licensing or compliance requirements, application performance benchmarking, and others. Furthermore, customers have asked us to provide the ability to restrict the auto-consideration of newly released instances types in their ABS configurations to meet their specific hardware qualification requirements before considering them for their workload. To provide this functionality, we added a new allowed instance types attribute to ABS.

The allowed instance types attribute allows ABS customers to narrow down the list of instance types that ABS considers for selection to a specific list of instances, families, or generations. It takes a comma separated list of specific instance types, instance families, and wildcard (*) patterns. Please note, that it does not use the full regular expression syntax.

For example, consider container-based web application that can only run on any 5th generation instances from compute optimized (c), general purpose (m), or memory optimized (r) families. It can be specified as “AllowedInstanceTypes”: [“c5*”, “m5*”,”r5*”].

Another example could be to limit the ABS selection to only memory-optimized instances for big data Spark workloads. It can be specified as “AllowedInstanceTypes”: [“r6*”, “r5*”, “r4*”].

Note that you cannot use both the existing exclude instance types and the new allowed instance types attributes together, because it would lead to a validation error.

Using allowed instance types attribute in ASG

Let us look at the InstanceRequirements section of an ASG configuration file for a sample web application. The AllowedInstanceTypes attribute is configured as [“c5.*”, “m5.*”,”c4.*”, “m4.*”] which means that ABS will limit the instance type consideration set to any instance from 4th and 5th generation of c or m families. Additional attributes are defined to a minimum of 4 vCPUs and 16 GiB RAM and allow both Intel and AMD processors.

Create the following configuration file and name it: my_asg_allow_instance_types_configuration.json

{
    "AutoScalingGroupName": "allow-instance-types-based-instances-asg",
    "DesiredCapacityType": "units",
    "MixedInstancesPolicy": {
        "LaunchTemplate": {
            "LaunchTemplateSpecification": {
                "LaunchTemplateName": "LaunchTemplate-x86",
                "Version": "$Latest"
            },
            "Overrides": [
                {
                "InstanceRequirements": {
                    "VCpuCount": {"Min": 4},
                    "MemoryMiB": {"Min": 16384},
                    "CpuManufacturers": ["intel","amd"],
                    "AllowedInstanceTypes": ["c5.*", "m5.*","c4.*", "m4.*"] }
            }
            ]
        },
        "InstancesDistribution": {
            "OnDemandPercentageAboveBaseCapacity": 30,
            "SpotAllocationStrategy": "capacity-optimized"
        }
    },
    "MinSize": 1,
    "MaxSize": 10,
    "DesiredCapacity":10,
    "VPCZoneIdentifier": "subnet-f76e208a, subnet-f76e208b, subnet-f76e208c"
}

As a result, you have created an ASG that may include instance types like m5.xlarge, m5.2xlarge, c5.xlarge, and c5.2xlarge, among others. The actual selection at the time of the request is made by capacity optimized Spot allocation strategy. Please note that if EC2 will in the future release a new instance type which will satisfy the other attributes provided in the request, but will not be a member of 4th or 5th generation of m or c families specified in the allowed instance types attribute, the instance type will not be considered for provisioning.

Selected Instances (not an exhaustive list)

m5.xlarge

m5.2xlarge

m5.4xlarge

c5.xlarge

c5.2xlarge

m4.xlarge

m4.2xlarge

m4.4xlarge

c4.xlarge

c4.2xlarge

As you can see, ABS considers a broad set of instance types for provisioning, however they all meet the compute attributes that are required for your workload.

Cleanup

To delete both ASGs and terminate all the instances, execute the following commands:

aws autoscaling delete-auto-scaling-group --auto-scaling-group-name network-bandwidth-based-instances-asg --force-delete

aws autoscaling delete-auto-scaling-group --auto-scaling-group-name allow-instance-types-based-instances-asg --force-delete

Conclusion

In this post, we explored the two new ABS attributes – network bandwidth and allowed instance types. Customers can use these attributes to select instances based on network bandwidth and to limit the set of instances that ABS selects from. The two new attributes, as well as the existing set of ABS attributes enable you to save time on creating and maintaining instance type flexible configurations and make it even easier to express the compute requirements of your workload.

ABS represents the paradigm shift in the way that our customers interact with compute, making it easier than ever to request diversified compute resources at scale. We recommend ABS as a tool to help you identify and access the largest amount of EC2 compute capacity for your instance type flexible workloads.