Tag Archives: Technical How-to

Configure Hadoop YARN CapacityScheduler on Amazon EMR on Amazon EC2 for multi-tenant heterogeneous workloads

2022-08-16 Suvojit Dasgupta

Post Syndicated from Suvojit Dasgupta original https://aws.amazon.com/blogs/big-data/configure-hadoop-yarn-capacityscheduler-on-amazon-emr-on-amazon-ec2-for-multi-tenant-heterogeneous-workloads/

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster resource manager responsible for assigning computational resources (CPU, memory, I/O), and scheduling and monitoring jobs submitted to a Hadoop cluster. This generic framework allows for effective management of cluster resources for distributed data processing frameworks, such as Apache Spark, Apache MapReduce, and Apache Hive. When supported by the framework, Amazon EMR by default uses Hadoop YARN. Please note that not all frameworks offered by Amazon EMR use Hadoop YARN, such as Trino/Presto and Apache HBase.

In this post, we discuss various components of Hadoop YARN, and understand how components interact with each other to allocate resources, schedule applications, and monitor applications. We dive deep into the specific configurations to customize Hadoop YARN’s CapacityScheduler to increase cluster efficiency by allocating resources in a timely and secure manner in a multi-tenant cluster. We take an opinionated look at the configurations for CapacityScheduler and configure them on Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) to solve for the common resource allocation, resource contention, and job scheduling challenges in a multi-tenant cluster.

We dive deep into CapacityScheduler because Amazon EMR uses CapacityScheduler by default, and CapacityScheduler has benefits over other schedulers for running workloads with heterogeneous resource consumption.

Solution overview

Modern data platforms often run applications on Amazon EMR with the following characteristics:

Heterogeneous resource consumption patterns by jobs, such as computation-bound jobs, I/O-bound jobs, or memory-bound jobs
Multiple teams running jobs with an expectation to receive an agreed-upon share of cluster resources and complete jobs in a timely manner
Cluster admins often have to cater to one-time requests for running jobs without impacting scheduled jobs
Cluster admins want to ensure users are using their assigned capacity and not using others
Cluster admins want to utilize the resources efficiently and allocate all available resources to currently running jobs, but want to retain the ability to reclaim resources automatically should there be a claim for the agreed-upon cluster resources from other jobs

To illustrate these use cases, let’s consider the following scenario:

user1 and user2 don’t belong to any team and use cluster resources periodically on an ad hoc basis
A data platform and analytics program has two teams:
- A data_engineering team, containing user3
- A data_science team, containing user4
user5 and user6 (and many other users) sporadically use cluster resources to run jobs

Based on this scenario, the scheduler queue may look like the following diagram. Take note of the common configurations applied to all queues, the overrides, and the user/groups-to-queue mappings.

Capacity Scheduler Queue Setup

In the subsequent sections, we will understand the high-level components of Hadoop YARN, discuss the various types of schedulers available in Hadoop YARN, review the core concepts of CapacityScheduler, and showcase how to implement this CapacityScheduler queue setup on Amazon EMR (on Amazon EC2). You can skip to Code walkthrough section if you are already familiar with Hadoop YARN and CapacityScheduler.

Overview of Hadoop YARN

At a high level, Hadoop YARN consists of three main components:

ResourceManager (one per primary node)
ApplicationMaster (one per application)
NodeManager (one per node)

The following diagram shows the main components and their interaction with each other.

Apache Hadoop Yarn Architecture Diagram1

Before diving further, let’s clarify what Hadoop YARN’s ResourceContainer (or container) is. A ResourceContainer represents a collection of physical computational resources. It’s an abstraction used to bundle resources into distinct, allocatable unit.

ResourceManager

The ResourceManager is responsible for resource management and making allocation decisions. It’s the ResourceManager’s responsibility to identify and allocate resources to a job upon submission to Hadoop YARN. The ResourceManager has two main components:

ApplicationsManager (not to be confused with ApplicationMaster)
Scheduler

ApplicationsManager

The ApplicationsManager is responsible for accepting job submissions, negotiating the first container for running ApplicationMaster, and providing the service for restarting the ApplicationMaster on failure.

Scheduler

The Scheduler is responsible for scheduling allocation of resources to the jobs. The Scheduler performs its scheduling function based on the resource requirements of the jobs. The Scheduler is a pluggable interface. Hadoop YARN currently provides three implementations:

CapacityScheduler – A pluggable scheduler for Hadoop that allows for multiple tenants to securely share a cluster such that jobs are allocated resources in a timely manner under constraints of allocated capacities. The implementation is available on GitHub. The Java concrete class is org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler. In this post, we primarily focus on CapacityScheduler, which is the default scheduler on Amazon EMR (on Amazon EC2).
FairScheduler – A pluggable scheduler for Hadoop that allows Hadoop YARN applications to share resources in clusters fairly. The implementation is available on GitHub. The Java concrete class is org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
FifoScheduler – A pluggable scheduler for Hadoop that allows Hadoop YARN applications share resources in clusters in a first-in-first-out basis. The implementation is available on GitHub. The Java concrete class is org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.

ApplicationMaster

Upon negotiating the first container by ApplicationsManager, the per-application ApplicationMaster has the responsibility of negotiating the rest of the appropriate resources from the Scheduler, tracking their status, and monitoring progress.

NodeManager

The NodeManager is responsible for launching and managing containers on a node.

Hadoop YARN on Amazon EMR

By default, Amazon EMR (on Amazon EC2) uses Hadoop YARN for cluster management for the distributed data processing frameworks that support Hadoop YARN as a resource manager, like Apache Spark, Apache MapReduce, and Apache Hive. Amazon EMR provides multiple sensible default settings that work for most scenarios. However, every data platform is different and has specific needs. Amazon EMR provides the ability to customize the setting at cluster creation using configuration classifications . You can also reconfigure Amazon EMR cluster applications and specify additional configuration classifications for each instance group in a running cluster using AWS Command Line Interface (AWS CLI), or the AWS SDK.

CapacityScheduler

CapacityScheduler depends on ResourceCalculator to identify the available resources and calculate the allocation of the resources to ApplicationMaster. The ResourceCalculator is an abstract Java class. Hadoop YARN currently provides two implementations:

DefaultResourceCalculator – In DefaultResourceCalculator, resources are calculated based on memory alone.
DominantResourceCalculator – DominantResourceCalculator is based on the Dominant Resource Fairness (DRF) model of resource allocation. The paper Dominant Resource Fairness: Fair Allocation of Multiple Resource Types, Ghodsi et al. [2011] describes DRF as follows: “DRF computes the share of each resource allocated to that user. The maximum among all shares of a user is called that user’s dominant share, and the resource corresponding to the dominant share is called the dominant resource. Different users may have different dominant resources. For example, the dominant resource of a user running a computation-bound job is CPU, while the dominant resource of a user running an I/O-bound job is bandwidth. DRF simply applies max-min fairness across users’ dominant shares. That is, DRF seeks to maximize the smallest dominant share in the system, then the second-smallest, and so on.”

Because of DRF, DominantResourceCalculator is a better ResourceCalculator for data processing environments running heterogeneous workloads. By default, Amazon EMR uses DefaultResourceCalculator for CapacityScheduler. This can be verified by checking the value of yarn.scheduler.capacity.resource-calculator parameter in /etc/hadoop/conf/capacity-scheduler.xml.

Code walkthrough

CapacityScheduler provides multiple parameters to customize the scheduling behavior to meet specific needs. For a list of available parameters, refer to Hadoop: CapacityScheduler.

Refer to the configurations section in cloudformation/templates/emr.yaml to review all the CapacityScheduler parameters set as part of this post. In this example, we use two classifiers of Amazon EMR (on Amazon EC2):

yarn-site – The classification to update yarn-site.xml
capacity-scheduler – The classification to update capacity-scheduler.xml

For various types of classification available in Amazon EMR, refer to Customizing cluster and application configuration with earlier AMI versions of Amazon EMR.

In the AWS CloudFormation template, we have modified the ResourceCalculator of CapacityScheduler from the defaults, DefaultResourceCalculator to DominantResourceCalculator. Data processing environments tends to run different kinds of jobs, for example, computation-bound jobs consuming heavy CPU, I/O-bound jobs consuming heavy bandwidth, and memory-bound jobs consuming heavy memory. As previously stated, DominantResourceCalculator is better suited for such environments due to its Dominant Resource Fairness model of resource allocation. If your data processing environment only runs memory-bound jobs, then modifying this parameter isn’t necessary.

You can find the codebase in the AWS Samples GitHub repository.

Prerequisites

For deploying the solution, you should have the following prerequisites:

An AWS account
The AWS Command Line Interface (AWS CLI) installed
The GIT Command Line Interface (GIT CLI) installed
Permission to create AWS resources
Familiarity with AWS CloudFormation and Amazon EMR

Deploy the solution

To deploy the solution, complete the following steps:

Download the source code from the AWS Samples GitHub repository:

git clone [email protected]:aws-samples/amazon-emr-yarn-capacity-scheduler.git

Create an Amazon Simple Storage Service (Amazon S3) bucket:

aws s3api create-bucket --bucket emr-yarn-capacity-scheduler-<AWS_ACCOUNT_ID>-<AWS_REGION> --region <AWS_REGION>

Copy the cloned repository to the Amazon S3 bucket:

aws s3 cp --recursive amazon-emr-yarn-capacity-scheduler s3://emr-yarn-capacity-scheduler-<AWS_ACCOUNT_ID>-<AWS_REGION>/

Update the amazon-emr-yarn-capacity-scheduler/cloudformation/parameters/parameters.json file with appropriate values for the following keys. We have provided sensible defaults wherever possible. You should update the values to fit your specific requirements.

1. ArtifactsS3Repository – The S3 bucket name that was created in the previous step (emr-yarn-capacity-scheduler-<AWS_ACCOUNT_ID>-<AWS_REGION>).
2. emrKeyName – An existing EC2 key name. If you don’t have an existing key and want to create a new key, refer to Use an Amazon EC2 key pair for SSH credentials.
3. clientCIDR – The CIDR range of the client machine for accessing the EMR cluster via SSH. You can run the following command to identify the IP of the client machine: echo "$(curl -s http://checkip.amazonaws.com)/32"

Deploy the AWS CloudFormation templates:

aws cloudformation create-stack \
--stack-name emr-yarn-capacity-scheduler \
--template-url https://emr-yarn-capacity-scheduler-<AWS_ACCOUNT_ID>-<AWS_REGION>.s3.amazonaws.com/cloudformation/templates/main.yaml \
--parameters file://amazon-emr-yarn-capacity-scheduler/cloudformation/parameters/parameters.json \
--capabilities CAPABILITY_NAMED_IAM \
--region <AWS_REGION>

On the AWS CloudFormation console, check for the successful deployment of the following stacks.

AWS CloudFormation Stack Deployment

On the Amazon EMR console, check for the successful creation of the emr-cluster-capacity-scheduler cluster.
Choose the cluster and on the Configurations tab, review the properties under the capacity-scheduler and yarn-site classification labels.

AWS EMR Configurations

Access the Hadoop YARN resource manager UI on the emr-cluster-capacity-scheduler cluster to review the CapacityScheduler setup. For instructions on how to access the UI on Amazon EMR, refer to View web interfaces hosted on Amazon EMR clusters.

Apache Hadoop YARN UI

SSH into the emr-cluster-capacity-scheduler cluster and review the following files.For instructions on how to SSH into the EMR primary node, refer to Connect to the master node using SSH.

- /etc/hadoop/conf/yarn-site.xml
- /etc/hadoop/conf/capacity-scheduler.xml

All the parameters set using the yarn-site and capacity-scheduler classifiers are reflected in these files. If an admin wants to update CapacityScheduler configs, they can directly update capacity-scheduler.xml and run the following command to apply the changes without interrupting any running jobs and services:

yarn rmadmin -resfreshQueues

Changes to yarn-site.xml require the ResourceManager service to be restarted, which interrupts the running jobs. As a best practice, refrain from manual modifications and use version control for change management.

The CloudFormation template adds a bootstrap action to create test users (user1, user2, user3, user4, user5 and user6) on all the nodes and adds a step script to create HDFS directories for the test users.

Users can SSH into the primary node, sudo as different users and submit Spark jobs to verify the job submission and CapacityScheduler behavior:

[hadoop@ip-xx-x-xx-xxx ~]$ sudo su - user1
[user1@ip-xx-x-xx-xxx ~]$ spark-submit --master yarn --deploy-mode cluster \
--class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar

You can validate the results from the resource manager web UI.

Apache Hadoop YARN Jobs List

Clean up

To avoid incurring future charges, delete the resources you created.

Delete the CloudFormation stack:

aws cloudformation delete-stack --stack-name emr-yarn-capacity-scheduler

Delete the S3 bucket:

aws s3 rb s3://emr-yarn-capacity-scheduler-<AWS_ACCOUNT_ID>-<AWS_REGION> --force

The command deletes the bucket and all files underneath it. The files may not be recoverable after deletion.

Conclusion

In this post, we discussed Apache Hadoop YARN and its various components. We discussed the types of schedulers available in Hadoop YARN. We dived deep in to the specifics of Hadoop YARN CapacityScheduler and the use of Dominant Resource Fairness to efficiently allocate resources to submitted jobs. We also showcased how to implement the discussed concepts using AWS CloudFormation.

We encourage you to use this post as a starting point to implement CapacityScheduler on Amazon EMR (on Amazon EC2) and customize the solution to meet your specific data platform goals.

About the authors

Suvojit Dasgupta is a Sr. Lakehouse Architect at Amazon Web Services. He works with customers to design and build data solutions on AWS.

Bharat Gamini is a Data Architect focused on big data and analytics at Amazon Web Services. He helps customers architect and build highly scalable, robust, and secure cloud-based analytical solutions on AWS.

How to incorporate ACM PCA into your existing Windows Active Directory Certificate Services

2022-08-10 Geoff Sweet

Post Syndicated from Geoff Sweet original https://aws.amazon.com/blogs/security/how-to-incorporate-acm-pca-into-your-existing-windows-active-directory-certificate-services/

Using certificates to authenticate and encrypt data is vital to any enterprise security. For example, companies rely on certificates to provide TLS encryption for web applications so that client data is protected. However, not all certificates need to be issued from a publicly trusted certificate authority (CA). A privately trusted CA can be leveraged to issue certificates to help protect data in transit on resources such as load-balancers and also device authentication for endpoints and IoT devices. Many organizations already have that privately trusted CA running in their Microsoft Active Directory architecture via Active Directory Certificate Services (ADCS).

This post outlines how you can use Microsoft’s Windows 2019 ADCS to sign an AWS Certificate Manager (ACM) Private Certificate Authority (Private CA) instance, extending your existing ADCS system into your AWS environment. This will allow you to issue certificates via ACM for resources like Application Load Balancer that are trusted by your Active Directory members. The ACM PCA documentation talks about how to use an external CA to sign the ACM PCA certificate. However, it leaves the details of the external CA outside of the documentation scope.

Why use ACM PCA?

AWS Certificate Manager Private Certificate Authority (ACM Private CA or ACM PCA) is a private CA service that extends ACM certificate management capabilities to both public and private certificates. ACM PCA provides a highly available private CA service without the upfront investment and ongoing maintenance costs of operating your own private CA. ACM PCA allows developers to be more agile by providing them with APIs to create and deploy private certificates programmatically. You also have the flexibility to create private certificates for applications that require custom certificate lifetimes or resource names.

Why use ACM PCA with Windows Active Directory?

Many enterprises already use Active Directory to manage their IT resources. Whether it is on-premises or built into your AWS accounts, Active Directory’s built-in CA can be extended by ACM PCA. Using your ADCS to sign an ACM PCA means that members of your Active Directory automatically trust certificates issued by that ACM PCA. Keep in mind that these are still private certificates, and they are intended to be used just like certificates from ADCS itself. They will not be trusted by unmanaged devices, because these are not signed by a publicly trusted external CA. Therefore, systems like Mac and Linux may require that you manually deploy the ADCS certificate chain in order to trust certificates issued by your new ACM PCA.

This means it is more efficient for you to rapidly deploy certificates to your endpoint workstations for authentication. Or you can protect internal-only workloads with certificates that are constrained to your internal domain namespace. These tasks can be done conveniently through AWS APIs and the AWS SDK.

Solution overview

In the following sections, we will configure Microsoft ADCS to be able to sign a subordinate CA, deploy and sign ACM PCA, and then test the solution using a private website that is protected by a TLS certificate issued from the ACM PCA.

Configure Microsoft ADCS

Microsoft ADCS is normally deployed as part of your Windows Active Directory architecture. It can be extended to do multiple different types of certificate signing depending on your environment’s needs. Each of these different types of certificates is defined by a template that you must enable and configure. Each template contains configuration information about how Microsoft ADCS will issue the certificate type. You can copy and configure templates differently depending on your environment’s requirements. The specifics of each type of template is outside the scope of this blog post.

To configure ADCS to sign subordinate CAs

On the CA server that will be signing the private CA certificate, open the Certification Authority Microsoft Management Console (MMC).
In the left-side tree view, expand the name of the server.
Open the context (right-click) menu for Certificate Templates and choose Manage.

Figure 1: Navigating to the Manage option for the certificate templates

This opens the Certificate Template Console, which is populated with the list of optional templates.
Scroll down, open the context (right-click) menu for Subordinate Certification Authority, and choose Duplicate Template, as shown in Figure 2. This will create a duplicate of the template that you can alter for your needs, while leaving the original template unaltered for future use. Selecting Duplicate Template immediately opens the configuration for the new template.

Figure 2: Select Duplicate Template to create a copy of the Subordinate Certification Authority template

To configure and use the new template

On the new template configuration page, choose the General tab, and change the template display name to something that uniquely identifies it. The example in this post uses the name Subordinate Certification Authority – Private CA.
Select the check box for Publish certificate in Active Directory, and then choose OK. The new template appears in the list of available templates. Close the Certificate Templates Console.
Return to the Certification Authority MMC. Open the context (right-click) menu for Certificate Templates again, but this time choose New -> Certificate Template to Issue.

Figure 3: Issue the new Certificate Template you created for subordinate Cas
In the dialog box that appears, choose the new template you created in Step 1 of this procedure, and then choose OK.

That’s all that’s needed! Your CA is now ready to issue certificates for subordinate CAs in your public key infrastructure. Open a browser from either the ADCS CA server itself or through a network connection to the ADCS CA server, and use the following URL to access the certificate server’s certificate signing interface.

http://<hostname-of-your-ca-with-domain>/certsrv/certrqxt.asp

Now you can see that in the Certificate Templates list, you can choose the Subordinate Certification Authority template that you created, as shown in Figure 4.

Figure 4: The interface to sign certificates on your CA now shows the new certificate template you created

Deploy and sign the ACM Private CA’s certificate

In this step, you will deploy the ACM PCA, which is the first step to create a subordinate CA to deploy in your AWS account. The process of deploying the ACM PCA is well documented, so this post will not go into depth about the deployment itself. Instead, this procedure focuses on the steps for taking the certificate signing request (CSR) and signing it against the ADCS, and then covers the additional steps to convert the certificates that ADCS produces into the certificate format that ACM PCA expects.

After the ACM PCA is initially deployed, it needs to have a certificate signed to authenticate it. ACM PCA offers two options for signing the new instance’s certificate. You can choose to sign either through another ACM PCA instance, or via an external CA. Since you are using ADCS in this walkthrough, you will use the process of an external CA. The ACM PCA deployment is now at a point where it needs its CSR signed by Microsoft ADCS. You should see that it is ready in the AWS Management Console for ACM PCA.

To deploy and sign the ACM PCA’s certificate

When the ACM PCA is ready, in the ACM PCA console, begin the Install subordinate CA certificate process by choosing External private CA for the CA type.

Figure 5: Options for signing the new instance’s certificate
You will then be provided the certificate signing request (CSR) for the ACM PCA. Copy and paste the CSR content into the ADCS CA signing URL you visited earlier on the CA server. Then choose Next. The next page is where you will paste in the new signed certificate and certchain in a later step.
From the ADCS CA URL, be sure that the new Subordinate Certification Authority template is selected, and then choose Submit. The new certificate will be issued to you. The ADCS issuing page provides two different formats for the certificate, either as Distinguished Encoding Rules (DER) or base64-encoded.
Copy the base64-encoded files for both the certificate and the certchain to your local computer. The certificate is already in Privacy Enhanced Mail (PEM) format, and its contents can be pasted into the ACM PCA certificate input in the console. However, you must convert the certchain into the format required by the ACM PCA by following these steps:
1. To convert the format of the certchain, use the openssl tool from the command line. The process of installing the openssl tool is outside the scope of this blog post. Refer to the OpenSSL site documentation for installation options for your operating system.
2. Use the following command to convert the certchain file from Public Key Cryptographic Standards #7 (PKCS7) to PEM.
  openssl pkcs7 -print_certs -in certnew.p7b -out certchain.pem
3. Using a text editor, open the certchain.pem file and copy the last certificate block from the file, starting with —–BEGIN CERTIFICATE—– and ending with —–END CERTIFICATE—–. You will notice that the file begins with the signed certificate and includes subject= and issuer= statements. ACM PCA only accepts the content that is the certificate chain.
Return to the ACM PCA console page from Step 1, and paste the text the you just copied into the input area provided for the certificate chain. After this step is complete, the private CA is now signed by your corporate PKI.

Test the solution

Now that the ACM PCA is online, one of the things it can do is issue certificates via ACM that are trusted by your corporate Active Directory joined clients. These certificates can be used in services such as Application Load Balancers to provide TLS protected endpoints that are unique to your organization and trusted only by your internal clients.

From a client joined to our test Active Directory, Internet Explorer shows that it trusts the TLS certificate issued by AWS Certificate Manager and used on the Application Load Balancer for a private site.

Figure 6: Internet Explorer showing that it trusts the TLS certificate

For this demo, we created a test web server that is hosting an example webpage. The web server is behind an AWS Application Load Balancer. The TLS certificate attached to the Application Load Balancer is issued from the new ACM PCA.

Conclusion

Organizations that have Microsoft Active Directory deployed can use Active Directory’s Certificate Services to issue certificates for private resources. This blog post shows how you can extend that certificate trust to AWS Certificate Manager Private CA. This provides a way for your developers to issue private certificates automatically, which are trusted by your Active Directory domain-joined clients or clients that have the ADCS certificate chain installed.

For more information on hybrid public key infrastructure (PKI) on AWS, refer to these blog posts:

For more information on certificates for Mac and Linux, refer to the following resources:

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Building AWS Lambda governance and guardrails

2022-08-09 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-aws-lambda-governance-and-guardrails/

When building serverless applications using AWS Lambda, there are a number of considerations regarding security, governance, and compliance. This post highlights how Lambda, as a serverless service, simplifies cloud security and compliance so you can concentrate on your business logic. It covers controls that you can implement for your Lambda workloads to ensure that your applications conform to your organizational requirements.

The Shared Responsibility Model

The AWS Shared Responsibility Model distinguishes between what AWS is responsible for and what customers are responsible for with cloud workloads. AWS is responsible for “Security of the Cloud” where AWS protects the infrastructure that runs all the services offered in the AWS Cloud. Customers are responsible for “Security in the Cloud”, managing and securing their workloads. When building traditional applications, you take on responsibility for many infrastructure services, including operating systems and network configuration.

Traditional application shared responsibility

One major benefit when building serverless applications is shifting more responsibility to AWS so you can concentrate on your business applications. AWS handles managing and patching the underlying servers, operating systems, and networking as part of running the services.

Serverless application shared responsibility

For Lambda, AWS manages the application platform where your code runs, which includes patching and updating the managed language runtimes. This reduces the attack surface while making cloud security simpler. You are responsible for the security of your code and AWS Identity and Access Management (IAM) to the Lambda service and within your function.

Lambda is SOC, HIPAA, PCI, and ISO-compliant. For more information, see Compliance validation for AWS Lambda and the latest Lambda certification and compliance readiness services in scope.

Lambda isolation

Lambda functions run in separate isolated AWS accounts that are dedicated to the Lambda service. Lambda invokes your code in a secure and isolated runtime environment within the Lambda service account. A runtime environment is a collection of resources running in a dedicated hardware-virtualized Micro Virtual Machines (MVM) on a Lambda worker node.

Lambda workers are bare metal EC2 Nitro instances, which are managed and patched by the Lambda service team. They have a maximum lease lifetime of 14 hours to keep the underlying infrastructure secure and fresh. MVMs are created by Firecracker, an open source virtual machine monitor (VMM) that uses Linux’s Kernel-based Virtual Machine (KVM) to create and manage MVMs securely at scale.

MVMs maintain a strong separation between runtime environments at the virtual machine hardware level, which increases security. Runtime environments are never reused across functions, function versions, or AWS accounts.

Isolation model for AWS Lambda workers

Network security

Lambda functions always run inside secure Amazon Virtual Private Cloud (Amazon VPCs) owned by the Lambda service. This gives the Lambda function access to AWS services and the public internet. There is no direct network inbound access to Lambda workers, runtime environments, or Lambda functions. All inbound access to a Lambda function only comes via the Lambda Invoke API, which sends the event object to the function handler.

You can configure a Lambda function to connect to private subnets in a VPC in your account if necessary, which you can control with IAM condition keys . The Lambda function still runs inside the Lambda service VPC but sends all network traffic through your VPC. Function outbound traffic comes from your own network address space.

AWS Lambda service VPC with VPC-to-VPC NAT to customer VPC

To give your VPC-connected function access to the internet, route outbound traffic to a NAT gateway in a public subnet. Connecting a function to a public subnet doesn’t give it internet access or a public IP address, as the function is still running in the Lambda service VPC and then routing network traffic into your VPC.

All internal AWS traffic uses the AWS Global Backbone rather than traversing the internet. You do not need to connect your functions to a VPC to avoid connectivity to AWS services over the internet. VPC connected functions allow you to control and audit outbound network access.

You can use security groups to control outbound traffic for VPC-connected functions and network ACLs to block access to CIDR IP ranges or ports. VPC endpoints allow you to enable private communications with supported AWS services without internet access.

You can use VPC Flow Logs to audit traffic going to and from network interfaces in your VPC.

Runtime environment re-use

Each runtime environment processes a single request at a time. After Lambda finishes processing the request, the runtime environment is ready to process an additional request for the same function version. For more information on how Lambda manages runtime environments, see Understanding AWS Lambda scaling and throughput.

Data can persist in the local temporary filesystem path, in globally scoped variables, and in environment variables across subsequent invocations of the same function version. Ensure that you only handle sensitive information within individual invocations of the function by processing it in the function handler, or using local variables. Do not re-use files in the local temporary filesystem to process unencrypted sensitive data. Do not put sensitive or confidential information into Lambda environment variables, tags, or other freeform fields such as Name fields.

For more Lambda security information, see the Lambda security whitepaper.

Multiple accounts

AWS recommends using multiple accounts to isolate your resources because they provide natural boundaries for security, access, and billing. Use AWS Organizations to manage and govern individual member accounts centrally. You can use AWS Control Tower to automate many of the account build steps and apply managed guardrails to govern your environment. These include preventative guardrails to limit actions and detective guardrails to detect and alert on non-compliance resources for remediation.

Lambda access controls

Lambda permissions define what a Lambda function can do, and who or what can invoke the function. Consider the following areas when applying access controls to your Lambda functions to ensure least privilege:

Execution role

Lambda functions have permission to access other AWS resources using execution roles. This is an AWS principal that the Lambda service assumes which grants permissions using identity policy statements assigned to the role. The Lambda service uses this role to fetch and cache temporary security credentials, which are then available as environment variables during a function’s invocation. It may re-use them across different runtime environments that use the same execution role.

Ensure that each function has its own unique role with the minimum set of permissions..

Identity/user policies

IAM identity policies are attached to IAM users, groups, or roles. These policies allow users or callers to perform operations on Lambda functions. You can restrict who can create functions, or control what functions particular users can manage.

Resource policies

Resource policies define what identities have fine-grained inbound access to managed services. For example, you can restrict which Lambda function versions can add events to a specific Amazon EventBridge event bus. You can use resource-based policies on Lambda resources to control what AWS IAM identities and event sources can invoke a specific version or alias of your function. You also use a resource-based policy to allow an AWS service to invoke your function on your behalf. To see which services support resource-based policies, see “AWS services that work with IAM”.

Attribute-based access control (ABAC)

With attribute-based access control (ABAC), you can use tags to control access to your Lambda functions. With ABAC, you can scale an access control strategy by setting granular permissions with tags without requiring permissions updates for every new user or resource as your organization scales. You can also use tag policies with AWS Organizations to standardize tags across resources.

Permissions boundaries

Permissions boundaries are a way to delegate permission management safely. The boundary places a limit on the maximum permissions that a policy can grant. For example, you can use boundary permissions to limit the scope of the execution role to allow only read access to databases. A builder with permission to manage a function or with write access to the applications code repository cannot escalate the permissions beyond the boundary to allow write access.

Service control policies

When using AWS Organizations, you can use Service control policies (SCPs) to manage permissions in your organization. These provide guardrails for what actions IAM users and roles within the organization root or OUs can do. For more information, see the AWS Organizations documentation, which includes example service control policies.

Code signing

As you are responsible for the code that runs in your Lambda functions, you can ensure that only trusted code runs by using code signing with the AWS Signer service. AWS Signer digitally signs your code packages and Lambda validates the code package before accepting the deployment, which can be part of your automated software deployment process.

Auditing Lambda configuration, permissions and access

You should audit access and permissions regularly to ensure that your workloads are secure. Use the IAM console to view when an IAM role was last used.

IAM last used

IAM access advisor

Use IAM access advisor on the Access Advisor tab in the IAM console to review when was the last time an AWS service was used from a specific IAM user or role. You can use this to remove IAM policies and access from your IAM roles.

IAM access advisor

AWS CloudTrail

AWS CloudTrail helps you monitor, log, and retain account activity to provide a complete event history of actions across your AWS infrastructure. You can monitor Lambda API actions to ensure that only appropriate actions are made against your Lambda functions. These include CreateFunction, DeleteFunction, CreateEventSourceMapping, AddPermission, UpdateEventSourceMapping, UpdateFunctionConfiguration, and UpdateFunctionCode.

AWS CloudTrail

IAM Access Analyzer

You can validate policies using IAM Access Analyzer, which provides over 100 policy checks with security warnings for overly permissive policies. To learn more about policy checks provided by IAM Access Analyzer, see “IAM Access Analyzer policy validation”.

You can also generate IAM policies based on access activity from CloudTrail logs, which contain the permissions that the role used in your specified date range.

IAM Access Analyzer

AWS Config

AWS Config provides you with a record of the configuration history of your AWS resources. AWS Config monitors the resource configuration and includes rules to alert when they fall into a non-compliant state.

For Lambda, you can track and alert on changes to your function configuration, along with the IAM execution role. This allows you to gather Lambda function lifecycle data for potential audit and compliance requirements. For more information, see the Lambda Operators Guide.

AWS Config includes Lambda managed config rules such as lambda-concurrency-check, lambda-dlq-check, lambda-function-public-access-prohibited, lambda-function-settings-check, and lambda-inside-vpc. You can also write your own rules.

There are a number of other AWS services to help with security compliance.

AWS Audit Manager: Collect evidence to help you audit your use of cloud services.
Amazon GuardDuty: Detect unexpected and potentially unauthorized activity in your AWS environment.
Amazon Macie: Evaluates your content to identify business-critical or potentially confidential data.
AWS Trusted Advisor: Identify opportunities to improve stability, save money, or help close security gaps.
AWS Security Hub: Provides security checks and recommendations across your organization.

Conclusion

Lambda makes cloud security simpler by taking on more responsibility using the AWS Shared Responsibility Model. Lambda implements strict workload security at scale to isolate your code and prevent network intrusion to your functions. This post provides guidance on assessing and implementing best practices and tools for Lambda to improve your security, governance, and compliance controls. These include permissions, access controls, multiple accounts, and code security. Learn how to audit your function permissions, configuration, and access to ensure that your applications conform to your organizational requirements.

For more serverless learning resources, visit Serverless Land.

How to prepare your application to scale reliably with Amazon EC2

2022-08-08 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/how-to-prepare-your-application-to-scale-reliably-with-amazon-ec2/

This blog post is written by, Gabriele Postorino, Senior Technical Account Manager, and Giorgio Bonfiglio, Principal Technical Account Manager

In this post, we’ll discuss how you can prepare for planned and unplanned scaling events with

Most of the challenges related to horizontal scaling can be mitigated by optimizing the architectural implementation and applying improvements in operational processes.

In the following sections, we’ll explore this in depth. Recommendations can be applied partially or fully – they come with different complexities, and each one will help you reduce the risk of facing insufficient capacity errors or scaling delays, as well as deliver enhancements in areas such as fault tolerance, elasticity, and cost optimization.

Architectural best practices

Instance capacity can be regarded as being divided into “pools” defined by AZ (such as us-east-1a), instance type (for example m5.xlarge), and tenancy. Combining the following two guidelines will widen the capacity pools available to scale out your fleets of instances. This will help you reduce costs, transparently recover from failures, and increase your application scalability.

Instance flexibility

Whether you’re migrating a new workload to the cloud, or tuning an existing workload, you’ll likely evaluate which compute configuration options are available and determine the right configuration for your application.

If your workload is already running on EC2 instances, you might already be aware of the instance type that it runs best on. Let’s say that your application is RAM intensive, and you found that r6i.4xlarge instances are best suited for it.

However, relying on a single instance type might result in artificially limiting your ability to scale compute resources for your workload when needed. It’s always a good idea to explore how your workload behaves when running on other instance types: you might find that your application can serve double the number of requests served by one r6i.4xlarge instance when using one r6i.8xlarge instance or four r6i.2xlarge instances.

Furthermore, there’s no reason to limit your options to a single instance family, generation, or processor type. For example, m6a.8xlarge instances offer the same amount of RAM of r6i.4xlarge and might be used to run your application if needed.

Amazon EC2 Auto Scaling helps you make sure that you have the right number of EC2 instances available to handle the load for your application.

Auto Scaling groups can be configured to respond to scaling events by selecting the type of instance to launch among a list of instance types. You can statically populate the list in advance, as in the following screenshot,

or dynamically define it by a set of instance attributes as shown in the subsequent screenshot:

For example, by setting the requirements to a minimum of 8 vCPUs, 64GiB of Memory, and a RAM/CPU ratio of 8 (just like r6i.2xlarge instances), up to 73 instance types can be included in the list of suitable instances. They will be selected for launch starting from the lowest priced instance types. If the request can’t be fulfilled in full by the lowest priced instance type, then additional instances will be launched from the second lowest instance type pool, and so on.

Instance distribution

Each AWS Region consists of multiple, isolated Availability Zones (AZ), interconnected with high-bandwidth, low-latency networking. Spreading a workload across AZs is a well-established resiliency best practice. It will make sure that your end users aren’t impacted in the case of a single AZ, data center, or rack failures, as each AZ has its own distinct instance capacity pools that you can leverage to scale your application fleets.

EC2 Auto Scaling can manage the optimal distribution of EC2 instances in a group across all AZs in a Region automatically, as well as deal with temporary failures transparently. To do so, it must be configured to use at least one subnet in each AZ. Then, it will attempt to distribute instances evenly across AZs and automatically cycle through AZs in case of temporary launch failures.

Operational best practices

The way that your workload is operated also impacts your ability to scale it when needed. Failure management and appropriate scaling techniques will help you maximize the availability of your environment.

Failure management

On-Demand capacity isn’t guaranteed to always be available. There might be short windows of time when AWS doesn’t have enough available On-Demand capacity to fulfill your specific request: as the availability of On-Demand capacity changes frequently, it’s important that your launch processes implement retry mechanisms.

Retries and fallbacks are managed automatically by EC2 Auto Scaling. But if you have a custom workflow to launch instances, it should be able to work with server error codes, in particular InsufficientInstanceCapacity or InternalError, by retrying the launch request. For a complete list of error codes for the EC2 API, please refer to our documentation.

Another option provided by EC2 is represented by EC2 Fleets. EC2 Fleet is a feature that helps to implement instance flexibility best practices. Instead of calling RunInstances with one instance type and retrying, EC2 Fleet in Instant mode considers all provided instance types, using a list of instances or Attribute Based Instance selection, and provisions capacity from the pools configured by the EC2 Fleet call where capacity was available.

Scaling technique

Launching EC2 instances as soon as you have an initial indication of increased load, in smaller batches and over a longer time span, helps increase your application performance and reliability while reducing costs and minimizing disruptions.

In the graph above, two different scaling techniques are depicted. Scaling approach #1 adds a large number of instances less frequently, while approach #2 launches a smaller batch of instances more frequently. Adopting the first approach risks your application not being able to sustain the increase in load in a timely manner. This will potentially cause an impact on end users and leave the operations team with little time to resolve.

Effective capacity planning

On-Demand Instances are best suited for applications with irregular, uninterruptible workloads. Interruptible workloads can avail of Spot Instances that pick from spare EC2 capacity. They cost less than On-Demand Instances but can be interrupted with a two-minute warning.

If your workload has a stable baseline utilization that hardly changes over time, then you can reserve capacity for your baseline usage of EC2 instances using open On-Demand Capacity Reservations and cover them with Savings Plans to get discounted rates with a one-year or three-year commitment, with the latter offering the bigger discounts.

Open On-Demand Capacity Reservations and Savings Plans aren’t tightly related to the EC2 instances that they cover at a certain point in time. Rather they shift to other usage, matching all of the parameters of the respective On-Demand Capacity Reservation or Savings Plan (e.g., Instance Type, Operating System, AZ, tenancy) in your account or across accounts for which you have sharing enabled. This lets you be dynamic even with your stable baseline. For example, during a rolling update or a blue/green deployment, On-Demand Capacity Reservations and Savings Plans will automatically cover any instances that match the respective criteria.

ODCR Fleets

There are times when you can’t apply all of the recommended mitigating actions in anticipation of a planned event. In those cases, you might want to use On-Demand Capacity Reservation Fleets to reserve capacity in advance for additional peace of mind. Capacity reservation fleets let you define capacity requests across multiple instance types, up to a target capacity that you specify. They can be created and managed using the AWS Command Line Interface (AWS CLI) and the AWS APIs.

Key concepts of Capacity Reservation Fleets are the total target capacity and the instance type weight. The instance type weight expresses the number of capacity units that each instance of a specific instance type counts toward the total target capacity.

Let’s say your workload is memory-bound, you expect to need 1,6TiB of RAM, and you want to use r6i instances. You can create a Capacity Reservation Fleet for r6i instances defining weights for each instance type in the family based on the relative amount of memory that they have in an instance type specification json file.

instanceTypeSpecification.json:
[
    {             
        "InstanceType": "r6i.2xlarge",                       
        "InstancePlatform":"Linux/UNIX",            
        "Weight": 1,
        "AvailabilityZone":"eu-west-1a",        
        "EbsOptimized": true,           
        "Priority" : 3
    },
    { 
        "InstanceType": "r6i.4xlarge",                        
        "InstancePlatform":"Linux/UNIX",            
        "Weight": 2,
        "AvailabilityZone":"eu-west-1a",        
        "EbsOptimized": true,            
        "Priority" : 2
    },
    {             
        "InstanceType": "r6i.8xlarge",                        
        "InstancePlatform":"Linux/UNIX",           
        "Weight": 4,
        "AvailabilityZone":"eu-west-1a",       
        "EbsOptimized": true,            
        "Priority" : 1
    }
]

Then, you want to use this specification to create a Capacity Reservation Fleet that takes care of the underlying Capacity Reservations needed to fulfill your request:

$ aws ec2 create-capacity-reservation-fleet \
--total-target-capacity 25 \
--allocation-strategy prioritized \
--instance-match-criteria open \
--tenancy default \
--end-date 2022-05-31T00:00:00.000Z \
--instance-type-specifications file://instanceTypeSpecification.json

In this example, I set the target capacity to 25, which is the number of r6i.2xlarge needed to get 1,6TiB of total memory across the fleet. As you might have noticed, Capacity Reservation Fleets can be created with an end date. They will automatically cancel themselves and the Capacity Reservations that they created when the end date is reached, so that you don’t need to.

AWS Infrastructure Event Management

Last but not least, our teams can offer the AWS Infrastructure Event Management (IEM) program. Part of select AWS Support offerings, the IEM program has been designed to help you with planning and executing events that impact your infrastructure on AWS. By requesting an IEM engagement, you will be supported by AWS experts during all of the phases of your event.Starting from your business outcomes and success criteria, we’ll assess your infrastructure readiness for the event, evaluate risks, and recommend specific actions to mitigate them. The AWS experts will focus on your application architecture as a whole and dive deep into each of its components with your respective teams. They might also engage with other AWS teams to notify them of the upcoming event, and get specific prescriptive guidance when needed. During the event, AWS experts will have the context needed to help you resolve any issue that might arise as quickly as possible. The program is included in the Enterprise and Enterprise On-Ramp Support plans and is available to Business Support customers for an additional fee.

Conclusion

Whether you’re planning for a big future event, or you want to make sure that your application can withstand unexpected increases in traffic, it’s important that you consider what we discussed in this article:

Use as many instance types as you can, don’t limit your workload to use a single instance type when it could also use a lot more types
Distribute your EC2 instances across all AZs in the Region
Expect failures: manage retries and fallback options
Make use of EC2 Autoscaling and EC2 Fleet whenever possible
Avoid scaling spikes: start scaling earlier, in smaller chunks, and more frequently
Reserve capacity only when you really need to

For further study, we recommend the Well-Architected Framework Reliability and Operational Excellence pillars as starting points. Moreover, if you have an event coming up, talk to your Technical Account Manager, your Account Team, or contact us to find out how we can help!

Best practices to optimize cost and performance for AWS Glue streaming ETL jobs

2022-08-03 Gonzalo Herreros

Post Syndicated from Gonzalo Herreros original https://aws.amazon.com/blogs/big-data/best-practices-to-optimize-cost-and-performance-for-aws-glue-streaming-etl-jobs/

AWS Glue streaming extract, transform, and load (ETL) jobs allow you to process and enrich vast amounts of incoming data from systems such as Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka (Amazon MSK), or any other Apache Kafka cluster. It uses the Spark Structured Streaming framework to perform data processing in near-real time.

This post covers use cases where data needs to be efficiently processed, delivered, and possibly actioned in a limited amount of time. This can cover a wide range of cases, such as log processing and alarming, continuous data ingestion and enrichment, data validation, internet of things, machine learning (ML), and more.

We discuss the following topics:

Development tools that help you code faster using our newly launched AWS Glue Studio notebooks
How to monitor and tune your streaming jobs
Best practices for sizing and scaling your AWS Glue cluster, using our newly launched features like auto scaling and the small worker type G 0.25X

Development tools

AWS Glue Studio notebooks can speed up the development of your streaming job by allowing data engineers to work using an interactive notebook and test code changes to get quick feedback—from business logic coding to testing configuration changes—as part of tuning.

Before you run any code in the notebook (which would start the session), you need to set some important configurations.

The magic %streaming creates the session cluster using the same runtime as AWS Glue streaming jobs. This way, you interactively develop and test your code using the same runtime that you use later in the production job.

Additionally, configure Spark UI logs, which will be very useful for monitoring and tuning the job.

See the following configuration:

%streaming
%%configure
{
"--enable-spark-ui": "true",
"--spark-event-logs-path": "s3://your_bucket/sparkui/"
}

For additional configuration options such as version or number of workers, refer to Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks.

To visualize the Spark UI logs, you need a Spark history server. If you don’t have one already, refer to Launching the Spark History Server for deployment instructions.

Structured Streaming is based on streaming DataFrames, which represent micro-batches of messages.
The following code is an example of creating a stream DataFrame using Amazon Kinesis as the source:

kinesis_options = {
  "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream",
  "startingPosition": "TRIM_HORIZON",
  "inferSchema": "true",
  "classification": "json"
}
kinesisDF = glueContext.create_data_frame_from_options(
   connection_type="kinesis",
   connection_options=kinesis_options
)

The AWS Glue API helps you create the DataFrame by doing schema detection and auto decompression, depending on the format. You can also build it yourself using the Spark API directly:

kinesisDF = spark.readStream.format("kinesis").options(**kinesis_options).load()

After your run any code cell, it triggers the startup of the session, and the application soon appears in the history server as an incomplete app (at the bottom of the page there is a link to display incomplete apps) named GlueReplApp, because it’s a session cluster. For a regular job, it’s listed with the job name given when it was created.

History server home page

From the notebook, you can take a sample of the streaming data. This can help development and give an indication of the type and size of the streaming messages, which might impact performance.

Monitor the cluster with Structured Streaming

The best way to monitor and tune your AWS Glue streaming job is using the Spark UI; it gives you the overall streaming job trends on the Structured Streaming tab and the details of each individual micro-batch processing job.

Overall view of the streaming job

On the Structured Streaming tab, you can see a summary of the streams running in the cluster, as in the following example.

Normally there is just one streaming query, representing a streaming ETL. If you start multiple in parallel, it’s good if you give it a recognizable name, calling queryName() if you use the writeStream API directly on the DataFrame.

After a good number of batches are complete (such as 10), enough for the averages to stabilize, you can use Avg Input/sec column to monitor how many events or messages the job is processing. This can be confusing because the column to the right, Avg Process/sec, is similar but often has a higher number. The difference is that this process time tells us how efficient our code is, whereas the average input tells us how many messages the cluster is reading and processing.

The important thing to note is that if the two values are similar, it means the job is working at maximum capacity. It’s making the best use of the hardware but it likely won’t be able to cope with an increase in volume without causing delays.

In the last column is the latest batch number. Because they’re numbered incrementally from zero, this tells us how many batches the query has processed so far.

When you choose the link in the “Run ID” column of a streaming query, you can review the details with graphs and histograms, as in the following example.

The first two rows correspond to the data that is used to calculate the averages shown on the summary page.

For Input Rate, each data point is calculated by dividing the number of events read for the batch by the time passed between the current batch start and the previous batch start. In a healthy system that is able to keep up, this is equal to the configured trigger interval (in the GlueContext.forEachBatch() API, this is set using the option windowSize).

Because it uses the current batch rows with the previous batch latency, this graph is often unstable in the first batches until the Batch Duration (the last line graph) stabilizes.

In this example, when it stabilizes, it gets completely flat. This means that either the influx of messages is constant or the job is hitting the limit per batch set (we discuss how to do this later in the post).

Be careful if you set a limit per batch that is constantly hit, you could be silently building a backlog, but everything could look good in the job metrics. To monitor this, have a metric of latency measuring the difference between the message timestamp when it gets created and the time it’s processed.

Process Rate is calculated by dividing the number of messages in a batch by the time it took to process that batch. For instance, if the batch contains 1,000 messages, and the trigger interval is 10 seconds but the batch only needed 5 seconds to process it, the process rate would be 1000/5 = 200 msg/sec. while the input rate for that batch (assuming the previous batch also ran within the interval) is 1000/10 = 100 msg/sec.

This metric is useful to measure how efficient our code processing the batch is, and therefore it can get higher than the input rate (this doesn’t mean it’s processing more messages, just using less time). As mentioned earlier, if both metrics get close, it means the batch duration is close to the interval and therefore additional traffic is likely to start causing batch trigger delays (because the previous batch is still running) and increase latency.

Later in this post, we show how auto scaling can help prevent this situation.

Input Rows shows the number of messages read for each batch, like input rate, but using volume instead of rate.

It’s important to note that if the batch processes the data multiple times (for example, writing to multiple destinations), the messages are counted multiple times. If the rates are greater than the expected, this could be the reason. In general, to avoid reading messages multiple times, you should cache the batch while processing it, which is the default when you use the GlueContext.forEachBatch() API.

The last two rows tell us how long it takes to process each batch and how is that time spent. It’s normal to see the first batches take much longer until the system warms up and stabilizes.
The important thing to look for is that the durations are roughly stable and well under the configured trigger interval. If that’s not the case, the next batch gets delayed and could start a compounding delay by building a backlog or increasing batch size (if the limit allows taking the extra messages pending).

In Operation Duration, the majority of time should be spent on addBatch (the mustard color), which is the actual work. The rest are fixed overhead, therefore the smaller the batch process, the more percentage of time that will take. This represents the trade-off between small batches with lower latency or bigger batches but more computing efficient.

Also, it’s normal for the first batch to spend significant time in the latestOffset (the brown bar), locating the point at which it needs to start processing when there is no checkpoint.

The following query statistics show another example.

In this case, the input has some variation (meaning it’s not hitting the batch limit). Also, the process rate is roughly the same as the input rate. This tells us the system is at max capacity and struggling to keep up. By comparing the input rows and input rate, we can guess that the interval configured is just 3 seconds and the batch duration is barely able to meet that latency.

Finally, in Operation Duration, you can observe that because the batches are so frequent, a significant amount of time (proportionally speaking) is spent saving the checkpoint (the dark green bar).

With this information, we can probably improve the stability of the job by increasing the trigger interval to 5 seconds or more. This way, it checkpoints less often and has more time to process data, which might be enough to get batch duration consistently under the interval. The trade-off is that the latency between when a message is published and when it’s processed is longer.

Monitor individual batch processing

On the Jobs tab, you can see how long each batch is taking and dig into the different steps the processing involves to understand how the time is spent. You can also check if there are tasks that succeed after retry. If this happens continuously, it can silently hurt performance.

For instance, the following screenshot shows the batches on the Jobs tab of the Spark UI of our streaming job.

Each batch is considered a job by Spark (don’t confuse the job ID with the batch number; they only match if there is no other action). The job group is the streaming query ID (this is important only when running multiple queries).

The streaming job in this example has a single stage with 100 partitions. Both batches processed them successfully, so the stage is marked as succeeded and all the tasks completed (100/100 in the progress bar).

However, there is a difference in the first batch: there were 20 task failures. You know all the failed tasks succeeded in the retries, otherwise the stage would have been marked as failed. For the stage to fail, the same task would have to fail four times (or as configured by spark.task.maxFailures).

If the stage fails, the batch fails as well and possibly the whole job; if the job was started by using GlueContext.forEachBatch(), it has a number of retries as per the batchMaxRetries parameter (three by default).

These failures are important because they have two effects:

They can silently cause delays in the batch processing, depending on how long it took to fail and retry.
They can cause records to be sent multiple times if the failure is in the last stage of the batch, depending on the type of output. If the output is files, in general it won’t cause duplicates. However, if the destination is Amazon DynamoDB, JDBC, Amazon OpenSearch Service, or another output that uses batching, it’s possible that some part of the output has already been sent. If you can’t tolerate any duplicates, the destination system should handle this (for example, being idempotent).

Choosing the description link takes you to the Stages tab for that job. Here you can dig into the failure: What is the exception? Is it always in the same executor? Does it succeed on the first retry or took multiple?

Ideally, you want to identify these failures and solve them. For example, maybe the destination system is throttling us because doesn’t have enough provisioned capacity, or a larger timeout is needed. Otherwise, you should at least monitor it and decide if it is systemic or sporadic.

Sizing and scaling

Defining how to split the data is a key element in any distributed system to run and scale efficiently. The design decisions on the messaging system will have a strong influence on how the streaming job will perform and scale, and thereby affect the job parallelism.

In the case of AWS Glue Streaming, this division of work is based on Apache Spark partitions, which define how to split the work so it can be processed in parallel. Each time the job reads a batch from the source, it divides the incoming data into Spark partitions.

For Apache Kafka, each topic partition becomes a Spark partition; similarly, for Kinesis, each stream shard becomes a Spark partition. To simplify, I’ll refer to this parallelism level as number of partitions, meaning Spark partitions that will be determined by the input Kafka partitions or Kinesis shards on a one-to-one basis.

The goal is to have enough parallelism and capacity to process each batch of data in less time than the configured batch interval and therefore be able to keep up. For instance, with a batch interval of 60 seconds, the job lets 60 seconds of data build up and then processes that data. If that work takes more than 60 seconds, the next batch waits until the previous batch is complete before starting a new batch with the data that has built up since the previous batch started.

It’s a good practice to limit the amount of data to process in a single batch, instead of just taking everything that has been added since the last one. This helps make the job more stable and predictable during peak times. It allows you to test that the job can handle volume of data without issues (for example, memory or throttling).

To do so, specify a limit when defining the source stream DataFrame:

For Kinesis, specify the limit using kinesis.executor.maxFetchRecordsPerShard, and revise this number if the number of shards changes substantially. You might need to increase kinesis.executor.maxFetchTimeInMs as well, in order to allow more time to read the batch and make sure it’s not truncated.
For Kafka, set maxOffsetsPerTrigger, which divides that allowance equally between the number of partitions.

The following is an example of setting this config for Kafka (for Kinesis, it’s equivalent but using Kinesis properties):

kafka_properties= {
  "kafka.bootstrap.servers": "bootstrapserver1:9092",
  "subscribe": "mytopic",
  "startingOffsets": "latest",
  "maxOffsetsPerTrigger": "5000000"
}
# Pass the properties as options when creating the DataFrame
spark.spark.readStream.format("kafka").options(**kafka_properties).load()

Initial benchmark

If the events can be processed individually (no interdependency such as grouping), you can get a rough estimation of how many messages a single Spark core can handle by running with a single partition source (one Kafka partition or one Kinesis shard stream) with data preloaded into it and run batches with a limit and the minimum interval (1 second). This simulates a stress test with no downtime between batches.

For these repeated tests, clear the checkpoint directory, use a different one (for example, make it dynamic using the timestamp in the path), or just disable the checkpointing (if using the Spark API directly), so you can reuse the same data.
Leave a few batches to run (at least 10) to give time for the system and the metrics to stabilize.

Start with a small limit (using the limit configuration properties explained in the previous section) and do multiple reruns, increasing the value. Record the batch duration for that limit and the throughput input rate (because it’s a stress test, the process rate should be similar).

In general, larger batches tend to be more efficient up to a point. This is because the fixed overhead taken for each to checkpoint, plan, and coordinate the nodes is more significant if the batches are smaller and therefore more frequent.

Then pick your reference initial settings based on the requirements:

If a goal SLA is required, use the largest batch size whose batch duration is less than half the latency SLA. This is because in the worst case, a message that is stored just after a batch is triggered has to wait at least the interval and then the processing time (which should be less than the interval). When the system is keeping up, the latency in this worst case would be close to twice the interval, so aim for the batch duration to be less than half the target latency.
In the case where the throughput is the priority over latency, just pick the batch size that provides a higher average process rate and define an interval that allows some buffer over the observed batch duration.

Now you have an idea of the number of messages per core our ETL can handle and the latency. These numbers are idealistic because the system won’t scale perfectly linearly when you add more partitions and nodes. You can use the messages per core obtained to divide the total number of messages per second to process and get the minimum number of Spark partitions needed (each core handles one partition in parallel).

With this number of estimated Spark cores, calculate the number of nodes needed depending on the type and version, as summarized in the following table.

AWS Glue Version	Worker Type	vCores	Spark Cores per Worker
2	G 1X	4	8
2	G 2X	8	16
3	G 0.25X	2	2
3	G 1X	4	4
3	G 2X	8	8

Using the newer version 3 is preferable because it includes more optimizations and features like auto scaling (which we discuss later). Regarding size, unless the job has some operation that is heavy on memory, it’s preferable to use the smaller instances so there aren’t so many cores competing for memory, disk, and network shared resources.

Spark cores are equivalent to threads; therefore, you can have more (or less) than the actual cores available in the instance. This doesn’t mean that having more Spark cores is going to necessarily be faster if they’re not backed by physical cores, it just means you have more parallelism competing for the same CPU.

Sizing the cluster when you control the input message system

This is the ideal case because you can optimize the performance and the efficiency as needed.

With the benchmark information you just gathered, you can define your initial AWS Glue cluster size and configure Kafka or Kinesis with the number of partitions or topics estimated, plus some buffer. Test this baseline setup and adjust as needed until the job can comfortably meet the total volume and required latency.

For instance, if we have determined that we need 32 cores to be well within the latency requirement for the volume of data to process, then we can create an AWS Glue 3.0 cluster with 9 G.1X nodes (a driver and 8 workers with 4 cores = 32) which reads from a Kinesis data stream with 32 shards.

Imagine that the volume of data in that stream doubles and we want to keep the latency requirements. To do so, we double the number of workers (16 + 1 driver = 17) and the number of shards on the stream (now 64). Remember this is just a reference and needs to be validated; in practice you might need more or less nodes depending on the cluster size, if the destination system can keep up, complexity of transformations, or other parameters.

Sizing the cluster when you don’t control the message system configuration

In this case, your options for tuning are much more limited.

Check if a cluster with the same number of Spark cores as existing partitions (determined by the message system) is able to keep up with the expected volume of data and latency, plus some allowance for peak times.

If that’s not the case, adding more nodes alone won’t help. You need to repartition the incoming data inside AWS Glue. This operation adds an overhead to redistribute the data internally, but it’s the only way the job can scale out in this scenario.

Let’s illustrate with an example. Imagine we have a Kinesis data stream with one shard that we don’t control, and there isn’t enough volume to justify asking the owner to increase the shards. In the cluster, significant computing for each message is needed; for each message, it runs heuristics and other ML techniques to take action depending on the calculations. After running some benchmarks, the calculations can be done promptly for the expected volume of messages using 8 cores working in parallel. By default, because there is only one shard, only one core will process all the messages sequentially.

To solve this scenario, we can provision an AWS Glue 3.0 cluster with 3 G 1X nodes to have 8 worker cores available. In the code repartition, the batch distributes the messages randomly (as evenly as possible) between them:

def batch_function(data_frame, batch_id):
    # Repartition so the udf is called in parallel for each partition
    data_frame.repartition(8).foreach(process_event_udf)

glueContext.forEachBatch(frame=streaming_df, batch_function=batch_function)

If the messaging system resizes the number of partitions or shards, the job picks up this change on the next batch. You might need to adjust the cluster capacity accordingly with the new data volume.

The streaming job is able to process more partitions than Spark cores are available, but might cause inefficiencies because the additional partitions will be queued and won’t start being processed until others finish. This might result in many nodes being idle while the remaining partitions finish and the next batch can be triggered.

When the messages have processing interdependencies

If the messages to be processed depend on other messages (either in the same or previous batches), that’s likely to be a limiting factor on the scalability. In that case, it might help to analyze a batch (job in Spark UI) to see where the time is spent and if there are imbalances by checking the task duration percentiles on the Stages tab (you can also reach this page by choosing a stage on the Jobs tab).

Auto scaling

Up to now, you have seen sizing methods to handle a stable stream of data with the occasional peak.
However, for variable incoming volumes of data, this isn’t cost-effective because you need to size for the worst-case scenario or accept higher latency at peak times.

This is where AWS Glue Streaming 3.0 auto scaling comes in. You can enable it for the job and define the maximum number of workers you want to allow (for example, using the number you have determined needed for the peak times).

The runtime monitors the trend of time spent on batch processing and compares it with the configured interval. Based on that, it makes a decision to increase or decrease the number of workers as needed, being more aggressive as the batch times get near or go over the allowed interval time.

The following screenshot is an example of a streaming job with auto scaling enabled.

Splitting workloads

You have seen how to scale a single job by adding nodes and partitioning the data as needed, which is enough on most cases. As the cluster grows, there is still a single driver and the nodes have to wait for the others to complete the batch before they can take additional work. If it reaches a point that increasing the cluster size is no longer effective, you might want to consider splitting the workload between separate jobs.

In the case of Kinesis, you need to divide the data into multiple streams, but for Apache Kafka, you can divide a topic into multiple jobs by assigning partitions to each one. To do so, instead of the usual subscribe or subscribePattern where the topics are listed, use the property assign to assign using JSON a subset of the topic partitions that the job will handle (for example, {"topic1": [0,1,2]}). At the time of this writing, it’s not possible to specify a range, so you need to list all the partitions, for instance building that list dynamically in the code.

Sizing down

For low volumes of traffic, AWS Glue Streaming has a special type of small node: G 0.25X, which provides two cores and 4 GB RAM for a quarter of the cost of a DPU, so it’s very cost-effective. However, even with that frugal capacity, if you have many small streams, having a small cluster for each one is still not practical.

For such situations, there are currently a few options:

Configure the stream DataFrame to feed from multiple Kafka topics or Kinesis streams. Then in the DataFrame, use the columns topic and streamName, for Kafka and Kinesis sources respectively, to determine how to handle the data (for example, different transformations or destinations). Make sure the DataFrame is cached, so you don’t read the data multiple times.
If you have a mix of Kafka and Kinesis sources, you can define a DataFrame for each, join them, and process as needed using the columns mentioned in the previous point.
The preceding two cases require all the sources to have the same batch interval and links their processing (for example, a busier stream can delay a slower one). To have independent stream processing inside the same cluster, you can trigger the processing of separate stream’s DataFrames using separate threads. Each stream is monitored separately in the Spark UI, but you’re responsible for starting and managing those threads and handle errors.

Settings

In this post, we showed some config settings that impact performance. The following table summarizes the ones we discussed and other important config properties to use when creating the input stream DataFrame.

Property	Applies to	Remarks
`maxOffsetsPerTrigger`	Kafka	Limit of messages per batch. Divides the limit evenly among partitions.
`kinesis.executor.maxFetchRecordsPerShard`	Kinesis	Limit per each shard, therefore should be revised if the number of shards changes.
`kinesis.executor.maxFetchTimeInMs`	Kinesis	When increasing the batch size (either by increasing the batch interval or the previous property), the executor might need more time, allotted by this property.
`startingOffsets`	Kafka	Normally you want to read all the data available and therefore use `earliest`. However, if there is a big backlog, the system might take a long time to catch up and instead use `latest` to skip the history.
`startingposition`	Kinesis	Similar to startingOffsets, in this case the values to use are `TRIM_HORIZON` to backload and `LATEST` to start processing from now on.
`includeHeaders`	Kafka	Enable this flag if you need to merge and split multiple topics in the same job (see the previous section for details).
`kinesis.executor.maxconnections`	Kinesis	When writing to Kinesis, by default it uses a single connection. Increasing this might improve performance.
`kinesis.client.avoidEmptyBatches`	Kinesis	It’s best to set it to true to avoid wasting resources (for example, generating empty files) when there is no data (like the Kafka connector does). `GlueContext.forEachBatch` prevents empty batches by default.

Further optimizations

In general, it’s worth doing some compression on the messages to save on transfer time (at the expense of some CPU, depending on the compression type used).

If the producer compresses the messages individually, AWS Glue can detect it and decompress automatically in most cases, depending on the format and type. For more information, refer to Adding Streaming ETL Jobs in AWS Glue.

If using Kafka, you have the option to compress the topic. This way, the compression is more effective because it’s done in batches, end-to-end, and it’s transparent to the producer and consumer.

By default, the GlueContext.forEachBatch function caches the incoming data. This is helpful if the data needs to be sent to multiple sinks (for example, as Amazon S3 files and also to update a DynamoDB table) because otherwise the job would read the data multiple times from the source. But it can be detrimental to performance if the volume of data is big and there is only one output.

To disable this option, set persistDataFrame as false:

glueContext.forEachBatch(
    frame=myStreamDataFrame,
    batch_function=processBatch,
    options={
        "windowSize": "30 seconds",
        "checkpointLocation": myCheckpointPath,
        "persistDataFrame":  "false"
    }
)

In streaming jobs, it’s common to have to join streaming data with another DataFrame to do enrichment (for example, lookups). In that case, you want to avoid any shuffle if possible, because it splits stages and causes data to be moved between nodes.

When the DataFrame you’re joining to is relatively small to fit in memory, consider using a broadcast join. However, bear in mind it will be distributed to the nodes on every batch, so it might not be worth it if the batches are too small.

If you need to shuffle, consider enabling the Kryo serializer (if using custom serializable classes you need to register them first to use it).

As in any AWS Glue jobs, avoid using custom udf() if you can do the same with the provided API like Spark SQL. User-defined functions (UDFs) prevent the runtime engine from performing many optimizations (the UDF code is a black box for the engine) and in the case of Python, it forces the movement of data between processes.

Avoid generating too many small files (especially columnar like Parquet or ORC, which have overhead per file). To do so, it might be a good idea to coalesce the micro-batch DataFrame before writing the output. If you’re writing partitioned data to Amazon S3, repartition based on columns can significantly reduce the number of output files created.

Conclusion

In this post, you saw how to approach sizing and tuning an AWS Glue streaming job in different scenarios, including planning considerations, configuration, monitoring, tips, and pitfalls.

You can now use these techniques to monitor and improve your existing streaming jobs or use them when designing and building new ones.

About the author

Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team.

Leverage L2 constructs to reduce the complexity of your AWS CDK application

2022-07-25 David Boldt

Post Syndicated from David Boldt original https://aws.amazon.com/blogs/devops/leverage-l2-constructs-to-reduce-the-complexity-of-your-aws-cdk-application/

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework to define your cloud application resources using familiar programming languages. AWS CDK uses the familiarity and expressive power of programming languages for modeling your applications. Constructs are the basic building blocks of AWS CDK apps. A construct represents a “cloud component” and encapsulates everything that AWS CloudFormation needs to create the component. Furthermore, AWS Construct Library lets you ease the process of building your application using predefined templates and logic. Three levels of constructs exist:

L1 – These are low-level constructs called Cfn (short for CloudFormation) resources. They’re periodically generated from the AWS CloudFormation Resource Specification. The name pattern is CfnXyz, where Xyz is name of the resource. When using these constructs, you must configure all of the resource properties. This requires a full understanding of the underlying CloudFormation resource model and its corresponding attributes.
L2 – These represent AWS resources with a higher-level, intent-based API. They provide additional functionality with defaults, boilerplate, and glue logic that you’d be writing yourself with L1 constructs. AWS constructs offer convenient defaults and reduce the need to know all of the details about the AWS resources that they represent. This is done while providing convenience methods that make it simpler to work with the resources and as a result creating your application.
L3 – These constructs are called patterns. They’re designed to complete common tasks in AWS, often involving multiple types of resources.

In this post, I show a sample architecture and how the complexity of an AWS CDK application is reduced by using L2 constructs.

Overview of the sample architecture

This solution uses Amazon API Gateway, AWS Lambda, and Amazon DynamoDB. I implement a simple serverless web application. The application receives a POST request from a user via API Gateway and forwards it to a Lambda function using proxy integration. The Lambda function writes the request body to a DynamoDB table.

The sample code can be found on GitHub.

Walkthrough

You can follow the instructions in the README file of the GitHub repository to deploy the stack. In the following walkthrough, I explain each logical unit and the differences when implementing it using L1 and L2 constructs. Before each code sample, I’ll show the path in the GitHub repository where you can find its source.

Create the DynamoDB table

First, I create a DynamoDB table to store the request content.

L1 construct

With L1 constructs, I must define each attribute of a table separately. For the DynamoDB table, these are keySchema, attributeDefinitions, and provisionedThroughput. They all require detailed CloudFormation knowledge, for example, how a keyType is defined.

lib/level1/database/infrastructure.ts

this.cfnDynamoDbTable = new dynamodb.CfnTable(
   this, 
   "CfnDynamoDbTable", 
   {
      keySchema: [
         {
            attributeName: props.attributeName,
            keyType: "HASH",
         },
      ],
      attributeDefinitions: [
         {
            attributeName: props.attributeName,
            attributeType: "S",
         },
      ],
      provisionedThroughput: {
         readCapacityUnits: 5,
         writeCapacityUnits: 5,
      },
   },
);

L2 construct

The corresponding L2 construct lets me use the default values for readCapacity (5) and writeCapacity (5). To further reduce the complexity, I define the attributes and the partition key simultaneously. In addition, I utilize the dynamodb.AttributeType.STRING enum.

lib/level2/database/infrastructure.ts

this.dynamoDbTable = new dynamodb.Table(
   this, 
   "DynamoDbTable", 
   {
      partitionKey: {
         name: props.attributeName,
         type: dynamodb.AttributeType.STRING,
      },
   },
);

Create the Lambda function

Next, I create a Lambda function which receives the request and stores the content in the DynamoDB table. The runtime code uses Node.js.

L1 construct

When creating a Lambda function using L1 construct, I must specify all of the properties at creation time – the business logic code location, runtime, and the function handler. This includes the role for the Lambda function to assume. As a result, I must provide the Attribute Resource Name (ARN) of the role. In the “Granting permissions” sections later in this post, I show how to create this role.

lib/level1/api/infrastructure.ts

const cfnLambdaFunction = new lambda.CfnFunction(
   this, 
   "CfnLambdaFunction", 
   {
      code: {
         zipFile: fs.readFileSync(
            path.resolve(__dirname, "runtime/index.js"),
            "utf8"
         ),
      },
      role: this.cfnIamLambdaRole.attrArn,
      runtime: "nodejs16.x",
      handler: "index.handler",
      environment: {
         variables: {
            TABLE_NAME: props.dynamoDbTableArn,
         },
      },
   },
);

L2 construct

I can achieve the same result with less complexity by leveraging the NodejsFunction L2 construct for Lambda function. It sets a default version for Node.js runtime unless another one is explicitly specified. The construct creates a Lambda function with automatic transpiling and bundling of TypeScript or Javascript code. This results in smaller Lambda packages that contain only the code and dependencies needed to run the function, and it uses esbuild under the hood. The Lambda function handler code is located in the runtime directory of the API logical unit. I provide the path to the Lambda handler file in the entry property. I don’t have to specify the handler function name, because the NodejsFunction construct uses the handler name by default. Moreover, a Lambda execution role isn’t required to be provided during L2 Lambda construct creation. If no role is specified, then a default one is generated which has permissions for Lambda execution. In the section ‘Granting Permissions’, I describe how to customize the role after creating the construct.

lib/level2/api/infrastructure.ts

this.lambdaFunction = new lambda_nodejs.NodejsFunction(
   this, 
   "LambdaFunction", 
   {
      entry: path.resolve(__dirname, "runtime/index.ts"),
      runtime: lambda.Runtime.NODEJS_16_X,
      environment: {
         TABLE_NAME: props.dynamoDbTableName,
      },
   },
);

Create API Gateway REST API

Next, I define the API Gateway REST API to receive POST requests with Cross-origin resource sharing (CORS) enabled.

L1 construct

Every step, from creating a new API Gateway REST API, to the deployment process, must be configured individually. With an L1 construct, I must have a good understanding of CORS and the exact configuration of headers and methods.

Furthermore, I must know all of the specifics, such as for the Lambda integration type I must know how to construct the URI.

lib/level1/api/infrastructure.ts

const cfnApiGatewayRestApi = new apigateway.CfnRestApi(
   this, 
   "CfnApiGatewayRestApi", 
   {
      name: props.apiName,
   },
);

const cfnApiGatewayPostMethod = new apigateway.CfnMethod(
   this, 
   "CfnApiGatewayPostMethod", 
   {
      httpMethod: "POST",
      resourceId: cfnApiGatewayRestApi.attrRootResourceId,
      restApiId: cfnApiGatewayRestApi.ref,
      authorizationType: "NONE",
      integration: {
         credentials: cfnIamApiGatewayRole.attrArn,
         type: "AWS_PROXY",
         integrationHttpMethod: "ANY",
         uri:
            "arn:aws:apigateway:" +
            Stack.of(this).region +
            ":lambda:path/2015-03-31/functions/" +
            cfnLambdaFunction.attrArn +
            "/invocations",
            passthroughBehavior: "WHEN_NO_MATCH",
      },
   },
);

const CfnApiGatewayOptionsMethod = new apigateway.CfnMethod(
    this,
    "CfnApiGatewayOptionsMethod",
   {    
      // fields omitted
   },
);

const cfnApiGatewayDeployment = new apigateway.CfnDeployment(
    this,
    "cfnApiGatewayDeployment",
    {
      restApiId: cfnApiGatewayRestApi.ref,
      stageName: "prod",
    },
);

L2 construct

Creating an API Gateway REST API with CORS enabled is simpler with L2 constructs. I can leverage the defaultCorsPreflightOptions property and the construct builds the required options method. To set origins and methods, I can use the apigateway.Cors enum. To configure the Lambda proxy option, all I need to do is to set the proxy variable in the method to true. A default deployment is created automatically.

lib/level2/api/infrastructure.ts

this.api = new apigateway.RestApi(
   this, 
   "ApiGatewayRestApi", 
   {
      defaultCorsPreflightOptions: {
         allowOrigins: apigateway.Cors.ALL_ORIGINS,
         allowMethods: apigateway.Cors.ALL_METHODS,
      },
   },
);

this.api.root.addMethod(
    "POST",
    new apigateway.LambdaIntegration(this.lambdaFunction, {
      proxy: true,
    })
);

Granting permissions

In the sample application, I must give permissions to two different resources:

API Gateway REST API to invoke the Lambda function.
Lambda function to write data to the DynamoDB table.

L1 construct

For both resources, I must define AWS Identity and Access Management (IAM) roles. This requires in-depth knowledge of IAM, how policies are structured, and which actions are required. In the following code snippet, I start by creating the policy documents. Afterward, I create a role for each resource. These are provided at creation time to the corresponding constructs as shown earlier.

lib/level1/api/infrastructure.ts

const cfnLambdaAssumeIamPolicyDocument = {
    // fields omitted
};

this.cfnLambdaIamRole = new iam.CfnRole(
   this, 
   "cfnLambdaIamRole", 
   {
      assumeRolePolicyDocument: cfnLambdaAssumeIamPolicyDocument,
      managedPolicyArns: [
        "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole",
      ],
   },
);
    
const cfnApiGatewayAssumeIamPolicyDocument = {
   // fields omitted
};

const cfnApiGatewayInvokeLambdaIamPolicyDocument = {
   Version: "2012-10-17",
   Statement: [
      {
         Action: ["lambda:InvokeFunction"],
         Resource: [cfnLambdaFunction.attrArn],
         Effect: "Allow",
      },
   ],
};

const cfnApiGatewayIamRole = new iam.CfnRole(
   this, 
   "cfnApiGatewayIamRole", 
   {
      assumeRolePolicyDocument: cfnApiGatewayAssumeIamPolicyDocument,
      policies: [{
         policyDocument: cfnApiGatewayInvokeLambdaIamPolicyDocument,
         policyName: "ApiGatewayInvokeLambdaIamPolicy",
      }],
   },
);

The database construct exposes a function to grant write access to any IAM role. The function creates a policy, which allows dynamodb:PutItem on the database table and adds it as an additional policy to the role.

lib/level1/database/infrastructure.ts

grantWriteData(cfnIamRole: iam.CfnRole) {
   const cfnPutDynamoDbIamPolicyDocument = {
      Version: "2012-10-17",
      Statement: [
         {
            Action: ["dynamodb:PutItem"],
            Resource: [this.cfnDynamoDbTable.attrArn],
            Effect: "Allow",
         },
      ],
   };

    cfnIamRole.policies = [{
        policyDocument: cfnPutDynamoDbIamPolicyDocument,
        policyName: "PutDynamoDbIamPolicy",
    }];
}

At this point, all permissions are in place, except that Lambda function doesn’t have permissions to write data to the DynamoDB table yet. To grant write access, I call the grantWriteData function of the Database construct with the IAM role of the Lambda function.

lib/deployment.ts

database.grantWriteData(api.cfnLambdaIamRole)

L2 construct

Creating an API Gateway REST API with the LambdaIntegration construct generates the IAM role and attaches the role to the API Gateway REST API method. Giving the Lambda function permission to write to the DynamoDB table can be achieved with the following single line:

lib/deployment.ts

database.dynamoDbTable.grantWriteData(api.lambdaFunction);

Using L3 constructs

To reduce complexity even further, I can leverage L3 constructs. In the case of this sample architecture, I can utilize the LambdaRestApi construct. This construct uses a default Lambda proxy integration. It automatically generates a method and a deployment, and grants permissions. As a result, I can achieve the same with even less code.

const restApi = new apigateway.LambdaRestApi(
   this, 
   "restApiLevel3", 
   {
      handler: this.lambdaFunction,
      defaultCorsPreflightOptions: {
         allowOrigins: apigateway.Cors.ALL_ORIGINS,
         allowMethods: apigateway.Cors.ALL_METHODS
      },
   },
);

Cleanup

Many services in this post are available in the AWS Free Tier. However, using this solution may incur costs, and you should tear down the stack if you don’t need it anymore. Cleanup steps are included in the RADME file of the GitHub repository.

Conclusion

In this post, I highlight the difference between using L1 and L2 AWS CDK constructs with an example architecture. Leveraging L2 constructs reduces the complexity of your application by using predefined patterns, boiler plate, and glue logic. They offer convenient defaults and reduce the need to know all of the details about the AWS resources they represent, while providing convenient methods that make it simpler to work with the resource. Additionally, I showed how to reduce complexity for common tasks even further by using an L3 construct.

Visit the AWS CDK documentation to learn more about building resilient, scalable, and cost-efficient architectures with the expressive power of a programming language.

Author:

Stream Amazon EMR on EKS logs to third-party providers like Splunk, Amazon OpenSearch Service, or other log aggregators

2022-07-20 Matthew Tan

Post Syndicated from Matthew Tan original https://aws.amazon.com/blogs/big-data/stream-amazon-emr-on-eks-logs-to-third-party-providers-like-splunk-amazon-opensearch-service-or-other-log-aggregators/

Spark jobs running on Amazon EMR on EKS generate logs that are very useful in identifying issues with Spark processes and also as a way to see Spark outputs. You can access these logs from a variety of sources. On the Amazon EMR virtual cluster console, you can access logs from the Spark History UI. You also have flexibility to push logs into an Amazon Simple Storage Service (Amazon S3) bucket or Amazon CloudWatch Logs. In each method, these logs are linked to the specific job in question. The common practice of log management in DevOps culture is to centralize logging through the forwarding of logs to an enterprise log aggregation system like Splunk or Amazon OpenSearch Service (successor to Amazon Elasticsearch Service). This enables you to see all the applicable log data in one place. You can identify key trends, anomalies, and correlated events, and troubleshoot problems faster and notify the appropriate people in a timely fashion.

EMR on EKS Spark logs are generated by Spark and can be accessed via the Kubernetes API and kubectl CLI. Therefore, although it’s possible to install log forwarding agents in the Amazon Elastic Kubernetes Service (Amazon EKS) cluster to forward all Kubernetes logs, which include Spark logs, this can become quite expensive at scale because you get information that may not be important for Spark users about Kubernetes. In addition, from a security point of view, the EKS cluster logs and access to kubectl may not be available to the Spark user.

To solve this problem, this post proposes using pod templates to create a sidecar container alongside the Spark job pods. The sidecar containers are able to access the logs contained in the Spark pods and forward these logs to the log aggregator. This approach allows the logs to be managed separately from the EKS cluster and uses a small amount of resources because the sidecar container is only launched during the lifetime of the Spark job.

Implementing Fluent Bit as a sidecar container

Fluent Bit is a lightweight, highly scalable, and high-speed logging and metrics processor and log forwarder. It collects event data from any source, enriches that data, and sends it to any destination. Its lightweight and efficient design coupled with its many features makes it very attractive to those working in the cloud and in containerized environments. It has been deployed extensively and trusted by many, even in large and complex environments. Fluent Bit has zero dependencies and requires only 650 KB in memory to operate, as compared to FluentD, which needs about 40 MB in memory. Therefore, it’s an ideal option as a log forwarder to forward logs generated from Spark jobs.

When you submit a job to EMR on EKS, there are at least two Spark containers: the Spark driver and the Spark executor. The number of Spark executor pods depends on your job submission configuration. If you indicate more than one spark.executor.instances, you get the corresponding number of Spark executor pods. What we want to do here is run Fluent Bit as sidecar containers with the Spark driver and executor pods. Diagrammatically, it looks like the following figure. The Fluent Bit sidecar container reads the indicated logs in the Spark driver and executor pods, and forwards these logs to the target log aggregator directly.

Architecture of Fluent Bit sidecar

Pod templates in EMR on EKS

A Kubernetes pod is a group of one or more containers with shared storage, network resources, and a specification for how to run the containers. Pod templates are specifications for creating pods. It’s part of the desired state of the workload resources used to run the application. Pod template files can define the driver or executor pod configurations that aren’t supported in standard Spark configuration. That being said, Spark is opinionated about certain pod configurations and some values in the pod template are always overwritten by Spark. Using a pod template only allows Spark to start with a template pod and not an empty pod during the pod building process. Pod templates are enabled in EMR on EKS when you configure the Spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile. Spark downloads these pod templates to construct the driver and executor pods.

Forward logs generated by Spark jobs in EMR on EKS

A log aggregating system like Amazon OpenSearch Service or Splunk should always be available that can accept the logs forwarded by the Fluent Bit sidecar containers. If not, we provide the following scripts in this post to help you launch a log aggregating system like Amazon OpenSearch Service or Splunk installed on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

We use several services to create and configure EMR on EKS. We use an AWS Cloud9 workspace to run all the scripts and to configure the EKS cluster. To prepare to run a job script that requires certain Python libraries absent from the generic EMR images, we use Amazon Elastic Container Registry (Amazon ECR) to store the customized EMR container image.

Create an AWS Cloud9 workspace

The first step is to launch and configure the AWS Cloud9 workspace by following the instructions in Create a Workspace in the EKS Workshop. After you create the workspace, we create AWS Identity and Access Management (IAM) resources. Create an IAM role for the workspace, attach the role to the workspace, and update the workspace IAM settings.

Prepare the AWS Cloud9 workspace

Clone the following GitHub repository and run the following script to prepare the AWS Cloud9 workspace to be ready to install and configure Amazon EKS and EMR on EKS. The shell script prepare_cloud9.sh installs all the necessary components for the AWS Cloud9 workspace to build and manage the EKS cluster. These include the kubectl command line tool, eksctl CLI tool, jq, and to update the AWS Command Line Interface (AWS CLI).

$ sudo yum -y install git
$ cd ~ 
$ git clone https://github.com/aws-samples/aws-emr-eks-log-forwarding.git
$ cd aws-emr-eks-log-forwarding
$ cd emreks
$ bash prepare_cloud9.sh

All the necessary scripts and configuration to run this solution are found in the cloned GitHub repository.

Create a key pair

As part of this particular deployment, you need an EC2 key pair to create an EKS cluster. If you already have an existing EC2 key pair, you may use that key pair. Otherwise, you can create a key pair.

Install Amazon EKS and EMR on EKS

After you configure the AWS Cloud9 workspace, in the same folder (emreks), run the following deployment script:

$ bash deploy_eks_cluster_bash.sh 
Deployment Script -- EMR on EKS
-----------------------------------------------

Please provide the following information before deployment:
1. Region (If your Cloud9 desktop is in the same region as your deployment, you can leave this blank)
2. Account ID (If your Cloud9 desktop is running in the same Account ID as where your deployment will be, you can leave this blank)
3. Name of the S3 bucket to be created for the EMR S3 storage location
Region: [xx-xxxx-x]: < Press enter for default or enter region > 
Account ID [xxxxxxxxxxxx]: < Press enter for default or enter account # > 
EC2 Public Key name: < Provide your key pair name here >
Default S3 bucket name for EMR on EKS (do not add s3://): < bucket name >
Bucket created: XXXXXXXXXXX ...
Deploying CloudFormation stack with the following parameters...
Region: xx-xxxx-x | Account ID: xxxxxxxxxxxx | S3 Bucket: XXXXXXXXXXX

...

EKS Cluster and Virtual EMR Cluster have been installed.

The last line indicates that installation was successful.

Log aggregation options

There are several log aggregation and management tools on the market. This post suggests two of the more popular ones in the industry: Splunk and Amazon OpenSearch Service.

Option 1: Install Splunk Enterprise

To manually install Splunk on an EC2 instance, complete the following steps:

Launch an EC2 instance.
Install Splunk.
Configure the EC2 instance security group to permit access to ports 22, 8000, and 8088.

This post, however, provides an automated way to install Spunk on an EC2 instance:

Download the RPM install file and upload it to an accessible Amazon S3 location.
Upload the following YAML script into AWS CloudFormation.
Provide the necessary parameters, as shown in the screenshots below.
Choose Next and complete the steps to create your stack.

Splunk CloudFormation screen - 1

Splunk CloudFormation screen - 2

Splunk CloudFormation screen - 3

Alternatively, run an AWS CLI script like the following:

aws cloudformation create-stack \
--stack-name "splunk" \
--template-body file://splunk_cf.yaml \
--parameters ParameterKey=KeyName,ParameterValue="< Name of EC2 Key Pair >" \
  ParameterKey=InstanceType,ParameterValue="t3.medium" \
  ParameterKey=LatestAmiId,ParameterValue="/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2" \
  ParameterKey=VPCID,ParameterValue="vpc-XXXXXXXXXXX" \
  ParameterKey=PublicSubnet0,ParameterValue="subnet-XXXXXXXXX" \
  ParameterKey=SSHLocation,ParameterValue="< CIDR Range for SSH access >" \
  ParameterKey=VpcCidrRange,ParameterValue="172.20.0.0/16" \
  ParameterKey=RootVolumeSize,ParameterValue="100" \
  ParameterKey=S3BucketName,ParameterValue="< S3 Bucket Name >" \
  ParameterKey=S3Prefix,ParameterValue="splunk/splunk-8.2.5-77015bc7a462-linux-2.6-x86_64.rpm" \
  ParameterKey=S3DownloadLocation,ParameterValue="/tmp" \
--region < region > \
--capabilities CAPABILITY_IAM

After you build the stack, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the internal and external DNS for the Splunk instance.

You use these later to configure the Splunk instance and log forwarding.

Splunk CloudFormation output screen

To configure Splunk, go to the Resources tab for the CloudFormation stack and locate the physical ID of EC2Instance.
Choose that link to go to the specific EC2 instance.
Select the instance and choose Connect.

Connect to Splunk Instance

On the Session Manager tab, choose Connect.

Connect to Instance

You’re redirected to the instance’s shell.

Install and configure Splunk as follows:

$ sudo /opt/splunk/bin/splunk start --accept-license
…
Please enter an administrator username: admin
Password must contain at least:
   * 8 total printable ASCII character(s).
Please enter a new password: 
Please confirm new password:
…
Done
                                                           [  OK  ]

Waiting for web server at http://127.0.0.1:8000 to be available......... Done
The Splunk web interface is at http://ip-xx-xxx-xxx-x.us-east-2.compute.internal:8000

Enter the Splunk site using the SplunkPublicDns value from the stack outputs (for example, http://ec2-xx-xxx-xxx-x.us-east-2.compute.amazonaws.com:8000). Note the port number of 8000.
Log in with the user name and password you provided.

Splunk Login

Configure HTTP Event Collector

To configure Splunk to be able to receive logs from Fluent Bit, configure the HTTP Event Collector data input:

Go to Settings and choose Data input.
Choose HTTP Event Collector.
Choose Global Settings.
Select Enabled, keep port number 8088, then choose Save.
Choose New Token.
For Name, enter a name (for example, emreksdemo).
Choose Next.
For Available item(s) for Indexes, add at least the main index.
Choose Review and then Submit.
In the list of HTTP Event Collect tokens, copy the token value for emreksdemo.

You use it when configuring the Fluent Bit output.

splunk-http-collector-list

Option 2: Set up Amazon OpenSearch Service

Your other log aggregation option is to use Amazon OpenSearch Service.

Provision an OpenSearch Service domain

Provisioning an OpenSearch Service domain is very straightforward. In this post, we provide a simple script and configuration to provision a basic domain. To do it yourself, refer to Creating and managing Amazon OpenSearch Service domains.

Before you start, get the ARN of the IAM role that you use to run the Spark jobs. If you created the EKS cluster with the provided script, go to the CloudFormation stack emr-eks-iam-stack. On the Outputs tab, locate the IAMRoleArn output and copy this ARN. We also modify the IAM role later on, after we create the OpenSearch Service domain.

iam_role_emr_eks_job

If you’re using the provided opensearch.sh installer, before you run it, modify the file.

From the root folder of the GitHub repository, cd to opensearch and modify opensearch.sh (you can also use your preferred editor):

[../aws-emr-eks-log-forwarding] $ cd opensearch
[../aws-emr-eks-log-forwarding/opensearch] $ vi opensearch.sh

Configure opensearch.sh to fit your environment, for example:

# name of our Amazon OpenSearch cluster
export ES_DOMAIN_NAME="emreksdemo"

# Elasticsearch version
export ES_VERSION="OpenSearch_1.0"

# Instance Type
export INSTANCE_TYPE="t3.small.search"

# OpenSearch Dashboards admin user
export ES_DOMAIN_USER="emreks"

# OpenSearch Dashboards admin password
export ES_DOMAIN_PASSWORD='< ADD YOUR PASSWORD >'

# Region
export REGION='us-east-1'

Run the script:

[../aws-emr-eks-log-forwarding/opensearch] $ bash opensearch.sh

Configure your OpenSearch Service domain

After you set up your OpenSearch service domain and it’s active, make the following configuration changes to allow logs to be ingested into Amazon OpenSearch Service:

On the Amazon OpenSearch Service console, on the Domains page, choose your domain.

Opensearch Domain Console

On the Security configuration tab, choose Edit.

Opensearch Security Configuration

For Access Policy, select Only use fine-grained access control.
Choose Save changes.

The access policy should look like the following code:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:xx-xxxx-x:xxxxxxxxxxxx:domain/emreksdemo/*"
    }
  ]
}

When the domain is active again, copy the domain ARN.

We use it to configure the Amazon EMR job IAM role we mentioned earlier.

Choose the link for OpenSearch Dashboards URL to enter Amazon OpenSearch Service Dashboards.

Opensearch Main Console

In Amazon OpenSearch Service Dashboards, use the user name and password that you configured earlier in the opensearch.sh file.
Choose the options icon and choose Security under OpenSearch Plugins.

opensearch menu

Choose Roles.
Choose Create role.

opensearch-create-role-button

Enter the new role’s name, cluster permissions, and index permissions. For this post, name the role fluentbit_role and give cluster permissions to the following:
1. indices:admin/create
2. indices:admin/template/get
3. indices:admin/template/put
4. cluster:admin/ingest/pipeline/get
5. cluster:admin/ingest/pipeline/put
6. indices:data/write/bulk
7. indices:data/write/bulk*
8. create_index

opensearch-create-role-button

In the Index permissions section, give write permission to the index fluent-*.
On the Mapped users tab, choose Manage mapping.
For Backend roles, enter the Amazon EMR job execution IAM role ARN to be mapped to the fluentbit_role role.
Choose Map.

opensearch-map-backend

To complete the security configuration, go to the IAM console and add the following inline policy to the EMR on EKS IAM role entered in the backend role. Replace the resource ARN with the ARN of your OpenSearch Service domain.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "es:ESHttp*"
            ],
            "Resource": "arn:aws:es:us-east-2:XXXXXXXXXXXX:domain/emreksdemo"
        }
    ]
}

The configuration of Amazon OpenSearch Service is complete and ready for ingestion of logs from the Fluent Bit sidecar container.

Configure the Fluent Bit sidecar container

We need to write two configuration files to configure a Fluent Bit sidecar container. The first is the Fluent Bit configuration itself, and the second is the Fluent Bit sidecar subprocess configuration that makes sure that the sidecar operation ends when the main Spark job ends. The suggested configuration provided in this post is for Splunk and Amazon OpenSearch Service. However, you can configure Fluent Bit with other third-party log aggregators. For more information about configuring outputs, refer to Outputs.

Fluent Bit ConfigMap

The following sample ConfigMap is from the GitHub repo:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-sidecar-config
  namespace: sparkns
  labels:
    app.kubernetes.io/name: fluent-bit
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-application.conf
    @INCLUDE input-event-logs.conf
    @INCLUDE output-splunk.conf
    @INCLUDE output-opensearch.conf

  input-application.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/spark/user/*/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  input-event-logs.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/spark/apps/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  output-splunk.conf: |
    [OUTPUT]
        Name            splunk
        Match           *
        Host            < INTERNAL DNS of Splunk EC2 Instance >
        Port            8088
        TLS             On
        TLS.Verify      Off
        Splunk_Token    < Token as provided by the HTTP Event Collector in Splunk >

  output-opensearch.conf: |
[OUTPUT]
        Name            es
        Match           *
        Host            < HOST NAME of the OpenSearch Domain | No HTTP protocol >
        Port            443
        TLS             On
        AWS_Auth        On
        AWS_Region      < Region >
        Retry_Limit     6

In your AWS Cloud9 workspace, modify the ConfigMap accordingly. Provide the values for the placeholder text by running the following commands to enter the VI editor mode. If preferred, you can use PICO or a different editor:

[../aws-emr-eks-log-forwarding] $  cd kube/configmaps
[../aws-emr-eks-log-forwarding/kube/configmaps] $ vi emr_configmap.yaml

# Modify the emr_configmap.yaml as above
# Save the file once it is completed

Complete either the Splunk output configuration or the Amazon OpenSearch Service output configuration.

Next, run the following commands to add the two Fluent Bit sidecar and subprocess ConfigMaps:

[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_configmap.yaml
[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_entrypoint_configmap.yaml

You don’t need to modify the second ConfigMap because it’s the subprocess script that runs inside the Fluent Bit sidecar container. To verify that the ConfigMaps have been installed, run the following command:

$ kubectl get cm -n sparkns
NAME                         DATA   AGE
fluent-bit-sidecar-config    6      15s
fluent-bit-sidecar-wrapper   2      15s

Set up a customized EMR container image

To run the sample PySpark script, the script requires the Boto3 package that’s not available in the standard EMR container images. If you want to run your own script and it doesn’t require a customized EMR container image, you may skip this step.

Run the following script:

[../aws-emr-eks-log-forwarding] $ cd ecr
[../aws-emr-eks-log-forwarding/ecr] $ bash create_custom_image.sh <region> <EMR container image account number>

The EMR container image account number can be obtained from How to select a base image URI. This documentation also provides the appropriate ECR registry account number. For example, the registry account number for us-east-1 is 755674844232.

To verify the repository and image, run the following commands:

$ aws ecr describe-repositories --region < region > | grep emr-6.5.0-custom
            "repositoryArn": "arn:aws:ecr:xx-xxxx-x:xxxxxxxxxxxx:repository/emr-6.5.0-custom",
            "repositoryName": "emr-6.5.0-custom",
            "repositoryUri": " xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/emr-6.5.0-custom",

$ aws ecr describe-images --region < region > --repository-name emr-6.5.0-custom | jq .imageDetails[0].imageTags
[
  "latest"
]

Prepare pod templates for Spark jobs

Upload the two Spark driver and Spark executor pod templates to an S3 bucket and prefix. The two pod templates can be found in the GitHub repository:

emr_driver_template.yaml – Spark driver pod template
emr_executor_template.yaml – Spark executor pod template

The pod templates provided here should not be modified.

Submitting a Spark job with a Fluent Bit sidecar container

This Spark job example uses the bostonproperty.py script. To use this script, upload it to an accessible S3 bucket and prefix and complete the preceding steps to use an EMR customized container image. You also need to upload the CSV file from the GitHub repo, which you need to download and unzip. Upload the unzipped file to the following location: s3://<your chosen bucket>/<first level folder>/data/boston-property-assessment-2021.csv.

The following commands assume that you launched your EKS cluster and virtual EMR cluster with the parameters indicated in the GitHub repo.

Variable	Where to Find the Information or the Value Required
EMR_EKS_CLUSTER_ID	Amazon EMR console virtual cluster page
EMR_EKS_EXECUTION_ARN	IAM role ARN
EMR_RELEASE	emr-6.5.0-latest
S3_BUCKET	The bucket you create in Amazon S3
S3_FOLDER	The preferred prefix you want to use in Amazon S3
CONTAINER_IMAGE	The URI in Amazon ECR where your container image is
SCRIPT_NAME	emreksdemo-script or a name you prefer

Alternatively, use the provided script to run the job. Change the directory to the scripts folder in emreks and run the script as follows:

[../aws-emr-eks-log-forwarding] cd emreks/scripts
[../aws-emr-eks-log-forwarding/emreks/scripts] bash run_emr_script.sh < S3 bucket name > < ECR container image > < script path>

Example: bash run_emr_script.sh emreksdemo-123456 12345678990.dkr.ecr.us-east-2.amazonaws.com/emr-6.5.0-custom s3://emreksdemo-123456/scripts/scriptname.py

After you submit the Spark job successfully, you get a return JSON response like the following:

{
    "id": "0000000305e814v0bpt",
    "name": "emreksdemo-job",
    "arn": "arn:aws:emr-containers:xx-xxxx-x:XXXXXXXXXXX:/virtualclusters/upobc00wgff5XXXXXXXXXXX/jobruns/0000000305e814v0bpt",
    "virtualClusterId": "upobc00wgff5XXXXXXXXXXX"
}

What happens when you submit a Spark job with a sidecar container

After you submit a Spark job, you can see what is happening by viewing the pods that are generated and the corresponding logs. First, using kubectl, get a list of the pods generated in the namespace where the EMR virtual cluster runs. In this case, it’s sparkns. The first pod in the following code is the job controller for this particular Spark job. The second pod is the Spark executor; there can be more than one pod depending on how many executor instances are asked for in the Spark job setting—we asked for one here. The third pod is the Spark driver pod.

$ kubectl get pods -n sparkns
NAME                                        READY   STATUS    RESTARTS   AGE
0000000305e814v0bpt-hvwjs                   3/3     Running   0          25s
emreksdemo-script-1247bf80ae40b089-exec-1   0/3     Pending   0          0s
spark-0000000305e814v0bpt-driver            3/3     Running   0          11s

To view what happens in the sidecar container, follow the logs in the Spark driver pod and refer to the sidecar. The sidecar container launches with the Spark pods and persists until the file /var/log/fluentd/main-container-terminated is no longer available. For more information about how Amazon EMR controls the pod lifecycle, refer to Using pod templates. The subprocess script ties the sidecar container to this same lifecycle and deletes itself upon the EMR controlled pod lifecycle process.

$ kubectl logs spark-0000000305e814v0bpt-driver -n sparkns  -c custom-side-car-container --follow=true

Waiting for file /var/log/fluentd/main-container-terminated to appear...
AWS for Fluent Bit Container Image Version 2.24.0Start wait: 1652190909
Elapsed Wait: 0
Not found count: 0
Waiting...
Fluent Bit v1.9.3
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/05/10 13:55:09] [ info] [fluent bit] version=1.9.3, commit=9eb4996b7d, pid=11
[2022/05/10 13:55:09] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/10 13:55:09] [ info] [cmetrics] version=0.3.1
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] worker #0 started
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] worker #1 started
[2022/05/10 13:55:09] [ info] [output:es:es.1] worker #0 started
[2022/05/10 13:55:09] [ info] [output:es:es.1] worker #1 started
[2022/05/10 13:55:09] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2022/05/10 13:55:09] [ info] [sp] stream processor started
Waiting for file /var/log/fluentd/main-container-terminated to appear...
Last heartbeat: 1652190914
Elapsed Time since after heartbeat: 0
Found count: 0
list files:
-rw-r--r-- 1 saslauth 65534 0 May 10 13:55 /var/log/fluentd/main-container-terminated
Last heartbeat: 1652190918

…

[2022/05/10 13:56:09] [ info] [input:tail:tail.0] inotify_fs_add(): inode=58834691 watch_fd=6 name=/var/log/spark/user/spark-0000000305e814v0bpt-driver/stdout-s3-container-log-in-tail.pos
[2022/05/10 13:56:09] [ info] [input:tail:tail.1] inotify_fs_add(): inode=54644346 watch_fd=1 name=/var/log/spark/apps/spark-0000000305e814v0bpt
Outside of loop, main-container-terminated file no longer exists
ls: cannot access /var/log/fluentd/main-container-terminated: No such file or directory
The file /var/log/fluentd/main-container-terminated doesn't exist anymore;
TERMINATED PROCESS
Fluent-Bit pid: 11
Killing process after sleeping for 15 seconds
root        11     8  0 13:55 ?        00:00:00 /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/etc/fluent-bit.conf
root       114     7  0 13:56 ?        00:00:00 grep fluent
Killing process 11
[2022/05/10 13:56:24] [engine] caught signal (SIGTERM)
[2022/05/10 13:56:24] [ info] [input] pausing tail.0
[2022/05/10 13:56:24] [ info] [input] pausing tail.1
[2022/05/10 13:56:24] [ warn] [engine] service will shutdown in max 5 seconds
[2022/05/10 13:56:25] [ info] [engine] service has stopped (0 pending tasks)
[2022/05/10 13:56:25] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=54644346 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917120 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917121 watch_fd=2
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834690 watch_fd=3
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834692 watch_fd=4
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834689 watch_fd=5
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834691 watch_fd=6
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #0 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #0 stopped
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #1 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #1 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #0 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #0 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #1 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #1 stopped

View the forwarded logs in Splunk or Amazon OpenSearch Service

To view the forwarded logs, do a search in Splunk or on the Amazon OpenSearch Service console. If you’re using a shared log aggregator, you may have to filter the results. In this configuration, the logs tailed by Fluent Bit are in the /var/log/spark/*. The following screenshots show the logs generated specifically by the Kubernetes Spark driver stdout that were forwarded to the log aggregators. You can compare the results with the logs provided using kubectl:

kubectl logs < Spark Driver Pod > -n < namespace > -c spark-kubernetes-driver --follow=true

…
root
 |-- PID: string (nullable = true)
 |-- CM_ID: string (nullable = true)
 |-- GIS_ID: string (nullable = true)
 |-- ST_NUM: string (nullable = true)
 |-- ST_NAME: string (nullable = true)
 |-- UNIT_NUM: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- ZIPCODE: string (nullable = true)
 |-- BLDG_SEQ: string (nullable = true)
 |-- NUM_BLDGS: string (nullable = true)
 |-- LUC: string (nullable = true)
…

|02108|RETAIL CONDO           |361450.0            |63800.0        |5977500.0      |
|02108|RETAIL STORE DETACH    |2295050.0           |988200.0       |3601900.0      |
|02108|SCHOOL                 |1.20858E7           |1.20858E7      |1.20858E7      |
|02108|SINGLE FAM DWELLING    |5267156.561085973   |1153400.0      |1.57334E7      |
+-----+-----------------------+--------------------+---------------+---------------+
only showing top 50 rows

The following screenshot shows the Splunk logs.

splunk-result-driver-stdout

The following screenshots show the Amazon OpenSearch Service logs.

opensearch-result-driver-stdout

Optional: Include a buffer between Fluent Bit and the log aggregators

If you expect to generate a lot of logs because of high concurrent Spark jobs creating multiple individual connects that may overwhelm your Amazon OpenSearch Service or Splunk log aggregation clusters, consider employing a buffer between the Fluent Bit sidecars and your log aggregator. One option is to use Amazon Kinesis Data Firehose as the buffering service.

Kinesis Data Firehose has built-in delivery to both Amazon OpenSearch Service and Splunk. If using Amazon OpenSearch Service, refer to Loading streaming data from Amazon Kinesis Data Firehose. If using Splunk, refer to Configure Amazon Kinesis Firehose to send data to the Splunk platform and Choose Splunk for Your Destination.

To configure Fluent Bit to Kinesis Data Firehose, add the following to your ConfigMap output. Refer to the GitHub ConfigMap example and add the @INCLUDE under the [SERVICE] section:

     @INCLUDE output-kinesisfirehose.conf
…

  output-kinesisfirehose.conf: |
    [OUTPUT]
        Name            kinesis_firehose
        Match           *
        region          < region >
        delivery_stream < Kinesis Firehose Stream Name >

Optional: Use data streams for Amazon OpenSearch Service

If you’re in a scenario where the number of documents grows rapidly and you don’t need to update older documents, you need to manage the OpenSearch Service cluster. This involves steps like creating a rollover index alias, defining a write index, and defining common mappings and settings for the backing indexes. Consider using data streams to simplify this process and enforce a setup that best suits your time series data. For instructions on implementing data streams, refer to Data streams.

Clean up

To avoid incurring future charges, delete the resources by deleting the CloudFormation stacks that were created with this script. This removes the EKS cluster. However, before you do that, remove the EMR virtual cluster first by running the delete-virtual-cluster command. Then delete all the CloudFormation stacks generated by the deployment script.

If you launched an OpenSearch Service domain, you can delete the domain from the OpenSearch Service domain. If you used the script to launch a Splunk instance, you can go to the CloudFormation stack that launched the Splunk instance and delete the CloudFormation stack. This removes remove the Splunk instance and associated resources.

You can also use the following scripts to clean up resources:

To remove an OpenSearch Service domain, run the delete_opensearch_domain.sh script
To remove the Splunk resource, run the remove_splunk_deployment.sh script
To remove the EKS cluster, including the EMR virtual cluster, VPC, and other IAM roles, run the remove_emr_eks_deployment.sh script

Conclusion

EMR on EKS facilitates running Spark jobs on Kubernetes to achieve very fast and cost-efficient Spark operations. This is made possible through scheduling transient pods that are launched and then deleted the jobs are complete. To log all these operations in the same lifecycle of the Spark jobs, this post provides a solution using pod templates and Fluent Bit that is lightweight and powerful. This approach offers a decoupled way of log forwarding based at the Spark application level and not at the Kubernetes cluster level. It also avoids routing through intermediaries like CloudWatch, reducing cost and complexity. In this way, you can address security concerns and DevOps and system administration ease of management while providing Spark users with insights into their Spark jobs in a cost-efficient and functional way.

If you have questions or suggestions, please leave a comment.

About the Author

Matthew Tan is a Senior Analytics Solutions Architect at Amazon Web Services and provides guidance to customers developing solutions with AWS Analytics services on their analytics workloads.

Selecting Network Switches for Your AWS Outposts

2022-07-19 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/selecting-network-switches-for-your-aws-outposts/

This blog post is written by, Frankie Negro, Outposts Solution Architect.

AWS Outposts is a family of fully managed solutions that extend AWS infrastructure, services, APIs, and tools to customer premises. Outposts is available in a variety of form factors, from 1U and 2U Outposts servers (https://aws.amazon.com/outposts/servers/) to 42U Outposts racks (https://aws.amazon.com/outposts/rack/). AWS Outposts is ideal for workloads that require low-latency access to on-premises systems, local data processing, data residency, and application migration with local system interdependencies.

When operating and consuming services in the AWS Regions, the underlying networking layer is completely abstracted. You do not need to be aware of the underlying networking topology, device port speeds, connectors, transports, links, and media types. Instead, the focus is on design, with the architecture leveraging the high-level constructs available for the Amazon Virtual Private Cloud (VPC), such as VPCs, Subnets, Route Tables, Security Groups, and network access control lists. The network bandwidth available for an Amazon Elastic Compute Cloud (Amazon EC2) instance depends on the number of vCPUs that it has.

AWS Outposts requires a dedicated network connection to an AWS Region defined by the customer when ordering the product. This connection is called the Service Link, and it connects to either public or private anchors (not both) in a specific Availability Zone (AZ) in the selected parent Region. AWS recommends redundant connections that meet the bandwidth requirements for Outposts rack and Outposts servers.

The purpose of AWS Outposts is to fulfill use cases where the workload has requirements that prevent or make it unfeasible to operate in the AWS Regions. Most of these use cases, such as low latency and local data processing, require strong and reliable network infrastructure to handle a high volume of packets per second.

The construct connecting AWS Outposts to the customer on-premises network is called local gateway (LGW) for Outposts rack and local network interface (LNI) for Outposts Servers. These logical elements mediate the data traffic between Outposts and the customer premises.

On Outposts rack, Service Link and LGW traffic flows through the same network connection, which can be a single link per physical device or an aggregated link. Network packets sent to the Region or to your local network are segregated using distinct virtual LANs (VLANs) on Outposts rack. The smaller family members, Outposts servers, use two distinct physical ports.

The physical network elements providing the connections between the devices and services are called Outposts Networking Devices (ONDs) on the AWS side and Customer Networking Devices (CNDs) on the customer side. For its part, Outposts rack can deliver throughput up to 400 Gbps, aggregating 4 x 100 Gbps uplinks to support Service Link and LGW network traffic, while an Outposts server can provide a 10 Gbps dedicated network port for each traffic.

The upstream devices you provide play a fundamental role in the harmonic coexistence and operation at the ethernet physical and data link layers, which are the basis for performance and stability of the upper network and transport layers as defined by the OSI model. A careful selection of your upstream networking devices must combine reliable operations, cost effectiveness, and long-term vision.

The physical layer (L1)

Here we are talking about physical cables and media interfaces. There are no supported options for UTP Cables with RJ-45 connectors, as Outposts rack only supports Fiber Optic cables with Lucent Connectors (LC). For short distances you can use MMF (Multi-Mode Fiber) or MMF OM4 (Optical Multimode) with LC. Longer distances can be achieved using SMF (Single Mode Fiber). Distance limits depend on the Fiber Mode and Type.

Each Outposts server has one physical QSFP+ interface. A 4-way breakout cable is supplied with SFP+ transceivers. You will use two interfaces: One for the LNI traffic and another for the Service Link traffic.

With this is mind, RJ-45 ports on upstream switches will not suit any AWS Outposts connections. Switch models that combine RF-45 and optical ports can be used in conjunction with copper ethernet cables category 8 (CAT8), which support up to 40 Gbps speeds, to connect other segments while the optical ports can be used for AWS Outposts.

When evaluating your upstream switches, bear in mind that Outposts rack switches are always capable of 1 / 10 / 40 / 100 Gbps speeds, and it is the same equipment regardless of the selected AWS Outposts resource ID and uplink connection speed defined during the order process.

It is recommended to account for future traffic needs from the beginning and specify upstream switches with 40 or 100 Gbps ports rather than start small and upgrade in the future. Upgrades and changes always carry risk, so limiting future risk by minimizing the need for upgrades will help mitigate issues and provide a stable, productive environment.

Another characteristic to look for when selecting your networking devices is “non-blocking” switches. These switches can handle all ports at full capacity simultaneously, without contention. It is a simple feature to select, and you can expect high performance out-of-the box without having to go too deep into details such as buffering mechanisms.

The Data Link layer (L2)

This layer establishes and terminates the logical links between nodes and exchange frames end-to-end. Outposts rack requires that your upstream devices support 802.1Q (Dot1q) standards that implement the VLAN support needed to segregate traffic to be forwarded to the Region (via Service Link) from traffic to be forwarded to the customer’s local network.

Most core switches ship with this capability. One good spec to evaluate is the maximum size of the MAC Address Table per VLAN supported by the switch. If the MAC Table gets full, your equipment may fail over to broadcast mode in that VLAN, which introduces additional stress in the network and is a potential exploit condition.

Another common feature for core switches is to support link aggregation or bundle links together so they act like a single, logical link. While AWS Outposts will work with just a single connection per OND, a recommended fault tolerance and high availability best practice is to aggregate multiple paths to withstand the failure of one or multiple members of the logical aggregation group.

As defined in the AWS Well-Architected Framework Reliability pillar design principles, to observe best practices of Automatically recover from failure and Scale horizontally to increase aggregate workload availability, you should consider implementing, for example, 4 x 10 Gbps instead of a single 40 Gbps uplink. AWS Outposts uses link aggregation control protocol (LACP) aggregations with the immediate customer network device (CND) according to the IEEE 802.3ad standard.

To learn more about how you can architect Outposts for network failures, check out the AWS Outposts High Availability Design and Architecture Considerations at this URL.

The logical interface defined as a result of the link aggregation (LAG) can be configured as an ethernet trunk port defined in the IEEE 802.1q standard to allow the use of multiple VLANs. Alternatively, the logical interface can be configured as an L3 interface with the Service Link and LGW defined as VLAN sub-interfaces. This is how AWS Outposts segregates traffic forwarded to Service Link from packets sent to the customer local network.

The Network layer (L3)

At this layer, we get into routing and logical addressing. Outposts rack requires Border Gateway Protocol (BGP) to dynamically exchange routes. Each OND device will establish eBGP peering with the upstream routing device for the Service Link and the LGW.

The architectural decision will be a trade-off between discrete components for routing and switching and an L3 switch capable of BGP routing. This aspect requires a careful assessment. It is common for a core switch to offer L3 capabilities, but BGP support is not available in most cases.

Switch design often aims for excelling at L2 and basic L3. If the network design requires advanced routing features or large IP routing tables, the safest path is to specify a powerful L2 switch and a dedicated L3 router.

Redundant equipment for fault tolerance is recommended as well. AWS does not have restrictions on how the customer implements core switches, but it’s always a good practice to keep it simple and standard, avoiding designs that include proprietary solutions, such as Virtual Chassis and Switch Clustering, because it can make troubleshooting difficult.

Conclusion

In this post, I showed the importance of dedicating time and effort to carefully evaluating the networking landscape where your AWS Outposts will be deployed, assessing the network device options available to you, designing for high-availability, and selecting switch models with proper feature sets and future-proof specifications.

The performance and operation of your AWS Outposts is largely dependent on your network substrate, and all efforts dedicated to making good decisions will be time well spent, allowing you to get the best value out of your hybrid solution while focusing on creating compelling applications and addressing your use cases with AWS Outposts.

Understanding AWS Lambda scaling and throughput

2022-07-18 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/understanding-aws-lambda-scaling-and-throughput/

AWS Lambda provides a serverless compute service that can scale from a single request to hundreds of thousands per second. When designing your application, especially for high load, it helps to understand how Lambda handles scaling and throughput. There are two components to consider: concurrency and transactions per second.

Concurrency of a system is the ability to process more than one task simultaneously. You can measure concurrency at a point in time to view how many tasks the system is doing in parallel. The number of transactions a system can process per second is not the same as concurrency, because a transaction can take more or less than a second to process.

This post shows you how concurrency and transactions per second work within the Lambda lifecycle. It also covers ways to measure, control, and optimize them.

The Lambda runtime environment

Lambda invokes your code in a secure and isolated runtime environment. The following shows the lifecycle of requests to a single function.

First request to invoke your function

At the first request to invoke your function, Lambda creates a new runtime environment. It then runs the function’s initialization (init) code, which is the code outside the main handler. Lambda then runs the function handler code as the invocation. This receives the event payload and processes your business logic.

Each runtime environment processes a single request at a time. While a single runtime environment is processing a request, it cannot process other requests.

After Lambda finishes processing the request, the runtime environment is ready to process an additional request for the same function. As the initialization code has already run, for request 2, Lambda runs only the function handler code as the invocation.

Lambda ready to process an additional request for the same function

If additional requests arrive while processing request 1, Lambda creates new runtime environments. In this example, Lambda creates new runtime environments for requests 2, 3, 4, and 5. It runs the function init and the invocation.

Lambda creating new runtime environments

When requests arrive, Lambda reuses available runtime environments, and creates new ones if necessary. The following example shows the behavior for additional requests after 1-5.

Lambda reusing runtime environments

Once request 1 completes, the runtime environment is available to process another request. When request 6 arrives, Lambda re-uses request 1s runtime environment and runs the invocation. This process continues for requests 7 and 8, which reuse the runtime environments from requests 2 and 3. When request 9 arrives, Lambda creates a new runtime environment as there isn’t an existing one available. When request 10 arrives, it reuses the runtime environment freed up after request 4.

The number of runtime environments determines the concurrency. This is the sum of all concurrent requests for currently running functions at a particular point in time. For a single runtime environment, the number of concurrent requests is 1.

Single runtime environment concurrent requests

For the example with requests 1–10, the Lambda function concurrency at particular times is the following:

Lambda concurrency at particular times

time	concurrency
t1	3
t2	5
t3	4
t4	6
t5	5
t6	2

When the number of requests decreases, Lambda stops unused runtime environments to free up scaling capacity for other functions.

Invocation duration, concurrency, and transactions per second

The number of transactions Lambda can process per second is the sum of all invokes for that period. If a function takes 1 second to run, and 10 invocations happen concurrently, Lambda creates 10 runtime environments. In this case, Lambda processes 10 requests per second.

Lambda transactions per second for 1 second invocation

If the function duration is halved to 500 ms, the Lambda concurrency remains 10. However, transactions per second are now 20.

Lambda transactions per second for 500 ms second invocation

If the function takes 2 seconds to run, during the initial second, transactions per second is 0. However, averaged over time, transactions per second is 5.

You can view concurrency using Amazon CloudWatch metrics. Use the metric name ConcurrentExecutions to view concurrent invocations for all or individual functions.

CloudWatch metrics ConcurrentExecutions

You can also estimate concurrent requests from the number of requests per unit of time, in this case seconds, and their average duration, using the formula:

RequestsPerSecond x AvgDurationInSeconds = concurrent requests

If a Lambda function takes an average 500 ms to run, at 100 requests per second, the number of concurrent requests is 50:

100 requests/second x 0.5 sec = 50 concurrent requests

If you half the function duration to 250 ms and the number of requests per second doubles to 200 requests/second, the number of concurrent requests remains the same:

200 requests/second x 0.250 sec = 50 concurrent requests.

Reducing a function’s duration can increase the transactions per second that a function can process. For more information on reducing function duration, watch this re:Invent video.

Scaling quotas

There are two scaling quotas to consider with concurrency. Account concurrency quota and burst concurrency quota.

Account concurrency is the maximum concurrency in a particular Region. This is shared across all functions in an account. The default Regional concurrency quota starts at 1,000, which you can increase with a service ticket.

The burst concurrency quota provides an initial burst of traffic for each function, between 500 and 3000 per minute, depending on the Region.

After this initial burst, functions can scale by another 500 concurrent invocations per minute for all Regions. If you reach the maximum number of concurrent requests, further requests are throttled.

For synchronous invocations, Lambda returns a throttling error (429) to the caller, which must retry the request. With asynchronous and event source mapping invokes, Lambda automatically retries the requests. See Error handling and automatic retries in AWS Lambda for more detail.

A scaling quota example

The following walks through how account and burst concurrency work for an example application.

Anticipating handling additional load, the application builders have raised account concurrency to 7,000. There are no other Lambda functions running in this account, so this function can use all available account concurrency.

Lambda account and burst concurrency

08:59: The application already has a steady stream of requests, using 1,000 concurrent runtime environments. Each Lambda invocation takes 250 ms, so transactions per second are 4,000.
09:00: There is a large spike in traffic at 5,000 sustained requests. 1000 requests use the existing runtime environments. Lambda uses the 3,000 available burst concurrency to create new environments to handle the additional load. 1,000 requests are throttled as there is not enough burst concurrency to handle all 5,000 requests. Transactions per second are 16,000.
09:01: Lambda scales by another 500 concurrent invocations per minute. 500 requests are still throttled. The application can now handle 4,500 concurrent requests.
09:02: Lambda scales by another 500 concurrent invocations per minute. No requests are throttled. The application can now handle all 5,000 requests.
09:03: The application continues to handle the sustained 5000 requests. The burst concurrency quota rises to 500.
09:04: The application sees another spike in traffic, this time unexpected. 3,000 new sustained requests arrive, a combination of 8,000 requests. 5,000 requests use the existing runtime environments. Lambda uses the now available burst concurrency of 1,000 to create new environments to handle the additional load. 1,000 requests are throttled as there is not enough burst concurrency. Another 1,000 requests are throttled as the account concurrency quota has been reached.
09:05: Lambda scales by another 500 concurrent requests. The application can now handle 6,500 requests. 500 requests are throttled as there is not enough burst concurrency. 1,000 requests are still throttled as the account concurrency quota has been reached.
09:06: Lambda scales by another 500 concurrent requests. The application can now handle 7,000 requests. 1,000 requests are still throttled as the account concurrency quota has been reached.
09:07: The application continues to handle the sustained 7,000 requests. 1,000 requests are still throttled as the account concurrency quota has been reached. Transactions per second are 28,000.

Service Quotas is an AWS service that helps you manage your quotas for many AWS services. Along with looking up the values, you can also request a limit increase from the Service Quotas console.

Service Quotas console

Reserved concurrency

You can configure a Reserved concurrency setting for your Lambda functions to allocate a maximum concurrency limit for a function. This assigns part of the account concurrency quota to a specific function. This protects and ensures that a function is not throttled and can always scale up to the reserved concurrency value.

It can also help protect downstream resources. If a database or external API can only handle 2 concurrent connections, you can ensure Lambda can’t scale beyond 2 concurrent invokes. This ensures Lambda doesn’t overwhelm the downstream service. You can also use reserved concurrency to avoid race conditions to ensure that your function can only run one concurrent invocation at a time.

Reserved concurrency

You can also set the function concurrency to zero, which stops any function invocations and acts as an off-switch. This can be useful to stop Lambda invocations when you have an issue with a downstream resource. It can give you time to fix the issue before removing or increasing the concurrency to allow invocations to continue.

Provisioned Concurrency

The function initialization process can introduce latency for your applications. You can reduce this latency by configuring Provisioned Concurrency for a function version or alias. This prepares runtime environments in advance, running the function initialization process, so the function is ready to invoke when needed.

This is primarily useful for synchronous requests to ensure you have enough concurrency before an expected traffic spike. You can still burst above this using standard concurrency. The following example shows Provisioned Concurrency configured as 10. Lambda runs the init process for 10 functions, and then when requests arrive, immediately runs the invocation.

Provisioned Concurrency

You can use Application Auto Scaling to adjust Provisioned Concurrency automatically based on Lambda’s utilization metric.

Operational metrics

There are CloudWatch metrics available to monitor your account and function concurrency to ensure that your applications can scale as expected. Monitor function Invocations and Duration to understand throughput. Throttles show throttled invocations.

ConcurrentExecutions tracks the total number of runtime environments that are processing events. Ensure this doesn’t reach your account concurrency to avoid account throttling. Use the metric for individual functions to see which are using account concurrency, and also ensure reserved concurrency is not too high. For example, a function may have a reserved concurrency of 2000, but is only using 10.

UnreservedConcurrentExecutions show the number of function invocations without reserved concurrency. This is your available account concurrency buffer.

Use ProvisionedConcurrencyUtilization to ensure you are not paying for Provisioned Concurrency that you are not using. The metric shows the percentage of allocated Provisioned Concurrency in use.

ProvisionedConcurrencySpilloverInvocations show function invocations using standard concurrency, above the configured Provisioned Concurrency value. This may show that you need to increase Provisioned Concurrency.

Conclusion

Lambda provides a highly scalable compute service. Understanding how Lambda scaling and throughput works can help you design your application, especially for high load.

This post explains concurrency and transactions per second. It shows how account and burst concurrency quotas work. You can configure reserved concurrency to ensure that your functions can always scale, and also use it to protect downstream resources. Use Provisioned Concurrency to scale up Lambda in advance of invokes.

For more serverless learning resources, visit Serverless Land.

Tighten your package security with CodeArtifact Package Origin Control toolkit

2022-07-15 Davide Semenzin

Post Syndicated from Davide Semenzin original https://aws.amazon.com/blogs/devops/tighten-your-package-security-with-codeartifact-package-origin-control-toolkit/

Introduction

AWS CodeArtifact is a fully managed artifact repository service that makes it easy for organizations to securely store and share software packages used for application development. On Jul14 2022 we introduced a new feature called Package Origin Controls which allows customers to protect themselves against “dependency substitution“ or “dependency confusion” attacks.

This class of supply chain attacks can be carried out when an attacker with knowledge of an organization’s internally published package names (for example: Sample-Package=1.0.0) is able to publish such name(s) in a public repository. Package managers contain dependency resolution logic that pulls the latest version of a package. The attacker abuses this logic by publishing a high version number of a package with the same name as the organization’s package (for example: Sample-Package=99.0.0). The package manager then would resolve any requests for that package by pulling the attacker’s package version with malicious code instead of the internally published dependency.

In order for this type of attack to be successful, the organization must source their package versions from both internal and remote repositories at the same time. For example, your pip installation could be configured with multiple package indexes, both internal and external; or, as a CodeArtifact user you may have both the repository containing your private packages as well as an external connection to PyPI in the upstream graph of your current repository. In either case, the package manager is able to obtain package versions from more than one source. This causes the package manager to resolve the higher version number from the remote repository, instead of the trusted internal version.

A few strategies can be used to mitigate this kind of mixing: a simple one is to instruct the package manager to only source from an internal repository. While effective, this is often not practical, because it either significantly degrades developer experience or requires a lot of effort in order to set up, maintain, and vet external dependencies. Another mitigation consists in using explicit version pinning, which is also effective, though it might re-introduce the dependency substitution risk upon dependency upgrade without manual vetting. Some package managers also support namespaces or other types of dependency scoping, which are also helpful in preventing this class of attacks, but when available may not always actionable for existing packages due to the large amount of work required to do the renaming.

CodeArtifact is adding another tool to strengthen your software supply chain by introducing per-package per-repository controls which allow you to more precisely configure and control how package versions are sourced. For each package in your repository, you are now able to decide whether to allow or block sourcing versions from both upstream sources and direct publishing. These flags enable you to prevent mixed versions scenarios for all the types of packages supported by CodeArtifact without the need for additional package manager configuration.

While packages retained or published after the launch of this feature come with tighter-by-default origin configurations, in keeping with the principle of least astonishment we decided not to apply any of these policies retroactively. Therefore, your existing packages in CodeArtifact will not have their origin configuration changed and your setup will continue to work continue to work as it did before the feature was released.

Should you want to leverage this feature to tighten the security posture of your existing packages, we are releasing a toolkit to make it easier to bulk-set policy values in your repositories. This blog post describes how to use it.

Solution overview

The purpose of the Package Origin Control toolkit is to provide repository administrators with an easy way to set Origin Control policies in bulk on packages that have not received the default protection because they pre-date feature release. This can be achieved by blocking upstream versions for internal packages. In this blog post we will focus on this use-case, though the toolkit does support blocking publishing package versions to avoid a potentially vulnerable mixed state for external packages as well.

The toolkit is comprised of two scripts: a first one called generate_package_configurations.py for creating a manifest file listing the packages in a domain alongside their proposed origin configuration to apply, and a second one named apply_package_configurations.py, that reads the manifest file and applies the configuration within.

generate_package_configurations.py can operate on a whole repository, or on a subset of packages (specified either via filters, or though a list) and supports two origin control resolution modes:

A manual one where you supply the origin configuration you would like to set for all packages in scope. This is a good option if, for instance, you already maintain a list of internal packages, or if they are published in a consistent internal namespace which allows for them to be easily selected.
An automated one, which tries to identify what packages should have their upstreams blocked by analyzing the upstream repository graph and external connections, looking for evidence that package versions are only available from the repository at hand- in which case it determines it can disable sourcing of upstream versions can be done without risk of breaking builds. This is a good option if you want a quick way to tighten your security posture without having to manually analyze your whole repository.

With the manifest created, apply_package_configurations.py takes it as an input and effects the changes specified in it by calling the new PutPackageOriginConfiguration API (link). Precisely because it is meant to set these values in bulk, this script supports backup and revert operations by default, as well as dry-run and step-by-step confirmation options. If you identify an issue after applying origin control changes, you will be able to safely revert to the original, working configuration before trying again.

In this blog post we will cover how to use these tools:

To block package versions from upstream sources for all recommended packages in a repository
To block upstreams for a list of packages you already have
To revert to the original state in case of an incorrect configuration push

Prerequisites

The following prerequisites are required before you begin:

Set up the Package Origin Control toolkit as described in the README on GitHub. You will need a working installation of Python 3.6 or later as well as the ability to install dependencies like the Python AWS SDK. The AWS CLI is not required.
Write permissions on the CodeArtifact repository where you want to add package origin controls (see this link for additional info.)

Procedures

To block package versions from upstream sources for all recommended packages in a repository

Introduction

This procedure should be considered if you have a CodeArtifact repository with a variety of package formats and upstreams, and you want to have the toolkit automatically resolve what packages are safe to block upstreams for. It will block acquisition of new versions from upstreams only if two conditions are met:

the target repository doesn’t have access to an external connection
no versions of the package are available via any of its upstream repositories (either because the target repository itself doesn’t have any upstreams or because none of the upstreams have the package).

Therefore, we assume there isn’t an immediate External Connection attached to the target repository for the package format(s) you are trying to run this script against (because in that case the script would fall back on leaving things as-is for all packages).

Steps

Make sure you have completed the required prerequisites described above
Identify the target repository in your domain you want to automatically apply origin controls for, e.g. myrepo
Identify the query parameters that define the list of packages you want to target. The script supports the same filters as the ListPackagesAPI.
1. To match all packages in a repo, run : python generate_package_configurations.py --region us-west-2 --domain mydomain --repository myrepo
  Please note that you always need to specify the AWS region and CodeArtifact domain alongside the repository.
2. To match only some packages in a repo, for example only Python packages whose name begins with “internal_software_”: python generate_package_configurations.py --region us-west-2 --domain mydomain --repository myrepo --format pypi --prefix internal_software_*
If necessary, you can review the produced manifest file, which you will be able to find in the same folder under the name origin-configuration_mydomain_myrepo.csv (unless you have specified a different filename and path via the --output-file option)
If the manifest file looks correct, you can apply the changes by calling the second stage: python apply_package_configurations.py --region us-west-2 --domain mydomain --repository myrepo --input origin-configuration_mydomain_myrepo.csv

To block package versions from upstream sources for all packages in a repository matching a list you maintain

Introduction

This procedure should be considered if you have a set of packages within your repository you know you want to apply origin control restrictions to. Rather than relying on a query, you can use such a list as an input to create a manifest.

Steps

Create a file containing a list of package names (and package names only). Multiple namespaces and formats are not supported and you will need to re-run this procedure for each. The expected file format is one package name per line. In this example we will want to block upstreams for three packages of the npm format (format is always mandatory when specifying a list of packages). As an example, a small input file is going to look something like this:

requests
numpy
django

(more information about this option can be found in the README)

Generate the manifest by supplying this file to the first stage, alongside the desired origin control configuration. In this case, we want to block upstreams for all packages in the supplied list (for more information about the origin control configuration string, consult the documentation): python generate_package_configurations.py --region us-west-2 --domain mydomain --repository myrepo --format npm --from-list inputfile.csv --set-restrictions publish=ALLOW,upstream=BLOCK

If necessary, you can review the produced manifest file, which you will be able to find in the same folder under the name origin-configuration_mydomain_myrepo.csv (unless you have specified a different filename and path via the –output-file option)

Run the apply_package_configurations.py script to update the package origin controls in your repository: apply_package_configurations.py --region us-west-2 --domain mydomain --repository myrepo --input origin-configuration_mydomain_myrepo.csv

To revert to the original state in case of an incorrect configuration push

Introduction

Erroneously bulk-changing your origin configuration can lead to broken builds and confusing failure modes for developers. To mitigate this risk, the toolkit backs up the existing configuration before making any changes and lets you easily revert them if need be.

Steps

Identify the manifest containing the configuration you want to revert. The toolkit automatically creates a backup file t for every input manifest you provide. This is the file produced by the first stage, which by default takes a name like origin-configuration_[domain]_[repository], for example origin-configuration_mydomain_myrepo.csv
Run the second stage script in restore mode: python apply_package_configurations.py --region us-west-2 --domain mydomain --repository myrepo --input origin-configuration_mydomain_myrepo.csv --restore

Conclusion

In this blog post we have explained how to use the Origin Control toolkit to improve package security, focusing on restricting upstream package versions. We have demonstrated both an automated repository-wide application of the toolkit, which tries to minimize the amount of repository administrator work by applying a restriction heuristic, as well as a manual mode where a repository administrator can effect fine-grained origin control changes. Finally, we showed how these changes can be reverted using the built-in backup feature.

Author:

Automating Amazon EC2-Windows EBS Volumes monitoring and creating alarms

2022-07-13 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/automating-amazon-ec2-windows-ebs-volumes-monitoring-and-creating-alarms/

This blog post is written by, Santhosh Kumar Adapa, Database Consultant, AWS WWCO ProServe, Jeevan Shetty, Database Consultant, AWS WWCO ProServe, and Bhanu Ganesh Gudivada, Consultant Databases, AWS WWCO ProServe.

Customers who are running fleets of Amazon Elastic Compute Cloud (Amazon EC2) instances use advanced monitoring techniques to observe their operational performance. Capabilities like aggregated and custom dimensions help customers categorize and customize their metrics across server fleets for fast and efficient decision making. Customers require visibility into not only infrastructure metrics (such as CPU and memory), but also disk usage metrics.

Monitoring Amazon EC2-Windows Amazon Elastic Block Store (Amazon EBS) Volumes usage is critical, especially when customers have a large fleet of Amazon EC2 Windows servers running to host their databases and applications in AWS. Generally, we see issues with EC2 instances running out of disk space, and free disk space isn’t a metric that is directly available with Amazon CloudWatch. Amazon CloudWatch agent helps solve this problem. After installing and configuring the CloudWatch agent on your EC2 instance, the agent will send metric data with the disk utilization to CloudWatch. The next step is to create a CloudWatch alarm to monitor the disk utilization metric.

In this post, we showcase the steps to automate the monitoring and creating alarms for EBS volumes attached to Amazon EC2 Windows instances. Alarms are created using AWS Lambda that monitors the free disk space and alerts whenever thresholds are crossed using Amazon Simple Notification Service (Amazon SNS).

Solution overview

To demonstrate the solution we first install and configure the CloudWatch agent in your Amazon EC2 Windows instance, and then the agent will send metric data with the disk utilization to CloudWatch. To monitor the disk on each Amazon EC2 Windows instance, we’ll use two custom Metrics, “FreeStorageSpaceInMB” and “FreeStorageSpaceInPercent”, that are collected by CloudWatch agent and pushed to CloudWatch.

The following diagram illustrates the architecture used in this post:

Amazon EC2 Windows instance with attached Amazon EBS Volumes to be monitored for free disk usage. The EC2 instance is configured with Amazon CloudWatch Agent.
CloudWatch agent is configured to monitor the “FreeStorageSpaceInMB” and “FreeStorageSpaceInPercent” metrics, and pushed to AWS CloudWatch.
Lambda function that can be invoked to create CloudWatch alarms for each disk attached to the EC2 instance.
CloudWatch Alarms are created with warnings and critical thresholds based on storage size.
Amazon SNS is used to send alerts when the CloudWatch Alarms crosses the threshold.
AWS Identity and Access Management (IAM) to provide permission to the Lambda function to get Amazon EBS metrics and to create CloudWatch Alarms.

Prerequisites

You will need the following prerequisites:

To implement this solution, you must have an Amazon EC2 Windows instance configured with Amazon CloudWatch Agent by following the steps documented in the article – How to monitor Windows and Linux servers and get internal performance metrics.
To monitor the “FreeStorageSpaceInMB” and “FreeStorageSpaceInPercent” metrics for Amazon EBS volumes attached to the EC2 instance, the CloudWatch agent configuration JSON should have the following section:

"LogicalDisk": {
	"measurement": [
	{
		"name":"% Free Space",
		"rename":"FreeStorageSpaceInPercent",
		"unit":"Percent"
	},
	{
		"name":"Free Megabytes",
		"rename":"FreeStorageSpaceInMB",
		"unit":"Megabytes"
	}
	],
	"metrics_collection_interval": 10,
	"resources": [
		"*"
	]
},

Amazon EC2 host or bastion host with an IAM role attached that has permissions to create an IAM role, Lambda function, and run Amazon Relational Database Service (Amazon RDS) CLI commands. A Lambda function and an IAM role are created using AWS Serverless Application Model (SAM).

AWS SAM

In this section, we provide the steps to create an IAM role and deploy a Lambda function using AWS SAM.

Log in to the Amazon EC2 host and install the AWS SAM CLI.
Download the source code and deploy it by running the following command:

git clone https://github.com/aws-samples/aws-ec2-windows-ebs-volumes-monitoring

cd aws-ec2-windows-ebs-volumes-monitoring/ebs_volumes_monitoring
sam deploy --guided

3. Provide the following parameters:

1. Stack Name – Name for the AWS CloudFormation stack.
2. AWS Region – AWS Region where the stack is being deployed.

The following is the sample output when you run sam deploy –guided with default arguments:

=========================================
Stack Name [ebs-volumes-monitoring]: ebs-volumes-monitoring
AWS Region [us-west-2]:
#Shows you resources changes to be deployed and require a 'Y' to initiate deploy
Confirm changes before deploy [y/N]:
#SAM needs permission to be able to create roles to connect to the resources in your template
Allow SAM CLI IAM role creation [Y/n]:
#Preserves the state of previously provisioned resources when an operation fails
Disable rollback [y/N]:
Save arguments to configuration file [Y/n]:
SAM configuration file [samconfig.toml]:
SAM configuration environment [default]:

In the following sections, we describe the AWS services deployed with AWS SAM.

IAM role

AWS SAM creates an IAM role with policies to describe EC2 instances, as well as List, Get, and Put CloudWatch metrics. Furthermore, it attaches an AWS managed IAM policy called AWSLambdaBasicExecutionRole to the IAM role. This role is attached to the Lambda functions to create Amazon EBS volume alarms for EC2 instances.

Lambda function

AWS SAM also deploys the Lambda function. It uses Python 3.8 and accepts two parameters:

Hostname: Amazon EC2 Windows instance name, or, if we must configure alarms for multiple servers, then you can use a wild card character, such as Instance_name* or Instance_name?
sns_topic_name: ARN of the SNS topic that is used to configure CloudWatch Alarms. Notification is sent to the SNS topic when the Amazon EBS Volume metric crosses the threshold.

Invoking Lambda function

After the SAM deployment is successful, we can invoke the Lambda function with the instance name and the SNS Topic ARN. The Lambda function creates two alarms (Warning and Critical) for every Amazon EBS volume attached to the instance. The Warning and Critical values can be changed in the Lambda code so that there are two different values depending on the size of the disk drive. Furthermore, the alarms are configured to send notifications to the SNS Topic. The following is the sample command to invoke the Lambda function:

aws lambda invoke --function-name ec2-ebs-metric --cli-binary-format raw-in-base64-out \
--payload '{"hostname": "Windows*", "sns_topic_name": "arn:aws:sns:us-west-2:123456789:notify_dba" }' response.json

Verifying CloudWatch Alarms:

Verify the CloudWatch Alarms that are created in the CloudWatch console. The following screenshot shows the CloudWatch alarms created for an EC2 instance with four disks. There are two alarms (Warning and Critical) created for every disk (four disks in total). Therefore, we see eight CloudWatch alarms.

Checking CloudWatch Logs:

After running the Lambda function, to verify the log, go to Lambda Service page, select the Lambda function created, navigate to the Monitor tab, and then select “View logs in CloudWatch”. Then, go to the latest log file to check the CloudWatch log files for any errors.

Select the latest Log Steam to check the details of the last Lambda function execution.

The log file shows the full details of the Lambda function execution. Furthermore, it shows the CloudWatch alarms configured for each disk, as well as if there are any errors generated during execution.

Cleanup

To clean up the resources used in this post, complete the following steps:

Delete the CloudFormation stack by running below command and replacing STACK_NAME with stack name provided in step 3a above, under section “AWS SAM”

sam delete --stack-name STACK_NAME

Confirm the stack has been deleted by running below command. Replace STACK_NAME as mentioned in previous step.

aws cloudformation list-stacks --query "StackSummaries[?contains(StackName,' STACK_NAME ')].StackStatus"

Delete any CloudWatch alarms created by the Lambda function by following the document – Editing or deleting a CloudWatch alarm.

Conclusion

In this post, we demonstrated how the requirement of monitoring Amazon EC2 Windows EBS Volumes usage is critical. In particular, this is essential when customers have a large fleet of Amazon EC2 Windows servers running to host their databases and applications in the cloud. We showcased the process of automating the free disk monitoring using Lambda and notifying through Amazon SNS when disks cross the storage threshold. By implementing such monitoring, customers can prevent issues with EC2 instances running out of disk space thus preventing critical production outages.

Provide any thoughts or questions in the comments section. We also encourage you to explore CloudWatch monitoring and try out additional use cases mentioned in the documentation.

How to Mitigate Docker Hub’s pull rate limit error through AWS Code build and ECR

2022-07-11 Bijith Nair

Post Syndicated from Bijith Nair original https://aws.amazon.com/blogs/devops/how-to-mitigate-docker-hubs-pull-rate-limit-error-through-aws-code-build-and-ecr/

How to mitigate Docker Hub’s pull rate limit errors

Docker, Inc. has announced that its hosted repository service, Docker Hub, will begin limiting the rate at which the Docker images are being pulled. The pull rate limit will purely be based on the individual IP Address. The anonymous user can do 100 pulls per 6 hours per IP Address, while the authenticated user can do 200 pulls per 6 hours.

This post shows how you can overcome those errors while you’re working with AWS Developer Tools, such as AWS CodeCommit, AWS CodeBuild, and Amazon Elastic Container Registry (Amazon ECR).

Solution overview

The workflow and architecture of the solution work as follows:

The developer will push the code, in this case (buildspec, Dockerfile, README.md) to CodeCommit by using Git client installed locally.
CodeBuild will pull the latest commit id from CodeCommit.
CodeBuild will build the base Docker image by going through the build steps listed in buildspec and Dockerfile.

Finally, Docker image will be pushed to Amazon ECR, and it can be used for future deployments.

AWS Services Overview:

Figure 1. Architectural Overview – Leverage AWS Developer tools to mitigate Docker hub’s pull rate limit error.

For this post, we’ll be using the following AWS services

AWS CodeCommit – a fully-managed source control service that hosts secure Git-based repositories.
AWS CodeBuild – a fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy.
Amazon ECR – an AWS managed container image registry service that is secure, scalable, and reliable.

Prerequisites

The prerequisites, before we build the pipeline, are as follows:

Install a Git client to clone the source code and to push the code to CodeCommit.
Dockerfile – Sample file to produce the Docker image and push it to an Amazon ECR.
Install Docker daemon on your localhost or laptop, and make sure that it’s up and running.The latest version of the AWS Command Line Interface (AWS CLI). For more information, see Installing, updating, and uninstalling the AWS CLI.
An AWS account with local credentials properly configured (typically under ~/.aws/credentials).
An IAM user with Git credentials.

Setup Amazon ECR repositories

We’ll be setting up two Amazon ECR repositories. The first repository, “golang”, will hold the base image which we will be migrating from Docker hub. And the second repository, “mydemorepo”, will hold the final image for your sample application.

Figure 2. ECR Repositories with Source and Target image.

Migrate your existing Docker image from Docker Hub to Amazon ECR

Note that for this post, I’ll be using golang:1.12-alpine as the base Docker image to be migrated to Amazon ECR.

Once you have your base repository setup, the next step is to pull the golang:1.12-alpine public image from docker hub to you laptop/Server.

To do that, log in to your local server where you have deployed all of the prerequisites. This includes the Docker engine, Git Client, AWS CLI, etc., and run the following command:

docker pull golang:1.12-alpine

Authenticate to your Amazon ECR private registry, and click here to learn more about private registry authentication.

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 12345678910.dkr.ecr.us-east-1.amazonaws.com

Apply the appropriate Tag to your image.

docker tag golang:1.12-alpine 12345678910.dkr.ecr.us-east-1.amazonaws.com/golang:1.12-alpine

Push your image to the Amazon ECR repository.

docker push 150359982618.dkr.ecr.us-east-1.amazonaws.com/golang:1.12-alpine

Figure 3. The Source Image.

Once your image gets pushed to Amazon ECR with all of the required tags, the next step is to use this base image as the source image to build our sample application. Furthermore, while going through the “Overview of the Solution” section, you might have noticed that we must have a source code repository for CodeBuild to fetch the latest code from, and to build a sample application. Let’s go through how to set up your source code repository.

Set up a source code repository using CodeCommit

Let’s start by setting up the repository name. In this case, let’s call it “mysampleapp”. Keep rest of the settings as is, and then select create.

Figure 4. Screenshot for creating a repository under AWS Code Commit.

Once the repository gets created, go to the top-right corner. Under Clone URL, select Clone HTTPS.

Figure 5. Clone URL to copy the code Locally on your desktop or server.

Go back to your local server or laptop where you want to clone the repo, and authenticate your repository. Learn more about how to HTTPS connections to AWS Codecommit repositories.

git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/mysampleapp mysampleapp

Note that you have your empty repository cloned locally with a folder name “mysampleapp”. The next step is to set up Dockerfile and buildspec.

Set up a Dockerfile and buildspec for building the base image of your Sample Application

To build a Docker image, we’ll set up three files under the “mysampleapp” folder.

Buildspec.yml
Dockerfile
README.md

Figure 6. List out the base code pushed to AWS Code commit.

Note that I have also listed the content inside of buildspec.yml, Dockerfile, and ReadME.md in case you want a sample code to test the scenario.

buildspec.yml

version: 0.2
phases:
pre_build:
commands:
- echo Logging in to Amazon ECR...
- REPOSITORY_URI=12345678910.dkr.ecr.us-east-1.amazonaws.com/mydemorepo
- AWS_DEFAULT_REGION=us-east-1
- AWS_ACCOUNT_ID=12345678910
- aws ecr get-login-password --region $AWS_DEFAULT_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com
- IMAGE_REPO_NAME=mydemorepo
- COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
- IMAGE_TAG=build-$(echo $CODEBUILD_BUILD_ID | awk -F":" '{print $2}')
build:
commands:
- echo Build started on `date`
- echo Building the Docker image...
- docker build -t $REPOSITORY_URI:latest .
- docker images
- docker tag $REPOSITORY_URI:latest $REPOSITORY_URI:$IMAGE_TAG
post_build:
commands:
- echo Build completed on `date`
- echo Pushing the Docker image...
- docker push $REPOSITORY_URI:latest
- docker push $REPOSITORY_URI:$IMAGE_TAG

Dockerfile

Point your Dockfile to use Amazon ECR repository instead of Docker hub.

FROM golang:1.12-alpine AS build << Replace the public Image with ECR private image

FROM 150359982618.dkr.ecr.us-east-1.amazonaws.com/golang: 1.12-alpine

AS build

#Install git

RUN apk add --no-cache git

#Get the hello world package from a GitHub repository

RUN go get github.com/golang/example/hello

WORKDIR /go/src/github.com/golang/example/hello

#Build the project and send the output to /bin/HelloWorld

RUN go build -o /bin/HelloWorld

README.md

> Demo Repository has files related to CodeBuild spec and Dockerfile

* buildspec.yaml
* Dockerfile

Now, commit your changes locally, and push them to Amazon ECR. Note that to learn more about how to set up HTTPS users using Git Credentials, click here.

git add .
git commit -m "My First Commit - Sample App"
git push -u origin master

Now, go back to your AWS Management Console, and Under Developer Tools, select CodeCommit. Go to mysampleapp, and you should see three files with the latest commits.

Figure 7. Screenshot with buildspec, Dockerfile and README

That concludes our setup to CodeCommit. Next, we’ll set up a build project.

Set up a CodeBuild project

Enter the project name, in this case let’s use “mysamplbuild”.

Figure 8. Setting up a Build Project.

Select the provider as CodeCommit, followed by the repository and the branch that contains the latest commit.

Figure 9. Selecting the source code provider which is Code Commit and the source repository.

Select the runtime as standard, and then choose the latest Amazon Linux image. Make sure that the environment type is set to Linux. Privileged option should be checked. For the rest of the options, go with the defaults.

Figure 10. Required environment variables and attributes.

Figure 11. Choose the buildspec.

Once the project gets created, select the project, and select start Build.

Figure 12. Screenshot of the build project along with details.

Once you trigger the build, you will notice that CodeBuild is pulling the image from Amazon ECR instead of Docker hub.

Figure 13. The log file for the build.

Clean up

To avoid incurring future charges, delete the resources.

Conclusion

Developers can use AWS to host both their private and public container images. This decreases the need to use different public websites and registries. Public images will be geo-replicated for reliable availability around the world, and they’ll offer fast downloads to quickly serve up images on-demand. Anyone (with or without an AWS account) will be able to browse and pull containerized software for use in their own applications.

Author:

Use Amazon Athena parameterized queries to provide data as a service

2022-07-11 Blayze Stefaniak

Post Syndicated from Blayze Stefaniak original https://aws.amazon.com/blogs/big-data/use-amazon-athena-parameterized-queries-to-provide-data-as-a-service/

Amazon Athena now provides you more flexibility to use parameterized queries for any query you send to Athena, and we recommend you use them as the best practice for your Athena queries moving forward so you benefit from the security, reusability, and simplicity they offer. In a previous post, Improve reusability and security using Amazon Athena parameterized queries, we explained how parameterized queries with prepared statements provide reusability of queries, protection against SQL injection, and masking of query strings from AWS CloudTrail events. In this post, we explain how you can run Athena parameterized queries using the ExecutionParameters property in your StartQueryExecution requests. We provide a sample application you can reference for using parameterized queries, with and without prepared statements. Athena parameterized queries can be integrated into many data driven applications, and we walk you through a sample data as a service application to see how parameterized queries can plug in.

Customers tell us they are finding new ways to make effective use of their data assets by providing data as a service (DaaS). In this post, we share a sample architecture using parameterized queries applied in the form of a DaaS application. This is helpful for many types of organizations, whether you’re working with an enterprise making data available to other lines of business, a regulator making reports available to your industry, a company monetizing your data assets, an independent software vendor (ISV) enabling your applications’ tenants to query their data when they need it, or trying to share data at scale in other ways. In DaaS applications, you can provide predefined queries to run against your governed datasets with values your users input. You can expand your DaaS application to break away from monolithic data infrastructure by treating data as a product (DaaP) and providing a distribution of datasets, which have distinct domain-specific data pipelines. You can authorize these datasets to consumers in your DaaS application permissions. You can use Athena parameterized queries as a way to predefine your queries, which you can use to run queries across your datasets, and serve as a layer of protection for your DaaS applications. This post first describes how parameterized queries work, then applies parameterized queries in the form of a DaaS application.

Feature overview

In any query you send to Athena, you can use positional parameters declared by a question mark (?) in your query string, then declare values as execution parameters sequentially in your StartQueryExecution request. You can use execution parameters with your existing prepared statements and also with any SQL queries in Athena. You can still take advantage of the reusability and security benefits of parameterized queries, and using execution parameters also masks your query’s parameters when viewing recent queries in Athena. You can also change from building SQL query strings manually to using execution parameters; this allows you to run parameterized queries without needing to first create prepared statements. For more information on using execution parameters, refer to StartQueryExecution.

Previously, you could only run parameterized queries by first creating prepared statements in your Athena workgroup, then running parameterized queries while passing variables into an EXECUTE SQL statement with the USING clause. You are no longer required to create and maintain prepared statements across all of your Athena workgroups to take advantage of parameterization. This is helpful if you run the same queries across multiple workgroups or otherwise do not need the prepared statements feature.

You can continue to use Athena workgroups to isolate, implement individual cost constraints, and track query-related metrics for tenants within your multi-tenant application. For example, your DaaS application’s customers can run the same queries against your dataset with separate workgroups. For more information on Athena workgroups, refer to Using workgroups for running queries.

Changing your code to use parameterized queries

Changing your existing code to use parameterized queries is a small change which will have an immediate positive impact. Previously, you were required to build your query string value manually using environment variables as parameter placeholders. Manipulating the query string can be burdensome and has an inherent risk for injecting undesired values or SQL fragments (such as SQL operators), regardless of intent. You can now replace variables in your query string with a question mark (?), and declare your variable values sequentially with the ExecutionParameters option. By doing so, you take advantage of the security benefits of parameterized queries, and your queries are less complicated to author and maintain. The syntax change is shown in the following code, using the AWS Command Line Interface (AWS CLI) as an example.

Previously, running queries against Athena without execution parameters:

aws athena start-query-execution \
--query-string "SELECT * FROM table WHERE x = $ARG1 AND y = $ARG2 AND z = $ARG3" \
--query-execution-context "Database"="default" \
--work-group myWorkGroup

Now, running parameterized queries against Athena with execution parameters:

aws athena start-query-execution \
--query-string "SELECT * FROM table WHERE x = ? AND y = ? AND z = ?" \
--query-execution-context "Database"="default" \
--work-group myWorkGroup \
--execution-parameters $ARG1 $ARG2 $ARG3

The following is an example of a command that creates a prepared statement in your Athena workgroup. To learn more about creating prepared statements, refer to Querying with prepared statements.

aws athena start-query-execution \
--query-string "PREPARE my-prepared-statement FROM SELECT * FROM table WHERE x = ? AND y = ? AND z = ?" \
--query-execution-context "Database"="default" \
--work-group myWorkGroup

Previously, running parameterized queries against prepared statements without execution parameters:

aws athena start-query-execution \
--query-string "EXECUTE my-prepared-statement USING $ARG1, $ARG2, $ARG3“ \
--query-execution-context "Database"="default" \
--work-group myWorkGroup

Now, running parameterized queries against prepared statements with execution parameters:

aws athena start-query-execution \
--query-string "EXECUTE my-prepared-statement" \
--query-execution-context "Database"="default" \
--work-group myWorkGroup \
--execution-parameters $ARG1 $ARG2 $ARG3

Sample architecture

The purpose of this sample architecture is to apply the ExecutionParameters feature when running Athena queries, with and without prepared statements. This is not intended to be a DaaS solution for use with your production data.

This sample architecture exhibits a DaaS application with a user interface (UI) that presents three Athena parameterized queries written against the public Amazon.com customer reviews dataset. The following figure depicts this workflow when a user submits a query to Athena. This example uses AWS Amplify to host a front-end application. The application calls an Amazon API Gateway HTTP API, which invokes AWS Lambda functions to authenticate requests, fetch the Athena prepared statements and named queries, and run the parameterized queries against Athena. The Lambda function uses the name of the Athena workgroup, statement name, statement type (prepared statement or not), and a list of query parameters input by the user. Athena queries data in an Amazon Simple Storage Service (Amazon S3), bucket which is cataloged in AWS Glue, and presents results to the user on the DaaS application UI.

End-users of the DaaS application UI can run only parameterized queries against Athena. The DaaS application UI demonstrates two ways to run parameterized queries with execution parameters: with and without prepared statements. In both cases, the Lambda function submits the query, waits for the query to complete, and provides the results that match the query parameters. The following figure depicts the DaaS application UI.

You may want your users to have the ability to list all Athena prepared statements within your Athena workgroup, select a statement, input arguments, and run the query; on the left side of the DaaS application UI, you use an EXECUTE statement to query the data lake with an Athena prepared statement. You may have several reporting queries maintained in your code base. In this case, your users select a statement, input arguments, and run the query. On the right side of the DaaS application UI, you use a SELECT statement to use Athena parameterized queries without prepared statements.

Prerequisites

This post uses the following AWS services to demonstrate a DaaS architecture pattern that uses Athena to query the Amazon.com customer reviews dataset:

AWS Amplify
Amazon API Gateway
Amazon Athena
AWS CloudFormation
Amazon DynamoDB
AWS Glue and the AWS Glue Data Catalog
AWS Identity and Access Management (IAM)
AWS Lambda
Amazon S3

This post assumes you have the following:

An AWS account. For instructions, refer to Creating an AWS account.
A CloudTrail trail created. For instructions, refer to Creating a trail.
An S3 bucket created. For instructions, refer to Creating a bucket.
Node.js version 16+ installed on your device.

Deploy the CloudFormation stack

In this section, you deploy a CloudFormation template that creates the following resources:

AWS Glue Data Catalog database
AWS Glue Data Catalog table
An Athena workgroup
Three Athena prepared statements
Three Athena named queries
The API Gateway HTTP API
The Lambda execution role for Athena queries
The Lambda execution role for API Gateway HTTP API authorization
Five Lambda functions:
- Update the AWS Glue Data Catalog
- Authorize API Gateway requests
- Submit Athena queries
- List Athena prepared statements
- List Athena named queries

Note that this CloudFormation template was tested in AWS Regions ap-southeast-2, ca-central-1, eu-west-2, us-east-1, us-east-2, and us-west-2. Note that deploying this into your AWS account will incur cost. Steps for cleaning up the resources are included later in this post.

To deploy the CloudFormation stack, follow these steps:

Navigate to this post’s GitHub repository.
Clone the repository or copy the CloudFormation template athena-parameterized-queries.yaml.
On the AWS CloudFormation console, choose Create stack.
Select Upload a template file and choose Choose file.
Upload athena-parameterized-queries.yaml, then choose Next.
On the Specify stack details page, enter the stack name athena-parameterized-queries.
On the same page, there are two parameters:
1. For S3QueryResultsBucketName, enter the S3 bucket name in your AWS account and in the same AWS Region as where you’re running your CloudFormation stack. (For this post, we use the bucket name value, like my-bucket).
2. For APIPassphrase, enter a passphrase to authenticate API requests. You use this later.
Choose Next.
On the Configure stack options page, choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and choose Create stack.

The script takes less than two minutes to run and change to a CREATE_COMPLETE state. If you deploy the stack twice in the same AWS account and Region, some resources may already exist, and the process fails with a message indicating the resource already exists in another template.

On the Outputs tab, copy the APIEndpoint value to use later.

For least-privilege authorization for deployment of the CloudFormation template, you can create an AWS CloudFormation service role with the following IAM policy actions. To do this, you must create an IAM policy and IAM role, and choose this role when configuring stack options. You need to replace the values for ${Partition}, ${AccountId}, and ${Region} with your own values; for more information on these values, refer to Pseudo parameters reference.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "IAM",
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:UntagRole",
                "iam:TagRole",
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:PassRole",
                "iam:GetRolePolicy",
                "iam:PutRolePolicy",
                "iam:AttachRolePolicy",
                "iam:TagPolicy",
                "iam:DeleteRolePolicy",
                "iam:DetachRolePolicy",
                "iam:UntagPolicy"
            ],
            "Resource": [
                "arn:${Partition}:iam::${AccountId}:role/LambdaAthenaExecutionRole-athena-parameterized-queries",
                "arn:${Partition}:iam::${AccountId}:role/service-role/LambdaAthenaExecutionRole-athena-parameterized-queries",
                "arn:${Partition}:iam::${AccountId}:role/service-role/LambdaAuthorizerExecutionRole-athena-parameterized-queries",
                "arn:${Partition}:iam::${AccountId}:role/LambdaAuthorizerExecutionRole-athena-parameterized-queries"
            ]
        },
        {
            "Sid": "LAMBDA",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:GetFunction",
                "lambda:InvokeFunction",
                "lambda:AddPermission",
                "lambda:DeleteFunction",
                "lambda:RemovePermission",
                "lambda:UpdateFunctionConfiguration"
            ],
            "Resource": [
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:LambdaRepairFunction-athena-parameterized-queries",
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:LambdaAthenaFunction-athena-parameterized-queries",
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:LambdaAuthorizerFunction-athena-parameterized-queries",
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:GetPrepStatements-athena-parameterized-queries",
                "arn:${Partition}:lambda:${Region}:${AccountId}:function:GetNamedQueries-athena-parameterized-queries"
            ]
        },
        {
            "Sid": "ATHENA",
            "Effect": "Allow",
            "Action": [
                "athena:GetWorkGroup",
                "athena:CreateWorkGroup",
                "athena:DeleteWorkGroup",
                "athena:DeleteNamedQuery",
                "athena:CreateNamedQuery",
                "athena:CreatePreparedStatement",
                "athena:DeletePreparedStatement",
                "athena:GetPreparedStatement"
            ],
            "Resource": [
                "arn:${Partition}:athena:${Region}:${AccountId}:workgroup/ParameterizedStatementsWG"
            ]
        },
        {
            "Sid": "GLUE",
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:DeleteDatabase",
                "glue:CreateTable",
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:${Partition}:glue:${Region}:${AccountId}:catalog",
                "arn:${Partition}:glue:${Region}:${AccountId}:database/athena_prepared_statements",
                "arn:${Partition}:glue:${Region}:${AccountId}:table/athena_prepared_statements/*",
                "arn:${Partition}:glue:${Region}:${AccountId}:userDefinedFunction/athena_prepared_statements/*"
            ]
        },
        {
            "Sid": "APIGATEWAY",
            "Effect": "Allow",
            "Action": [
                "apigateway:DELETE",
                "apigateway:PUT",
                "apigateway:PATCH",
                "apigateway:POST",
                "apigateway:TagResource",
                "apigateway:UntagResource"
            ],
            "Resource": [
                "arn:${Partition}:apigateway:${Region}::/apis/*/integrations*",
                "arn:${Partition}:apigateway:${Region}::/apis/*/stages*",
                "arn:${Partition}:apigateway:${Region}::/apis/*/authorizers*",
                "arn:${Partition}:apigateway:${Region}::/apis/*/routes*",
                "arn:${Partition}:apigateway:${Region}::/tags/arn%3Aaws%3Aapigateway%3A${Region}%3A%3A%2Fv2%2Fapis%2F*"
            ]
        },
        {
            "Sid": "APIGATEWAYMANAGEAPI",
            "Effect": "Allow",
            "Action": [
                "apigateway:DELETE",
                "apigateway:PUT",
                "apigateway:PATCH",
                "apigateway:POST",
                "apigateway:GET"
            ],
            "Resource": [
                "arn:${Partition}:apigateway:${Region}::/apis"
            ],
            "Condition": {
                "StringEquals": {
                    "apigateway:Request/ApiName": "AthenaAPI-athena-parameterized-queries"
                }
            }
        },
        {
            "Sid": "APIGATEWAYMANAGEAPI2",
            "Effect": "Allow",
            "Action": [
                "apigateway:DELETE",
                "apigateway:PUT",
                "apigateway:PATCH",
                "apigateway:POST",
                "apigateway:GET"
            ],
            "Resource": [
                "arn:${Partition}:apigateway:${Region}::/apis/*"
            ],
            "Condition": {
                "StringEquals": {
                    "apigateway:Resource/ApiName": "AthenaAPI-athena-parameterized-queries"
                }
            }
        },
        {
            "Sid": "APIGATEWAYGET",
            "Effect": "Allow",
            "Action": [
                "apigateway:GET"
            ],
            "Resource": [
                "arn:${Partition}:apigateway:${Region}::/apis/*"
            ]
        },
        {
            "Sid": "LAMBDALAYER",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:${Partition}:lambda:*:280475519630:layer:boto3-1_24*"
            ]
        }
    ]
}

After you create the CloudFormation stack, you use the AWS management console to deploy an Amplify application and view the Lambda functions. The following is the scoped-down IAM policy that you can attach to an IAM user or role to perform these operations:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AmplifyCreateApp",
            "Effect": "Allow",
            "Action": [
                "amplify:CreateBranch",
                "amplify:StartDeployment",
                "amplify:CreateDeployment",
                "amplify:CreateApp",
                "amplify:StartJob"
            ],
            "Resource": "arn:${Partition}:amplify:${Region}:${AccountId}:apps/*"
        },
        {
            "Sid": "AmplifyList",
            "Effect": "Allow",
            "Action": "amplify:List*",
            "Resource": "arn:${Partition}:amplify:${Region}:${AccountId}:apps/*"
        },
        {
            "Sid": "AmplifyGet",
            "Effect": "Allow",
            "Action": "amplify:GetJob",
            "Resource": "arn:${Partition}:amplify:${Region}:${AccountId}:apps/*"
        },
        {
            "Sid": "LambdaList",
            "Effect": "Allow",
            "Action": [
                "lambda:GetAccountSettings",
                "lambda:ListFunctions"
            ],
            "Resource": "*"
        },
        {
            "Sid": "LambdaFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:GetFunction"
            ],
            "Resource": "arn:${Partition}:lambda:${Region}:${AccountId}:function:LambdaAthenaFunction-athena-parameterized-queries"
        }
    ]
}

Note that you need the following IAM policy when deploying your Amplify application to set a global password, and when cleaning up your resources to delete the Amplify application. Remember to replace ${AppARN} with the ARN of the Amplify application. You can find the ARN after creating the Amplify app on the General tab in the App Settings section of the Amplify console.

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Sid": "UpdateAndDeleteAmplifyApp",
           "Effect": "Allow",
            "Action": [
                "amplify:DeleteApp",
                "amplify:UpdateApp"
            ],
           "Resource": "${AppARN}"
       }
   ]
}

Deploy the Amplify application

In this section, you deploy your Amplify application.

In the cloned repository, open web-application/.env in a text editor.
Set AWS_API_ENDPOINT as the APIEndpoint value from the CloudFormation stack Outputs For example: AWS_API_ENDPOINT="https://123456abcd.execute-api.your-region.amazonaws.com".
Set API_AUTH_CODE as the value you input as the CloudFormation stack’s APIPassphrase parameter argument. For example: API_AUTH_CODE="YOUR_PASSPHRASE".
Navigate to the web-application/ directory and run npm install.
Run npm run build to compile distribution assets.
On the Amplify console, choose All apps.
Choose New app.
Select Host web app, select Deploy without Git provider, then choose Continue.
For App name, enter Athena Parameterized Queries App.
For Environment name¸ you don’t need to enter a value.
Select Drag and Drop.
Locate the dist/ directory inside web-application/, drag it into the window and drop it. Ensure you drag the entire directory, not the files within it.
Choose Save and deploy to deploy the web application on Amplify.

This step takes less than a minute to complete.

Under App settings, choose Access control, then choose Manage access.
Select Apply a global password, then enter values for Username and Password.

You use these credentials to access your Amplify application.

Access your Amplify application and run queries

In this section, you use the Amplify application to run Athena parameterized queries against the Amazon.com customer reviews dataset. The left side of the application shows how you can run parameterized queries using Athena prepared statements. The right side of the application shows how you can run parameterized queries without prepared statements, such as if the queries are written in your code. The sample in this post uses named queries within the Athena workgroup. For more information about named queries, refer to NamedQuery.

Open the Amplify web application link located under Domain. For example: https://dev123.abcd12345xyz.amplifyapp.com/.
In the Sign in prompt, enter the user name and password you provided as the Amplify application global password.
For Workgroup Name, choose the ParameterizedStatementsWG workgroup.
Choose a statement example on the Prepared Statement or SQL Statement drop-down menu.

Selecting a statement displays a description about the query, including examples of parameters you can try with this statement, and the original SQL query string. SQL parameters of type string must be surrounded by single quotes, for example: 'your_string_value'.

Enter your query parameters.

The following figure shows an example of the parameters to input for the product_helpful_reviews prepared statement.

Choose Run Query to send the query request to the API endpoint.

After the query runs, the sample application presents the results in a table format, as depicted in the following screenshot. This is one of many ways to present results, and your application can display results in the format which makes the most sense for your users. The complete query workflow is depicted in the previous architecture diagram.

Screenshot of the sample application's query results rendered in a table format. The table has columns for product_id, product_title, star_rating, helpful_votes, review_headline, and review_body. The query returned two results, which are 4 star reviews for the Amazon Smile eGift Card.

Using execution parameters with the AWS SDK for Python (Boto3)

In this section, you inspect the Lambda function code for using the StartQueryExecution API with and without prepared statements.

On the Lambda console, choose Functions.
Navigate to the LambdaAthenaFunction-athena-parameterized-queries function.
Choose the Code Source window.

Examples of passing parameters to the Athena StartQueryExecution API using the AWS SDK for Python (Boto3) begin on lines 39 and 49. Note the ExecutionParameters option on lines 45 and 55.

The following code uses execution parameters with Athena prepared statements:

response = athena.start_query_execution(
    QueryString=f'EXECUTE {statement}', # Example: "EXECUTE prepared_statement_name"
    WorkGroup=workgroup,
    QueryExecutionContext={
        'Database': 'athena_prepared_statements'
    },
    ExecutionParameters=input_parameters
)

The following code uses execution parameters without Athena prepared statements:

response = athena.start_query_execution(
    QueryString=statement, # Example: "SELECT * FROM TABLE WHERE parameter_name = ?"
    WorkGroup=workgroup,
    QueryExecutionContext={
        'Database': 'athena_prepared_statements'
    },
    ExecutionParameters=input_parameters
)

Clean up

In this post, you created several components, which generate cost. To avoid incurring future charges, remove the resources with the following steps:

Delete the S3 bucket’s results prefix created after you ran a query on your workgroup.

With the default template, the prefix is named <S3QueryResultsBucketName>/athena-results. Use caution in this step. Unless you are using versioning on your S3 bucket, deleting S3 objects can’t be undone.

On the Amplify console, select the app to delete and on the Actions menu, choose Delete app, then confirm.
On the AWS CloudFormation console, select the stack to delete, choose Delete, and confirm.

Conclusion

In this post, we showed how you can build a DaaS application using Athena parameterized queries. The StartQueryExecution API in Athena now supports execution parameters, which allows you to run any Athena query as a parameterized query. You can decouple your execution parameters from your query strings, and use parameterized queries without being limited to the Athena workgroups where you have created prepared statements. You can take advantage of the security benefits Athena offers with parameterized queries, and developers no longer need to build query strings manually. In this post, you learned how to use execution parameters, and you deployed a DaaS reference architecture to see how parameterized queries can be applied.

You can get started with Athena parameterized queries by using the Athena console, the AWS CLI, or the AWS SDK. To learn more about Athena, refer to the Amazon Athena User Guide.

Thanks for reading this post! If you have questions about Athena prepared statements and parameterized queries, don’t hesitate to leave a comment.

About the Authors

Blayze Stefaniak is a Senior Solutions Architect for the Technical Strategist Program supporting Executive Customer Programs in AWS Marketing. He has experience working across industries including healthcare, automotive, and public sector. He is passionate about breaking down complex situations into something practical and actionable. In his spare time, you can find Blayze listening to Star Wars audiobooks, trying to make his dogs laugh, and probably talking on mute.

Daniel Tatarkin is a Solutions Architect at Amazon Web Services (AWS) supporting Federal Financial organizations. He is passionate about big data analytics and serverless technologies. Outside of work, he enjoys learning about personal finance, coffee, and trying out new programming languages for fun.

Matt Boyd is a Senior Solutions Architect at AWS working with federal financial organizations. He is passionate about effective cloud management and governance, as well as data governance strategies. When he’s not working, he enjoys running, weight lifting, and teaching his elementary-age son ethical hacking skills.

Accelerate machine learning with AWS Data Exchange and Amazon Redshift ML

2022-07-08 Yadgiri Pottabhathini

Post Syndicated from Yadgiri Pottabhathini original https://aws.amazon.com/blogs/big-data/accelerate-machine-learning-with-aws-data-exchange-and-amazon-redshift-ml/

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. Redshift ML allows you to use your data in Amazon Redshift with Amazon SageMaker, a fully managed ML service, without requiring you to become an expert in ML.

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. AWS Data Exchange for Amazon Redshift enables you to access and query tables in Amazon Redshift without extracting, transforming, and loading files (ETL).

As a subscriber, you can browse through the AWS Data Exchange catalog and find data products that are relevant to your business with data stored in Amazon Redshift, and subscribe to the data from the providers without any further processing, and no need for an ETL process.

If the provider data is not already available in Amazon Redshift, many providers will add the data to Amazon Redshift upon request.

In this post, we show you the process of subscribing to datasets through AWS Data Exchange without ETL, running ML algorithms on an Amazon Redshift cluster, and performing local inference and production.

Solution overview

The use case for the solution in this post is to predict ticket sales for worldwide events based on historical ticket sales data using a regression model. The data or ETL engineer can build the data pipeline by subscribing to the Worldwide Event Attendance product on AWS Data Exchange without ETL. You can then create the ML model in Redshift ML using the time series ticket sales data and predict future ticket sales.

To implement this solution, you complete the following high-level steps:

Subscribe to datasets using AWS Data Exchange for Amazon Redshift.
Connect to the datashare in Amazon Redshift.
Create the ML model using the SQL notebook feature of the Amazon Redshift query editor V2.

The following diagram illustrates the solution architecture.

Prerequisites

Before starting this walkthrough, you must complete the following prerequisites:

Make sure you have an existing Amazon Redshift cluster with RA3 node type. If not, you can create a provisioned Amazon Redshift cluster.
Make sure the Amazon Redshift cluster is encrypted, because the data provider is encrypted and Amazon Redshift data sharing requires homogeneous encryption configurations. For more details on homogeneous encryption, refer to Data sharing considerations in Amazon Redshift.
Create an AWS Identity Access and Management (IAM) role with access to SageMaker and Amazon Simple Storage Service (Amazon S3) and attach it to the Amazon Redshift cluster. Refer to Cluster setup for using Amazon Redshift ML for more details.
Create an S3 bucket for storing the training data and model output.

Please note that some of the above AWS resources in this walkthrough will incur charges. Please remember to delete the resources when you’re finished.

Subscribe to an AWS Data Exchange product with Amazon Redshift data

To subscribe to an AWS Data Exchange public dataset, complete the following steps:

On the AWS Data Exchange console, choose Explore available data products.
In the navigation pane, under Data available through, select Amazon Redshift to filter products with Amazon Redshift data.
Choose Worldwide Event Attendance (Test Product).
Choose Continue to subscribe.
Confirm the catalog is subscribed by checking that it’s listed on the Subscriptions page.

Predict tickets sold using Redshift ML

To set up prediction using Redshift ML, complete the following steps:

On the Amazon Redshift console, choose Datashares in the navigation pane.
On the Subscriptions tab, confirm that the AWS Data Exchange datashare is available.
In the navigation pane, choose Query editor v2.
Connect to your Amazon Redshift cluster in the navigation pane.

Amazon Redshift provides a feature to create notebooks and run your queries; the notebook feature is currently in preview and available in all Regions. For the remaining part of this tutorial, we run queries in the notebook and create comments in the markdown cell for each step.

Choose the plus sign and choose Notebook.
Choose Add markdown.
Enter Show available data share in the cell.
Choose Add SQL.
Enter and run the following command to see the available datashares for the cluster:
```
SHOW datashares;
```

You should be able to see worldwide_event_test_data, as shown in the following screenshot.

Note the producer_namespace and producer_account values from the output, which we use in the next step.
Choose Add markdown and enter Create database from datashare with producer_namespace and procedure_account.
Choose Add SQL and enter the following code to create a database to access the datashare. Use the producer_namespace and producer_account values you copied earlier.
```
CREATE DATABASE ml_blog_db FROM DATASHARE worldwide_event_test_data OF ACCOUNT 'producer_account' NAMESPACE 'producer_namespace';
```
Choose Add markdown and enter Create new table to consolidate features.

Choose Add SQL and enter the following code to create a new table called event consisting of the event sales by date and assign a running serial number to split into training and validation datasets:

CREATE TABLE event AS
	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday, ROW_NUMBER() OVER (ORDER BY RANDOM()) r
	FROM "ml_blog_db"."public"."sales" s
	INNER JOIN "ml_blog_db"."public"."event" e
	ON s.eventid = e.eventid
	INNER JOIN "ml_blog_db"."public"."date" d
	ON s.dateid = d.dateid;

Event name, quantity sold, sale time, day, week, month, quarter, year, and holiday are columns in the dataset from AWS Data Exchange that are used as features in the ML model creation.

Choose Add markdown and enter Split the dataset into training dataset and validation dataset.

Choose Add SQL and enter the following code to split the dataset into training and validation datasets:

CREATE TABLE training_data AS 
	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday
	FROM event
	WHERE r >
	(SELECT COUNT(1) * 0.2 FROM event);

	CREATE TABLE validation_data AS 
	SELECT eventname, qtysold, saletime, day, week, month, qtr, year, holiday
	FROM event
	WHERE r <=
	(SELECT COUNT(1) * 0.2 FROM event);

Choose Add markdown and enter Create ML model.

Choose Add SQL and enter the following command to create the model. Replace the your_s3_bucket parameter with your bucket name.

CREATE MODEL predict_ticket_sold
	FROM training_data
	TARGET qtysold
	FUNCTION predict_ticket_sold
	IAM_ROLE 'default'
	PROBLEM_TYPE regression
	OBJECTIVE 'mse'
	SETTINGS (s3_bucket 'your_s3_bucket',
	s3_garbage_collect off,
	max_runtime 5000);

Note: It can take up to two hours to create and train the model.
The following screenshot shows the example output from adding our markdown and SQL.

Choose Add markdown and enter Show model creation status. Continue to next step once the Model State has changed to Ready.
Choose Add SQL and enter the following command to get the status of the model creation:
```
SHOW MODEL predict_ticket_sold;
```

Move to the next step after the Model State has changed to READY.

Choose Add markdown and enter Run the inference for eventname Jason Mraz.
When the model is ready, you can use the SQL function to apply the ML model to your data. The following is sample SQL code to predict the tickets sold for a particular event using the predict_ticket_sold function created in the previous step:
```
SELECT eventname,
	predict_ticket_sold(
	eventname, saletime, day, week, month, qtr, year, holiday ) AS predicted_qty_sold,
	day, week, month
	FROM event
	Where eventname = 'Jason Mraz';
```

The following is the output received by applying the ML function predict_ticket_sold on the original dataset. The output of the ML function is captured in the field predicted_qty_sold, which is the predicted ticket sold quantity.

Share notebooks

To share the notebooks, complete the following steps:

Create an IAM role with the managed policy AmazonRedshiftQueryEditorV2FullAccess attached to the role.
Add a principal tag to the role with the tag name sqlworkbench-team.
Set the value of this tag to the principal (user, group, or role) you’re granting access to.
After you configure these permissions, navigate to the Amazon Redshift console and choose Query editor v2 in the navigation pane. If you haven’t used the query editor v2 before, please configure your account to use query editor v2.
Choose Notebooks in the left pane and navigate to My notebooks.
Right-click on the notebook you want to share and choose Share with my team.
You can confirm that the notebook is shared by choosing Shared to my team and checking that the notebook is listed.

Summary

In this post, we showed you how to build an end-to-end pipeline by subscribing to a public dataset through AWS Data Exchange, simplifying data integration and processing, and then running prediction using Redshift ML on the data.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.

About the Authors

Yadgiri Pottabhathini is a Sr. Analytics Specialist Solutions Architect. His role is to assist customers in their cloud data warehouse journey and help them evaluate and align their data analytics business objectives with Amazon Redshift capabilities.

Ekta Ahuja is an Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Configuring low latency connectivity between AWS Outposts rack and on-premises data using CoIP

2022-07-05 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/configuring-low-latency-connectivity-between-aws-outposts-rack-using-coip-and-on-premises-data/

This blog post is written by, Leonardo Azize Martins, Cloud Infrastructure Architect, Professional Services.

AWS Outposts rack enables applications that need to run on-premises due to low latency, local data processing, or local data storage needs by connecting Outposts rack to your on-premises network via the local gateway (LGW).

Each Outpost rack includes a local gateway to provide low latency connectivity between the Outpost and any local data sources, end users, local machinery and equipment, or local databases. If you have an Outpost rack, you can include a local gateway as the target in your VPC subnet route table where the destination is your on-premises network. Local gateways are only available for Outposts rack and can only be used in route tables where the VPC has been associated with an LGW.

In a previous blog post on connecting AWS Outposts to on-premises data sources, you learned the different use cases to use AWS Outposts connected with your local network. In this post, you will dive deep into local gateway usage and specific details about it. You will learn how to use Outpost local gateway when it is configured as Customer-owned IP addresses (CoIP) mode. You will also learn how to integrate it with your Amazon Virtual Private Cloud (Amazon VPC) and how different routes work regarding Amazon Elastic Compute Cloud (Amazon EC2) instances running on Outposts.

Overview of solution

The primary role of a local gateway is to provide connectivity from an Outpost to your local on-premises LAN. It also provides AWS Outposts connectivity to the internet through your on-premises network via the LGW, so you don’t need to rely on an internet gateway (IGW). The local gateway can also provide a data plane path back to the AWS Region. If you already have connectivity between your LAN and the Region through AWS Site-to-Site VPN or AWS Direct Connect, you can use the same path to connect from the Outpost to the AWS Region privately.

Outpost subnet

Public subnets and private subnets are important concepts to understand for Outposts networking. A public subnet has a route to an internet gateway. The same concept applies to Outpost subnets, when a public subnet exists on Outposts, it will have a route to an internet gateway, which will use a service link as a communication path between an Outpost and the internet gateway in the parent AWS Region.

A private subnet does not have a direct route to the internet gateway. It will only be local inside the VPC, or it could have a route to a Network Address Translation (NAT) gateway. In both cases, the communication between Outpost subnets and AWS Region subnets will be done via the service link.

Your subnet can be private and only be allowed to communicate with your on-premises network. You just need a route pointing to the LGW.

You can also provide internet connectivity to your Outpost subnets via LGW. In this case, it will not use the service link. As soon as it traverses the LGW and goes to your next hop, it will follow your routing flow to the internet.

Routing

By default, every Outpost subnet inherits the main route table from its VPC. You can create a custom route table and associate it with an Outpost subnet. You can include a local gateway as the target when the destination is your on-premises network. A local gateway can only be used in VPC and subnet route tables that are exclusively associated with an Outpost subnet. If the route table is associated with an Outpost subnet and a Region subnet, it will not allow you to add a local gateway as the target.

Local gateways are also not supported in the main route table.

The local gateway advertises Outpost IP address ranges to your on-premises network via BGP. In the other direction, from an on-premises network to the Outpost, it doesn’t use BGP, which means there is no propagation. You need to configure your VPC route table with static routes.

As of this writing, the LGW does not support jumbo frame.

Outposts IP addresses

Outposts can be configured in customer-owned IP (CoIP) mode.

Customer-owned IP

During the installation process, uses information that you provide about your on-premises network to create an address pool, which is known as a customer-owned IP address pool (CoIP pool). AWS then assigns it to the local gateway for use and advertises back to your on-premises network through BGP.

CoIP addresses provide local or external connectivity to resources in your Outpost subnets through your on-premises network. You can assign these IP addresses to resources on your Outpost, such as an EC2 instance, by allocating a new Elastic IP address from the customer-owned IP pool and then assigning this new Elastic IP address to your EC2 instance.

A local gateway serves as NAT for EC2 instances that have been assigned addresses from your customer-owned IP pool.

You can optionally share your customer-owned pool with multiple AWS accounts in your AWS Organizations using the AWS Resource Access Manager (RAM). After you share the pool, participants can allocate and associate Elastic IP addresses from the customer-owned IP pool.

Communication between your Outpost and on-premises network will use the CoIP Elastic IP addresses to address instances in the Outpost; the VPC CIDR range is not used.

Walkthrough

You will follow the steps required to configure your VPC to use LGW configured as CoIP, including:

Associate your VPC with the LGW route table.
Create an Outposts subnet.
Create and associate the VPC route table with the subnet.
Add a route to on-premises network with LGW as the target.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
An Outpost that consists of one or more Outposts racks configured in CoIP mode.

Associate VPC with LGW route table

Use the following procedure to associate a VPC with the LGW route table. You can’t associate VPCs that have a CIDR block conflict.

Open the AWS Outposts console.
In the navigation pane, choose Local gateway route tables.
Select the route table and then choose Actions, Associate VPC.
For VPC ID, select the VPC to associate with the local gateway route table.
Choose Associate VPC.

Create an Outpost subnet

You can add Outpost subnets to any VPC in the parent AWS Region for the Outpost. When you do so, the VPC also spans the Outpost.

Open the AWS Outposts console.
On the navigation pane, choose Outposts.
Select the Outpost, and then choose Actions, Create subnet.
Select the VPC and specify an IP address range for the subnet.
Choose Create.

Create and associate VPC route table with the Outpost subnet

You can create a custom route table for your VPC using the Amazon VPC console. It is a best practice to have one specific route table for each subnet.

Open the Amazon VPC console.
In the navigation pane, choose Route Tables.
Choose Create route table.
For VPC, choose your VPC.
Choose Create.
On the Subnet associations tab, choose Edit subnet associations.
Select the check box for the subnet to associate with the route table and then choose Save associations.

Add a route to on-premises network with LGW as the target

You can add a route to a route table using the Amazon VPC console.

Open the Amazon VPC console.
In the navigation pane, choose Route Tables and select
Choose Actions, Edit routes.
Choose Add route. For Destination, enter the destination CIDR block, a single IP address, or the ID of a prefix list.
Choose Save routes.

Allocate and associate a customer-owned IP address

Open the Amazon EC2 console.
In the navigation pane, choose Elastic IPs.
Choose Allocate new address.
For Network Border Group, select the location from which the IP address is advertised.
For Public IPv4 address pool, choose Customer owned IPv4 address pool.
For Customer owned IPv4 address pool, select the pool that you configured.
Choose Allocate and close the confirmation screen.
In the navigation pane, choose Elastic IPs.
Select an Elastic IP address and choose Actions, Associate address.
Select the instance from Instance and then choose Associate.

For more information about Launch an instance on your Outpost, refer to the AWS Outposts User Guide.

Cleaning up

To avoid incurring future charges, delete the resources, like EC2 instances.

Conclusion

In this post, I covered how to use the Outposts rack local gateway to communicate with your on-premises network. You learned how a subnet route table can influence the connectivity of public or private Outpost instances.

To learn more, check out our Outposts local gateway documentation and the networking reference architecture.

Analyze logs with Dynatrace Davis AI Engine using Amazon Kinesis Data Firehose HTTP endpoint delivery

2022-07-05 Erick Leon

Post Syndicated from Erick Leon original https://aws.amazon.com/blogs/big-data/analyze-logs-with-dynatrace-davis-ai-engine-using-amazon-kinesis-data-firehose-http-endpoint-delivery/

This blog post is co-authored with Erick Leon, Sr. Technical Alliance Manager from Dynatrace.

Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. With just a few clicks, you can create fully-managed delivery streams that auto scale on demand to match the throughput of your data. Customers already use Kinesis Data Firehose to ingest raw data from various data sources, including logs from AWS services. Kinesis Data Firehose now supports delivering streaming data to Dynatrace. Dynatrace begins analyzing incoming data within minutes of Amazon CloudWatch data generation.

Starting today, you can use Kinesis Data Firehose to send CloudWatch Metrics and Logs directly to the Dynatrace observability platform to perform your explorations and analysis. Dynatrace, an AWS Partner Network (APN) has provided full observability into AWS Services by ingesting CloudWatch metrics that are published by AWS services. Dynatrace ingests this data to perform root-cause analysis using the Dynatrace Davis AI engine.

In this post, we describe the Kinesis Data Firehose and related Dynatrace integration.

Prerequisites

For this walkthrough, you should have the following prerequisites:

AWS account.
Access to the CloudWatch and Kinesis Data Firehose with permissions to manage HTTP endpoints.
Dynatrace Intelligent Observability Platform account, or get a free 15 day trial here.
Dynatrace version 1.182+.
An updated AWS monitoring policy to include the additional AWS services.
To update the AWS Identity and Access Management (IAM) policy, use the JSON in the link above, containing the monitoring policy (permissions) for all supporting services.
Dynatrace API token: create token with the following permission and keep readily available in a notepad.

Figure 1 – Dynatrace API Token

How it works

Figure 2 – Amazon Kinesis Data Firehose HTTP endpoint delivery

Simply create a log stream for your Amazon services to deliver your context rich logs to the Amazon CloudWatch Logs service. Next, select your Dynatrace HTTP endpoint to enhance your logs streams with the power of the Dynatrace Intelligence Platform. Finally, you can also back up your logs to an Amazon Simple Storage Service (Amazon S3) bucket.

Setup instructions

To add a service to monitoring, follow these steps:

In the Dynatrace menu, go to Settings > Cloud and virtualization, and select AWS.
On the AWS overview page, scroll down and select the desired AWS instance. Select the Edit button.
Scroll down and select Add service. Choose the service name from the drop-down, and select Add service.
Select Save changes.

To process and deliver AWS CloudWatch Metrics to Dynatrace, follow these steps.

Figure 3 – AWS Console

On the Amazon Kinesis services page, select the radio button for Kinesis Data Firehose and select the Create delivery stream button.

Figure 4 – Amazon Kinesis

Choose the “Direct PUT” from the drop down, and from Destination drop down, choose “Dynatrace”.

Figure 5 – Amazon Kinesis Data Firehose

Delivery stream name – Give your stream a new name, for example: – “KFH-StreamToDynatrace”

Figure 6 – Delivery stream name

In the section “Destination settings”:

Figure 7 – Destination settings

HTTP endpoint name – “Dynatrace”.
HTTP endpoint URL – From the drop down, select “Dynatrace – US”.
API token – Enter Dynatrace API TOKEN created in the prerequisite section.
API URL – enter the Dynatrace URL for your tenant, for example: https://xxxxx.live.dynatrace.com
Back Up Settings – Either select an existing S3 bucket or create a new one and add details and select the Create delivery stream button.

Figure 8 – Backup settings

Once successful, your AWS Console will look like the following:

Figure 9 – Amazon Kinesis Data Firehose

The Dynatrace Experience

Once the initial setups are completed in both Dynatrace and the AWS Console, follow these steps to visualize your new KHF stream data in the Dynatrace console.

Log in to the Dynatrace Console, and on the left side menu expand the “infrastructure” section, and select “AWS”
From the screen, select the AWS account that you want to add the KFH stream to.
Next, you’ll see a virtualization of your AWS assets for the account selected. Select the box marked “Supporting Services”.
Next, press the “Configure services” button.
Next, select “Add service”.
From the drop down, select “Kinesis Data Firehose”.
Next, select the “Add metric” button, and select the metrics that you want to see for this stream. Dynatrace has a comprehensive list of metrics that can be selected from the UI. The list can be found in this link.

Troubleshooting

After configuration, load to the new KFH stream no data in the Dynatrace Console.
1. Check the Error Logs tab check to make sure that the Destination URL is correct for the Dynatrace Tenant.

Figure 10 – Destination error logs

Invalid or misconfigured Dynatrace API token or scope isn’t properly set.

Figure 11 – Destination error logs

Conclusion

In this post, we demonstrate the Kinesis Data Firehose and related Dynatrace integration. In addition, engineers can use CloudWatch Metrics to explore their production systems alongside events in Dynatrace. This provides a seamless, current view of your system (from logs to events and traces) in a single data store.

To learn more about CloudWatch Service, see the Amazon CloudWatch home page. If you have any questions, post them on the AWS CloudWatch service forum.

If you haven’t yet signed up for Dynatrace, then you can try out Kinesis Data Firehose with Dynatrace with a free Dynatrace trial.

About the Authors

Erick Leon is a Technical Alliances Sr. Manager at Dynatrace, Observability Practice Architect, and Customer Advocate. He promotes strong technical integrations with a focus on AWS. With over 15 years as a Dynatrace customer, his real-world experiences and lessons learned bring valuable insights into the Dynatrace Intelligent Observability Platform.

Shashiraj Jeripotula (Raj) is a San Francisco-based Sr. Partner Solutions Architect at AWS. He works with various independent software vendors (ISVs), and partners who specialize in cloud management tools and DevOps to develop joint solutions and accelerate cloud adoption on AWS.

Understanding the lifecycle of Amazon EC2 Dedicated Hosts

2022-07-01 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/understanding-the-lifecycle-of-amazon-ec2-dedicated-hosts/

This post is written by Benjamin Meyer, Sr. Solutions Architect, and Pascal Vogel, Associate Solutions Architect.

Amazon Elastic Compute Cloud (Amazon EC2) Dedicated Hosts enable you to run software on dedicated physical servers. This lets you comply with corporate compliance requirements or per-socket, per-core, or per-VM licensing agreements by vendors, such as Microsoft, Oracle, and Red Hat. Dedicated Hosts are also required to run Amazon EC2 Mac Instances.

The lifecycles and states of Amazon EC2 Dedicated Hosts and Amazon EC2 instances are closely connected and dependent on each other. To operate Dedicated Hosts correctly and consistently, it is critical to understand the interplay between Dedicated Hosts and EC2 Instances. In this post, you’ll learn how EC2 instances are reliant on their (dedicated) hosts. We’ll also dive deep into their respective lifecycles, the connection points of these lifecycles, and the resulting considerations.

What is an EC2 instance?

An EC2 instance is a virtual server running on top of a physical Amazon EC2 host. EC2 instances are launched using a preconfigured template called Amazon Machine Image (AMI), which packages the information required to launch an instance. EC2 instances come in various CPU, memory, storage and GPU configurations, known as instance types, to enable you to choose the right instance for your workload. The process of finding the right instance size is known as right sizing. Amazon EC2 builds on the AWS Nitro System, which is a combination of dedicated hardware and the lightweight Nitro hypervisor. The EC2 instances that you launch in your AWS Management Console via Launch Instances are launched on AWS-controlled physical hosts.

What is an Amazon EC2 Bare Metal instance?

Bare Metal instances are instances that aren’t using the Nitro hypervisor. Bare Metal instances provide direct access to physical server hardware. Therefore, they let you run legacy workloads that don’t support a virtual environment, license-restricted business-critical applications, or even your own hypervisor. Workloads on Bare Metal instances continue to utilize AWS Cloud features, such as Amazon Elastic Block Store (Amazon EBS), Elastic Load Balancing (ELB), and Amazon Virtual Private Cloud (Amazon VPC).

What is an Amazon EC2 Dedicated Host?

An Amazon EC2 Dedicated Host is a physical server fully dedicated to a single customer. With visibility of sockets and physical cores of the Dedicated Host, you can address corporate compliance requirements, such as per-socket, per-core, or per-VM software licensing agreements.

You can launch EC2 instances onto a Dedicated Host. Instance families such as M5, C5, R5, M5n, C5n, and R5n allow for the launching of different instance sizes, such as4xlarge and 8xlarge, to the same host. Other instance families only support a homogenous launching of a single instance size. For more details, see Dedicated Host instance capacity.

As an example, let’s look at an M6i Dedicated Host. M6i Dedicated Hosts have 2 sockets and 64 physical cores. If you allocate a M6i Dedicated Host, then you can specify what instance type you’d like to support for allocation. In this case, possible instance sizes are:

large
xlarge
2xlarge
4xlarge
8xlarge
12xlarge
16xlarge
24xlarge
32xlarge
metal

The number of instances that you can launch on a single M6i Dedicated Host depends on the selected instance size. For example:

In the case of xlarge (4 vCPUs), a maximum of 32 m6i.xlarge instances can be scheduled on this Dedicated Host.
In the case of 8xlarge (32 vCPUs), a maximum of 4 m6i.8xlarge instances can be scheduled on this Dedicated Host.
In the case of metal (128 vCPUs), a maximum of 1 m6i.metal instance can be scheduled on this Dedicated Host.

When launching an EC2 instance on a Dedicated Host, you’re billed for the Dedicated Host but not for the instance. The cost for Amazon EBS volumes is the same as in the case of regular EC2 instances.

Exemplary M6i Dedicated Host instance selections: m6i.xlarge, m6i.8xlarge and m6i.metal

Understanding the EC2 instance lifecycle

Amazon EC2 instance lifecycle states and transitions

Throughout its lifecycle, an EC2 instance transitions through different states, starting with its launch and ending with its termination. Upon Launch, an EC2 instance enters the pending state. You can only launch EC2 instances on Dedicated Hosts in the available state. You aren’t billed for the time that the EC2 instance is in any state other than running. When launching an EC2 instance on a Dedicated Host, you’re billed for the Dedicated Host but not for the instance. Depending on the user action, the instance can transition into three different states from the running state:

Via Reboot from the running state, the instance enters the rebooting state. Once the reboot is complete, it reenters the running state.
In the case of an Amazon EBS-backed instance, a Stop or Stop-Hibernate transitions the running instance into the stopping state. After reaching the stopped state, it remains there until further action is taken. Via Start, the instance will reenter the pending and subsequently the running state. Via Terminate from the stopped state, the instance will enter the terminated state. As part of a Stop or Stop-Hibernate and subsequent Start, the EC2 instance may move to a different AWS-managed host. On Reboot, it remains on the same AWS-managed host.
Via Terminate from the running state, the instance will enter the shutting-down state, and finally the terminated state. An instance can’t be started from the terminated state.

Understanding the Amazon EC2 Dedicated Host lifecycle

Amazon EC2 Dedicated Host lifecycle states and transitions

An Amazon EC2 Dedicated Host enters the available state as soon as you allocate it in your AWS account. Only if the Dedicated Host is in the available state, you can launch EC2 instances on it. You aren’t billed for the time that your Dedicated Host is in any state other than available. From the available state, the following states and state transitions can be reached:

You can Release the Dedicated Host, transitioning it into the released state. Amazon EC2 Mac Instances Dedicated Hosts have a minimum allocation time of 24h. They can’t be released within the 24h. You can’t release a Dedicated Host that contains instances in one of the following states: pending, running, rebooting, stopping, or shutting down. Consequently, you must Stop or Terminate any EC2 instances on the Dedicated Host and wait until it’s in the available state before being able to release it. Once an instance is in the stopped state, you can move it to a different Dedicated Host by modifying its Instance placement configuration.
The Dedicated Host may enter the pending state due to a number of reasons. In case of an EC2 Mac instance, stopping or terminating a Mac instance initiates a scrubbing workflow of the underlying Dedicated Host, during which it enters the pending state. This scrubbing workflow includes tasks such as erasing the internal SSD, resetting NVRAM, and more, and it can take up to 50 minutes to complete. Additionally, adding or removing a Dedicated Host to or from a Resource Group can cause the Dedicated Host to go into the pending state. From the pending state, the Dedicated Host will reenter the available state.
The Dedicated Host may enter the under-assessment state if AWS is investigating a possible issue with the underlying infrastructure, such as a hardware defect or network connectivity event. While the host is in the under-assessment state, all of the EC2 instances running on it will have the impaired status. Depending on the nature of the underlying issue and if it’s configured, the Dedicated Host will initiate host auto recovery.

If Dedicated Host Auto Recovery is enabled for your host, then AWS attempts to restart the instances currently running on a defect Dedicated Host on an automatically allocated replacement Dedicated Host without requiring your manual intervention. When host recovery is initiated, the AWS account owner is notified by email and by an AWS Health Dashboard event. A second notification is sent after the host recovery has been successfully completed. Initially, the replacement Dedicated Host is in the pending state. EC2 instances running on the defect dedicated Host remain in the impaired status throughout this process. For more information, see the Host Recovery documentation.

Once all of the EC2 instances have been successfully relaunched on the replacement Dedicated Host, it enters the available state. Recovered instances reenter the running state. The original Dedicated Host enters the released-permanent-failure state. However, if the EC2 instances running on the Dedicated Host don’t support host recovery, then the original Dedicated Host enters the permanent-failure state instead.

Conclusion

In this post, we’ve explored the lifecycles of Amazon EC2 instances and Amazon EC2 Dedicated Hosts. We took a close look at the individual lifecycle states and how both lifecycles must be considered in unison to operate EC2 Instances on EC2 Dedicated Hosts correctly and consistently. To learn more about operating Amazon EC2 Dedicated Hosts, visit the EC2 Dedicated Hosts User Guide.

Sink Amazon Kinesis Data Analytics Apache Flink output to Amazon Keyspaces using Apache Cassandra Connector

2022-06-30 Pratik Patel

Post Syndicated from Pratik Patel original https://aws.amazon.com/blogs/big-data/sink-amazon-kinesis-data-analytics-apache-flink-output-to-amazon-keyspaces-using-apache-cassandra-connector/

Amazon Keyspaces (for Apache Cassandra) is a scalable, highly available, and managed Apache Cassandra–compatible database service. With Amazon Keyspaces you don’t have to provision, patch, or manage servers, and you don’t have to install, maintain, or operate software. Amazon Keyspaces is serverless, so you only pay for the resources that you use and the service can automatically scale tables up and down in response to application traffic. You can use Amazon Keyspaces to store large volumes of data, such as entries in a log file or the message history for a chat application as Amazon Keyspaces offers virtually unlimited throughput and storage. You can also use Amazon Keyspaces to store information about devices for Internet of Things (IoT) applications or player profiles for games.

A popular use case in the wind energy sector is to protect wind turbines from wind speed. Engineers and analysts often want to see real-time aggregated wind turbine speed data to analyze the current situation out in the field. Furthermore, they need access to historical aggregated wind turbine speed data to build machine learning (ML) models which can help them take preventative actions on wind turbines. Customers often ingest high-velocity IoT data into Amazon Kinesis Data Streams and use Amazon Kinesis Data Analytics, AWS Lambda, or Amazon Kinesis Client Library (KCL) applications to aggregate IoT data in real-time and store it in Amazon Keyspaces, Amazon DynamoDB, or Amazon Timestream.

In this post, we demonstrate how to aggregate sensor data using Amazon Kinesis Data Analytics and persist aggregated sensor data in to Amazon Keyspaces using Apache Flink’s Apache Cassandra Connector.

Architecture

BDB-2063-kda-keyspaces-architecture

In the architecture diagram above, Lambda simulates wind speed sensor data and ingests sensor data into Amazon Kinesis Data Stream. Amazon Kinesis Data Analytics Apache Flink application reads wind speed sensor data from Amazon Kinesis Data Stream in real-time and aggregates wind speed sensor data using a five minutes tumbling window and storing aggregated wind speed sensor data into Amazon Keyspaces table. Aggregated wind speed sensor data stored in Amazon Keyspaces can be used by engineers and analysts to review real-time dashboards or to perform historical analysis on specific wind turbine.

Deploying resources using AWS CloudFormation

After you sign in to your AWS account, launch the AWS CloudFormation template by choosing Launch Stack:

The CloudFormation template configures the following resources in your account:

One Lambda function which simulates wind turbine data
One Amazon Kinesis Data Stream
One Amazon Kinesis Data Analytics Apache Flink application
An AWS Identity and Access Management (IAM) role (service execution role) for Amazon Kinesis Data Analytics Apache Flink application
One Amazon Keyspaces Table: turbine_aggregated_sensor_data

After you complete the setup, sign in to the Kinesis Data Analytics console. On the Kinesis Data Analytics applications page, choose the Streaming applications tab, where you can see the Streaming application in the ready status. Select the Streaming application, choose Run, and wait until the Streaming application is in running status. It can take a couple of minutes for the Streaming application to get into running status.

Now that we have deployed all of the resources using CloudFormation template, let’s review deployed resources and how they function.

Format of wind speed sensor data

Lambda simulates wind turbine speed data every one minute and ingests it into Amazon Kinesis Data Stream. Each wind turbine sensor data message consists of two attributes: turbineId and speed.

{
  "turbineId": "turbine-0001",
  "speed": 60
}

Schema of destination Amazon Keyspaces table

We’ll store aggregated sensor data in to destination turbine_aggregated_sensor_data Amazon Keyspaces table. turbine_aggregated_sensor_data table has on-demand capacity mode enabled. Amazon Keyspaces (for Apache Cassandra) on-demand capacity mode is a flexible billing option capable of serving thousands of requests per second without capacity planning. This option offers pay-per-request pricing for read and write requests so that you pay only for what you use. When you choose on-demand mode, Amazon Keyspaces can scale the throughput capacity for your table up to any previously reached traffic level instantly, and then back down when application traffic decreases. If a workload’s traffic level hits a new peak, then the service adapts rapidly to increase throughput capacity for your table.

BDB-2063-keyspaces-table BDB-2063-keyspaces-table-def-1 BDB-2063-keyspaces-table-def-2

Apache Flink code to aggregate and persist data in Amazon Keyspaces Table

Apache Flink source code used by this post can be found on the KeyspacesSink section of Kinesis Data Analytics Java Examples public git repository.

The following code snippet demonstrates how incoming wind turbine messages are getting aggregated using a five-minute tumbling window and produces a DataStream of TurbineAggregatedRecord records.

DataStream<TurbineAggregatedRecord> result = input
.map(new WindTurbineInputMap())
.keyBy(t -> t.turbineId)
.window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
.reduce(new AggregateReducer())
.map(new AggregateMap());

The following code snippet demonstrates how Amazon Keyspaces table name and column names are annotated on the TurbineAggregatedRecord class.

@Table(keyspace = "sensor_data", name = "turbine_aggregated_sensor_data", readConsistency = "LOCAL_QUORUM", writeConsistency = "LOCAL_QUORUM")
public class TurbineAggregatedRecord {

@Column(name = "turbineid")
@PartitionKey(0)
private String turbineid = "";

@Column(name = "reported_time")
private long reported_time = 0;

@Column(name = "max_speed")
private long max_speed = 0;

@Column(name = "min_speed")
private long min_speed = 0;

@Column(name = "avg_speed")
private long avg_speed = 0;

The following code snippet demonstrates the implementation of Apache Cassandra Connector to sink aggregated wind speed sensor data TurbineAggregatedRecord into Amazon Keyspaces table. We’re using SigV4AuthProvider with Apache Cassandra Connector. The SigV4 authentication plugin lets you use IAM credentials for users or roles when connecting to Amazon Keyspaces. Instead of requiring a user name and password, this plugin signs API requests using access keys.

CassandraSink.addSink(result)
                .setClusterBuilder(
                        new ClusterBuilder() {

                            private static final long serialVersionUID = 2793938419775311824L;

                            @Override
                            public Cluster buildCluster(Cluster.Builder builder) {
                                return builder
                                        .addContactPoint("cassandra."+ region +".amazonaws.com")
                                        .withPort(9142)
                                        .withSSL()
                                        .withAuthProvider(new SigV4AuthProvider(region))
                                        .withLoadBalancingPolicy(
                                                DCAwareRoundRobinPolicy
                                                        .builder()
                                                        .withLocalDc(region)
                                                        .build())
                                        .withQueryOptions(queryOptions)
                                        .build();
                            }
                        })
                .setMapperOptions(() -> new Mapper.Option[] {Mapper.Option.saveNullFields(true)})
                .setDefaultKeyspace("sensor_data")
                .build();

Review output in Amazon Keyspaces Table

Once Amazon Kinesis Data Analytics Apache Flink application aggregates wind turbine sensor data and persists aggregated data in Amazon Keyspaces table, we can query and review aggregated data using Amazon Keyspaces CQL editor as illustrated in the following.

select * from sensor_data.turbine_aggregated_sensor_data

BDB-2063-cql-editor BDB-2063-cql-editor-result

Clean up

To avoid incurring future charges, complete the following steps:

Empty Amazon S3 bucket created by AWS CloudFormation stack.
Delete AWS CloudFormation stack.

Conclusion

As you’ve learned in this post, you can build Amazon Kinesis Data Analytics Apache Flink application to read sensor data from Amazon Kinesis Data Streams, perform aggregations, and persist aggregated sensor data in Amazon Keyspaces using Apache Cassandra Connector. There are several use cases in IoT and Application development to move data quickly through the analytics pipeline and persist data in Amazon Keyspaces.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.

About the Author

Pratik Patel is a Sr Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively helps in keeping customer’s AWS environments operationally healthy.

Jenkins high availability and disaster recovery on AWS

2022-06-29 James Bland

Post Syndicated from James Bland original https://aws.amazon.com/blogs/devops/jenkins-high-availability-and-disaster-recovery-on-aws/

We often hear from customers about their challenges architecting Jenkins for scale and high availability (HA). Jenkins was originally built as a continuous integration (CI) system to test software before it was committed to a repository. Since its beginning, Jenkins has grown out of necessity versus grand master plan. Developers who extended Jenkins favored speed of creating functionality over performance or scalability of the entire system. This is not to say that it’s impossible to scale Jenkins, it’s only mentioned here to highlight the challenges and technical debt that has accumulated because of the prioritization of features versus developing towards a specific architecture. In this post, we discuss these challenges and our proposed solution.

Challenges with Jenkins at scale and HA

Business and customer demand are forcing organizations to increase the speed and agility at which they release features and functionality. As organizations make this transition, the usage of continuous integration and continuous delivery (CI/CD) increases, which drives the need to scale Jenkins. Overlay this with an organization that commits hundreds of changes per day and works around the clock, with developers dispersed globally, and you end up with an operational situation where there is no room for downtime. To mitigate the risk of impacting an organization’s ability to release when they need it, developers require a system that not only scales but is also highly available.

The ability to scale Jenkins and provide HA comes down to two problems. One is the ability to scale compute to handle additional jobs, and the second is storage. To scale compute, we typically do it in one of two ways, horizontally or vertically. Horizontally means we scale Jenkins to add additional compute nodes. Scaling vertically means we scale Jenkins by adding more resources to the compute node.

Let’s start with the storage problem. Jenkins is designed around the local file system. Anyone who has spent time around Jenkins is aware that logs, cloned repos, plugins, and build artifacts are stored into JENKINS_HOME. Local file systems, while good for single-server designs, tend to be a challenge when HA comes into the picture. In on-premises designs, administrators have often used Network File System (NFS) and Storage Area Networks (SAN) to achieve some scale and resiliency. This type of design comes with a trade-off of performance and doesn’t provide the true HA and inherent disaster recovery (DR) required to meet the demands of the business.

Because of the local file system constraint, there are two native families of storage available in AWS: Amazon Elastic Block Store (Amazon EBS) and Amazon Elastic File System (Amazon EFS). Amazon EBS is great for a single-server design in a single Availability Zone. The challenge is trying to scale a single-server design to support HA. Because of the requirement to assign an EBS volume to a specific Availability Zone, you can’t automatically transition the EBS volume to another Availability Zone and attach it to a Jenkins instance. If you don’t mind having an impact on Recovery Time Objective (RTO) and Recovery Point Objective (RPO), a solution using Amazon EBS snapshots copied to additional Availability Zones might work. Although EBS snapshot copy is possible, it’s not a recommended solution because it doesn’t scale and has complexities in building and maintaining this type of solution.

Amazon EFS as an alternative has worked well for customers that don’t have high usage patterns of Jenkins. All Jenkins instances within the Region can access the Amazon EFS file system and data durably stored in multiple Availability Zones. If a single Availability Zone experiences an outage, the Jenkins file system is still accessible from other Availability Zones providing HA for the storage layer. This solution is not recommended for high-usage systems due to the way that Jenkins reads and writes data. Jenkins’s access pattern is skewed towards writing data such as logs, cloned repos, and building artifacts versus reading data. Amazon EFS, on the other hand, is designed for workloads that read more than they write. On high-usage workloads, customers have experienced Jenkins build slowness and Jenkins page load latency. This is why Amazon EFS isn’t recommended for high-usage Jenkins systems.

Solution for Jenkins at scale and HA

Solving the compute problem is relatively straightforward by using Amazon Elastic Kubernetes Service (Amazon EKS). In the context of Jenkins, an organization would run Jenkins in an Amazon EKS cluster that spans multiple Availability Zones, as shown in the following diagram.

Diagram showing Jenkins deployment in Amazon EKS with three availability zones inside a VPC

Figure 1 –Jenkins deployment in Amazon EKS with multiple availability zones.

Jenkins Controller and Agent would run in an Availability Zone as a Kubernetes pod. Amazon EKS is designed around Desired State Configuration (DSC), which means that it continuously make sure that the running environment matches the configuration that has been applied to Amazon EKS. In practice, when Amazon EKS is told that you want a single pod of Jenkins running, it monitors and makes sure that pod is always running. If an Availability Zone is unavailable, Amazon EKS launches a new node in another Availability Zone and deploys all pods to meet any necessary constraints defined in Amazon EKS. With this option, we still need to have the data in other Availability Zones, which we cover later in this post.

The only option of scaling Jenkins controllers is vertical. Scaling Jenkins horizontally could lead to an undesirable state because the system wasn’t designed to have multiple instances of Jenkins attached to the same storage layer. There is no exclusive file locking mechanism to ensure data consistency. For organizations that have exhausted the limits with vertical scaling, the recommendation is to run multiple independent Jenkins controllers and separate them per team or group. Vertical scaling of Jenkins is simpler in Amazon EKS. Node sizes and container memory are controlled by configuration. Increasing memory size is as simple as changing a container’s memory setting. Due to the ease of changing configuration, it’s best to start with a lower memory setting, monitor performance, and increase as necessary. You want to find a good balance between price and performance.

For Jenkins agents, there are many options to scale the compute. In the context of scale and HA, the best options are to use AWS CodeBuild, AWS Fargate for Amazon EKS, or Amazon EKS managed node groups. With CodeBuild, you don’t need to provision, manage, or scale your build servers. CodeBuild scales continuously and processes multiple builds concurrently. You can use the Jenkins plugin for CodeBuild to integrate CodeBuild with Jenkins. Fargate is a good option but has some challenges if you’re trying to build container images within a container due to permissions necessary that aren’t exposed in Fargate. For additional information on how to overcome this challenge with Jenkins, refer to How to build container images with Amazon EKS on Fargate.

Now let’s look at the storage layer and see how LINBIT is helping organizations solve this problem with LINSTOR. LINBIT’s LINSTOR is an open-source management tool designed to manage block storage devices. Its primary use case is to provide Linux block storage for Kubernetes and other public and private cloud platforms. LINBIT also provides enterprise subscription for LINSTOR, which include technical support with SLA.

The following diagram illustrates a LINSTOR storage solution running on Amazon EKS using multiple Availability Zones and Amazon Simple Storage Service (Amazon S3) for snapshots.

Diagram showing LINSTOR storage solution running on Amazon EKS across three availability zone with snapshot stored in Amazon S3.

Figure 2. LINSTOR storage solution running on Amazon EKS using multiple availability zones and S3 for snapshot.

LINSTOR is composed of a control plane and a data plane. The control plane consists of a set of containers deployed into Amazon EKS and is responsible for managing the data plane. The data plane consists of a collection of open-source block storage software, most importantly LINBIT’s Distributed Replicated Storage System (DRBD) software. DRBD is responsible for provisioning and synchronously replicating storage between Amazon EKS worker instances in different Availability Zones.

LINSTOR is deployed via Helm into Amazon EKS, and the LINSTOR cluster is initialized by the LINSTOR Operator. Once deployed, LINSTOR volumes and volume snapshots are managed via Kubernetes Storage Classes and Snapshot Classes in a Kubernetes native fashion. LINSTOR volumes are backed by LINSTOR objects known as storage pools, which are composed of one or more EBS volumes attached to each Amazon EKS worker instance.

LINSTOR volumes layer DRBD on top of the worker’s attached EBS volume to enable synchronous replication between peers in the Amazon EKS cluster. This ensures that you have an identical copy of your persistent volume on the EBS volumes in each Availability Zone. In the event of an Availability Zone outage or planned migration, Amazon EKS moves the Jenkins deployment to another Availability Zone where the persistent volume copy is available. In terms of scaling, LINBIT DRDB supports up to 32 replicas per volume, with a maximum size of 1 PiB per volume. LINSTOR node itself can scale beyond hundreds of nodes, as shown in this case study.

LINSTOR also provides an HA Controller component in its control plane to speed up failover times during outages. LINSTOR’s HA Controller looks for pods with a specific label, and if LINSTOR’s persistent volumes replication network becomes interrupted (like during an Availability Zone outage), LINSTOR reschedules the pod sooner than the default Kubernetes pod-eviction-timeout.

LINBIT provides a detailed full installation for Jenkins HA in AWS. A sample of LINSTOR’s helm values supporting these features is as follows:

operator:
  satelliteSet:
    storagePools:
      lvmThinPools:
      - name: lvm-thin
        thinVolume: thinpool
        volumeGroup: ""
        devicePaths:
        - /dev/nvme1n1
    kernelModuleInjectionMode: Compile
stork:
  enabled: false
csi:
  enableTopology: true
etcd:
  replicas: 3
haController:
  replicas: 3

After LINSTOR is deployed, you create a Kubernetes StorageClass supporting persistent volumes with three replicas using the following example:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: "linstor-csi-lvm-thin-r3"
provisioner: linstor.csi.linbit.com
parameters:
  allowRemoteVolumeAccess: "false"
  autoPlace: "3"
  storagePool: "lvm-thin"
  DrbdOptions/Disk/disk-flushes: "no"
  DrbdOptions/Disk/md-flushes: "no"
  DrbdOptions/Net/max-buffers: "10000"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

Finally, Jenkins helm charts are deployed into Amazon EKS with the following Helm values to request a PV from the LINSTOR StorageClass:

persistence:
  storageClass: linstor-csi-lvm-thin-r3
  size: "200Gi"
controller:
  serviceType: LoadBalancer
  podLabels:
    linstor.csi.linbit.com/on-storage-lost: remove

To protect against entire AWS Region outages and provide disaster recovery, LINSTOR takes volume snapshots and replicates it cross-Region using Amazon S3. LINSTOR requires read and write access to the target S3 bucket using AWS credentials provided as Kubernetes secrets:

kind: Secret
apiVersion: v1
metadata:
  name: linstor-csi-s3-access
  namespace: default
type: linstor.csi.linbit.com/s3-credentials.v1
immutable: true
stringData:
  access-key: REDACTED
  secret-key: REDACTED

The target S3 bucket is referenced as a snapshot shipping target using a LINSTOR S3 VolumeSnapshotClass. The following example shows a VolumeSnapshotClass referencing the S3 bucket’s secret and additional configuration for the target S3 bucket:

kind: VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
metadata:
  name: linstor-csi-snapshot-class-s3
driver: linstor.csi.linbit.com
deletionPolicy: Delete
parameters:
  snap.linstor.csi.linbit.com/type: S3
  snap.linstor.csi.linbit.com/remote-name: s3-us-west-2
  snap.linstor.csi.linbit.com/allow-incremental: "false"
  snap.linstor.csi.linbit.com/s3-bucket: name-of-bucket-123
  snap.linstor.csi.linbit.com/s3-endpoint: http://s3.us-west-2.amazonaws.com
  snap.linstor.csi.linbit.com/s3-signing-region: us-west-2
  snap.linstor.csi.linbit.com/s3-use-path-style: "false"
  # Secret to store access credentials
  csi.storage.k8s.io/snapshotter-secret-name: linstor-csi-s3-access
  csi.storage.k8s.io/snapshotter-secret-namespace: default

Jenkins deployment persistent volume claim (PVC) is stored as a snapshot in Amazon S3 by using a standard Kubernetes volumeSnapshot definition with LINSTOR’s snapshot class for Amazon S3:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: jenkins-dr-snapshot-0
spec:
  volumeSnapshotClassName: linstor-csi-snapshot-class-s3
  source:
    persistentVolumeClaimName: <jenkins-pvc-name>

Conclusion

In this post, we explained the challenges to scale Jenkins for HA and DR. We also reviewed Jenkins storage architecture with Amazon EBS and Amazon EFS and where to apply these. We demonstrated how you can use Amazon EKS to scale Jenkins compute for HA and how AWS partner solutions such as LINBIT LINSTOR can help scale Jenkins storage for HA and DR. Combining both solutions can help organizations maintain their ability to deploy software with speed and agility. We hope you found this post useful as you think through building your CI/CD infrastructure in AWS. To learn more about running Jenkins in Amazon EKS, check out Orchestrate Jenkins Workloads using Dynamic Pod Autoscaling with Amazon EKS. To find out more information about LINBIT’s LINSTOR, check the Jenkins technical guide.

Authors:

Field Notes: Integrating Active Directory Federation Service with AWS Single Sign-On

2022-06-25 Shirin Bano

Post Syndicated from Shirin Bano original https://aws.amazon.com/blogs/architecture/field-notes-integrating-active-directory-federation-service-with-aws-single-sign-on/

Enterprises use Active Directory Federation Services (AD FS) with single sign-on, to solve operational and security challenges by allowing the usage of a single set of credentials for multiple applications. This improves the user experience and helps manage access to the applications in a centralized way.

AWS offers a native cloud-based single sign-on solution called AWS Single Sign-On (AWS SSO). This service helps centrally manage SSO access and user permissions to all the AWS accounts and cloud applications. AWS SSO supports identity federation with SAML 2.0, allowing integration with AD FS solutions. This helps enterprises migrate to AWS, who have a hybrid environment with on-premises AD FS and need access to AWS accounts and cloud applications. Users can sign in to the AWS SSO portal with their corporate credentials thus reducing the admin overhead of maintaining separate credentials on AWS SSO.

Note: you can skip AD FS and connect your Active Directory to AWS SSO directly, instead. This gives you a simpler integration and with AD FS, enables you to use WebAuthn and TOTP MFA, and gives you a free and easy SAML IdP for apps. However, if you have specific constraints that require using AD FS, this blog will help you configure that.

This section explains the authentication flow with AD FS and AWS SSO integration. You can use Identity Provider (AD FS) initiated or Service Provider (AWS SSO) initiated authentication methods.

Following are the steps involved for both Identity Provider (IdP) and Service Provider (SP) initiated authentication methods:

1. IdP Initiated Authentication Flow

Authentication Flow:

You access the SSO user-portal URL. The authentication flow depends on how you initiate the login request. There are 2 methods in which you can access the SSO user-portal.

IdP (AD FS) Initiated Authentication Method

1.a. This method is followed when users access the AD FS SSO user-portal URL. Some organizations prefer this method when they have a federation system built into their on-premises network and they start using AWS Services. The AWS SSO and AD FS integration allows them to continue using the AD FS user-portal URL, and to login even after they move to AWS.

The following diagram outlines the architecture for the IdP (AD FS) Initiated Authentication Method.

AD FS Reference Architecture

2. SP Initiated Authentication Flow

The following diagram outlines the architecture for an SP Initiated Authentication flow.

SP Initiated Authentication Flow

SP (AWS SSO) Initiated Authentication Method

This method is followed when users access the AWS SSO user-portal URL, for example, https://d-12345c789.awsapps.com/start.
Once the request arrives at the AWS SSO endpoint, it is redirected to the AD FS user-portal URL.
The user then goes to the AD FS user-portal URL, for example, https://acmecorp.com/adfs/ls after which the traffic flow is similar to the IdP Initiated Authentication method.
You are asked to enter the username and password after which it is authenticated against the Active Directory.
You receive a SAML assertion, as an authentication response, from AD FS. The assertion identifies you and includes attributes about you as the user.
You are redirected to the AWS SSO endpoint and it posts the SAML Assertion.
AWS SSO endpoint calls the AssumeRoleWithSAML API to the STS service for temporary security credentials on your behalf. This creates a console sign-in URL that uses those credentials.
AWS sends the sign-in URL back to you as a redirect. You are then re-directed to the AWS SSO Application page, where you can choose the account to log into or the cloud/custom application to use.

Process to Integrate AD FS with AWS SSO

In this section, we show the configurations needed to establish a trust between AD FS and AWS SSO. This allows you to log into AWS accounts using the credentials configured in AD FS.

Step 1: Build SAML Trust Relationship between AD FS and AWS SSO

Get AWS SSO SAML metadata information.
Log into the AWS account where you have configured AWS SSO. On the AWS SSO console, select Dashboard and then Choose your identity source.
On the settings page, select Change, next to the Identity source.
Change the identity source and select External identity provider.
Under Service provider metadata, select show individual metadata information.
Make a note of AWS SSO Sign-in URL, AWS SSO ACS URL, and AWS SSO issuer URL, as these will be used to configure AWS SSO as the relying party in the AD FS settings.

Figure 1 – Service Provider metadata

Add AWS SSO as a Relying Party in AD FS

Go to AD FS Management from the Tools menu in the Server Manager.
Select Add Relying Party Trust.
For Add Relying Party Trust Wizard, choose Claims aware and select Start.
For Select Data Source, select Enter data about the relying party manually.
For Specify Display Name add a user-friendly name for example – AWS SSO.
For Configure URL, select the option Enable support for the SAML 2.0 WebSSO protocol.
Enter the value for AWS SSO ACS URL that you got in the previous step (Figure-1).

Figure 2 – Add AWS SSO as a Relying Party in AD FS

8. For Configure Identifiers, add the AWS SSO Issuer URL (Figure-1), in the Relying party trust identifiers box and select Add.

9. Leave the rest of the configuration as default and click Next until the relying party trust is successfully added.

Figure 3 – Configure Identifiers

Add Claim Issuance Policy

Select the Relying Party Trust you created in the previous step and go to Edit Claim Issuance Policy.
In the Edit Claim Issuance Policy for AWS SSO dialog box, select Add rule.
In the Add Transform Claim Rule Wizard from the drop-down menu for Claim rule template, select Transform an incoming claim.
Enter a name for the claim rule, for this example – Rule for SSO.
Select UPN for Incoming claim type, Name ID for Outgoing claim type and Email for Outgoing name ID format.

Figure 4 – Transform Claim Rule

Note: The rule language for the above rule is:

c:[Type == "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/upn"]

=> issue(Type = "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/nameidentifier", Issuer = c.Issuer, OriginalIssuer = c.OriginalIssuer, Value = c.Value, ValueType = c.ValueType, Properties["http://schemas.xmlsoap.org/ws/2005/05/identity/claimproperties/format"] = "urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress");

Get AD FS metadata from the Windows machine

1. Enter this meta-data document endpoint URL, https://acmecorp.com/federationmetadata/2007-06/federationmetadata.xml, in your web browser, replacing acmecorp.com with your domain used for AD DS.

2. Download the federationmetadata.xml file on your local machine as it will be needed for the AWS SSO configuration.

Upload AD FS metadata to AWS SSO

From the AWS SSO console, select Dashboard and go to Choose your identity source.
On the settings page, select Change next to the Identity source.
Change the identity source and select External identity provider.
Under Identity provider metadata, Browse and upload the AD FS metadata.

Figure 5 – Upload AD FS metadata to AWS SSO

Step 2: Provision Users in AWS SSO

You must provision users in AWS SSO, to make it aware of the users in your IdP. There are 2 ways of provisioning the users in AWS SSO:

Automatic Provisioning

With SAML, we do not have a way to query the IdP to learn about the users and groups. However, AWS SSO support System for Cross-domain Identity Management (SCIM) v2.0 standard. With SCIM you can keep the identities in AWS SSO in sync with the identities from your IdP which support SCIM (like Azure AD). Refer to the guide on Automatic Provisioning for more information.

Manual Provisioning

Some IdPs do not support SCIM. In that case, you will need to manually provision the users in AWS SSO. The username in AWS SSO should be identical to the username configured in your IdP. In this setup, we are using the email address as the username. Adding users manually can be tedious and is prone to errors. You can implement this solution to programmatically create users and groups into AWS SSO from a CSV file with user and group information.

For this demonstration, we show how to manually provision the user in AWS SSO. You can also go with Automatic Provisioning, if your IdP supports it.

Manually Provision user in AWS SSO

Add User from the Users section in AWS SSO console
For Username, enter the email address of the user that was created in Windows AD

Note: Since the Outgoing NameId format is set as email address (Figure 3), the username should match the email address of the user configured in Windows AD. Make sure that the values entered for Username and Email address exactly match the values in AD DS, as the credentials are verified against the values in AD DS.

Figure 6 – Edit user details

Next, we show how to create a new Permission Set and how a user is assigned to an AWS account. If you already have the permission set configured and users assigned to accounts, skip to Step-4 to verify your settings.

Step 3: Manage Access Permissions for the User

This step defines the permission boundaries for the user provisioned in AWS SSO that allows them to access AWS Accounts.

AWS SSO is integrated with AWS Organizations and users have the capability to use their IdP credentials to log into the accounts in the Organization. You can access the primary (master) account as well as the member accounts. Permission sets define the level of access for the users and groups for the AWS accounts. Refer to this Permission Sets document for more details.

In this example, we create a custom permission set for Read Only access to CloudWatch Logs for the log archive account in the organization.

We have not covered how to manage access to your custom application with AWS SSO. For more details on this, review our documentation on Manage SSO to your applications.

Create a Custom Permission Set

On the AWS SSO Console, choose AWS Accounts and then select Permission Sets. Select Create permission Set.
Select Create a custom permission set on the Create new permission set page and select Next.
Enter Name and description for the Permission Set and select Attach AWS managed policies.
Choose CloudWatchLogsReadOnlyAccess, from the list of AWS managed policies

Assign a User to AWS Accounts

This step is used to define which AWS Accounts a user can access. It also defines the Permission Set that the user can use while accessing an AWS Account.

On the AWS SSO console, select AWS Accounts and choose the AWS Organizations tab. You will see the list of accounts in the organization.
Select the account(s) for which the user should have access. You can select multiple accounts.
Choose Assign users and select the user from the list of users. You have the option of selecting multiple users or groups.
In the next step, select the permission set we created in the previous step.

Figure 7 – Assign a User to AWS Accounts

5. Select Finish.

Step 4: Verify your settings

The AD FS and AWS SSO configurations are now complete. It is now time to verify the configurations.

1. If you follow the SP initiated authentication method and entered the AWS SSO user-portal URL, it will re-direct you to the IdP URL and you will land on the same page.You should see the following login page:

Figure 8 – AD FS Login page

If you follow the SP initiated authentication method and entered the AWS SSO user-portal URL, it will re-direct you to the IdP URL and you will land on the same page.

2. After you enter the user credentials, i.e the email address and password for the user. You will be re-directed to the AWS SSO page. All the accounts and applications for which the user is provisioned for are shown on the following page. You can see the permission set(s) for the user after selecting the account

Figure 9 – AWS SSO Sign On Page

3. Select Management console to access the console of the account.

4. Go to the CloudWatch Console and then to logs to verify your access.

Conclusion

In this walk-through, we showed how you can use your corporate credentials in AD FS, to log in to your AWS account and cloud applications. This eliminated the need to maintain separate credentials on AWS, thereby giving a better user experience. We did this by establishing a trust between AD FS and AWS SSO. We described the steps on how to manually add users in AWS SSO. We also demonstrated how to create a permission set and assign a user to an account using that permission set. In addition, we provided illustrations of what you should see when accessing AWS SSO user-portal URL (SP Initiated) or the AD FS user-portal URL (IdP Initiated).

We hope this post helps you to understand how the AWS SSO integrates with Windows AD FS.

If you have any questions or feedback, please leave a comment below.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.