Tag Archives: Advanced (300)

Build a serverless analytics application with Amazon Redshift and Amazon API Gateway

2023-01-24 David Zhang

Post Syndicated from David Zhang original https://aws.amazon.com/blogs/big-data/build-a-serverless-analytics-application-with-amazon-redshift-and-amazon-api-gateway/

Serverless applications are a modernized way to perform analytics among business departments and engineering teams. Business teams can gain meaningful insights by simplifying their reporting through web applications and distributing it to a broader audience.

Use cases can include the following:

Dashboarding – A webpage consisting of tables and charts where each component can offer insights to a specific business department.
Reporting and analysis – An application where you can trigger large analytical queries with dynamic inputs and then view or download the results.
Management systems – An application that provides a holistic view of the internal company resources and systems.
ETL workflows – A webpage where internal company individuals can trigger specific extract, transform, and load (ETL) workloads in a user-friendly environment with dynamic inputs.
Data abstraction – Decouple and refactor underlying data structure and infrastructure.
Ease of use – An application where you want to give a large set of user-controlled access to analytics without having to onboard each user to a technical platform. Query updates can be completed in an organized manner and maintenance has minimal overhead.

In this post, you will learn how to build a serverless analytics application using Amazon Redshift Data API and Amazon API Gateway WebSocket and REST APIs.

Amazon Redshift is fully managed by AWS, so you no longer need to worry about data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, monitoring nodes and drives to recover from failures, or backups. The Data API simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later.

API Gateway is a fully managed service that makes it easy for developers to publish, maintain, monitor, and secure APIs at any scale. With API Gateway, you can create RESTful APIs and WebSocket APIs that enable real-time two-way communication applications. API Gateway supports containerized and serverless workloads, as well as web applications. API Gateway acts as a reverse proxy to many of the compute resources that AWS offers.

Event-driven model

Event-driven applications are increasingly popular among customers. Analytical reporting web applications can be implemented through an event-driven model. The applications run in response to events such as user actions and unpredictable query events. Decoupling the producer and consumer processes allows greater flexibility in application design and building decoupled processes. This design can be achieved with the Data API and API Gateway WebSocket and REST APIs.

Both REST API calls and WebSocket establish communication between the client and the backend. Due to the popularity of REST, you may wonder why WebSockets are present and how they contribute to an event-driven design.

What are WebSockets and why do we need them?

Unidirectional communication is customary when building analytical web solutions. In traditional environments, the client initiates a REST API call to run a query on the backend and either synchronously or asynchronously waits for the query to complete. The “wait” aspect is engineered to apply the concept of polling. Polling in this context is when the client doesn’t know when a backend process will complete. Therefore, the client will consistently make a request to the backend and check.

What is the problem with polling? Main challenges include the following:

Increased traffic in your network bandwidth – A large number of users performing empty checks will impact your backend resources and doesn’t scale well.
Cost usage – Empty requests don’t deliver any value to the business. You pay for the unnecessary cost of resources.
Delayed response – Polling is scheduled in time intervals. If the query is complete in-between these intervals, the user can only see the results after the next check. This delay impacts the user experience and, in some cases, may result in UI deadlocks.

For more information on polling, check out From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

WebSockets is another approach compared to REST when establishing communication between the front end and backend. WebSockets enable you to create a full duplex communication channel between the client and the server. In this bidirectional scenario, the client can make a request to the server and is notified when the process is complete. The connection remains open, with minimal network overhead, until the response is received.

You may wonder why REST is present, since you can transfer response data with WebSockets. A WebSocket is a light weight protocol designed for real-time messaging between systems. The protocol is not designed for handling large analytical query data and in API Gateway, each frame’s payload can only hold up to 32 KB. Therefore, the REST API performs large data retrieval.

By using the Data API and API Gateway, you can build decoupled event-driven web applications for your data analytical needs. You can create WebSocket APIs with API Gateway and establish a connection between the client and your backend services. You can then initiate requests to perform analytical queries with the Data API. Due to the Data API’s asynchronous nature, the query completion generates an event to notify the client through the WebSocket channel. The client can decide to either retrieve the query results through a REST API call or perform other follow-up actions. The event-driven architecture enables bidirectional interoperable messages and data while keeping your system components agnostic.

Solution overview

In this post, we show how to create a serverless event-driven web application by querying with the Data API in the backend, establishing a bidirectional communication channel between the user and the backend with the WebSocket feature in API Gateway, and retrieving the results using its REST API feature. Instead of designing an application with long-running API calls, you can use the Data API. The Data API allows you to run SQL queries asynchronously, removing the need to hold long, persistent database connections.

The web application is protected using Amazon Cognito, which is used to authenticate the users before they can utilize the web app and also authorize the REST API calls when made from the application.

Other relevant AWS services in this solution include AWS Lambda and Amazon EventBridge. Lambda is a serverless, event-driven compute resource that enables you to run code without provisioning or managing servers. EventBridge is a serverless event bus allowing you to build event-driven applications.

The solution creates a lightweight WebSocket connection between the browser and the backend. When a user submits a request using WebSockets to the backend, a query is submitted to the Data API. When the query is complete, the Data API sends an event notification to EventBridge. EventBridge signals the system that the data is available and notifies the client. Afterwards, a REST API call is performed to retrieve the query results for the client to view.

We have published this solution on the AWS Samples GitHub repository and will be referencing it during the rest of this post.

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with AWS CloudFormation templates run as part of the shell script with some parameter variables.

The application performs the following steps (note the corresponding numbered steps in the process flow):

A web application is provisioned on AWS Amplify; the user needs to sign up first by providing their email and a password to access the site.
The user verifies their credentials using a pin sent to their email. This step is mandatory for the user to then log in to the application and continue access to the other features of the application.
After the user is signed up and verified, they can sign in to the application and requests data through their web or mobile clients with input parameters. This initiates a WebSocket connection in API Gateway. (Flow 1, 2)
The connection request is handled by a Lambda function, OnConnect, which initiates an asynchronous database query in Amazon Redshift using the Data API. The SQL query is taken from a SQL script in Amazon Simple Storage Service (Amazon S3) with dynamic input from the client. (Flow 3, 4, 6, 7)
In addition, the OnConnect Lambda function stores the connection, statement identifier, and topic name in an Amazon DynamoDB database. The topic name is an extra parameter that can be used if users want to implement multiple reports on the same webpage. This allows the front end to map responses to the correct report. (Flow 3, 4, 5)
The Data API runs the query, mentioned in step 2. When the operation is complete, an event notification is sent to EventBridge. (Flow 8)
EventBridge activates an event rule to redirect that event to another Lambda function, SendMessage. (Flow 9)
The SendMessage function notifies the client that the SQL query is complete via API Gateway. (Flow 10, 11, 12)
After the notification is received, the client performs a REST API call (GET) to fetch the results. (Flow 13, 14, 15, 16)
The GetResult function is triggered, which retrieves the SQL query result and returns it to the client.
The user is now able to view the results on the webpage.
When clients disconnect from their browser, API Gateway automatically deletes the connection information from the DynamoDB table using the onDisconnect function. (Flow 17, 18,19)

Prerequisites

Prior to deploying your event-driven web application, ensure you have the following:

An Amazon Redshift cluster in your AWS environment – This is your backend data warehousing solution to run your analytical queries. For instructions to create your Amazon Redshift cluster, refer to Getting started with Amazon Redshift.
- After you create your Amazon Redshift cluster, add the AmazonS3ReadOnlyAccess permission to the associated cluster AWS Identity and Access Management (IAM) role.
An S3 bucket that you have access to – The S3 bucket will be your object storage solution where you can store your SQL scripts. To create your S3 bucket, refer to Create your first S3 bucket.

Deploy CloudFormation templates

The code associated to the design is available in the following GitHub repository. You can clone the repository inside an AWS Cloud9 environment in our AWS account. The AWS Cloud9 environment comes with AWS Command Line Interface (AWS CLI) installed, which is used to run the CloudFormation templates to set up the AWS infrastructure. Make sure that the jQuery library is installed; we use it to parse the JSON output during the run of the script.

The complete architecture is set up using three CloudFormation templates:

cognito-setup.yaml – Creates the Amazon Cognito user pool to web app client, which is used for authentication and protecting the REST API
backend-setup.yaml – Creates all the required Lambda functions and the WebSocket and Rest APIs, and configures them on API Gateway
webapp-setup.yaml – Creates the web application hosting using Amplify to connect and communicate with the WebSocket and Rest APIs.

These CloudFormation templates are run using the script.sh shell script, which takes care of all the dependencies as required.

A generic template is provided for you to customize your own DDL SQL scripts as well as your own query SQL scripts. We have created sample scripts for you to follow along.

Download the sample DDL script and upload it to an existing S3 bucket.
Change the IAM role value to your Amazon Redshift cluster’s IAM role with permissions to AmazonS3ReadOnlyAccess.

For this post, we copy the New York Taxi Data 2015 dataset from a public S3 bucket.

Download the sample query script and upload it to an existing S3 bucket.
Upload the modified sample DDL script and the sample query script into a preexisting S3 bucket that you own, and note down the S3 URI path.

If you want to run your own customized version, modify the DDL and query script to fit your scenario.

Edit the script.sh file before you run it and set the values for the following parameters:

- RedshiftClusterEndpoint (aws_redshift_cluster_ep) – Your Amazon Redshift cluster endpoint available on the AWS Management Console
- DBUsername (aws_dbuser_name) – Your Amazon Redshift database user name
- DDBTableName (aws_ddbtable_name) – The name of your DynamoDB table name that will be created
- WebsocketEndpointSSMParameterName (aws_wsep_param_name) – The parameter name that stores the WebSocket endpoint in AWS Systems Manager Parameter Store.
- RestApiEndpointSSMParameterName (aws_rapiep_param_name) – The parameter name that stores the REST API endpoint in Parameter Store.
- DDLScriptS3Path (aws_ddl_script_path) – The S3 URI to the DDL script that you uploaded.
- QueryScriptS3Path (aws_query_script_path) – The S3 URI to the query script that you uploaded.
- AWSRegion (aws_region) – The Region where the AWS infrastructure is being set up.
- CognitoPoolName (aws_user_pool_name) – The name you want to give to your Amazon Cognito user pool
- ClientAppName (aws_client_app_name) – The name of the client app to be configured for the web app to handle the user authentication for the users

The default acceptable values are already provided as part of the downloaded code.

Run the script using the following command:

./script.sh

During deployment, AWS CloudFormation creates and triggers the Lambda function SetupRedshiftLambdaFunction, which sets up an Amazon Redshift database table and populates data into the table. The following diagram illustrates this process.

Use the demo app

When the shell script is complete, you can start interacting with the demo web app:

On the Amplify console, under All apps in the navigation pane, choose DemoApp.
Choose Run build.

The DemoApp web application goes through a phase of Provision, Build, Deploy.

When it’s complete, use the URL provided to access the web application.

The following screenshot shows the web application page. It has minimal functionality: you can sign in, sign up, or verify a user.

Choose Sign Up.

For Email ID, enter an email.
For Password, enter a password that is at least eight characters long, has at least one uppercase and lowercase letter, at least one number, and at least one special character.
Choose Let’s Enroll.

The Verify your Login to Demo App page opens.

Enter your email and the verification code sent to the email you specified.
Choose Verify.

You’re redirected to a login page.

You’re redirected to the demoPage.html website.

Choose Open Connection.

You now have an active WebSocket connection between your browser and your backend AWS environment.

For Trip Month, specify a month (for this example, December) and choose Submit.

You have now defined the month and year you want to query your data upon. After a few seconds, you can to see the output delivered from the WebSocket.

You may continue using the active WebSocket connection for additional queries—just choose a different month and choose Submit again.

When you’re done, choose Close Connection to close the WebSocket connection.

For exploratory purposes, while your WebSocket connection is active, you can navigate to your DynamoDB table on the DynamoDB console to view the items that are currently stored. After the WebSocket connection is closed, the items stored in DynamoDB are deleted.

Clean up

To clean up your resources, complete the following steps:

On the Amazon S3 console, navigate to the S3 bucket containing the sample DDL script and query script and delete them from the bucket.
On the Amazon Redshift console, navigate to your Amazon Redshift cluster and delete the data you copied over from the sample DDL script.
1. Run truncate nyc_yellow_taxi;
2. Run drop table nyc_yellow_taxi;
On the AWS CloudFormation console, navigate to the CloudFormation stacks and choose Delete. Delete the stacks in the following order:
1. WebappSetup
2. BackendSetup
3. CognitoSetup

All resources created in this solution will be deleted.

Monitoring

You can monitor your event-driven web application events, user activity, and API usage with Amazon CloudWatch and AWS CloudTrail. Most areas of this solution already have logging enabled. To view your API Gateway logs, you can turn on CloudWatch Logs. Lambda comes with default logging and monitoring and can be accessed with CloudWatch.

Security

You can secure access to the application using Amazon Cognito, which is a developer-centric and cost-effective customer authentication, authorization, and user management solution. It provides both identity store and federation options that can scale easily. Amazon Cognito supports logins with social identity providers and SAML or OIDC-based identity providers, and supports various compliance standards. It operates on open identity standards (OAuth2.0, SAML 2.0, and OpenID Connect). You can also integrate it with API Gateway to authenticate and authorize the REST API calls either using the Amazon Cognito client app or a Lambda function.

Considerations

The nature of this application includes a front-end client initializing SQL queries to Amazon Redshift. An important component to consider are potential malicious activities that the client can perform, such as SQL injections. With the current implementation, that is not possible. In this solution, the SQL queries preexist in your AWS environment and are DQL statements (they don’t alter the data or structure). However, as you develop this application to fit your business, you should evaluate these areas of risk.

AWS offers a variety of security services to help you secure your workloads and applications in the cloud, including AWS Shield, AWS Network Firewall, AWS Web Application Firewall, and more. For more information and a full list, refer to Security, Identity, and Compliance on AWS.

Cost optimization

The AWS services that the CloudFormation templates provision in this solution are all serverless. In terms of cost optimization, you only pay for what you use. This model also allows you to scale without manual intervention. Review the following pages to determine the associated pricing for each service:

Conclusion

In this post, we showed you how to create an event-driven application using the Amazon Redshift Data API and API Gateway WebSocket and REST APIs. The solution helps you build data analytical web applications in an event-driven architecture, decouple your application, optimize long-running database queries processes, and avoid unnecessary polling requests between the client and the backend.

You also used severless technologies, API Gateway, Lambda, DynamoDB, and EventBridge. You didn’t have to manage or provision any servers throughout this process.

This event-driven, serverless architecture offers greater extensibility and simplicity, making it easier to maintain and release new features. Adding new components or third-party products is also simplified.

With the instructions in this post and the generic CloudFormation templates we provided, you can customize your own event-driven application tailored to your business. For feedback or contributions, we welcome you to contact us through the AWS Samples GitHub Repository by creating an issue.

About the Authors

David Zhang is an AWS Data Architect in Global Financial Services. He specializes in designing and implementing serverless analytics infrastructure, data management, ETL, and big data systems. He helps customers modernize their data platforms on AWS. David is also an active speaker and contributor to AWS conferences, technical content, and open-source initiatives. During his free time, he enjoys playing volleyball, tennis, and weightlifting. Feel free to connect with him on LinkedIn.

Manash Deb is a Software Development Manager in the AWS Directory Service team. With over 18 years of software dev experience, his passion is designing and delivering highly scalable, secure, zero-maintenance applications in the AWS identity and data analytics space. He loves mentoring and coaching others and to act as a catalyst and force multiplier, leading highly motivated engineering teams, and building large-scale distributed systems.

Pavan Kumar Vadupu Lakshman Manikya is an AWS Solutions Architect who helps customers design robust, scalable solutions across multiple industries. With a background in enterprise architecture and software development, Pavan has contributed in creating solutions to handle API security, API management, microservices, and geospatial information system use cases for his customers. He is passionate about learning new technologies and solving, automating, and simplifying customer problems using these solutions.

Managing Dev Environments with Amazon CodeCatalyst

2023-01-23 Ryan Bachman

Post Syndicated from Ryan Bachman original https://aws.amazon.com/blogs/devops/managing-dev-environments-with-amazon-codecatalyst/

An Amazon CodeCatalyst Dev Environment is a cloud-based development environment that you can use in CodeCatalyst to quickly work on the code stored in the source repositories of your project. The project tools and application libraries included in your Dev Environment are defined by a devfile in the source repository of your project.

Introduction

In the previous CodeCatalyst post, Team Collaboration with Amazon CodeCatalyst, I focused on CodeCatalyst’s collaboration capabilities and how that related to The Unicorn Project’s main protaganist. At the beginning of Chapter 2, Maxine is struggling to configure her development environment. She is two days into her new job and still cannot build the application code. She has identified over 100 dependencies she is missing. The documentation is out of date and nobody seems to know where the dependencies are stored. I can sympathize with Maxine. In this post, I will focus on managing development environments to show how CodeCatalyst removes the burden of managing workload specific configurations and produces reliable on-demand development environments.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Have an AWS Builder ID for signing in to CodeCatalyst.

Belong to a space and have the space administrator role assigned to you in that space. For more information, see Creating a space in CodeCatalyst, Managing members of your space, and Space administrator role.

Have an AWS account associated with your space and have the IAM role in that account. For more information about the role and role policy, see Creating a CodeCatalyst service role.

Walkthrough

As with the previous posts in our CodeCatalyst series, I am going to use the Modern Three-tier Web Application blueprint. Blueprints provide sample code and CI/CD workflows to help make getting started easier across different combinations of programming languages and architectures. To follow along, you can re-use a project you created previously, or you can refer to a previous post that walks through creating a project using the blueprint.

One of the most difficult aspects of my time spent as a developer was finding ways to quickly contribute to a new project. Whenever I found myself working on a new project, getting to the point where I could meaningfully contribute to a project’s code base was always more difficult than writing the actual code. A major contributor to this inefficiency, was the lack of process managing my local development environment. I will be exploring how CodeCatalyst can help solve this challenge. For this walkthrough, I want to add a new test that will allow local testing of Amazon DynamoDB. To achieve this, I will use a CodeCatalyst dev environment.

CodeCatalyst Dev Environments are managed cloud-based development environments that you can use to access and modify code stored in a source repository. You can launch a project specific dev environment that will automate check-out of your project’s repo or you can launch an empty environment to use for accessing third-party source providers. You can learn more about CodeCatalyst Dev Environments in the CodeCatalyst User Guide.

CodeCatalyst user interface showing Create Dev Environment

Figure 1. Creating a new Dev Environment

To begin, I navigate to the Dev Environments page under the Code section of the navigaiton menu. I then use the Create Dev Environment to launch my environment. For this post, I am using the AWS Cloud9 IDE, but you can follow along with the IDE you are most comfortable using. In the next screen, I select Work in New Branch and assign local_testing for the new branch name, and I am branching from main. I leave the remaining default options and Create.

Create Dev Environment user interface with work in a new branch selected

Figure 2. Dev Environment Create Options

After waiting less than a minute, my IDE is ready in a new tab and I am ready to begin work. The first thing I see in my dev environment is an information window asking me if I want to navigate to the Dev Environment Settings. Because I need to enable local testing of Dynamodb, not only for myself, but other developers that will collaborate on this project, I need to update the project’s devfile. I select to navigate to the settings tab because I know that contains information on the project’s devfile and allows me to access the file to edit.

AWS Toolkit prompting to Open Dev Environment Settings.

Figure 3. Toolkit Welcome Banner

Devfiles allow you to model a Dev Environment’s configuration and dependencies so that you can re-produce consisent Dev Environments and reduce the manual effort in setting up future environments. The tools and application libraries included in your Dev Environment are defined by the devfile in the source repository of your project. Since this project was created from a blueprint, there is one provided. For blank projects, a default CodeCatalyst devfile is created when you first launch an environment. To learn more about the devfile, see https://devfile.io.

In the settings tab, I find a link to the devfile that is configured. When I click the edit button, a new file tab launches and I can now make changes. I first add an env section to the container that hosts our dev environment. By adding an environment variable and value, anytime a new dev environment is created from this project’s repository, that value will be included. Next, I add a second container to the dev environment that will run DynamoDB locally. I can do this by adding a new container component. I use Amazon’s verified DynamoDB docker image for my environment. Attaching additional images allow you to extend the dev environment and include tools or services that can be made available locally. My updates are highlighted in the green sections below.

Devfile.yaml with environment variable and DynamoDB container added

Figure 4. Example Devfile

I save my changes and navigate back to the Dev Environment Settings tab. I notice that my changes were automatically detected and I am prompted to restart my development environment for the changes to take effect. Modifications to the devfile requires a restart. You can restart a dev environment using the toolkit, or from the CodeCatalyst UI.

AWS Toolkit prompt asking to restart the dev environment

Figure 5. Dev Environment Settings

After waiting a few seconds for my dev environment to restart, I am ready to write my test. I use the IDE’s file explorer, expand the repo’s ./tests/unit folder, and create a new file named test_dynamodb.py. Using the IS_LOCAL environment variable I configured in the devfile, I can include a conditional in my test that sets the endpoint that Amazon’s python SDK ( Boto3 ) will use to connect to the Dynamodb service. This way, I can run tests locally before pushing my changes and still have tests complete successfully in my project’s workflow. My full test file is included below.

Figure 6. Dynamodb test file

Now that I have completed my changes to the dev environment using the devfile and added a test, I am ready to run my test locally to verify. I will use pytest to ensure the tests are passing before pushing any changes. From the repo’s root folder, I run the command pip install -r requirements-dev.txt. Once my dependencies are installed, I then issue the command pytest -k unit. All tests pass as I expect.

Result of the pytest shown at the command line

Figure 7. Pytest test results

Rather than manually installing my development dependencies in each environment, I could also use the devfile to include commands and automate the execution of those commands during the dev environment lifecycle events. You can refer to the links for commands and events for more information.

Finally, I am ready to push my changes back to my CodeCatalyst source repository. I use the git extension of Cloud9 to review my changes. After reviewing my changes are what I expect, I use the git extension to stage, commit, and push the new test file and the modified devfile so other collaborators can adopt the improvements I made.

Figure 8. Changes reviewed in CodeCatalyst Cloud9 git extension.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the two stacks that CDK deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project.

Conclusion

In this post, you learned how CodeCatalyst provides configurable on-demand dev environments. You also learned how devfiles help you define a consistent experience for developing within a CodeCatalyst project. Please follow our DevOps blog channel as I continue to explore how CodeCatalyst solve Maxine’s and other builders’ challenges.

About the author:

Journey to adopt Cloud-Native DevOps platform Series #2: Progressive delivery on Amazon EKS with Flagger and Gloo Edge Ingress Controller

2023-01-18 Purna Sanyal

Post Syndicated from Purna Sanyal original https://aws.amazon.com/blogs/devops/journey-to-adopt-cloud-native-devops-platform-series-2-progressive-delivery-on-amazon-eks-with-flagger-and-gloo-edge-ingress-controller/

In the last post, OfferUp modernized its DevOps platform with Amazon EKS and Flagger to accelerate time to market, we talked about hypergrowth and the technical challenges encountered by OfferUp in its existing DevOps platform. As a reminder, we presented how OfferUp modernized its DevOps platform with Amazon Elastic Kubernetes Service (Amazon EKS) and Flagger to gain developer’s velocity, automate faster deployment, and achieve lower cost of ownership.

In this post, we discuss the technical steps to build a DevOps platform that enables the progressive deployment of microservices on Amazon Managed Amazon EKS. Progressive delivery exposes a new version of the software incrementally to ingress traffic and continuously measures the success rate of the metrics before allowing all of the new traffics to a newer version of the software. Flagger is the Graduate project of Cloud Native Computing Foundations (CNCF) that enables progressive canary delivery, along with bule/green and A/B Testing, while measuring metrics like HTTP/gRPC request success rate and latency. Flagger shifts and routes traffic between app versions using a service mesh or an Ingress controller

We leverage Gloo Ingress Controller for traffic routing, Prometheus, Datadog, and Amazon CloudWatch for application metrics analysis and Slack to send notification. Flagger will post messages to slack when a deployment has been initialized, when a new revision has been detected, and if the canary analysis failed or succeeded.

Prerequisite steps to build the modern DevOps platform

You need an AWS Account and AWS Identity and Access Management (IAM) user to build the DevOps platform. If you don’t have an AWS account with Administrator access, then create one now by clicking here. Create an IAM user and assign admin role. You can build this platform in any AWS region however, I will you us-west-1 region throughout this post. You can use a laptop (Mac or Windows) or an Amazon Elastic Compute Cloud (AmazonEC2) instance as a client machine to install all of the necessary software to build the GitOps platform. For this post, I launched an Amazon EC2 instance (with Amazon Linux2 AMI) as the client and install all of the prerequisite software. You need the awscli, git, eksctl, kubectl, and helm applications to build the GitOps platform. Here are the prerequisite steps,

Create a named profile(eks-devops) with the config and credentials file:

aws configure --profile eks-devops

AWS Access Key ID [None]: xxxxxxxxxxxxxxxxxxxxxx

AWS Secret Access Key [None]: xxxxxxxxxxxxxxxxxx

Default region name [None]: us-west-1

Default output format [None]:

View and verify your current IAM profile:

export AWS_PROFILE=eks-devops

aws sts get-caller-identity

If the Amazon EC2 instance doesn’t have git preinstalled, then install git in your Amazon EC2 instance:

sudo yum update -y

sudo yum install git -y

Check git version

git version

Git clone the repo and download all of the prerequisite software in the home directory.

git clone https://github.com/aws-samples/aws-gloo-flux.git

Download all of the prerequisite software from install.sh which includes awscli, eksctl, kubectl, helm, and docker:

cd aws-gloo-flux/eks-flagger/

ls -lt

chmod 700 install.sh ecr-setup.sh

. install.sh

Check the version of the software installed:

aws --version

eksctl version

kubectl version -o json

helm version

docker --version

docker info

If the docker info shows an error like “permission denied”, then reboot the Amazon EC2 instance or re-log in to the instance again.

Create an Amazon Elastic Container Repository (Amazon ECR) and push application images.

Amazon ECR is a fully-managed container registry that makes it easy for developers to share and deploy container images and artifacts. ecr setup.sh script will create a new Amazon ECR repository and also push the podinfo images (6.0.0, 6.0.1, 6.0.2, 6.1.0, 6.1.5 and 6.1.6) to the Amazon ECR. Run ecr-setup.sh script with the parameter, “ECR repository name” (e.g. ps-flagger-repository) and region (e.g. us-west-1)

./ecr-setup.sh <ps-flagger-repository> <us-west-1>

You’ll see output like the following (truncated).

###########################################################

Successfully created ECR repository and pushed podinfo images to ECR #

Please note down the ECR repository URI

xxxxxx.dkr.ecr.us-west-1.amazonaws.com/ps-flagger-repository

Technical steps to build the modern DevOps platform

This post shows you how to use the Gloo Edge ingress controller and Flagger to automate canary releases for progressive deployment on the Amazon EKS cluster. Flagger requires a Kubernetes cluster v1.16 or newer and Gloo Edge ingress 1.6.0 or newer. This post will provide a step-by-step approach to install the Amazon EKS cluster with managed node group, Gloo Edge ingress controller, and Flagger for Gloo in the Amazon EKS cluster. Now that the cluster, metrics infrastructure, and Flagger are installed, we can install the sample application itself. We’ll use the standard Podinfo application used in the Flagger project and the accompanying loadtester tool. The Flagger “podinfo” backend service will be called by Gloo’s “VirtualService”, which is the root routing object for the Gloo Gateway. A virtual service describes the set of routes to match for a set of domains. We’ll automate the canary promotion, with the new image of the “podinfo” service, from version 6.0.0 to version 6.0.1. We’ll also create a scenario by injecting an error for automated canary rollback while deploying version 6.0.2.

Use myeks-cluster.yaml to create your Amazon EKS cluster with managed nodegroup. myeks-cluster.yaml deployment file has “cluster name” value as ps-eks-66, region value as us-west-1, availabilityZones as [us-west-1a, us-west-1b], Kubernetes version as 1.24, and nodegroup Amazon EC2 instance type as m5.2xlarge. You can change this value if you want to build the cluster in a separate region or availability zone.

eksctl create cluster -f myeks-cluster.yaml

Check the Amazon EKS Cluster details:

kubectl cluster-info

kubectl version -o json

kubectl get nodes -o wide

kubectl get pods -A -o wide

Deploy the Metrics Server:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

kubectl get deployment metrics-server -n kube-system

Update the kubeconfig file to interact with you cluster:

# aws eks update-kubeconfig --name <ekscluster-name> --region <AWS_REGION>

kubectl config view

cat $HOME/.kube/config

Create a namespace “gloo-system” and Install Gloo with Helm Chart. Gloo Edge is an Envoy-based Kubernetes-native ingress controller to facilitate and secure application traffic.

helm repo add gloo https://storage.googleapis.com/solo-public-helm

kubectl create ns gloo-system

helm upgrade -i gloo gloo/gloo --namespace gloo-system

Install Flagger and the Prometheus add-on in the same gloo-system namespace. Flagger is a Cloud Native Computing Foundation project and part of Flux family of GitOps tools.

helm repo add flagger https://flagger.app

helm upgrade -i flagger flagger/flagger \

--namespace gloo-system \

--set prometheus.install=true \

--set meshProvider=gloo

[Optional] If you’re using Datadog as a monitoring tool, then deploy Datadog agents as a DaemonSet using the Datadog Helm chart. Replace RELEASE_NAME and DATADOG_API_KEY accordingly. If you aren’t using Datadog, then skip this step. For this post, we leverage the Prometheus open-source monitoring tool.

helm repo add datadog https://helm.datadoghq.com

helm repo update

helm install <RELEASE_NAME> \

--set datadog.apiKey=<DATADOG_API_KEY> datadog/datadog

Integrate Amazon EKS/ K8s Cluster with the Datadog Dashboard – go to the Datadog Console and add the Kubernetes integration.

[Optional] If you’re using Slack communication tool and have admin access, then Flagger can be configured to send alerts to the Slack chat platform by integrating the Slack alerting system with Flagger. If you don’t have admin access in Slack, then skip this step.

helm upgrade -i flagger flagger/flagger \

--set slack.url=https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK \

--set slack.channel=general \

--set slack.user=flagger \

--set clusterName=<my-cluster>

Create a namespace “apps”, and applications and load testing service will be deployed into this namespace.

kubectl create ns apps

Create a deployment and a horizontal pod autoscaler for your custom application or service for which canary deployment will be done.

kubectl -n apps apply -k app

kubectl get deployment -A

kubectl get hpa -n apps

Deploy the load testing service to generate traffic during the canary analysis.

kubectl -n apps apply -k tester

kubectl get deployment -A

kubectl get svc -n apps

Use apps-vs.yaml to create a Gloo virtual service definition that references a route table that will be generated by Flagger.

kubectl apply -f ./apps-vs.yaml

kubectl get vs -n apps

[Optional] If you have your own domain name, then open apps-vs.yaml in vi editor and replace podinfo.example.com with your own domain name to run the app in that domain.

Use canary.yaml to create a canary custom resource. Review the service, analysis, and metrics sections of the canary.yaml file.

kubectl apply -f ./canary.yaml

After a couple of seconds, Flagger will create the canary objects. When the bootstrap finishes, Flagger will set the canary status to “Initialized”.

kubectl -n apps get canary podinfo

NAME STATUS WEIGHT LASTTRANSITIONTIME

podinfo Initialized 0 2023-xx-xxTxx:xx:xxZ

Gloo automatically creates an ELB. Once the load balancer is provisioned and health checks pass, we can find the sample application at the load balancer’s public address. Note down the ELB’s Public address:

kubectl get svc -n gloo-system --field-selector 'metadata.name==gateway-proxy' -o=jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}{"\n"}'

Validate if your application is running, and you’ll see an output with version 6.0.0.

curl <load balancer’s public address> -H "Host:podinfo.example.com"

Trigger progressive deployments and monitor the status

You can Trigger a canary deployment by updating the application container image from 6.0.0 to 6.01.

kubectl -n apps set image deployment/podinfo podinfod=<ECR URI>:6.0.1

Flagger detects that the deployment revision changed and starts a new rollout.

kubectl -n apps describe canary/podinfo

Monitor all canaries, as the promoted status condition can have one of the following statuses: initialized, Waiting, Progressing, Promoting, Finalizing, Succeeded, and Failed.

watch kubectl get canaries --all-namespaces

curl < load balancer’s public address> -H "Host:podinfo.example.com"

Once canary is completed, validate your application. You can see that the version of the application is changed from 6.0.0 to 6.0.1.

{

"hostname": "podinfo-primary-658c9f9695-4pqbl",

"version": "6.0.1",

"revision": "",

"color": "#34577c",

"logo": "https://raw.githubusercontent.com/stefanprodan/podinfo/gh-pages/cuddle_clap.gif",

"message": "greetings from podinfo v6.0.1",

}

[Optional] Open podinfo application from the laptop browser

Find out both of the IP addresses associated with load balancer.

dig < load balancer’s public address >

Open /etc/hosts file in the laptop and add both of the IPs of load balancer in the host file.

sudo vi /etc/hosts

<Public IP address of LB Target node> podinfo.example.com

e.g.

xx.xx.xxx.xxx podinfo.example.com

Type “podinfo.example.com” in your browser and you’ll find the application in form similar to this:

Figure 1: Greetings from podinfo v6.0.1

Automated rollback

While doing the canary analysis, you’ll generate HTTP 500 errors and high latency to check if Flagger pauses and rolls back the faulted version. Flagger performs automatic Rollback in the case of failure.

Introduce another canary deployment with podinfo image version 6.0.2 and monitor the status of the canary.

kubectl -n apps set image deployment/podinfo podinfod=<ECR URI>:6.0.2

Run HTTP 500 errors or a high-latency error from a separate terminal window.

Generate HTTP 500 errors:

watch curl -H 'Host:podinfo.example.com' <load balancer’s public address>/status/500

Generate high latency:

watch curl -H 'Host:podinfo.example.com' < load balancer’s public address >/delay/2

When the number of failed checks reaches the canary analysis threshold, the traffic is routed back to the primary, the canary is scaled to zero, and the rollout is marked as failed.

kubectl get canaries --all-namespaces

kubectl -n apps describe canary/podinfo

Cleanup

When you’re done experimenting, you can delete all of the resources created during this series to avoid any additional charges. Let’s walk through deleting all of the resources used.

Delete Flagger resources and apps namespace
kubectl delete canary podinfo -n apps

kubectl delete HorizontalPodAutoscaler podinfo -n apps

kubectl delete deployment podinfo -n apps

helm -n gloo-system delete flagger

helm -n gloo-system delete gloo

kubectl delete namespace apps

Delete Amazon EKS Cluster
After you’ve finished with the cluster and nodes that you created for this tutorial, you should clean up by deleting the cluster and nodes with the following command:

eksctl delete cluster --name <cluster name> --region <region code>

Delete Amazon ECR

aws ecr delete-repository --repository-name ps-flagger-repository --force

Conclusion

This post explained the process for setting up Amazon EKS cluster and how to leverage Flagger for progressive deployments along with Prometheus and Gloo Ingress Controller. You can enhance the deployments by integrating Flagger with Slack, Datadog, and webhook notifications for progressive deployments. Amazon EKS removes the undifferentiated heavy lifting of managing and updating the Kubernetes cluster. Managed node groups automate the provisioning and lifecycle management of worker nodes in an Amazon EKS cluster, which greatly simplifies operational activities such as new Kubernetes version deployments.

We encourage you to look into modernizing your DevOps platform from monolithic architecture to microservice-based architecture with Amazon EKS, and leverage Flagger with the right Ingress controller for secured and automated service releases.

Further Reading

Journey to adopt Cloud-Native DevOps platform Series #1: OfferUp modernized DevOps platform with Amazon EKS and Flagger to accelerate time to market

About the authors:

Automate data lineage on Amazon MWAA with OpenLineage

2023-01-18 Stephen Said

Post Syndicated from Stephen Said original https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/

In modern data architectures, datasets are combined across an organization using a variety of purpose-built services to unlock insights. As a result, data governance becomes a key component for data consumers and producers to know that their data-driven decisions are based on trusted and accurate datasets. One aspect of data governance is data lineage, which captures the flow of data as it goes through various systems and allows consumers to understand how a dataset was derived.

In order to capture data lineage consistently across various analytical services, you need to use a common lineage model and a robust job orchestration that is able to tie together diverse data flows. One possible solution is the open-source OpenLineage project. It provides a technology-agnostic metadata model for capturing data lineage and integrates with widely used tools. For job orchestration, it integrates with Apache Airflow, which you can run on AWS conveniently through the managed service Amazon Managed Workflows for Apache Airflow (Amazon MWAA). OpenLineage provides a plugin for Apache Airflow that extracts data lineage from Directed Acyclic Graphs (DAGs).

In this post, we show how to get started with data lineage on AWS using OpenLineage. We provide a step-by-step configuration guide for the openlineage-airflow plugin on Amazon MWAA. Additionally, we share an AWS Cloud Development Kit (AWS CDK) project that deploys a pre-configured demo environment for evaluating and experiencing OpenLineage first-hand.

OpenLineage on Apache Airflow

In the following example, Airflow turns OLTP data into a star schema on Amazon Redshift Serverless.

After staging and preparing source data from Amazon Simple Storage Service (Amazon S3), fact and dimension tables are eventually created. For this, Airflow orchestrates the execution of SQL statements that create and populate tables on Redshift Serverless.

Overview on DAGs in Amazon MWAA

The openlineage-airflow plugin collects metadata about creation of datasets and dependencies between them. This allows us to move from a jobs-centric approach of Airflow to a datasets-centric approach, improving the observability of workflows.

The following screenshot shows parts of the captured lineage for the previous example. It’s displayed in Marquez, an open-source metadata service for collection and visualization of data lineage with support for the OpenLineage standard. In Marquez, you can analyze the upstream datasets and transformations that eventually create the user dimension table on the right.

Data lineage graph in marquez

The example in this post is based on SQL and Amazon Redshift. OpenLineage also supports other transformation engines and data stores such as Apache Spark and dbt.

Solution overview

The following diagram shows the AWS setup required to capture data lineage using OpenLineage.

Solution overview

The workflow includes the following components:

The openlineage-airflow plugin is configured on Airflow as a lineage backend. Metadata about the DAG runs is passed by Airflow core to the plugin, which converts it into OpenLineage format and sends it to an external metadata store. In our demo setup, we use Marquez as the metadata store.
The openlineage-airflow plugin receives its configuration from environment variables. To populate these variables on Amazon MWAA, a custom Airflow plugin is used. First, the plugin reads source values from AWS Secrets Manager. Then, it creates environment variables.
Secrets Manager is configured as a secrets backend. Typically, this type of configuration is stored in Airflow’s native metadata database. However, this approach has limitations. For instance, in case of multiple Airflow environments, you need to track and store credentials across multiple environments, and updating credentials requires you to update all the environments. With a secrets backend, you can centralize configuration.
For demo purposes, we collect data lineage from a data pipeline, which creates a star schema in Redshift Serverless.

In the following sections, we walk you through the steps for end-to-end configuration.

Install the openlineage-airflow plugin

Specify the following dependency in the requirements.txt file of the Amazon MWAA environment. Note that the latest Airflow version currently available on Amazon MWAA is 2.4.3; for this post, use the compatible version 0.19.2 of the plugin:

openlineage-airflow==0.19.2

For more details on installing Python dependencies on Amazon MWAA, refer to Installing Python dependencies.

For Airflow < 2.3, configure the plugin’s lineage backend through the following configuration overrides on the Amazon MWAA environment and load it immediately at Airflow start by disabling lazy load of plugins:

AirflowConfigurationOptions:
    core.lazy_load_plugins: False
    lineage.backend: openlineage.lineage_backend.OpenLineageBackend

For more information on configuration overrides, refer to Configuration options overview.

Configure the Secrets Manager backend with Amazon MWAA

Using Secrets Manager as a secrets backend for Amazon MWAA is straightforward. First, provide the execution role of Amazon MWAA with read permission to Secrets Manager. You can use the following policy template as a starting point:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetResourcePolicy",
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret",
        "secretsmanager:ListSecretVersionIds"
      ],
      "Resource": "arn:aws:secretsmanager:AWS_REGION:<YOUR_ACCOUNT_ID>:secret:"
    },
    {
      "Effect": "Allow",
      "Action": "secretsmanager:ListSecrets",
      "Resource": ""
    }
  ]
}

Second, configure Secrets Manager as a backend in Amazon MWAA through the following configuration overrides:

AirflowConfigurationOptions:
secrets.backend: airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend
secrets.backend_kwargs: '{"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}'

For more information configuring a secrets backend in Amazon MWAA, refer to Configuring an Apache Airflow connection using a Secrets Manager secret and Move your Apache Airflow connections and variables to AWS Secrets Manager.

Deploy a custom envvar plugin to Amazon MWAA

Apache Airflow has a built-in plugin manager through which it can be extended with custom functionality. In our case, this functionality is to populate OpenLineage-specific environment variables based on values in Secrets Manager. Natively, Amazon MWAA allows environment variables with the prefix AIRFLOW__, but the openlineage-airflow plugin expects the prefix OPENLINEAGE__.

The following Python code is used in the plugin. We assume the file is called envvar_plugin.py:

from airflow.plugins_manager import AirflowPlugin
from airflow.models import Variable
import os

os.environ["OPENLINEAGE_URL"] = Variable.get('OPENLINEAGE_URL', default_var='')

class EnvVarPlugin(AirflowPlugin):
  name = "env_var_plugin"

Amazon MWAA has a mechanism to install a plugin through a zip archive. You zip your code, upload the archive to an S3 bucket, and pass the URL to the file to Amazon MWAA:

zip plugins.zip envvar_plugin.py

Upload plugins.zip to an S3 bucket and configure the URL in Amazon MWAA. The following screenshot shows the configuration via the Amazon MWAA console.

Configuration of a custom plugin in Amazon MWAA

For more information on installing custom plugins on Amazon MWAA, refer to Creating a custom plugin that generates runtime environment variables.

Configure connectivity between the openlineage-airflow plugin and Marquez

As a last step, store the URL to Marquez in Secrets Manager. For this, create a secret called airflow/variables/OPENLINEAGE_URL with value <protocol>://<hostname/ip>:<port> (for example, https://marquez.mysite.com:5000).

Configuration of OPENLINEAGE_URL as secret

In case you need to spin up Marquez on AWS, you have multiple options to host, including running it on Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Compute Cloud (Amazon EC2). Refer to Running Marquez on AWS or check out our infrastructure template in the next section to deploy Marquez on AWS.

Deploy with an AWS CDK-based solution template

Assuming you want to set up a demo infrastructure for all of the above in one step, you can use the following template based on the AWS CDK.

The template has the following prerequisites:

An AWS account.
Amazon Linux 2 with AWS CDK, Docker CLI, and Python3 installed. Alternatively, setting up an AWS Cloud9 environment will satisfy this requirement.

Complete the following steps to deploy the template:

Clone GitHub repository and install Python dependencies. Bootstrap the AWS CDK if required.

git clone https://github.com/aws-samples/aws-mwaa-openlineage 
	cd aws-mwaa-openlineage
	python3 -m venv .env && source .env/bin/activate
	python3 -m pip install -r requirements.txt
	cdk bootstrap

Update the value for the variable EXTERNAL_IP in constants.py to your outbound IP for connecting to the internet:
```
# Set variable to outbound IP for connecting to the internet.
EXTERNAL_IP = "255.255.255.255"
```
This configures security groups so that you can access Marquez but block other clients. constants.py is found in the root folder of the cloned repository.
Deploy the VPC_S3 stack to provision a new VPC dedicated for this solution as well as the security groups that are used by the different components:
```
cdk deploy vpc-s3
```
It creates a new S3 bucket and uploads the source raw data based on the TICKIT sample database. This serves as the landing area from the OLTP database. We then need to parse the metadata of these files through an AWS Glue crawler, which facilitates the native integration between Amazon Redshift and the S3 data lake.
Deploy the lineage stack to create an EC2 instance that hosts Marquez:
```
cdk deploy marquez
```
Access the Marquez web UI through https://{ec2.public_dns_name}:3000/. This URL is also available as part of the AWS CDK outputs for the lineage stack.
Deploy the Amazon Redshift stack to create a Redshift Serverless endpoint:
```
cdk deploy redshift
```
Deploy the Amazon MWAA stack to create an Amazon MWAA environment:
```
cdk deploy mwaa
```
You can access the Amazon MWAA UI through the URL provided in the AWS CDK output.

Test a sample data pipeline

On Amazon MWAA, you can see an example data pipeline deployed that consists of two DAGs. It builds a star schema on top of the TICKIT sample database. One DAG is responsible for loading data from the S3 data lake into an Amazon Redshift staging layer; the second DAG loads data from the staging layer to the dimensional model.

Datamodel of star schema

Open the Amazon MWAA UI through the URL obtained in the deployment steps and launch the following DAGs: rs_source_to_staging and rs_staging_to_dm. As part of the run, the lineage metadata is sent to Marquez.

After the DAG has been run, open the Marquez URL obtained in the deployment steps. In Marquez, you can find the lineage metadata for the computed star schema and related data assets on Amazon Redshift.

Clean up

Delete the AWS CDK stacks to avoid ongoing charges for the resources that you created. Run the following command in the aws-mwaa-openlineage project directory so that all resources are undeployed:

cdk destroy --all

Summary

In this post, we showed you how to automate data lineage with OpenLineage on Amazon MWAA. As part of this, we covered how to install and configure the openlineage-airflow plugin on Amazon MWAA. Additionally, we provided a ready-to-use infrastructure template for a complete demo environment.

We encourage you to explore what else can be achieved with OpenLineage. A job orchestrator like Apache Airflow is only one piece of a data platform and not all possible data lineage can be captured on it. We recommend exploring OpenLineage’s integration with other platforms like Apache Spark or dbt. For more information, refer to Integrations.

Additionally, we recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon MWAA and data governance on AWS.

About the Authors

Stephen Said is a Senior Solutions Architect and works with Digital Native Businesses. His areas of interest are data analytics, data platforms and cloud-native software engineering.

Vishwanatha Nayak is a Senior Solutions Architect at AWS. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable modern data platforms using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.

Paul Villena is an Analytics Solutions Architect with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

Manually Approving Security Changes in CDK Pipeline

2023-01-18

Post Syndicated from original https://aws.amazon.com/blogs/devops/manually-approving-security-changes-in-cdk-pipeline/

In this post I will show you how to add a manual approval to AWS Cloud Development Kit (CDK) Pipelines to confirm security changes before deployment. With this solution, when a developer commits a change, CDK pipeline identifies an IAM permissions change, pauses execution, and sends a notification to a security engineer to manually approve or reject the change before it is deployed.

Introduction

In my role I talk to a lot of customers that are excited about the AWS Cloud Development Kit (CDK). One of the things they like is that L2 constructs often generate IAM and other security policies. This can save a lot of time and effort over hand coding those policies. Most customers also tell me that the policies generated by CDK are more secure than the policies they generate by hand.

However, these same customers are concerned that their security engineering team does not know what is in the policies CDK generates. In the past, these customers spent a lot of time crafting a handful of IAM policies that developers can use in their apps. These policies were well understood, but overly permissive because they were often reused across many applications.

Customers want more visibility into the policies CDK generates. Luckily CDK provides a mechanism to approve security changes. If you are using CDK, you have probably been prompted to approve security changes when you run cdk deploy at the command line. That works great on a developer’s machine, but customers want to build the same confirmation into their continuous delivery pipeline. CDK provides a mechanism for this with the ConfirmPermissionsBroadening action. Note that ConfirmPermissionsBroadening is only supported by the AWS CodePipline deployment engine.

Background

Before I talk about ConfirmPermissionsBroadening, let me review how CDK creates IAM policies. Consider the “Hello, CDK” application created in AWS CDK Workshop. At the end of this module, I have an AWS Lambda function and an Amazon API Gateway defined by the following CDK code.

// defines an AWS Lambda resource
const hello = new lambda.Function(this, 'HelloHandler', {
  runtime: lambda.Runtime.NODEJS_14_X,    // execution environment
  code: lambda.Code.fromAsset('lambda'),  // code loaded from "lambda" directory
  handler: 'hello.handler'                // file is "hello", function is "handler"
});

// defines an API Gateway REST API resource backed by our "hello" function.
new apigw.LambdaRestApi(this, 'Endpoint', {
  handler: hello
});

Note that I did not need to define the IAM Role or Lambda Permissions. I simply passed a refence to the Lambda function to the API Gateway (line 10 above). CDK understood what I was doing and generated the permissions for me. For example, CDK generated the following Lambda Permission, among others.

{
  "Effect": "Allow",
  "Principal": {
    "Service": "apigateway.amazonaws.com"
  },
  "Action": "lambda:InvokeFunction",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:HelloHandler2E4FBA4D",
  "Condition": {
    "ArnLike": {
      "AWS:SourceArn": "arn:aws:execute-api:us-east-1:123456789012:9y6ioaohv0/prod/*/"
    }
  }
}

Notice that CDK generated a narrowly scoped policy, that allows a specific API (line 10 above) to call a specific Lambda function (line 7 above). This policy cannot be reused elsewhere. Later in the same workshop, I created a Hit Counter Construct using a Lambda function and an Amazon DynamoDB table. Again, I associated them using a single line of CDK code.

table.grantReadWriteData(this.handler);

As in the prior example, CDK generated a narrowly scoped IAM policy. This policy allows the Lambda function to perform certain actions (lines 4-11) on a specific table (line 14 below).

{
  "Effect": "Allow",
  "Action": [
    "dynamodb:BatchGetItem",
    "dynamodb:ConditionCheckItem",
    "dynamodb:DescribeTable",
    "dynamodb:GetItem",
    "dynamodb:GetRecords",
    "dynamodb:GetShardIterator",
    "dynamodb:Query",
    "dynamodb:Scan"
  ],
  "Resource": [
    "arn:aws:dynamodb:us-east-1:123456789012:table/HelloHitCounterHits"
  ]
}

As you can see, CDK is doing a lot of work for me. In addition, CDK is creating narrowly scoped policies for each resource, rather than sharing a broadly scoped policy in multiple places.

CDK Pipelines Permissions Checks

Now that I have reviewed how CDK generates policies, let’s discuss how I can use this in a Continuous Deployment pipeline. Specifically, I want to allow CDK to generate policies, but I want a security engineer to review any changes using a manual approval step in the pipeline. Of course, I don’t want security to be a bottleneck, so I will only require approval when security statements or traffic rules are added. The pipeline should skip the manual approval if there are no new security rules added.

Let’s continue to use CDK Workshop as an example. In the CDK Pipelines module, I used CDK to configure AWS CodePipeline to deploy the “Hello, CDK” application I discussed above. One of the last things I do in the workshop is add a validation test using a post-deployment step. Adding a permission check is similar, but I will use a pre-deployment step to ensure the permission check happens before deployment.

First, I will import ConfirmPermissionsBroadening from the pipelines package

import {ConfirmPermissionsBroadening} from "aws-cdk-lib/pipelines";

Then, I can simply add ConfirmPermissionsBroadening to the deploySatage using the addPre method as follows.

const deploy = new WorkshopPipelineStage(this, 'Deploy');
const deployStage = pipeline.addStage(deploy);

deployStage.addPre(    
  new ConfirmPermissionsBroadening("PermissionCheck", {
    stage: deploy
})

deployStage.addPost(
    // Post Deployment Test Code Omitted
)

Once I commit and push this change, a new manual approval step called PermissionCheck.Confirm is added to the Deploy stage of the pipeline. In the future, if I push a change that adds additional rules, the pipeline will pause here and await manual approval as shown in the screenshot below.

Figure 1. Pipeline waiting for manual review

When the security engineer clicks the review button, she is presented with the following dialog. From here, she can click the URL to see a summary of the change I am requesting which was captured in the build logs. She can also choose to approve or reject the change and add comments if needed.

Figure 2. Manual review dialog with a link to the build logs

When the security engineer clicks the review URL, she is presented with the following sumamry of security changes.

Figure 3. Summary of security changes in the build logs

The final feature I want to add is an email notification so the security engineer knows when there is something to approve. To accomplish this, I create a new Amazon Simple Notification Service (SNS) topic and subscription and associate it with the ConfirmPermissionsBroadening Check.

// Create an SNS topic and subscription for security approvals
const topic = new sns.Topic(this, 'SecurityApproval’);
topic.addSubscription(new subscriptions.EmailSubscription('[email protected]')); 

deployStage.addPre(    
  new ConfirmPermissionsBroadening("PermissionCheck", {
    stage: deploy,
    notificationTopic: topic
})

With the notification configured, the security engineer will receive an email when an approval is needed. She will have an opportunity to review the security change I made and assess the impact. This gives the security engineering team the visibility they want into the policies CDK is generating. In addition, the approval step is skipped if a change does not add security rules so the security engineer does not become a bottle neck in the deployment process.

Conclusion

AWS Cloud Development Kit (CDK) automates the generation of IAM and other security policies. This can save a lot of time and effort but security engineering teams want visibility into the policies CDK generates. To address this, CDK Pipelines provides the ConfirmPermissionsBroadening action. When you add ConfirmPermissionsBroadening to your CI/CD pipeline, CDK will wait for manual approval before deploying a change that includes new security rules.

About the author:

Build near real-time logistics dashboards using Amazon Redshift and Amazon Managed Grafana for better operational intelligence

2023-01-17 Paul Villena

Post Syndicated from Paul Villena original https://aws.amazon.com/blogs/big-data/build-near-real-time-logistics-dashboards-using-amazon-redshift-and-amazon-managed-grafana-for-better-operational-intelligence/

Amazon Redshift is a fully managed data warehousing service that is currently helping tens of thousands of customers manage analytics at scale. It continues to lead price-performance benchmarks, and separates compute and storage so each can be scaled independently and you only pay for what you need. It also eliminates data silos by simplifying access to your operational databases, data warehouse, and data lake with consistent security and governance policies.

With the Amazon Redshift streaming ingestion feature, it’s easier than ever to access and analyze data coming from real-time data sources. It simplifies the streaming architecture by providing native integration between Amazon Redshift and the streaming engines in AWS, which are Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). Streaming data sources like system logs, social media feeds, and IoT streams can continue to push events to the streaming engines, and Amazon Redshift simply becomes just another consumer. Before Amazon Redshift streaming was available, we had to stage the streaming data first in Amazon Simple Storage Service (Amazon S3) and then run the copy command to load it into Amazon Redshift. Eliminating the need to stage data in Amazon S3 results in faster performance and improved latency. With this feature, we can ingest hundreds of megabytes of data per second and have a latency of just a few seconds.

Another common challenge for our customers is the additional skill required when using streaming data. In Amazon Redshift streaming ingestion, only SQL is required. We use SQL to do the following:

Define the integration between Amazon Redshift and our streaming engines with the creation of external schema
Create the different streaming database objects that are actually materialized views
Query and analyze the streaming data
Generate new features that are used to predict delays using machine learning (ML)
Perform inferencing natively using Amazon Redshift ML

In this post, we build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. Our example is an operational intelligence dashboard for a logistics company that provides situational awareness and augmented intelligence for their operations team. From this dashboard, the team can see the current state of their consignments and their logistics fleet based on events that happened only a few seconds ago. It also shows the consignment delay predictions of an Amazon Redshift ML model that helps them proactively respond to disruptions before they even happen.

Solution overview

This solution is composed of the following components, and the provisioning of resources is automated using the AWS Cloud Development Kit (AWS CDK):

Multiple streaming data sources are simulated through Python code running in our serverless compute service, AWS Lambda
The streaming events are captured by Amazon Kinesis Data Streams, which is a highly scalable serverless streaming data service
We use the Amazon Redshift streaming ingestion feature to process and store the streaming data and Amazon Redshift ML to predict the likelihood of a consignment getting delayed
We use AWS Step Functions for serverless workflow orchestration
The solution includes a consumption layer built on Amazon Managed Grafana where we can visualize the insights and even generate alerts through Amazon Simple Notification Service (Amazon SNS) for our operations team

The following diagram illustrates our solution architecture.

Prerequisites

The project has the following prerequisites:

An AWS account.
Amazon Linux 2 with AWS CDK, Docker CLI, and Python3 installed. Alternatively, setting up an AWS Cloud9 environment will satisfy this requirement.
To run this code, you need elevated privileges into the AWS account you are using.

Sample deployment using the AWS CDK

The AWS CDK is an open-source project that allows you to define your cloud infrastructure using familiar programming languages. It uses high-level constructs to represent AWS components to simplify the build process. In this post, we use Python to define the cloud infrastructure due to its familiarity to many data and analytics professionals.

Clone the GitHub repository and install the Python dependencies:

git clone https://github.com/aws-samples/amazon-redshift-streaming-workshop
cd amazon-redshift-streaming-workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Next, bootstrap the AWS CDK. This sets up the resources required by the AWS CDK to deploy into the AWS account. This step is only required if you haven’t used the AWS CDK in the deployment account and Region.

cdk bootstrap

Deploy all stacks:

cdk deploy IngestionStack 
cdk deploy RedshiftStack 
cdk deploy StepFunctionStack

The entire deployment time takes 10–15 minutes.

Access streaming data using Amazon Redshift streaming ingestion

The AWS CDK deployment provisions an Amazon Redshift cluster with the appropriate default IAM role to access the Kinesis data stream. We can create an external schema to establish a connection between the Amazon Redshift cluster and the Kinesis data stream:

CREATE EXTERNAL SCHEMA ext_kinesis FROM KINESIS
IAM_ROLE default;

For instructions on how to connect to the cluster, refer to Connecting to the Redshift Cluster.

We use a materialized view to parse data in the Kinesis data stream. In this case, the whole payload is ingested as is and stored using the SUPER data type in Amazon Redshift. Data stored in streaming engines is usually in semi-structured format, and the SUPER data type provides a fast and efficient way to analyze semi-structured data within Amazon Redshift.

See the following code:

CREATE MATERIALIZED VIEW consignment_stream AS
SELECT approximate_arrival_timestamp,
JSON_PARSE(from_varbyte(kinesis_data, 'utf-8')) as consignment_data FROM ext_kinesis.consignment_stream
WHERE is_utf8(kinesis_data)
AND is_valid_json(from_varbyte(kinesis_data, 'utf-8'));

Refreshing the materialized view invokes Amazon Redshift to read data directly from the Kinesis data stream and load it into the materialized view. This refresh can be done automatically by adding the AUTO REFRESH clause in the materialized view definition. However, in this example, we are orchestrating the end-to-end data pipeline using AWS Step Functions.

REFRESH MATERIALIZED VIEW consignment_stream;

Now we can start running queries against our streaming data and unify it with other datasets like logistics fleet data. If we want to know the distribution of our consignments across different states, we can easily unpack the contents of the JSON payload using the PartiQL syntax.

SELECT cs.consignment_data.origin_state::VARCHAR,
COUNT(1) number_of_consignments,
AVG(on_the_move) running_fleet,
AVG(scheduled_maintenance + unscheduled_maintenance) under_maintenance
FROM consignment_stream cs
INNER JOIN fleet_summary fs
on TRIM(cs.consignment_data.origin_state::VARCHAR) = fs.vehicle_location
GROUP BY 1

Generate features using Amazon Redshift SQL functions

The next step is to transform and enrich the streaming data using Amazon Redshift SQL to generate additional features that will be used by Amazon Redshift ML for its predictions. We use date and time functions to identify the day of the week, and calculate the number of days between the order date and target delivery date.

We also use geospatial functions, specifically ST_DistanceSphere, to calculate the distance between origin and destination locations. The GEOMETRY data type within Amazon Redshift provides a cost-effective way to analyze geospatial data such as longitude and latitudes at scale. In this example, the addresses have already been converted to longitude and latitude. However, if you need to perform geocoding, you can integrate Amazon Location Services with Amazon Redshift using user-defined functions (UDFs). On top of geocoding, Amazon Location Service also allows you to more accurately calculate route distance between origin and destination, and even specify waypoints along the way.

We use another materialized view to persist these transformations. A materialized view provides a simple yet efficient way to create data pipelines using its incremental refresh capability. Amazon Redshift identifies the incremental changes from the last refresh and only updates the target materialized view based on these changes. In this materialized view, all our transformations are deterministic, so we expect our data to be consistent when going through a full refresh or an incremental refresh.

See the following code:

CREATE MATERIALIZED VIEW consignment_transformed AS
SELECT
consignment_data.consignmentid::INT consignment_id,
consignment_data.consignment_date::TIMESTAMP consignment_date,
consignment_data.delivery_date::TIMESTAMP delivery_date,
consignment_data.origin_state::VARCHAR origin_state,
consignment_data.destination_state::VARCHAR destination_state,
consignment_data.revenue::FLOAT revenue,
consignment_data.cost::FLOAT cost,
DATE_PART(dayofweek, consignment_data.consignment_date::TIMESTAMP)::INT day_of_week,
DATE_PART(hour, consignment_data.consignment_date::TIMESTAMP)::INT "hour",
DATEDIFF(days,
consignment_data.consignment_date::TIMESTAMP,
consignment_data.delivery_date::TIMESTAMP
)::INT days_to_deliver,
(ST_DistanceSphere(
ST_Point(consignment_data.origin_lat::FLOAT, consignment_data.origin_long::FLOAT),
ST_Point(consignment_data.destination_lat::FLOAT, consignment_data.destination_long::FLOAT)
) / 1000 --convert to km
) delivery_distance
FROM consignment_stream;

Predict delays using Amazon Redshift ML

We can use this enriched data to make predictions on the delay probability of a consignment. Amazon Redshift ML is a feature of Amazon Redshift that allows you to use the power of Amazon Redshift to build, train, and deploy ML models directly within your data warehouse.

The training of a new Amazon Redshift ML model has been initiated as part of the AWS CDK deployment using the CREATE MODEL statement. The training dataset is defined in the FROM clause, and TARGET defines which column the model is trying to predict. The FUNCTION clause defines the name of the function that is used to make predictions.

CREATE MODEL ml_delay_prediction -- already executed by CDK
FROM (SELECT * FROM ext_s3.consignment_train)
TARGET probability
FUNCTION fnc_delay_probabilty
IAM_ROLE default
SETTINGS (
MAX_RUNTIME 1800, --seconds
S3_BUCKET '<ingestionstack-s3bucketname>' --replace S3 bucket name
)

This simplified model is trained using historical observations, and the training process takes around 30 minutes to complete. You can check the status of the training job by running the SHOW MODEL statement:

SHOW MODEL ml_delay_prediction;

When the model is ready, we can start making predictions on new data that is streamed into Amazon Redshift. Predictions are generated using the Amazon Redshift ML function that was defined during the training process. We pass the calculated features from the transformed materialized view into this function, and the prediction results populate the delay_probability column.

This final output is persisted into the consignment_predictions table, and Step Functions is orchestrating the ongoing incremental data load into this target table. We use a table for the final output, instead of a materialized view, because ML predictions have randomness involved and it may give us non-deterministic results. Using a table gives us more control on how data is loaded.

See the following code:

CREATE TABLE consignment_predictions AS
SELECT *, fnc_delay_probability(
day_of_week, "hour", days_to_deliver, delivery_distance) delay_probability
FROM consignment_transformed;

Create an Amazon Managed Grafana dashboard

We use Amazon Managed Grafana to create a near real-time logistics dashboard. Amazon Managed Grafana is a fully managed service that makes it easy to create, configure, and share interactive dashboards and charts for monitoring your data. We can also use Grafana to set up alerts and notifications based on specific conditions or thresholds, allowing you to quickly identify and respond to issues.

The high-level steps in setting up the dashboard are as follows:

Create a Grafana workspace.
Set up Grafana authentication using AWS IAM Identity Center (successor to AWS Single Sign-On) or using direct SAML integration.
Configure Amazon Redshift as a Grafana data source.
Import the JSON file for the near real-time logistics dashboard.

A more detailed set of instructions is available in the GitHub repository for your reference.

Clean up

To avoid ongoing charges, delete the resources deployed. Access the Amazon Linux 2 environment and run the AWS CDK destroy command. Delete the Grafana objects related to this deployment.

cd amazon-redshift-streaming-workshop
source .venv/bin/activate
cdk destroy –all

Conclusion

In this post, we showed how easy it is to build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. We created an end-to-end modern data pipeline using only SQL. This shows how Amazon Redshift is a powerful platform for democratizing your data—it enables a wide range of users, including business analysts, data scientists, and others, to work with and analyze data without requiring specialized technical skills or expertise.

We encourage you to explore what else can be achieved with Amazon Redshift and Amazon Managed Grafana. We also recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon Redshift.

About the Author

Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

How to revoke federated users’ active AWS sessions

2023-01-16 Matt Howard

Post Syndicated from Matt Howard original https://aws.amazon.com/blogs/security/how-to-revoke-federated-users-active-aws-sessions/

When you use a centralized identity provider (IdP) for human user access, changes that an identity administrator makes to a user within the IdP won’t invalidate the user’s existing active Amazon Web Services (AWS) sessions. This is due to the nature of session durations that are configured on assumed roles. This situation presents a challenge for identity administrators.

In this post, you’ll learn how to revoke access to specific users’ sessions on AWS assumed roles through the use of AWS Identity and Access Management (IAM) policies and service control policies (SCPs) via AWS Organizations.

Session duration overview

When you configure IAM roles, you have the option of configuring a maximum session duration that specifies how long a session is valid. By default, the temporary credentials provided to the user will last for one hour, but you can change this to a value of up to 12 hours.

When a user assumes a role in AWS by using their IdP credentials, that role’s credentials will remain valid for the length of their session duration. It’s convenient for end users to have a maximum session duration set to 12 hours, because this prevents their sessions from frequently timing out and then requiring re-login. However, a longer session duration also poses a challenge if you, as an identity administrator, attempt to revoke or modify a user’s access to AWS from your IdP.

For example, user John Doe is leaving the company and you want to verify that John has his privileges within AWS revoked. If John has access to IAM roles with long-session durations, then he might have residual access to AWS despite having his session revoked or his user identity deleted within the IdP. Perhaps John assumed a role for his daily work at 8 AM and then you revoked his credentials within the IdP at 9 AM. Because John had already assumed an AWS role, he would still have access to AWS through that role for the duration of the configured session, 8 PM if the session was configured for 12 hours. Therefore, as a security best practice, AWS recommends that you do not set the session duration length longer than is needed. This example is displayed in Figure 1.

Figure 1: Session duration overview

In order to restrict access despite the session duration being active, you could update the roles that are assumable from an IdP with a deny-all policy or delete the role entirely. However, this is a disruptive action for the users that have access to this role. If the role was deleted or the policy was updated to deny all, then users would no longer be able to assume the role or access their AWS environment. Instead, the recommended approach is to revoke access based on the specific user’s principalId or sourceIdentity values.

The principalId is the unique identifier for the entity that made the API call. When requests are made with temporary credentials, such as assumed roles through IdPs, this value also includes the session name, such as [email protected]. The sourceIdentity identifies the original user identity that is making the request, such as a user who is authenticated through SAML federation from an IdP. As a best practice, AWS recommends that you configure this value within the IdP, because this improves traceability for user sessions within AWS. You can find more information on this functionality in the blog post, How to integrate AWS STS SourceIdentity with your identity provider.

Identify the principalId and sourceIdentity by using CloudTrail

You can use AWS CloudTrail to review the actions taken by a user, role, or AWS service that are recorded as events. In the following procedure, you will use CloudTrail to identify the principalId and sourceIdentity contained in the CloudTrail record contents for your IdP assumed role.

To identify the principalId and sourceIdentity by using CloudTrail

Assume a role in AWS by signing in through your IdP.
Perform an action such as a creating an S3 bucket.
Navigate to the CloudTrail service.
In the navigation pane, choose Event History.
For Lookup attributes, choose Event name. For Event name, enter CreateBucket.

Figure 2: Looking up the CreateBucket event in the CloudTrail event history

Select the corresponding event record and review the event details. An example showing the userIdentity element is as follows.


"userIdentity": {
	"type": "AssumedRole",
	"principalId": 
"AROATVGBKRLCHXEXAMPLE:[email protected]",
	"arn": "arn:aws:sts::111122223333:assumed-
role/roleexample/[email protected]",
	"accountId": "111122223333",
	"accessKeyId": "ASIATVGBKRLCJEXAMPLE",
	"sessionContext": {
		"sessionIssuer": {
			"type": "Role",
			"principalId": "AROATVGBKRLCHXEXAMPLE",
			"arn": 
"arn:aws:iam::111122223333:role/roleexample",
			"accountId": "111122223333",
			"userName": "roleexample"
		},
		"webIdFederationData": {},
		"attributes": {
			"creationDate": "2022-07-05T15:48:28Z",
			"mfaAuthenticated": "false"
		},
		"sourceIdentity": "[email protected]"
	}
}

In this event record, you can see that principalId is “AROATVGBKRLCHXEXAMPLE:[email protected]” and sourceIdentity was specified as “[email protected]”. Now that you have these values, let’s explore how you can revoke access by using SCP and IAM policies.

Use an SCP to deny users based on IdP user name or revoke session token

First, you will create an SCP, a policy that can be applied to an organization to offer central control of the maximum available permissions across the accounts in the organization. More information on SCPs, including steps to create and apply them, can be found in the AWS Organizations User Guide.

The SCP will have a deny-all statement with a condition for aws:userid, which will evaluate the principalId field; and a condition for aws:SourceIdentity, which will evaluate the sourceIdentity field. In the following example SCP, the users John Doe and Mary Major are prevented from accessing AWS, in member accounts, regardless of their session duration, because each action will check against their aws:userid and aws:SourceIdentity values and be denied accordingly.

SCP to deny access based on IdP user name


{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Deny",
			"Action": "*",
			"Resource": "*",
			"Condition": {
				"StringLike": {
					"aws:userid": [
						"*:[email protected]",
						"*:[email protected]"
				]
			}
		}
	},
	{
			"Effect": "Deny",
			"Action": "*",
			"Resource": "*",
			"Condition": {
				"StringEquals": {
					"aws:SourceIdentity": [
						"[email protected]",
						"[email protected]"
					]
				}
			}
		}
	]
}

Use an IAM policy to revoke access in the AWS Organizations management account

SCPs do not affect users or roles in the AWS Organizations management account and instead only affect the member accounts in the organization. Therefore, using an SCP alone to deny access may not be sufficient. However, identity administrators can revoke access in a similar way within their management account by using the following procedure.

To create an IAM policy in the management account

Sign in to the AWS Management Console by using your AWS Organizations management account credentials.
Follow these steps to use the JSON policy editor to create an IAM policy. Use the JSON of the SCP shown in the preceding section, SCP to deny access based on IdP user name, in the IAM JSON editor.
Follow these steps to add the IAM policy to roles that IdP users may assume within the account.

Revoke active sessions when role chaining

At this point, the user actions on the IdP assumable roles within the AWS organization have been blocked. However, there is still an edge case if the target users use role chaining (use an IdP assumedRole credential to assume a second role) that uses a different RoleSessionName than the one assigned by the IdP. In a role chaining situation, the users will still have access by using the cached credentials for the second role.

This is where the sourceIdentity field is valuable. After a source identity is set, it is present in requests for AWS actions that are taken during the role session. The value that is set persists when a role is used to assume another role (role chaining). The value that is set cannot be changed during the role session. Therefore, it’s recommended that you configure the sourceIdentity field within the IdP as explained previously. This concept is shown in Figure 3.

Figure 3: Role chaining with sourceIdentity configured

A user assumes an IAM role via their IdP (#1), and the CloudTrail record displays sourceIdentity: [email protected] (#2). When the user assumes a new role within AWS (#3), that CloudTrail record continues to display sourceIdentity: [email protected] despite the principalId changing (#4).

However, if a second role is assumed in the account through role chaining and the sourceIdentity is not set, then it’s recommended that you revoke the issued session tokens for the second role. In order to do this, you can use the SCP policy at the end of this section, SCP to revoke active sessions for assumed roles. When you use this policy, the issued credentials related to the roles specified will be revoked for the users currently using them, and only users who were not denied through the previous SCP or IAM policies restricting their aws:userid will be able to reassume the target roles to obtain a new temporary credential.

If you take this approach, you will need to use an SCP to apply across the organization’s member accounts. The SCP must have the human-assumable roles for role chaining listed and a token issue time set to a specific time when you want users’ access revoked. (Normally, this time window would be set to the present time to immediately revoke access, but there might be circumstances in which you wish to revoke the access at a future date, such as when a user moves to a new project or team and therefore requires different access levels.) In addition, you will need to follow the same procedures in your management account by creating a customer-managed policy by using the same JSON with the condition statement for aws:PrincipalArn removed. Then attach the customer managed policy to the individual roles that are human-assumable through role chaining.

SCP to revoke active sessions for assumed roles


{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "RevokeActiveSessions",
			"Effect": "Deny",
			"Action": [
				"*"
			],
			"Resource": [
				"*"
			],
			"Condition": {
				"StringEquals": {
					"aws:PrincipalArn": [
						"arn:aws:iam::<account-id>:role/<role-name>",
						"arn:aws:iam::<account-id>:role/<role-name>"
					]
				},
				"DateLessThan": {
					"aws:TokenIssueTime": "2022-06-01T00:00:00Z"
				}
			}
		}
	]
}

Conclusion and final recommendations

In this blog post, I demonstrated how you can revoke a federated user’s active AWS sessions by using SCPs and IAM policies that restrict the use of the aws:userid and aws:SourceIdentity condition keys. I also shared how you can handle a role chaining situation with the aws:TokenIssueTime condition key.

This exercise demonstrates the importance of configuring the session duration parameter on IdP assumed roles. As a security best practice, you should set the session duration to no longer than what is needed to perform the role. In some situations, that could mean an hour or less in a production environment and a longer session in a development environment. Regardless, it’s important to understand the impact of configuring the maximum session duration in the user’s environment and also to have proper procedures in place for revoking a federated user’s access.

This post also covered the recommendation to set the sourceIdentity for assumed roles through the IdP. This value cannot be changed during role sessions and therefore persists when a user conducts role chaining. Following this recommendation minimizes the risk that a user might have assumed another role with a different session name than the one assigned by the IdP and helps prevent the edge case scenario of revoking active sessions based on TokenIssueTime.

You should also consider other security best practices, described in the Security Pillar of the AWS Well-Architected Framework, when you revoke users’ AWS access. For example, rotating credentials such as IAM access keys in situations in which IAM access keys are regularly used and shared among users. The example solutions in this post would not have prevented a user from performing AWS actions if that user had IAM access keys configured for a separate IAM user in the environment. Organizations should limit long-lived security credentials such as IAM keys and instead rotate them regularly or avoid their use altogether. Also, the concept of least privilege is highly important to limit the access that users have and scope it solely to the requirements that are needed to perform their job functions. Lastly, you should adopt a centralized identity provider coupled with the AWS IAM Identity Center (successor to AWS Single Sign-On) service in order to centralize identity management and avoid the need for multiple credentials for users.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Identity and Access Management re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Setting up a secure CI/CD pipeline in a private Amazon Virtual Private Cloud with no public internet access

2023-01-13 MJ Kubba

Post Syndicated from MJ Kubba original https://aws.amazon.com/blogs/devops/setting-up-a-secure-ci-cd-pipeline-in-a-private-amazon-virtual-private-cloud-with-no-public-internet-access/

With the rise of the cloud and increased security awareness, the use of private Amazon VPCs with no public internet access also expanded rapidly. This setup is recommended to make sure of proper security through isolation. The isolation requirement also applies to code pipelines, in which developers deploy their application modules, software packages, and other dependencies and bundles throughout the development lifecycle. This is done without having to push larger bundles from the developer space to the staging space or the target environment. Furthermore, AWS CodeArtifact is used as an artifact management service that will help organizations of any size to securely store, publish, and share software packages used in their software development process.

We’ll walk through the steps required to build a secure, private continuous integration/continuous development (CI/CD) pipeline with no public internet access while maintaining log retention in Amazon CloudWatch. We’ll utilize AWS CodeCommit for source, CodeArtifact for the Modules and software packages, and Amazon Simple Storage Service (Amazon S3) as artifact storage.

Prerequisites

The prerequisites for following along with this post include:

An AWS Account
A Virtual Private Cloud (Amazon VPC)
A CI/CD pipeline – This can be CodePipeline, Jenkins or any CI/CD tool you want to integrate CodeArtifact with, we will use CodePipeline in our walkthrough here.

Solution walkthrough

The main service we’ll focus on is CodeArtifact, a fully managed artifact repository service that makes it easy for organizations of any size to securely store, publish, and share software packages used in their software development process. CodeArtifact works with commonly used package managers and build tools, such as Maven and Gradle (Java), npm and yarn (JavaScript), pip and twine (Python), or NuGet (.NET).

user checkin code to CodeCommit, CodePipeline will detect the change and start the pipeline, in CodeBuild the build stage will utilize the private endpoints and download the software packages needed without the need to go over the internet.

Users push code to CodeCommit, CodePipeline will detect the change and start the pipeline, in CodeBuild the build stage will utilize the private endpoints and download the software packages needed without the need to go over the internet.

The preceding diagram shows how the requests remain private within the VPC and won’t go through the Internet gateway, by going from CodeBuild over the private endpoint to CodeArtifact service, all within the private subnet.

The requests will use the following VPC endpoints to connect to these AWS services:

CloudWatch Logs endpoint (for CodeBuild to put logs in CloudWatch)
CodeArtifact endpoints
AWS Security Token Service (AWS STS) endpoint
Amazon Simple Storage Service (Amazon S3) endpoint

Walkthrough

Create a CodeCommit Repository:
1. Navigate to your CodeCommit Console then click on Create repository

Figure 2. Screenshot: Create repository button.

1. Type in name for the repository then click Create

Figure 3. Screenshot: Repository setting with name shown as “Private” and empty Description.

1. Scroll down and click Create file

Figure 4. Create file button.

1. Copy the example buildspec.yml file and paste it to the editor

Example buildspec.yml file:

version: 0.2
phases:
  install:
    runtime-versions:
        nodejs: 16
    
commands:
      - export AWS_STS_REGIONAL_ENDPOINTS=regional
      - ACCT=`aws sts get-caller-identity --region ${AWS_REGION} --query Account --output text`
      - aws codeartifact login --tool npm --repository Private --domain private --domain-owner ${ACCT}
      - npm install
  build:
    commands:
      - node index.js

1. Name the file buildspec.yml, type in your name and your email address then Commit changes

Figure 5. Screenshot: Create file page.

Create CodeArtifact
1. Navigate to your CodeArtifact Console then click on Create repository
2. Give it a name and select npm-store as public upsteam repository

Figure 6. Screenshot: Create repository page with Repository name “Private”.

1. For the Domain Select this AWS account and enter a domain name

Figure 7. Screenshot: Select domain page.

1. Click Next then Create repository

Figure 8. Screenshot: Create repository review page.

Create a CI/CD using CodePipeline
1. Navigate to your CodePipeline Console then click on Create pipeline

Figure 9. Screenshot: Create pipeline button.

1. Type a name, leave the Service role as “New service role” and click next

Figure 10. Screenshot: Choose pipeline setting page with pipeline name “Private”.

1. Select AWS CodeCommit as your Source provider
2. Then choose the CodeCommit repository you created earlier and for branch select main then click Next

Figure 11. Screenshot: Create pipeline add source stage.

1. For the Build Stage, Choose AWS CodeBuild as the build provider, then click Create Project

Figure 12. Screenshot: Create pipeline add build stage.

1. This will open new window to create the new Project, Give this project a name

Figure 13. Screenshot: Create pipeline create build project window.

1. Scroll down to the Environment section: select pick Managed image,
2. For Operating system select “Amazon Linux 2”,
3. Runtime “Standard” and
4. For Image select the aws/codebuild/amazonlinux2-x86+64-standard:4.0
  For the Image version: Always use the latest image for this runtime version
5. Select Linux for the Environment type
6. Leave the Privileged option unchecked and set Service Role to “New service role”

Figure 14. Screenshot: Create pipeline create build project, setting up environment window.

1. Expand Additional configurations and scroll down to the VPC section, select the desired VPC, your Subnets (we recommend selecting multiple AZs, to ensure high availability), and Security Group (the security group rules must allow resources that will use the VPC endpoint to communicate with the AWS service to communicate with the endpoint network interface, default VPC security group will be used here as an example)

Figure 15. Screenshot: Create pipeline create build project networking window.

1. Scroll down to the Buildspec and select “Use a buildspec file” and type “buildspec.yml” for the Buildspec name

Figure 16. Screenshot: Create pipeline create build project buildspec window.

1. Select the CloudWatch logs option you can leave the group name and stream empty this will let the service use the default values and click Continue to CodePipeline

Figure 17. Screenshot: Create pipeline create build project logs window.

1. This will create the new CodeBuild Project, update the CodePipeline page, now you can click Next

Figure 18. Screenshot: Create pipeline add build stage window.

1. Since we are not deploying this to any environment, you can skip the deploy stage and click “Skip deploy stage”

Figure 19. Screenshot: Create pipeline add deploy stage.

Figure 20. Screenshot: Create pipeline skip deployment stage confirmation.

1. After you get the popup click skip again you’ll see the review page, scroll all the way down and click Create Pipeline

Create a VPC endpoint for Amazon CloudWatch Logs. This will enable CodeBuild to send execution logs to CloudWatch:
1. Navigate to your VPC console, and from the navigation menu on the left select “Endpoints”.

Figure 21. Screenshot: VPC endpoint.

1. click Create endpoint Button.

Figure 22. Screenshot: Create endpoint.

1. For service Category, select “AWS Services”. You can set a name for the new endpoint, and make sure to use something descriptive.

Figure 23. Screenshot: Create endpoint page.

1. From the list of services, search for the endpoint by typing logs in the search bar and selecting the one with com.amazonaws.us-west-2.logs.
  This walkthrough can be done in any region that supports the services. I am going to be using us-west-2, please select the appropriate region for your workload.

Figure 24. Screenshot: create endpoint select services with com.amazonaws.us-west-2.logs selected.

1. Select the VPC that you want the endpoint to be associated with, and make sure that the Enable DNS name option is checked under additional settings.

Figure 25. Screenshot: create endpoint VPC setting shows VPC selected.

1. Select the Subnets where you want the endpoint to be associated, and you can leave the security group as default and the policy as empty.

Figure 26. Screenshot: create endpoint subnet setting shows 2 subnet selected and default security group selected.

1. Select Create Endpoint.

Figure 27. Screenshot: create endpoint button.

Create a VPC endpoint for CodeArtifact. At the time of writing this article, CodeArifact has two endpoints: one is for API operations like service level operations and authentication, and the other is for using the service such as getting modules for our code. We’ll need both endpoints to automate working with CodeArtifact. Therefore, we’ll create both endpoints with DNS enabled.

In addition, we’ll need AWS Security Token Service (AWS STS) endpoint for get-caller-identity API call:

Follow steps a-c from the steps that were used from the creating the Logs endpoint above.

a. From the list of services, you can search for the endpoint by typing codeartifact in the search bar and selecting the one with com.amazonaws.us-west-2.codeartifact.api.

Figure 28. Screenshot: create endpoint select services with com.amazonaws.us-west-2.codeartifact.api selected.

Follow steps e-g from Part 4.

Then, repeat the same for com.amazon.aws.us-west-2.codeartifact.repositories service.

Figure 29. Screenshot: create endpoint select services with com.amazonaws.us-west-2.codeartifact.api selected.

Enable a VPC endpoint for AWS STS:

Follow steps a-c from Part 4

a. From the list of services you can search for the endpoint by typing sts in the search bar and selecting the one with com.amazonaws.us-west-2.sts.

Figure 30.Screenshot: create endpoint select services with com.amazon.aws.us-west-2.codeartifact.repositories selected.

Then follow steps e-g from Part 4.

Create a VPC endpoint for S3:

Follow steps a-c from Part 4

a. From the list of services you can search for the endpoint by typing sts in the search bar and selecting the one with com.amazonaws.us-west-2.s3, select the one with type of Gateway

Then select your VPC, and select the route tables for your subnets, this will auto update the route table with the new S3 endpoint.

Figure 31. Screenshot: create endpoint select services with com.amazonaws.us-west-2.s3 selected.

Now we have all of the endpoints set. The last step is to update your pipeline to point at the CodeArtifact repository when pulling your code dependencies. I’ll use CodeBuild buildspec.yml as an example here.

Make sure that your CodeBuild AWS Identity and Access Management (IAM) role has the permissions to perform STS and CodeArtifact actions.

Navigate to IAM console and click Roles from the left navigation menu, then search for your IAM role name, in our case since we selected “New service role” option in step 2.k was created with the name “codebuild-Private-service-role” (codebuild-<BUILD PROJECT NAME>-service-role)

Figure 32. Screenshot: IAM roles with codebuild-Private-service-role role shown in search.

From the Add permissions menu, click on Create inline policy

Search for STS in the services then select STS

Figure 34. Screenshot: IAM visual editor with sts shown in search.

Search for “GetCallerIdentity” and select the action

Figure 35. Screenshot: IAM visual editor with GetCallerIdentity in search and action selected.

Repeat the same with “GetServiceBearerToken”

Figure 36. Screenshot: IAM visual editor with GetServiceBearerToken in search and action selected.

Click on Review, add a name then click on Create policy

Figure 37. Screenshot: Review page and Create policy button.

You should see the new inline policy added to the list

Figure 38. Screenshot: shows the new in-line policy in the list.

For CodeArtifact actions we will do the same on that role, click on Create inline policy

Figure 39. Screenshot: attach policies.

Search for CodeArtifact in the services then select CodeArtifact

Figure 40. Screenshot: select service with CodeArtifact in search.

Search for “GetAuthorizationToken” in actions and select that action in the check box

Figure 41. CodeArtifact: with GetAuthorizationToken in search.

Repeat for “GetRepositoryEndpoint” and “ReadFromRepository”

Click on Resources to fix the 2 warnings, then click on Add ARN on the first one “Specify domain resource ARN for the GetAuthorizationToken action.”

Figure 42. Screenshot: with all selected filed and 2 warnings.

You’ll get a pop up with fields for Region, Account and Domain name, enter your region, your account number, and the domain name, we used “private” when we created our domain earlier.

Figure 43. Screenshot: Add ARN page.

Then click Add

Repeat the same process for “Specify repository resource ARN for the ReadFromRepository and 1 more”, and this time we will provide Region, Account ID, Domain name and Repository name, we used “Private” for the repository we created earlier and “private” for domain

Figure 44. Screenshot: add ARN page.

Note it is best practice to specify the resource we are targeting, we can use the checkbox for “Any” but we want to narrow the scope of our IAM role best we can.

Navigate to CodeCommit then click on the repo you created earlier in step1

Figure 45. Screenshot: CodeCommit repo.

Click on Add file dropdown, then Create file button

Paste the following in the editor space:

{
  "dependencies": {
    "mathjs": "^11.2.0"
  }
}

Name the file “package.json”

Add your name and email, and optional commit message

Repeat this process for “index.js” and paste the following in the editor space:

const { sqrt } = require('mathjs')
console.log(sqrt(49).toString())

Figure 46. Screenshot: CodeCommit Commit changes button.

This will force the pipeline to kick off and start building the application

Figure 47. Screenshot: CodePipeline.

This is a very simple application that gets the square root of 49 and log it to the screen, if you click on the Details link from the pipeline build stage, you’ll see the output of running the NodeJS application, the logs are stored in CloudWatch and you can navigate there by clicking on the link the View entire log “Showing the last xx lines of the build log. View entire log”

Figure 48. Screenshot: Showing the last 54 lines of the build log. View entire log.

We used npm example in the buildspec.yml above, Similar setup will be used for pip and twine,

For Maven, Gradle, and NuGet, you must set Environment variables and change your settings.xml and build.gradle, as well as install the plugin for your IDE. For more information, see here.

Cleanup

Navigate to VPC endpoint from the AWS console and delete the endpoints that you created.

Navigate to CodePipeline and delete the Pipeline you created.

Navigate to CodeBuild and delete the Build Project created.

Navigate to CodeCommit and delete the Repository you created.

Navigate to CodeArtifact and delete the Repository and the domain you created.

Navigate to IAM and delete the Roles created:

For CodeBuild: codebuild-<Build Project Name>-service-role

For CodePipeline: AWSCodePipelineServiceRole-<Region>-<Project Name>

Conclusion

In this post, we deployed a full CI/CD pipeline with CodePipeline orchestrating CodeBuild to build and test a small NodeJS application, using CodeArtifact to download the application code dependencies. All without going to the public internet and maintaining the logs in CloudWatch.

About the author:

Team Collaboration with Amazon CodeCatalyst

2023-01-11

Post Syndicated from original https://aws.amazon.com/blogs/devops/team-collaboration-with-amazon-codecatalyst/

Amazon CodeCatalyst enables teams to collaborate on features, tasks, bugs, and any other work involved when building software. CodeCatalyst was announced at re:Invent 2022 and is currently in preview.

Introduction:

In a prior post in this series, Using Workflows to Build, Test, and Deploy with Amazon CodeCatalyst, I discussed reading The Unicorn Project, by Gene Kim, and how the main character, Maxine, struggles with a complicated software development lifecycle (SLDC) after joining a new team. Some of the challenges she encounters include:

Continually delivering high-quality updates is complicated and slow
Collaborating efficiently with others is challenging
Managing application environments is increasingly complex
Setting up a new project is a time-consuming chore

In this post, I will focus on the second bullet, and how CodeCatalyst helps you collaborate from anywhere with anyone.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Have an AWS Builder ID for signing in to CodeCatalyst.
Belong to a CodeCatalyst space and have the Space administrator role assigned to you in that space. For more information, see Creating a space in CodeCatalyst, Managing members of your space, and Space administrator role.
Have an AWS account associated with your space and have the IAM role in that account. For more information about the role and role policy, see Creating a CodeCatalyst service role.

Walkthrough

Similar to the prior post, I am going to use the Modern Three-tier Web Application blueprint in this walkthrough. A CodeCatalyst blueprint provides a template for a new project. If you would like to follow along, you can launch the blueprint as described in Creating a project in Amazon CodeCatalyst. This will deploy the Mythical Mysfits sample application shown in the following image.

Figure 1. The Mythical Mysfits user interface showing header and three Mysfits

For this Walkthrough, let us assume that I need to make a simple change to the application. The legal department would like to add a footer that includes the text “© 2023 Demo Organization.” I will create an issue in CodeCatalyst to track this work and use CodeCatalyst to track the change throughout the entire Software Development Life Cycle (SDLC).

CodeCatalyst organizes projects into Spaces. A space represents your company, department, or group; and contains projects, members, and the associated cloud resources you create in CodeCatalyst. In this walkthrough, my Space currently includes two members, Brian Beach and Panna Shetty, as shown in the following screenshot. Note that both users are administrators, but CodeCatalyst supports multiple roles. You can read more about roles in members of your space.

Figure 2. The space members configuration page showing two users

To begin, Brian creates a new issue to track the request from legal. He assigns the issue to Panna, but leaves it in the backlog for now. Note that CodeCatalyst supports multiple metadata fields to organize your work. This issue is not impacting users and is relatively simple to fix. Therefore, Brian has categorized it as low priority and estimated the effort as extra small (XS). Brian has also added a label, so all the requests from legal can be tracked together. Note that these metadata fields are customizable. You can read more in configuring issue settings.

Figure 3. Create issue dialog box with name, description and metadata

CodeCatalyst supports rich markdown in the description field. You can read about this in Markdown tips and tricks. In the following screenshot, Brian types “@app.vue” which brings up an inline search for people, issues, and code to help Panna find the relevant bit of code that needs changing later.

Figure 4. Create issue dialog box with type-ahead overlay

When Panna is ready to begin work on the new feature, she moves the issue from the “Backlog“ to ”In progress.“ CodeCatalyst allows users to manage their work using a Kanban style board. Panna can simply drag-and-drop issues on the board to move the issue from one state to another. Given the small team, Brian and Panna use a single board. However, CodeCatalyst allows you to create multiple views filtered by the metadata fields discussed earlier. For example, you might create a label called Sprint-001, and use that to create a board for the sprint.

Figure 5. Kanban board showing to do, in progress and in review columns

Panna creates a new branch for the change called feature_add_copyright and uses the link in the issue description to navigate to the source code repository. This change is so simple that she decides to edit the file in the browser and commits the change. Note that for more complex changes, CodeCatalyst supports Dev Environments. The next post in this series will be dedicated to Dev Environments. For now, you just need to know that a Dev Environment is a cloud-based development environment that you can use to quickly work on the code stored in the source repositories of your project.

Figure 6. Editor with new lines highlighted

Panna also creates a pull request to merge the feature branch in to the main branch. She identifies Brian as a required reviewer. Panna then moves the issue to the “In review” column on the Kanban board so the rest of the team can track the progress. Once Brian reviews the change, he approves and merges the pull request.

Figure 7. Pull request details with title, description, and reviewed assigned

When the pull request is merged, a workflow is configured to run automatically on code changes to build, test, and deploy the change. Note that Workflows were covered in the prior post in this series. Once the workflow is complete, Panna is notified in the team’s Slack channel. You can read more about notifications in working with notifications in CodeCatalyst. She verifies the change in production and moves the issue to the done column on the Kanban board.

Figure 8. Kanban board showing in progress, in review, and done columns

Once the deployment completes, you will see the footer added at the bottom of the page.

Figure 9. The Mythical Mysfits user interface showing footer and three Mysfits

At this point the issue is complete and you have seen how this small team collaborated to progress through the entire software development lifecycle (SDLC).

Cleanup

Conclusion

In this post, you learned how CodeCatalyst can help you rapidly collaborate with other developers. I used issues to track feature and bugs, assigned code reviews, and managed pull requests. In future posts I will continue to discuss how CodeCatalyst can address the rest of the challenges Maxine encountered in The Unicorn Project.

About the authors:

Secure CDK deployments with IAM permission boundaries

2023-01-10 Brian Farnhill

Post Syndicated from Brian Farnhill original https://aws.amazon.com/blogs/devops/secure-cdk-deployments-with-iam-permission-boundaries/

The AWS Cloud Development Kit (CDK) accelerates cloud development by allowing developers to use common programming languages when modelling their applications. To take advantage of this speed, developers need to operate in an environment where permissions and security controls don’t slow things down, and in a tightly controlled environment this is not always the case. Of particular concern is the scenario where a developer has permission to create AWS Identity and Access Management (IAM) entities (such as users or roles), as these could have permissions beyond that of the developer who created them, allowing for an escalation of privileges. This approach is typically controlled through the use of permission boundaries for IAM entities, and in this post you will learn how these boundaries can now be applied more effectively to CDK development – allowing developers to stay secure and move fast.

Time to read

10 minutes

Learning level

Advanced (300)

Services used

AWS Cloud Development Kit (CDK)

AWS Identity and Access Management (IAM)

Applying custom permission boundaries to CDK deployments

When the CDK deploys a solution, it assumes a AWS CloudFormation execution role to perform operations on the user’s behalf. This role is created during the bootstrapping phase by the AWS CDK Command Line Interface (CLI). This role should be configured to represent the maximum set of actions that CloudFormation can perform on the developers behalf, while not compromising any compliance or security goals of the organisation. This can become complicated when developers need to create IAM entities (such as IAM users or roles) and assign permissions to them, as those permissions could be escalated beyond their existing access levels. Taking away the ability to create these entities is one way to solve the problem. However, doing this would be a significant impediment to developers, as they would have to ask an administrator to create them every time. This is made more challenging when you consider that security conscious practices will create individual IAM roles for every individual use case, such as each AWS Lambda Function in a stack. Rather than taking this approach, IAM permission boundaries can help in two ways – first, by ensuring that all actions are within the overlap of the users permissions and the boundary, and second by ensuring that any IAM entities that are created also have the same boundary applied. This blocks the path to privilege escalation without restricting the developer’s ability to create IAM identities. With the latest version of the AWS CLI these boundaries can be applied to the execution role automatically when running the bootstrap command, as well as being added to IAM entities that are created in a CDK stack.

To use a permission boundary in the CDK, first create an IAM policy that will act as the boundary. This should define the maximum set of actions that the CDK application will be able to perform on the developer’s behalf, both during deployment and operation. This step would usually be performed by an administrator who is responsible for the security of the account, ensuring that the appropriate boundaries and controls are enforced. Once created, the name of this policy is provided to the bootstrap command. In the example below, an IAM policy called “developer-policy” is used to demonstrate the command.
cdk bootstrap –custom-permissions-boundary developer-policy
Once this command runs, a new bootstrap stack will be created (or an existing stack will be updated) so that the execution role has this boundary applied to it. Next, you can ensure that any IAM entities that are created will have the same boundaries applied to them. This is done by either using a CDK context variable, or the permissionBoundary attribute on those resources. To explain this in some detail, let’s use a real world scenario and step through an example that shows how this feature can be used to restrict developers from using the AWS Config service.

Installing or upgrading the AWS CDK CLI

Before beginning, ensure that you have the latest version of the AWS CDK CLI tool installed. Follow the instructions in the documentation to complete this. You will need version 2.54.0 or higher to make use of this new feature. To check the version you have installed, run the following command.

cdk --version

Creating the policy

First, let’s begin by creating a new IAM policy. Below is a CloudFormation template that creates a permission policy for use in this example. In this case the AWS CLI can deploy it directly, but this could also be done at scale through a mechanism such as CloudFormation Stack Sets. This template has the following policy statements:

Allow all actions by default – this allows you to deny the specific actions that you choose. You should carefully consider your approach to allow/deny actions when creating your own policies though.
Deny the creation of users or roles unless the “developer-policy” permission boundary is used. Additionally limit the attachment of permissions boundaries on existing entities to only allow “developer-policy” to be used. This prevents the creation or change of an entity that can escalate outside of the policy.
Deny the ability to change the policy itself so that a developer can’t modify the boundary they will operate within.
Deny the ability to remove the boundary from any user or role
Deny any actions against the AWS Config service

Here items 2, 3 and 4 all ensure that the permission boundary works correctly – they are controls that prevent the boundary being removed, tampered with, or bypassed. The real focus of this policy in terms of the example are items 1 and 5 – where you allow everything, except the specific actions that are denied (creating a deny list of actions, rather than an allow list approach).

Resources:
  PermissionsBoundary:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      PolicyDocument:
        Statement:
          # ----- Begin base policy ---------------
          # If permission boundaries do not have an explicit allow
          # then the effect is deny
          - Sid: ExplicitAllowAll
            Action: "*"
            Effect: Allow
            Resource: "*"
          # Default permissions to prevent privilege escalation
          - Sid: DenyAccessIfRequiredPermBoundaryIsNotBeingApplied
            Action:
              - iam:CreateUser
              - iam:CreateRole
              - iam:PutRolePermissionsBoundary
              - iam:PutUserPermissionsBoundary
            Condition:
              StringNotEquals:
                iam:PermissionsBoundary:
                  Fn::Sub: arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/developer-policy
            Effect: Deny
            Resource: "*"
          - Sid: DenyPermBoundaryIAMPolicyAlteration
            Action:
              - iam:CreatePolicyVersion
              - iam:DeletePolicy
              - iam:DeletePolicyVersion
              - iam:SetDefaultPolicyVersion
            Effect: Deny
            Resource:
              Fn::Sub: arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/developer-policy
          - Sid: DenyRemovalOfPermBoundaryFromAnyUserOrRole
            Action: 
              - iam:DeleteUserPermissionsBoundary
              - iam:DeleteRolePermissionsBoundary
            Effect: Deny
            Resource: "*"
          # ----- End base policy ---------------
          # -- Begin Custom Organization Policy --
          - Sid: DenyModifyingOrgCloudTrails
            Effect: Deny
            Action: config:*
            Resource: "*"
          # -- End Custom Organization Policy --
        Version: "2012-10-17"
      Description: "Bootstrap Permission Boundary"
      ManagedPolicyName: developer-policy
      Path: /

Save the above locally as developer-policy.yaml and then you can deploy it with a CloudFormation command in the AWS CLI:

aws cloudformation create-stack --stack-name DeveloperPolicy \
        --template-body file://developer-policy.yaml \
        --capabilities CAPABILITY_NAMED_IAM

Creating a stack to test the policy

To begin, create a new CDK application that you will use to test and observe the behaviour of the permission boundary. Create a new directory with a TypeScript CDK application in it by executing these commands.

mkdir DevUsers && cd DevUsers
cdk init --language typescript

Once this is done, you should also make sure that your account has a CDK bootstrap stack deployed with the cdk bootstrap command – to start with, do not apply a permission boundary to it, you can add that later an observe how it changes the behaviour of your deployment. Because the bootstrap command is not using the --cloudformation-execution-policies argument, it will default to arn:aws:iam::aws:policy/AdministratorAccess which means that CloudFormation will have full access to the account until the boundary is applied.

cdk bootstrap

Once the command has run, create an AWS Config Rule in your application to be sure that this works without issue before the permission boundary is applied. Open the file lib/dev_users-stack.ts and edit its contents to reflect the sample below.


import * as cdk from 'aws-cdk-lib';
import { ManagedRule, ManagedRuleIdentifiers } from 'aws-cdk-lib/aws-config';
import { Construct } from "constructs";

export class DevUsersStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new ManagedRule(this, 'AccessKeysRotated', {
      configRuleName: 'access-keys-policy',
      identifier: ManagedRuleIdentifiers.ACCESS_KEYS_ROTATED,
      inputParameters: {
        maxAccessKeyAge: 60, // default is 90 days
      },
    });
  }
}

Next you can deploy with the CDK CLI using the cdk deploy command, which will succeed (the output below has been truncated to show a summary of the important elements).

❯ cdk deploy
  Synthesis time: 3.05s
  DevUsersStack
  Deployment time: 23.17s

Stack ARN:
arn:aws:cloudformation:ap-southeast-2:123456789012:stack/DevUsersStack/704a7710-7c11-11ed-b606-06d79634f8d4

  Total time: 26.21s

Before you deploy the permission boundary, remove this stack again with the cdk destroy command.

❯ cdk destroy
Are you sure you want to delete: DevUsersStack (y/n)? y
DevUsersStack: destroying... [1/1]
 DevUsersStack: destroyed

Using a permission boundary with the CDK test application

Now apply the permission boundary that you created above and observe the impact it has on the same deployment. To update your booststrap with the permission boundary, re-run the cdk bootstrap command with the new custom-permissions-boundary parameter.

cdk bootstrap --custom-permissions-boundary developer-policy

After this command executes, the CloudFormation execution role will be updated to use that policy as a permission boundary, which based on the deny rule for config:* will cause this same application deployment to fail. Run cdk deploy again to confirm this and observe the error message.

 Deployment failed: Error: Stack Deployments Failed: Error: The stack
named DevUsersStack failed creation, it may need to be manually deleted 
from the AWS console: 
  ROLLBACK_COMPLETE: 
    User: arn:aws:sts::123456789012:assumed-role/cdk-hnb659fds-cfn-exec-role-123456789012-ap-southeast-2/AWSCloudFormation
    is not authorized to perform: config:PutConfigRule on resource: access-keys-policy with an explicit deny in a
    permissions boundary

This shows you that the action was denied specifically due to the use of a permissions boundary, which is what was expected.

Applying permission boundaries to IAM entities automatically

Next let’s explore how the permission boundary can be extended to IAM entities that are created by a CDK application. The concern here is that a developer who is creating a new IAM entity could assign it more permissions than they have themselves – the permission boundary manages this by ensuring that entities can only be created that also have the boundary attached. You can validate this by modifying the stack to deploy a Lambda function that uses a role that doesn’t include the boundary. Open the file lib/dev_users-stack.ts again and edit its contents to reflect the sample below.

import * as cdk from 'aws-cdk-lib';
import { PolicyStatement } from "aws-cdk-lib/aws-iam";
import {
  AwsCustomResource,
  AwsCustomResourcePolicy,
  PhysicalResourceId,
} from "aws-cdk-lib/custom-resources";
import { Construct } from "constructs";

export class DevUsersStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new AwsCustomResource(this, "Resource", {
      onUpdate: {
        service: "ConfigService",
        action: "putConfigRule",
        parameters: {
          ConfigRule: {
            ConfigRuleName: "SampleRule",
            Source: {
              Owner: "AWS",
              SourceIdentifier: "ACCESS_KEYS_ROTATED",
            },
            InputParameters: '{"maxAccessKeyAge":"60"}',
          },
        },
        physicalResourceId: PhysicalResourceId.of("SampleConfigRule"),
      },
      policy: AwsCustomResourcePolicy.fromStatements([
        new PolicyStatement({
          actions: ["config:*"],
          resources: ["*"],
        }),
      ]),
    });
  }
}

Here the AwsCustomResource is used to provision a Lambda function that will attempt to create a new config rule. This is the same result as the previous stack but in this case the creation of the rule is done by a new IAM role that is created by the CDK construct for you. Attempting to deploy this will result in a failure – run cdk deploy to observe this.

 Deployment failed: Error: Stack Deployments Failed: Error: The stack named 
DevUsersStack failed creation, it may need to be manually deleted from the AWS 
console: 
  ROLLBACK_COMPLETE: 
    API: iam:CreateRole User: arn:aws:sts::123456789012:assumed-
role/cdk-hnb659fds-cfn-exec-role-123456789012-ap-southeast-2/AWSCloudFormation
    is not authorized to perform: iam:CreateRole on resource:
arn:aws:iam::123456789012:role/DevUsersStack-
AWS679f53fac002430cb0da5b7982bd2287S-1EAD7M62914OZ
    with an explicit deny in a permissions boundary

The error message here details that the stack was unable to deploy because the call to iam:CreateRole failed because the boundary wasn’t applied. The CDK now offers a straightforward way to set a default permission boundary on all IAM entities that are created, via the CDK context variable core:permissionsBoundary in the cdk.json file.

{
  "context": {
     "@aws-cdk/core:permissionsBoundary": {
       "name": "developer-policy"
     }
  }
}

This approach is useful because now you can import constructs that create IAM entities (such as those found on Construct Hub or out of the box constructs that create default IAM roles) and have the boundary apply to them as well. There are alternative ways to achieve this, such as setting a boundary on specific roles, which can be used in scenarios where this approach does not fit. Make the change to your cdk.json file and run the CDK deploy again. This time the custom resource will attempt to create the config rule using its IAM role instead of the CloudFormation execution role. It is expected that the boundary will also protect this Lambda function in the same way – run cdk deploy again to confirm this. Note that the deployment updates from CloudFormation show that this time the role creation succeeds this time, and a new error message is generated.

 Deployment failed: Error: Stack Deployments Failed: Error: The stack named
DevUsersStack failed creation, it may need to be manually deleted from the AWS 
console:
  ROLLBACK_COMPLETE: 
    Received response status [FAILED] from custom resource. Message returned: User:
    arn:aws:sts::123456789012:assumed-role/DevUsersStack-
AWS679f53fac002430cb0da5b7982bd2287S-84VFVA7OGC9N/DevUsersStack-
AWS679f53fac002430cb0da5b7982bd22872-MBnArBmaaLJp
    is not authorized to perform: config:PutConfigRule on resource: SampleRule with an explicit deny in a permissions boundary

In this error message you can see that the user it refers to is DevUsersStack-AWS679f53fac002430cb0da5b7982bd2287S-84VFVA7OGC9N rather than the CloudFormation execution role. This is the role being used by the custom Lambda function resource, and when it attempts to create the Config rule it is rejected because of the permissions boundary in the same way. Here you can see how the boundary is being applied consistently to all IAM entities that are created in your CDK app, which ensures the administrative controls can be applied consistently to everything a developer does with a minimal amount of overhead.

Cleanup

At this point you can either choose to remove the CDK bootstrap stack if you no longer require it, or remove the permission boundary from the stack. To remove it, delete the CDKToolkit stack from CloudFormation with this AWS CLI command.

aws cloudformation delete-stack --stack-name CDKToolkit

If you want to keep the bootstrap stack, you can remove the boundary by following these steps:

Browse to the CloudFormation page in the AWS console, and select the CDKToolit stack.
Select the ‘Update’ button. Choose “Use Current Template” and then press ‘Next’
On the parameters page, find the value InputPermissionsBoundary which will have developer-policy as the value, and delete the text in this input to leave it blank. Press ‘Next’ and the on the following page, press ‘Next’ again
On the final page, scroll to the bottom and check the box acknowledging that CloudFormation might create IAM resources with custom names, and choose ‘Submit’

With the permission boundary no longer being used, you can now remove the stack that created it as the final step.

aws cloudformation delete-stack --stack-name DeveloperPolicy

Conclusion

Now you can see how IAM permission boundaries can easily be integrated in to CDK development, helping ensure developers have the control they need while administrators can ensure that security is managed in a way that meets the needs of the organisation as well.

With this being understood, there are next steps you can take to further expand on the use of permission boundaries. The CDK Security and Safety Developer Guide document on GitHub outlines these approaches, as well as ways to think about your approach to permissions on deployment. It’s recommended that developers and administrators review this, and work to develop and appropriate approach to permission policies that suit your security goals.

Additionally, the permission boundary concept can be applied in a multi-account model where each Stage has a unique boundary name applied. This can allow for scenarios where a lower-level environment (such as a development or beta environment) has more relaxed permission boundaries that suit troubleshooting and other developer specific actions, but then the higher level environments (such as gamma or production) could have the more restricted permission boundaries to ensure that security risks are more appropriately managed. The mechanism for implement this is defined in the security and safety developer guide also.

About the authors:

Code conversion from Greenplum to Amazon Redshift: Handling arrays, dates, and regular expressions

2023-01-09 Jagrit Shrestha

Post Syndicated from Jagrit Shrestha original https://aws.amazon.com/blogs/big-data/code-conversion-from-greenplum-to-amazon-redshift-handling-arrays-dates-and-regular-expressions/

Amazon Redshift is a fully managed service for data lakes, data analytics, and data warehouses for startups, medium enterprises, and large enterprises. Amazon Redshift is used by tens of thousands of businesses around the globe for modernizing their data analytics platform.

Greenplum is an open-source, massively parallel database used for analytics, mostly for on-premises infrastructure. Greenplum is based on the PostgreSQL database engine.

Many customers have found migration to Amazon Redshift from Greenplum an attractive option instead of managing on-premises Greenplum for the following reasons:

The opportunity to modernize the data lake and data warehouse environment
Benefits of other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon CloudWatch, Amazon EMR, Amazon SageMaker, and more

Even though both Greenplum and Amazon Redshift use the open-source PostgreSQL database engine, migration still requires a lot of planning and manual intervention. This post covers the key functions and considerations while performing code conversion from Greenplum to Amazon Redshift. It is focused on the migration of procedures, functions, and views.

Solution overview

AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT) can migrate most of the objects in a heterogeneous database migration from Greenplum to Amazon Redshift. But there are some situations where code conversion teams encounter errors and warnings for views, procedures, and functions while creating them in Amazon Redshift. To address this type of situation, manual conversion of the code is required.

The posts focuses on how to handle the following while migrating from Greenplum to Amazon Redshift:

Arrays
Dates and timestamps
Regular expressions (regex)

Please note that for this post, we use Greenplum 4.3 and Amazon Redshift PostgreSQL 8.2.

Working with array functions

The AWS SCT doesn’t convert array functions while migrating from Greenplum or PostgreSQL to Amazon Redshift. Developers need to extensively convert those functions manually. This post outlines the most common array functions:

ARRAY_UPPER
JSON_EXTACT_ARRAY_ELEMENT_TEXT and JSON_ARRAY_LENGTH
UNNEST ()
STRING_AGG()
ANY ARRAY()

ARRAY_UPPER()

This function returns the upper bound of an array. It can be used to extract the n^th element from an array in PostgreSQL or Greenplum.

The Greenplum code is as follows:

With temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"','NULL'] as PhoneNumbers
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
array['222-333-4444','201-301-4001','AAA-BBB-CCCC'] as PhoneNumbers
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
array['333-444-5555','301-401-3001','DDD-EEE-FFFF'] as PhoneNumbers
)
Select Firstname, PhoneNumbers[ARRAY_UPPER(PhoneNumbers,1)]

There is no function to extract an element from an array in Amazon Redshift; however, there are two JSON functions that can be used for this purpose:

JSON_EXTRACT_ARRAY_ELEMENT_TEXT() – Returns a JSON array element in the outermost array of a JSON string
JSON_ARRAY_LENGTH() – Returns the number of elements in the outer array of a JSON string

See the following code:

With temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"'] as PhoneNumbers
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
array['"222-333-4444"','"201-301-4001"','"AAA-BBB-CCCC"'] as PhoneNumbers
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
array['"333-444-5555"','"301-401-3001"','"DDD-EEE-FFFF"'] as PhoneNumbers
)

Select
FirstName
,('['+array_to_string(phoneNumbers,',')+']') as JSONConvertedField
,JSON_EXTRACT_ARRAY_ELEMENT_TEXT
(
'['+array_to_string(phoneNumbers,',')+']'
,JSON_ARRAY_LENGTH('['+array_to_string(phoneNumbers,',')+']')-1
) as LastElementFromArray
from temp1

UNNEST()

UNNEST() is PostgreSQL’s system function for semi-structured data, expanding an array, or a combination of arrays to a set of rows. It is introduced to improve the database performance of thousands or records for inserts, updates, and deletes.

You can use UNNEST() for basic array, multiple arrays, and multiple arrays with different lengths.

Some of Amazon Redshift functions used to unnest arrays are split_part, json_extract_path_text, json_array_length, and json_extract_array_element_text.

In Greenplum, the UNNEST function is used to expand an array to a set of rows:

Select ‘A’,unnest(array([1,2])

Output
A 1
A 2

with temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
)

select
FirstName
,LastName
,unnest(array[‘Mobile’::text,’HomePhone’::text]) as PhoneType
,unnest(array[MobilePhone::text,HomePhone::text]) as PhoneNumber
from
temp1
order by 1,2,3

Amazon Redshift doesn’t support the UNNEST function; you can use the following workaround:

with temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
),
ns as
(
Select row_number() over(order by 1) as n from pg_tables
)

Select
FirstName
,LastName
,split_part('Mobile,Home',',',ns.n::int) as PhoneType
,split_part(MobilePhone|| '&&' || HomePhone, '&&', ns.n::int) as PhoneNumber
from
temp1, ns
where
ns.n<=regexp_count('Mobile,Home',',')+1
order by 1,2,3

When the element of array is in the form of array itself, use the JSON_EXTRACT_ARRAY_ELEMENT_TEXT() function and JSON_ARRAY_LENGTH:

with ns as
(
Select row_number() over(order by 1) as n from pg_tables
)

Select JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["arrayelement1","arrayelement2"]',ns.n-1)
from ns
where
ns.n<=JSON_ARRAY_LENGTH('["arrayelement1","arrayelement2"]')

STRING_AGG()

The STRING_AGG() function is an aggregate function that concatenates a list of strings and places a separator between them. The function doesn’t add the separator at the end of the string. See the following code:

STRING_AGG ( expression, separator [order_by_clause] )

The Greenplum code is as follows:

with temp1 as
(
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Doe'::text as LastName
union all
Select 'Finance'::text as Dept, 'Mary'::text as FirstName, 'Jane'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Bob'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Steve'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Account'::text as Dept, 'Phil'::text as FirstName, 'Adams'::text as LastName
union all
Select 'Account'::text as Dept, 'Jim'::text as FirstName, 'Smith'::text as LastName
)

Select dept,STRING_AGG(FirstName||' '||LastName,' ; ') as Employees from temp1 group by dept order by 1

The Amazon Redshift equivalent for the STRING_AGG() function is LISTAGG(). This aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string:

LISTAGG(expression, separator [order_by_clause])

See the following code:

Create temporary Table temp1 as
(
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Doe'::text as LastName
union all
Select 'Finance'::text as Dept, 'Mary'::text as FirstName, 'Jane'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Bob'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Steve'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Account'::text as Dept, 'Phil'::text as FirstName, 'Adams'::text as LastName
union all
Select 'Account'::text as Dept, 'Jim'::text as FirstName, 'Smith'::text as LastName
)

Select dept,LISTAGG(FirstName||' '||LastName,' ; ') as Employees from temp1
group by dept
order by 1

ANY ARRAY()

The PostgreSQL ANY ARRAY() function evaluates and compare the left-hand expression to each element in array:

Select * from temp1 where DeptName = ANY ARRAY('10-F','20-F','30-F')

In Amazon Redshift, the evaluation can be achieved with an IN operator:

Select * from temp1 where DeptName IN ('10-F','20-F','30-F')

Working with date functions

In this section, we discuss calculating the difference between date_part for Greenplum and datediff for Amazon Redshift.

When the application needs to calculate the difference between the subfields of dates for Greenplum, it uses the function date_part, which allows you to retrieve subfields such as year, month, week, and day. In the following example queries, we calculate the number of completion_days by calculating the difference between originated_date and eco_date.

To calculate the difference between the subfields of the date, Amazon Redshift has the function datediff. The following queries show an example of how to calculate the completion_days as the difference between eco_date and orginated_date. DATEDIFF determines the number of date part boundaries that are crossed between the two expressions.

We compare the Greenplum and Amazon Redshift queries as follows:

Difference by year

The following Greenplum query returns 1 year between 2009-01-01 and 2009-12-31:

SELECT date_part(‘year’, TIMESTAMP ‘2009-01-01’) - date_part(‘year’, 2008-12-31’) as year;

The following Amazon Redshift query returns 1 year between 2009-01-01 and 2009-12-31:

SELECT datediff (year, ‘2008-12-31’ , ‘2009-01-01’ ) as year;

Difference by month

The following Greenplum query returns 1 month between 2009-01-01 and 2008-12-31:

SELECT (date_part(‘year’, ‘2009-01-01’ :: date) - date_part(‘year’, ‘2008-12-31’ :: date)) * 12 +<br />(date_part(‘month’, ‘2009-01-01’) - date_part(‘month’, ‘2008-12-31’ :: date)) as month;

The following Amazon Redshift query returns 1 month between 2009-01-01 and 2008-12-31:

SELECT datediff( month, ‘2008-12-31’ , ‘2009-01-01’ ) as month;

Difference by week

The following Greenplum query returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT date_part(‘week’, timestamp ‘2009-01-01’ ) - date_part(‘week’, timestamp ‘2008-12-31’) as week;

The following Amazon Redshift query returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT datediff( week, ‘2008-12-31’ , ‘2009-01-01’ ) as week;

Difference by day

The following Greenplum query returns 1 day:

SELECT date_part ('day', '2009-01-01 24:00:00' :: timestamp - '2008-12-31 24:00:00 :: timestamp) as day;

The following Amazon Redshift query returns 1 day:

SELECT datediff (day, ‘2008-12-31’, ‘2009-01-01’) as day;

Difference by hour

The following Greenplum query returns 1 hour:

SELECT date_part(‘hour’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2008-12-31 21:54:55' :: timestamp)

The following Amazon Redshift query returns 1 hour:

SELECT datediff (hour, ‘2009-01-01 21:56:10’, ‘2009-01-01’ ) as hour;

Difference by minute

The following Greenplum query returns 3 minutes:

SELECT date_part(‘minute’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2009-01-01 21:53:10’ :: timestamp) as minutes;

The following Amazon Redshift query returns 1 minute:

SELECT datediff(minute, ‘2009-01-01 21:56:10’, ‘2009-01-01 21:57:55’) as minute;

Difference by second

The following Greenplum query returns 40 seconds:

SELECT date_part(‘second’, ‘2009-01-01 22:56:50’ :: timestamp - ‘2009-01-01 21:53:10’ : : timestamp) as seconds;

The following Amazon Redshift query returns 45 seconds:

SELECT datediff(second, ‘2009-01-01 21:56:10’, ‘2009-01-01- 21:56:55’) as seconds;

Now let’s look at how we use Amazon Redshift to calculate days and weeks in seconds.

The following Amazon Redshift query displays 2 days:

SELECT datediff(second, ‘2008-12-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24) as days;

The following Amazon Redshift query displays 9 weeks:

SELECT datediff(second, ‘2008-10-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24*7) as weeks;

For Greenplum, the date subfields need to be in single quotes, whereas for Amazon Redshift, we can use date subfields such as year, month, week, day, minute, second without quotes. For Greenplum, we have to subtract the subfield from one part to another part, whereas for Amazon Redshift we can use commas to separate the two dates.

Extract ISOYEAR from date

ISOYEAR 8601 is a week-numbering year. It begins with the Monday of the week containing the 4th of January. So for the date of early January or late December, the ISO year may be different from the Gregorian year. ISO year has 52 or 53 full weeks (364 or 371 days). The extra week is called a leap week; a year with such a week is called a leap year.

The following Greenplum query displays the ISOYEAR 2020:

SELECT extract (ISOYEAR from ‘2019-12-30’ :: date) as ISOYEARS;

The following Amazon Redshift query displays the ISOYEAR 2020:

SELECT to_char(‘2019-12-30’ :: date, ‘IYYYY’) as ISOYEARS;

Function to generate_series()

Greenplum has adopted the PostgreSQL function generate_series(). But the generate_series function works differently with Amazon Redshift while retrieving records from the table because it’s a leader node-only function.

To display a series of numbers in Amazon Redshift, run the following query on the leader node. In this example, it displays 10 rows, numbered 1–10:

SELECT generate_series(1,10);

To display a series of days for a given date, use the following query. It extracts the day from the given date and subtracts 1, to display a series of numbers from 0–6:

SELECT generate_series(0, extract(day from date ‘2009-01-07’) :: int -1);

But for the queries fetching the record from the table, joining with another table’s row, and processing data at the compute node, it doesn’t work, and generates an error message with Invalid Operation. The following code is an example of a SQL statement that works for Greenplum but fails for Amazon Redshift:

SELECT column_1,
FROM table_1t1
JOIN table_2 t2
ON t2.code = t1.code
CROSS JOIN generate_series(1,12) gen(fiscal_month)
WHERE condition_1

For Amazon Redshift, the solution is to create a table to store the series data, and rewrite the code as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Working with regular expressions (regex functions)

Amazon Redshift and Greenplum both support three conditions for pattern matching:

LIKE
SIMILAR TO
POSIX operators

In this post, we don’t discuss all of these pattern matching in detail. Instead, we discuss a few regex functions and regex escape characters that aren’t supported by Amazon Redshift.

Regexp_split_to_table function

The Regex_split_to_table function splits a string using a POSIX regular expression pattern as delimiter.

This function has the following syntax:

Regexp_split_to_table(string,pattern [,flags])

For Greenplum, we use the following query:

select regexp_split_to_table ('bat,cat,hat',’\,’) as regexp_split_table_GP

For Amazon Redshift, the regexp_split_to_table function has to be converted using the Amazon Redshift split_part function:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Another way to convert regexp_split_to_table is as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Substring from regex expressions

Substring (the string from the regex pattern) extracts the substring or value matching the pattern that is passed on. If there is no match, null is returned. For more information, refer to Pattern Matching.

We use the following code in Greenplum:

create temp table data1 ( col1 varchar );
insert into data1 values ('hellohowareyou 12\687687abcd');
select substring( col1 from '[A-Za-z]+$') from data1;
from data1

We can use the regexp_substr function to convert this code to Amazon Redshift. It returns the characters extracted from a string by searching for a regular expression pattern. The syntax is as follows:

REGEXP_SUBSTR ( source_string, pattern [, position [, occurrence [, parameters ] ] ] )

select regexp_substr( col1, '[A-Za-z]+$') as substring_from_rs from data1

Key points while converting regular expression escapes

The Postgres escape character E doesn’t work in Amazon Redshift. Additionally, the following Greenplum regular expression constraints aren’t supported in Amazon Redshift:

\m – Matches only at the beginning of a word
\y – Matches only at the beginning or end of a word

For Amazon Redshift, use \\< and \\>, or [[:<:]] and [[:>:]] instead.

Use the following code for Greenplum:

select col1,
case
when (col1) ~ E '\\m[0-9]{2}[A-Z]{1}[0-9]{1}' then
regexp_replace(col1, E '([0-9]{2})([A-Z]{1})([0-9]{1})',E '\\2')
else 'nothing'
end as regex_test
from temp1123

Use the following code for Amazon Redshift:

select col1,
case
when (col1) ~ '\\<[0-9]{2}[A-Z]{1}[0-9]{1}>\\' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1})','\\2')
else 'nothing'
end as regex_test
from temp1123

select col1,
case
when (col1) ~ '[[:<:]][0-9]{2}[A-Z]{1}[0-9]{1}[[:>:]]' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1}) (.*)','\\2')
else 'nothing'
end as regex_test
from temp1123

Conclusion

For heterogeneous database migration from Greenplum to the Amazon Redshift, you can use AWS DMS and the AWS SCT to migrate most of the database objects, such as tables, views, stored procedures, and functions.

There are some situations in which one function is used for the source environment, and the target environment doesn’t support the same function. In this case, manual conversion is required to produce the same results set and complete the database migration.

In some cases, use of a new window function supported by the target environment proves more efficient for analytical queries to process petabytes of data.

This post included several situations where manual code conversion is required, which also improves the code efficiency and make queries efficient.

If you have any questions or suggestions, please share your feedback.

About the Authors

Jagrit Shrestha is a Database consultant at Amazon Web Services (AWS). He works as a database specialist helping customers migrate their on-premises database workloads to AWS and provide technical guidance.

Ishwar Adhikary is a Database Consultant at Amazon Web Services (AWS). He works closely with customers to modernize their database and application infrastructures. His focus area is migration of relational databases from On-premise data center to AWS Cloud.

Shrenik Parekh works as a Database Consultants at Amazon Web Services (AWS). He is expertise in database migration assessment, database migration, modernizing database environment with purpose-built database using AWS cloud database services. He is also focused on AWS web services for data analytics. In his spare time, he loves hiking, yoga and other outdoor activities.

Santhosh Meenhallimath is a Data Architect at AWS. He works on building analytical solutions, building data lakes and migrate Database into AWS.

Build a search application with Amazon OpenSearch Serverless

2023-01-07 Aish Gunasekar

Post Syndicated from Aish Gunasekar original https://aws.amazon.com/blogs/big-data/build-a-search-application-with-amazon-opensearch-serverless/

In this post, we demonstrate how to build a simple web-based search application using the recently announced Amazon OpenSearch Serverless, a serverless option for Amazon OpenSearch Service that makes it easy to run petabyte-scale search and analytics workloads without having to think about clusters. The benefit of using OpenSearch Serverless as a backend for your search application is that it automatically provisions and scales the underlying resources based on the search traffic demands, so you don’t have to worry about infrastructure management. You can simply focus on building your search application and analyzing the results. OpenSearch Serverless is powered by the open-source OpenSearch project, which consists of a search engine, and OpenSearch Dashboards, a visualization tool to analyze your search results.

Solution overview

There are many ways to build a search application. In our example, we create a simple Java script front end and call Amazon API Gateway, which triggers an AWS Lambda function upon receiving user queries. As shown in the following diagram, API Gateway acts as a broker between the front end and the OpenSearch Serverless collection. When the user queries the front-end webpage, API Gateway passes requests to the Python Lambda function, which runs the queries on the OpenSearch Serverless collection and returns the search results.

To get started with the search application, you must first upload the relevant dataset, a movie catalog in this case, to the OpenSearch collection and index them to make them searchable.

Create a collection in OpenSearch Serverless

A collection in OpenSearch Serverless is a logical grouping of one or more indexes that represent a workload. You can create a collection using the AWS Management Console or AWS Software Development Kit (AWS SDK). Follow the steps in Preview: Amazon OpenSearch Serverless – Run Search and Analytics Workloads without Managing Clusters to create and configure a collection in OpenSearch Serverless.

Create an index and ingest data

After your collection is created and active, you can upload the movie data to an index in this collection. Indexes hold documents, and each document in this example represents a movie record. Documents are comparable to rows in the database table. Each document (the movie record) consists of 10 fields that are typically searched for in a movie catalog, like the director, actor, release date, genre, title, or plot of the movie. The following is a sample movie JSON document:

{
"directors": ["David Yates"],
"release_date": "2011-07-07T00:00:00Z",
"rating": 8.1,
"genres": ["Adventure", "Family", "Fantasy", "Mystery"],
"plot": "Harry, Ron and Hermione search for Voldemort's remaining Horcruxes in their effort to destroy the Dark Lord.",
"title": "Harry Potter and the Deathly Hallows: Part 2",
"rank": 131,
"running_time_secs": 7800,
"actors": ["Daniel Radcliffe", "Emma Watson", "Rupert Grint"],
"year": 2011
}

For the search catalog, you can upload the sample-movies.bulk dataset sourced from the Internet Movies Database (IMDb). OpenSearch Serverless offers the same ingestion pipeline and clients to ingest the data as OpenSearch Service, such as Fluentd, Logstash, and Postman. Alternatively, you can use the OpenSearch Dashboards Dev Tools to ingest and search the data without configuring any additional pipelines. To do so, log in to OpenSearch Dashboards using your SAML credentials and choose Dev tools.

To create a new index, use the PUT command followed by the index name:

PUT movies-index

A confirmation message is displayed upon successful creation of your index.

After the index is created, you can ingest documents into the index. OpenSearch provides the option to ingest multiple documents in one request using the _bulk request. Enter POST /_bulk in the left pane as shown in the following screenshot, then copy and paste the contents of the sample-movies.bulk file you downloaded earlier.

You have successfully created the movies index and uploaded 1,500 records into the catalog! Now let’s integrate the movie catalog with your search application.

Integrate the Lambda function with an OpenSearch Serverless endpoint

In this step, you create a Lambda function that queries the movie catalog in OpenSearch Serverless and returns the result. For more information, see our tutorial on creating a Lambda function for connecting to and querying an OpenSearch Service domain. You can reuse the same code by replacing the parameters to align to OpenSearch Serverless’s requirements. Replace <my-region> with your corresponding region (for example, us-west-2), use aoss instead of es for service, replace <hostname> with the OpenSearch collection endpoint, and <index-name> with your index (in this case, movies-index).

The following is a snippet of the Lambda code. You can find the complete code in the tutorial.

import boto3
import json
import requests
from requests_aws4auth import AWS4Auth

region = '<my-region>'
service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

host = '<hostname>' 
# The OpenSearch collection endpoint 
index = '<index-name>'
url = host + '/' + index + '/_search'

# Lambda execution starts here
def Lambda_handler(event, context):

This Lambda function returns a list of movies based on a search string (such as movie title, director, or actor) provided by the user.

Next, you need to configure the permissions in OpenSearch Serverless’s data access policy to let the Lambda function access the collection.

On the Lambda console, navigate to your function.
On the Configuration tab, in the Permissions section, under Execution role, copy the value for Role name.
Add this role name as one of the principals of your movie-search collection’s data access policy.

Principals can be AWS Identity and Access Management (IAM) users, role ARNs, or SAML identities. These principals must be within the current AWS account.

After you add the role name as a principal, you can see the role ARN updated in your rule, as show in the following screenshot.

Now you can grant collection and index permissions to this principal.

For more details about data access policies, refer to Data access control for Amazon OpenSearch Serverless. Skipping this step or not running it correctly will result in permission errors, and your Lambda code won’t be able to query the movie catalog.

Configure API Gateway

API Gateway acts as a front door for applications to access the code running on Lambda. To create, configure, and deploy the API for the GET method, refer to the steps in the tutorial. For API Gateway to pass the requests to the Lambda function, configure it as a trigger to invoke the Lambda function.

The next step is to integrate it with the front end.

Test the web application

To build the front-end UI, you can download the following sample JavaScript web service. Open the scripts/search.js file and update the apigatewayendpoint variable to point to your API Gateway endpoint:

var apigatewayendpoint = 'https://kxxxxxxzzz.execute-api.us-west-2.amazonaws.com/opensearch-api-test/';
// Update this variable to point to your API Gateway endpoint.

You can access the front-end application by opening index.html in your browser. When the user runs a query on the front-end application, it calls API Gateway and Lambda to serve up the content hosted in the OpenSearch Serverless collection.

When you search the movie catalog, the Lambda function runs the following query:

    # Put the user query into the query DSL for more accurate search results.
    # Note that certain fields are boosted (^).
    query = {
        "size": 25,
        "query": {
            "multi_match": {
                "query": event['queryStringParameters']['q'],
                "fields": ["title", "plot", "actors"]
            }
        }
    }

The query returns documents based on a provided query string. Let’s look at the parameters used in the query:

size – The size parameter is the maximum number of documents to return. In this case, a maximum of 25 results is returned.
multi_match – You use a match query when matching larger pieces of text, especially when you’re using OpenSearch’s relevance to sort your results. With a multi_match query, you can query across multiple fields specified in the query.
fields – The list of fields you are querying.

In a search for “Harry Potter,” the document with the matching term both in the title and plot fields appears higher than other documents with the matching term only in the title field.

Congratulations! You have configured and deployed a search application fronted by API Gateway, running Lambda functions for the queries served by OpenSearch Serverless.

Clean up

To avoid unwanted charges, delete the OpenSearch Service collection, Lambda function, and API Gateway that you created.

Conclusion

In this post, you learned how to build a simple search application using OpenSearch Serverless. With OpenSearch Serverless, you don’t have to worry about managing the underlying infrastructure. OpenSearch Serverless supports the same ingestion and query APIs as the OpenSearch Project. You can quickly get started by ingesting the data into your OpenSearch Service collection, and then perform searches on the data using your web interface.

In subsequent posts, we dive deeper into many other search queries and features that you can use to make your search application even more effective.

We would love to hear how you are building your search applications today. If you’re just getting started with OpenSearch Serverless, we recommend getting hands-on with the Getting started with Amazon OpenSearch Serverless workshop.

About the authors

Aish Gunasekar is a Specialist Solutions architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.

Pavani Baddepudi is a senior product manager working in search services at AWS. Her interests include distributed systems, networking, and security.

How Contino improved collaboration with Amazon CodeCatalyst

2023-01-06 Chetan Makvana

Post Syndicated from Chetan Makvana original https://aws.amazon.com/blogs/devops/how-contino-improved-collaboration-with-amazon-codecatalyst/

Amazon CodeCatalyst is a modern software development service that empowers teams to deliver software on AWS easily and quickly. CodeCatalyst provides one place where you can plan, code, and build, test, and deploy applications with continuous integration/continuous delivery (CI/CD) tools. It also helps streamlined team collaboration. Developers on modern software teams are usually distributed, work independently, and use disparate tools. Often, ad hoc collaboration is necessary to resolve problems. Today, developers are forced to do this across many tools, which distract developers from their primary task—adding business critical features and enhancing their quality and completeness.

In this post, we explain how Contino uses CodeCatalyst to on-board their engineering team onto new projects, eliminates the overhead of managing disparate tools, and streamlines collaboration among different stakeholders.

The Problem

Contino helps customers migrate their applications to the cloud, and then improves their architecture by taking full advantage of cloud-native features to improve agility, performance, and scalability. This usually involves the build out of a central landing zone platform. A landing zone is a set of standard building blocks that allows customers to automatically create accounts, infrastructure and environments that are pre-configured in line with security policies, compliance guidelines and cloud native best practices. Some features are common to most landing zones, for example creating secure container images, AMIs, and environment setup boilerplate. In order to provide maximum value to the customers, Contino develops in-house versions of such features, incorporating AWS best practices, and later rolls out to the customer’s environment with some customization. Contino’s technical consultants, who are not currently assigned to customer work, collectively known as ‘Squad 0’ work on these features. Squad 0 builds the foundation for the work that will be re-used by other squads that work directly with Contino’s customers. As the technical consultants are typically on Squad 0 for a short period, it is critical that they can be productive in this short time, without spending too much time getting set up.

To build these foundational services, Contino was looking for something more integrated that would allow them to quickly setup development environments, enable collaboration between Squad 0 members, invite other squads to validate foundations services usage for their respective customers, and provide access to different AWS accounts and git repos centrally from one place. Historically, Contino has used disparate tools to achieve this, which meant having to grant/revoke access to the various AWS accounts individually on a continual basis. With these disparate tools, granting access to the tools needed for squads to be productive was non-trivial.

The Solution

It was at this point Contino participated in the private beta for CodeCatalyst prior to the public preview. CodeCatalyst has allowed Contino to move to a structure, as shown in Figure 1 below. A Project Manager at Contino creates a different project for each foundational service and invites Squad 0 members to join the relevant project. With CodeCatalyst, Squad 0 technical consultants use features like CI/CD, source repositories, and issue trackers to build foundational services. This helps eliminate the overhead of managing and integrating developer tools and provides more time to focus on developing code. Once Squad 0 is ready with the foundational services, they invite customer squads using their email address to validate the readiness of the project for use with their customers. Finally, members of Squad 0 use Cloud 9 Dev Environments from within CodeCatalyst to rapidly create consistent cloud development environments, without manual configuration, so they can work on new or multiple projects simultaneously, without conflict.

With CodeCatalyst, Squad 0 technical consultants use features like CI/CD, source repositories, and issue trackers to build foundational services. This helps eliminate the overhead of managing and integrating developer tools and provides more time to focus on developing code.

Figure 1: CodeCatalyst with multiple account connections

Contino uses CI/CD to conduct multi-account deployments. Contino typically does one of two types of deployments: 1. Traditional sequential application deployment that is promoted from one environment to another, for example dev -> test -> prod, and 2. Parallel deployment, for example, a security control that is required to be deployed out into multiple AWS accounts at the same time. CodeCatalyst solves this problem by making it easier to construct workflows using a workflow definition file that can deploy either sequentially or in parallel to multiple AWS accounts. Figure 2 shows parallel deployment.

CodeCatalyst provides a feature to add CI/CD pipeline for Dev, Test and Production accounts

Figure 2: CI/CD with CodeCatalyst

The Value

CodeCatalyst has reduced the time it takes for members of Squad 0 to complete the necessary on-boarding to work on foundational services from 1.5 days to about 1 hour. These tasks include setting up connections to source repositories, setting up development environments, configuring IAM roles and trust relationships, etc. With support for integrated tools and better collaboration, CodeCatalyst minimized overhead for ad hoc collaboration. Squad 0 could spend more time on writing code to build foundation services. This has led to tasks being completed, on average, 20% faster. This increased productivity led to increased value delivered to Contino’s customers. As Squad 0 is more productive, more foundation services are available for other squads to reuse for their respective customers. Now, Contino’s teams on the ground working directly with customers can re-use these services with some customization for the specific needs of the customer.

Conclusion

Amazon CodeCatalyst brings together everything software development teams need to plan, code, build, test, and deploy applications on AWS into a streamlined, integrated experience. With CodeCatalyst, developers can spend more time developing application features and less time setting up project tools, creating and managing CI/CD pipelines, provisioning and configuring various development environments or coordinating with team members. With CodeCatalyst, the Contino engineers can improve productivity and focus on rapidly developing application code which captures business value for their customers.

About the authors:

Accelerate your data exploration and experimentation with the AWS Analytics Reference Architecture library

2023-01-05 Lotfi Mouhib

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/accelerate-your-data-exploration-and-experimentation-with-the-aws-analytics-reference-architecture-library/

Organizations use their data to solve complex problems by starting small, running iterative experiments, and refining the solution. Although the power of experiments can’t be ignored, organizations have to be cautious about the cost-effectiveness of such experiments. If time is spent creating the underlying infrastructure for enabling experiments, it further adds to the cost.

Developers need an integrated development environment (IDE) for data exploration and debugging of workflows, and different compute profiles for running these workflows. If you choose Amazon EMR for such use cases, you can use an IDE called Amazon EMR Studio for data exploration, transformation, version control, and debugging, and run Spark jobs to process large volume of data. Deploying Amazon EMR on Amazon EKS simplifies management, reduces costs, and improves performance. However, a data engineer or IT administrator needs to spend time creating the underlying infrastructure, configuring security, and creating a managed endpoint for users to connect to. This means such projects have to wait until these experts create the infrastructure.

In this post, we show how a data engineer or IT administrator can use the AWS Analytics Reference Architecture (ARA) to accelerate infrastructure deployment, saving your organization both time and money spent on these data analytics experiments. We use the library to deploy an Amazon Elastic Kubernetes (Amazon EKS) cluster, configure it to use Amazon EMR on EKS, and deploy a virtual cluster and managed endpoints and EMR Studio. You can then either run jobs on the virtual cluster or run exploratory data analysis with Jupyter notebooks on Amazon EMR Studio and Amazon EMR on EKS. The architecture below represent the infrastructure you will deploy with the AWS Analytics Reference Architecture.

Prerequisites

To follow along, you need to have an AWS account that is bootstrapped with the AWS Cloud Development Kit (AWS CDK). For instructions, refer to Bootstrapping. The following tutorial uses TypeScript, and requires version 2 or later of the AWS CDK. If you don’t have the AWS CDK installed, refer to Install the AWS CDK.

Set up an AWS CDK project

To deploy resources using the ARA, you first need to set up an AWS CDK project and install the ARA library. Complete the following steps:

Create a folder named emr-eks-app:
```
mkdir emr-eks-app && cd emr-eks-app
```
Initialize an AWS CDK project in an empty directory and run the following command:
```
cdk init app --language typescript
```

Install the ARA library:

npm install aws-analytics-reference-architecture --save

In lib/emr-eks-app.ts, import the ARA library as follows. The first line calls the ARA library, the second one defines AWS Identity and Access Management (IAM) policies:
```
import * as ara from 'aws-analytics-reference-architecture'; 
import * as iam from 'aws-cdk-lib/aws-iam';
```

Create and define an EKS cluster and compute capacity

To create an EMR on EKS virtual cluster, you first need to deploy an EKS cluster. The ARA library defines a construct called EmrEksCluster. The construct provisions an EKS cluster, enables IAM roles for service accounts, and deploys a set of supporting controllers like certificate manager controller (needed by the managed endpoint that is used by Amazon EMR Studio) as well as a cluster auto scaler to have an elastic cluster and save on cost when no job is submitted to the cluster.

In lib/emr-eks-app.ts, add the following line:

const emrEks = ara.EmrEksCluster.getOrCreate(this,{ 
   eksAdminRoleArn:ROLE_ARN;, 
   eksClusterName:CLUSTER_NAME;
   autoscaling: Autoscaler.KARPENTER, 
});

To learn more about the properties you can customize, refer to EmrEksClusterProps. There are two mandatory parameters in EmrEksCluster construct: The first is eksAdminRoleArn role is mandatory and is the role you use to interact with the Kubernetes control plane. This role must have administrative permissions to create or update the cluster. The second parameter is autoscaling, this parameter allows you to select the autoscaling mechanism, either Karpenter or native Kubernetes Cluster Autoscaler. In this blog we will use Karpenter and we recommend its use due to faster autoscaling, simplified node management and provisioning. Now you’re ready to define the compute capacity.

One way to define worker nodes in Amazon EKS is to use managed node groups. We use one node group called tooling, which hosts the coredns, ingress controller, certificate manager, Karpenter and any other pod that is necessary for the running EMR on EKS jobs or ManagedEndpoint. We also define default Karpenter Provisioners that define capacity to be used for jobs submitted by EMR on EKS. These Provisioners are optimized for different Spark use cases (critical jobs, non-critical job, experimentation and interactive sessions). The construct also allows you to submit your own provisioner defined by a Kubernetes manifest through a method called addKarpenterProvisioner. Let’s discuss the predefined Provisioners.

Default Provisioners configurations

The default provisioners are set for rapid experimentation and are always created by default. However, if you don’t want to use them, you can set the defaultNodeGroups parameter to false in the EmrEksCluster properties at creation time. The Provisioners are defined as follows and are created in each of the subnets that are used by Amazon EKS:

Critical provisioner – It is dedicated to supporting jobs with aggressive SLAs and are time sensitive. The provisioner uses On-Demand Instances, which aren’t stopped, unlike Spot Instances, and their lifecycle follows through one of the jobs. The nodes use instance stores, which are NVMe disks physically attached to the host, which offer a high I/O throughput that allow better Spark performance, because it’s used as temporary storage for disk spill and shuffle. The instance types used in the node are of the m6gd family. The instances use the AWS Graviton processor, which offers better price/performance than x86 processors. To use this provisioner in your jobs, you can use the following sample configuration, which is referenced in the configuration override of the EMR on EKS job submission.
Non-critical provisioner – This Provisioner leverage Spot Instances to save costs for jobs that aren’t time sensitive or jobs that are used for experiments. This node use Spot Instances because the jobs aren’t critical and can be interrupted. These instances can be stopped if the instance is reclaimed. The instance types used in the node are of the m6gd family, the driver is On-Demand and executors are on spot instances.
Notebook provisioner – The Provisioner is for running managed endpoints that are used by Amazon EMR Studio for data exploration using Amazon EMR on EKS. The instances are of t3 family and are On-Demand for driver and Spot Instances for executors to keep the cost low. If the executor instances are stopped, new ones are started by Karpenter. If the executor instances are stopped too often, you can define your own that use On-Demand instances.

The following link provides more details about how each of the provisioner are defined. One import property that is defined in the default Provisioners is there is one for each AZ. This is important because it allows you to reduce inter-AZ network transfer cost when Spark runs a shuffle.

For this post, we use the default Provisioners, so you don’t need to add any lines of code for this section. If you want yo add your own Provisioners you can leverage the method addKarpenterProvisioner to apply your own manifests. You can use helper methods in Utils class like readYamlDocument to read YAML document and loadYaml load YAML files and pass them as arguments to addKarpenterProvisioner method.

Deploy the virtual cluster and an execution role

A virtual cluster is a Kubernetes namespace that Amazon EMR is registered with; when you submit a job, the driver and executor pods are running in the associated namespace. The EmrEksCluster construct offers a method called addEmrVirtualCluster, which creates the virtual cluster for you. The method takes EmrVirtualClusterOptions as a parameter, which has the following attributes:

name – The name of your virtual cluster.
createNamespace – An optional field that creates the EKS namespace. This is of type Boolean and by default it doesn’t create a separate EKS namespace, so your virtual cluster is created in the default namespace.
eksNamespace – The name of the EKS namespace to be linked with the virtual EMR cluster. If no namespace is supplied, the construct uses the default namespace.

In lib/emr-eks-app.ts, add the following line to create your virtual cluster:
```
const virtualCluster = emrEks.addEmrVirtualCluster(this,{ 
   name:'my-emr-eks-cluster', 
   eksNamespace: ‘batchjob’, 
   createNamespace: true 
});
```
Now we create the execution role, which is an IAM role that is used by the driver and executor to interact with AWS services. Before we can create the execution role for Amazon EMR, we need to first create the ManagedPolicy. Note that in the following code, we create a policy to allow access to the Amazon Simple Storage Service (Amazon S3) bucket and Amazon CloudWatch logs.
In lib/emr-eks-app.ts, add the following line to create the policy:
```
const emrEksPolicy = new iam.ManagedPolicy(this,'managed-policy',
{ statements: [ 
   new iam.PolicyStatement({ 
       effect: iam.Effect.ALLOW, 
       actions:['s3:PutObject','s3:GetObject','s3:ListBucket'], 
       resources:['YOUR-DATA-S3-BUCKET']
    }), 
   new iam.PolicyStatement({ 
       effect: iam.Effect.ALLOW, 
       actions:['logs:PutLogEvents','logs:CreateLogStream','logs:DescribeLogGroups','logs:DescribeLogStreams'], 
       resources:['arn:aws:logs:*:*:*'] 
    })
   ] 
});
```
If you want to use the AWS Glue Data Catalog, add its permission in the preceding policy.

Now we create the execution role for Amazon EMR on EKS using the policy defined in the previous step using the createExecutionRole instance method. The driver and executor pods can then assume this role to access and process data. The role is scoped in such a way that only pods in the virtual cluster namespace can assume it. To learn more about the condition implemented by this method to restrict access to the role to only pods that are created by Amazon EMR on EKS in the namespace of the virtual cluster, refer to Using job execution roles with Amazon EMR on EKS.
In lib/emr-eks-app.ts, add the following line to create the execution role:
```
const role = emrEks.createExecutionRole(this,'emr-eks-execution-role', emrEksPolicy, ‘batchjob’,’ execRoleJob’);
```
The preceding code produces an IAM role called execRoleJob with the IAM policy defined in emrekspolicy and scoped to the namespace dataanalysis.
Lastly, we output parameters that are important for the job run:

// Virtual cluster Id to reference in jobs
new cdk.CfnOutput(this, 'VirtualClusterId', { value: virtualCluster.attrId });

// Job config for each nodegroup
new cdk.CfnOutput(this, 'CriticalConfig', { value: emrEks.criticalDefaultConfig });

// Execution role arn
new cdk.CfnOutput(this, 'ExecRoleArn', { value: role.roleArn });

Deploy Amazon EMR Studio and provision users

To deploy an EMR Studio for data exploration and job authoring, the ARA library has a construct called NotebookPlatform. This construct allows you to deploy as many EMR Studios as you need (within the account limit) and set them up with the authentication mode that is suitable for you and assign users to them. To learn more about the authentication modes available in Amazon EMR Studio, refer to Choose an authentication mode for Amazon EMR Studio.

The construct creates all the necessary IAM roles and policies needed by Amazon EMR Studio. It also creates an S3 bucket where all the notebooks are stored by Amazon EMR Studio. The bucket is encrypted with a customer managed key (CMK) generated by the AWS CDK stack. The following steps show you how to create your own EMR Studio with the construct.

The notebook platform construct takes NotebookPlatformProps as a property, which allows you to define your EMR Studio, a namespace, the name of the EMR Studio, and its authentication mode.

In lib/emr-eks-app.ts, add the following line:
```
const notebookPlatform = new ara.NotebookPlatform(this, 'platform-notebook', {
emrEks: emrEks,
eksNamespace: 'dataanalysis',
studioName: 'platform',
studioAuthMode: ara.StudioAuthMode.IAM,
});
```
For this post, we use IAM users so that you can easily reproduce it in your own account. However, if you have IAM federation or single sign-on (SSO) already in place, you can use them instead of IAM users.To learn more about the parameters of NotebookPlatformProps, refer to NotebookPlatformProps.

Next, we need to create and assign users to the Amazon EMR Studio. For this, the construct has a method called addUser that takes a list of users and either assigns them to Amazon EMR Studio in case of SSO or updates the IAM policy to allows access to Amazon EMR Studio for the provided IAM users. The user can also have multiple managed endpoints, and each user can have their Amazon EMR version defined. They can use a different set of Amazon Elastic Compute Cloud (Amazon EC2) instances and different permissions using job execution roles.
In lib/emr-eks-app.ts, add the following line:
```
notebookPlatform.addUser([{
identityName:<NAME-OF-EXISTING-IAM-USER>,
notebookManagedEndpoints: [{
emrOnEksVersion: 'emr-6.8.0-latest',
executionPolicy: emrEksPolicy,
managedEndpointName: ‘myendpoint’
}],
}]);
```
In the preceding code, for the sake of brevity, we reuse the same IAM policy that we created in the execution role.

Note that the construct optimizes the number of managed endpoints that are created. If two endpoints have the same name, then only one is created.
Now that we have defined our deployment, we can deploy it:

   npm run build && cdk deploy

You can find a sample project that contains all the steps of the walk through in the following GitHub repository.

When the deployment is complete, the output contains the S3 bucket containing the assets for podTemplate, the link for the EMR Studio, and the EMR Studio virtual cluster ID. The following screenshot shows the output of the AWS CDK after the deployment is complete.

Submit jobs

Because we’re using the default Provisioners, we will use the podTemplate that is defined by the construct available on the ARA GitHub repository. These are uploaded for you by the construct to an S3 bucket called <clustername>-emr-eks-assets; you only need to refer to them in your Spark job. In this job, you also use the job parameters in the output at the end of the AWS CDK deployment. These parameters allow you to use the AWS Glue Data Catalog and implement Spark on Kubernetes best practices like dynamicAllocation and pod collocation. At the end of cdk deploy ARA will output job sample configurations with the best practices listed before that you can use to submit a job. You can submit a job as follows.

A job run is a unit of work such as a Spark JAR file that is submitted to the EMR on EKS cluster. We start a job using the start-job-run command. Note you can use SparkSubmitParameters to specify the Amazon S3 path to the pod template, as shown in the following command:

aws emr-containers start-job-run \

--virtual-cluster-id <CLUSTER-ID>\

--name <SPARK-JOB-NAME>\

--execution-role-arn <ROLE-ARN> \

--release-label emr-6.8.0-latest \

--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": ""<S3URI-SPARK-JOB>"
}
}' --configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",

"spark.sql.catalogImplementation": "hive",

"spark.dynamicAllocation.enabled":"true",

"spark.dynamicAllocation.minExecutors": "8",

"spark.dynamicAllocation.maxExecutors": "40",

"spark.kubernetes.allocation.batch.size": "8",

"spark.executor.cores": "8",

"spark.kubernetes.executor.request.cores": "7",

"spark.executor.memory": "28G",

"spark.driver.cores": "2",

"spark.kubernetes.driver.request.cores": "2",

"spark.driver.memory": "6G",

"spark.dynamicAllocation.executorAllocationRatio": "1",

"spark.dynamicAllocation.shuffleTracking.enabled": "true",

"spark.dynamicAllocation.shuffleTracking.timeout": "300s",

"spark.kubernetes.driver.podTemplateFile": s3://<EKS-CLUSTER-NAME>-emr-eks-assets-<ACCOUNT-ID>-<REGION> /<EKS-CLUSTER-NAME>/pod-template/critical-driver.yaml ",

"spark.kubernetes.executor.podTemplateFile": s3://<EKS-CLUSTER-NAME>-emr-eks-assets-<ACCOUNT-ID>-<REGION> /<EKS-CLUSTER-NAME>/pod-template/critical-executor.yaml "
}
}
],
"monitoringConfiguration": {
"cloudWatchMonitoringConfiguration": {
"logGroupName": ""<Log_Group_Name>",
"logStreamNamePrefix": "<Log_Stream_Prefix>"
}
}'

The code takes the following values:

<CLUSTER-ID> – The EMR virtual cluster ID
<SPARK-JOB-NAME> – The name of your Spark job
<ROLE-ARN> – The execution role you created
<S3URI-SPARK-JOB> – The Amazon S3 URI of your Spark job
<S3URI-CRITICAL-DRIVER> – The Amazon S3 URI of the driver pod template, which you get from the AWS CDK output
<S3URI-CRITICAL-EXECUTOR> – The Amazon S3 URI of the executor pod template
<Log_Group_Name> – Your CloudWatch log group name
<Log_Stream_Prefix> – Your CloudWatch log stream prefix

You can go to the Amazon EMR console to check the status of your job and to view logs. You can also check the status by running the describe-job-run command:

aws emr-containers describe-job-run --<CLUSTER-ID> cluster-id --id <JOB-RUN-ID>

Explore data using Amazon EMR Studio

In this section, we show how you can create a workspace in Amazon EMR Studio and connect to the Amazon EKS managed endpoint from the workspace. From the output, use the link to Amazon EMR Studio to navigate to the EMR Studio deployment. You must sign in with the IAM username you provided in the addUser method.

Create a Workspace

To create a Workspace, complete the following steps:

Log in to the EMR Studio created by the AWS CDK.
Choose Create Workspace.
Enter a workspace name and an optional description.
Select Allow Workspace Collaboration if you want to work with other Studio users in this Workspace in real time.
Choose Create Workspace.

After you create the Workspace, choose it from the list of Workspaces to open the JupyterLab environment.

The following screenshot shows what the terminal looks like. For more information about the user interface, refer to Understand the Workspace user interface.

Connect to an EMR on EKS managed endpoint

You can easily connect to the EMR on EKS managed endpoint from the Workspace.

In the navigation pane, on the Clusters menu, select EMR Cluster on EKS for Cluster type.
The virtual clusters appear on the EMR Cluster on EKS drop-down menu, and the endpoint appears on the Endpoint drop-down menu. If there are multiple endpoints, they appear here, and you can easily switch between endpoints from the Workspace.
Select the appropriate endpoint and choose Attach.

Work with a notebook

You can now open a notebook and connect to a preferred kernel to do your tasks. For instance, you can select a PySpark kernel, as shown in the following screenshot.

Explore your data

The first step of our data exploration exercise is to create a Spark session and then load the New York taxi dataset from the S3 bucket into a data frame. Use the following code block to load the data into a data frame. Copy the Amazon S3 URI for the location where the dataset resides in Amazon S3.

	from pyspark.sql import SparkSession
	from pyspark.sql.functions import *
	from datetime import datetime
	spark = SparkSession.builder.appName("SparkEDAA").getOrCreate()

After we load the data into a data frame, we replace the data of the current_date column with the actual current date, count the number of rows, and save the data into a Parquet file:

print("Total number of records: " + str(updatedNYTaxi.count()))
updatedNYTaxi.write.parquet("<YOUR-S3-PATH>")

The following screenshot shows the result of our notebook running on Amazon EMR Studio and with PySpark running on Amazon EMR on EKS.

Clean up

To clean up after this post, run cdk destroy.

Conclusion

In this post, we showed how you can use the ARA to quickly deploy a data analytics infrastructure and start experimenting with your data. You can find the full example referenced in this post in the GitHub repository. The AWS Analytics Reference Architecture implements common Analytics pattern and AWS best practices to offer you ready to use constructs to for your experiments. One of the patterns is the data mesh, which you can consult how to use in this blog post.

You can also explore other constructs offered in this library to experiment with AWS Analytics services before transitioning your workload for production.

About the Authors

Lotfi Mouhib is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

Sandipan Bhaumik is a Senior Analytics Specialist Solutions Architect based in London. He has worked with customers in different industries like Banking & Financial Services, Healthcare, Power & Utilities, Manufacturing and Retail helping them solve complex challenges with large-scale data platforms. At AWS he focuses on strategic accounts in the UK and Ireland and helps customers to accelerate their journey to the cloud and innovate using AWS analytics and machine learning services. He loves playing badminton, and reading books.

How Novo Nordisk built a modern data architecture on AWS

2023-01-04 Jonatan Selsing

Post Syndicated from Jonatan Selsing original https://aws.amazon.com/blogs/big-data/how-novo-nordisk-built-a-modern-data-architecture-on-aws/

Novo Nordisk is a leading global pharmaceutical company, responsible for producing life-saving medicines that reach more than 34 million patients each day. They do this following their triple bottom line—that they must strive to be environmentally sustainable, socially sustainable, and financially sustainable. The combination of using AWS and data supports all these targets.

Data is pervasive throughout the entire value chain of Novo Nordisk. From foundational research, manufacturing lines, sales and marketing, clinical trials, pharmacovigilance, through patient-facing data-driven applications. Therefore, getting the foundation around how data is stored, safeguarded, and used in a way that provides the most value is one of the central drivers of improved business outcomes.

Together with AWS Professional Services, we’re building a data and analytics solution using a modern data architecture. The collaboration between Novo Nordisk and AWS Professional Services is a strategic and long-term close engagement, where developers from both organizations have worked together closely for years. The data and analytics environments are built around of the core tenets of the data mesh—decentralized domain ownership of data, data as a product, self-service data infrastructure, and federated computational governance. This enables the users of the environment to work with data in the way that drives the best business outcomes. We have combined this with elements from evolutionary architectures that will allow us to adapt different functionalities as AWS continuously develops new services and capabilities.

In this series of posts, you will learn how Novo Nordisk and AWS Professional Services built a data and analytics ecosystem to speed up innovation at petabyte scale:

In this first post, you will learn how the overall design has enabled the individual components to come together in a modular way. We dive deep into how we built a data management solution based on the data mesh architecture.
The second post discusses how we built a trust network between the systems that comprise the entire solution. We show how we use event-driven architectures, coupled with the use of attribute-based access controls, to ensure permission boundaries are respected at scale.
In the third post, we show how end-users can consume data from their tool of choice, without compromising data governance. This includes how to configure Okta, AWS Lake Formation, and Microsoft Power BI to enable SAML-based federated use of Amazon Athena for an enterprise business intelligence (BI) activity.

Pharma-compliant environment

As a pharmaceutical industry, GxP compliance is a mandate for Novo Nordisk. GxP is a general abbreviation for the “Good x Practice” quality guidelines and regulations defined by regulators such as European Medicines Agency, U.S. Food and Drug Administration, and others. These guidelines are designed to ensure that medicinal products are safe and effective for their intended use. In the context of a data environment, GxP compliance involves implementing integrity controls for data used to in decision making and processes and is used to guide how change management processes are implemented to continuously ensure compliance over time.

Because this data environment supports teams across the whole organization, each individual data owner must retain accountability on their data. Features were designed to provide data owners autonomy and transparency when managing their data, enabling them to take this responsibility. This includes the capability to handle personally identifiable information (PII) data and other sensitive workloads. To provide traceability on the environment, audit capabilities were added, which we describe more in this post.

Solution overview

The full solution is a sprawling landscape of independent services that work together to enable data and analytics with a decentralized data governance model at petabyte scale. Schematically, it can be represented as in the following figure.

The architecture is split into three independent layers: data management, virtualization, and consumption. The end-user sits in the consumption layer and works with their tool of choice. It’s meant to abstract as much of the AWS-native resources to application primitives. The consumption layer is integrated into the virtualization layer, which abstracts the access to data. The purpose of the virtualization layer is to translate between data consumption and data management solutions. The access to data is managed by what we refer to as data management solutions. We discuss one of our versatile data management solutions later in this post. Each layer in this architecture is independent of each other and instead only relies on well-defined interfaces.

Central to this architecture is that access is encapsulated in an AWS Identity and Access Management (IAM) role session. The data management layer focuses on providing the IAM role with the right permissions and governance, the virtualization layer provides access to the role, and the consumption layer abstracts the use of the roles in the tools of choice.

Technical architecture

Each of the three layers in the overall architecture has a distinct responsibility, but no singular implementation. Think of them as abstract classes. They can be implemented in concrete classes, and in our case they rely on foundational AWS services and capabilities. Let’s go through each of the three layers.

Data management layer

The data management layer is responsible for providing access to and governance of data. As illustrated in the following diagram, a minimal construct in the data management layer is the combination of an Amazon Simple Storage Service (Amazon S3) bucket and an IAM role that gives access to the S3 bucket. This construct can be expanded to include granular permission with Lake Formation, auditing with AWS CloudTrail, and security response capabilities from AWS Security Hub. The following diagram also shows that a single data management solution has no singular span. It can cross many AWS accounts and be comprised of any number of IAM role combinations.

We have purposely not illustrated the trust policy of these roles in this figure, because those are a collaborative responsibility between the virtualization layer and the data management layer. We go into detail of how that works in the next post in this series. Data engineering professionals often interface directly with the data management layer, where they curate and prepare data for consumption.

Virtualization layer

The purpose of the virtualization layer is to keep track of who can do what. It doesn’t have any capabilities in itself, but translates the requirements from the data management ecosystems to the consumption layers and vice versa. It enables end-users on the consumption layer to access and manipulate data on one or more data management ecosystems, according to their permissions. This layer abstracts from end-users the technical details on data access, such as permission model, role assumptions, and storage location. It owns the interfaces to the other layers and enforces the logic of the abstraction. In the context of hexagonal architectures (see Developing evolutionary architecture with AWS Lambda), the interface layer plays the role of the domain logic, ports, and adapters. The other two layers are actors. The data management layer communicates the state of the layer to the virtualization layer and conversely receives information about the service landscape to trust. The virtualization layer architecture is shown in the following diagram.

Consumption layer

The consumption layer is where the end-users of the data products are sitting. This can be data scientists, business intelligence analysts, or any third party that generates value from consuming the data. It’s important for this type of architecture that the consumption layer has a hook-based sign-in flow, where the authorization into the application can be modified at sign-in time. This is to translate the AWS-specific requirement into the target applications. After the session in the client-side application has successfully been started, it’s up to the application itself to instrument for data layer abstraction, because this will be application specific. And this is an additional important decoupling, where some responsibility is pushed to the decentralized units. Many modern software as a service (SaaS) applications support these built-in mechanisms, such as Databricks or Domino Data Lab, whereas more traditional client-side applications like RStudio Server have more limited native support for this. In the case where native support is missing, a translation down to the OS user session can be done to enable the abstraction. The consumption layer is shown schematically in the following diagram.

When using the consumption layer as intended, the users don’t know that the virtualization layer exists. The following diagram illustrates the data access patterns.

Modularity

One of the main advantages of adopting the hexagonal architecture pattern, and delegating both the consuming layer and the data management layer to primary and secondary actors, means that they can be changed or replaced as new functionalities are released that require new solutions. This gives a hub-and-spoke type pattern, where many different types of producer/consumer type systems can be connected and work simultaneously in union. An example of this is that the current solution running in Novo Nordisk supports multiple, simultaneous data management solutions and are exposed in a homogenous way in the consuming layer. This includes both a data lake, the data mesh solution presented in this post, and several independent data management solutions. And these are exposed to multiple types of consuming applications, from custom managed, self-hosted applications, to SaaS offerings.

Data management ecosystem

To scale the usage of the data and increase the freedom, Novo Nordisk, jointly with AWS Professional Services, built a data management and governance environment, named Novo Nordisk Enterprise DataHub (NNEDH). NNEDH implements a decentralized distributed data architecture, and data management capabilities such as an enterprise business data catalog and data sharing workflow. NNEDH is an example of a data management ecosystem in the conceptual framework introduced earlier.

Decentralized architecture: From a centralized data lake to a distributed architecture

Novo Nordisk’s centralized data lake consists of 2.3 PB of data from more than 30 business data domains worldwide serving over 2000+ internal users throughout the value chain. It has been running successfully for several years. It is one of the data management ecosystems currently supported.

Within the centralized data architecture, data from each data domain is copied, stored, and processed in one central location: a central data lake hosted in one data storage. This pattern has challenges at scale because it retains the data ownership with the central team. At scale, this model slows down the journey toward a data-driven organization, because ownership of the data isn’t sufficiently anchored with the professionals closest to the domain.

The monolithic data lake architecture is shown in the following diagram.

Within the decentralized distributed data architecture, the data from each domain is kept within the domain on its own data storage and compute account. In this case, the data is kept close to domain experts, because they’re the ones who know their own data best and are ultimately the owner of any data products built around their data. They often work closely with business analysts to build the data product and therefore know what good data means to consumers of their data products. In this case, the data responsibility is also decentralized, where each domain has its own data owner, putting the accountability onto the true owners of the data. Nevertheless, this model might not work at small scale, for example an organization with only one business unit and tens of users, because it would introduce more overhead on the IT team to manage the organization data. It better suits large organizations, or small and medium ones that would like to grow and scale.

The Novo Nordisk data mesh architecture is shown in the following diagram.

Data domains and data assets

To enable the scalability of data domains across the organization, it’s mandatory to have a standard permission model and data access pattern. This standard must not be too restrictive in such a way that it may be a blocker for specific use cases, but it should be standardized in such a way to use the same interface between the data management and virtualization layers.

The data domains on NNEDH are implemented by a construct called an environment. An environment is composed of at least one AWS account and one AWS Region. It’s a workplace where data domain teams can work and collaborate to build data products. It links the NNEDH control plane to the AWS accounts where the data and compute of the domain reside. The data access permissions are also defined at the environment level, managed by the owner of the data domain. The environments have three main components: a data management and governance layer, data assets, and optional blueprints for data processing.

For data management and governance, the data domains rely on Lake Formation, AWS Glue, and CloudTrail. The deployment method and setup of these components is standardized across data domains. This way, the NNEDH control plane can provide connectivity and management to data domains in a standardized way.

The data assets of each domain residing in an environment are organized in a dataset, which is a collection of related data used for building a data product. It includes technical metadata such as data format, size, and creation time, and business metadata such as the producer, data classification, and business definition. A data product can use one or several datasets. It is implemented through managed S3 buckets and the AWS Glue Data Catalog.

Data processing can be implemented in different ways. NNEDH provides blueprints for data pipelines with predefined connectivity to data assets to speed up the delivery of data products. Data domain users have the freedom to use any other compute capability on their domain, for example using AWS services not predefined on the blueprints or accessing the datasets from other analytics tools implemented in the consumption layer, as mentioned earlier in this post.

Data domain personas and roles

On NNEDH, the permission levels on data domains are managed through predefined personas, for example data owner, data stewards, developers, and readers. Each persona is associated with an IAM role that has a predefined permission level. These permissions are based on the typical needs of users on these roles. Nevertheless, to give more flexibility to data domains, these permissions can be customized and extended as needed.

The permissions associated with each persona are related only to actions allowed on the AWS account of the data domain. For the accountability on data assets, the data access to the assets is managed by specific resource policies instead of IAM roles. Only the owner of each dataset, or data stewards delegated by the owner, can grant or revoke data access.

On the dataset level, a required persona is the data owner. Typically, they work closely with one or many data stewards as data products managers. The data steward is the data subject matter expert of the data product domain, responsible for interpreting collected data and metadata to derive deep business insights and build the product. The data steward bridges between business users and technical teams on each data domain.

Enterprise business data catalog

To enable freedom and make the organization data assets discoverable, a web-based portal data catalog is implemented. It indexes in a single repository metadata from datasets built on data domains, breaking data silos across the organization. The data catalog enables data search and discovery across different domains, as well as automation and governance on data sharing.

The business data catalog implements data governance processes within the organization. It ensures the data ownership—someone in the organization is responsible for the data origin, definition, business attributes, relationships, and dependencies.

The central construct of a business data catalog is a dataset. It’s the search unit within the business catalog, having both technical and business metadata. To collect technical metadata from structured data, it relies on AWS Glue crawlers to recognize and extract data structures from the most popular data formats, including CSV, JSON, Avro, and Apache Parquet. It provides information such as data type, creation date, and format. The metadata can be enriched by business users by adding a description of the business context, tags, and data classification.

The dataset definition and related metadata are stored in an Amazon Aurora Serverless database and Amazon OpenSearch Service, enabling you to run textual queries on the data catalog.

Data sharing

NNEDH implements a data sharing workflow, enabling peer-to-peer data sharing across AWS accounts using Lake Formation. The workflow is as follows:

A data consumer requests access to the dataset.
The data owner grants access by approving the access request. They can delegate the approval of access requests to the data steward.
Upon the approval of an access request, a new permission is added to the specific dataset in Lake Formation of the producer account.

The data sharing workflow is shown schematically in the following figure.

Security and audit

The data in the Novo Nordisk data mesh lies in AWS accounts owned by Novo Nordisk business accounts. The configuration and the states of the data mesh are stored in Amazon Relational Database Service (Amazon RDS). The Novo Nordisk security architecture is shown in the following figure.

Access and edits to the data in NNEDH needs to be logged for audit purposes. We need to be able to tell who modified data, when the modification happened, and what modifications were applied. In addition, we need to be able to answer why the modification was allowed by that person at that time.

To meet these requirements, we use the following components:

CloudTrail to log API calls. We specifically enable CloudTrail data event logging for S3 buckets and objects. By activating the logging, we can trace back any modification to any files in the data lake to the person who made the modification. We enforce usage of source identity for IAM role sessions to ensure user traceability.
We use Amazon RDS to store the configuration of the data mesh. We log queries against the RDS database. Together with CloudTrail, this log allows us to answer the question of why a modification to a file in Amazon S3 at a specific time by a specific person is possible.
Amazon CloudWatch to log activities across the mesh.

In addition to those logging mechanisms, the S3 buckets are created using the following properties:

The bucket is encrypted using server-side encryption with AWS Key Management Service (AWS KMS) and customer managed keys
Amazon S3 versioning is activated by default

Access to the data in NNEDH is controlled at the group level instead of individual users. The group corresponds to the group defined in the Novo Nordisk directory group. To keep track of the person who modified the data in the data lakes, we use the source identity mechanism explained in the post How to relate IAM role activity to corporate identity.

Conclusion

In this post, we showed how Novo Nordisk built a modern data architecture to speed up the delivery of data-driven use cases. It includes a distributed data architecture, to scale the usage to petabyte scale for over 2,000 internal users throughout the value chain, as well as a distributed security and audit architecture handling data accountability and traceability on the environment to meet their compliance requirements.

The next post in this series describes the implementation of distributed data governance and control at scale of Novo Nordisk’s modern data architecture.

About the Authors

Jonatan Selsing is former research scientist with a PhD in astrophysics that has turned to the cloud. He is currently the Lead Cloud Engineer at Novo Nordisk, where he enables data and analytics workloads at scale. With an emphasis on reducing the total cost of ownership of cloud-based workloads, while giving full benefit of the advantages of cloud, he designs, builds, and maintains solutions that enable research for future medicines.

Hassen Riahi is a Sr. Data Architect at AWS Professional Services. He holds a PhD in Mathematics & Computer Science on large-scale data management. He works with AWS customers on building data-driven solutions.

Anwar Rizal is a Senior Machine Learning consultant based in Paris. He works with AWS customers to develop data and AI solutions to sustainably grow their business.

Moses Arthur comes from a mathematics and computational research background and holds a PhD in Computational Intelligence specialized in Graph Mining. He is currently a Cloud Product Engineer at Novo Nordisk building GxP-compliant enterprise data lakes and analytics platforms for Novo Nordisk global factories producing digitalized medical products.

Alessandro Fior is a Sr. Data Architect at AWS Professional Services. With over 10 years of experience delivering data and analytics solutions, he is passionate about designing and building modern and scalable data platforms that accelerate companies to get value from their data.

Kumari Ramar is an Agile certified and PMP certified Senior Engagement Manager at AWS Professional Services. She delivers data and AI/ML solutions that speed up cross-system analytics and machine learning models, which enable enterprises to make data-driven decisions and drive new innovations.

Develop a serverless application in Python using Amazon CodeWhisperer

2023-01-03 Rafael Ramos

Post Syndicated from Rafael Ramos original https://aws.amazon.com/blogs/devops/develop-a-serverless-application-in-python-using-amazon-codewhisperer/

While writing code to develop applications, developers must keep up with multiple programming languages, frameworks, software libraries, and popular cloud services from providers such as AWS. Even though developers can find code snippets on developer communities, to either learn from them or repurpose the code, manually searching for the snippets with an exact or even similar use case is a distracting and time-consuming process. They have to do all of this while making sure that they’re following the correct programming syntax and best coding practices.

Amazon CodeWhisperer, a machine learning (ML) powered coding aide for developers, lets you overcome those challenges. Developers can simply write a comment that outlines a specific task in plain English, such as “upload a file to S3.” Based on this, CodeWhisperer automatically determines which cloud services and public libraries are best-suited for the specified task, it creates the specific code on the fly, and then it recommends the generated code snippets directly in the IDE. And this isn’t about copy-pasting code from the web, but generating code based on the context of your file, such as which libraries and versions you have, as well as the existing code. Moreover, CodeWhisperer seamlessly integrates with your Visual Studio Code and JetBrains IDEs so that you can stay focused and never leave the development environment. At the time of this writing, CodeWhisperer supports Java, Python, JavaScript, C#, and TypeScript.

In this post, we’ll build a full-fledged, event-driven, serverless application for image recognition. With the aid of CodeWhisperer, you’ll write your own code that runs on top of AWS Lambda to interact with Amazon Rekognition, Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), Amazon Simple Storage Service (Amazon S3), and third-party HTTP APIs to perform image recognition. The users of the application can interact with it by either sending the URL of an image for processing, or by listing the images and the objects present on each image.

Solution overview

To make our application easier to digest, we’ll split it into three segments:

Image download – The user provides an image URL to the first API. A Lambda function downloads the image from the URL and stores it on an S3 bucket. Amazon S3 automatically sends a notification to an Amazon SNS topic informing that a new image is ready for processing. Amazon SNS then delivers the message to an Amazon SQS queue.
Image recognition – A second Lambda function handles the orchestration and processing of the image. It receives the message from the Amazon SQS queue, sends the image for Amazon Rekognition to process, stores the recognition results on a DynamoDB table, and sends a message with those results as JSON to a second Amazon SNS topic used in section three. A user can list the images and the objects present on each image by calling a second API which queries the DynamoDB table.
3rd-party integration – The last Lambda function reads the message from the second Amazon SQS queue. At this point, the Lambda function must deliver that message to a fictitious external e-mail server HTTP API that supports only XML payloads. Because of that, the Lambda function converts the JSON message to XML. Lastly, the function sends the XML object via HTTP POST to the e-mail server.

The following diagram depicts the architecture of our application:

Architecture diagram depicting the application architecture. It contains the service icons with the component explained on the text above

Figure 1. Architecture diagram depicting the application architecture. It contains the service icons with the component explained on the text above.

Prerequisites

Before getting started, you must have the following prerequisites:

An AWS account and an Administrator user
Install and authenticate the AWS CLI. You can authenticate with an AWS Identity and Access Management (IAM) user or an AWS Security Token Service (AWS STS) token.
Install Python 3.7 or later.
Install Node Package Manager (npm).
Install the AWS CDK Toolkit.
Install the AWS Toolkit for VS Code or for JetBrains.
Install Git.

Configure environment

We already created the scaffolding for the application that we’ll build, which you can find on this Git repository. This application is represented by a CDK app that describes the infrastructure according to the architecture diagram above. However, the actual business logic of the application isn’t provided. You’ll implement it using CodeWhisperer. This means that we already declared using AWS CDK components, such as the API Gateway endpoints, DynamoDB table, and topics and queues. If you’re new to AWS CDK, then we encourage you to go through the CDK workshop later on.

Deploying AWS CDK apps into an AWS environment (a combination of an AWS account and region) requires that you provision resources that the AWS CDK needs to perform the deployment. These resources include an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments. The process of provisioning these initial resources is called bootstrapping. The required resources are defined in an AWS CloudFormation stack, called the bootstrap stack, which is usually named CDKToolkit. Like any CloudFormation stack, it appears in the CloudFormation console once it has been deployed.

After cloning the repository, let’s deploy the application (still without the business logic, which we’ll implement later on using CodeWhisperer). For this post, we’ll implement the application in Python. Therefore, make sure that you’re under the python directory. Then, use the cdk bootstrap command to bootstrap an AWS environment for AWS CDK. Replace {AWS_ACCOUNT_ID} and {AWS_REGION} with corresponding values first:

cdk bootstrap aws://{AWS_ACCOUNT_ID}/{AWS_REGION}

For more information about bootstrapping, refer to the documentation.

The last step to prepare your environment is to enable CodeWhisperer on your IDE. See Setting up CodeWhisperer for VS Code or Setting up Amazon CodeWhisperer for JetBrains to learn how to do that, depending on which IDE you’re using.

Image download

Let’s get started by implementing the first Lambda function, which is responsible for downloading an image from the provided URL and storing that image in an S3 bucket. Open the get_save_image.py file from the python/api/runtime/ directory. This file contains an empty Lambda function handler and the needed inputs parameters to integrate this Lambda function.

url is the URL of the input image provided by the user,
name is the name of the image provided by the user, and
S3_BUCKET is the S3 bucket name defined by our application infrastructure.

Write a comment in natural language that describes the required functionality, for example:

# Function to get a file from url

To trigger CodeWhisperer, hit the Enter key after entering the comment and wait for a code suggestion. If you want to manually trigger CodeWhisperer, then you can hit Option + C on MacOS or Alt + C on Windows. You can browse through multiple suggestions (if available) with the arrow keys. Accept a code suggestion by pressing Tab. Discard a suggestion by pressing Esc or typing a character.

For more information on how to work with CodeWhisperer, see Working with CodeWhisperer in VS Code or Working with Amazon CodeWhisperer from JetBrains.

You should get a suggested implementation of a function that downloads a file using a specified URL. The following image shows an example of the code snippet that CodeWhisperer suggests:

Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called get_file_from_url with the implementation suggestion to download a file using the requests lib

Figure 2. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called get_file_from_url with the implementation suggestion to download a file using the requests lib.

Be aware that CodeWhisperer uses artificial intelligence (AI) to provide code recommendations, and that this is non-deterministic. The result you get in your IDE may be different from the one on the image above. If needed, fine-tune the code, as CodeWhisperer generates the core logic, but you might want to customize the details depending on your requirements.

Let’s try another action, this time to upload the image to an S3 bucket:

# Function to upload image to S3

As a result, CodeWhisperer generates a code snippet similar to the following one:

Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called upload_image with the implementation suggestion to download a file using the requests lib and upload it to S3 using the S3 client

Figure 3. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called upload_image with the implementation suggestion to download a file using the requests lib and upload it to S3 using the S3 client.

Now that you have the functions with the functionalities to download an image from the web and upload it to an S3 bucket, you can wire up both functions in the Lambda handler function by calling each function with the correct inputs.

Image recognition

Now let’s implement the Lambda function responsible for sending the image to Amazon Rekognition for processing, storing the results in a DynamoDB table, and sending a message with those results as JSON to a second Amazon SNS topic. Open the image_recognition.py file from the python/recognition/runtime/ directory. This file contains an empty Lambda and the needed inputs parameters to integrate this Lambda function.

queue_url is the URL of the Amazon SQS queue to which this Lambda function is subscribed,
table_name is the name of the DynamoDB table, and
topic_arn is the ARN of the Amazon SNS topic to which this Lambda function is published.

Using CodeWhisperer, implement the business logic of the next Lambda function as you did in the previous section. For example, to detect the labels from an image using Amazon Rekognition, write the following comment:

# Detect labels from image with Rekognition

And as a result, CodeWhisperer should give you a code snippet similar to the one in the following image:

Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called detect_labels with the implementation suggestion to use the Rekognition SDK to detect labels on the given image

Figure 4. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called detect_labels with the implementation suggestion to use the Rekognition SDK to detect labels on the given image.

You can continue generating the other functions that you need to fully implement the business logic of your Lambda function. Here are some examples that you can use:

# Save labels to DynamoDB

# Publish item to SNS

# Delete message from SQS

Following the same approach, open the list_images.py file from the python/recognition/runtime/ directory to implement the logic to list all of the labels from the DynamoDB table. As you did previously, type a comment in plain English:

# Function to list all items from a DynamoDB table

Other frequently used code

Interacting with AWS isn’t the only way that you can leverage CodeWhisperer. You can use it to implement repetitive tasks, such as creating unit tests and converting message formats, or to implement algorithms like sorting and string matching and parsing. The last Lambda function that we’ll implement as part of this post is to convert a JSON payload received from Amazon SQS to XML. Then, we’ll POST this XML to an HTTP endpoint.

Open the send_email.py file from the python/integration/runtime/ directory. This file contains an empty Lambda function handler. An event is a JSON-formatted document that contains data for a Lambda function to process. Type a comment with your intent to get the code snippet:

# Transform json to xml

As CodeWhisperer uses the context of your files to generate code, depending on the imports that you have on your file, you’ll get an implementation such as the one in the following image:

Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called json_to_xml with the implementation suggestion to transform JSON payload into XML payload

Figure 5. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called json_to_xml with the implementation suggestion to transform JSON payload into XML payload.

Repeat the same process with a comment such as # Send XML string with HTTP POST to get the last function implementation. Note that the email server isn’t part of this implementation. You can mock it, or simply ignore this HTTP POST step. Lastly, wire up both functions in the Lambda handler function by calling each function with the correct inputs.

Deploy and test the application

To deploy the application, run the command cdk deploy --all. You should get a confirmation message, and after a few minutes your application will be up and running on your AWS account. As outputs, the APIStack and RekognitionStack will print the API Gateway endpoint URLs. It will look similar to this example:

Outputs:
...
APIStack.RESTAPIEndpoint01234567 = https://examp1eid0.execute-
api.{your-region}.amazonaws.com/prod/

The first endpoint expects two string parameters: url (the image file URL to download) and name (the target file name that will be stored on the S3 bucket). Use any image URL you like, but remember that you must encode an image URL before passing it as a query string parameter to escape the special characters. Use an online URL encoder of your choice for that. Then, use the curl command to invoke the API Gateway endpoint:

curl -X GET 'https://examp1eid0.execute-api.eu-east-
2.amazonaws.com/prod?url={encoded-image-URL}&amp;name={file-name}'

Replace {encoded-image-URL} and {file-name} with the corresponding values. Also, make sure that you use the correct API endpoint that you’ve noted from the AWS CDK deploy command output as mentioned above.

It will take a few seconds for the processing to happen in the background. Once it’s ready, see what has been stored in the DynamoDB table by invoking the List Images API (make sure that you use the correct URL from the output of your deployed AWS CDK stack):

curl -X GET 'https://examp1eid7.execute-api.eu-east-2.amazonaws.com/prod'

After you’re done, to avoid unexpected charges to your account, make sure that you clean up your AWS CDK stacks. Use the cdk destroy command to delete the stacks.

Conclusion

In this post, we’ve seen how to get a significant productivity boost with the help of ML. With that, as a developer, you can stay focused on your IDE and reduce the time that you spend searching online for code snippets that are relevant for your use case. Writing comments in natural language, you get context-based snippets to implement full-fledged applications. In addition, CodeWhisperer comes with a mechanism called reference tracker, which detects whether a code recommendation might be similar to particular CodeWhisperer training data. The reference tracker lets you easily find and review that reference code and see how it’s used in the context of another project. Lastly, CodeWhisperer provides the ability to run scans on your code (generated by CodeWhisperer as well as written by you) to detect security vulnerabilities.

During the preview period, CodeWhisperer is available to all developers across the world for free. Get started with the free preview on JetBrains, VS Code or AWS Cloud9.

About the author:

Impact of Infrastructure failures on shard in Amazon OpenSearch Service

2022-12-22 Bukhtawar Khan

Post Syndicated from Bukhtawar Khan original https://aws.amazon.com/blogs/big-data/impact-of-infrastructure-failures-on-shard-in-amazon-opensearch-service/

Amazon OpenSearch Service is a managed service that makes it easy to secure, deploy, and operate OpenSearch and legacy Elasticsearch clusters at scale in the AWS Cloud. Amazon OpenSearch Service provisions all the resources for your cluster, launches it, and automatically detects and replaces failed nodes, reducing the overhead of self-managed infrastructures. The service makes it easy for you to perform interactive log analytics, real-time application monitoring, website searches, and more by offering the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).

In the latest service software release, we’ve updated the shard allocation logic to be load-aware so that when shards are redistributed in case of any node failures, the service disallows surviving nodes from getting overloaded by shards previously hosted on the failed node. This is especially important for Multi-AZ domains to provide consistent and predictable cluster performance.

If you would like more background on shard allocation logic in general, please see Demystifying Elasticsearch shard allocation.

The challenge

An Amazon OpenSearch Service domain is said to be “balanced” when the number of nodes are equally distributed across configured Availability Zones, and the total number of shards are distributed equally across all the available nodes without concentration of shards of any one index on any one node. Also, OpenSearch has a property called “Zone Awareness” that, when enabled, ensures that the primary shard and its corresponding replica are allocated in different Availability Zones. If you have more than one copy of data, having multiple Availability Zones provides better fault tolerance and availability. In the event, the domain is scaled out or scaled in or during the failure of node(s), OpenSearch automatically redistributes the shards between available nodes while obeying the allocation rules based on zone awareness.

While the shard-balancing process ensures that shards are evenly distributed across Availability Zones, in some cases, if there is an unexpected failure in a single zone, shards will get reallocated to the surviving nodes. This might result in the surviving nodes getting overwhelmed, impacting cluster stability.

For instance, if one node in a three-node cluster goes down, OpenSearch redistributes the unassigned shards, as shown in the following diagram. Here “P“ represents a primary shard copy, whereas “R“ represents a replica shard copy.

This behavior of the domain can be explained in two parts – during failure and during recovery.

During failure

A domain deployed across multiple Availability Zones can encounter multiple types of failures during its lifecycle.

Complete zone failure

A cluster may lose a single Availability Zone due to a variety of reasons and also all the nodes in that zone. Today, the service tries to place the lost nodes in the remaining healthy Availability Zones. The service also tries to re-create the lost shards in the remaining nodes while still following the allocation rules. This can result in some unintended consequences.

When the shards of the impacted zone are getting reallocated to healthy zones, they trigger shard recoveries that can increase the latencies as it consumes additional CPU cycles and network bandwidth.
For an n-AZ, n-copy setup, (n>1), the surviving n-1 Availability Zones are allocated with the nth shard copy, which can be undesirable as it can cause skewness in shard distribution, which can also result in unbalanced traffic across nodes. These nodes can get overloaded, leading to further failures.

Partial zone failure

In the event of a partial zone failure or when the domain loses only some of the nodes in an Availability Zone, Amazon OpenSearch Service tries to replace the failed nodes as quickly as possible. However, in case it takes too long to replace the nodes, OpenSearch tries to allocate the unassigned shards of that zone into the surviving nodes in the Availability Zone. If the service cannot replace the nodes in the impacted Availability Zone, it may allocate them in the other configured Availability Zone, which may further skew the distribution of shards both across and within the zone. This again has unintended consequences.

If the nodes on the domain do not have enough storage space to accommodate the additional shards, the domain can be write-blocked, impacting indexing operation.
Due to the skewed distribution of shards, the domain may also experience skewed traffic across the nodes, which can further increase the latencies or timeouts for read and write operations.

Recovery

Today, in order to maintain the desired node count of the domain, Amazon OpenSearch Service launches data nodes in the remaining healthy Availability Zones, similar to the scenarios described in the failure section above. In order to ensure proper node distribution across all the Availability Zones after such an incident, manual intervention was needed by AWS.

What’s changing

To improve the overall failure handling and minimizing the impact of failure on the domain health and performance, Amazon OpenSearch Service is performing the following changes:

Forced Zone Awareness: OpenSearch has a preexisting shard balancing configuration called forced awareness that is used to set the Availability Zones to which shards need to be allocated. For example, if you have an awareness attribute called zone and configure nodes in zone1 and zone2, you can use forced awareness to prevent OpenSearch from allocating replicas if only one zone is available:

cluster.routing.allocation.awareness.attributes: zone
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2

With this example configuration, if you start two nodes with node.attr.zone set to zone1 and create an index with five shards and one replica, OpenSearch creates the index and allocates the five primary shards but no replicas. Replicas are only allocated once nodes with node.attr.zone set to zone2 are available.

Amazon OpenSearch Service will use the forced awareness configuration on Multi-AZ domains to ensure that shards are only allocated according to the rules of zone awareness. This would prevent the sudden increase in load on the nodes of the healthy Availability Zones.

Load-Aware Shard Allocation: Amazon OpenSearch Service will take into consideration factors like the provisioned capacity, actual capacity, and total shard copies to calculate if any node is overloaded with more shards based on expected average shards per node. It would prevent shard assignment when any node has allocated a shard count that goes beyond this limit.

Note that any unassigned primary copy would still be allowed on the overloaded node to prevent the cluster from any imminent data loss.

Similarly, to address the manual recovery issue (as described in the Recovery section above), Amazon OpenSearch Service is also making changes to its internal scaling component. With the newer changes in place, Amazon OpenSearch Service will not launch nodes in the remaining Availability Zones even when it goes through the previously described failure scenario.

Visualizing the current and new behavior

For example, an Amazon OpenSearch Service domain is configured with 3-AZ, 6 data nodes, 12 primary shards, and 24 replica shards. The domain is provisioned across AZ-1, AZ-2, and AZ-3, with two nodes in each of the zones.

Current shard allocation:
Total number of shards: 12 Primary + 24 Replica = 36 shards
Number of Availability Zones: 3
Number of shards per zone (zone awareness is true): 36/3 = 12
Number of nodes per Availability Zone: 2
Number of shards per node: 12/2 = 6

The following diagram provides a visual representation of the domain setup. The circles denote the count of shards allocated to the node. Amazon OpenSearch Service will allocate six shards per node.

During a partial zone failure, where one node in AZ-3 fails, the failed node is assigned to the remaining zone, and the shards in the zone are redistributed based on the available nodes. After the changes described above, the cluster will not create a new node or redistribute shards after the failure of the node.

In the diagram above, with the loss of one node in AZ-3, Amazon OpenSearch Service would try to launch the replacement capacity in the same zone. However, due to some outage, the zone might be impaired and would fail to launch the replacement. In such an event, the service tries to launch deficit capacity in another healthy zone, which might lead to zone imbalance across Availability Zones. Shards on the impacted zone get stuffed on the surviving node in the same zone. However, with the new behavior, the service would try to attempt launching capacity in the same zone but would avoid launching deficit capacity in other zones to avoid imbalance. The shard allocator would also ensure that the surviving nodes don’t get overloaded.

Similarly, in case all the nodes in AZ-3 are lost, or the AZ-3 becomes impaired, Amazon OpenSearch Service brings up the lost nodes in the remaining Availability Zone and also redistributes the shards on the nodes. However, after the new changes, Amazon OpenSearch Service will neither allocate nodes to the remaining zone or it will try to re-allocate lost shards to the remaining zone. Amazon OpenSearch Service will wait for the Recovery to happen and for the domain to return to the original configuration after recovery.

If your domain is not provisioned with enough capacity to withstand the loss of an Availability Zone, you may experience a drop in throughput for your domain. It is therefore strongly recommended to follow the best practices while sizing your domain, which means having enough resources provisioned to withstand the loss of a single Availability Zone failure.

Currently, once the domain recovers, the service requires manual intervention to balance capacity across Availability Zones, which also involves shard movements. However, with the new behaviour, there is no intervention needed during the recovery process because the capacity returns in the impacted zone and the shards are also automatically allocated to the recovered nodes. This ensures that there are no competing priorities on the remaining resources.

What you can expect

After you update your Amazon OpenSearch Service domain to the latest service software release, the domains that have been configured with best practices will have more predictable performance even after losing one or many data nodes in an Availability Zone. There will be reduced cases of shard overallocation in a node. It is a good practice to provision sufficient capacity to be able to tolerate a single zone failure

You might at times see a domain turning yellow during such unexpected failures since we won’t assign replica shards to overloaded nodes. However, this does not mean that there will be data loss in a well-configured domain. We will still make sure that all primaries are assigned during the outages. There is an automated recovery in place, which will take care of balancing the nodes in the domain and ensuring that the replicas are assigned once the failure recovers.

Update the service software of your Amazon OpenSearch Service domain to get these new changes applied to your domain. More details on the service software update process are in the Amazon OpenSearch Service documentation.

Conclusion

In this post we saw how Amazon OpenSearch Service recently improved the logic to distribute nodes and shards across Availability Zones during zonal outages.

This change will help the service to ensure more consistent and predictable performance during node or zonal failures. Domains won’t see any increased latencies or write blocks during processing writes and reads, which used to surface earlier at times due to over-allocation of shards on nodes.

About the authors

Bukhtawar Khan is a Senior Software Engineer working on Amazon OpenSearch Service. He is interested in distributed and autonomous systems. He is an active contributor to OpenSearch.

Anshu Agarwal is a Senior Software Engineer working on AWS OpenSearch at Amazon Web Services. She is passionate about solving problems related to building scalable and highly reliable systems.

Shourya Dutta Biswas is a Software Engineer working on AWS OpenSearch at Amazon Web Services. He is passionate about building highly resilient distributed systems.

Rishab Nahata is a Software Engineer working on OpenSearch at Amazon Web Services. He is fascinated about solving problems in distributed systems. He is active contributor to OpenSearch.

Ranjith Ramachandra is an Engineering Manager working on Amazon OpenSearch Service at Amazon Web Services.

Jon Handler is a Senior Principal Solutions Architect, specializing in AWS search technologies – Amazon CloudSearch, and Amazon OpenSearch Service. Based in Palo Alto, he helps a broad range of customers get their search and log analytics workloads deployed right and functioning well.

Stream VPC flow logs to Amazon OpenSearch Service via Amazon Kinesis Data Firehose

2022-12-20 Chaitanya Shah

Post Syndicated from Chaitanya Shah original https://aws.amazon.com/blogs/big-data/stream-vpc-flow-logs-to-amazon-opensearch-service-via-amazon-kinesis-data-firehose/

Amazon Virtual Private Cloud (Amazon VPC) flow logs enable you to track the IP traffic going to and from the network interfaces in your VPC for your workloads. Analyzing VPC logs helps you understand how your applications are communicating over your VPC network with log records and acts as a main source of information to the network in your VPC. After collecting the flow logs, the next step is performing log analysis to understand user or application behavior and patterns to make informed decisions. You can analyze logs using log analytics tools such as Amazon OpenSearch Service.

Amazon Kinesis Data Firehose is a fully managed service for delivering near real-time streaming data to various destinations for storage and performing near real-time analytics. With its extensible data transformation capabilities, you can also streamline log processing and log delivery pipelines into a single Firehose delivery stream.

Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. Amazon OpenSearch is an open source, distributed search and analytics suite. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), as well as visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). Amazon OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management processing trillions of requests per month.

In this post, you will learn how to ingest VPC flow logs with Kinesis Data Firehose and deliver them to an Amazon OpenSearch Service for analysis using OpenSearch Service Dashboards.

Overview of solution

This solution uses native integration of VPC flow logs streaming to Kinesis Data Firehose. We use a Firehose delivery stream to buffer the streamed VPC flow logs, and deliver those to an OpenSearch Service destination endpoint. We use Amazon OpenSearch Service Dashboards to create an index pattern for the VPC flow logs to analyze and visualize the logs in a near-real time. The following diagram illustrates this architecture.

We walk you through the following high-level steps:

Create an OpenSearch Service domain for storing and analyzing the VPC flow logs.
Create a Firehose delivery stream to deliver the flow logs to the OpenSearch Service domain.
Create a VPC flow log subscription to the delivery stream.
Explore VPC flow logs in OpenSearch Service Dashboards
- Create role mapping with an OpenSearch Service user to the Kinesis Data Firehose service role. Because we’re using a public access domain for OpenSearch Service, we have to map the delivery stream AWS Identity and Access Management (IAM) role to the OpenSearch Service primary user to deliver logs in bulk to the OpenSearch Service domain.
- Create an index pattern in OpenSearch Service Dashboards to enable analysis and visualization of VPC logs.

Prerequisites

As a prerequisite, you need to create an Amazon Simple Storage Service (Amazon S3) bucket to store the Firehose delivery stream backups and failed logs.

Create an Amazon OpenSearch Service domain

For demonstration purposes, and to limit the costs, we create an OpenSearch Service domain with the Development and testing deployment type and public access to the dashboard. For instructions, refer to Create an Amazon OpenSearch Service domain. Note that we select Public access only for demo purposes. For production, we recommend using VPC access for security reasons.

When it’s complete, the OpenSearch Service domain shows as Active.

Create a Kinesis Data Firehose delivery stream

Now that your Amazon OpenSearch Service domain is active, you can create a Firehose delivery stream where VPC flow logs are streamed.

On the Amazon Kinesis console, choose Kinesis Data Firehose in the navigation pane, then choose Create delivery stream.
Choose Direct PUT as the source and set the destination as Amazon OpenSearch Service.
For Delivery stream name, enter PUT-OPENSEARCH-STREAM-DEMO.
In the Destination settings section, choose Browse and choose the previously created Amazon OpenSearch Service domain.
For Index name, enter vpcflowlogs.
For Index rotation, choose Every day.
For this post, we set Buffer size to 5 and Buffer interval to 900.You can modify these settings to optimize ingestion throughput and near-real-time behavior.

In the Backup settings section, for Source record backup in Amazon S3, select Failed events only so you only save the data that fails to deliver to Amazon OpenSearch Service.
For S3 bucket, choose Browse and choose the S3 bucket you created to store failed logs and backups.
Optionally, you can input a prefix for backup files and error files.
Select GZIP for Compression for data records.
For Encryption for data records, select Disabled.
Expand Advanced settings, and for Amazon CloudWatch error logging, select Enabled.
Choose Create delivery stream.

When the delivery stream is active, proceed to the next step.

Create a VPC flow logs subscription

Now you create a VPC flow logs subscription for the Firehose delivery stream you created in the previous step.

On the Amazon VPC console, choose Your VPCs.
Select the VPC for which to create the flow log.
On the Actions menu, choose Create flow log.
Select All to send all flow log records to Amazon OpenSearch Service.

If you want to filter the flow logs, you can select either Accept or Reject.

For Maximum aggregation interval, select 10 minutes or the minimum setting of 1 minute if you need the flow log data to be available for near-real-time analysis in Amazon OpenSearch Service.
For Destination, select Send to Kinesis Firehose in the same account if the delivery stream is set up on the same account where you create the VPC flow logs.
For Log record format, if you leave it at AWS default format, the flow logs are sent as version 2 format.

Alternatively, you can specify which fields you need the flow logs to capture and send to an Amazon OpenSearch Service. For more information on log format and available fields, refer to Flow log records.

Choose Create flow log.

Now let’s explore the VPC flow logs in Amazon OpenSearch Service.

Explore VPC flow logs in Amazon OpenSearch Service Dashboards

In the final step, we set up OpenSearch Service Dashboards to explore the VPC flow logs.

On the OpenSearch Service console, choose Domains in the navigation pane.
Choose the domain you created.
Under OpenSearch Dashboards URL, choose the link to open a new tab.
Log in with the user you created during OpenSearch Service domain setup.
Select Private for Select your tenant, then choose Confirm.

Because we used a public access domain for OpenSearch Service, you need to map the role created for the Firehose delivery stream to the OpenSearch Service Dashboards user, so that the delivery stream can deliver logs in bulk to the OpenSearch Service domain.

On the menu icon, choose Security.
Choose Roles.
Choose the all_access role.
On the Mapped users tab, choose Manage mapping.
For Backend roles, enter the IAM role ARN created for the Firehose delivery stream.
Choose Map.
Now that mapping is complete, choose the menu icon, then choose Stack management.
Choose Index Patterns, then choose Create index pattern.
For Index pattern name, enter vpcflowlogs*.
Choose Next step.
Navigate to the Discover menu option.You can see the VPC flow logs from your VPC in this dashboard. Now you can search and visualize the flow logs that are being streamed in near-real time to the OpenSearch Service domain.

Clean up

After you test out this solution, remember to delete all the resources you created to avoid incurring future charges:

Delete your Amazon OpenSearch Service domain.
Delete the VPC flow logs subscription.
Delete the Firehose delivery stream.
Delete the S3 bucket for the VPC flow logs backup and failed logs.
If you created a new VPC and new resources in the VPC, delete the resources and VPC.

Conclusion

In this post, we walked through a solution of how integrate VPC flow logs with a Kinesis Data Firehose delivery stream and deliver it to an Amazon OpenSearch Service destination with no code and visualize it in OpenSearch Service Dashboards.

Try this new quick and hassle-free way of sending your VPC flow logs to an Amazon OpenSearch Service using Kinesis Data Firehose.

About the Author

Chaitanya Shah is a Sr. Technical Account Manager with AWS, based out of New York. He has over 22 years of experience working with enterprise customers. He loves to code and actively contributes to the AWS solutions labs to help customers solve complex problems. He provides guidance to AWS customers on best practices for their AWS Cloud migrations. He is also specialized in AWS data transfer and the data and analytics domain.

How to get best price performance from your Amazon Redshift Data Sharing deployment

2022-12-20 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/how-to-get-best-price-performance-from-your-amazon-redshift-data-sharing-deployment/

Amazon Redshift is a fast, scalable, secure, and fully-managed data warehouse that enables you to analyze all of your data using standard SQL easily and cost-effectively. Amazon Redshift Data Sharing allows customers to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and regions without needing to copy or move data from one cluster to another.

Amazon Redshift Data Sharing was initially launched in March 2021, and added support for cross-account data sharing was added in August 2021. The cross-region support became generally available in February 2022. This provides full flexibility and agility to share data across Redshift clusters in the same AWS account, different accounts, or different regions.

Amazon Redshift Data Sharing is used to fundamentally redefine Amazon Redshift deployment architectures into a hub-spoke, data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, easily onboard new use cases, and most importantly do all of this without the complexity of data movement and data copies. Some of the most common questions asked during data sharing deployment are, “How big should my consumer clusters and producer clusters be?”, and “How do I get the best price performance for workload isolation?”. As workload characteristics like data size, ingestion rate, query pattern, and maintenance activities can impact data sharing performance, a continuous strategy to size both consumer and producer clusters to maximize the performance and minimize cost should be implemented. In this post, we provide a step-by-step approach to help you determine your producer and consumer clusters sizes for the best price performance based on your specific workload.

Generic consumer sizing guidance

The following steps show the generic strategy to size your producer and consumer clusters. You can use it as a starting point and modify accordingly to cater your specific use case scenario.

Size your producer cluster

You should always make sure that you properly size your producer cluster to get the performance that you need to meet your SLA. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation for the producer cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started, and you should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Size and setup initial consumer cluster

You should always size your consumer cluster based on your compute needs. One way to get started is to follow the generic cluster sizing guide similar to the producer cluster above.

Setup Amazon Redshift data sharing

Setup data sharing from producer to consumer once you have both the producer and consumer cluster setup. Refer to this post for guidance on how to setup data sharing.

Test consumer only workload on initial consumer cluster

Test consumer only workload on the new initial consumer cluster. This can be done by pointing consumer applications, for example ETL tools, BI applications, and SQL clients, to the new consumer cluster and rerunning the workload to evaluate the performance against your requirements.

Test consumer only workload on different consumer cluster configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either continue to use this cluster configuration or you can test on smaller configurations to see if you can further reduce the cost and still get the performance that you need.

On the other hand, if the initial size consumer cluster fails to meet your workload performance requirements, then you can further test larger configurations to get the configuration that meets your SLA.

As a rule of thumb, size up the consumer cluster by 2x the initial cluster configuration incrementally until it meets your workload requirements.

Once you plan out what configuration you want to test, use elastic resize to resize the initial cluster to the target cluster configuration. After elastic resize is completed, perform the same workload test and evaluate the performance against your SLA. Select the configuration that meets your price performance target.

Test producer only workload on different producer cluster configurations

Once you move your consumer workload to the consumer cluster with the optimum price performance, there might be an opportunity to reduce the compute resource on the producer to save on costs.

To achieve this, you can rerun the producer only workload on 1/2x of the original producer size and evaluate the workload performance. Resizing the cluster up and down accordingly depends on the result, and then you select the minimum producer configuration that meets your workload performance requirements.

Re-evaluate after a full workload run over time

As Amazon Redshift continues evolving, and there are continuous performance and scalability improvement releases, data sharing performance will continue improving. Furthermore, numerous variables might impact the performance of data sharing queries. The following are just some examples:

Ingestion rate and amount of data change
Query pattern and characteristic
Workload changes
Concurrency
Maintenance activities, for example vacuum, analyze, and ATO

This is why you must re-evaluate the producer and consumer cluster sizing using the strategy above on occasion, especially after a full workload deployment, to gain the new best price performance from your cluster’s configuration.

Automated sizing solutions

If your environment involved more complex architecture, for example with multiple tools or applications (BI, ingestion or streaming, ETL, data science), then it might not feasible to use the manual method from the generic guidance above. Instead, you can leverage solutions in this section to automatically replay the workload from your production cluster on the test consumer and producer clusters to evaluate the performance.

Simple Replay utility will be leveraged as the automated solution to guide you through the process of getting the right producer and consumer clusters size for the best price performance.

Simple Replay is a tool for conducting a what-if analysis and evaluating how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.

Here we walk through the steps to extract your workload logs from the source production cluster and replay them in an isolated environment. This lets you perform a direct comparison between these Amazon Redshift clusters seamlessly and select the clusters configuration that best meet your price performance target.

The following diagram shows the solution architecture.

Architecutre for testing simple replay

Solution walkthrough

Follow these steps to go through the solution to size your consumer and producer clusters.

Size your production cluster

You should always make sure to properly size your existing production cluster to get the performance that you need to meet your workload requirements. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation on the production cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started. You should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Identify the workload to be isolated

You might have different workloads running on your original cluster, but the first step is to identify the most critical workload to the business that we want to isolate. This is because we want to make sure that the new architecture can meet your workload requirements. This post is a good reference on a data sharing workload isolation use case that can help you decide which workload can be isolated.

Setup Simple Replay

Once you know your critical workload, you must enable audit logging in your production cluster where the critical workload identified above is running to capture query activities and store in Amazon Simple Storage Service (Amazon S3). Note that it may take up to three hours for the audit logs to be delivered to Amazon S3. Once the audit log is available, proceed to setup Simple Replay and then extract the critical workload from the audit log. Note that start_time and end_time could be used as parameters to filter out the critical workload if those workloads run in certain time periods, for example 9am to 11am. Otherwise it will extract all of the logged activities.

Baseline workload

Create a baseline cluster with the same configuration as the producer cluster by restoring from the production snapshot. The purpose of starting with the same configuration is to baseline the performance with an isolated environment.

Once the baseline cluster is available, replay the extracted workload in the baseline cluster. The output from this replay will be the baseline used to compare against subsequent replays on different consumer configurations.

Setup initial producer and consumer test clusters

Create a producer cluster with the same production cluster configuration by restoring from the production snapshot. Create a consumer cluster with the recommended initial consumer size from the previous guidance. Furthermore, setup data sharing between the producer and consumer.

Replay workload on initial producer and consumer

Replay the producer only workload on the initial size producer cluster. This can be achieved using the “Exclude” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Replay the consumer only workload on the initial size consumer cluster. This can be achieved using the “Include” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Evaluate the performance of these replays against the baseline and workload performance requirements.

Replay consumer workload on different configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either use this cluster configuration or you can follow these steps to test on smaller configurations to see if you can further reduce costs and still get the performance that you need.

Compare initial consumer performance results against your workload requirements:

If the result exceeds your workload performance requirements, then you can reduce the size of the consumer cluster incrementally, starting with 1/2x, retry the replay and evaluate the performance, then resize up or down accordingly based on the result until it meets your workload requirements. The purpose is to get a sweet spot where you’re comfortable with the performance requirements and get the lowest price possible.
If the result fails to meet your workload performance requirements, then you can increase the size of the cluster incrementally, starting with 2x the original size, retry the replay and evaluate the performance until it meets your workload performance requirements.

Replay producer workload on different configurations

Once you split your workloads out to consumer clusters, the load on the producer cluster should be reduced and you should evaluate your producer cluster’s workload performance to seek the opportunity to downsize to save on costs.

The steps are similar to consumer replay. Elastic resize the producer cluster incrementally starting with 1/2x the original size, replay the producer only workload and evaluate the performance, and then further resize up or down until it meets your workload performance requirements. The purpose is to get a sweet spot where you’re comfortable with the workload performance requirements and get the lowest price possible. Once you have the desired producer cluster configuration, retry replay consumer workloads on the consumer cluster to make sure that the performance wasn’t impacted by producer cluster configuration changes. Finally, you should replay both producer and consumer workloads concurrently to make sure that the performance is achieved in a full workload scenario.

Re-evaluate after a full workload run over time

Similar to the generic guidance, you should re-evaluate the producer and consumer clusters sizing using the previous strategy on occasion, especially after full workload deployment to gain the new best price performance from your cluster’s configuration.

Clean up

Running these sizing tests in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save on costs. We also recommend pausing your clusters when they’re not in use.

Applying Amazon Redshift and data sharing best practices

Proper sizing of both your producer and consumer clusters will give you a good start to get the best price performance from your Amazon Redshift deployment. However, sizing isn’t the only factor that can maximize your performance. In this case, understanding and following best practices are equally important.

General Amazon Redshift performance tuning best practices are applicable to data sharing deployment. Make sure that your deployment follows these best practices.

There numerous data sharing specific best practices that you should follow to make sure that you maximize the performance. Refer to this post for more details.

Summary

There is no one-size-fits-all recommendation on producer and consumer cluster sizes. It varies by workloads and your performance SLA. The purpose of this post is to provide you with guidance for how you can evaluate your specific data sharing workload performance to determine both consumer and producer cluster sizes to get the best price performance. Consider testing your workloads on producer and consumer using simple replay before adopting it in production to get the best price performance.

About the Authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance and operational excellence at scale in their cloud journey. He has a keen interest in Data Analytics as well.

Migrate Google BigQuery to Amazon Redshift using AWS Schema Conversion tool (SCT)

2022-12-14 Jagadish Kumar

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/migrate-google-bigquery-to-amazon-redshift-using-aws-schema-conversion-tool-sct/

Amazon Redshift is a fast, fully-managed, petabyte scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. Using Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers. The following are just some of the notable capabilities:

Amazon Redshift seamlessly integrates with broader analytics services on AWS. This enables you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. Amazon Redshift lets you build lake house architectures and then perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning (ML), and more.
You don’t need to worry about workloads, such as ETL, dashboards, ad-hoc queries, and so on, interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get one hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
Amazon Redshift is easy-to-use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
Fault Tolerance is inbuilt. All of the data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, and serverless web services-based or event-driven applications and so on.
Redshift ML makes it easy for data scientists to create, train, and deploy ML models using familiar SQL. They can also run predictions using SQL.
Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row and column level security controls on sensitive data, and so on.
Amazon Redshift integrates seamlessly with other AWS services and third-party tools. You can move, transform, load, and query large datasets quickly and reliably.

In this post, we provide a walkthrough for migrating a data warehouse from Google BigQuery to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the majority of the database code and storage objects to a format that is compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. Furthermore, AWS SCT can scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your BigQuery project. First, we create an Amazon Redshift database into which BigQuery data is migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert BigQuery schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from BigQuery, upload it into the S3 bucket, and then copy to Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

A workstation with AWS SCT, Amazon Corretto 11, and Amazon Redshift drivers.
1. You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we’re using Amazon EC2 Windows instance. To create it, use this guide.
2. To download and install AWS SCT on the EC2 instance that you previously created, use this guide.
3. Download the Amazon Redshift JDBC driver from this location.
4. Download and install Amazon Corretto 11.
A GCP service account that AWS SCT can use to connect to your source BigQuery project.
1. Grant BigQuery Admin and Storage Admin roles to the service account.
2. Copy the Service account key file, which was created in the Google cloud management console, to the EC2 instance that has AWS SCT.
3. Create a Cloud Storage bucket in GCP to store your source data during migration.

This walkthrough covers the following steps:

Create an Amazon Redshift Serverless Workgroup and Namespace
Create the AWS S3 Bucket and Folder
Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT
- Connecting to the Google BigQuery Source
- Connect to the Amazon Redshift Target
- Convert BigQuery schema to an Amazon Redshift
- Analyze the assessment report and address the action items
- Apply converted schema to target Amazon Redshift
Migrate data using AWS SCT data extraction agents
- Generating Trust and Key Stores (Optional)
- Install and start data extraction agent
- Register data extraction agent
- Add virtual partitions for large tables (Optional)
- Create a local migration task
- Start the Local Data Migration Task
View Data in Amazon Redshift

Create an Amazon Redshift Serverless Workgroup and Namespace

In this step, we create an Amazon Redshift Serverless workgroup and namespace. Workgroup is a collection of compute resources and namespace is a collection of database objects and users. To isolate workloads and manage different resources in Amazon Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create Amazon Redshift Serverless workgroup and namespace:

Navigate to the Amazon Redshift console.
In the upper right, choose the AWS Region that you want to use.
Expand the Amazon Redshift pane on the left and choose Redshift Serverless.
Choose Create Workgroup.
For Workgroup name, enter a name that describes the compute resources.
Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
Choose Next.

For Namespace name, enter a name that describes your dataset.
In Database name and password section, select the checkbox Customize admin user credentials.
- For Admin user name, enter a username of your choice, for example awsuser.
- For Admin user password: enter a password of your choice, for example MyRedShiftPW2022.

Choose Next. Note that data in Amazon Redshift Serverless namespace is encrypted by default.
In the Review and Create page, choose Create.
Create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace, as described in the following. Note that there can only be one default IAM role.
- Navigate to the Amazon Redshift Serverless Dashboard.
- Under Namespaces / Workgroups, choose the namespace that you just created.
- Navigate toSecurity and encryption.
- Under Permissions, choose Manage IAM roles.
- Navigate to Manage IAM roles. Then, choose the Manage IAM roles drop-down and choose Create IAM role.
- Under Specify an Amazon S3 bucket for the IAM role to access, choose one of the following methods:
  - Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with a name starting with redshift.
  - Choose Any Amazon S3 bucket to allow the created IAM role to access all of the S3 buckets.
  - Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
- Choose Create IAM role as default. Amazon Redshift automatically creates and sets the IAM role as default.
Capture the Endpoint for the Amazon Redshift Serverless workgroup that you just created.
- Navigate to the Amazon Redshift Serverless Dashboard.
- Under Namespaces / Workgroups, choose the workgroup that you just created.
- Under General information, copy the Endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create the S3 bucket:

Navigate to the Amazon S3 console
Choose Create bucket. The Create bucket wizard opens.
For Bucket name, enter a unique DNS-compliant name for your bucket (e.g., uniquename-bq-rs). See rules for bucket naming when choosing a name.
For AWS Region, choose the region in which you created the Amazon Redshift Serverless workgroup.
Select Create Bucket.
In the Amazon S3 console, navigate to the S3 bucket that you just created (e.g., uniquename-bq-rs).
Choose “Create folder” to create a new folder.
For Folder name, enter incoming and choose Create Folder.

Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT

To convert BigQuery schema to the Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that we created previously, and then launch AWS SCT.

Follow these steps using AWS SCT:

Connect to the BigQuery Source

From the File Menu choose Create New Project.
Choose a location to store your project files and data.
Provide a meaningful but memorable name for your project, such as BigQuery to Amazon Redshift.
To connect to the BigQuery source data warehouse, choose Add source from the main menu.
Choose BigQuery and choose Next. The Add source dialog box appears.
For Connection name, enter a name to describe BigQuery connection. AWS SCT displays this name in the tree in the left panel.
For Key path, provide the path of the service account key file that was previously created in the Google cloud management console.
Choose Test Connection to verify that AWS SCT can connect to your source BigQuery project.
Once the connection is successfully validated, choose Connect.

Connect to the Amazon Redshift Target

Follow these steps to connect to Amazon Redshift:

In AWS SCT, choose Add Target from the main menu.
Choose Amazon Redshift, then choose Next. The Add Target dialog box appears.
For Connection name, enter a name to describe the Amazon Redshift connection. AWS SCT displays this name in the tree in the right panel.
For Server name, enter the Amazon Redshift Serverless workgroup endpoint captured previously.
For Server port, enter 5439.
For Database, enter dev.
For User name, enter the username chosen when creating the Amazon Redshift Serverless workgroup.
For Password, enter the password chosen when creating Amazon Redshift Serverless workgroup.
Uncheck the “Use AWS Glue” box.
Choose Test Connection to verify that AWS SCT can connect to your target Amazon Redshift workgroup.
Choose Connect to connect to the Amazon Redshift target.

Note that alternatively you can use connection values that are stored in AWS Secrets Manager.

Convert BigQuery schema to an Amazon Redshift

After the source and target connections are successfully made, you see the source BigQuery object tree on the left pane and target Amazon Redshift object tree on the right pane.

Follow these steps to convert BigQuery schema to the Amazon Redshift format:

On the left pane, right-click on the schema that you want to convert.
Choose Convert Schema.
A dialog box appears with a question, The objects might already exist in the target database. Replace?. Choose Yes.

Once the conversion is complete, you see a new schema created on the Amazon Redshift pane (right pane) with the same name as your BigQuery schema.

The sample schema that we used has 16 tables, 3 views, and 3 procedures. You can see these objects in the Amazon Redshift format in the right pane. AWS SCT converts all of the BigQuery code and data objects to the Amazon Redshift format. Furthermore, you can use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects. However, some of the objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from the Main view to the Assessment Report view as follows:

The Summary tab shows objects that were converted automatically, and objects that weren’t converted automatically. Green represents automatically converted or with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action Items tab shows the recommended actions for each conversion issue. If you select an action item from the list, AWS SCT highlights the object to which the action item applies.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database/schema show you the effort required to design and implement the recommendations for converting Action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. Amazon Redshift takes some actions automatically while converting the schema to Amazon Redshift. Objects with these actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the RECORD and JSON datatype columns in BigQuery table ncaaf_referee_data to the SUPER datatype in Amazon Redshift. The partition key in the ncaaf_referee_data table is converted to the distribution key and sort key in Amazon Redshift.

Apply converted schema to target Amazon Redshift

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and then choose Apply to database.

Migrate data from BigQuery to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this walkthrough, we show how to configure AWS SCT extraction agents to extract data from BigQuery and migrate to Amazon Redshift.

First, install AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For big datasets, you can use several data extraction agents to increase the data migration speed.

Generating trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all of the data passed between the applications remains private and integral. To use SSL communication, you must generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

In AWS SCT, navigate to Settings → Global Settings → Security.
Choose Generate trust and key store.
Enter the name and password for trust and key stores and choose a location where you would like to store them.
Choose Generate.

Install and configure Data Extraction Agent

In the installation package for AWS SCT, you find a sub-folder agent (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

For Listening port, enter the port number on which the agent listens. It is 8192 by default.
For Add a source vendor, enter no, as you don’t need drivers to connect to BigQuery.
For Add the Amazon Redshift driver, enter YES.
For Enter Redshift JDBC driver file or files, enter the location where you downloaded Amazon Redshift JDBC drivers.
For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data. The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.
For Enable SSL communication, enter yes. Choose No here if you don’t want to use SSL.
For Key store, enter the storage location chosen when creating the trust and key store.
For Key store password, enter the password for the key store.
For Enable client SSL authentication, enter yes.
For Trust store, enter the storage location chosen when creating the trust and key store.
For Trust store password, enter the password for the trust store.

*************************************************
*                                               *
*     AWS SCT Data Extractor Configuration      *
*              Version 2.0.1.666                *
*                                               *
*************************************************
User name: Administrator
User home: C:\Windows\system32\config\systemprofile
*************************************************
Listening port [8192]: 8192
Add a source vendor [YES/no]: no
No one source data warehouse vendors configured. AWS SCT Data Extractor cannot process data extraction requests.
Add the Amazon Redshift driver [YES/no]: YES
Enter Redshift JDBC driver file or files: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\redshift-jdbc42-2.1.0.9.jar
Working folder [C:\Windows\system32\config\systemprofile]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject
Enable SSL communication [YES/no]: YES
Setting up a secure environment at "C:\Windows\system32\config\systemprofile". This process will take a few seconds...
Key store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftKeyStore
Key store password:
Re-enter the key store password:
Enable client SSL authentication [YES/no]: YES
Trust store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftTrustStore
Trust store password:
Re-enter the trust store password:

Starting Data Extraction Agent(s)

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, double-click C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed following your operating system.
To check the status of the agent, run the same command but replace start with status.
To stop an agent, run the same command but replace start with stop.
To restart an agent, run the same RestartAgent.bat file.

Register the Data Extraction Agent

Follow these steps to register the Data Extraction Agent:

In AWS SCT, change the view to Data Migration view (other) and choose + Register.
In the connection tab:
1. For Description, enter a name to identify the Data Extraction Agent.
2. For Host name, if you installed the Data Extraction Agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT Data Extraction Agent is installed. It’s recommended to install the Data Extraction Agents on Linux for better performance.
3. For Port, enter the number entered for the Listening Port when installing the AWS SCT Data Extraction Agent.
4. Select the checkbox to use SSL (if using SSL) to encrypt the AWS SCT connection to the Data Extraction Agent.
If you’re using SSL, then in the SSL Tab:
1. For Trust store, choose the trust store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
2. For Key Store, choose the key store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
Choose Test Connection.
Once the connection is validated successfully, choose Register.

Add virtual partitions for large tables (optional)

You can use AWS SCT to create virtual partitions to optimize migration performance. When virtual partitions are created, AWS SCT extracts the data in parallel for partitions. We recommend creating virtual partitions for large tables.

Follow these steps to create virtual partitions:

Deselect all objects on the source database view in AWS SCT.
Choose the table for which you would like to add virtual partitioning.
Right-click on the table, and choose Add Virtual Partitioning.
You can use List, Range, or Auto Split partitions. To learn more about virtual partitioning, refer to Use virtual partitioning in AWS SCT. In this example, we use Auto split partitioning, which generates range partitions automatically. You would specify the start value, end value, and how big the partition should be. AWS SCT determines the partitions automatically. For a demonstration, on the Lineorder table:
1. For Start Value, enter 1000000.
2. For End Value, enter 3000000.
3. For Interval, enter 1000000 to indicate partition size.
4. Choose Ok.

You can see the partitions automatically generated under the Virtual Partitions tab. In this example, AWS SCT automatically created the following five partitions for the field:

1. <1000000
2. >=1000000 and <=2000000
3. >2000000 and <=3000000
4. >3000000
5. IS NULL

Create a local migration task

To migrate data from BigQuery to Amazon Redshift, create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local migration task:

In AWS SCT, under the schema name in the left pane, right-click on Standard tables.
Choose Create Local task.
There are three migration modes from which you can choose:
1. Extract source data and store it on a local pc/virtual machine (VM) where the agent runs.
2. Extract data and upload it on an S3 bucket.
3. Choose Extract upload and copy, which extracts data to an S3 bucket and then copies to Amazon Redshift.
In the Advanced tab, for Google CS bucket folder enter the Google Cloud Storage bucket/folder that you created earlier in the GCP Management Console. AWS SCT stores the extracted data in this location.
In the Amazon S3 Settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket that you created earlier. The AWS SCT data extraction agent uploads the data into the S3 bucket/folder before copying to Amazon Redshift.
Choose Test Task.
Once the task is successfully validated, choose Create.

Start the Local Data Migration Task

To start the task, choose the Start button in the Tasks tab.

First, the Data Extraction Agent extracts data from BigQuery into the GCP storage bucket.
Then, the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.
At this point, AWS SCT has successfully migrated data from the source BigQuery table to the Amazon Redshift table.

View data in Amazon Redshift

After the data migration task executes successfully, you can connect to Amazon Redshift and validate the data.

Follow these steps to validate the data in Amazon Redshift:

Navigate to the Amazon Redshift QueryEditor V2.
Double-click on the Amazon Redshift Serverless workgroup name that you created.
Choose the Federated User option under Authentication.
Choose Create Connection.
Create a new editor by choosing the + icon.
In the editor, write a query to select from the schema name and table name/view name you would like to verify. Explore the data, run ad-hoc queries, and make visualizations and charts and views.

The following is a side-by-side comparison between source BigQuery and target Amazon Redshift for the sports data-set that we used in this walkthrough.

Clean up up any AWS resources that you created for this exercise

Follow these steps to terminate the EC2 instance:

Navigate to the Amazon EC2 console.
In the navigation pane, choose Instances.
Select the check-box for the EC2 instance that you created.
Choose Instance state, and then Terminate instance.
Choose Terminate when prompted for confirmation.

Follow these steps to delete Amazon Redshift Serverless workgroup and namespace

Navigate to Amazon Redshift Serverless Dashboard.
Under Namespaces / Workgroups, choose the workspace that you created.
Under Actions, choose Delete workgroup.
Select the checkbox Delete the associated namespace.
Uncheck Create final snapshot.
Enter delete in the delete confirmation text box and choose Delete.

Follow these steps to delete the S3 bucket

Navigate to Amazon S3 console.
Choose the bucket that you created.
Choose Delete.
To confirm deletion, enter the name of the bucket in the text input field.
Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. Following this walkthrough, you can understand how a data migration task extracts, downloads, and then migrates data from BigQuery to Amazon Redshift. The solution that we presented in this post performs a one-time migration of database objects and data. Data changes made in BigQuery when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to BigQuery on hold or replay the ETLs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

AWS SCT has some limitations when using BigQuery as a source. For example, AWS SCT can’t convert sub queries in analytic functions, geography functions, statistical aggregate functions, and so on. Find the full list of limitations in the AWS SCT user guide. We plan to address these limitations in future releases. Despite these limitations, you can use AWS SCT to automatically convert most of your BigQuery code and storage objects.

Download and install AWS SCT, sign in to the AWS Console, checkout Amazon Redshift Serverless, and start migrating!

About the authors

Cedrick Hoodye is a Solutions Architect with a focus on database migrations using the AWS Database Migration Service (DMS) and the AWS Schema Conversion Tool (SCT) at AWS. He works on DB migrations related challenges. He works closely with EdTech, Energy, and ISV business sector customers to help them realize the true potential of DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.

Amit Arora is a Solutions Architect with a focus on Database and Analytics at AWS. He works with our Financial Technology and Global Energy customers and AWS certified partners to provide technical assistance and design customer solutions on cloud migration projects, helping customers migrate and modernize their existing databases to the AWS Cloud.

Jagadish Kumar is an Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.