Tag Archives: Best practices

Modernize your Java application with Amazon Q Developer

Post Syndicated from Chetan Makvana original https://aws.amazon.com/blogs/devops/modernize-your-java-application-with-amazon-q-developer/

Many organizations have critical legacy Java applications that are increasingly difficult to maintain. Modernizing these applications is a necessary, daunting, and risky task that takes the focus off of creating new value or features. This includes undocumented code, outdated frameworks and libraries, security vulnerabilities, a lack of logging and error handling, and a lack of input validation. Amazon Q Developer simplifies and accelerates the modernization of existing Java applications. It can analyze code to highlight areas for potential improvements, assist with resolving technical debt, suggest code optimizations, and facilitate the transition to current frameworks and libraries.

This blog post explores how to modernize legacy Java applications using Amazon Q Developer. We will take an example of Unicorn Store API, a Java application with Java 8 running on Amazon Elastic Compute Cloud (Amazon EC2). First, we will upgrade the underlying runtime from Java 8 to Java 17 and other common dependencies, including Spring. Then, we will reduce technical debt within the code by improving modularity and logging. Finally, we will redeploy this application in a container image using a modern computing option, AWS Fargate.

The Unicorn Store API provides CRUD operations to manage Unicorn records in a database. It is built with Maven.

You will follow the below steps to modernize this application and bring it to Fargate using Amazon Q Developer.

  1. Upgrade the application to Java 17 to leverage the latest features.
  2. Reduce existing technical debt in the codebase.
  3. Make the application cloud native and deploy it to AWS.

In this walkthrough, we are using IntelliJ IDEA IDE with the latest version of Amazon Q Developer plugin for IntelliJ IDEA

Upgrade from Java 8 to Java 17

Outdated applications require increased effort to maintain security and stability. As a developer, you must continually relearn framework changes and optimizations that others have discovered in previous upgrades. The effort required to maintain the application makes it difficult to balance necessary updates with adding new features.

With Amazon Q Developer agent for code transformation, you can keep applications updated and supported in just a few steps. This removes vulnerabilities from unsupported versions, improves performance, and frees up time to focus on adding new features. Amazon Q Developer agent for code transformation accelerates application maintenance, upgrades, and migration in minutes. It enables developers to remove much of the undifferentiated work out of the tedious task of maintaining, upgrading and migrating existing application workloads, saving up to days’ or months’ worth of the undifferentiated work involved in moving from older language versions.

Let’s upgrade our Unicorn Store API from Java 8 to Java 17 using Amazon Q Developer agent for code transformation to leverage the latest features and optimization. In IntelliJ IDE, you enter /transform in the Amazon Q chat panel and provide the necessary details for Amazon Q Developer to start upgrading the project.

IntelliJ IDE showing the Amazon Q chat panel open with the /transform command entered and details provided about the project to upgrade. This triggers the Amazon Q Developer agent for code transformation to analyze the code, generate a transformation plan, and upgrade frameworks and libraries like Spring, Spring Boot, JUnit, and Log4j to be compatible with Java 17.

Amazon Q Developer agent for code transformation automatically analyzes the existing code, generates a transformation plan, and completes the transformation tasks suggested by the plan. While doing so, it upgrades popular libraries and frameworks to a version compatible with Java 17, including Spring, Spring Boot, JUnit, JakartaEE, Mockito, Hibernate, and Log4j to their latest available major versions. It also updates deprecated code components according to Java 17 recommendations. To start with the Amazon Q Developer agent for code transformation capability, you can read and follow the steps at Upgrade language versions with Amazon Q Developer agent for Code Transformation.

Once complete, you can review the transformed code, complete with build and test results, before accepting the changes.

In the IntelliJ IDE, Amazon Q Developer agent for code transformation summarizes all the changes it made after the transformation of the current project to Java 17 is complete. User clicks on the View diff button. The list of all the files that have been modified or added is displayed. User has the option to review the diff, and then accept or reject them.

Reduce technical debt in the codebase

Technical debt accumulates in any codebase over time. Some technical debt may be unavoidable to meet deadlines, but must be tracked and prioritized to pay back later. If left unmanaged, compounding technical debt will make development slower and expensive. Reducing technical debt should be an ongoing team effort, but often falls behind other priorities. Amazon Q Developer streamlines modernizing legacy Java code by identifying and remediating technical debt. Amazon Q Developer reduces the time and resources it takes to analyze the code by providing a list of issues that contribute to technical debt in a codebase. This makes it easy for software development teams to prioritize technical debt items and make informed decisions about which technical debt to address first.

Let’s find the list of technical debt in our Unicorn Store API. In IntelliJ IDE, use Send to Prompt option to send the highlighted code to the Amazon Q chat panel and prompt to provide a list of all technical debt. Amazon Q Developer lists all technical debt in detail.

In the IntelliJ IDE, user has opened the code for one of the classes (UnicornController.java in this case). User right-clicks inside this file. This opens a context window. User selects Amazon Q, then selects Send to Prompt. The code is now displayed in the Amazon Q Chat panel. The user enters this message in the chat window: “Analyze the selected code and list all technical debt”. Amazon Q Developer generates a list of all potential technical debt or areas of improvement. It explains the impact of each technical debt.

Once you identify the technical debt, the next step is to gradually remediate them. Amazon Q Developer reduces the time it takes to implement the code to remediate the technical debt. As a developer, you can interact with Amazon Q Developer agent for software development within your IDE to get help with code suggestions for a specific task that you are trying to accomplish. It uses the code in whole project as context and provides an implementation plan that includes code updates it plans to make across all the files in the project. You can review the plan, and once you are satisfied with the plan, you can ask Amazon Q Developer to generate the code based on the proposed plan. This saves developers’ effort compared to manual updates.

For the technical debt identified for Unicorn Store API in the above step, let’s use Amazon Q Developer to address the missing logging technical debt. In IntelliJ IDE, enter /dev in the Amazon Q chat panel with the details on the logging technical debt. Amazon Q Developer generates an implementation plan and code to add logging based on the full project context. To get started with Amazon Q Developer agent for software development, you can refer to the steps at Develop software with the Amazon Q Developer agent for software development.

In the IntelliJ IDE, the user opens Amazon Q Chat panel, enters /dev and provides input about the task detail “Implement logging for critical REST endpoints. Log relevant error details such as request parameters and stack traces to facilitate debugging and troubleshooting”. The Amazon Q Developer agent for software development generates the steps to implement this task, including changes that will be made to existing files and any new files that will be created. User reviews all the changes, then clicks on Generate code button. Amazon Q generates a list of code suggestions. User clicks on a file from the list, and reviews the diff between the existing code in that file, and the new code generated by Amazon Q Developer. User then clicks on Insert Code button.

Modernizing legacy Java code requires continuous refactoring to incrementally enhance quality and avoid accumulating technical debt over time. Amazon Q Developer simplifies this iterative process through its Refactor capability. Amazon Q Developer provides a refactored version of the selected code, alongside explanations of each change and its coding benefit. It helps you to understand the changes by explaining each change and the benefit of making the change in the existing code. You can read further about this capability at Explain and update code with Amazon Q Developer.

Let’s leverage this feature to refine methods in the UnicornController class in our Unicorn Store API project. Amazon Q Developer furnishes the updated code with better code readability or efficiency, among other improvements, for you to review.

In the IntelliJ IDE, user highlights a section of code (createUnicorn and updateUnicorn methods in the UnicornController class in this case), and right-clicks on the highlighted code. This opens a context window. User selects Amazon Q, and then selects Refactor code. The code is now displayed in the Amazon Q Chat panel. Amazon Q Developer generates a refactored version of the selected code. It also provides a list of all the changes made and the benefit of changing the code.

Make the application cloud native and deploy to AWS

The final step in the modernization journey is to make the application cloud-native and deploy to AWS. Cloud native is the software approach of building, deploying, and managing modern applications in cloud computing environments. These cloud native technologies support fast and frequent changes to applications without impacting service delivery, providing adopters with an innovative, competitive advantage. Let’s see how Amazon Q Developer can assist in making our Unicorn Store API project cloud native.

In IntelliJ IDE, open the Amazon Q Chat, and prompt Amazon Q Developer to provide a recommended approach to make the project cloud native and deploy to AWS.

In the IntelliJ IDE, user sends the content of the pom.xml file to Amazon Q Chat window, and asks Amazon Q Developer “What is the recommended way to make this micro service, cloud native and deploy to AWS?” Amazon Q generates the steps and a high-level implementation guide. In this case, it includes suggestions to containerize the application, use Amazon Elastic Container Registry, deploy to Amazon Elastic Container Service (ECS), implement a load balancer, leverage AWS Fargate, implement monitoring and logging, implement CI/CD, leverage other AWS managed services, implement security best practices, monitor and optimize the deployment.

Amazon Q Developer analyzes the code and details the steps involved in making this application cloud native. The detailed steps involve containerizing the application, deploying the container application to AWS services such as Amazon Elastic Container Service (Amazon ECS), Fargate for running containers in a serverless manner, Amazon Elastic Container Registry (Amazon ECR) for pushing the container image, accessing the application through AWS Application Load Balancer (ALB), Amazon CloudWatch for monitoring and associated services like Amazon Virtual Private Cloud (VPC) and Subnets.

Let’s ask Amazon Q Developer to implement the steps outlined in the previous chat conversation. First, ask Amazon Q Developer to create a docker file to containerize the application. The containerization process streamlines application development by decoupling the software from the underlying hardware and other dependencies. This approach enhances speed, efficiency, and security by isolating different components within the containerized environment.

In the IntelliJ IDE, user continues the conversation in the previous chat window. User asks Amazon Q Developer to provide implementation for the step 1 (containerize the application) and create a Dockefile for the application. Amazon Q generates a Dockerfile that includes all the necessary steps. User creates a new file named Dockerfile, and copies the contents from the chat window to the new file. Next, user opens the terminal within the IntelliJ IDE. User copies the build and run commands from the chat and pastes it in the terminal window. The application successfully builds in the docker container. Finally, application runs successfully as a docker container.

Having successfully developed a container-based application, let’s leverage Amazon Q Developer’s capabilities to generate an AWS CloudFormation template. This template will enable us to deploy the required resources to AWS using Infrastructure as Code (IaC). IaC allows us to programmatically provision and manage our computing infrastructure, eliminating the need for manual processes and configurations. Manual infrastructure management can be time-consuming and error-prone, especially when dealing with large-scale applications.

To facilitate the creation of the CloudFormation template, let’s revisit the suggestions from our previous conversation and compile a list of the resources that need to be provisioned in AWS. Once you have this list, you can ask Amazon Q Developer to generate the CloudFormation template based on these resource requirements.

In the IntelliJ IDE, user copies the steps outlined in the previous conversation, and asks Amazon Q Developer to create a CloudFormation template to deploy this application to AWS. Amazon Q Developer creates the full CloudFormation template to deploy all resources. User creates a new YAML file, and copies the contents from the chat window to the new file. The generated CloudFormation contains the steps to deploy VPCs, subnets, ECR, ECS cluster, Tasks, LoadBalancer, CloudWatch logs, etc. User reviews the generated code for accuracy before deployment.

Amazon Q Developer can generate the CloudFormation template with all the required resources as outlined in the steps to deploy the container in AWS in a secure, reliable, and scalable manner.

Now that we have the CloudFormation template, once CloudFormation is deployed, let’s push the local docker image of our Unicorn Store API to Amazon ECR and start the Fargate tasks required to run the application in AWS.

In the IntelliJ IDE, user runs the commands to push the docker image of the application created using Amazon Q Developer.

In this way, you can use Amazon Q Developer to make your application cloud native by designing the steps to deploy to the cloud, helping migrate your application to container-based solution and even writes Infrastructure as code scripts to deploy your application to AWS.

Conclusion

Amazon Q Developer empowers developers to simplify and accelerate the modernization of legacy Java applications. By leveraging Amazon Q Developer, developers can bring outdated applications up to current frameworks and deploy them to AWS in a cloud-native architecture. This streamlines the process, reducing the effort, risk, and maintenance required. Developers save significant time and resources, which can now be used to focus on building new features and enhancing modernized applications rather than managing technical debt.

To learn more about Amazon Q Developer, see the following resources:

Chetan Makvana

Chetan Makvana is a Senior Solutions Architect with Amazon Web Services. He works with AWS partners and customers to provide them with architectural guidance for building scalable architecture and implementing strategies to drive adoption of AWS services. He is a technology enthusiast and a builder with a core area of interest on generative AI, serverless, and DevOps. Outside of work, he enjoys watching shows, traveling, and music.

Venugopalan Vasudevan

Venugopalan Vasudevan is a Senior Specialist Solutions Architect at Amazon Web Services (AWS), where he specializes in AWS Generative AI services. His expertise lies in helping customers leverage cutting-edge services like Amazon Q, and Amazon Bedrock to streamline development processes, accelerate innovation, and drive digital transformation. Venugopalan is dedicated to facilitating the Next Generation Developer experience, enabling developers to work more efficiently and creatively through the integration of Generative AI into their workflows.

Surabhi Tandon

Surabhi Tandon is a Senior Technical Account Manager at Amazon Web Services (AWS). She supports enterprise customers achieve operational excellence and help them with their cloud journey on AWS by providing strategic technical guidance. Surabhi is a builder with interest in Generative AI, automation, and DevOps. Outside of work, she enjoys hiking, reading and spending time with family and friends.

Amazon MWAA best practices for managing Python dependencies

Post Syndicated from Mike Ellis original https://aws.amazon.com/blogs/big-data/amazon-mwaa-best-practices-for-managing-python-dependencies/

Customers with data engineers and data scientists are using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as a central orchestration platform for running data pipelines and machine learning (ML) workloads. To support these pipelines, they often require additional Python packages, such as Apache Airflow Providers. For example, a pipeline may require the Snowflake provider package for interacting with a Snowflake warehouse, or the Kubernetes provider package for provisioning Kubernetes workloads. As a result, they need to manage these Python dependencies efficiently and reliably, providing compatibility with each other and the base Apache Airflow installation.

Python includes the tool pip to handle package installations. To install a package, you add the name to a special file named requirements.txt. The pip install command instructs pip to read the contents of your requirements file, determine dependencies, and install the packages. Amazon MWAA runs the pip install command using this requirements.txt file during initial environment startup and subsequent updates. For more information, see How it works.

Creating a reproducible and stable requirements file is key for reducing pip installation and DAG errors. Additionally, this defined set of requirements provides consistency across nodes in an Amazon MWAA environment. This is most important during worker auto scaling, where additional worker nodes are provisioned and having different dependencies could lead to inconsistencies and task failures. Additionally, this strategy promotes consistency across different Amazon MWAA environments, such as dev, qa, and prod.

This post describes best practices for managing your requirements file in your Amazon MWAA environment. It defines the steps needed to determine your required packages and package versions, create and verify your requirements.txt file with package versions, and package your dependencies.

Best practices

The following sections describe the best practices for managing Python dependencies.

Specify package versions in the requirements.txt file

When creating a Python requirements.txt file, you can specify just the package name, or the package name and a specific version. Adding a package without version information instructs the pip installer to download and install the latest available version, subject to compatibility with other installed packages and any constraints. The package versions selected during environment creation may be different than the version selected during an auto scaling event later on. This version change can create package conflicts leading to pip install errors. Even if the updated package installs properly, code changes in the package can affect task behavior, leading to inconsistencies in output. To avoid these risks, it’s best practice to add the version number to each package in your requirements.txt file.

Use the constraints file for your Apache Airflow version

A constraints file contains the packages, with versions, verified to be compatible with your Apache Airflow version. This file adds an additional validation layer to prevent package conflicts. Because the constraints file plays such an important role in preventing conflicts, beginning with Apache Airflow v2.7.2 on Amazon MWAA, your requirements file must include a --constraint statement. If a --constraint statement is not supplied, Amazon MWAA will specify a compatible constraints file for you.

Constraint files are available for each Airflow version and Python version combination. The URLs have the following form:

https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt

The official Apache Airflow constraints are guidelines, and if your workflows require newer versions of a provider package, you may need to modify your constraints file and include it in your DAG folder. When doing so, the best practices outlined in this post become even more important to guard against package conflicts.

Create a .zip archive of all dependencies

Creating a .zip file containing the packages in your requirements file and specifying this as the package repository source makes sure the exact same wheel files are used during your initial environment setup and subsequent node configurations. The pip installer will use these local files for installation rather than connecting to the external PyPI repository.

Test the requirements.txt file and dependency .zip file

Testing your requirements file before release to production is key to avoiding installation and DAG errors. Testing both locally, with the MWAA local runner, and in a dev or staging Amazon MWAA environment, are best practices before deploying to production. You can use continuous integration and delivery (CI/CD) deployment strategies to perform the requirements and package installation testing, as described in Automating a DAG deployment with Amazon Managed Workflows for Apache Airflow.

Solution overview

This solution uses the MWAA local runner, an open source utility that replicates an Amazon MWAA environment locally. You use the local runner to build and validate your requirements file, and package the dependencies. In this example, you install the snowflake and dbt-cloud provider packages. You then use the MWAA local runner and a constraints file to determine the exact version of each package compatible with Apache Airflow. With this information, you then update the requirements file, pinning each package to a version, and retest the installation. When you have a successful installation, you package your dependencies and test in a non-production Amazon MWAA environment.

We use MWAA local runner v2.8.1 for this walkthrough and walk through the following steps:

  1. Download and build the MWAA local runner.
  2. Create and test a requirements file with package versions.
  3. Package dependencies.
  4. Deploy the requirements file and dependencies to a non-production Amazon MWAA environment.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Set up the MWAA local runner

First, you download the MWAA local runner version matching your target MWAA environment, then you build the image.

Complete the following steps to configure the local runner:

  1. Clone the MWAA local runner repository with the following command:
    git clone [email protected]:aws/aws-mwaa-local-runner.git -b v2.8.1

  2. With Docker running, build the container with the following command:
    cd aws-mwaa-local-runner
     ./mwaa-local-env build-image

Create and test a requirements file with package versions

Building a versioned requirements file makes sure all Amazon MWAA components have the same package versions installed. To determine the compatible versions for each package, you start with a constraints file and an un-versioned requirements file, allowing pip to resolve the dependencies. Then you create your versioned requirements file from pip’s installation output.

The following diagram illustrates this workflow.

Requirements file testing process

To build an initial requirements file, complete the following steps:

  1. In your MWAA local runner directory, open requirements/requirements.txt in your preferred editor.

The default requirements file will look similar to the following:

--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-mysql==5.5.1
  1. Replace the existing packages with the following package list:
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake
apache-airflow-providers-dbt-cloud[http]
  1. Save requirements.txt.
  2. In a terminal, run the following command to generate the pip install output:
./mwaa-local-env test-requirements

test-requirements runs pip install, which handles resolving the compatible package versions. Using a constraints file makes sure the selected packages are compatible with your Airflow version. The output will look similar to the following:

Successfully installed apache-airflow-providers-dbt-cloud-3.5.1 apache-airflow-providers-snowflake-5.2.1 pyOpenSSL-23.3.0 snowflake-connector-python-3.6.0 snowflake-sqlalchemy-1.5.1 sortedcontainers-2.4.0

The message beginning with Successfully installed is the output of interest. This shows which dependencies, and their specific version, pip installed. You use this list to create your final versioned requirements file.

Your output will also contain Requirement already satisfied messages for packages already available in the base Amazon MWAA environment. You do not add these packages to your requirements.txt file.

  1. Update the requirements file with the list of versioned packages from the test-requirements command. The updated file will look similar to the following code:
--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0

Next, you test the updated requirements file to confirm no conflicts exist.

  1. Rerun the requirements-test function:
./mwaa-local-env test-requirements

A successful test will not produce any errors. If you encounter dependency conflicts, return to the previous step and update the requirements file with additional packages, or package versions, based on pip’s output.

Package dependencies

If your Amazon MWAA environment has a private webserver, you must package your dependencies into a .zip file, upload the file to your S3 bucket, and specify the package location in your Amazon MWAA instance configuration. Because a private webserver can’t access the PyPI repository through the internet, pip will install the dependencies from the .zip file.

If you’re using a public webserver configuration, you also benefit from a static .zip file, which makes sure the package information remains unchanged until it is explicitly rebuilt.

This process uses the versioned requirements file created in the previous section and the package-requirements feature in the MWAA local runner.

To package your dependencies, complete the following steps:

  1. In a terminal, navigate to the directory where you installed the local runner.
  2. Download the constraints file for your Python version and your version of Apache Airflow and place it in the plugins directory. For this post, we use Python 3.11 and Apache Airflow v2.8.1:
curl -o plugins/constraints-2.8.1-3.11.txt https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt
  1. In your requirements file, update the constraints URL to the local downloaded file.

The –-constraint statement instructs pip to compare the package versions in your requirements.txt file to the allowed version in the constraints file. Downloading a specific constraints file to your plugins directory enables you to control the constraint file location and contents.

The updated requirements file will look like the following code:

--constraint "/usr/local/airflow/plugins/constraints-2.8.1-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0
  1. Run the following command to create the .zip file:
./mwaa-local-env package-requirements

package-requirements creates an updated requirements file named packaged_requirements.txt and zips all dependencies into plugins.zip. The updated requirements file looks like the following code:

--find-links /usr/local/airflow/plugins
--no-index
--constraint "/usr/local/airflow/plugins/constraints-2.8.1-3.11.txt"
apache-airflow-providers-snowflake==5.2.1
apache-airflow-providers-dbt-cloud[http]==3.5.1
pyOpenSSL==23.3.0
snowflake-connector-python==3.6.0
snowflake-sqlalchemy==1.5.1
sortedcontainers==2.4.0

Note the reference to the local constraints file and the plugins directory. The –-find-links statement instructs pip to install packages from /usr/local/airflow/plugins rather than the public PyPI repository.

Deploy the requirements file

After you achieve an error-free requirements installation and package your dependencies, you’re ready to deploy the assets to a non-production Amazon MWAA environment. Even when verifying and testing requirements with the MWAA local runner, it’s best practice to deploy and test the changes in a non-prod Amazon MWAA environment before deploying to production. For more information about creating a CI/CD pipeline to test changes, refer to Deploying to Amazon Managed Workflows for Apache Airflow.

To deploy your changes, complete the following steps:

  1. Upload your requirements.txt file and plugins.zip file to your Amazon MWAA environment’s S3 bucket.

For instructions on specifying a requirements.txt version, refer to Specifying the requirements.txt version on the Amazon MWAA console. For instructions on specifying a plugins.zip file, refer to Installing custom plugins on your environment.

The Amazon MWAA environment will update and install the packages in your plugins.zip file.

After the update is complete, verify the provider package installation in the Apache Airflow UI.

  1. Access the Apache Airflow UI in Amazon MWAA.
  2. From the Apache Airflow menu bar, choose Admin, then Providers.

The list of providers, and their versions, is shown in a table. In this example, the page reflects the installation of apache-airflow-providers-db-cloud version 3.5.1 and apache-airflow-providers-snowflake version 5.2.1. This list only contains the provider packages installed, not all supporting Python packages. Provider packages that are part of the base Apache Airflow installation will also appear in the list. The following image is an example of the package list; note the apache-airflow-providers-db-cloud and apache-airflow-providers-snowflake packages and their versions.

Airflow UI with installed packages

To verify all package installations, view the results in Amazon CloudWatch Logs. Amazon MWAA creates a log stream for the requirements installation and the stream contains the pip install output. For instructions, refer to Viewing logs for your requirements.txt.

A successful installation results in the following message:

Successfully installed apache-airflow-providers-dbt-cloud-3.5.1 apache-airflow-providers-snowflake-5.2.1 pyOpenSSL-23.3.0 snowflake-connector-python-3.6.0 snowflake-sqlalchemy-1.5.1 sortedcontainers-2.4.0

If you encounter any installation errors, you should determine the package conflict, update the requirements file, run the local runner test, re-package the plugins, and deploy the updated files.

Clean up

If you created an Amazon MWAA environment specifically for this post, delete the environment and S3 objects to avoid incurring additional charges.

Conclusion

In this post, we discussed several best practices for managing Python dependencies in Amazon MWAA and how to use the MWAA local runner to implement these practices. These best practices reduce DAG and pip installation errors in your Amazon MWAA environment. For additional details and code examples on Amazon MWAA, visit the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.

Apache, Apache Airflow, and Airflow are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.


About the Author


Mike Ellis is a Technical Account Manager at AWS and an Amazon MWAA specialist. In addition to assisting customers with Amazon MWAA, he contributes to the Apache Airflow open source project.

Announcing updates to the AWS Well-Architected Framework guidance

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework-guidance-2/

We are excited to announce the availability of an enhanced AWS Well-Architected Framework. In this update, you’ll find expanded guidance across all six pillars of the Framework: Operational ExcellenceSecurityReliabilityPerformance EfficiencyCost Optimization, and Sustainability.

In this release, we updated the implementation guidance for the new and existing best practices to be more prescriptive. This includes enhanced recommendations and steps on reusable architecture patterns focused on specific business outcomes.

A brief history

The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

2024 AWS Well-Architected guidance timeline

Figure 1. 2024 AWS Well-Architected guidance timeline

In 2012, we published the first version of the Framework. In 2015, we released the AWS Well-Architected Framework whitepaper. We added the Operational Excellence pillar in 2016. We released pillar-specific whitepapers and AWS Well-Architected Lenses in 2017. The following year, the AWS Well-Architected Tool was launched.

In 2020, we released the new version of the Well-Architected Framework guidance, more lenses, and an API integration with the AWS Well-Architected Tool. We added the sixth pillar, Sustainability, in 2021. In 2022, dedicated HTML pages were introduced for each consolidated best practice across all six pillars, with several best practices updated with improved prescriptive guidance. By December 2023, we improved more than 75% of the Framework’s best practices. As of June 2024, more than 95% of the Framework’s best practices have been refreshed at least once.

What’s new

The Well-Architected Framework supports customers as they mature in their cloud journey by providing guidance to help achieve accurate business, environment, and workload solutions. Well-Architected is committed to providing such information to customers by continually evolving and updating our guidance.

The content updates and prescriptive guidance improvements in this release provide more complete coverage across AWS, helping customers make informed decisions when developing implementation plans. We added or expanded on guidance for the following services in this update: Amazon Chime, Amazon CloudWatch, Amazon CodeGuru Security, Amazon CodeWhisperer, Amazon Devops Guru, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon ElastiCache, Amazon EventBridge, Amazon GuardDuty, Amazon Q, Amazon Route 53, Amazon Security Lake, Amazon Simple Notification Service (Amazon SNS), AWS Billing and Cost Management, AWS Budgets, AWS Compute Optimizer, AWS Config, AWS Control Tower, AWS Cost Optimization Hub, AWS Customer Carbon Footprint Tool, AWS Data Exports, AWS Data Lifecycle Manager, AWS Elastic Disaster Recovery, AWS Fault Injection Service, AWS Global Accelerator, AWS Health, AWS Local Zones, AWS Organizations, AWS Outposts, AWS Resilience Hub, AWS Security Hub, AWS Systems Manager, AWS Trusted Advisor, and AWS Wickr.

Pillar updates

Operational Excellence

In the Operational Excellence Pillar, we updated 30 best practices across six questions. This includes OPS01, OPS02, OPS03, OPS07, OPS10, and OPS11. This update includes a refreshed structure and improved prescriptive guidance with updates on observability, generative AI capabilities, operating models, and the evolution of operational practices.

As part of this update, we consolidated four best practices into two (OPS01-BP07 merged into OPS01-BP06, OPS03-BP08 merged into OPS03-BP04) and changed the titles of seven best practices. Additionally, we added one new design principal to highlight the importance of aligning operating models to business outcomes and reordered design principles according to their priority from foundational to specialized. We updated three design principles and changed the title of one design principle. We’ve also updated the operating model guidance section of the pillar to be more prescriptive, showcasing pathways to evolving operating models.

The implementation guidance in best practices includes guidance on implementing generative AI capabilities with Amazon Q (Q Developer, Q Business, Q in QuickSight), the latest capabilities from Amazon CloudWatch Network Monitor, Amazon CloudWatch Internet Monitor, Amazon CloudWatch Logs, Amazon CloudWatch best practice alarms, cross-account observability, log-based alarms, log data protection, and AWS Health.

Security

In the Security Pillar, we updated 28 best practices across 10 questions. This includes SEC01, SEC02, SEC03, SEC04, SEC05, SEC06, SEC07, SEC08, SEC09, and SEC10. Best practice updates include removing duplication, clarifying desired outcomes, and providing robust prescriptive implementation guidance. As part of this update, we merged SEC01-BP05 into SEC01-BP04. We deleted two practices, SEC08-BP05 and SEC09-BP03, to remove the duplication of guidance covered across other existing practices. We updated the titles for 14 practices and changed the order of nine practices to improve clarity and flow.

Reliability

In the Reliability Pillar, we updated 11 best practices across six questions. This includes REL02, REL04, REL05, REL06, REL07, and REL08, with three best practices changing titles including REL04-BP01, REL05-BP06, and REL06-BP05. We improved resources available in best practices to include more recent blog posts, technical talks, and presentations. We also improved the prescriptive guidance by expanding on implementation steps. New services and service features added to the best practices guidance for AWS Resilience Hub, Amazon Route 53, Amazon Route53 Application Recovery Controller, AWS Fault Injection Service, and Amazon CloudWatch Synthetics.

Performance Efficiency

In the Performance Efficiency Pillar, we updated nine best practices across three questions. This includes PERF01, PERF03, and PERF05. We improved the prescriptive guidance on these best practices and added pillar-specific guidance on services including Amazon Devops Guru and Amazon ElastiCache Serverless. We’ve updated the resources section of all best practices with new and relevant resources.

Cost Optimization

In the Cost Optimization Pillar, we updated eight best practices across five questions. This includes COST01, COST02, COST03, COST05, and COST11. One new best practice added in COST06 highlights the benefits of using shared resources for organizational cost optimization. The improved best practices include guidance on AWS services and features including the AWS Cost Optimization Hub, AWS Billing and Cost Management features, and AWS Data Exports. These updates also cover sample key performance indicators (KPIs) for tracking optimization efforts, elaborate on the use of cost allocation tags, and discuss the split cost allocation for Amazon EKS and Amazon ECS to separate costs of containerized workloads. Additionally, the updates offer improved prescriptive and clear guidance on budgeting and forecasting. Finally, you’ll find guidance on using automations to reduce costs.

Sustainability

In the Sustainability Pillar, we updated 18 best practices across five questions. This includes SUS01, SUS02, SUS03, SUS04, SUS05, and SUS06. We improved the prescriptive guidance on these best practices, and added Pillar-specific guidance on services, including AWS Local Zones, AWS Outposts, Amazon Chime, AWS Wickr, Amazon CodeWhisperer, and AWS Customer Carbon Footprint Tool. We’ve expanded lists of resources across all best practices with new and relevant resources.

Conclusion

This release includes updates and improvements to the Framework guidance totaling 105 best practices. As of this release, we’ve updated 95% of the existing Framework best practices at least once since October 2022. With this release, we have refreshed 100% of the Operational Excellence, Security, Performance Efficiency, Cost Optimization, and Sustainability Pillars, as well as 79% of Reliability Pillar best practices. Best practice updates in this release across Operational Excellence, Security, and Reliability (a total of 66) are first-time updates since major Framework improvements started in 2022.

The content is available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Updates in this release are also available in the AWS Well-Architected Tool, which you can use to review your workloads, address important design considerations, and help you follow the AWS Well-Architected Framework guidance.

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices and pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.

Implement disaster recovery with Amazon Redshift

Post Syndicated from Nita Shah original https://aws.amazon.com/blogs/big-data/implement-disaster-recovery-with-amazon-redshift/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers.

The objective of a disaster recovery plan is to reduce disruption by enabling quick recovery in the event of a disaster that leads to system failure. Disaster recovery plans also allow organizations to make sure they meet all compliance requirements for regulatory purposes, providing a clear roadmap to recovery.

This post outlines proactive steps you can take to mitigate the risks associated with unexpected disruptions and make sure your organization is better prepared to respond and recover Amazon Redshift in the event of a disaster. With built-in features such as automated snapshots and cross-Region replication, you can enhance your disaster resilience with Amazon Redshift.

Disaster recovery planning

Any kind of disaster recovery planning has two key components:

  • Recovery Point Objective (RPO) – RPO is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.
  • Recovery Time Objective (RTO) – RTO is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.

To develop your disaster recovery plan, you should complete the following tasks:

  • Define your recovery objectives for downtime and data loss (RTO and RPO) for data and metadata. Make sure your business stakeholders are engaged in deciding appropriate goals.
  • Identify recovery strategies to meet the recovery objectives.
  • Define a fallback plan to return production to the original setup.
  • Test out the disaster recovery plan by simulating a failover event in a non-production environment.
  • Develop a communication plan to notify stakeholders of downtime and its impact to the business.
  • Develop a communication plan for progress updates, and recovery and availability.
  • Document the entire disaster recovery process.

Disaster recovery strategies

Amazon Redshift is a cloud-based data warehouse that supports many recovery capabilities out of the box to address unforeseen outages and minimize downtime.

Amazon Redshift RA3 instance types and Redshift serverless store their data in Redshift Managed Storage (RMS), which is backed by Amazon Simple Storage Service (Amazon S3), which is highly available and durable by default.

In the following sections, we discuss the various failure modes and associated recovery strategies.

Using backups

Backing up data is an important part of data management. Backups protect against human error, hardware failure, virus attacks, power outages, and natural disasters.

Amazon Redshift supports two kinds of snapshots: automatic and manual, which can be used to recover data. Snapshots are point-in-time backups of the Redshift data warehouse. Amazon Redshift stores these snapshots internally with RMS by using an encrypted Secure Sockets Layer (SSL) connection.

Redshift provisioned clusters offer automated snapshots that are taken automatically with a default retention of 1 day, which can be extended for up to 35 days. These snapshots are taken every 5 GB data change per node or every 8 hours, and the minimum time interval between two snapshots is 15 minutes. The data change must be greater than the total data ingested by the cluster (5 GB times the number of nodes). You can also set a custom snapshot schedule with frequencies between 1–24 hours. You can use the AWS Management Console or ModifyCluster API to manage the period of time your automated backups are retained by modifying the RetentionPeriod parameter. If you want to turn off automated backups altogether, you can set up the retention period to 0 (not recommended). For additional details, refer to Automated snapshots.

Amazon Redshift Serverless automatically creates recovery points approximately every 30 minutes. These recovery points have a default retention of 24 hours, after which they get automatically deleted. You do have the option to convert a recovery point into a snapshot if you want to retain it longer than 24 hours.

Both Amazon Redshift provisioned and serverless clusters offer manual snapshots that can be taken on-demand and be retained indefinitely. Manual snapshots allow you to retain your snapshots longer than automated snapshots to meet your compliance needs. Manual snapshots accrue storage charges, so it’s important that you delete them when you no longer need them. For additional details, refer to Manual snapshots.

Amazon Redshift integrates with AWS Backup to help you centralize and automate data protection across all your AWS services, in the cloud, and on premises. With AWS Backup for Amazon Redshift, you can configure data protection policies and monitor activity for different Redshift provisioned clusters in one place. You can create and store manual snapshots for Redshift provisioned clusters. This lets you automate and consolidate backup tasks that you had to do separately before, without any manual processes. To learn more about setting up AWS Backup for Amazon Redshift, refer to Amazon Redshift backups. As of this writing, AWS Backup does not integrate with Redshift Serverless.

Node failure

A Redshift data warehouse is a collection of computing resources called nodes.
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from Amazon S3 first to allow you to resume querying your data as quickly as possible.

If this is a single-node cluster (which is not recommended for customer production use), there is only one copy of the data in the cluster. When it’s down, AWS needs to restore the cluster from the most recent snapshot on Amazon S3, and that becomes your RPO.

We recommend using at least two nodes for production.

Cluster failure

Each cluster has a leader node and one or more compute nodes. In the event of a cluster failure, you must restore the cluster from a snapshot. Snapshots are point-in-time backups of a cluster. A snapshot contains data from all databases that are running on your cluster. It also contains information about your cluster, including the number of nodes, node type, and admin user name. If you restore your cluster from a snapshot, Amazon Redshift uses the cluster information to create a new cluster. Then it restores all the databases from the snapshot data. Note that the new cluster is available before all of the data is loaded, so you can begin querying the new cluster in minutes. The cluster is restored in the same AWS Region and a random, system-chosen Availability Zone, unless you specify another Availability Zone in your request.

Availability Zone failure

A Region is a physical location around the world where data centers are located. An Availability Zone is one or more discrete data centers with redundant power, networking, and connectivity in a Region. Availability Zones enable you to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center. All Availability Zones in a Region are interconnected with high-bandwidth, low-latency networking, over fully redundant, dedicated metro fiber providing high-throughput, low-latency networking between Availability Zones.

To recover from Availability Zone failures, you can use one of the following approaches:

  • Relocation capabilities (active-passive) – If your Redshift data warehouse is a single-AZ deployment and the cluster’s Availability Zone becomes unavailable, then Amazon Redshift will automatically move your cluster to another Availability Zone without any data loss or application changes. To activate this, you must enable cluster relocation for your provisioned cluster through configuration settings, which is automatically enabled for Redshift Serverless. Cluster relocation is free of cost, but it is a best-effort approach subject to resource availability in the Availability Zone being recovered in, and RTO can be impacted by other issues related to starting up a new cluster. This can result in recovery times between 10–60 minutes. To learn more about configuring Amazon Redshift relocation capabilities, refer to Build a resilient Amazon Redshift architecture with automatic recovery enabled.
  • Amazon Redshift Multi-AZ (active-active) – A Multi-AZ deployment allows you to run your data warehouse in multiple Availability Zones simultaneously and continue operating in unforeseen failure scenarios. No application changes are required to maintain business continuity because the Multi-AZ deployment is managed as a single data warehouse with one endpoint. Multi-AZ deployments reduce recovery time by guaranteeing capacity to automatically recover and are intended for customers with mission-critical analytics applications that require the highest levels of availability and resiliency to Availability Zone failures. This also allows you to implement a solution that is more compliant with the recommendations of the Reliability Pillar of the AWS Well-Architected Framework. Our pre-launch tests found that the RTO with Amazon Redshift Multi-AZ deployments is under 60 seconds or less in the unlikely case of an Availability Zone failure. To learn more about configuring Multi-AZ, refer to Enable Multi-AZ deployments for your Amazon Redshift data warehouse. As of writing, Redshift Serverless currently does not support Multi-AZ.

Region failure

Amazon Redshift currently supports single-Region deployments for clusters. However, you have several options to help with disaster recovery or accessing data across multi-Region scenarios.

Use a cross-Region snapshot

You can configure Amazon Redshift to copy snapshots for a cluster to another Region. To configure cross-Region snapshot copy, you need to enable this copy feature for each data warehouse (serverless and provisioned) and configure where to copy snapshots and how long to keep copied automated or manual snapshots in the destination Region. When cross-Region copy is enabled for a data warehouse, all new manual and automated snapshots are copied to the specified Region. In the event of a Region failure, you can restore your Redshift data warehouse in a new Region using the latest cross-Region snapshot.

The following diagram illustrates this architecture.

For more information about how to enable cross-Region snapshots, refer to the following:

Use a custom domain name

A custom domain name is easier to remember and use than the default endpoint URL provided by Amazon Redshift. With CNAME, you can quickly route traffic to a new cluster or workgroup created from snapshot in a failover situation. When a disaster happens, connections can be rerouted centrally with minimal disruption, without clients having to change their configuration.

For high availability, you should have a warm-standby cluster or workgroup available that regularly receives restored data from the primary cluster. This backup data warehouse could be in another Availability Zone or in a separate Region. You can redirect clients to the secondary Redshift cluster by setting up a custom domain name in the unlikely scenario of an entire Region failure.

In the following sections, we discuss how to use a custom domain name to handle Region failure in Amazon Redshift. Make sure the following prerequisites are met:

  • You need a registered domain name. You can use Amazon Route 53 or a third-party domain registrar to register a domain.
  • You need to configure cross-Region snapshots for your Redshift cluster or workgroup.
  • Turn on cluster relocation for your Redshift cluster. Use the AWS Command Line Interface (AWS CLI) to turn on relocation for a Redshift provisioned cluster. For Redshift Serverless, this is automatically enabled. For more information, see Relocating your cluster.
  • Take note of your Redshift endpoint. You can locate the endpoint by navigating to your Redshift workgroup or provisioned cluster name on the Amazon Redshift console.

Set up a custom domain with Amazon Redshift in the primary Region

In the hosted zone that Route 53 created when you registered the domain, create records to tell Route 53 how you want to route traffic to Redshift endpoint by completing the following steps:

  1. On the Route 53 console, choose Hosted zones in the navigation pane.
  2. Choose your hosted zone.
  3. On the Records tab, choose Create record.
  4. For Record name, enter your preferred subdomain name.
  5. For Record type, choose CNAME.
  6. For Value, enter the Redshift endpoint name. Make sure to provide the value by removing the colon (:), port, and database. For example, redshift-provisioned.eabc123.us-east-2.redshift.amazonaws.com.
  7. Choose Create records.

  1. Use the CNAME record name to create a custom domain in Amazon Redshift. For instructions, see Use custom domain names with Amazon Redshift.

You can now connect to your cluster using the custom domain name. The JDBC URL will be similar to jdbc:redshift://prefix.rootdomain.com:5439/dev?sslmode=verify-full, where prefix.rootdomain.com is your custom domain name and dev is the default database. Use your preferred editor to connect to this URL using your user name and password.

Steps to handle a Regional failure

In the unlikely situation of a Regional failure, complete the following steps:

  1. Use a cross-Region snapshot to restore a Redshift cluster or workgroup in your secondary Region.
  2. Turn on cluster relocation for your Redshift cluster in the secondary Region. Use the AWS CLI to turn on relocation for a Redshift provisioned cluster.
  3. Use the CNAME record name from the Route 53 hosted zone setup to create a custom domain in the newly created Redshift cluster or workgroup.
  4. Take note of the Redshift endpoint’s newly created Redshift cluster or workgroup.

Next, you need to update the Redshift endpoint in Route 53 for achieve seamless connectivity.

  1. On the Route 53 console, choose Hosted zones in the navigation pane.
  2. Choose your hosted zone.
  3. On the Record tab, select the CNAME record you created.
  4. Under Record details, choose Edit record.
  5. Change the value to the newly created Redshift endpoint. Make sure to provide the value by removing the colon (:), port, and database. For example, redshift-provisioned.eabc567.us-west-2.redshift.amazonaws.com.
  6. Choose Save.

Now when you connect to your custom domain name using the same JDBC URL from your application, you should be connected to your new cluster in your secondary Region.

Use active-active configuration

For business-critical applications that require high availability, you can set up an active-active configuration at the Region level. There are many ways to make sure all writes occur to all clusters; one way is to keep the data in sync between the two clusters by ingesting data concurrently into the primary and secondary cluster. You can also use Amazon Kinesis to sync the data between two clusters. For more details, see Building Multi-AZ or Multi-Region Amazon Redshift Clusters.

Additional considerations

In this section, we discuss additional considerations for your disaster recovery strategy.

Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows you to run SQL queries against exabytes of data stored in Amazon S3. With Redshift Spectrum, you don’t have to load or extract the data from Amazon S3 into Amazon Redshift before querying.

If you’re using external tables using Redshift Spectrum, you need to make sure it is configured and accessible on your secondary failover cluster.

You can set this up with the following steps:

  1. Replicate existing S3 objects between the primary and secondary Region.
  2. Replicate data catalog objects between the primary and secondary Region.
  3. Set up AWS Identity and Access Management (IAM) policies for accessing the S3 bucket residing in the secondary Region.

Cross-Region data sharing

With Amazon Redshift data sharing, you can securely share read access to live data across Redshift clusters, workgroups, AWS accounts, and Regions without manually moving or copying the data.

If you’re using cross-Region data sharing and one of the Regions has an outage, you need to have a business continuity plan to fail over your producer and consumer clusters to minimize the disruption.

In the event of an outage affecting the Region where the producer cluster is deployed, you can take the following steps to create a new producer cluster in another Region using a cross-Region snapshot and by reconfiguring data sharing, allowing your system to continue operating:

  1. Create a new Redshift cluster using the cross-Region snapshot. Make sure you have correct node type, node count, and security settings.
  2. Identify the Redshift data shares that were previously configured for the original producer cluster.
  3. Recreate these data shares on the new producer cluster in the target Region.
  4. Update the data share configurations in the consumer cluster to point to the newly created producer cluster.
  5. Confirm that the necessary permissions and access controls are in place for the data shares in the consumer cluster.
  6. Verify that the new producer cluster is operational and the consumer cluster is able to access the shared data.

In the event of an outage in the Region where the consumer cluster is deployed, you will need to create a new consumer cluster in a different Region. This makes sure all applications that are connecting to the consumer cluster continue to function as expected, with proper access.

The steps to accomplish this are as follows:

  1. Identify an alternate Region that is not affected by the outage.
  2. Provision a new consumer cluster in the alternate Region.
  3. Provide necessary access to data sharing objects.
  4. Update the application configurations to point to the new consumer cluster.
  5. Validate that all the applications are able to connect to the new consumer cluster and are functioning as expected.

For additional information on how to configure data sharing, refer to Sharing datashares.

Federated queries

With federated queries in Amazon Redshift, you can query and analyze data across operational databases, data warehouses, and data lakes. If you’re using federated queries, you need to set up federated queries from the failover cluster as well to prevent any application failure.

Summary

In this post, we discussed various failure scenarios and recovery strategies associated with Amazon Redshift. Disaster recovery solutions make restoring your data and workloads seamless so you can get business operations back online quickly after a catastrophic event.

As an administrator, you can now work on defining your Amazon Redshift disaster recovery strategy and implement it to minimize business disruptions. You should develop a comprehensive plan that includes:

  • Identifying critical Redshift resources and data
  • Establishing backup and recovery procedures
  • Defining failover and failback processes
  • Enforcing data integrity and consistency
  • Implementing disaster recovery testing and drills

Try out these strategies for yourself, and leave any questions and feedback in the comments section.


About the authors

Nita Shah is a Senior Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Jason Pedreza is a Senior Redshift Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com and Amazon Devices. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Agasthi Kothurkar is an AWS Solutions Architect, and is based in Boston. Agasthi works with enterprise customers as they transform their business by adopting the Cloud. Prior to joining AWS, he worked with leading IT consulting organizations on customers engagements spanning Cloud Architecture, Enterprise Architecture, IT Strategy, and Transformation. He is passionate about applying Cloud technologies to resolve complex real world business problems.

Let’s Architect! Migrating to the cloud with AWS

Post Syndicated from Federica Ciuffo original https://aws.amazon.com/blogs/architecture/lets-architect-migrating-to-the-cloud-with-aws/

In today’s digital world, businesses are increasingly turning to the cloud for its scalability, agility, and cost-effectiveness. Migrating your data center to the cloud can be a daunting task, but with the right approach and tools, it can be a successful journey. This Let’s Architect! blog post will guide you through the process of migrating to the cloud with AWS, leveraging the proven AWS Cloud Adoption Framework (AWS CAF) and exploring valuable resources to help you navigate each step.

AWS Cloud Adoption Framework

The AWS Cloud Adoption Framework (CAF) provides a comprehensive approach to planning, designing, and deploying your cloud migration. This robust framework outlines a four-phase methodology that guides you through every stage of the process, from strategy and planning to ongoing management and optimization. Here’s a closer look at the four phases of the AWS CAF:

  • Envision: Identify business transformation opportunities that align with your strategic goals and demonstrate how the cloud will accelerate your business outcomes.
  • Align: Assess your organization’s cloud readiness by identifying capability gaps across six key perspectives (Business, People, Governance, Platform, Security, and Operations). Address these gaps by developing strategies, ensuring stakeholder alignment, and implementing relevant change management activities.
  • Launch: Select impactful pilot initiatives and deploy them in production. These pilots should showcase the value proposition of the cloud and provide valuable insights for further refinement.
  • Scale: Focus on expanding production pilots and business value to desired scale and ensuring that the business benefits associated with your cloud investments are realized and sustained.
The AWS CAF recommends four iterative and incremental cloud transformation phases

Figure 1. The AWS CAF recommends four iterative and incremental cloud transformation phases

Take me to this whitepaper!

Large-scale migration and modernization

Migrating a large-scale data center to the cloud requires careful planning and execution. This video session focuses on valuable lessons learned from the thousands of enterprises who have migrated and modernized their on-premises workloads with AWS. Dive deep on technical lessons learned, mental models used, how to set up teams to modernize as they migrate, and how to engage with AWS Professional Services and AWS Partners for success. Finally, you will get insights on the latest AWS migration and modernization tools.

Migrating to AWS Cloud unlocked major benefits for Live Nation, including a 58% cost saving

Figure 2. Migrating to AWS Cloud unlocked major benefits for Live Nation, including a 58% cost saving

Take me to this video!

Dive deep into different AWS DMS migration options

At the heart of any successful data migration lies a robust database migration strategy. AWS Database Migration Service (AWS DMS) empowers you with a comprehensive suite of tools to seamlessly move and replicate your data. This session explains the various options offered by AWS DMS, including logical replication, managed native methods for export, import, and replication, and bulk extract and load functionalities. Through these options, you’ll gain a thorough understanding of how to migrate and replicate your data, along with the distinct advantages of each approach. The session also explores performance considerations to ensure optimal migration efficiency. Finally, you will learn how modern capabilities like serverless technologies, auto scaling, and schema conversion can simplify migrations.

AWS DMS Schema Conversion converts your existing database schemas and a majority of the database code objects to a format compatible with the target database

FIgure 3. AWS DMS Schema Conversion converts your existing database schemas and a majority of the database code objects to a format compatible with the target database

Take me to this video!

Application Migration with AWS

Migrating and modernizing your applications is a crucial aspect of your cloud adoption strategy. The Application Migration with AWS workshop series provides hands-on experience with planning and executing application migrations. You’ll learn practical techniques like database replatforming, application rehosting, and containerization to make your move to the cloud smooth and efficient.

As part of this lab, you will perform a database migration with AWS DMS

Figure 4. As part of this lab, you will perform a database migration with AWS DMS

Take me to this workshop!

But the journey doesn’t end there. As your applications scale in the cloud, managing that growth becomes key. This is where infrastructure as code (IaC) comes in, and AWS CDK takes IaC a step further by allowing you to write infrastructure code in familiar programming languages you already know. This streamlines your migration by leveraging your existing coding knowledge. We recommend this AWS CDK workshop to get started with CDK for infrastructure automation.

See you next time!

Thanks for reading! With this post, we provided resources to help you navigate your cloud migration journey with confidence and success. In the next blog, we will talk about Well-Architected best practices!

To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Best practices working with self-hosted GitHub Action runners at scale on AWS

Post Syndicated from Shilpa Sharma original https://aws.amazon.com/blogs/devops/best-practices-working-with-self-hosted-github-action-runners-at-scale-on-aws/

Overview

GitHub Actions is a continuous integration and continuous deployment platform that enables the automation of build, test and deployment activities for your workload. GitHub Self-Hosted Runners provide a flexible and customizable option to run your GitHub Action pipelines. These runners allow you to run your builds on your own infrastructure, giving you control over the environment in which your code is built, tested, and deployed. This reduces your security risk and costs, and gives you the ability to use specific tools and technologies that may not be available in GitHub hosted runners. In this blog, I explore security, performance and cost best practices to take into consideration when deploying GitHub Self-Hosted Runners to your AWS environment.

Best Practices

Understand your security responsibilities

GitHub Self-hosted runners, by design, execute code defined within a GitHub repository, either through the workflow scripts or through the repository build process. You must understand that the security of your AWS runner execution environments are dependent upon the security of your GitHub implementation. Whilst a complete overview of GitHub security is outside the scope of this blog, I recommended that before you begin integrating your GitHub environment with your AWS environment, you review and understand at least the following GitHub security configurations.

  • Federate your GitHub users, and manage the lifecycle of identities through a directory.
  • Limit administrative privileges of GitHub repositories, and restrict who is able to administer permissions, write to repositories, modify repository configurations or install GitHub Apps.
  • Limit control over GitHub Actions runner registration and group settings
  • Limit control over GitHub workflows, and follow GitHub’s recommendations on using third-party actions
  • Do not allow public repositories access to self-hosted runners

Reduce security risk with short-lived AWS credentials

Make use of short-lived credentials wherever you can. They expire by default within 1 hour, and you do not need to rotate or explicitly revoke them. Short lived credentials are created by the AWS Security Token Service (STS). If you use federation to access your AWS account, assume roles, or use Amazon EC2 instance profiles and Amazon ECS task roles, you are using STS already!

In almost all cases, you do not need long-lived AWS Identity and Access Management (IAM) credentials (access keys) even for services that do not “run” on AWS – you can extend IAM roles to workloads outside of AWS without requiring you to manage long-term credentials. With GitHub Actions, we suggest you use OpenID Connect (OIDC). OIDC is a decentralized authentication protocol that is natively supported by STS using sts:AssumeRoleWithWebIdentity, GitHub and many other providers. With OIDC, you can create least-privilege IAM roles tied to individual GitHub repositories and their respective actions. GitHub Actions exposes an OIDC provider to each action run that you can utilize for this purpose.

Short lived AWS credentials with GitHub self-hosted runners

Short lived AWS credentials with GitHub self-hosted runners

If you have many repositories that you wish to grant an individual role to, you may run into a hard limit of the number of IAM roles in a single account. While I advocate solving this problem with a multi-account strategy, you can alternatively scale this approach by:

  • using attribute based access control (ABAC) to match claims in the GitHub token (such as repository name, branch, or team) to the AWS resource tags.
  • using role based access control (RBAC) by logically grouping the repositories in GitHub into Teams or applications to create fewer subset of roles.
  • use an identity broker pattern to vend credentials dynamically based on the identity provided to the GitHub workflow.

Use Ephemeral Runners

Configure your GitHub Action runners to run in “ephemeral” mode, which creates (and destroys) individual short-lived compute environments per job on demand. The short environment lifespan and per-build isolation reduces the risk of data leakage , even in multi-tenanted continuous integration environments, as each build job remains isolated from others on the underlying host.

As each job runs on a new environment created on demand, there is no need for a job to wait for an idle runner, simplifying auto-scaling. With the ability to scale runners on demand, you do not need to worry about turning build infrastructure off when it is not needed (for example out of office hours), giving you a cost-efficient setup. To optimize the setup further, consider allowing developers to tag workflows with instance type tags and launch specific instance types that are optimal for respective workflows.

There are a few considerations to take into account when using ephemeral runners:

  • A job will remain queued until the runner EC2 instance has launched and is ready. This can take up to 2 minutes to complete. To speed up this process, consider using an optimized AMI with all prerequisites installed.
  • Since each job is launched on a fresh runner, utilizing caching on the runner is not possible. For example, Docker images and libraries will always be pulled from source.

Use Runner Groups to isolate your runners based on security requirements

By using ephemeral runners in a single GitHub runner group, you are creating a pool of resources in the same AWS account that are used by all repositories sharing this runner group. Your organizational security requirements may dictate that your execution environments must be isolated further, such as by repository or by environment (such as dev, test, prod).

Runner groups allow you to define the runners that will execute your workflows on a repository-by-repository basis. Creating multiple runner groups not only allow you to provide different types of compute environments, but allow you to place your workflow executions in locations within AWS that are isolated from each other. For example, you may choose to locate your development workflows in one runner group and test workflows in another, with each ephemeral runner group being deployed to a different AWS account.

Runners by definition execute code on behalf of your GitHub users. At a minimum, I recommend that your ephemeral runner groups are contained within their own AWS account and that this AWS account has minimal access to other organizational resources. When access to organizational resources is required, this can be given on a repository-by-repository basis through IAM role assumption with OIDC, and these roles can be given least-privilege access to the resources they require.

Optimize runner start up time using Amazon EC2 warm-pools

Ephemeral runners provide strong build isolation, simplicity and security. Since the runners are launched on demand, the job will be required to wait for the runner to launch and register itself with GitHub. While this usually happens in under 2 minutes, this wait time might not be acceptable in some scenarios.

We can use a warm pool of pre-registered ephemeral runners to reduce the wait time. These runners will listen to the incoming GitHub workflow events actively and as soon as an incoming workflow event is queued, it is picked up readily by the warm pool of registered EC2 runners.

While there can be multiple strategies to manage the warm pool, I recommend the following strategy which uses AWS Lambda for scaling up and scaling down the ephemeral runners:

GitHub self-hosted runners warm pool flow

GitHub self-hosted runners warm pool flow

A GitHub workflow event is created on a trigger like push of code in a master repository or a merge of pull request. This event triggers a Lambda function via webhook and Amazon API Gateway endpoint. The Lambda function helps in validating the GitHub workflow event payload and log events for observability & building metrics. It can be used optionally to replenish the warm pool. There are separate backend Lambda functions to launch, scale up and scale down the warm pool of EC2 instances. The EC2 instances or runners are registered with GitHub at the time of launch. The registered runners listens for incoming GitHub work flow events using GitHub’s internal job queue and as soon as workflow events are triggered, its assigned by GitHub to one of the runners in warm pool for job execution. The runner is automatically de-registered once the job completes. A job can be a build, or deploy request as defined in your GitHub workflow.

With warm pool in place, it is expected to help reduce wait time by 70-80%.

Considerations

  • Increased complexity as there is a possibility of over provisioning runners. This will depend on how long a runner EC2 instance requires to launch and reach a ready state and how frequently the scale up Lambda is configured to run. For example, if the scale up Lambda runs every 1 minute and the EC2 runner requires 2 minutes to launch, then the scale up Lambda will launch 2 instances. The mitigation is to use Auto scaling groups to manage the EC2 warm pool and desired capacity with predictive scaling policies tying back to incoming GitHub workflow events i.e. build job requests.
  • This strategy may have to be revised when supporting Windows or Mac based runners given the spin up times can vary.

Use an optimized AMI to speed up the launch of GitHub self-hosted runners

Amazon Machine Images (AMI) provide a pre-configured, optimized image that can be used to launch the runner EC2 instance. By using AMIs, you will be able to reduce the launch time of a new runner since dependencies and tools are already installed. Consistency across builds is guaranteed due to all instances running the same version of dependencies and tools. Runners will benefit from increased stability and security compliance as images are tested and approved before being distributed for use as runner instances.

When building an AMI for use as a GitHub self-hosted runner the following considerations need to be made:

  • Choosing the right OS base image for the builds. This will depend on your tech stack and toolset.
  • Install the GitHub runner app as part of the image. Ensure automatic runner updates are enabled to reduce the overhead of managing running versions. In case a specific runner version must be used you can disable automatic runner updates to avoid untested changes. Keep in mind, if disabled, a runner will need to be updated manually within 30 days of a new version becoming available.
  • Install build tools and dependencies from trusted sources.
  • Ensure runner logs are captured and forwarded to your security information and event management (SIEM) of choice.
  • The runner requires internet connectivity to access GitHub. This may require configuring proxy settings on the instance depending on your networking setup.
  • Configure any artifact repositories the runner requires. This includes sources and authentication.
  • Automate the creation of the AMI using tools such as EC2 Image Builder to achieve consistency.

Use Spot instances to save costs

The cost associated with scaling up the runners as well as maintaining a hot pool can be minimized using Spot Instances, which can result in savings up to 90% compared to On-Demand prices. However, there could be requirements where we can have longer running builds or batch jobs that cannot tolerate the spot instance terminating on 2 minutes notice. So, having a mixed pool of instances will be a good option where such jobs should be routed to on-demand EC2 instances and the rest on the Spot instances to cater for diverse build needs. This can be done by assigning labels to the runner during launch /registration. In that case, the on-demand instances will be launched and we can a savings plan in place to get cost benefits.

Record runner metrics using Amazon CloudWatch for Observability

It is vital for the observability of the overall platform to generate metrics for the EC2 based GitHub self-hosted runners. Examples of the GitHub runners metrics can be: the number of GitHub workflow events queued or completed in a minute, or number of EC2 runners up and available in the warm pool etc.

We can log the triggered workflow events and runner logs in Amazon CloudWatch and then use CloudWatch embedded metrics to collect metrics such as number of workflow events queued, in progress and completed. Using elements like “started_at” and “completed_at” timings which are part of workflow event payload we can calculate build wait time.

As an example, below is the sample incoming GitHub workflow event logged in Amazon Cloud Watch Logs

<p> </p><p><code>{</code></p><p><code>"hostname": "xxx.xxx.xxx.xxx",</code></p><p><code>"requestId": "aafddsd55-fgcf555",</code></p><p><code>"date": "2022-10-11T05:50:35.816Z",</code></p><p><code>"logLevel": "info",</code></p><p><code>"logLevelId": 3,</code></p><p><code>"filePath": "index.js",</code></p><p><code>"fullFilePath": "/var/task/index.js",</code></p><p><code>"fileNa<a class="ab-item" href="https://aws-blogs-prod.amazon.com/devops/" aria-haspopup="true">AWS DevOps Blog</a>me": "index.js",</code></p><p><code>"lineNumber": 83889,</code></p><p><code>"columnNumber": 12,</code></p><p><code>"isConstructor": false,</code></p><p><code>"functionName": "handle",</code></p><p><code>"argumentsArray": [</code></p><p><code>"Processing Github event",</code></p><p><code>"{\"event\":\"workflow_job\",\"repository\":\"testorg-poc/github-actions-test-repo\",\"action\":\"queued\",\"name\":\"jobname-buildanddeploy\",\"status\":\"queued\",\"started_at\":\"2022-10-11T05:50:33Z\",\"completed_at\":null,\"conclusion\":null}"</code></p><p><code>]</code></p><p><code>}</code></p>

In order to use the logged elements of above log into metrics by capturing \”status\”:\”queued\”,\”repository\”:\”testorg-poc/github-actions-test-repo\c, \”name\”:\”jobname-buildanddeploy\” ,and workflow \”event\” , one can build embedded metrics in the application code or AWS metrics Lambda using any of the cloud watch metrics client library Creating logs in embedded metric format using the client libraries – Amazon CloudWatch based on the language of your choice listed.c

Essentially what one of those libraries will do under the hood is map elements from Log event into dimension fields so cloud watch can then read and generate a metric using that.

console.log(<br />      JSON.stringify({<br />        message: '[Embedded Metric]', // Identifier for metric logs in CW logs<br />        build_event_metric: 1, // Metric Name and value<br />        status: `${status}`, // Dimension name and value<br />        eventName: `${eventName}`,<br />        repository: `${repository}`,<br />        name: `${name}`,<br />        <br />        _aws: {<br />          Timestamp: Date.now(),<br />          CloudWatchMetrics: [<br />            {<br />              Namespace: `demo_2`,<br />              Dimensions: [['status','eventName','repository','name']],<br />              Metrics: [<br />                {<br />                  Name: 'build_event_metric',<br />                  Unit: 'Count',<br />                },<br />              ],<br />            },<br />          ],<br />        },<br />      })<br />    );

A sample architecture:

Consumption of GitHub webhook events

Consumption of GitHub webhook events

The cloud watch metrics can be published to your dashboards or forwarded to any external tool based on requirements. Once we have metrics, CloudWatch alarms and notifications can be configured to manage pool exhaustion.

Conclusion

In this blog post, we outlined several best practices covering security, scalability and cost efficiency when using GitHub Actions with EC2 self-hosted runners on AWS. We covered how using short-lived credentials combined with ephemeral runners will reduce security and build contamination risks. We also showed how runners can be optimized for faster startup and job execution AMIs and warm EC2 pools. Last but not least, cost efficiencies can be maximized by using Spot instances for runners in the right scenarios.

Resources:

Fault-isolated, zonal deployments with AWS CodeDeploy

Post Syndicated from Michael Haken original https://aws.amazon.com/blogs/devops/fault-isolated-zonal-deployments-with-aws-codedeploy/

In this blog post you’ll learn how to use a new feature in AWS CodeDeploy to deploy your application one Availability Zone (AZ) at a time to help increase the operational resilience or your services through improved fault isolation.

Introducing change to a system can be a time of risk. Even the most advanced CI/CD systems with comprehensive testing and phased deployments can still promote a bad change into production. A common approach to reduce this risk is using fractional deployments and closely monitoring critical metrics like availability and latency to gauge a deployment’s success. If the deployment shows signs of failure, the CI/CD system initiates an

This blog post will show you how to leverage CodeDeploy zonal deployments as part of a holistic AZ independent (AZI) architecture strategy, both patterns that many AWS service teams follow. With this feature, you no longer need to distinguish between infrastructure or deployment failures in order to respond to the event. You can use the same observability tools and recovery techniques for both, which allows you to contain the scope of impact to a single AZ and mitigate the impact more quickly and with less complexity. First, let’s define what an AZI architecture is so we can understand how this feature in CodeDeploy supports it.

Availability Zone independence

Fault isolation is an architectural pattern that limits the scope of impact of failures by creating independent fault containers that don’t share fate. It also allows you to quickly recover from failures by shifting traffic or resources away from the impaired fault container. AWS provides a number of different fault isolation boundaries, but the ones most people are familiar with are AZs and Regions. When you build multi-AZ architectures, you can design your application to implement AZI that uses the fault boundary provided by AZs to keep the interaction of resources isolated to their respective AZ (to the greatest extent possible).

A three tier Availability Zone indepdendent architecture deployed across three AZa

An Availability Zone independent (AZI) architecture implemented by disabling cross-zone load balancing and using an independent database read replica per AZ. Only database writes have to cross an AZ boundary.

The result is that the impacts from an impairment in one AZ don’t cascade to resources in other AZs, making the operation of your application in one AZ independent from events in the others. You should also monitor the health of each AZ independently, for example by looking at per-AZ load balancer HTTPCode_Target_5XX_Count metrics, or by sending synthetic requests to the load balancer nodes in each AZ and recording availability and latency metrics for each. When an event occurs that impacts your availability or latency in a single AZ, you can use capabilities like Amazon Route 53 Application Recovery Controller zonal shift to shift traffic away from that AZ to quickly mitigate impact, often in single-digit minutes.

Using zonal shift to shift traffic away from an single AZ

Using zonal shift to move traffic away from the AZ experiencing a service impairment

Traditional deployment strategy challenges

During an event, SRE, engineering, or operations teams can spend a lot of time trying to figure out if the source of impact is an infrastructure problem or related to a failed deployment. Then, based on the identified cause, they may take different mitigation actions. Thus, precious time is spent investigating the source of impact and deciding on the appropriate mitigation plan.

When the cause is due to a failed deployment, traditionally rollbacks are used to mitigate the problem. But rollbacks, even when automated, take time to complete. For example, let’s say your deployment batches take 5 minutes to complete, you deploy in 10% batches, and you’re halfway through a deployment to 100 instances when the rollback is initiated. This means it’s going to take at least 25 minutes to finish the rollback (5 batches, each taking 5 minutes to re-deploy to). And it’s entirely possible during that time that instances where the new software was deployed continue to pass health checks, but result in errors being returned to your customers. In the worst case, if all instances had been deployed to, this event could last for almost an hour with customers being impacted during the entire rollback process. In some cases, deployments can’t be rolled back and have to be rolled forward, meaning a new, updated version needs to be deployed to fix the previous deployment. Writing the code for the new deployment and testing it adds to the recovery time of your system and can be error prone.

Additionally, if your unit of deployment includes multiple AZs, then your potential scope of impact from a failed deployment isn’t predictably bounded. For example, if your CodeDeploy deployment groups target Amazon Elastic Compute Cloud (Amazon EC2) instances based on tags or an Amazon EC2 Auto Scaling group that spans multiple AZs, then you could see impact across the whole Region, even if you’re using fractional deployments. There’s not a smaller fault container that helps consistently limit the scope of impact to a predictable size.

Let’s look at how we can overcome these two challenges by using zonal deployments with CodeDeploy.

Zonal deployments with AWS CodeDeploy

One of the best practices we follow at AWS, described in My CI/CD pipeline is my release captain, is performing fractional deployments aligned to intended fault isolation boundaries, like individual hosts, cells, AZs, and Regions. When we release a change, the deployment is separated into waves, which represent fault containers (like Regions) that are deployed to in parallel, and those are further separated into stages. Within a single Region, the deployment starts with a one-box environment, representing a single host, then moves on to fractional batches (like 10% at a time) inside a single AZ, waits for a period of bake time, moves on to the next AZ, and so on until we’ve completed rolling out the change.

Four stages in a deployment pipeline showing per AZ deployments with bake time.

Deployment stages aligned to intended fault isolation boundaries within a single deployment wave for one Region

By aligning each stage to an expected fault isolation boundary, we create well-defined fault containers that provide an understood and bounded scope of impact in the case that something goes wrong with a deployment. You can take advantage of this same deployment strategy in your own applications by using zonal deployments in CodeDeploy. To utilize this capability, you need to define a custom deployment configuration shown below.

The configuration options for a CodeDeploy deployment configuration using a zonal configuration

Creating a zonal deployment configuration that deploys to 10% of the EC2 instances in each AZ at a time, one AZ at a time

This configuration defines a few important properties. First, it enables the zonal configuration, which ensures deployments will be phased one AZ at a time. In this case, updates will be deployed to batches of 10% of the instances in each AZ (see the minimum number of healthy instances per Availability Zone for more details on configuring this setting). Second, it defines a monitor duration, which is the bake time where the effects of the changes are observed before moving on to the next AZ. This ensures sufficient use of the new software to discover any potential bugs or problems before moving on. The value in this example is defined as 900 seconds, or 15 minutes. You should ensure this value is longer than the time it takes for your alarms to trigger. For example, if you are using an M of N alarm for availability and/or latency, that is using 3 data points out of 5 with 1-minute intervals, you need to make sure your bake time is set to greater than 600 seconds, otherwise, you might move on to the next AZ before your alarm has a chance to mark the deployment as unsuccessful. Finally, I’ve also defined a first zone monitor duration. This overrides the “monitor duration” for the first AZ being deployed to. This is useful since the first AZ is acting as our canary or one-box environment and we may want to wait additional time to be really confident the deployment is successful before moving on to the second AZ.

If your service is deployed behind a load balancer with cross-zone load balancing disabled (which is important to achieve AZI), carefully consider your batch size. The load balancer evenly splits traffic across AZs regardless of how many healthy hosts are in each AZ. Ensure your batch size is small enough that the temporary reduction in capacity during each batch doesn’t overwhelm the remaining instances in the same AZ. You can use the CodeDeploy minimum healthy hosts per AZ option to ensure there are enough healthy hosts in the AZ during a deployment batch or Elastic Load Balancing (ELB) target group minimum healthy target count with DNS failover to shift traffic away from the AZ if too few targets are present.

Recovering from a failed zonal deployment.

When a failure occurs, the highest priority is mitigating the impact, not fixing the root cause. While an automated rollback can help achieve both for a failed deployment, using a zonal shift can improve your recovery time. Let’s take a simple dashboard like the following figure. The top graph shows your availability as perceived by customers through using the regional endpoint of your load balancer like https://load-balancer-name-and-id.elb.us-east-1.amazonaws.com. The graphs below it show the measured availability from Amazon CloudWatch Synthetics canaries that test the load balancer endpoints in each AZ using endpoints like https://us-east-1a.load-balancer-name-and-id.elb.us-east-1.amazonaws.com.

Dashboards showing a drop in availability in one AZ that also impacts the regional customer experience

Dashboard showing impact in one AZ that affects the availability of the service

We can see that something starts impacting resources in AZ1 at 10:38 causing an availability drop. As we would expect, this impact is also seen by customers, shown in the top graph, but it’s unclear what the underlying cause of the availability drop is. Using the approach described in this post, it turns out that it doesn’t matter. Within a few minutes, at 10:41 the CloudWatch composite alarm monitoring the health of AZ1 transitions to the ALARM state and invokes a Lambda function that reads the alarm’s definition to get the AZ ID and ALB ARN involved, and initiates the zonal shift. It’s important that the alarm logic only reacts when a single AZ is impacted, if there was impact in more than one AZ, we would need to treat this as a Regional issue.

The process of identifying a failed deployment in a single AZ and responding with a zonal shift.

After a failed deployment to AZ1, an automatically initiated zonal shift quickly mitigates the customer impact

Then, after a few more minutes, at 10:44, we can see availability from the customer perspective has gone back up to 100% by shifting traffic away from AZ1.

Dashboards showing the regional customer experience has recovered while the AZ is still impacted by the failed deployment

The impact of the failed deployment is mitigated by shifting traffic away from AZ1

It turns out the cause of impact in this case was a failed deployment, and we can see that our synthetic canaries still see the failure while the deployment is rolling back, but we’ve achieved our primary goal of quickly removing the impact to the customer experience. From the start of impact to mitigation, 6 minutes elapsed, which was significantly faster than waiting for the deployment to completely rollback. After the rollback is complete, at 10:58, 20 minutes after the start of the event, we can see the alarm transition back to the OK state and availability return to normal in AZ1, meaning we can end the zonal shift and return to normal operation.

Dashboards showing that after the rollback is complete, the impact to the single AZ has also dissipated

After the deployment rollback is complete, the availability in AZ1 recovers and the zonal shift can be ended

Conclusion

Performing zonal deployments helps improve the effectiveness of AZI architectures. Aligning your deployments to your intended fault isolation boundaries, in this case AZs, creates a predictable scope of impact and helps prevents cascading failures. This in turn allows you to use a common set of observability and mitigation tools for both single-AZ infrastructure events and failed deployments, which can mitigate the impact faster than automated rollbacks. Additionally, by removing the ambiguity on selecting a recovery strategy for operators, it further reduces recovery time and complexity. Learn more about zonal deployments in AWS CodeDeploy here.

Michael Haken

Michael Haken

Michael is a Senior Principal Solutions Architect on the AWS Strategic Accounts team where he helps customers innovate, differentiate their business, and transform their customer experiences. He has 15 years’ experience supporting financial services, public sector, and digital native customers. Michael has his B.A. from UVA and M.S. in Computer Science from Johns Hopkins. Outside of work you’ll find him playing with his family and dogs on his farm.

Stream multi-tenant data with Amazon MSK

Post Syndicated from Emanuele Levi original https://aws.amazon.com/blogs/big-data/stream-multi-tenant-data-with-amazon-msk/

Real-time data streaming has become prominent in today’s world of instantaneous digital experiences. Modern software as a service (SaaS) applications across all industries rely more and more on continuously generated data from different data sources such as web and mobile applications, Internet of Things (IoT) devices, social media platforms, and ecommerce sites. Processing these data streams in real time is key to delivering responsive and personalized solutions, and maximizes the value of data by processing it as close to the event time as possible.

AWS helps SaaS vendors by providing the building blocks needed to implement a streaming application with Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), and real-time processing applications with Amazon Managed Service for Apache Flink.

In this post, we look at implementation patterns a SaaS vendor can adopt when using a streaming platform as a means of integration between internal components, where streaming data is not directly exposed to third parties. In particular, we focus on Amazon MSK.

Streaming multi-tenancy patterns

When building streaming applications, you should take the following dimensions into account:

  • Data partitioning – Event streaming and storage needs to be isolated at the appropriate level, physical or logical, based on tenant ownership
  • Performance fairness – The performance coupling of applications processing streaming data for different tenants must be controlled and limited
  • Tenant isolation – A solid authorization strategy needs to be put in place to make sure tenants can access only their data

Underpinning all interactions with a multi-tenant system is the concept of SaaS identity. For more information, refer to SaaS Architecture Fundamentals.

SaaS deployment models

Tenant isolation is not optional for SaaS providers, and tenant isolation approaches will differ depending on your deployment model. The model is influenced by business requirements, and the models are not mutually exclusive. Trade-offs must be weighed across individual services to achieve a proper balance of isolation, complexity, and cost. There is no universal solution, and a SaaS vendor needs to carefully weigh their business and customer needs against three isolation strategies: silo, pool and bridge (or combinations thereof).

In the following sections, we explore these deployment models across data isolation, performance fairness, and tenant isolation dimensions.

Silo model

The silo model represents the highest level of data segregation, but also the highest running cost. Having a dedicated MSK cluster per tenant increases the risk of overprovisioning and requires duplication of management and monitoring tooling.

Having a dedicated MSK cluster per tenant makes sure tenant data partitioning occurs at the disk level when using an Amazon MSK Provisioned model. Both Amazon MSK Provisioned and Serverless clusters support server-side encryption at rest. Amazon MSK Provisioned further allows you to use a customer managed AWS Key Management Service (AWS KMS) key (see Amazon MSK encryption).

In a silo model, Kafka ACL and quotas is not strictly required unless your business requirements require them. Performance fairness is guaranteed because only a single tenant will be using the resources of the entire MSK cluster and are dedicated to applications producing and consuming events of a single tenant. This means spikes of traffic on a specific tenant can’t impact other tenants, and there is no risk of cross-tenant data access. As a drawback, having a provisioned cluster per tenant requires a right-sizing exercise per tenant, with a higher risk of overprovisioning than in the pool or bridge models.

You can implement tenant isolation the MSK cluster level with AWS Identity and Access Management (IAM) policies, creating per-cluster credentials, depending on the authentication scheme in use.

Pool model

The pool model is the simplest model where tenants share resources. A single MSK cluster is used for all tenants with data split into topics based on the event type (for example, all events related to orders go to the topic orders), and all tenant’s events are sent to the same topic. The following diagram illustrates this architecture.

Image showing a single streaming topic with multiple producers and consumers

This model maximizes operational simplicity, but reduces the tenant isolation options available because the SaaS provider won’t be able to differentiate per-tenant operational parameters and all responsibilities of isolation are delegated to the applications producing and consuming data from Kafka. The pool model also doesn’t provide any mechanism of physical data partitioning, nor performance fairness. A SaaS provider with these requirements should consider either a bridge or silo model. If you don’t have requirements to account for parameters such as per-tenant encryption keys or tenant-specific data operations, a pool model offers reduced complexity and can be a viable option. Let’s dig deeper into the trade-offs.

A common strategy to implement consumer isolation is to identify the tenant within each event using a tenant ID. The options available with Kafka are passing the tenant ID either as event metadata (header) or part of the payload itself as an explicit field. With this approach, the tenant ID will be used as a standardized field across all applications within both the message payload and the event header. This approach can reduce the risk of semantic divergence when components process and forward messages because event headers are handled differently by different processing frameworks and could be stripped when forwarded. Conversely, the event body is often forwarded as a single object and no contained information is lost unless the event is explicitly transformed. Including the tenant ID in the event header as well may simplify the implementation of services allowing you to specify tenants that need to be recovered or migrated without requiring the provider to deserialize the message payload to filter by tenant.

When specifying the tenant ID using either a header or as a field in the event, consumer applications will not be able to selectively subscribe to the events of a specific tenant. With Kafka, a consumer subscribes to a topic and receives all events sent to that topic of all tenants. Only after receiving an event will the consumer will be able to inspect the tenant ID to filter the tenant of interest, making access segregation virtually impossible. This means sensitive data must be encrypted to make sure a tenant can’t read another tenant’s data when viewing these events. In Kafka, server-side encryption can only be set at the cluster level, where all tenants sharing a cluster will share the same server-side encryption key.

In Kafka, data retention can only be set on the topic. In the pool model, events belonging to all tenants are sent to the same topic, so tenant-specific operations like deleting all data for a tenant will not be possible. The immutable, append-only nature of Kafka only allows an entire topic to be deleted, not selective events belonging to a specific tenant. If specific customer data in the stream requires the right to be forgotten, such as for GDPR, a pool model will not work for that data and silo should be considered for that specific data stream.

Bridge model

In the bridge model, a single Kafka cluster is used across all tenants, but events from different tenants are segregated into different topics. With this model, there is a topic for each group of related events per tenant. You can simplify operations by adopting a topic naming convention such as including the tenant ID in the topic name. This will practically create a namespace per tenant, and also allows different administrators to manage different tenants, setting permissions with a prefix ACL, and avoiding naming clashes (for example, events related to orders for tenant 1 go to tenant1.orders and orders of tenant 2 go to tenant2.orders). The following diagram illustrates this architecture.

Image showing multiple producers and consumers each publishing to a stream-per-tenant

With the bridge model, server-side encryption using a per-tenant key is not possible. Data from different tenants is stored in the same MSK cluster, and server-side encryption keys can be specified per cluster only. For the same reason, data segregation can only be achieved at file level, because separate topics are stored in separate files. Amazon MSK stores all topics within the same Amazon Elastic Block Store (Amazon EBS) volume.

The bridge model offers per-tenant customization, such as retention policy or max message size, because Kafka allows you to set these parameters per topic. The bridge model also simplifies segregating and decoupling event processing per tenant, allowing a stronger isolation between separate applications that process data of separate tenants.

To summarize, the bridge model offers the following capabilities:

  • Tenant processing segregation – A consumer application can selectively subscribe to the topics belonging to specific tenants and only receive events for those tenants. A SaaS provider will be able to delete data for specific tenants, selectively deleting the topics belonging to that tenant.
  • Selective scaling of the processing – With Kafka, the maximum number of parallel consumers is determined by the number of partitions of a topic, and the number of partitions can be set per topic, and therefore per tenant.
  • Performance fairness – You can implement performance fairness using Kafka quotas, supported by Amazon MSK, preventing the services processing a particularly busy tenant to consume too many cluster resources, at the expense of other tenants. Refer to the following two-part series for more details on Kafka quotas in Amazon MSK, and an example implementation for IAM authentication.
  • Tenant isolation – You can implement tenant isolation using IAM access control or Apache Kafka ACLs, depending on the authentication scheme that is used with Amazon MSK. Both IAM and Kafka ACLs allow you to control access per topic. You can authorize an application to access only the topics belonging to the tenant it is supposed to process.

Trade-offs in a SaaS environment

Although each model provides different capabilities for data partitioning, performance fairness, and tenant isolation, they also come with different costs and complexities. During planning, it’s important to identify what trade-offs you are willing to make for typical customers, and provide a tier structure to your client subscriptions.

The following table summarizes the supported capabilities of the three models in a streaming application.

. Pool Bridge Silo
Per-tenant encryption at rest No No Yes
Can implement right to be forgotten for single tenant No Yes Yes
Per-tenant retention policies No Yes Yes
Per-tenant event size limit No Yes Yes
Per-tenant replayability Yes (must implement with logic in consumers) Yes Yes

Anti-patterns

In the bridge model, we discussed tenant segregation by topic. An alternative would be segregating by partition, where all messages of a given type are sent to the same topic (for example, orders), but each tenant has a dedicated partition. This approach has many disadvantages and we strongly discourage it. In Kafka, partitions are the unit of horizontal scaling and balancing of brokers and consumers. Assigning partitions per tenants can introduce unbalancing of the cluster, and operational and performance issues that will be hard to overcome.

Some level of data isolation, such as per-tenant encryption keys, could be achieved using client-side encryption, delegating any encryption or description to the producer and consumer applications. This approach would allow you to use a separate encryption key per tenant. We don’t recommend this approach because it introduces a higher level of complexity in both the consumer and producer applications. It may also prevent you from using most of the standard programming libraries, Kafka tooling, and most Kafka ecosystem services, like Kafka Connect or MSK Connect.

Conclusion

In this post, we explored three patterns that SaaS vendors can use when architecting multi-tenant streaming applications with Amazon MSK: the pool, bridge, and silo models. Each model presents different trade-offs between operational simplicity, tenant isolation level, and cost efficiency.

The silo model dedicates full MSK clusters per tenant, offering a straightforward tenant isolation approach but incurring a higher maintenance and cost per tenant. The pool model offers increased operational and cost-efficiencies by sharing all resources across tenants, but provides limited data partitioning, performance fairness, and tenant isolation capabilities. Finally, the bridge model offers a good compromise between operational and cost-efficiencies while providing a good range of options to create robust tenant isolation and performance fairness strategies.

When architecting your multi-tenant streaming solution, carefully evaluate your requirements around tenant isolation, data privacy, per-tenant customization, and performance guarantees to determine the appropriate model. Combine models if needed to find the right balance for your business. As you scale your application, reassess isolation needs and migrate across models accordingly.

As you’ve seen in this post, there is no one-size-fits-all pattern for streaming data in a multi-tenant architecture. Carefully weighing your streaming outcomes and customer needs will help determine the correct trade-offs you can make while making sure your customer data is secure and auditable. Continue your learning journey on SkillBuilder with our SaaS curriculum, get hands-on with an AWS Serverless SaaS workshop or Amazon EKS SaaS workshop, or dive deep with Amazon MSK Labs.


About the Authors

Emmanuele Levi is a Solutions Architect in the Enterprise Software and SaaS team, based in London. Emanuele helps UK customers on their journey to refactor monolithic applications into modern microservices SaaS architectures. Emanuele is mainly interested in event-driven patterns and designs, especially when applied to analytics and AI, where he has expertise in the fraud-detection industry.

Lorenzo Nicora is a Senior Streaming Solution Architect helping customers across EMEA. He has been building cloud-native, data-intensive systems for over 25 years, working across industries, in consultancies and product companies. He has leveraged open-source technologies extensively and contributed to several projects, including Apache Flink.

Nicholas Tunney is a Senior Partner Solutions Architect for Worldwide Public Sector at AWS. He works with Global SI partners to develop architectures on AWS for clients in the government, nonprofit healthcare, utility, and education sectors.  He is also a core member of the SaaS Technical Field Community where he gets to meet clients from all over the world who are building SaaS on AWS.

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Post Syndicated from Sudipta Mitra original https://aws.amazon.com/blogs/big-data/design-a-data-mesh-pattern-for-amazon-emr-based-data-lakes-using-aws-lake-formation-with-hive-metastore-federation/

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery.

One of the key challenges in modern big data management is facilitating efficient data sharing and access control across multiple EMR clusters. Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. With the AWS Glue Data Catalog federation to external Hive metastore feature, you can now now apply data governance to the metadata residing across those EMR clusters and analyze them using AWS analytics services such as Amazon Athena, Amazon Redshift Spectrum, AWS Glue ETL (extract, transform, and load) jobs, EMR notebooks, EMR Serverless using Lake Formation for fine-grained access control, and Amazon SageMaker Studio. For detailed information on managing your Apache Hive metastore using Lake Formation permissions, refer to Query your Apache Hive metastore with AWS Lake Formation permissions.

In this post, we present a methodology for deploying a data mesh consisting of multiple Hive data warehouses across EMR clusters. This approach enables organizations to take advantage of the scalability and flexibility of EMR clusters while maintaining control and integrity of their data assets across the data mesh.

Use cases for Hive metastore federation for Amazon EMR

Hive metastore federation for Amazon EMR is applicable to the following use cases:

  • Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase. These data lakes require governance for access without the necessity of moving data to consumer accounts. The data resides on Amazon S3, which reduces the storage costs significantly.
  • Centralized catalog for published data – Multiple producers release data currently governed by their respective entities. For consumer access, a centralized catalog is necessary where producers can publish their data assets.
  • Consumer personas – Consumers include data analysts who run queries on the data lake, data scientists who prepare data for machine learning (ML) models and conduct exploratory analysis, as well as downstream systems that run batch jobs on the data within the data lake.
  • Cross-producer data access – Consumers may need to access data from multiple producers within the same catalog environment.
  • Data access entitlements – Data access entitlements involve implementing restrictions at the database, table, and column levels to provide appropriate data access control.

Solution overview

The following diagram shows how data from producers with their own Hive metastores (left) can be made available to consumers (right) using Lake Formation permissions enforced in a central governance account.

Producer and consumer are logical concepts used to indicate the production and consumption of data through a catalog. An entity can act both as a producer of data assets and as a consumer of data assets. The onboarding of producers is facilitated by sharing metadata, whereas the onboarding of consumers is based on granting permission to access this metadata.

The solution consists of multiple steps in the producer, catalog, and consumer accounts:

  1. Deploy the AWS CloudFormation templates and set up the producer, central governance and catalog, and consumer accounts.
  2. Test access to the producer cataloged Amazon S3 data using EMR Serverless in the consumer account.
  3. Test access using Athena queries in the consumer account.
  4. Test access using SageMaker Studio in the consumer account.

Producer

Producers create data within their AWS accounts using an Amazon EMR-based data lake and Amazon S3. Multiple producers then publish this data into a central catalog (data lake technology) account. Each producer account, along with the central catalog account, has either VPC peering or AWS Transit Gateway enabled to facilitate AWS Glue Data Catalog federation with the Hive metastore.

For each producer, an AWS Glue Hive metastore connector AWS Lambda function is deployed in the catalog account. This enables the Data Catalog to access Hive metastore information at runtime from the producer. The data lake locations (the S3 bucket location of the producers) are registered in the catalog account.

Central catalog

A catalog offers governed and secure data access to consumers. Federated databases are established within the catalog account’s Data Catalog using the Hive connection, managed by the catalog Lake Formation admin (LF-Admin). These federated databases in the catalog account are then shared by the data lake LF-Admin with the consumer LF-Admin of the external consumer account.

Data access entitlements are managed by applying access controls as needed at various levels, such as the database or table.

Consumer

The consumer LF-Admin grants the necessary permissions or restricted permissions to roles such as data analysts, data scientists, and downstream batch processing engine AWS Identity and Access Management (IAM) roles within its account.

Data access entitlements are managed by applying access control based on requirements at various levels, such as databases and tables.

Prerequisites

You need three AWS accounts with admin access to implement this solution. It is recommended to use test accounts. The producer account will host the EMR cluster and S3 buckets. The catalog account will host Lake Formation and AWS Glue. The consumer account will host EMR Serverless, Athena, and SageMaker notebooks.

Set up the producer account

Before you launch the CloudFormation stack, gather the following information from the catalog account:

  • Catalog AWS account ID (12-digit account ID)
  • Catalog VPC ID (for example, vpc-xxxxxxxx)
  • VPC CIDR (catalog account VPC CIDR; it should not overlap 10.0.0.0/16)

The VPC CIDR of the producer and catalog can’t overlap due to VPC peering and Transit Gateway requirements. The VPC CIDR should be a VPC from the catalog account where the AWS Glue metastore connector Lambda function will be eventually deployed.

The CloudFormation stack for the producer creates the following resources:

  • S3 bucket to host data for the Hive metastore of the EMR cluster.
  • VPC with the CIDR 10.0.0.0/16. Make sure there is no existing VPC with this CIDR in use.
  • VPC peering connection between the producer and catalog account.
  • Amazon Elastic Compute Cloud (Amazon EC2) security groups for the EMR cluster.
  • IAM roles required for the solution.
  • EMR 6.10 cluster launched with Hive.
  • Sample data downloaded to the S3 bucket.
  • A database and external tables, pointing to the downloaded sample data, in its Hive metastore.

Complete the following steps:

  1. Launch the template PRODUCER.yml. It’s recommended to use an IAM role that has administrator privileges.
  2. Gather the values for the following on the CloudFormation stack’s Outputs tab:
    1. VpcPeeringConnectionId (for example, pcx-xxxxxxxxx)
    2. DestinationCidrBlock (10.0.0.0/16)
    3. S3ProducerDataLakeBucketName

Set up the catalog account

The CloudFormation stack for the catalog account creates the Lambda function for federation. Before you launch the template, on the Lake Formation console, add the IAM role and user deploying the stack as the data lake admin.

Then complete the following steps:

  1. Launch the template CATALOG.yml.
  2. For the RouteTableId parameter, use the catalog account VPC RouteTableId. This is the VPC where the AWS Glue Hive metastore connector Lambda function will be deployed.
  3. On the stack’s Outputs tab, copy the value for LFRegisterLocationServiceRole (arn:aws:iam::account-id: role/role-name).
  4. Confirm if the Data Catalog setting has the IAM access control options un-checked and the current cross-account version is set to 4.

  1. Log in to the producer account and add the following bucket policy to the producer S3 bucket that was created during the producer account setup. Add the ARN of LFRegisterLocationServiceRole to the Principal section and provide the S3 bucket name under the Resource section.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::account-id: role/role-name"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::s3-bucket-name/*",
                "arn:aws:s3:::s3-bucket-name"
            ]
        }
    ]
}
  1. In the producer account, on the Amazon EMR console, navigate to the primary node EC2 instance to get the value for Private IP DNS name (IPv4 only) (for example, ip-xx-x-x-xx.us-west-1.compute.internal).

  1. Switch to the catalog account and deploy the AWS Glue Data Catalog federation Lambda function (GlueDataCatalogFederation-HiveMetastore).

The default Region is set to us-east-1. Change it to your desired Region before deploying the function.

Use the VPC that was used as the CloudFormation input for the VPC CIDR. You can use the VPC’s default security group ID. If using another security group, make sure the outbound allows traffic to 0.0.0.0/0.

Next, you create a federated database in Lake Formation.

  1. On the Lake Formation console, choose Data sharing in the navigation pane.
  2. Choose Create database.

  1. Provide the following information:
    1. For Connection name, choose your connection.
    2. For Database name, enter a name for your database.
    3. For Database identifier, enter emrhms_salesdb (this is the database created on the EMR Hive metastore).
  2. Choose Create database.

  1. On the Databases page, select the database and on the Actions menu, choose Grant to grant describe permissions to the consumer account.

  1. Under Principals, select External accounts and choose your account ARN.
  2. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database and table.
  3. Under Table permissions, provide the following information:
    1. For Table permissions¸ select Select and Describe.
    2. For Grantable permissions¸ select Select and Describe.
  4. Under Data permissions, select All data access.
  5. Choose Grant.

  1. On the Tables page, select your table and on the Actions menu, choose Grant to grant select and describe permissions.

  1. Under Principals, select External accounts and choose your account ARN.
  2. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database.
  3. Under Database permissions¸ provide the following information:
    1. For Database permissions¸ select Create table and Describe.
    2. For Grantable permissions¸ select Create table and Describe.
  4. Choose Grant.

Set up the consumer account

Consumers include data analysts who run queries on the data lake, data scientists who prepare data for ML models and conduct exploratory analysis, as well as downstream systems that run batch jobs on the data within the data lake.

The consumer account setup in this section shows how you can query the shared Hive metastore data using Athena for the data analyst persona, EMR Serverless to run batch scripts, and SageMaker Studio for the data scientist to further use data in the downstream model building process.

For EMR Serverless and SageMaker Studio, if you’re using the default IAM service role, add the required Data Catalog and Lake Formation IAM permissions to the role and use Lake Formation to grant table permission access to the role’s ARN.

Data analyst use case

In this section, we demonstrate how a data analyst can query the Hive metastore data using Athena. Before you get started, on the Lake Formation console, add the IAM role or user deploying the CloudFormation stack as the data lake admin.

Then complete the following steps:

  1. Run the CloudFormation template CONSUMER.yml.
  2. If the catalog and consumer accounts are not part of the organization in AWS Organizations, navigate to the AWS Resource Access Manager (AWS RAM) console and manually accept the resources shared from the catalog account.
  3. On the Lake Formation console, on the Databases page, select your database and on the Actions menu, choose Create resource link.

  1. Under Database resource link details, provide the following information:
    1. For Resource link name, enter a name.
    2. For Shared database’s region, choose a Region.
    3. For Shared database, choose your database.
    4. For Shared database’s owner ID, enter the account ID.
  2. Choose Create.

Now you can use Athena to query the table on the consumer side, as shown in the following screenshot.

Batch job use case

Complete the following steps to set up EMR Serverless to run a sample Spark job to query the existing table:

  1. On the Amazon EMR console, choose EMR Serverless in the navigation pane.
  2. Choose Get started.

  1. Choose Create and launch EMR Studio.

  1. Under Application settings, provide the following information:
    1. For Name, enter a name.
    2. For Type, choose Spark.
    3. For Release version, choose the current version.
    4. For Architecture, select x86_64.
  2. Under Application setup options, select Use custom settings.

  1. Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
  2. Choose Create and start application.

  1. On the application details page, on the Job runs tab, choose Submit job run.

  1. Under Job details, provide the following information:
    1. For Name, enter a name.
    2. For Runtime role¸ choose Create new role.
    3. Note the IAM role that gets created.
    4. For Script location, enter the S3 bucket location created by the CloudFormation template (the script is emr-serverless-query-script.py).
  2. Choose Submit job run.

  1. Add the following AWS Glue access policy to the IAM role created in the previous step (provide your Region and the account ID of your catalog account):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase",
                "glue:GetDataBases",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:GetUserDefinedFunctions"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:1234567890:catalog",
                "arn:aws:glue:us-east-1:1234567890:database/*",
                "arn:aws:glue:us-east-1:1234567890:table/*/*"
            ]
        }
    ]
}
  1. Add the following Lake Formation access policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "LakeFormation:GetDataAccess"
            "Resource": "*"
        }
    ]
}
  1. On the Databases page, select the database and on the Actions menu, choose Grant to grant Lake Formation access to the EMR Serverless runtime role.
  2. Under Principals, select IAM users and roles and choose your role.
  3. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database.
  4. Under Resource link permissions, for Resource link permissions, select Describe.
  5. Choose Grant.

  1. On the Databases page, select the database and on the Actions menu, choose Grant on target.

  1. Provide the following information:
    1. Under Principals, select IAM users and roles and choose your role.
    2. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database and table
    3. Under Table permissions, for Table permissions, select Select.
    4. Under Data permissions, select All data access.
  2. Choose Grant.

  1. Submit the job again by cloning it.
  2. When the job is complete, choose View logs.

The output should look like the following screenshot.

Data scientist use case

For this use case, a data scientist queries the data through SageMaker Studio. Complete the following steps:

  1. Set up SageMaker Studio.
  2. Confirm that the domain user role has been granted permission by Lake Formation to SELECT data from the table.
  3. Follow steps similar to the batch run use case to grant access.

The following screenshot shows an example notebook.

Clean up

We recommend deleting the CloudFormation stack after use, because the deployed resources will incur costs. There are no prerequisites to delete the producer, catalog, and consumer CloudFormation stacks. To delete the Hive metastore connector stack on the catalog account (serverlessrepo-GlueDataCatalogFederation-HiveMetastore), first delete the federated database you created.

Conclusion

In this post, we explained how to create a federated Hive metastore for deploying a data mesh architecture with multiple Hive data warehouses across EMR clusters.

By using Data Catalog metadata federation, organizations can construct a sophisticated data architecture. This approach not only seamlessly extends your Hive data warehouse but also consolidates access control and fosters integration with various AWS analytics services. Through effective data governance and meticulous orchestration of the data mesh architecture, organizations can provide data integrity, regulatory compliance, and enhanced data sharing across EMR clusters.

We encourage you to check out the features of the AWS Glue Hive metastore federation connector and explore how to implement a data mesh architecture across multiple EMR clusters. To learn more and get started, refer to the following resources:


About the Authors

Sudipta Mitra is a Senior Data Architect for AWS, and passionate about helping customers to build modern data analytics applications by making innovative use of latest AWS services and their constantly evolving features. A pragmatic architect who works backwards from customer needs, making them comfortable with the proposed solution, helping achieve tangible business outcomes. His main areas of work are Data Mesh, Data Lake, Knowledge Graph, Data Security and Data Governance.

Deepak Sharma is a Senior Data Architect with the AWS Professional Services team, specializing in big data and analytics solutions. With extensive experience in designing and implementing scalable data architectures, he collaborates closely with enterprise customers to build robust data lakes and advanced analytical applications on the AWS platform.

Nanda Chinnappa is a Cloud Infrastructure Architect with AWS Professional Services at Amazon Web Services. Nanda specializes in Infrastructure Automation, Cloud Migration, Disaster Recovery and Databases which includes Amazon RDS and Amazon Aurora. He helps AWS Customer’s adopt AWS Cloud and realize their business outcome by executing cloud computing initiatives.

AWS CodeBuild Managed Self-Hosted GitHub Action Runners

Post Syndicated from Matt Laver original https://aws.amazon.com/blogs/devops/aws-codebuild-managed-self-hosted-github-action-runners/

AWS CodeBuild now supports managed self-hosted GitHub Action runners, allowing you to build powerful CI/CD capabilities right beside your code and quickly implement a build, test and deploy pipeline. Last year AWS announced that customers can define their GitHub Actions steps within any phase of a CodeBuild buildspec file but with a self-hosted runner, jobs execute from GitHub Actions on GitHub.com to a system you deploy and manage.

With the recent announcement that AWS CodeBuild now supports managed GitHub Action runners, AWS can take care of managing the hosting of GitHub Action self-hosted runners within CodeBuild allowing teams to run their GitHub Actions workflow jobs natively within AWS.

For customers managing their self-hosted runners on their own infrastructure, CodeBuild can now provide a secure, scalable and lower latency solution. In addition, CodeBuild managed self-hosted GitHub Action runners bring features, such as:

With the compute options available, customers can now run tests on hardware and operating system combinations that closely match production and reduce manual operational tasks by shifting the management of the runners to AWS.

In this blog, I will explore how AWS managed GitHub Action self-hosted runners work by building and deploying an application to AWS using GitHub Actions.

Architecture overview

The architecture of what I’ll be building can be seen below:

Architecture diagram of AWS CodeBuild running managed GitHub Actions Runners

Figure 1. Architecture diagram of AWS CodeBuild running managed Self-Hosted GitHub Actions Runners

The architecture above shows how a developer pushes code changes to GitHub. This triggers CodeBuild to detect the update. CodeBuild then runs the defined GitHub Action Workflow, which builds and deploys it to AWS Lambda.

Step 1. Build a AWS Lambda Function

I’ll start with a simple application to demonstrate how to build and deploy an application on AWS via a Managed Self-Hosted GitHub Actions runner. We’ve written before about why AWS is the best place to run Rust, Amazon CTO Werner Vogels has been an outspoken advocate for exploring energy-efficient programming languages like Rust and AWS have great guides on using Rust to build on AWS such as:

Cargo Lambda is one of the simplest ways to run, build and deploy Rust lambda functions on AWS, I’ll start with the Getting Started guide:

  1. Navigate to GitHub.com and create a new GitHub repository

    Create a new GitHub repository

    Figure 2 Create a new GitHub repository

  2. Clone the repository locally:
    git clone https://github.com/{{user-name}}/rust-api-demo.git
  3. From the above cloned repository, install Cargo Lambda:For macOS & Linux:
    brew tap cargo-lambda/cargo-lambda
    brew install cargo-lambda

    Windows users can follow the guide to see all the ways that you can install Cargo Lambda in your system.

  4. Use Cargo lambda to create a new project
    cargo lambda new new-lambda-project && cd new-lambda-project

It’s now possible to explore the project, in this case I am using JetBrains RustRover with Amazon Q Developer installed to increase my productivity while working on the application:

JetBrains RustRover with Amazon Q Developer

Figure 3. JetBrains RustRover with Amazon Q Developer

Amazon Q Developer is available on a free tier and provides real-time code suggestions as well as advanced suggestions such as in-built chat to reason with the code we’re working on.

  1. Add, Commit & Push the code to your GitHub account:
    git add .
    git commit -m “Initial Commit”
    git push origin

Step 2. Create AWS CodeBuild Project

The AWS Documentation outlines how to set up self-hosted GitHub Actions runners in AWS CodeBuild, the key here is to setup a GitHub webhook event of event type WORKFLOW_JOB_QUEUED so that CodeBuild will only process GitHub Actions workflow jobs.

I will create a new CodeBuild project as per the documentation to connect CodeBuild to our GitHub repository and correctly configure a webhook to trigger the GitHub Actions.

  1. Open the AWS CodeBuild console
  2. Create a build project.
    • In Source:
      • For Source provider, choose GitHub.
      • For Repository, choose Repository in my GitHub account.
      • For Repository URL, enter https://github.com/user-name/repository-name.
    • In Primary source webhook events:
      • For Webhook – optional, select Rebuild every time a code change is pushed to this repository.
      • For Event type, select WORKFLOW_JOB_QUEUED. Once this is enabled, builds will only be triggered by GitHub Actions workflow jobs events.

        WORKFLOW_JOB_QUEUED Event Type

        Figure 4. WORKFLOW_JOB_QUEUED Event Type

    • In Environment:
      • Choose a supported Environment image and Compute. Note that you have the option to override the image and instance settings by using a label in your GitHub Actions workflow YAML.
    • In Buildspec:
      • Note that your Buildspec will be ignored. Instead, CodeBuild will override it to use commands that will setup the self-hosted runner. This project’s primary responsibility is to set up a self-hosted runner in CodeBuild to run GitHub Actions workflow jobs.
  3. Continue with the remaining default options and select Create build project.

CodeBuild Service Role Permissions

In order for the CodeBuild service role to be able to successfully create and deploy a Lambda function, the service role will require the necessary permissions. The Service role can be seen when editing the CodeBuild project:

CodeBuild Service Role Permissions

Figure 5. CodeBuild Service Role Permissions

The required Lambda permissions are documented in the Cargo Lambda documentation:

  • lambda:GetFunction
  • lambda:CreateFunction
  • lambda:UpdateFunctionCode

In addition, there are also IAM permissions required:

  • iam:CreateRole
  • iam:AttachRolePolicy
  • iam:UpdateAssumeRolePolicy
  • iam:PassRole

Add the required permissions to the service role for the CodeBuild project:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "iam:UpdateAssumeRolePolicy",
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::{AWS:Account}:role/AWSLambdaBasicExecutionRole",
                "arn:aws:iam::{AWS:Account}:role/cargo-lambda-role*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:UpdateFunctionCode",
                "lambda:GetFunction"
            ],
            "Resource": "arn:aws:lambda::{AWS:Account}:function:{function-name}"
        }
    ]
} 

Note that I do not need to manage IAM permissions outside of our AWS Account, for example GitHub does not need to know about our AWS permissions.

Step 3. Create a GitHub Action Workflow

GitHub Actions is a continuous integration and continuous deliver (CI/CD) platform that provides automation through building, testing and deploying applications. In this section we will create a GitHub Action Workflow to build and deploy our Lambda.

  1. Navigate back to our GitHub project create a workflow within the .github/workflows directory, the Simple Workflow is a good starting point:

    Create a Simple Workflow

    Figure 6. Create a Simple Workflow

  2. Update the Job to include the tooling required to build our Rust Lambda function, the details can be found in the GitHub Actions section. Our workflow file should now look like this:
name: rust-api-demo-cicd

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

env:
  CARGO_TERM_COLOR: always

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Install Rust toolchain
        uses: dtolnay/rust-toolchain@stable
      - name: Install Zig toolchain
        uses: korandoru/setup-zig@v1
        with:
          zig-version: 0.10.0
      - name: Install Cargo Lambda
        uses: jaxxstorm/[email protected]
        with:
          repo: cargo-lambda/cargo-lambda
          tag: v0.14.0 
          platform: linux 
          arch: x86_64 
          # Add your build steps below

The above GitHub Actions Workflow currently runs on GitHub; However, I now want to make two further changes:

  • Define an AWS CodeBuild runner
  • Define Build and Deploy Lambda steps

Define an AWS CodeBuild runner

A GitHub Actions workflow is made up of one or more jobs, each job runs in a runner environment specified by runs-on. The value for runs-on to specify CodeBuild as a runner takes the format:

runs-on: codebuild-<CodeBuildProjectName>-${{ github.run_id }}-${{ github.run_attempt }}

I will update the <CodeBuildProjectName> to the CodeBuild project name that was entered in Step2, e.g. “GitHubActionsDemo”.

When configuring CodeBuild as a runner environment, BuildSpecs are ignored. In order to define the specification of our build environments it is possible to pass in variables including:

  • EC2 compute builds: Image, image version, instance size
  • Lambda compute builds: environment type, runtime version, instance size

For further details, see the action runner guide.

Define Build and Deploy Lambda steps

The last change is to add steps to check out our code onto the runner and then build and deploy using cargo lambda:

      - name: Build Rust API
        uses: actions/checkout@v4
      - run: cargo lambda build --release
      - run: cargo lambda deploy

The final workflow looks like this:

name: rust-api-demo-cicd

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

env:
  CARGO_TERM_COLOR: always

jobs:
  build:
    runs-on: codebuild-GitHubActionsDemo-${{ github.run_id }}-${{ github.run_attempt }}
    steps:
      - name: Install Rust toolchain
        uses: dtolnay/rust-toolchain@stable
      - name: Install Zig toolchain
        uses: korandoru/setup-zig@v1
        with:
          zig-version: 0.10.0
      - name: Install Cargo Lambda
        uses: jaxxstorm/[email protected]
        with:
          repo: cargo-lambda/cargo-lambda
          tag: v0.14.0 
          platform: linux 
          arch: x86_64 
          # Add your build steps below
      - name: Build Rust API
        uses: actions/checkout@v4
      - run: cargo lambda build --release
      - run: cargo lambda deploy

When I commit changes to the workflow to the main branch it will trigger the GitHub Action.

Step 4. Testing our GitHub Action Workflow.

The GitHub Action is currently triggered on all push and pull requests to main branch:

Trigger a build

Figure 7. Trigger a build

Note that GitHub is where the CI/CD process is being driven, the build logs are available in GitHub as the job is running:

GitHub Action Logs

Figure 8. GitHub Action Logs

As the build progresses through the deployment step, the details of the Lambda function deployed are shown:

Deployment ARN

Figure 9. Deployment ARN

Navigating back to the AWS Console, the deployed Lambda Function can be seen:

Lambda Deployed

Figure 10. Lambda Deployed

And finally, opening the CodeBuild console, it’s possible to observe the status of the Managed GitHub Actions Runner, the build number and also the duration:

Lambda Deployed via Managed Self-Hosted GitHub Action runner

Figure 11. Lambda Deployed via Managed Self-Hosted GitHub Action runner

Clean Up

To avoid incurring future charges:

  1. Delete the Lambda created via the deployment in Step 4.
  2. Delete the CodeBuild Project created in Step 2.

Conclusion

As I’ve shown in this blog, setting up GitHub Actions Workflows that run on AWS is now even easier to allow CodeBuild projects to receive GitHub Actions workflow job events and run them on CodeBuild ephemeral hosts. AWS customers can take advantage of natively integrating with AWS and providing security and convenience through features such as defining service role permissions with AWS IAM or passing credentials as environment variables to build jobs with AWS Secrets Manager.

Being able to use CodeBuild’s reserved capacity allows you to provision a fleet of CodeBuild hosts that persist your build environment. These hosts remain available to receive subsequent build requests, which reduces build start-up latencies but also make it possible to compile your software within your VPC and access resources such as Amazon Relational Database ServiceAmazon ElastiCache, or any service endpoints that are only reachable from within a specific VPC.

CodeBuild-hosted GitHub Actions runners are supported in all CodeBuild regions and customers managing their CI/CD processes via GitHub Actions can use the compute platforms CodeBuild offers, including Lambda, Windows, Linux, Linux GPU-enhanced and Arm-based instances powered by AWS Graviton Processors.

Read more in our documentation for GitHub Action runner in AWS CodeBuild.

About the author:

Matt Laver

Matt Laver is a Senior Solutions Architect at AWS working with SMB customers in EMEA. He is passionate about DevOps and loves helping customers find simple solutions to difficult problems.

Secure file sharing solutions in AWS: A security and cost analysis guide, Part 1

Post Syndicated from Sumit Bhati original https://aws.amazon.com/blogs/security/how-to-securely-transfer-files-with-presigned-urls/

July 28, 2025: This post has been updated and expanded into a comprehensive two-part series covering multiple AWS file sharing solutions. This new series provides in-depth analysis of security and cost considerations to help you make informed decisions based on your requirements.


Note: This is Part 1 of a two-part post. You can read Part 2 here.

Sharing files with an outside entity—to share data between business partners or facilitate customer access to files—is a common use case for Amazon Web Services (AWS) customers. Organizations must balance security, cost, and usability. In a business-to-business data sharing scenario, these challenges become even more complex because human interaction is often minimal or absent, requiring robust automated solutions. Many AWS services offer multiple options for granting access. The one that’s best for your use case depends on multiple factors.

This post helps you decide which AWS services to use to implement a file sharing approach that suits your business needs. We focus on security controls and cost implications, describe some of the trade-offs, and highlight key differences to help you make an informed decision based on your specific requirements. We go through each option, highlighting their strengths and limitations, and provide guidance on choosing the right solution for your use case.

Understand your needs first

The first step in designing an AWS file sharing solution is to develop a clear understanding of your requirements and constraints. Because there are several possible design patterns and a number of different AWS services to consider, you need to start by identifying and prioritizing the features that you need. Gather the following information to guide your approach:

Access patterns and scale

When planning for access patterns and scale, there are a few key factors to keep in mind. First, consider how files are shared—machine-to-machine, human-to-machine, or human-to-human—because that impacts security and performance. Then, think about transfer frequency—are files exchanged only once a day, or are thousands moving every hour? If download control matters, setting limits on how often a file can be accessed might be necessary. File sizes also play a role, from typical everyday transfers to the largest files you need to support. Finally, total data volume shapes how much information you’ll be transferring on a regular basis.

Technical requirements

Your choice of solution will be influenced by technical constraints and capabilities. Protocol requirements often drive initial decisions, such as whether you need SFTP, FTPS, or HTTPS access. Consider existing systems that must interface with your solution and how they’ll connect. Performance considerations span several dimensions: acceptable latency for file transfers, geographic distribution of your users, bandwidth requirements, and whether you need built-in retry mechanisms for failed transfers. Additionally, think about how many simultaneous transfers your solution needs to support.

Security and compliance

Security and compliance requirements will definitely influence your file sharing strategy. Consider who controls encryption keys—whether managed by AWS or your organization—and what key rotation policies are needed. Authentication needs often vary—you might be authenticating individual users, specific systems, or entire business entities, using methods ranging from passwords to API keys, multi-factor authentication, or certificates. Your audit requirements will influence your choices in logging and monitoring capabilities. You might have geographic considerations like data sovereignty requirements, storage location restrictions, and access controls that consider the recipient’s location. If your data is subject to a law, like GDPR in Europe or HIPAA in the United States, or if your data is regulated by a standard like the Payment Card Industry’s Data Security Standard (PCI-DSS), you will need to consult with your own legal and compliance advisors to see what is required. When assessing risk tolerance, consider the security triad of confidentiality, integrity, and availability—some use cases might tolerate brief periods of unavailability but cannot risk data exposure, while others prioritize continuous availability.

Operational requirements

Day-to-day operations bring their own set of considerations. File retention policies determine how long data needs to be kept, while auto-deletion capabilities might be necessary for managing storage and compliance. Consider what kind of reporting and monitoring of file transfer activities you need. Do you need monthly reports, daily reports or perhaps detailed real-time tracking of transfer activities. By adding handling and notification systems, you can help make sure that problems are caught and addressed promptly. Disaster recovery requirements, expressed through recovery point objectives (RPO) and recovery time objectives (RTO), help determine the resilience needed in your solution.

Business constraints

Your solution must operate within your business constraints, such as budget limitations, technical limitations, timelines, available expertise, and service level agreements (SLAs). Budget limitations include initial implementation costs and ongoing operational expenses. Consider other parties’ technical limitations—they might use specific protocols such as SFTP, require mobile device compatibility, or operate older systems that have limited cryptographic capabilities. Implementation timelines influence choices between managed services that can be deployed quickly and custom solutions that require more time and expertise. The expertise available for solution maintenance is also a consideration. SLAs for file transfers might specify availability and performance requirements that you’re obligated to meet. To meet these constraints, you must estimate how much your file sharing needs will grow over time and determine if you need a regional or a global solution.

By carefully considering these aspects, you’ll be better prepared to evaluate different AWS file sharing solutions and select the one that best fits your use case. Understanding your requirements for uploads and downloads will help determine if your use case can be supported through a single AWS service or needs a combination of services.

Solutions

Let’s start by looking at the various file sharing mechanisms that AWS supports. The following table identifies the key AWS services needed for each solution, describes the security and cost implications of the solutions, and describes their complexity and protocol support capabilities. The following table shows the solutions described in this post.

Solution AWS services Security features Cost* Region control
AWS Transfer Family Transfer Family, Amazon S3, API Gateway, and Lambda Managed security, encryption in transit and at rest, IAM integration, and custom authentication $0.30 per hour per protocol, data transfer fees, and storage costs Can deploy to specific AWS Regions, can only transfer files to and from S3 buckets in the same Region
Transfer Family web apps Transfer Family, S3, and CloudFront Browser-based access, IAM Identity Center integration, and S3 Access Grants Pay-per-file operation, CloudFront costs, and storage costs Uses CloudFront (global) for web access, but backend components can be Region-specific
Amazon S3 pre-signed URLs S3 Time-limited URLs, IAM controls for URL generation, and HTTPS S3 request and data transfer fees Can be restricted to specific Regions
Serverless application with Amazon S3 presigned URLs S3, AWS Lambda, and API Gateway Time-limited URLs, HTTPS, IAM controls, customizable authentication Pay per request and minimal infrastructure cost Components can be Region-specific

The following table shows the solutions described in Part 2.

Solution AWS services Security features Cost* Region control
CloudFront signed URLs CloudFront, Amazon S3, and Lambda Optional edge security using AWS Lambda@Edge, AWS WAF integration, SSL/TLS, geo restrictions, and AWS Shield Standard (included automatically) Content delivery network (CDN) costs, request pricing, and data transfer fees Global service by design; origin can be AWS Region-specific
Amazon VPC endpoint service PrivateLink, VPC, and NLB Complete network isolation, private connectivity, and multi-layer security Endpoint hourly charges, NLB costs, and data processing fees Service endpoints are strictly Region-specific; must create endpoints in each Region where access is needed
S3 Access Points S3, IAM, VPC (for VPC-specific access points)
  • Dedicated IAM policies per access point
  • VPC-only access restrictions available
  • Works with bucket policies for layered security
  • Supports AWS PrivateLink for private network access
  • Compatible with S3 Block Public Access settings
  • No additional charge for S3 Access Points
  • Standard S3 request pricing applies
  • Data transfer fees apply based on standard S3 rates
  • VPC endpoint charges apply when using VPC endpoints with access points
  • Access points are Region-specific
  • Each access point is created in the same Region as its S3 bucket
  • Cross-Region access requires separate access points in each Region
  • VPC-specific access points are limited to the VPC’s Region

* Pricing information provided is based on AWS service rates at the time of publication and is intended as an estimation only. Additional costs may be incurred depending on your specific implementation and usage patterns. For the most current and accurate pricing details, please consult the official AWS pricing pages for each service mentioned.

Let’s examine the solutions in detail.

AWS Transfer Family

AWS Transfer Family is a managed file transfer service for SFTP, FTPS, and AS2 protocols. It integrates directly with Amazon Simple Storage Service (Amazon S3) for storage and supports custom identity providers for authentication through Amazon API Gateway and AWS Lambda.

As shown in Figure 1, when a user initiates a file transfer, Transfer Family authenticates them through the configured identity provider using API Gateway and Lambda. After authentication succeeds, the service maps the user to an AWS Identity and Access Management (IAM) role that defines their S3 bucket access permissions. The service encrypts data in transit using TLS 1.2 and data at rest using S3 server-side encryption.

Figure 1: AWS Transfer Family architecture

Figure 1: AWS Transfer Family architecture

Transfer Family automatically handles scaling from zero to thousands of concurrent users, manages high availability across Availability Zones, and minimizes infrastructure management. It records detailed metrics and logs in Amazon CloudWatch for monitoring and auditing, supporting compliance requirements with activity tracking.

It’s important to note that Transfer Family also offers service-managed authentication. This simpler setup stores user credentials (passwords or SSH keys) directly in Transfer Family, minimizing the need for external identity providers. Service-managed authentication is best suited if you have a small number of users or no existing identity management system, or when you want to have a disconnected identity system and don’t want to give external partners an account in your identity provider system.

Pros

One of the biggest advantages of Transfer Family is how it provides the reliability and scalability of Amazon S3 for storing your data, while keeping that data available to existing client applications and workflows. The service integrates with existing authentication systems through custom identity providers, while maintaining security through IAM policies. Its auto-scaling capabilities handle variable workloads, from occasional transfers to high-volume scenarios.

Transfer Family also offers detailed CloudWatch logging and audit trails for file transfer activities, which should be sufficient for most logging and audit needs. It encrypts data in transit using TLS 1.2 and at rest using Amazon S3 server-side encryption. You can implement fine-grained access controls through IAM roles and integrate with AWS Organizations for multi-account management. The service supports VPC endpoints for secure internal access and custom domain names for branded endpoints.

Because data is stored in S3, some of your requirements will be fulfilled by configuring S3, not the Transfer Family services. Data retention (for example, avoiding deletion and scheduling deletion) is achieved through S3 Object Lock and S3 Lifecycle Events.

Cons

The pricing structure of Transfer Family includes $0.30 per hour for each protocol you enable and data transfer fees based on data volume. There can be additional charges for custom domain names. If you use VPC endpoints for secure internal access to Amazon S3, there will also be VPC data charges. If you have high-volume transfers or multiple endpoints across AWS Regions, you will face increased costs. Because the data ultimately lives in S3; S3 storage and request pricing applies as well.

Custom identity provider implementations (such as SAML or OAuth) add latency to authentication processes, affecting transfer initiation times. This authentication process requires additional configuration and introduces extra steps and latency during transfer initiation compared to service-managed authentication.

The Regional nature of Transfer Family means you must choose between deploying in a single Region (simpler management but potential latency for global users) or multiple Regions (better performance but higher costs at $0.30 per protocol per hour per Region). Multi-Region can serve as a disaster recovery strategy or when Regional data isolation is needed.

Transfer Family web apps

Transfer Family web apps provide browser-based access to Amazon S3, enabling users to upload and download files through a web interface. With the web apps, you can create a branded, secure, and highly available portal for your users to browse, upload, and download data in S3. Web apps are built using Storage Browser for S3 and offer the same user functionalities in a fully managed offering without having to write code or host your own application.

When a user accesses the web application, authentication occurs through AWS IAM Identity Center, and S3 Access Grants determine their permissions to specific S3 buckets or prefixes. The access grant permissions can be either read-only or read and write. After authentication succeeds, users can upload or download files directly through the web interface. The service uses Amazon CloudFront for content delivery and implements SSL/TLS encryption for data transfers, while S3 provides server-side encryption for data at rest. Figure 2 shows a simplified Transfer Family web app architecture.

Figure 2: Simplified Transfer Family web app architecture

Figure 2: Simplified Transfer Family web app architecture

The web application automatically scales to accommodate varying numbers of users and provides high availability through the CloudFront global edge network. It minimizes the need for custom web application development and provides logging through AWS CloudTrail and CloudWatch. You can customize the user experience by implementing custom domains through CloudFront distributions.

Transfer Family web apps support multiple authentication methods, with IAM Identity Center being one of the primary options. While Identity Center provides simplified user management and integration with existing identity providers. It also provides useful mechanisms such as multi-factor authentication (MFA), strong password policies, and resetting lost passwords. It’s not the only authentication method available; you can also use custom identity providers for authentication, providing flexibility in how you manage user access to the web application.

Pros

Transfer Family web apps minimize the need to build and maintain custom web interfaces for Amazon S3 file sharing. It provides seamless integration with IAM Identity Center for user management and authentication, enabling you to use existing identity providers. The service offers fine-grained access control through S3 Access Grants, allowing precise permission management at the bucket and prefix level. Its integration with CloudFront provides global availability and enhanced performance, while CloudTrail logging offers audit capabilities.

The service provides robust security features including SSL/TLS encryption, CORS policy management, and optional integration with AWS WAF for protection against bots, web scrapers, DDoS events, and more. You can implement custom domains for branded experiences and use CloudFront security features including DDoS protection using AWS Shield. The web interface offers intuitive file management capabilities without requiring client software or that users have technical expertise.

Cons

Transfer Family web apps require using IAM Identity Center, which might require additional setup and configuration if you’re not currently using this service. The web interface currently requires the Identity Center identities to live in the same AWS account as the S3 buckets. That might create design challenges if you want to keep identities in one AWS account and data storage in another. Implementation requires careful cross-origin resource sharing (CORS) configuration for each S3 bucket.

The service incurs costs for both Transfer Family and associated services, including CloudFront distribution and data transfer fees. Custom domain implementation requires additional configuration and SSL certificate management through AWS Certificate Manager (ACM). The web interface is well suited for humans to upload or download, but it’s not as good for automated workflows that transfer files from machine to machine. You must carefully manage user assignments and access grants to maintain security, adding administrative overhead.

S3 pre-signed URLs

Amazon S3 pre-signed URLs enable secure, time-limited access to objects in S3 without requiring the file recipient to have an identity in your identity systems. The URLs are generated using the AWS SDK or AWS Command Line Interface (AWS CLI), granting specific permissions (GET, PUT) that are valid for up to seven days. When accessing files, S3 validates the cryptographically signed parameters in these URLs before permitting access to objects. This provides a direct method for secure file sharing through HTTPS endpoints.

The solution requires only an S3 bucket and appropriate IAM permissions for URL generation. S3 handles the authentication of the pre-signed URL parameters and manages access to objects. File transfers occur directly between users and S3 through HTTPS endpoints, with the pre-signed URL controlling the access patterns.

Amazon S3 provides security features including server-side encryption, access logging, and CloudTrail integration. The security of pre-signed URLs is primarily managed through expiration times and specific operation permissions defined during URL generation.

Pros

Amazon S3 pre-signed URLs follow a straightforward pay-per-use pricing model, charging only for S3 storage, requests, and data transfers. For example, if you create pre-signed URLs but the object isn’t actually downloaded, you pay storage costs as usual, but you don’t pay transfer costs. The solution uses the native scalability of S3 to handle varying numbers of concurrent users without additional infrastructure. you can implement granular access controls through URL expiration times and specific operation permissions (GET, PUT, DELETE).

Access is controlled through URL expiration enforcement. Amazon S3 server access logging and CloudTrail integration enable audit capabilities. The solution’s simplicity makes it ideal for basic file sharing needs while maintaining security and scalability.

Cons

A pre-signed URL can be used by anyone who has access to the URL. That’s the goal of this design: You don’t need to have an identity for the user. Pre-signed URLs can be reused an unlimited number of times until they expire. To improve security, short expiration times can limit the potential for URL re-use. Shorter expiration times, however, require the recipient to download the file soon after the URL is created.

When implementing this solution, you should establish processes for secure URL generation and distribution. Set your URL expiration times based on realistic expectations about how quickly your recipients will download the files. A web or mobile app where the user selects a link to download something (such as a document, an image, a data file) and they expect the download to start immediately is a good candidate for this design.

The solution works with files up to 5 GB for single operations. To share a file larger than 5 GB, you must split the file into multiple parts, issue multiple pre-signed URLs, and then the recipient must download all the parts and join the parts together correctly. This isn’t a good solution for sharing large files. Also, distributing large files as a single download can be difficult if the recipient doesn’t have good connectivity. Amazon S3 can start an object download from the middle of the object, but selecting a pre-signed URL cannot. So, if the recipient transfers 1 GB out of a 2 GB download, and then their connection is disrupted, they cannot pick up where they left off. They will restart from the beginning, which is undesirable. Overall, this design is unsuitable for transmitting large files over unreliable internet connections.

You should enable appropriate monitoring through Amazon S3 access logs and CloudTrail to track usage patterns and meet security compliance.

This solution is particularly effective if you’re seeking straightforward, secure file sharing capabilities where the files are small enough to download in one request, and where you have a secure mechanism to share the download URLs.

Serverless web application with S3 presigned URLs

Amazon S3 presigned URLs combined with a custom web application enable secure, time-limited access to S3 objects. The application generates URLs that grant specific S3 permissions (GET, PUT) for between one minute and seven days. When requesting file access, the application authenticates users and generates presigned URLs using the AWS SDK with defined permissions and expiration times.

The web application uses API Gateway and Lambda functions for authentication and URL generation. Amazon S3 validates the cryptographically signed parameters in these URLs before permitting access to objects. File transfers occur directly between users and S3 through HTTPS endpoints, with the application controlling the access patterns. The architecture is shown in Figure 3.

Figure 3: Amazon S3 pre-signed URLs architecture

Figure 3: Amazon S3 pre-signed URLs architecture

The web application can implement security controls including request logging, rate limiting (requests per second), and authentication workflows. CloudWatch logs record API access patterns and Lambda execution metrics, while Amazon S3 access logging records object-level operations.

Pros

Amazon S3 presigned URLs follow a pay-per-use pricing model. This solution charges only for API Gateway requests, Lambda executions, and S3 operations performed. The serverless architecture scales automatically from zero to thousands of concurrent users without infrastructure management. You can implement custom security controls and business logic for specific access requirements through API Gateway authorizers (using custom identity solutions or Amazon Cognito) and Lambda functions.

The solution enforces security through URL expiration (maximum seven days), IAM policies restricting URL generation permissions, and HTTPS encryption for data transfers. Custom authentication workflows integrate with existing identity providers (SAML, OIDC). Additional security features include IP-based restrictions, required request headers, and request validation through AWS WAF. This solution would be good, for example, if you have a variety of files or a variety of buckets and you’re trying to build a unified front-end where people can download various files without knowing which bucket the files are stored in or what URL expiration time is appropriate. You can configure the frontend to look at tags on objects, tags on buckets, object names, or another attribute that fits your use case, and then choose a URL expiration time based on that attribute. For example, objects from buckets tagged Data Classification: Restricted might expire after 1 minute, whereas objects from buckets tagged Data Classification: Public might be valid for 7 days.

Cons

Building a custom web application requires developing and maintaining the code for URL generation, authentication, and error handling logic. The application must track URL expiration times and implement mechanisms that permit retries for failed transfers. Monitoring systems must track URL usage, detect abuse patterns, and send alerts for security violations through CloudWatch metrics and logs.

One limitation of this solution is the 10 MB size limit imposed by API Gateway. This affects how your application handles file uploads and downloads. For uploads, files under 10 MB can be uploaded directly through API Gateway. Larger files require implementing multipart uploads, where the client splits the file into chunks and sends each chunk separately. For downloads, files under 10 MB can be downloaded directly through API Gateway but for larger files, your application should generate a pre-signed URL for direct Amazon S3 access, bypassing API Gateway.

URL generation errors or misconfigured IAM permissions can expose objects to unauthorized access. The HTTPS-only protocol limits integration with SFTP and FTPS clients. Files larger than 5 GB require multipart upload implementation, and network interruptions need custom resume logic. This design will incur some extra charges if the number of file transfers are the millions. Lambda functions cost $0.20 per million requests, and API Gateway costs $1.00 per million requests. Analyze your expected access patterns to determine whether these extra costs will be significant and if they’re worth the additional flexibility of custom transfer logic.

Decision matrix: When to use each solution

The following table summarizes the characteristics of the solutions presented in the two parts of this post. See Part 2 for full descriptions of the solutions not covered in Part 1.

Characteristics Transfer Family Transfer Family web app S3 pre-signed URLs (Direct) Serverless web application with S3 pre-signed URL CloudFront signed URLs (Part 2) VPC endpoint service (Part 2) S3 Object Lambda (Part 2)
Protocol support SFTP, FTPS, and AS2 HTTPS (web-based) HTTPS HTTPS HTTPS with CDN A TCP-based protocol HTTPS
Global distribution Global endpoint support CloudFront integration Global S3 access Global S3 access Global edge network acceleration Direct AWS backbone access Global S3 access with Regional endpoints
Pricing model Hourly service rate and usage Pay per file operation Pay-per-request Pay-per-request and application costs Pay-per-request with caching savings Hourly endpoint rate and usage No additional charge for access points; standard S3 request pricing applies
Content processing Direct S3 integration Built-in web interface Direct S3 access Custom app processing Edge-based file processing Access files through private network Direct S3 access with customized permissions per access point
Authentication options Custom IdP and service-managed IAM Identity Center IAM Custom authentication possible IAM, custom authentication, and edge validation VPC security controls and custom authentication IAM policies, VPC endpoint policies, resource-based policies
Upload capabilities Unlimited file size Web interface upload Up to 5 GB direct and multipart for larger Up to 10 MB using API Gateway Optimized for global ingestion Unlimited file size over private connection Same as standard S3
Download capabilities Unlimited file size Browser-based downloads Up to 5 GB using a single URL Up to 10 MB using API Gateway Accelerated downloads using global edge locations Unlimited file size over private connection Same as standard S3 with customized access controls
Example use cases
  • Enterprise file transfer systems
  • B2B data exchange
  • Compliance-focused transfers
  • Browser-based file sharing
  • Internal document management
  • Client portals
  • Simple direct S3 access
  • Temporary file sharing
  • Mobile app backend
  • Custom file sharing systems
  • Integrated web applications
  • Enhanced S3 access control
  • Global content delivery
  • Media distribution
  • Web application assets
  • Private network transfers
  • Custom protocol support
  • Secure enterprise data exchange
  • Simplified data access management at scale
  • Multi-application access to shared datasets
  • VPC-restricted data access

The following list gives you a quick overview of the strengths of each solution presented in the two parts of this post.

  • Transfer Family is the optimal choice for organizations that require legacy file transfer protocols such as SFTP, FTPS, or AS2 protocols, and you must integrate with existing authentication systems. It’s ideal for scenarios with strict compliance and audit requirements, where operational overhead needs to be minimized. While the solution comes with higher costs because of its managed service nature, it’s often the lowest-friction option to support existing enterprise use cases that depend on these protocols.
  • Transfer Family web apps suit organizations that need browser-based file sharing without custom development. They integrate with IAM Identity Center for user authentication and uses Amazon S3 Access Grants for permission management. The solution works well for internal document sharing, client portals, and scenarios requiring a branded web interface. While limited to web browser access, they provide built-in features like MFA and password management without infrastructure maintenance.
  • Amazon S3 pre-signed URLs excel in scenarios where simplicity, cost-effectiveness, and temporary access are key requirements. This solution is ideal if you’re seeking a straightforward file sharing mechanism without the need for custom application development or additional infrastructure. This approach shines in environments that require a quick implementation of secure file sharing and cost-effective solutions with minimal overhead.
  • Serverless web application with S3 presigned URLs best serves scenarios where cost optimization is paramount and the HTTPS protocol meets your requirements. This solution shines in environments that need simple, direct file sharing capabilities with quick implementation timelines. It’s particularly effective for moderate usage patterns where serverless architecture can provide cost benefits. The solution’s simplicity makes it ideal for web applications and scenarios where complex file transfer protocols aren’t necessary, though careful consideration must be given to its 10 MB file size limitation for single operations using API Gateway.

In Part 2:

  • CloudFront signed URLs excel in situations that demand global content distribution with high performance requirements. This solution is the clear choice when your architecture needs built-in DDoS protection and performance optimization through caching. It’s particularly valuable when content delivery speed is crucial and you require security at edge locations. The solution’s global reach and caching capabilities make it cost-effective for large-scale content distribution, though it’s primarily optimized for download scenarios rather than uploads.
  • Amazon VPC endpoint service is the preferred choice if you require complete network isolation and maximum security. This solution is ideal when you need support for custom protocols while maintaining private network connectivity. It’s particularly suitable for scenarios with extremely high security requirements and when you have the necessary resources to managed networking configurations. While this solution requires significant expertise and investment, it provides the highest level of security and control for sensitive data transfers.
  • S3 Access Points are best suited for scenarios that require simplified data access management at scale. This solution excels when you need to provide different access patterns to the same underlying data for multiple applications or user groups. It’s ideal if you prefer a structured approach to permissions and need network-level access controls. While primarily focused on simplifying complex access scenarios without modifying bucket policies, it offers unique capabilities for VPC-restricted access and granular permissions management, though subject to certain service limits and configuration requirements.

Conclusion

In this first part of a two-part post, you’ve learned about multiple solutions for secure file sharing using AWS services and the pros and cons of each. You can find additional options in Part 2. The optimal solution depends on your specific organizational requirements, technical capabilities, and budget constraints. You don’t have to choose just one option, you can implement multiple solutions to address different use cases, creating a file sharing strategy that balances security, cost, and operational efficiency.

Additional resources:

If you have feedback about this post, submit comments in the Comments section below.

Swapnil Singh

Swapnil Singh

Swapnil is a Senior Solutions Architect for AWS World Wide Public Sector. As a Product Acceleration Solutions Architect at AWS, she currently works with GovTech customers to ideate, design, validate, and launch products using cloud-native technologies and modern development practices.

Sumit Bhati

Sumit Bhati

Sumit is a Senior Customer Solutions Manager at AWS, specializing in expediting the cloud journey for enterprise customers. Sumit is dedicated to assisting customers through every phase of their cloud adoption, from accelerating migrations to modernizing workloads and facilitating the integration of innovative practices.

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

Post Syndicated from Krishna Gogineni original https://aws.amazon.com/blogs/big-data/migrate-a-petabyte-scale-data-warehouse-from-actian-vectorwise-to-amazon-redshift/

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

In this post, we discuss how a financial services industry customer achieved scalability, resiliency, and availability by migrating from an on-premises Actian Vectorwise data warehouse to Amazon Redshift.

Challenges

The customer’s use case required a high-performing, highly available, and scalable data warehouse to process queries against large datasets in a low-latency environment. Their Actian Vectorwise system was designed to replace Excel plugins and stock screeners but eventually evolved into a much larger and ambitious portfolio analysis solution running multiple API clusters on premises, serving some of the largest financial services firms worldwide. The customer saw growing demand that needed high performance and scalability due to 30% year-over-year increase in usage from the success of their products. The customer needed to keep up with increased volume of read requests, but they couldn’t do this without deploying additional hardware in the data center. There was also a customer mandate that business-critical products must have their hardware updated to cloud-based solutions or be deemed on the path to obsolescence. In addition, the business started moving customers onto a new commercial model, and therefore new projects would need to provision a new cluster, which meant that they needed improved performance, scalability, and availability.

They faced the following challenges:

  • Scalability – The customer understood that infrastructure maintenance was a growing issue and, although operations were a consideration, the existing implementation didn’t have a scalable and efficient solution to meet the advanced sharding requirements needed for query, reporting, and analysis. Over-provisioning of data warehouse capacity to meet unpredictable workloads resulted in underutilized capacity during normal operations by 30%.
  • Availability and resiliency – Because the customer was running business-critical analytical workloads, it required the highest levels of availability and resiliency, which was a concern with the on-premises data warehouse solution.
  • Performance – Some of their queries needed to be processed in priority, and users were starting to experience performance degradation with longer-running query times as their solution started getting used more and more. The need for a scalable and efficient solution to manage customer demand, address infrastructure maintenance concerns, replace legacy tooling, and tackle availability led to them choosing Amazon Redshift as the future state solution. If these concerns were not addressed, the customer would be prevented from growing their user base.

Legacy architecture

The customer’s platform was the main source for one-time, batch, and content processing. It served many enterprise use cases across API feeds, content mastering, and analytics interfaces. It was also the single strategic platform within the company for entity screening, on-the-fly aggregation, and other one-time, complex request workflows.

The following diagram illustrates the legacy architecture.

The architecture consists of many layers:

  • Rules engine – The rules engine was responsible for intercepting every incoming request. Based on the nature of the request, it routed the request to the API cluster that could optimally process that specific request based on the response time requirement.
  • API – Scalability was one of the primary challenges with the existing on-premises system. It wasn’t possible to quickly scale up and down API service capacity to meet growing business demand. Both the API and data store had to support a highly volatile workload pattern. This included simple data retrieval requests that had to be processed within a few milliseconds vs. power user-style batch requests with complex analytics-based workloads that could take several seconds and significant compute resources to process. To separate these different workload patterns, the API and data store infrastructure was split into multiple isolated physical clusters. This made sure each workload group was provisioned with sufficient reserved capacity to meet the respective response time expectations. However, this model of reserving capacity for each workload type resulted in suboptimal usage of compute resources because each cluster would only process a specific workload type.
  • Data store – The data store used a custom data model that had been highly optimized to meet low-latency query response requirements. The current on-premises data store wasn’t horizontally scalable, and there was no built-in replication or data sharding capability. Due to this limitation, multiple database instances were created to meet concurrent scalability and availability requirements because the schema wasn’t generic per dataset. This model caused operational maintenance overhead and wasn’t easily expandable.
  • Data ingestion – Pentaho was used to ingest data sourced from multiple data publishers into the data store. The ingestion framework itself didn’t have any major challenges. However, the primary bottleneck was due to scalability issues associated with the data store. Because the data store didn’t support sharding or replication, data ingestion had to explicitly ingest the same data concurrently across multiple database nodes within a single transaction to provide data consistency. This significantly impacted overall ingestion speed.

Overall, the current architecture didn’t support workload prioritization, therefore a physical model of resources was reserved for this reason. The downside here is over-provisioning. The system had an integration with legacy backend services that were all hosted on premises.

Solution overview

Amazon Redshift is an industry-leading cloud data warehouse. Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes using AWS-designed hardware and machine learning (ML) to deliver the best price-performance at any scale.

Amazon Redshift is designed for high-performance data warehousing, which provides fast query processing and scalable storage to handle large volumes of data efficiently. Its columnar storage format minimizes I/O and improves query performance by reading only the relevant data needed for each query, resulting in faster data retrieval. Lastly, you can integrate Amazon Redshift with data lakes like Amazon Simple Storage Service (Amazon S3), combining structured and semi-structured data for comprehensive analytics.

The following diagram illustrates the architecture of the new solution.

In the following sections, we discuss the features of this solution and how it addresses the challenges of the legacy architecture.

Rules engine and API

Amazon API Gateway is a fully managed service that help developers deliver secure, robust, API-driven application backends at any scale. To address scalability and availability requirements of the rules and routing layer, we introduced API Gateway to do the routing of the client requests to different integration paths using routes and parameter mappings. Having API Gateway as the entry point allowed the customer to move away from the design, testing, and maintenance of their rules engine development workload. In their legacy environment, handling fluctuating amounts of traffic posed a significant challenge. However, API Gateway seamlessly addressed this issue by acting as a proxy and automatically scaling to accommodate varying traffic demands, providing optimal performance and reliability.

Data storage and processing

Amazon Redshift allowed the customer to meet their scalability and performance requirements. Amazon Redshift features such as workload management (WLM), massively parallel processing (MPP) architecture, concurrency scaling, and parameter groups helped address the requirements:

  • WLM provided the ability for query prioritization and managing resources effectively
  • The MPP architecture model provided horizontal scalability
  • Concurrency scaling added additional cluster capacity to handle unpredictable and spiky workloads
  • Parameter groups defined configuration parameters that control database behavior

Together, these capabilities allowed them to meet their scalability and performance requirements in a managed fashion.

Data distribution

The legacy data center architecture was unable to partition the data without deploying additional hardware in the data center, and it couldn’t handle read workloads efficiently.

The MPP architecture of Amazon Redshift offers efficient data distribution across all the compute nodes, which helped run heavy workloads in parallel and subsequently lowered response times. With the data distributed across all the compute nodes, it allows data to be processed in parallel. Its MPP engine and architecture separates compute and storage for efficient scaling and performance.

Operational efficiency and hygiene

Infrastructure maintenance and operational efficiency was a concern for the customer in their current state architecture. Amazon Redshift is a fully managed service that takes care of data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, and monitoring nodes and drives to recover from failures or backups. Amazon Redshift periodically performs maintenance to apply fixes, enhancements, and new features to your Redshift data warehouse. As a result, the customer’s operational costs reduced by 500%, and they are now able to spend more time innovating and building mission-critical applications.

Workload management

Amazon Redshift WLM was able to resolve issues with the legacy architecture where longer-running queries were consuming all the resources, causing other queries to run slower, impacting performance SLAs. With automatic WLM, the customer was able to create separate WLM queues with different priorities, which allowed them to manage the priorities for the critical SLA-bound workloads and other non-critical workloads. With short query acceleration (SQA) enabled, it prioritized selected short-running queries ahead of longer-running queries. Furthermore, the customer benefited by using query monitoring rules in WLM to apply performance boundaries to control poorly designed queries and take action when a query goes beyond those boundaries. To learn more about WLM, refer to Implementing workload management.

Workload isolation

In the legacy architecture, all the workloads—extract, transform, and load (ETL); business intelligence (BI); and one-time workloads—were running on the same on-premises data warehouse, leading to the noisy neighbor problem and performance issues with the increase in users and workloads.

With the new solution architecture, this issue is remediated using data sharing in Amazon Redshift. With data sharing, the customer is able to share live data with security and ease across Redshift clusters, AWS accounts, or AWS Regions for read purposes, without the need to copy any data.

Data sharing improved the agility of the customer’s organization. It does this by giving them instant, granular, and high-performance access to data across Redshift clusters without the need to copy or move it manually. With data sharing, customers have live access to data, so their users can see the most up-to-date and consistent information as it’s updated in Redshift clusters. Data sharing provides workload isolation by running ETL workloads in its own Redshift cluster and sharing data with other BI and analytical workloads in their respective Redshift clusters.

Scalability

With the legacy architecture, the customer was facing scalability challenges during large events to handle unpredictable spiky workloads and over-provisioning of the database capacity. Using concurrency scaling and elastic resize allowed the customer to meet their scalability requirements and handle unpredictable and spiky workloads.

Data migration to Amazon Redshift

The customer used a home-grown process to extract the data from Actian Vectorwise and store it in Amazon S3 and CSV files. The data from Amazon S3 was then ingested into Amazon Redshift.

The loading process used a COPY command and ingested the data from Amazon S3 in a fast and efficient way. A best practice for loading data into Amazon Redshift is to use the COPY command. The COPY command is the most efficient way to load a table because it uses the Amazon Redshift MPP architecture to read and load data in parallel from a file or multiple files in an S3 bucket.

To learn about the best practices for source data files to load using the COPY command, see Loading data files.

After the data is ingested into Redshift staging tables from Amazon S3, transformation jobs are run from Pentaho to apply the incremental changes to the final reporting tables.

The following diagram illustrates this workflow.

Key considerations for the migration

There are three ways of migrating an on-premises data warehouse to Amazon Redshift: one-step, two-step, and wave-based migration. To minimize the risk of migrating over 20 databases that vary in complexity, we decided on the wave-based approach. The fundamental concept behind wave-based migration involves dividing the migration program into projects based on factors such as complexity and business outcomes. The implementation then migrates each project individually or by combining certain projects into a wave. Subsequent waves follow, which may or may not be dependent on the results of the preceding wave.

This strategy requires both the legacy data warehouse and Amazon Redshift to operate concurrently until the migration and validation of all workloads are successfully complete. This provides a smooth transition while making sure the on-premises infrastructure can be retired only after thorough migration and validation have taken place.

In addition, within each wave, we followed a set of phases to make sure that each wave was successful:

  • Assess and plan
  • Design the Amazon Redshift environment
  • Migrate the data
  • Test and validate
  • Perform cutover and optimizations

In the process, we didn’t want to rewrite the legacy code for each migration. With minimal code changes, we migrated the data to Amazon Redshift because SQL compatibility was very important in the process due to existing knowledge within the organization and downstream application consumption. After the data was ingested into the Redshift cluster, we adjusted the tables for best performance.

One of the main benefits we realized as part of the migration was the option to integrate data in Amazon Redshift with other business groups in the future that use AWS Data Exchange, without significant effort.

We performed blue/green deployments to make sure that the end-users didn’t encounter any latency degradation while retrieving the data. We migrated the end-users in a phased manner to measure the impact and adjust the cluster configuration as needed.

Results

The customer’s decision to use Amazon Redshift for their solution was further reinforced by the platform’s ability to handle both structured and semi-structured data seamlessly. Amazon Redshift allows the customer to efficiently analyze and derive valuable insights from their diverse range of datasets, including equities and institutional data, all while using standard SQL commands that teams are already comfortable with.

Through rigorous testing, Amazon Redshift consistently demonstrated remarkable performance, meeting the customer’s stringent SLAs and delivering exceptional subsecond query response times with an impressive latency. With the AWS migration, the customer achieved a 5% improvement in query performance. Scalability of the clusters was done in minutes compared to 6 months in the data center. Operational cost reduced by 500% due to the simplicity of the Redshift cluster operations in AWS. Stability of the clusters improved by 100%. Upgrades and patching cycle time improved by 200%. Overall, improvement in operational posture and total savings for the footprint has resulted in significant savings for the team and platform in general. In addition, the ability to scale the overall architecture based on market data trends in a resilient and highly available way not only met the customer demand in terms of time to market, but also significantly reduced the operational costs and total cost of ownership.

Conclusion

In this post, we covered how a large financial services customer improved performance and scalability, and reduced their operational costs by migrating to Amazon Redshift. This enabled the customer to grow and onboard new workloads into Amazon Redshift for their business-critical applications.

To learn about other migration use cases, refer to the following:


About the Authors

Krishna Gogineni is a Principal Solutions Architect at AWS helping financial services customers. Krishna is Cloud-Native Architecture evangelist helping customers transform the way they build software. Krishna works with customers to learn their unique business goals, and then super-charge their ability to meet these goals through software delivery that leverages industry best practices/tools such as DevOps, Data Lakes, Data Analytics, Microservices, Containers, and Continuous Integration/Continuous Delivery.

Dayananda Shenoy is a Senior Solution Architect with over 20 years of experience designing and architecting backend services for financial services products. Currently, he leads the design and architecture of distributed, high-performance, low latency analytics services for a data provider. He is passionate about solving scalability and performance challenges in distributed systems leveraging emerging technology which improve existing tech stacks and add value to the business to enhance customer experience.

Vishal Balani is a Sr. Customer Solutions Manager based out of New York. He works closely with Financial Services customers to help them leverage cloud for businesses agility, innovation and resiliency. He has extensive experience leading large-scale cloud migration programs. Outside of work he enjoys spending time with family, tinkering with a new project or riding his bike.

Ranjan Burman is a Sr. PostgreSQL Database Specialist SA. He specializes in RDS & Aurora PostgreSQL. He has more than 18 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Muthuvelan Swaminathan is an Enterprise Solutions Architect based out of New York. He works with enterprise customers providing architectural guidance in building resilient, cost-effective and innovative solutions that address business needs.

How to enable one-click unsubscribe email with Amazon Pinpoint

Post Syndicated from Zip Zieper original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-enable-one-click-unsubscribe-email-with-amazon-pinpoint/

Amazon Pinpoint customers who use campaigns, journeys, or the SendMesages API to send more than 5,000 marketing email messages per day are considered “bulk senders”. If your organization meets this criteria, you are now subject to new requirements that were recently established by Google, Yahoo and other large ISPs/ESPs. These providers have mandated these requirements to help protect their user’s inboxes. Detailed information about these requirements is provided in the Amazon Simple Email Service (SES) bulk sender updates blog post.

Per these new requirements, Pinpoint customers that send marketing email messages in bulk must meet all of these criteria:

  • Fully authenticate their email sending domains with SPF, DKIM and DMARC. See this blog.
  • Provide a clearly visible unsubscribe link in the body &/or footer of each message.
  • Enable the “List-Unsubscribe” and “List-Unsubscribe-Post” one-click unsubscribe (the subbect of this blog post). You can learn more about these headers and how they are used in SES in this related blog post.
  • Honor all unsubscribe POST requests within 48 hours, after which time you shouldn’t be sending emails to the now unsubscribed end-user.
  • Actively monitor spam complaint rates, and take the steps needed to ensure these rates remain below acceptable levels as defined by the ESPs.

This blog post provides Pinpoint customers with the steps necessary to enable the one-click unsubscribe button via email headers for “List-Unsubscribe” and “List-Unsubscribe-Post” as defined by RFC 2369 and RFC 8058.

Unsubscribe Process Overview

Pinpoint now supports the inclusion of the “List-Unsubscribe” and “List-Unsubscribe-Post” email headers that enable compatible email client apps to render a one-click unsubscribe button when displaying emails from a subscription list. When you include these headers in the emails you send by Pinpoint, those end-users who want to unsubscribe from your emails can do so by simply clicking the unsubscribe button in their email app (see image). Once pressed, the unsubscribe button fires off a POST request to the URL you have defined in the “List-Unsubscribe” header.

You, the Pinpoint customer, are responsible for defining the “List-Unsubscribe” and “List-Unsubscribe-Post” headers, as well as supplying the system or process invoked by the “List-Unsubscribe” and “List-Unsubscribe-Post” email headers. Your system or process must, when activated by the unsubscribe action, update that end-user’s preferences accordingly so that within 48 hours, any end-user who unsubscribes will no longer receive unwanted emails.

If you only use Pinpoint’s campaigns and journeys, you may elect to use the Pinpoint endpoint’s OptOut attribute to store the user’s unsubscribe preferences. Possible values for OptOut are: ALL, the user has opted out and doesn’t want to receive any messages; and, NONE, the user hasn’t opted out and wants to receive all messages. It is important to note, however, that the SendMessages API ignores the Pinpoint endpoint’s OptOut attribute.

If you do not currently offer your recipients the option to unsubscribe to unwanted emails, you will need to develop & deploy a system or process to receive end-user unsubscribe requests to be in compliance with these new requirements. An example solution with sample code to processes email opt-out requests for Pinpoint can be found here. You can read more about this example in this blog post.

REQUIRED: Update the SES IAM role used by Pinpoint

Because Pinpoint uses SES resources for sending email messages, when using campaigns or journeys you must now create (or update) an IAM Orchestration sending role to grant Pinpoint service access to your SES resources. This allows Pinpoint to send emails via SES. To add or update the IAM role, follow the steps outlined in the Pinpoint documentation.

Note – If you are sending emails directly via the SendMesage, API you do not need an IAM Orchestration sending role, but you must have permissions for ses:SendEmail and ses:SendRawEmail.

Add easy unsubscribe email headers:

The steps you need to take to enable one-click unsubscribe in your Pinpoint emails depends on how you send emails, and whether or not you use templates, as shown below:

Decision tree for adding headers

Use SendMessages with the AWS SDK or CLI

Using the AWS CLI: add headers for the “List-Unsubscribe” and “List-Unsubscribe-post” as shown in the example below:

aws pinpoint send-messages \
--region us-east-1 \
--application-id ce796be37f32f178af652b26eexample \
--message-request '{
    "Addresses": {
        "[email protected]": {"ChannelType": "EMAIL"},
    },
    "MessageConfiguration": {
        "EmailMessage": {
            "SimpleEmail": {
                "Subject": {"Data":"URL with easy unsubscribe headers", "Charset":"UTF-8"},
                "TextPart": {"Data":"with headers list-unsubscribe and list-unsubscribe-post.\n\nUnsubscribe: <https://www.example.com/preferences>", "Charset":"UTF-8"},
                "HtmlPart": {"Data":"<html><body>with headers list-unsubscribe and list-unsubscribe-post<br><br><a ses:tags=\"unsubscribeLinkTag:optout\" href=\"https://example.com/?address=x&topic=x\">Unsubscribe</a></body></html>", "Charset":"UTF-8"},
                "Headers": [
                    {"Name":"List-Unsubscribe", "Value":"<https://example.com/?address=x&topic=x>, <mailto: [email protected]?subject=TopicUnsubscribe>"},
                    {"Name":"List-Unsubscribe-Post", "Value":"List-Unsubscribe=One-Click"}
                ]
            }
        }
    }
}

Send an email message

Below is an example using the SendMessages API from the AWS SDK for Python (Boto3) that includes the List-Unsubscribe headers. This example assumes that you’ve already installed and updated the SDK for Python (Boto3) to the latest version available. For more information, see Quickstart in the AWS SDK for Python (Boto3) API Reference.

import logging  # Logging library to log messages
import boto3  # AWS SDK for Python
from botocore.exceptions import ClientError  # Exception handling for boto3
import hashlib  # Library to generate unique hashes

# Configure logger
logger = logging.getLogger(__name__)

# Define constants
CHARSET = "UTF-8"
REGION = 'us-east-1'

def send_email_message(
    pinpoint_client,
    project_id, 
    sender,
    to_addresses,
    subject,
    html_message,
    text_message,
):
    """
    Sends an email message with HTML and plain text versions.

    :param pinpoint_client: A Boto3 Pinpoint client.
    :param project_id: The Amazon Pinpoint project ID to use when you send this message.
    :param sender: The "From" address. This address must be verified in
                   Amazon Pinpoint in the AWS Region you're using to send email.
    :param to_addresses: The list of addresses on the "To" line. If your Amazon Pinpoint account
                         is in the sandbox, these addresses must be verified.
    :param subject: The subject line of the email.
    :param html_message: The HTML content of the email.
    :param text_message: The plain text content of the email.
    :return: A dict of to_addresses and their message IDs.
    """
    try:
        # Create a dictionary of addresses with unique unsubscribe URLs
        # The addresses are encoded using the SHA256 hashing algorithm from the hashlib library
        # to create a unique and obfuscated unsubscribe URL for each recipient. This ensures
        # that the unsubscribe link is specific to each individual recipient, preventing
        # potential abuse or unauthorized unsubscribes. The hashed value is appended to the
        # base unsubscribe URL, allowing the email service to identify the intended recipient
        # when the unsubscribe link is clicked, while also protecting the recipient's personal
        # email address from being directly exposed in the URL.
        addresses = {
            address: {
                "ChannelType": "EMAIL",
                "Substitutions": {
                    "unsubscribeURL": [f"https://example.com/unsub/{hashlib.sha256(address.encode()).hexdigest()}"],
                }
            }
            for address in to_addresses
        }
        
        # Send email using Amazon Pinpoint
        response = pinpoint_client.send_messages(
            ApplicationId=project_id,
            MessageRequest={
                "Addresses": addresses,
                "MessageConfiguration": {
                    "EmailMessage": {
                        "FromAddress": sender,
                        "SimpleEmail": {
                            "Subject": {"Charset": CHARSET, "Data": subject},
                            "HtmlPart": {"Charset": CHARSET, "Data": html_message},
                            "TextPart": {"Charset": CHARSET, "Data": text_message},
                            "Headers": [
                                {"Name": "List-Unsubscribe", "Value": "{{unsubscribeURL}}"},
                                {"Name": "List-Unsubscribe-Post", "Value": "List-Unsubscribe=One-Click"}
                            ],
                        },
                    }
                }
            }
        )
    except ClientError as e:
        # Log exception if sending email fails
        logger.exception("Couldn't send email: %s", e)
        raise
    else:
        # Return a dictionary of addresses and their respective message IDs
        return {
            address: message["MessageId"] 
        for address, message in response["MessageResponse"]["Result"].items()
        }

def main():
    # Sample data for sending email
    project_id = "ce796be37f32f178af652b26eexample"  # Amazon Pinpoint project ID
    sender = "[email protected]"  # Verified sender email address
    to_addresses = ["[email protected]", "[email protected]", "[email protected]"]  # Recipient email addresses
    subject = "Amazon Pinpoint Unsubscribe Headers Test (SDK for Python (Boto3))"  # Email subject
    text_message = """Amazon Pinpoint Test (SDK for Python)
    -------------------------------------
    This email was sent with Amazon Pinpoint using the AWS SDK for Python (Boto3).
    For more information, see https://aws.amazon.com/sdk-for-python/
                """  # Plain text message
    html_message = """<html>
    <head></head>
    <body>
      <h1>Amazon Pinpoint Test (SDK for Python (Boto3)</h1>
      <p>This email was sent with
        <a href='https://aws.amazon.com/pinpoint/'>Amazon Pinpoint</a> using the
        <a href='https://aws.amazon.com/sdk-for-python/'>
          AWS SDK for Python (Boto3)</a>.</p>
    </body>
    </html>
                """  # HTML message

    # Create a Pinpoint client
    pinpoint_client = boto3.client("pinpoint", region_name=REGION)

    print("Sending email.")
    # Send email and print message IDs
    try:
        message_ids = send_email_message(
            pinpoint_client,
            project_id,
            sender,
            to_addresses,
            subject,
            html_message,
            text_message,
        )
        print(f"Message sent! Message IDs: {message_ids}")
    except ClientError as e:
        print(f"Failed to send messages: {e}")

# Entry point of the script
if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)  # Set logging level to INFO
    main()

Send an email message with an existing email template.

If you use message templates to send email messages via AWS SDK for Python (Boto3), you can add the headers for List-Unsubscribe and List-Unsubscribe-post into the template, and then fill those variables with unique values per recipient, as shown in the code example below. First, you would create the template via the UI and add the Headers in the new fields as shown in the image below.

Or you can create the template, with headers, via the AWS CLI:

aws pinpoint create-email-template --template-name MyEmailTemplate \
--email-template-request '{
    "Subject": "Amazon Pinpoint Unsubscribe Headers Test using email template",
    "TextPart": "Hello, welcome to our service. We are glad to have you with us. If you wish to unsubscribe, click here: {{unsubscribeURL}}",
    "HtmlPart": "<html><body><h1>Hello, welcome to our service</h1><p>We are glad to have you with us.</p><p>If you wish to unsubscribe, click <a href=\"{{unsubscribeURL}}\">here</a>.</p></body></html>",
    "DefaultSubstitutions": "{\"unsubscribeURL\": \"https://example.com/unsubscribe\"}",
    "Headers": [
            {"Name": "List-Unsubscribe","Value": "{{unsubscribeURL}}"},
            {"Name": "List-Unsubscribe-Post","Value": "List-Unsubscribe=One-Click"}
        ]
  }

In this next example, we are including the use of a secret Hash key. By using this format, the unsubscribe URL will include the Pinpoint project ID and a hashed value of the email address combined with the secret key. This provides a more secure and customized unsubscribe experience for the recipients.

import logging  # Logging library to log messages
import boto3  # AWS SDK for Python
from botocore.exceptions import ClientError  # Exception handling for boto3
import hashlib  # Library to generate unique hashes

# Configure logger
logger = logging.getLogger(__name__)

# Define constants
REGION = 'us-east-1'
HASH_SECRET_KEY = "my_secret_key"  # Replace with your secret key

def send_templated_email_message(
    pinpoint_client, 
    project_id, 
    sender, 
    to_addresses, 
    template_name, 
    template_version
):
    """
    Sends an email message with HTML and plain text versions.

    :param pinpoint_client: A Boto3 Pinpoint client.
    :param project_id: The Amazon Pinpoint project ID to use when you send this message.
    :param sender: The "From" address. This address must be verified in
                   Amazon Pinpoint in the AWS Region you're using to send email.
    :param to_addresses: The list of addresses on the "To" line. If your Amazon Pinpoint account
                         is in the sandbox, these addresses must be verified.
    :param template_name: The name of the email template to use when sending the message.
    :param template_version: The version number of the message template.

    :return: A dict of to_addresses and their message IDs.
    """
    try:
        # Create a dictionary of addresses with unique unsubscribe URLs
        # The addresses are encoded using the SHA256 hashing algorithm from the hashlib library
        # to create a unique and obfuscated unsubscribe URL for each recipient. This ensures
        # that the unsubscribe link is specific to each individual recipient, preventing
        # potential abuse or unauthorized unsubscribes. The hashed value is appended to the
        # base unsubscribe URL, allowing the email service to identify the intended recipient
        # when the unsubscribe link is clicked, while also protecting the recipient's personal
        # email address from being directly exposed in the URL.
        addresses = {
            address: {
                "ChannelType": "EMAIL",
                "Substitutions": {
                    "unsubscribeURL": [
                        f"https://www.example.com/preferences/index.html?pid={project_id}&h={hashlib.sha256((address + HASH_SECRET_KEY).encode()).hexdigest()}"
                    ]
                }
            }
            for address in to_addresses
        }
        # Send templated email using Amazon Pinpoint
        response = pinpoint_client.send_messages(
            ApplicationId=project_id,
            MessageRequest={
                "Addresses": addresses,
                "MessageConfiguration": {"EmailMessage": {"FromAddress": sender}},
                "TemplateConfiguration": {
                    "EmailTemplate": {
                        "Name": template_name,
                        "Version": template_version,
                    },
                },
            },
        )
    except ClientError as e:
        # Log exception if sending email fails
        logger.exception("Couldn't send email: %s", e)
        raise
    else:
        # Return a dictionary of addresses and their respective message IDs
        return {
            address: message["MessageId"] 
        for address, message in response["MessageResponse"]["Result"].items()
        }


def main():
    # Sample data for sending email
    project_id = "ce796be37f32f178af652b26eexample"  # Amazon Pinpoint project ID
    sender = "[email protected]"  # Verified sender email address
    to_addresses = ["[email protected]", "[email protected]", "[email protected]"]  # Recipient email addresses
    template_name = "MyEmailTemplate"
    template_version = "1"

    # Create a Pinpoint client
    pinpoint_client = boto3.client("pinpoint", region_name=REGION)
    print("Sending email.")
    # Send email and print message IDs
    try:
        message_ids = send_templated_email_message(
            pinpoint_client,
            project_id,
            sender,
            to_addresses,
            template_name,
            template_version,
        ),
        print(f"Message sent! Message IDs: {message_ids}"),
    except ClientError as e:
        print(f"Failed to send messages: {e}")
        
# Entry point of the script
if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)  # Set logging level to INFO
    main()

Pinpoint Campaigns via API (runtime).

If you send emails using Pinpoint campaigns via the API call (runtime), you can add the headers as described below:

"EmailMessage":{
   "Body": "string", 
   "Title": "string", 
   "HtmlBody": "string", 
    "FromAddress": "string",
   "Headers": [
        {
            "Name": "string", 
            "Value": "string"
        } 
   ]
}

Pinpoint Campaigns & Journeys via AWS Console.

The Pinpoint console enables you to create (or update) your email templates to add support for up to 15 different headers, including the “List-Unsubscribe” and “List-Unsubscribe-Post” headers. Simply open , or create a new, template in the Pinpoint console, scroll to the bottom of the visual message editor, expand the Headers option, and insert the header names and values. Note that if you only use the console UI to send your Campaigns and Journeys, you can store the encoded List-Unsubscribe URL as an attribute in the endpoint, then use that attribute as the value as shown below:

Conclusion.

In this blog, we provide Pinpoint customers with the information and guidance needed to enable a one-click unsubscribe link in their recipients’ compatible email apps via “List-Unsubscribe” and “List-Unsubscribe-Post” email headers. Following this guidance, in conjunction with properly authenticating your email sending domains and monitoring / keeping spam complaints below prescribed thresholds will help ensure high rates of Pinpoint email deliverability.

We welcome your comments on this post below. For additional information, refer to these resources, or contact your AWS account team.

About the Authors

zip

Zip

Zip is an Amazon Pinpoint and Amazon Simple Email Service Sr. Specialist Solutions Architect at AWS. Outside of work he enjoys time with his family, cooking, mountain biking and plogging.

Darren Roback

Darren Roback

Darren is a Senior Solutions Architect with Amazon Web Services based in St. Louis, Missouri. He has a background in Security and Compliance, Serverless Event-Driven Architecture, and Enterprise Architecture. At AWS, Darren partners with customers to help them solve business challenges with AWS technology. Outside of work, Darren enjoys spending time in his shop working on woodworking projects.

Bruno Giorgini

Bruno Giorgini

Bruno Giorgini is a Senior Solutions Architect specializing in Pinpoint and SES. With over two decades of experience in the IT industry, Bruno has been dedicated to assisting customers of all sizes in achieving their objectives. When he is not crafting innovative solutions for clients, Bruno enjoys spending quality time with his wife and son, exploring the scenic hiking trails around the SF Bay Area.

In-place version upgrades for applications on Amazon Managed Service for Apache Flink now supported

Post Syndicated from Jeremy Ber original https://aws.amazon.com/blogs/big-data/in-place-version-upgrades-for-applications-on-amazon-managed-service-for-apache-flink-now-supported/

For existing users of Amazon Managed Service for Apache Flink who are excited about the recent announcement of support for Apache Flink runtime version 1.18, you can now statefully migrate your existing applications that use older versions of Apache Flink to a more recent version, including Apache Flink version 1.18. With in-place version upgrades, upgrading your application runtime version can be achieved simply, statefully, and without incurring data loss or adding additional orchestration to your workload.

Apache Flink is an open source distributed processing engine, offering powerful programming interfaces for both stream and batch processing, with first-class support for stateful processing and event time semantics. Apache Flink supports multiple programming languages, Java, Python, Scala, SQL, and multiple APIs with different level of abstraction, which can be used interchangeably in the same application.

Managed Service for Apache Flink is a fully managed, serverless experience in running Apache Flink applications, and now supports Apache Flink 1.18.1, the latest released version of Apache Flink at the time of writing.

In this post, we explore in-place version upgrades, a new feature offered by Managed Service for Apache Flink. We provide guidance on getting started and offer detailed insights into the feature. Later, we deep dive into how the feature works and some sample use cases.

This post is complemented by an accompanying video on in-place version upgrades, and code samples to follow along.

Use the latest features within Apache Flink without losing state

With each new release of Apache Flink, we observe continuous improvements across all aspects of the stateful processing engine, from connector support to API enhancements, language support, checkpoint and fault tolerance mechanisms, data format compatibility, state storage optimization, and various other enhancements. To learn more about the features supported in each Apache Flink version, you can consult the Apache Flink blog, which discusses at length each of the Flink Improvement Proposals (FLIPs) incorporated into each of the versioned releases. For the most recent version of Apache Flink supported on Managed Service for Apache Flink, we have curated some notable additions to the framework you can now use.

With the release of in-place version upgrades, you can now upgrade to any version of Apache Flink within the same application, retaining state in between upgrades. This feature is also useful for applications that don’t require retaining state, because it makes the runtime upgrade process seamless. You don’t need to create a new application in order to upgrade in-place. In addition, logs, metrics, application tags, application configurations, VPCs, and other settings are retained between version upgrades. Any existing automation or continuous integration and continuous delivery (CI/CD) pipelines built around your existing applications don’t require changes post-upgrade.

In the following sections, we share best practices and considerations while upgrading your applications.

Make sure your application code runs successfully in the latest version

Before upgrading to a newer runtime version of Apache Flink on Managed Service for Apache Flink, you need to update your application code, version dependencies, and client configurations to match the target runtime version due to potential inconsistencies between application versions for certain Apache Flink APIs or connectors. Additionally, there may have been changes within the existing Apache Flink interface between versions that will require updating. Refer to Upgrading Applications and Flink Versions for more information about how to avoid any unexpected inconsistencies.

The next recommended step is to test your application locally with the newly upgraded Apache Flink runtime. Make sure the correct version is specified in your build file for each of your dependencies. This includes the Apache Flink runtime and API and recommended connectors for the new Apache Flink runtime. Running your application with realistic data and throughput profiles can prevent issues with code compatibility and API changes prior to deploying onto Managed Service for Apache Flink.

After you have sufficiently tested your application with the new runtime version, you can begin the upgrade process. Refer to General best practices and recommendations for more details on how to test the upgrade process itself.

It is strongly recommended to test your upgrade path on a non-production environment to avoid service interruptions to your end-users.

Build your application JAR and upload to Amazon S3

You can build your Maven projects by following the instructions in How to use Maven to configure your project. If you’re using Gradle, refer to How to use Gradle to configure your project. For Python applications, refer to the GitHub repo for packaging instructions.

Next, you can upload this newly created artifact to Amazon Simple Storage Service (Amazon S3). It is strongly recommended to upload this artifact with a different name or different location than the existing running application artifact to allow for rolling back the application should issues arise. Use the following code:

aws s3 cp <<artifact>> s3://<<bucket-name>>/path/to/file.extension

The following is an example:

aws s3 cp target/my-upgraded-application.jar s3://my-managed-flink-bucket/1_18/my-upgraded-application.jar

Take a snapshot of the current running application

It is recommended to take a snapshot of your current running application state prior to starting the upgrade process. This enables you to roll back your application statefully if issues occur during or after your upgrade. Even if your applications don’t use state directly in the case of windows, process functions, or similar, they may still use Apache Flink state in the case of a source like Apache Kafka or Amazon Kinesis, remembering the position in the topic or shard it last left off before restarting. This helps prevent duplicate data entering the stream processing application.

Some things to keep in mind:

  • Stateful downgrades are not compatible and will not be accepted due to snapshot incompatibility.
  • Validation of the state snapshot compatibility happens when the application attempts to start in the new runtime version. This will happen automatically for applications in RUNNING mode, but for applications that are upgraded in READY state, the compatibility check will only happen when the application starts by calling the RunApplication action.
  • Stateful upgrades from an older version of Apache Flink to a newer version are generally compatible with rare exceptions. Make sure your current Flink version is snapshot-compatible with the target Flink version by consulting the Apache Flink state compatibility table.

Begin the upgrade of a running application

After you have tested your new application, uploaded the artifacts to Amazon S3, and taken a snapshot of the current application, you are now ready to begin upgrading your application. You can upgrade your applications using the UpdateApplication action:

aws kinesisanalyticsv2 update-application \ --region ${region} \ --application-name ${appName} \ --current-application-version-id 1 \ --runtime-environment-update "FLINK-1_18" \ --application-configuration-update '{ "ApplicationCodeConfigurationUpdate": { "CodeContentTypeUpdate": "ZIPFILE", "CodeContentUpdate": { "S3ContentLocationUpdate": { "BucketARNUpdate": "'${bucketArn}'", "FileKeyUpdate": "1_18/amazon-msf-java-stream-app-1.0.jar" } } } }'

This command invokes several processes to perform the upgrade:

  • Compatibility check – The API will check if your existing snapshot is compatible with the target runtime version. If compatible, your application will transition into UPDATING status, otherwise your upgrade will be rejected and resume processing data with unaffected application.
  • Restore from latest snapshot with new code – The application will then attempt to start using the most recent snapshot. If the application starts running and behavior appears in-line with expectations, no further action is needed.
  • Manual intervention may be required – Keep a close watch on your application throughout the upgrade process. If there are unexpected restarts, failures, or issues of any kind, it is recommended to roll back to the previous version of your application.

When the application is in RUNNING status in the new application version, it is still recommended to closely monitor the application for any unexpected behavior, state incompatibility, restarts, or anything else related to performance.

Unexpected issues while upgrading

In the event of encountering any issues with your application following the upgrade, you retain the ability to roll back your running application to the previous application version. This is the recommended approach if your application is unhealthy or unable to take checkpoints or snapshots while upgrading. Additionally, it’s recommended to roll back if you observe unexpected behavior out of the application.

There are several scenarios to be aware of when upgrading that may require a rollback:

  • An app stuck in UPDATING state for any reason can use the RollbackApplication action to trigger a rollback to the original runtime
  • If an application successfully upgrades to a newer Apache Flink runtime and switches to RUNNING status, but exhibits unexpected behavior, it can use the RollbackApplication function to revert back to the prior application version
  • An application fails via the UpgradeApplication command, which will result in the upgrade not taking place to begin with

Edge cases

There are several known issues you may face when upgrading your Apache Flink versions on Managed Service for Apache Flink. Refer to Precautions and known issues for more details to see if they apply to your specific applications. In this section, we walk through one such use case of state incompatibility.

Consider a scenario where you have an Apache Flink application currently running on runtime version 1.11, using the Amazon Kinesis Data Streams connector for data retrieval. Due to notable alterations made to the Kinesis Data Streams connector across various Apache Flink runtime versions, transitioning directly from 1.11 to 1.13 or higher while preserving state may pose difficulties. Notably, there are disparities in the software packages employed: Amazon Kinesis Connector vs. Apache Kinesis Connector. Consequently, this difference will lead to complications when attempting to restore state from older snapshots.

For this specific scenario, it’s recommended to use the Amazon Kinesis Connector Flink State Migrator, a tool to help migrate Kinesis Data Streams connectors to Apache Kinesis Data Stream connectors without losing state in the source operator.

For illustrative purposes, let’s walk through the code to upgrade the application:

aws kinesisanalyticsv2 update-application \ --region ${region} \ --application-name ${appName} \ --current-application-version-id 1 \ --runtime-environment-update "FLINK-1_13" \ --application-configuration-update '{ "ApplicationCodeConfigurationUpdate": { "CodeContentTypeUpdate": "ZIPFILE", "CodeContentUpdate": { "S3ContentLocationUpdate": { "BucketARNUpdate": "'${bucketArn}'", "FileKeyUpdate": "1_13/new-kinesis-application-1-13.jar" } } } }'

This command will issue an update command and run all compatibility checks. Additionally, the application may even start, displaying the RUNNING status on the Managed Service for Apache Flink console and API.

However, with a closer inspection into your Apache Flink Dashboard to view the fullRestart metrics and application behavior, you may find that the application has failed to start due to the state from the 1.11 version of the application’s state being incompatible with the new application due changing the connector as described previously.

You can roll back to the previous running version, restoring from the successfully taken snapshot, as shown in the following code. If the application has no snapshots, Managed Service for Apache Flink will reject the rollback request.

aws kinesisanalyticsv2 rollback-application --application-name ${appName} --current-application-version-id 2 --region ${region}

After issuing this command, your application should be running again in the original runtime without any data loss, thanks to the application snapshot that was taken previously.

This scenario is meant as a precaution, and a recommendation that you should test your application upgrades in a lower environment prior to production. For more details about the upgrade process, along with general best practices and recommendations, refer to In-place version upgrades for Apache Flink.

Conclusion

In this post, we covered the upgrade path for existing Apache Flink applications running on Managed Service for Apache Flink and how you should make modifications to your application code, dependencies, and application JAR prior to upgrading. We also recommended taking snapshots of your application prior to the upgrade process, along with testing your upgrade path in a lower environment. We hope you found this post helpful and that it provides valuable insights into upgrading your applications seamlessly.

To learn more about the new in-place version upgrade feature from Managed Service for Apache Flink, refer to In-place version upgrades for Apache Flink, the how-to video, the GitHub repo, and Upgrading Applications and Flink Versions.


About the Authors

Jeremy Ber

Jeremy Ber boasts over a decade of expertise in stream processing, with the last four years dedicated to AWS as a Streaming Specialist Solutions Architect. With a robust ten-year career background, Jeremy’s commitment to stream processing, notably Apache Flink, underscores his professional endeavors. Transitioning from Software Engineer to his current role, Jeremy prioritizes assisting customers in resolving complex challenges with precision. Whether elucidating Amazon Managed Streaming for Apache Kafka (Amazon MSK) or navigating AWS’s Managed Service for Apache Flink, Jeremy’s proficiency and dedication ensure efficient problem-solving. In his professional approach, excellence is maintained through collaboration and innovation.

Krzysztof Dziolak is Sr. Software Engineer on Amazon Managed Service for Apache Flink. He works with product team and customers to make streaming solutions more accessible to engineering community.

Use AWS Data Exchange to seamlessly share Apache Hudi datasets

Post Syndicated from Saurabh Bhutyani original https://aws.amazon.com/blogs/big-data/use-aws-data-exchange-to-seamlessly-share-apache-hudi-datasets/

Apache Hudi was originally developed by Uber in 2016 to bring to life a transactional data lake that could quickly and reliably absorb updates to support the massive growth of the company’s ride-sharing platform. Apache Hudi is now widely used to build very large-scale data lakes by many across the industry. Today, Hudi is the most active and high-performing open source data lakehouse project, known for fast incremental updates and a robust services layer.

Apache Hudi serves as an important data management tool because it allows you to bring full online transaction processing (OLTP) database functionality to data stored in your data lake. As a result, Hudi users can store massive amounts of data with the data scaling costs of a cloud object store, rather than the more expensive scaling costs of a data warehouse or database. It also provides data lineage, integration with leading access control and governance mechanisms, and incremental ingestion of data for near real-time performance. AWS, along with its partners in the open source community, has embraced Apache Hudi in several services, offering Hudi compatibility in Amazon EMR, Amazon Athena, Amazon Redshift, and more.

AWS Data Exchange is a service provided by AWS that enables you to find, subscribe to, and use third-party datasets in the AWS Cloud. A dataset in AWS Data Exchange is a collection of data that can be changed or updated over time. It also provides a platform through which a data producer can make their data available for consumption for subscribers.

In this post, we show how you can take advantage of the data sharing capabilities in AWS Data Exchange on top of Apache Hudi.

Benefits of AWS Data Exchange

AWS Data Exchange offers a series of benefits to both parties. For subscribers, it provides a convenient way to access and use third-party data without the need to build and maintain data delivery, entitlement, or billing technology. Subscribers can find and subscribe to thousands of products from qualified AWS Data Exchange providers and use them with AWS services. For providers, AWS Data Exchange offers a secure, transparent, and reliable channel to reach AWS customers. It eliminates the need to build and maintain data delivery, entitlement, and billing technology, allowing providers to focus on creating and managing their datasets.

To become a provider on AWS Data Exchange, there are a few steps to determine eligibility. Providers need to register to be a provider, make sure their data meets the legal eligibility requirements, and create datasets, revisions, and import assets. Providers can define public offers for their data products, including prices, durations, data subscription agreements, refund policies, and custom offers. The AWS Data Exchange API and AWS Data Exchange console can be used for managing datasets and assets.

Overall, AWS Data Exchange simplifies the process of data sharing in the AWS Cloud by providing a platform for customers to find and subscribe to third-party data, and for providers to publish and manage their data products. It offers benefits for both subscribers and providers by eliminating the need for complex data delivery and entitlement technology and providing a secure and reliable channel for data exchange.

Solution overview

Combining the scale and operational capabilities of Apache Hudi with the secure data sharing features of AWS Data Exchange enables you to maintain a single source of truth for your transactional data. Simultaneously, it enables automatic business value generation by allowing other stakeholders to use the insights that the data can provide. This post shows how to set up such a system in your AWS environment using Amazon Simple Storage Service (Amazon S3), Amazon EMR, Amazon Athena, and AWS Data Exchange. The following diagram illustrates the solution architecture.

Set up your environment for data sharing

You need to register as a data producer before you create datasets and list them in AWS Data Exchange as data products. Complete the following steps to register as a data provider:

  1. Sign in to the AWS account that you want to use to list and manage products on AWS Data Exchange.
    As a provider, you are responsible for complying with these guidelines and the Terms and Conditions for AWS Marketplace Sellers and the AWS Customer Agreement. AWS may update these guidelines. AWS removes any product that breaches these guidelines and may suspend the provider from future use of the service. AWS Data Exchange may have some AWS Regional requirements; refer to Service endpoints for more information.
  2.  Open the AWS Marketplace Management Portal registration page and enter the relevant information about how you will use AWS Data Exchange.
  3. For Legal business name, enter the name that your customers see when subscribing to your data.
  4. Review the terms and conditions and select I have read and agree to the AWS Marketplace Seller Terms and Conditions.
  5. Select the information related to the types of products you will be creating as a data provider.
  6. Choose Register & Sign into Management Portal.

If you want to submit paid products to AWS Marketplace or AWS Data Exchange, you must provide your tax and banking information. You can add this information on the Settings page:

  1. Choose the Payment information tab.
  2. Choose Complete tax information and complete the form.
  3. Choose Complete banking information and complete the form.
  4. Choose the Public profile tab and update your public profile.
  5. Choose the Notifications tab and configure an additional email address to receive notifications.

You’re now ready to configure seamless data sharing with AWS Data Exchange.

Upload Apache Hudi datasets to AWS Data Exchange

After you create your Hudi datasets and register as a data provider, complete the following steps to create the datasets in AWS Data Exchange:

  1. Sign in to the AWS account that you want to use to list and manage products on AWS Data Exchange.
  2. On the AWS Data Exchange console, choose Owned data sets in the navigation pane.
  3. Choose Create data set.
  4. Select the dataset type you want to create (for this post, we select Amazon S3 data access).
  5. Choose Choose Amazon S3 locations.
  6. Choose the Amazon S3 location where you have your Hudi datasets.

After you add the Amazon S3 location to register in AWS Data Exchange, a bucket policy is generated.

  1. Copy the JSON file and update the bucket policy in Amazon S3.
  2. After you update the bucket policy, choose Next.
  3. Wait for the CREATE_S3_DATA_ACCESS_FROM_S3_BUCKET job to show as Completed, then choose Finalize data set.

Publish a product using the registered Hudi dataset

Complete the following steps to publish a product using the Hudi dataset:

  1. On the AWS Data Exchange console, choose Products in the navigation pane.
    Make sure you’re in the Region where you want to create the product.
  2. Choose Publish new product to start the workflow to create a new product.
  3. Choose which product visibility you want to have: public (it will be publicly available in AWS Data Exchange catalog as well as the AWS Marketplace websites) or private (only the AWS accounts you share with will have access to it).
  4. Select the sensitive information category of the data you are publishing.
  5. Choose Next.
  6. Select the dataset that you want to add to the product, then choose Add selected to add the dataset to the new product.
  7. Define access to your dataset revisions based on time. For more information, see Revision access rules.
  8. Choose Next.
  9. Provide the information for a new product, including a short description.
    One of the required fields is the product logo, which must be in a supported image format (PNG, JPG, or JPEG) and the file size must be 100 KB or less.
  10. Optionally, in the Define product section, under Data dictionaries and samples, select a dataset and choose Edit to upload a data dictionary to the product.
  11. For Long description, enter the description to display to your customers when they look at your product. Markdown formatting is supported.
  12. Choose Next.
  13. Based on your choice of product visibility, configure the offer, renewal, and data subscription agreement.
  14. Choose Next.
  15. Review all the products and offer information, then choose Publish to create the new private product.

Manage permissions and access controls for shared datasets

Datasets that are published on AWS Data Exchange can only be used when customers are subscribed to the products. Complete the following steps to subscribe to the data:

  1. On the AWS Data Exchange console, choose Browse catalog in the navigation pane.
  2. In the search bar, enter the name of the product you want to subscribe to and press Enter.
  3. Choose the product to view its detail page.
  4. On the product detail page, choose Continue to Subscribe.
  5. Choose your preferred price and duration combination, choose whether to enable auto-renewal for the subscription, and review the offer details, including the data subscription agreement (DSA).
    The dataset is available in the US East (N. Virginia) Region.
  6. Review the pricing information, choose the pricing offer and, if you and your organization agree to the DSA, pricing, and support information, choose Subscribe.

After the subscription has gone through, you will be able to see the product on the Subscriptions page.

Create a table in Athena using an Amazon S3 access point

Complete the following steps to create a table in Athena:

  1. Open the Athena console.
  2. If this is the first time using Athena, choose Explore Query Editor and set up the S3 bucket where query results will be written:
    Athena will display the results of your query on the Athena console, or send them through your ODBC/JDBC driver if that is what you are using. Additionally, the results are written to the result S3 bucket.

    1. Choose View settings.
    2. Choose Manage.
    3. Under Query result location and encryption, choose Browse Amazon S3 to choose the location where query results will be written.
    4. Choose Save.
    5. Choose a bucket and folder you want to automatically write the query results to.
      Athena will display the results of your query on the Athena console, or send them through your ODBC/JDBC driver if that is what you are using. Additionally, the results are written to the result S3 bucket.
  3. Complete the following steps to create a workgroup:
    1. In the navigation pane, choose Workgroups.
    2. Choose Create workgroup.
    3. Enter a name for your workgroup (for this post, data_exchange), select your analytics engine (Athena SQL), and select Turn on queries on requester pay buckets in Amazon S3.
      This is important to access third-party datasets.
    4. In the Athena query editor, choose the workgroup you created.
    5. Run the following DDL to create the table:

Now you can run your analytical queries using Athena SQL statements. The following screenshot shows an example of the query results.

Enhanced customer collaboration and experience with AWS Data Exchange and Apache Hudi

AWS Data Exchange provides a secure and simple interface to access high-quality data. By providing access to over 3,500 datasets, you can use leading high-quality data in your analytics and data science. Additionally, the ability to add Hudi datasets as shown in this post allows you to enable deeper integration with lakehouse use cases. There are several potential use cases where having Apache Hudi datasets integrated into AWS Data Exchange can accelerate business outcomes, such as the following:

  • Near real-time updated datasets – One of Apache Hudi’s defining features is the ability to provide near real-time incremental data processing. As new data flows in, Hudi allows that data to be ingested in real time, providing a central source of up-to-date truth. AWS Data Exchange supports dynamically updated datasets, which can keep up with these incremental updates. For downstream customers that rely on the most up-to-date information for their use cases, the combination of Apache Hudi and AWS Data Exchange means that they can subscribe to a dataset in AWS Data Exchange and know that they’re getting incrementally updated data.
  • Incremental pipelines and processing – Hudi supports incremental processing and updates to data in the data lake. This is especially valuable because it enables you to only update or process any data that has changed and materialized views that are valuable for your business use case.

Best practices and recommendations

We recommend the following best practices for security and compliance:

  • Enable AWS Lake Formation or other data governance systems as part of creating the source data lake
  • To maintain compliance, you can use the guides provided by AWS Artifact

For monitoring and management, you can enable Amazon CloudWatch logs on your EMR clusters along with CloudWatch alerts to maintain pipeline health.

Conclusion

Apache Hudi enables you to bring to life massive amounts of data stored in Amazon S3 for analytics. It provides full OLAP capabilities, enables incremental processing and querying, along with maintaining the ability to run deletes to remain GDPR compliant. Combining this with the secure, reliable, and user-friendly data sharing capabilities of AWS Data Exchange means that the business value unlocked by a Hudi lakehouse doesn’t need to remain limited to the producer that generates this data.

For more use cases about using AWS Data Exchange, see Learning Resources for Using Third-Party Data in the Cloud. To learn more about creating Apache Hudi data lakes, refer to Build your Apache Hudi data lake on AWS using Amazon EMR – Part 1. You can also consider using a fully managed lakehouse product such as Onehouse.


About the Authors

Saurabh Bhutyani is a Principal Analytics Specialist Solutions Architect at AWS. He is passionate about new technologies. He joined AWS in 2019 and works with customers to provide architectural guidance for running generative AI use cases, scalable analytics solutions and data mesh architectures using AWS services like Amazon Bedrock, Amazon SageMaker, Amazon EMR, Amazon Athena, AWS Glue, AWS Lake Formation, and Amazon DataZone.

Ankith Ede is a Data & Machine Learning Engineer at Amazon Web Services, based in New York City. He has years of experience building Machine Learning, Artificial Intelligence, and Analytics based solutions for large enterprise clients across various industries. He is passionate about helping customers build scalable and secure cloud based solutions at the cutting edge of technology innovation.

Chandra Krishnan is a Solutions Engineer at Onehouse, based in New York City. He works on helping Onehouse customers build business value from their data lakehouse deployments and enjoys solving exciting challenges on behalf of his customers. Prior to Onehouse, Chandra worked at AWS as a Data and ML Engineer, helping large enterprise clients build cutting edge systems to drive innovation in their organizations.

Accelerate your Software Development Lifecycle with Amazon Q

Post Syndicated from Chetan Makvana original https://aws.amazon.com/blogs/devops/accelerate-your-software-development-lifecycle-with-amazon-q/

Software development teams are constantly looking for ways to accelerate their software development lifecycle (SDLC) to release quality software faster. Amazon Q, a generative AI–powered assistant, can help software development teams work more efficiently throughout the SDLC—from research to maintenance.

Software development teams spend significant time on undifferentiated tasks while analyzing requirements, building, testing, and operating applications. Trained on 17 years’ worth of AWS expertise, Amazon Q can transform how you build, deploy, and operate applications and workloads on AWS. By automating mundane tasks, Amazon Q enables development teams to spend more innovating and building. Amazon Q can speed up on-boarding, reduce context switching, and accelerate development of applications on AWS.

This blog post explores how Amazon Q can accelerate development tasks across the SDLC using an example To-Do API project. Throughout this blog, we will navigate through the various phases of the SDLC while implementing To-Do API by leveraging Amazon Q Business and Amazon Q Developer. We will walk through common use cases for Amazon Q Business in the planning and research phases, and Amazon Q Developer in the research, design, development, testing, and maintenance phases.

Planning

As a product owner, you spend significant time on requirements analysis and user story development. You research internal documents like functional specifications and business requirements to understand the desired functionalities and objectives. Manually sifting through documentation is time consuming. You can leverage Amazon Q Business to quickly extract relevant information from your internal documents or wikis, such as Confluence. Amazon Q Business quickly connects to your business data, information, and systems so that you can have tailored conversations, solve problems, generate content, and take actions relevant to your business. Amazon Q Business offers over 40 built-in connectors to popular enterprise applications and document repositories, including Amazon Simple Storage Service (Amazon S3), Confluence, Salesforce and more, enabling you to create a generative AI solution with minimal configuration. Amazon Q Business also provides plugins to interact with third-party applications. These plugins support read and write actions that can help boost end user productivity.

So, instead of digging through the internal documentations, you can simply ask Amazon Q Business about requirements using natural language and it will provide immediate and relevant information to the users, and helps streamline tasks and accelerate problem solving.

For our To-Do API example, let’s consider the business requirements are documented in Confluence, and Jira is utilized for issue management. You can configure Amazon Q Business with Confluence and Jira through the Confluence connector and Jira plugin, respectively. To understand requirements, you may ask Amazon Q Business for an overview of the use case, business drivers, non-functional requirements, and other related questions. Amazon Q Business then pulls the relevant details from the Confluence documents and presents them to you in a clear and concise manner. This allows you to save time gathering requirements and focus more on developing user stories.

Ask Amazon Q Business to understand requirements from the requirement document available on Confluence

Once you have a good understanding of the requirements, you can ask Amazon Q Business to write a user story and even create a Jira issue for you. For the To-Do API use case, Amazon Q Business generates the user stories tailored to the requirements and creates the corresponding Jira ticket ready for your team, saving you time and ensuring efficiency in the project workflow.

Ask Amazon Q Business to create issue in Jira

Research and Design

Let’s consider a scenario where the above mentioned user story is assigned to you and you have to implement it based on the technology stacks described in the confluence page.

First, you ask Amazon Q Business to gain insights into the technology stacks aligning with the organization’s development guidelines. Amazon Q Business promptly provides you with details sourced from the internal development guidelines document hosted on Confluence along with references and citations.

Ask Amazon Q Business to gain insight in technology stack detail from Confluence.

As a developer, you can use Amazon Q Developer in your integrated development environment (IDE) to get software development assistance, including code explanation, code generation, and code improvements such as debugging and optimization. Amazon Q Developer can help by analyzing the requirements, assessing different approaches, and creating an implementation plan and sample code. It can investigate options, weigh tradeoffs, recommend best practices, and even brainstorm with you to optimize the design.

Let’s see how Amazon Q Developer can help analyze the user story, design, and brainstorm with you arrive at an implementation plan.

Ask Amazon Q Developer to design To-Do API.

Let us further refine the design with adding non-functional requirements such as security and performance.

Ask Amazon Q Developer to design with non-functional requirements.

Develop and Test

Amazon Q Developer can generate code snippets that meet your specified business and technical needs. You can review the auto-generated code, manually copy, and paste it into your editor, or use the Insert at cursor option to directly incorporate it into your source code. This allows you to rapidly prototype and iterate on new capabilities for your application. Amazon Q Developer uses the context of your conversation to inform future responses for the duration of your conversation. This makes it easy to help you focus on building applications because you don’t have to leave your IDE to get answers and context-specific coding guidance.

Amazon Q Developer is particularly useful for answering questions related to the following areas:

  • Building on AWS, including AWS service selection, limits, and best practices.
  • General software development concepts, including programming language syntax and application development.
  • Writing code, including explaining code, debugging code, and writing unit tests.
  • Upgrading and modernizing existing application code using Amazon Q Developer Agent for Code Transformation.

Expanding on the same user story design generated by Amazon Q Developer, you can ask Amazon Q Developer to implement the API and refine based on additional requirements and parameters. Let’s collaborate with Amazon Q Developer, to expand our design to implementation. You can leverage Amazon Q Developer’s expertise to ideate, evaluate options, and arrive at an optimal solution. Amazon Q Developer can have an intelligent discussion to brainstorm creative new test cases based on the requirements. It can then help construct an implementation plan, suggesting ways to efficiently add robust, comprehensive tests that cover edge cases.

Let’s ask Amazon Q Developer to generate code based on the design.

Ask Amazon Q Developer to generate code

Now, let’s ask Amazon Q Developer to implement the AWS Lambda function.

Ask Amazon Q Developer to generate AWS Lambda function.

Amazon Q Developer can provide code examples and snippets that show how to implement the design. You can review the code, get Amazon Q Developer’s feedback, and seamlessly integrate it into the project. Collaborating with Amazon Q Developer allows you to amplify your productivity by leveraging its knowledge to quickly iterate and enrich our application capabilities.

Amazon Q Developer can also review the code and find opportunities for improvements, optimization based on performance and other parameters. Let us ask Amazon Q Developer to find any opportunities for improvements on the code for our To-do API.

Ask Amazon Q Developer for code improvements

Debugging and Troubleshooting

Amazon Q Developer can assist you with troubleshooting and debugging your code. For unfamiliar error codes or exception types, you can ask Amazon Q Developer to research their meaning and common resolutions. Amazon Q Developer can also help by analyzing your applications’ debug logs, highlighting any anomalies, errors, or warnings that could indicate potential issues.

Amazon Q Developer can troubleshoot network connectivity issues caused by misconfiguration, providing concise problem analysis and resolution suggestions. Amazon Q Developer can also research AWS best practices to identify areas not aligned with recommendations. For code issues, it can answer questions and debug problems within supported IDEs. Leveraging its knowledge of AWS services and their interactions, Amazon Q Developer can provide service-specific guidance. In the AWS Management Console, Amazon Q Developer can troubleshoot errors you receive while working with AWS services such as insufficient permissions, incorrect configuration, and exceeding service limits.

Let’s test our To-Do API by hitting the Amazon API Gateway endpoint using cURL.

Test To-Do API in IDE.

The API Gateway endpoint invokes the Lambda function to insert records in the Amazon DynamoDB table. Since it throws Internal Server Error, let’s go to the Lambda console to troubleshoot this further and test the function directly by creating a test event for the POST method. You can troubleshoot different console errors with Amazon Q Developer, directly in the AWS Management Console. For the above error, Amazon Q analyzes the issue and helps find the resolution. Amazon Q explains how to fix this error directly on the console by adding an environment variable for DynamoDB table name.

Ask Amazon Q to troubleshoot issue in the AWS Management Console.

Now, let’s ask Amazon Q Developer in IDE to generate code to fix this error. Amazon Q Developer then generates a code snippet to set the desired environment variable in the AWS Cloud Development Kit (AWS CDK) code for Lambda function.

Ask Amazon Q Developer to generate CDK.

Conclusion

In this post, you learned how to leverage Amazon Q Business and Amazon Q Developer to streamline SDLC and accelerate time-to-market. With its deep understanding of code and AWS resources, Amazon Q Developer enables development teams to work efficiently throughout the research, design, development, testing, and review phases. By automating mundane tasks, offering expert guidance, generating code snippets, optimizing implementations, and troubleshooting issues, Amazon Q Developer allows developers to redirect their focus towards higher-value activities that drive innovation. Moreover, through Amazon Q Business, teams can leverage the power of generative AI to expedite the requirements planning and research phases.

Chetan Makvana

Chetan Makvana is a Senior Solutions Architect with Amazon Web Services. He works with AWS partners and customers to provide them with architectural guidance for building scalable architecture and implementing strategies to drive adoption of AWS services. He is a technology enthusiast and a builder with a core area of interest on generative AI, serverless, and DevOps. Outside of work, he enjoys watching shows, traveling, and music.

Suruchi Saxena

Suruchi Saxena is a Cloud/DevOps Engineer working at Amazon Web Services. She has a background in Generative AI and DevOps, leveraging years of IT experience to drive transformational changes for AWS customers. She specializes in architecting and managing cloud-based solutions, automation, code delivery and analysis, infrastructure as code, and continuous integration/delivery. In her free time, she enjoys traveling and reading.

Venugopalan Vasudevan

Venugopalan Vasudevan is a Senior Specialist Solutions Architect at Amazon Web Services (AWS), where he specializes in AWS Generative AI services. His expertise lies in helping customers leverage cutting-edge services like Amazon Q, and Amazon Bedrock to streamline development processes, accelerate innovation, and drive digital transformation. Venugopalan is dedicated to facilitating the Next Generation Developer experience, enabling developers to work more efficiently and creatively through the integration of Generative AI into their workflows.

Automate Terraform Deployments with Amazon CodeCatalyst and Terraform Community action

Post Syndicated from Vineeth Nair original https://aws.amazon.com/blogs/devops/automate-terraform-deployments-with-amazon-codecatalyst-and-terraform-community-action/

Amazon CodeCatalyst integrates continuous integration and deployment (CI/CD) by bringing key development tools together on one platform. With the entire application lifecycle managed in one tool, CodeCatalyst empowers rapid, dependable software delivery. CodeCatalyst offers a range of actions which is the main building block of a workflow, and defines a logical unit of work to perform during a workflow run. Typically, a workflow includes multiple actions that run sequentially or in parallel depending on how you’ve configured them.

Introduction

Infrastructure as code (IaC) has become a best practice for managing IT infrastructure. IaC uses code to provision and manage your infrastructure in a consistent, programmatic way. Terraform by HashiCorp is one of most common tools for IaC.

With Terraform, you define the desired end state of your infrastructure resources in declarative configuration files. Terraform determines the necessary steps to reach the desired state and provisions the infrastructure automatically. This removes the need for manual processes while enabling version control, collaboration, and reproducibility across your infrastructure.

In this blog post, we will demonstrate using the “Terraform Community Edition” action in CodeCatalyst to create resources in an AWS account.

Amazon CodeCatalyst workflow overview
Figure 1: Amazon CodeCatalyst Action

Prerequisites

To follow along with the post, you will need the following items:

Walkthrough

In this walkthrough we create an Amazon S3 bucket using the Terraform Community Edition action in Amazon CodeCatalyst. The action will execute the Terraform commands needed to apply your configuration. You configure the action with a specified Terraform version. When the action runs it uses that Terraform version to deploy your Terraform templates, provisioning the defined infrastructure. This action will run terraform init to initialize the working directory, terraform plan to preview changes, and terraform apply to create the Amazon S3 bucket based on the Terraform configuration in a target AWS Account. At the end of the post your workflow will look like the following:

Amazon CodeCatalyst Workflow with Terraform Community Action

Figure 2: Amazon CodeCatalyst Workflow with Terraform Community Action

Create the base workflow

To begin, we create a workflow that will execute our Terraform code. In the CodeCatalyst project, click on CI/CD on left pane and select Workflows. In the Workflows pane, click on Create Workflow.

Creating Amazon CodeCatalyst Workflow

Figure 3: Creating Amazon CodeCatalyst Workflow

We have taken an existing repository my-sample-terraform-repository as a source repository.

Creating Workflow from source repository

Figure 4 : Creating Workflow from source repository

Once the source repository is selected, select Branch as main and click Create. You will have an empty workflow. You can edit the workflow from within the CodeCatalyst console. Click on the Commit button to create an initial commit:

Initial Workflow commit

Figure 5: Initial Workflow commit

On the Commit Workflow dialogue, add a commit message, and click on Commit. Ignore any validation errors at this stage:

Completing Initial Commit for Workflow

Figure 6: Completing Initial Commit for Workflow

Connect to CodeCatalyst Dev Environment

For this post, we will use an AWS Cloud9 Dev Environment to edit our workflow. Your first step is to connect to the dev environment. Select Code → Dev Environments.

Navigate to CodeCatalyst Dev Environments

Figure 7 : Navigate to CodeCatalyst Dev Environments

If you do not already have a Dev Environment you can create an instance by selecting the Create Dev Environment dropdown and selecting AWS Cloud9 (in browser). Leave the options as default and click on Create to provision a new Dev Environment.

Create CodeCatalyst Dev Environment

Figure 8: Create CodeCatalyst Dev Environment

Once the Dev Environment has provisioned, you are redirected to a Cloud9 instance in browser. The Dev Environment automatically clones the existing repository for the Terraform project code. We at first create a main.tf file in root of the repository with the Terraform code for creating an Amazon S3 bucket. To do this, we right click on the repository folder in the tree-pane view on the left side of the Cloud9 Console window and select New File

Creating a new file in Cloud9

Figure 9: Creating a new file in Cloud9

We are presented with a new file which we will name main.tf, this file will store the Terraform code. We then edit main.tf by right clicking on the file and selecting open. We insert the code below into main.tf. The code has a Terraform resource block to create an AWS S3 Bucket. The configuration also uses Terraform AWS datasources to obtain AWS region and AWS Account ID data which is used to form part of the bucket name. Finally, we use a backend block to configure Terraform to use an AWS S3 bucket to store Terraform state data. To save our changes we select File -> Save

: Adding Terraform Code

Figure 10: Adding Terraform Code

Now let’s start creating Terraform Workflow using Amazon CodeCatalyst Terraform Community Action. Within your repository go to .codecatalyst/workflows directory and open the <workflowname.yaml> file.

Creating CodeCatalyst Workflow

Figure 11: Creating CodeCatalyst Workflow

The below code snippet is an example workflow definition with terraform plan and terraform apply. We will enter this into our workflow file, with the relevant configuration settings for our environment.

The workflow does the following:

  • When a change is pushed to the main branch, a new workflow execution is triggered. This workflow carries a Terraform plan and subsequent apply operation.
    Name: terraform-action-workflow
    Compute:
      Type: EC2
      Fleet: Linux.x86-64.Large
    SchemaVersion: "1.0"
    Triggers:
      - Type: Push
        Branches:
          -  main
    Actions: 
      PlanTerraform:
        Identifier: codecatalyst-labs/provision-with-terraform-community@v1
        Environment:
          Name: dev 
          Connections:
            - Name: codecatalyst
              Role: CodeCatalystWorkflowDevelopmentRole # The IAM role to be used
        Inputs:
          Sources:
            - WorkflowSource
        Outputs:
          Artifacts:
            - Name: tfplan # generates a tfplan output artifact
              Files:
                - tfplan.out
        Configuration:
          AWSRegion: eu-west-2
          StateBucket: tfstate-bucket # The Terraform state S3 Bucket
          StateKey: terraform.tfstate # The Terraform state file
          StateKeyPrefix: states/ # The path to the state file (optional)
          StateTable: tfstate-table # The Dynamo DB database
          TerraformVersion: ‘1.5.1’ # The Terraform version to be used
          TerraformOperationMode: plan # The Terraform operation- can be plan or apply
      ApplyTerraform:
        Identifier: codecatalyst-labs/provision-with-terraform-community@v1
        DependsOn:
          - PlanTerraform
        Environment:
          Name: dev 
          Connections:
            - Name: codecatalyst
              Role: CodeCatalystWorkflowDevelopmentRole
        Inputs:
          Sources:
            - WorkflowSource
          Artifacts:
            - tfplan
        Configuration:
          AWSRegion: eu-west-2
          StateBucket: tfstate-bucket
          StateKey: terraform.tfstate
          StateKeyPrefix: states/
          StateTable: tfstate-table
          TerraformVersion: '1.5.1'
          TerraformOperationMode: apply
  • Key configuration parameters are:
    • Environment.Name: The name of our CodeCatalyst Environment
    • Environment.Connections.Name: The name of the CodeCatalyst connection
    • Environment.Connections.Role: The IAM role used for the workflow
    • AWSRegion: The AWS region that hosts the Terraform state bucket
    • Environment.Name: The name of our CodeCatalyst Environment
    • Identifier: codecatalyst-labs/provision-with-terraform-community@v1
    • StateBucket: The Terraform state bucket
    • StateKey: The Terraform statefile e.g. terraform.tfstate
    • StateKeyPrefix: The folder location of the State file (optional)
    • StateTable: The DynamoDB State table
    • TerraformVersion: The version of Terraform to be installed
    • TerraformOperationMode: The operation mode for Terraform – this can be either ‘plan’ or ‘apply’

The workflow now contains CodeCatalyst action for Terraform Plan and Terraform Apply.

To save our changes we select File -> Save, we can then commit these to our git repository by typing the following at the terminal:

git add . && git commit -m ‘adding terraform workflow and main.tf’ && git push

The above command adds the workflow file and Terraform code to be tracked by git. It then commits the code and pushes the changes to CodeCatalyst git repository. As we have a branch trigger for main defined, this will trigger a run of the workflow. We can monitor the status of the workflow in the CodeCatalyst console by selecting CICD -> Workflows. Locate your workflow and click on Runs to view the status. You will be able to observe that the workflow has successfully completed and Amazon S3 bucket is created.

: CodeCatalyst Workflow Status

Figure 12: CodeCatalyst Workflow Status

Cleaning up

If you have been following along with this workflow, you should delete the resources that you have deployed to avoid further charges. The walkthrough will create an Amazon S3 bucket named <your-aws-account-id>-<your-aws-region>-terraform-sample-bucket in your AWS account. In the AWS Console > S3, locate the bucket that was created, then select and click Delete to remove the bucket.

Conclusion

In this post, we explained how you can easily get started deploying IaC to your AWS accounts with Amazon CodeCatalyst. We outlined how the Terraform Community Edition action can streamline the process of planning and applying Terraform configurations and how to create a workflow that can leverage this action. Get started with Amazon CodeCatalyst today.

Richard Merritt

Richard Merritt is a Senior DevOps Consultant at Amazon Web Services (AWS), Professional Services. He works with AWS customers to accelerate their journeys to the cloud by providing scalable, secure and robust DevOps solutions.

Vineeth Nair

Vineeth Nair is a DevOps Architect at Amazon Web Services (AWS), Professional Services. He collaborates closely with AWS customers to support and accelerate their journeys to the cloud and within the cloud ecosystem by building performant, resilient, scalable, secure and cost efficient solutions.

Nagaraju Basavaraju

Nagaraju is a seasoned DevOps Architect at AWS, UKI. He specializes in assisting customers in designing and implementing secure, scalable, and resilient hybrid and cloud-native solutions with DevOps methodologies. With a profound passion for cloud infrastructure, observability and automation, Nagaraju is also an avid contributor to Open-Source projects related to Terraform and AWS CDK.

Debojit Bhadra

Debojit is a DevOps consultant who specializes in helping customers deliver secure and reliable solutions using AWS services. He concentrates on infrastructure development and building serverless solutions with AWS and DevOps. Apart from work, Debojit enjoys watching movies and spending time with his family.

Analyze more demanding as well as larger time series workloads with Amazon OpenSearch Serverless 

Post Syndicated from Satish Nandi original https://aws.amazon.com/blogs/big-data/analyze-more-demanding-as-well-as-larger-time-series-workloads-with-amazon-opensearch-serverless/

In today’s data-driven landscape, managing and analyzing vast amounts of data, especially logs, is crucial for organizations to derive insights and make informed decisions. However, handling this data efficiently presents a significant challenge, prompting organizations to seek scalable solutions without the complexity of infrastructure management.

Amazon OpenSearch Serverless lets you run OpenSearch in the AWS Cloud, without worrying about scaling infrastructure. With OpenSearch Serverless, you can ingest, analyze, and visualize your time-series data. Without the need for infrastructure provisioning, OpenSearch Serverless simplifies data management and enables you to derive actionable insights from extensive repositories.

We recently announced a new capacity level of 10TB for Time-series data per account per Region, which includes one or more indexes within a collection. With the support for larger datasets, you can unlock valuable operational insights and make data-driven decisions to troubleshoot application downtime, improve system performance, or identify fraudulent activities.

In this post, we discuss this new capability and how you can analyze larger time series datasets with OpenSearch Serverless.

10TB Time-series data size support in OpenSearch Serverless

The compute capacity for data ingestion and search or query in OpenSearch Serverless is measured in OpenSearch Compute Units (OCUs). These OCUs are shared among various collections, each containing one or more indexes within the account. To accommodate larger datasets, OpenSearch Serverless now supports up to 200 OCUs per account per AWS Region, each for indexing and search respectively, doubling from the previous limit of 100. You configure the maximum OCU limits on search and indexing independently to manage costs. You can also monitor real-time OCU usage with Amazon CloudWatch metrics to gain a better perspective on your workload’s resource consumption.

Dealing with larger data and analysis needs more memory and CPU. With 10TB data size support, OpenSearch Serverless is introducing vertical scaling up to eight times of 1-OCU systems. For example, the OpenSearch Serverless will deploy a larger system equivalent of eight 1-OCU systems. The system will use hybrid of horizontal and vertical scaling to address the needs of the workloads. There are improvements to shard reallocation algorithm to reduce the shard movement during heat remediation, vertical scaling, or routine deployment.

In our internal testing for 10TB Time-series data, we set the Max OCU to 48 for Search and 48 for Indexing. We set the data retention for 5 days using data lifecycle policies, and set the deployment type to “Enable redundancy” making sure the data is replicated across Availability Zones . This will lead to 12_24 hours of data in hot storage (OCU disk memory) and the rest in Amazon Simple Service (Amazon S3) storage. We observed the average ingestion achieved was 2.3 TiB per day with an average ingestion performance of 49.15 GiB per OCU per day, reaching a max of 52.47 GiB per OCU per day and a minimum of 32.69 Gib per OCU per day in our testing. The performance depends on several aspects, like document size, mapping, and other parameters, which may or may not have a variation for your workload.

Set max OCU to 200

You can start using our expanded capacity today by setting your OCU limits for indexing and search to 200. You can still set the limits to less than 200 to maintain a maximum cost during high traffic spikes. You only pay for the resources consumed, not for the max OCU configuration.

Ingest the data

You can use the load generation scripts shared in the following workshop, or you can use your own application or data generator to create a load. You can run multiple instances of these scripts to generate a burst in indexing requests. As shown in the following screenshot, we tested with an index, sending approximately 10 TB of data. We used our load generator script to send the traffic to a single index, retaining data for 5 days, and used a data life cycle policy to delete data older than 5 days.

Auto scaling in OpenSearch Serverless with new vertical scaling.

Before this release, OpenSearch Serverless auto-scaled by horizontally adding the same-size capacity to handle increases in traffic or load. With the new feature of vertical scaling to a larger size capacity, it can optimize the workload by providing a more powerful compute unit. The system will intelligently decide whether horizontal scaling or vertical scaling is more price-performance optimal. Vertical scaling also improves auto-scaling responsiveness, because vertical scaling helps to reach the optimal capacity faster compared to the incremental steps taken through horizontal scaling. Overall, vertical scaling has significantly improved the response time for auto_scaling.

Conclusion

We encourage you to take advantage of the 10TB index support and put it to the test! Migrate your data, explore the improved throughput, and take advantage of the enhanced scaling capabilities. Our goal is to deliver a seamless and efficient experience that aligns with your requirements.

To get started, refer to Log analytics the easy way with Amazon OpenSearch Serverless. To get hands-on experience with OpenSearch Serverless, follow the Getting started with Amazon OpenSearch Serverless workshop, which has a step-by-step guide for configuring and setting up an OpenSearch Serverless collection.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the authors

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and ML/AI. He holds a Bachelor’s degree in Computer Science and an MBA in Entrepreneurship. In his free time, he likes to fly airplanes, hang gliders and ride his motorcycle.

Michelle Xue is Sr. Software Development Manager working on Amazon OpenSearch Serverless. She works closely with customers to help them onboard OpenSearch Serverless and incorporates customer’s feedback into their Serverless roadmap. Outside of work, she enjoys hiking and playing tennis.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Detect and handle data skew on AWS Glue

Post Syndicated from Salim Tutuncu original https://aws.amazon.com/blogs/big-data/detect-and-handle-data-skew-on-aws-glue/

AWS Glue is a fully managed, serverless data integration service provided by Amazon Web Services (AWS) that uses Apache Spark as one of its backend processing engines (as of this writing, you can use Python Shell, Spark, or Ray).

Data skew occurs when the data being processed is not evenly distributed across the Spark cluster, causing some tasks to take significantly longer to complete than others. This can lead to inefficient resource utilization, longer processing times, and ultimately, slower performance. Data skew can arise from various factors, including uneven data distribution, skewed join keys, or uneven data processing patterns. Even though the biggest issue is often having nodes running out of disk during shuffling, which leads to nodes falling like dominoes and job failures, it’s also important to mention that data skew is hidden. The stealthy nature of data skew means it can often go undetected because monitoring tools might not flag an uneven distribution as a critical issue, and logs don’t always make it evident. As a result, a developer may observe that their AWS Glue jobs are completing without apparent errors, yet the system could be operating far from its optimal efficiency. This hidden inefficiency not only increases operational costs due to longer runtimes but can also lead to unpredictable performance issues that are difficult to diagnose without a deep dive into the data distribution and task run patterns.

For example, in a dataset of customer transactions, if one customer has significantly more transactions than the others, it can cause a skew in the data distribution.

Identifying and handling data skew issues is key to having good performance on Apache Spark and therefore on AWS Glue jobs that use Spark as a backend. In this post, we show how you can identify data skew and discuss the different techniques to mitigate data skew.

How to detect data skew

When an AWS Glue job has issues with local disks (split disk issues), doesn’t scale with the number of workers, or has low CPU usage (you can enable Amazon CloudWatch metrics for your job to be able to see this), you may have a data skew issue. You can detect data skew with data analysis or by using the Spark UI. In this section, we discuss how to use the Spark UI.

The Spark UI provides a comprehensive view of Spark applications, including the number of tasks, stages, and their duration. To use it you need to enable Spark UI event logs for your job runs. It is enabled by default on Glue console and once enabled, Spark event log files will be created during the job run and stored in your S3 bucket. Then, those logs are parsed, and you can use the AWS Glue serverless Spark UI to visualize them. You can refer to this blogpost for more details. In those jobs where the AWS Glue serverless Spark UI does not work as it has a limit of 512 MB of logs, you can set up the Spark UI using an EC2 instance.

You can use the Spark UI to identify which tasks are taking longer to complete than others, and if the data distribution among partitions is balanced or not (remember that in Spark, one partition is mapped to one task). If there is data skew, you will see that some partitions have significantly more data than others. The following figure shows an example of this. We can see that one task is taking a lot more time than the others, which can indicate data skew.

Another thing that you can use is the summary metrics for each stage. The following screenshot shows another example of data skew.

These metrics represent the task-related metrics below which a certain percentage of tasks completed. For example, the 75th percentile task duration indicates that 75% of tasks completed in less time than this value. When the tasks are evenly distributed, you will see similar numbers in all the percentiles. When there is data skew, you will see very biased values in each percentile. In the preceding example, it didn’t write many shuffle files (less than 50 MiB) in Min, 25th percentile, Median, and 75th percentile. However, in Max, it wrote 460 MiB, 10 times the 75th percentile. It means there was at least one task (or up to 25% of tasks) that wrote much bigger shuffle files than the rest of the tasks. You can also see that the duration of the tax in Max is 46 seconds and the Median is 2 seconds. These are all indicators that your dataset may have data skew.

AWS Glue interactive sessions

You can use interactive sessions to load your data from the AWS Glue Data Catalog or just use Spark methods to load the files such as Parquet or CSV that you want to analyze. You can use a similar script to the following to detect data skew from the partition size perspective; the more important issue is related to data skew while shuffling, and this script does not detect that kind of skew:

from pyspark.sql.functions import spark_partition_id, asc, desc
#input_dataframe being the dataframe where you want to check for data skew
partition_sizes_df=input_dataframe\
    .withColumn("partitionId", spark_partition_id())\
    .groupBy("partitionId")\
    .count()\
    .orderBy(asc("count"))\
    .withColumnRenamed("count","partition_size")
#calculate average and standar deviation for the partition sizes
avg_size = partition_sizes_df.agg({"partition_size": "avg"}).collect()[0][0]
std_dev_size = partition_sizes_df.agg({"partition_size": "stddev"}).collect()[0][0]

""" 
 the code calculates the absolute difference between each value in the "partition_size" column and the calculated average (avg_size).
 then, calculates twice the standard deviation (std_dev_size) and use 
 that as a boolean mask where the condition checks if the absolute difference is greater than twice the standard deviation
 in order to mark a partition 'skewed'
"""
skewed_partitions_df = partition_sizes_df.filter(abs(partition_sizes_df["partition_size"] - avg_size) > 2 * std_dev_size)
if skewed_partitions_df.count() > 0:
    skewed_partitions = [row["partition_id"] for row in skewed_partitions_df.collect()]
    print(f"The following partitions have significantly different sizes: {skewed_partitions}")
else:
    print("No data skew detected.")

You can calculate the average and standard deviation of partition sizes using the agg() function and identify partitions with significantly different sizes using the filter() function, and you can print their indexes if any skewed partitions are detected. Otherwise, the output prints that no data skew is detected.

This code assumes that your data is structured, and you may need to modify it if your data is of a different type.

How to handle data skew

You can use different techniques in AWS Glue to handle data skew; there is no single universal solution. The first thing to do is confirm that you’re using latest AWS Glue version, for example AWS Glue 4.0 based on Spark 3.3 has enabled by default some configs like Adaptative Query Execution (AQE) that can help improve performance when data skew is present.

The following are some of the techniques that you can employ to handle data skew:

  • Filter and perform – If you know which keys are causing the skew, you can filter them out, perform your operations on the non-skewed data, and then handle the skewed keys separately.
  • Implementing incremental aggregation – If you are performing a large aggregation operation, you can break it up into smaller stages because in large datasets, a single aggregation operation (like sum, average, or count) can be resource-intensive. In those cases, you can perform intermediate actions. This could involve filtering, grouping, or additional aggregations. This can help distribute the workload across the nodes and reduce the size of intermediate data.
  • Using a custom partitioner – If your data has a specific structure or distribution, you can create a custom partitioner that partitions your data based on its characteristics. This can help make sure that data with similar characteristics is in the same partition and reduce the size of the largest partition.
  • Using broadcast join – If your dataset is small but exceeds the spark.sql.autoBroadcastJoinThreshold value (default is 10 MB), you have the option to either provide a hint to use broadcast join or adjust the threshold value to accommodate your dataset. This can be an effective strategy to optimize join operations and mitigate data skew issues resulting from shuffling large amounts of data across nodes.
  • Salting – This involves adding a random prefix to the key of skewed data. By doing this, you distribute the data more evenly across the partitions. After processing, you can remove the prefix to get the original key values.

These are just a few techniques to handle data skew in PySpark; the best approach will depend on the characteristics of your data and the operations you are performing.

The following is an example of joining skewed data with the salting technique:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, ceil, rand, concat, col

# Define the number of salt values
num_salts = 3

# Function to identify skewed keys
def identify_skewed_keys(df, key_column, threshold):
    key_counts = df.groupBy(key_column).count()
    return key_counts.filter(key_counts['count'] > threshold).select(key_column)

# Identify skewed keys
skewed_keys = identify_skewed_keys(skewed_data, "key", skew_threshold)

# Splitting the dataset
skewed_data_subset = skewed_data.join(skewed_keys, ["key"], "inner")
non_skewed_data_subset = skewed_data.join(skewed_keys, ["key"], "left_anti")

# Apply salting to skewed data
skewed_data_subset = skewed_data_subset.withColumn("salt", ceil((rand() * 10) % num_salts))
skewed_data_subset = skewed_data_subset.withColumn("salted_key", concat(col("key"), lit("_"), col("salt")))

# Replicate skewed rows in non-skewed dataset
def replicate_skewed_rows(df, keys, multiplier):
    replicated_df = df.join(keys, ["key"]).crossJoin(spark.range(multiplier).withColumnRenamed("id", "salt"))
    replicated_df = replicated_df.withColumn("salted_key", concat(col("key"), lit("_"), col("salt")))
    return replicated_df.drop("salt")

replicated_non_skewed_data = replicate_skewed_rows(non_skewed_data, skewed_keys, num_salts)

# Perform the JOIN operation on the salted keys for skewed data
result_skewed = skewed_data_subset.join(replicated_non_skewed_data, "salted_key")

# Perform regular join on non-skewed data
result_non_skewed = non_skewed_data_subset.join(non_skewed_data, "key")

# Combine results
final_result = result_skewed.union(result_non_skewed)

In this code, we first define a salt value, which can be a random integer or any other value. We then add a salt column to our DataFrame using the withColumn() function, where we set the value of the salt column to a random number using the rand() function with a fixed seed. The function replicate_salt_rows is defined to replicate each row in the non-skewed dataset (non_skewed_data) num_salts times. This ensures that each key in the non-skewed data has matching salted keys. Finally, a join operation is performed on the salted_key column between the skewed and non-skewed datasets. This join is more balanced compared to a direct join on the original key, because salting and replication have mitigated the data skew.

The rand() function used in this example generates a random number between 0–1 for each row, so it’s important to use a fixed seed to achieve consistent results across different runs of the code. You can choose any fixed integer value for the seed.

The following figures illustrate the data distribution before (left) and after (right) salting. Heavily skewed key2 identified and salted into key2_0, key2_1, and key2_2, balancing the data distribution and preventing any single node from being overloaded. After processing, the results can be aggregated back, so that that the final output is consistent with the unsalted key values.

Other techniques to use on skewed data during the join operation

When you’re performing skewed joins, you can use salting or broadcasting techniques, or divide your data into skewed and regular parts before joining the regular data and broadcasting the skewed data.

If you are using Spark 3, there are automatic optimizations for trying to optimize Data Skew issues on joins. Those can be tuned because they have dedicated configs on Apache Spark.

Conclusion

This post provided details on how to detect data skew in your data integration jobs using AWS Glue and different techniques for handling it. Having a good data distribution is key to achieving the best performance on distributed processing systems like Apache Spark.

Although this post focused on AWS Glue, the same concepts apply to jobs you may be running on Amazon EMR using Apache Spark or Amazon Athena for Apache Spark.

As always, AWS welcomes your feedback. Please leave your comments and questions in the comments section.


About the Authors

Salim Tutuncu is a Sr. PSA Specialist on Data & AI, based from Amsterdam with a focus on the EMEA North and EMEA Central regions. With a rich background in the technology sector that spans roles as a Data Engineer, Data Scientist, and Machine Learning Engineer, Salim has built a formidable expertise in navigating the complex landscape of data and artificial intelligence. His current role involves working closely with partners to develop long-term, profitable businesses leveraging the AWS Platform, particularly in Data and AI use cases.

Angel Conde Manjon is a Sr. PSA Specialist on Data & AI, based in Madrid, and focuses on EMEA South and Israel. He has previously worked on research related to Data Analytics and Artificial Intelligence in diverse European research projects. In his current role, Angel helps partners develop businesses centered on Data and AI.