Tag Archives: Amazon RDS

Manage your Digital Microscopy Data using OMERO on AWS

Post Syndicated from Travis Berkley original https://aws.amazon.com/blogs/architecture/manage-your-digital-microscopy-data-using-omero-on-aws/

The Open Microscopy Environment (OME) consortium develops open-source software and format standards for microscopy data. OME Remote Objects (OMERO) is an open source, image data management platform designed to support digital pathology and cellular biology studies. You can access, share, and work with various biological data. This can include histopathology, high content screening, electron microscopy, and even non-image genotype data. Deploying this open source tool on Amazon Web Services (AWS) allows you to access your image data in a secure central repository. You can take advantage of elastic storage by growing the archive as needed without provisioning excess storage beforehand. OMERO has a web interface, which facilitates data access and visualization. It also supports connection through the OMERO client or other third-party image analysis tools, like CellProfilerTM, QuPath, Fiji, ImageJ, and others.

The challenge of microscopy data

Saint Louis University (SLU) School of Medicine Research Microscopy and Histology Core required a centralized system for both distribution and hosting. The solution must provide research imaging distribution to both internal and external clients. It also needed the capability of hosting an educational platform for microscope images. SLU decided that the open source software OMERO was an ideal fit for them.

In order to provide speed, ease of access, and security for the University’s computer networks, SLU decided the solution must be hosted in the cloud. By partnering with AWS, SLU established a robust system for their clients. The privately hosted images on OMERO represent research material databases used by University researchers. OMERO also hosts teaching datasets for resident and fellow education. Other publicly hosted repositories provide access to source images for future publishing standards and regulations. SLU reported that the implementation was extraordinarily smooth for a non-programmer. In addition, the system design allowed for advanced data management to control costs and security.

Reviewing the OMERO architecture

OMERO is a typical three-tier web application, consisting of the following components:

  • OMERO.web provides access to OMERO’s data hierarchies and also enables annotation, organization, and visualization of data. This web browser-based client of OMERO.server exposes the annotation-based data-sharing mechanism.
  • OMERO.server is a middleware server application that provides access to image data and metadata stored in a series of databases. It contains a multi-threaded, image-rendering engine and supports a wide range (>140) of image pyramid formats through the Bio-Formats Java library. This Java application facilitates remote access and interoperability for modern scientific studies. It also exposes an API to allow any OMERO client to access the original data and any derived measurements.
  • OMERO relational database (PostgreSQL) provides the underlying storage facilities. This storage backend contains the processed metadata associated with the binary images, measurement specification, user information, structured annotations, and more.
Figure 1. Architectural diagram for a highly available (HA) deployment of OMERO on AWS including data ingestion options

Figure 1. Architectural diagram for a highly available (HA) deployment of OMERO on AWS including data ingestion options

To achieve the highly available (HA) deployment in the diagram, follow the guidance from this GitHub repository. Since OMERO only supports one writer per mounted network file share, there is one OMERO read+write server and one read-only server in the HA deployment. Otherwise, multiple instances will compete to get first access to Amazon Elastic File System (EFS). If HA is not a requirement, you can lower costs by deploying only the read+write OMERO.server.

OMERO is deployed on AWS using AWS CloudFormation (CFN) templates, which will deploy two nested CFN stacks, one for storage, and one for compute. The storage template creates an EFS volume and an Amazon Relational Database Service (RDS) instance of PostgreSQL. EFS provides the option to move files to an infrequent accessed storage class after a certain number of days to save storage cost. RDS has Multi-AZ option to improve business continuity. The compute template creates Amazon Elastic Container Service (Amazon ECS) containers for the OMERO web and server functions. You have the option to deploy the OMERO containers on AWS Fargate or Amazon EC2 launch type. It also creates an Amazon Application Load Balancer (ALB) with duration-based stickiness enabled and an AWS Certificate Manager (ACM) certificate for Transport Layer Security (TLS) termination at ALB. Only the ALB is publicly accessible, as the web portal is protected behind it in private subnets. VPC and subnets are required, which can be obtained via this CFN template. It also requires the hosted zone ID and fully qualified domain name in Amazon Route 53, which will be used to validate the TLS certificate. If higher security is not a requirement, there is an option to deploy without the registered domain and the hosted zone in Route 53. You will then be able to access the OMERO web through Application Load Balancer DNS name without TLS encryption.

Additionally, the containers of OMERO.web and OMERO.server can be extended with plugins. The landing page for login can be customized with logos, brands, or disclaimers. Build a new Docker container image with specific configuration changes to enrich the functionality of this open source platform.

You can use Amazon ECS Exec to access the OMERO command line interface (CLI) to import images within the OMERO.server container, running on either AWS Fargate or EC2 launch type. You can also run Amazon ECS Exec via AWS CloudShell. The OMERO CFN templates enable Amazon ECS Exec commands by default. You will only need to install AWS CLI and SSM plugin on your clients or AWS CloudShell to initiate the commands. When you import images within the OMERO.server container instances, you can use the OMERO in-place import to avoid redundant copies of the image files on Amazon EFS. Alternatively, you can access the Windows desktop OMERO client OMERO.insight, via the application virtualization service Amazon AppStream 2.0. This connects to the OMERO.server in the same VPC. Amazon AppStream 2.0 allows Amazon S3 being used as home folder storage, so you can import images directly from Amazon S3 to OMERO.server.

AWS offers multiple options to move your microscopic image data from on premises facilities to the cloud storage, as illustrated in Figure 1:

  1. Use AWS Transfer Family to copy data directly from on premises devices to EFS
  2. Alternatively, transfer data directly from your on-premises Network File System (NFS) to EFS using AWS DataSync. AWS DataSync can also be used to transfer files from S3 to EFS.
  3. Set up AWS Storage Gateway, in particular File Gateway, to move your image files from on premises to Amazon Simple Storage Service (S3) first. A storage lifecycle policy can archive images. You can track the storage activity metrics using Amazon S3 Storage Lens and gain insights on storage cost using cost allocation tags. Once the files are in Amazon S3, you can either set up AWS DataSync to transfer files from S3 to EFS, or directly import files into OMERO.server.

To find the latest development to this solution, check out digital pathology on AWS repository on GitHub.

Conclusion

Researchers and scientists at St. Louis University were able to grow their image repository on AWS without the concern of fixed storage limits. They can scale their compute environment up or down as their research requirements dictate. The managed services, like Amazon ECS and RDS, are able to significantly reduce the operational workloads from researchers. SLU reports that this platform is of great use to their researchers. Other universities, academic medical centers, and pharmaceutical and biotechnology companies can also use this cloud-based image data management platform to collect, visualize, and share access to their image data assets.

Building well-architected serverless applications: Regulating inbound request rates – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-regulating-inbound-request-rates-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Reliability question REL1: How do you regulate inbound request rates?

This post continues part 1 of this security question. Previously, I cover controlling inbound request rates using throttling. I go through how to use throttling to control steady-rate and burst rate requests. I show some solutions for performance testing to identify the request rates that your workload can sustain before impacting performance.

Good practice: Use, analyze, and enforce API quotas

API quotas limit the maximum number of requests a given API key can submit within a specified time interval. Metering API consumers provides a better understanding of how different consumers use your workload at sustained and burst rates at any point in time. With this information, you can determine fine-grained rate limiting for multiple quota limits. These can be done according to a group of consumer needs, and can adjust their limits on a regular basis.

Segregate API consumers steady-rate requests and their quota into multiple buckets or tiers

Amazon API Gateway usage plans allow your API consumer to access selected APIs at agreed-upon request rates and quotas. These help your consumers meet their business requirements and budget constraints. Create and attach API keys to usage plans to control access to certain API stages. I show how to create usage plans and how to associate them with API keys in “Building well-architected serverless applications: Controlling serverless API access – part 2”.

API key associated with usage plan

API key associated with usage plan

You can extract utilization data from usage plans to analyze API usage on a per-API key basis. In the example, I show how to use usage plans to see how many requests are made.

View API key usage

View API key usage

This allows you to generate billing documents and determine whether your customers need higher or lower limits. Have a mechanism to allow customers to request higher limits preemptively. When customers anticipate greater API usage, they can take action proactively.

API Gateway Lambda authorizers can dynamically associate API keys to a given request. This can be used where you do not control API consumers, or want to associate API keys based on your own criteria. For more information, see the documentation.

You can also visualize usage plans with Amazon QuickSight using enriched API Gateway access logs.

Visualize usage plans with Amazon QuickSight

Visualize usage plans with Amazon QuickSight

Define whether your API consumers are end users or machines

Understanding your API consumers helps you manage how they connect to your API. This helps you define a request access pattern strategy, which can distinguish between end users or machines.

Machine consumers make automated connections to your API, which may require a different access pattern to end users. You may decide to prioritize end user consumers to provide a better experience. Machine consumers may be able to handle request throttling automatically.

Best practice: Use mechanisms to protect non-scalable resources

Limit component throughput by enforcing how many transactions it can accept

AWS Lambda functions can scale faster than traditional resources, such as relational databases and cache systems. Protect your non-scalable resources by ensuring that components that scale quickly do not exceed the throughput of downstream systems. This can prevent system performance degrading. There are a number of ways to achieve this, either directly or via buffer mechanisms such as queues and streams.

For relational databases such as Amazon RDS, you can limit the number of connections per user, in addition to the global maximum number of connections. With Amazon RDS Proxy, your applications can pool and share database connections to improve their ability to scale.

Amazon RDS Proxy

Amazon RDS Proxy

For additional options for using RDS with Lambda, see the AWS Serverless Hero blog post “How To: Manage RDS Connections from AWS Lambda Serverless Functions”.

Cache results and only connect to, and fetch data from databases when needed. This reduces the load on the downstream database. Adjust the maximum number of connections for caching systems. Include a caching expiration mechanism to prevent serving stale records. For more information on caching implementation patterns and considerations, see “Caching Best Practices”.

Lambda provides managed scaling. When a function is first invoked, the Lambda service creates an instance of the function to process the event. This is called a cold start. After completion, the function remains available for a period of time to process subsequent events. These are called warm starts. If other events arrive while the function is busy, Lambda creates more instances of the function to handle these requests concurrently as cold starts. The following example shows 10 events processed in six concurrent requests.

Lambda concurrency

Lambda concurrency

You can control the number of concurrent function invocations to both reserve and limit the maximum concurrency your function can achieve. You can configure reserved concurrency to set the maximum number of concurrent instances for the function. This can protect downstream resources such as a database by ensuring Lambda can only scale up to the number of connections the database can support.

For example, you may have a traditional database or external API that can only support a maximum of 50 concurrent connections. You can set the maximum number of concurrent Lambda functions using the function concurrency settings. Setting the value to 50 ensures that the traditional database or external API is not overwhelmed.

Edit Lambda concurrency

Edit Lambda concurrency

You can also set the Lambda function concurrency to 0, which disables the Lambda function in the event of anomalies.

Another solution to protect downstream resources is to use an intermediate buffer. A buffer can persistently store messages in a stream or queue until a receiver processes them. This helps you control how fast messages are processed, which can protect the load on downstream resources.

Amazon Kinesis Data Streams allows you to collect and process large streams of data records in real time, and can act as a buffer. Streams consist of a set of shards that contain a sequence of data records. When using Lambda to process records, it processes one batch of records at a time from each shard.

Kinesis Data Streams control concurrency at the shard level, meaning that a single shard has a single concurrent invocation. This can reduce downstream calls to non-scalable resources such as a traditional database. Kinesis Data Streams also support batch windows up to 5 minutes and batch record sizes. These can also be used to control how frequent invocations can occur.

To learn how to manage scaling with Kinesis, see the documentation. To learn more how Lambda works with Kinesis, read the blog series “Building serverless applications with streaming data”.

Lambda and Kinesis shards

Lambda and Kinesis shards

Amazon Simple Queue Service (SQS) is a fully managed serverless message queuing service that enables you to decouple and scale microservices. You can offload tasks from one component of your application by sending them to a queue and processing them asynchronously.

SQS can act as a buffer, using a Lambda function to process the messages. Lambda polls the queue and invokes your Lambda function synchronously with an event that contains queue messages. Lambda reads messages in batches and invokes your function once for each batch. When your function successfully processes a batch, Lambda deletes its messages from the queue.

You can protect downstream resources using the Lambda concurrency controls. This limits the number of concurrent Lambda functions that pull messages off the queue. The messages persist in the queue until Lambda can process them. For more information see, “Using AWS Lambda with Amazon SQS

Lambda and SQS

Lambda and SQS

Conclusion

Regulating inbound requests helps you adapt different scaling mechanisms based on customer demand. You can achieve better throughput for your workloads and make them more reliable by controlling requests to a rate that your workload can support.

In this post, I cover using, analyzing, and enforcing API quotas using usage plans and API keys. I show mechanisms to protect non-scalable resources such as using RDS Proxy to protect downstream databases. I show how to control the number of Lambda invocations using concurrency controls to protect downstream resources. I explain how you can use streams and queues as an intermediate buffer to store messages persistently until a receiver processes them.

In the next post in the series, I cover the second reliability question from the Well-Architected Serverless Lens, building resiliency into serverless applications.

For more serverless learning resources, visit Serverless Land.

Data preparation using an Amazon RDS for MySQL database with AWS Glue DataBrew

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/data-preparation-using-an-amazon-rds-for-mysql-database-with-aws-glue-databrew/

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, or Amazon Aurora and Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an RDS database, store the cleaned data in an S3 data lake, and build a business intelligence (BI) report.

Use case overview

For our use case, we use three datasets:

  • A school dataset that contains school details like school ID and school name
  • A student dataset that contains student details like student ID, name, and age
  • A student study details dataset that contains student study time, health, country, and more

The following diagram shows the relation of these tables.

For our use case, this data is collected by a survey organization after an annual exam, and updates are made in Amazon RDS for MySQL using a Java script-based frontend application. We join the tables to create a single view and create aggregated data through a series of data preparation steps, and the business team uses the output data to create BI reports.

Solution overview

The following diagram illustrates our solution architecture. We use Amazon RDS to store data, DataBrew for data preparation, Amazon Athena for data analysis with standard SQL, and Amazon QuickSight for business reporting.

The workflow includes the following steps:
  1. Create a JDBC connection for RDS and a DataBrew project. DataBrew does the transformation to find the top performing students across all the schools considered for analysis.
  2. The DataBrew job writes the final output to our S3 output bucket.
  3. After the output data is written, we can create external tables on top of it with Athena create table statements and load partitions with MCSK REPAIR commands.
  4. Business users can use QuickSight for BI reporting, which fetches data through Athena. Data analysts can also use Athena to analyze the complete refreshed dataset.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use three mock datasets. You can download the DDL code and data files from GitHub.

  1. Create the RDS for MySQL instance to capture the student health data.
  2. Make sure you have set up the correct security group for Amazon RDS. For more information, see Setting Up a VPC to Connect to JDBC Data Stores.
  3. Create three tables: student_tbl, study_details_tbl, and school_tbl. You can use DDLsql to create the database objects.
  4. Upload the student.csv, study_details.csv, and school.csv files in their respective tables. You can use student.sql, study_details.sql, and school.sql to insert the data in the tables.

Create an Amazon RDS connection

To create your Amazon RDS connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.

  1. For Connection name, enter a name (for example, student_db-conn).
  2. For Connection type, select JDBC.
  3. For Database type, choose MySQL.

  1. Provide other parameters like RDS endpoint, port, database name, and database login credentials.

  1. In the Network options section, choose the VPC, subnet, and security group of your RDS instance.
  2. Choose Create connection.

Create your datasets

We have three tables in Amazon RDS: school_tbl, student_tbl, and study_details_tbl. To use these tables, we first need to create a dataset for each table.

To create the datasets, complete the following steps (we walk you through creating the school dataset):

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.

  1. For Dataset name, enter school-dataset.
  2. Choose the connection you created (AwsGlueDatabrew-student-db-conn).
  3. For Table name, enter school_tbl.
  4. Choose Create dataset.

  1. Repeat these steps for the student_tbl and study_details_tbl tables, and name the new datasets student-dataset and study-detail-dataset, respectively.

All three datasets are available to use on the Datasets page.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project Name, enter my-rds-proj.
  4. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

  1. For Select a dataset, select My datasets.
  2. For Dataset name, select study-detail-dataset.

  1. For Role name, choose your AWS Identity and Access management (IAM) role to use with DataBrew.
  2. Choose Create project.

You can see a success message along with our RDS study_details_tbl table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Open an Amazon RDS project and build a transformation recipe

In a DataBrew interactive session, you can cleanse and normalize your data using over 250 built-in transforms. In this post, we use DataBrew to identify top performing students by performing a few transforms and finding students who got marks greater than or equal to 60 in the last annual exam.

First, we use DataBrew to join all three RDS tables. To do this, we perform the following steps:

  1. Navigate to the project you created.
  2. Choose Join.

  1. For Select dataset, choose student-dataset.
  2. Choose Next.

  1. For Select join type, select Left join.
  2. For Join keys, choose student_id for Table A and deselect student_id for Table B.
  3. Choose Finish.

Repeat the steps for school-dataset based on the school_id key.

  1. Choose MERGE to merge first_name and last_name.
  2. Enter a space as a separator.
  3. Choose Apply.

We now filter the rows based on marks value greater than or equal to 60 and add the condition as a recipe step.

  1. Choose FILTER.

  1. Provide the source column and filter condition and choose Apply.

The final data shows the top performing students’ data who had marks greater than or equal to 60.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter top-performer-student.

For this post, we use Parquet as the output format.

  1. For File type, choose PARQUET.
  2. For S3 location, enter the S3 path of the output folder.

  1. For Role name, choose an existing role or create a new one.
  2. Choose Create and run job.

  1. Navigate to the Jobs page and wait for the top-performer-student job to complete.

  1. Choose the Destination link to navigate to Amazon S3 to access the job output.

Run an Athena query

Let’s validate the aggregated table output in Athena by running a simple SELECT query. The following screenshot shows the output.

Create reports in QuickSight

Now let’s do our final step of the architecture, which is creating BI reports through QuickSight by connecting to the Athena aggregated table.

  1. On the QuickSight console, choose Athena as your data source.

  1. Choose the database and catalog you have in Athena.
  2. Select your table.
  3. Choose Select.

Now you can create a quick report to visualize your output, as shown in the following screenshot.

If QuickSight is using SPICE storage, you need to refresh the dataset in QuickSight after you receive notification about the completion of the data refresh. We recommend using SPICE storage to get better performance.

Clean up

Delete the following resources that might accrue cost over time:

  • The RDS instance
  • The recipe job top-performer-student
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project my-rds-proj and its associated recipe my-rds-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an RDS database. We learned how to use this connection to create a DataBrew dataset for each table, and how to reuse this connection multiple times. We also saw how we can bring data from Amazon RDS into DataBrew and seamlessly apply transformations and run recipe jobs that refresh transformed data for BI reporting.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Query your Oracle database using Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-your-oracle-database-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Oracle as your transactional data store, you may need to join the data in your data lake with Oracle on Amazon Relational Database Service (Amazon RDS), Oracle running on Amazon Elastic Compute Cloud (Amazon EC2), or an on-premises Oracle database, for example to build a dashboard or create consolidated reporting.

In these use cases, Amazon Athena Federated Query allows you to seamlessly access the data you’re your Oracle database without having to move the data to the S3 data lake. This removes the overhead in managing such jobs.

In this post, we walk you through a step-by-step configuration to set up Athena Federated query using AWS Lambda to access data in Oracle on Amazon RDS.

For this post, we will be using the Oracle Athena Federated query connector developed by Trianz. The runtime includes Oracle XE running on Amazon EC2 and Amazon RDS. Your Oracle instance can be on Amazon RDS, Amazon EC2, or on premises. You can deploy the Trianz Oracle AFQ connector available in the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Amazon Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Oracle instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Oracle XE instance.
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have an Oracle database up and running.

Create a secret for the Oracle instance

Our first step is to create a secret for the Oracle instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Enter the credentials as key-value pairs (username, password) for your Oracle XE instance.

  1. For Secret name, enter a name for your secret. Use the prefix OracleAFQ so it’s easy to find.
  2. Leave the remaining fields at their defaults and choose Next.
  3. Complete your secret creation.

Create an S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, I use athena-accelerator/oracle.

Configure Athena federation with the Oracle XE instance

To configure Athena federation with your Oracle instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. In the search field, enter TrianzOracleAthenaJDBC.

  1. For Application name, enter TrianzOracleAthenaJDBC.
  2. For SecretNamePrefix, enter OracleAFQ_XE.
  3. For SpillBucket, enter Athena-accelerator/oracle.
  4. For JDBCConnectorConfig, use the format oracle://jdbc:oracle:thin:${secretname}@//hostname:port/servicename.

For example, we enter oracle://jdbc:oracle:thin:${OracleAFQ_XE}@//12.345.67.89:1521/xe.

  1. For DisableSpillEncryption, enter false.
  2. For LambdaFunctionName, enter oracleconnector.
  3. For SecurityGroupID, enter the security group ID where the Oracle instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Oracle instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your federated queries using lambda:trianzoracle against tables in the Oracle database. trianzoracle is the name of lambda function which we have created in step 7 of previous section of this blog

lambda:trianzoracle is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following query joins the dataset between Oracle and the S3 data lake.

Key performance best practices

If you’re considering Athena Federated query with Oracle, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Oracle database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the Oracle database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Oracle to your S3 data lake.
  • The star schema is a commonly used data model in Oracle. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Oracle. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Oracle database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Oracle database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with Oracle. Now you don’t need to wait for all the data in your Oracle data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Oracle for better performance. When queries are well written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Building well-architected serverless applications: Implementing application workload security – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-implementing-application-workload-security-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Security question SEC3: How do you implement application security in your workload?

This post continues part 1 of this security question. Previously, I cover reviewing security awareness documentation such as the Common Vulnerabilities and Exposures (CVE) database. I show how to use GitHub security features to inspect and manage code dependencies. I then show how to validate inbound events using Amazon API Gateway request validation.

Required practice: Store secrets that are used in your code securely

Store secrets such as database passwords or API keys in a secrets manager. Using a secrets manager allows for auditing access, easier rotation, and prevents exposing secrets in application source code. There are a number of AWS and third-party solutions to store and manage secrets.

AWS Partner Network (APN) member Hashicorp provides Vault to keep secrets and application data secure. Vault has a centralized workflow for tightly controlling access to secrets across applications, systems, and infrastructure. You can store secrets in Vault and access them from an AWS Lambda function to, for example, access a database. You can use the Vault Agent for AWS to authenticate with Vault, receive the database credentials, and then perform the necessary queries. You can also use the Vault AWS Lambda extension to manage the connectivity to Vault.

AWS Systems Manager Parameter Store allows you to store configuration data securely, including secrets, as parameter values.

AWS Secrets Manager enables you to replace hardcoded credentials in your code with an API call to Secrets Manager to retrieve the secret programmatically. You can protect, rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle. You can also generate secure secrets. By default, Secrets Manager does not write or cache the secret to persistent storage.

Parameter Store integrates with Secrets Manager. For more information, see “Referencing AWS Secrets Manager secrets from Parameter Store parameters.”

To show how Secrets Manager works, deploy the solution detailed in “How to securely provide database credentials to Lambda functions by using AWS Secrets Manager”.

The AWS Cloud​Formation stack deploys an Amazon RDS MySQL database with a randomly generated password. This is stored in Secrets Manager using a secret resource. A Lambda function behind an API Gateway endpoint returns the record count in a table from the database, using the required credentials. Lambda function environment variables store the database connection details and which secret to return for the database password. The password is not stored as an environment variable, nor in the Lambda function application code.

Lambda environment variables for Secrets Manager

Lambda environment variables for Secrets Manager

The application flow is as follows:

  1. Clients call the API Gateway endpoint
  2. API Gateway invokes the Lambda function
  3. The Lambda function retrieves the database secrets using the Secrets Manager API
  4. The Lambda function connects to the RDS database using the credentials from Secrets Manager and returns the query results

View the password secret value in the Secrets Manager console, which is randomly generated as part of the stack deployment.

Example password stored in Secrets Manager

Example password stored in Secrets Manager

The Lambda function includes the following code to retrieve the secret from Secrets Manager. The function then uses it to connect to the database securely.

secret_name = os.environ['SECRET_NAME']
rds_host = os.environ['RDS_HOST']
name = os.environ['RDS_USERNAME']
db_name = os.environ['RDS_DB_NAME']

session = boto3.session.Session()
client = session.client(
	service_name='secretsmanager',
	region_name=region_name
)
get_secret_value_response = client.get_secret_value(
	SecretId=secret_name
)
...
secret = get_secret_value_response['SecretString']
j = json.loads(secret)
password = j['password']
...
conn = pymysql.connect(
	rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)

Browsing to the endpoint URL specified in the Cloud​Formation output displays the number of records. This confirms that the Lambda function has successfully retrieved the secure database credentials and queried the table for the record count.

Lambda function retrieving database credentials

Lambda function retrieving database credentials

Audit secrets access through a secrets manager

Monitor how your secrets are used to confirm that the usage is expected, and log any changes to them. This helps to ensure that any unexpected usage or change can be investigated, and unwanted changes can be rolled back.

Hashicorp Vault uses Audit devices that keep a detailed log of all requests and responses to Vault. Audit devices can append logs to a file, write to syslog, or write to a socket.

Secrets Manager supports logging API calls with AWS CloudTrail. CloudTrail captures all API calls for Secrets Manager as events. This includes calls from the Secrets Manager console and from code calling the Secrets Manager APIs.

Viewing the CloudTrail event history shows the requests to secretsmanager.amazonaws.com. This shows the requests from the console in addition to the Lambda function.

CloudTrail showing access to Secrets Manager

CloudTrail showing access to Secrets Manager

Secrets Manager also works with Amazon EventBridge so you can trigger alerts when administrator-specified operations occur. You can configure EventBridge rules to alert on deleted secrets or secret rotation. You can also create an alert if anyone tries to use a secret version while it is pending deletion. This can identify and alert when there is an attempt to use an out-of-date secret.

Enforce least privilege access to secrets

Access to secrets must be tightly controlled because the secrets contain sensitive information. Create AWS Identity and Access Management (IAM) policies that enable minimal access to secrets to prevent credentials being accidentally used or compromised. Secrets that have policies that are too permissive could be misused by other environments or developers. This can lead to accidental data loss or compromised systems. For more information, see “Authentication and access control for AWS Secrets Manager”.

Rotate secrets frequently.

Rotating your workload secrets is important. This prevents misuse of your secrets since they become invalid within a configured time period.

Secrets Manager allows you to rotate secrets on a schedule or on demand. This enables you to replace long-term secrets with short-term ones, significantly reducing the risk of compromise. Secrets Manager creates a CloudFormation stack with a Lambda function to manage the rotation process for you. Secrets Manager has native integrations with Amazon RDS, Amazon Redshift, and Amazon DocumentDB. It populates the function with the Amazon Resource Name (ARN) of the secret. You specify the permissions to rotate the credentials, and how often you want to rotate the secret.

The CloudFormation stack creates a MySecretRotationSchedule resource with a MyRotationLambda function to rotate the secret every 30 days.

MySecretRotationSchedule:
    Type: AWS::SecretsManager::RotationSchedule
    DependsOn: SecretRDSInstanceAttachment
    Properties:
    SecretId: !Ref MyRDSInstanceRotationSecret
    RotationLambdaARN: !GetAtt MyRotationLambda.Arn
    RotationRules:
        AutomaticallyAfterDays: 30
MyRotationLambda:
    Type: AWS::Serverless::Function
    Properties:
    Runtime: python3.7
    Role: !GetAtt MyLambdaExecutionRole.Arn
    Handler: mysql_secret_rotation.lambda_handler
    Description: 'This is a lambda to rotate MySql user passwd'
    FunctionName: 'cfn-rotation-lambda'
    CodeUri: 's3://devsecopsblog/code.zip'      
    Environment:
        Variables:
        SECRETS_MANAGER_ENDPOINT: !Sub 'https://secretsmanager.${AWS::Region}.amazonaws.com'

View and edit the rotation settings in the Secrets Manager console.

Secrets Manager rotation settings

Secrets Manager rotation settings

Manually rotate the secret by selecting Rotate secret immediately. This invokes the Lambda function, which updates the database password and updates the secret in Secrets Manager.

View the updated secret in Secrets Manager, where the password has changed.

Secrets Manager password change

Secrets Manager password change

Browse to the endpoint URL to confirm you can still access the database with the updated credentials.

Access endpoint with updated Secret Manager password

Access endpoint with updated Secret Manager password

You can provide your own code to customize a Lambda rotation function for other databases or services. The code includes the commands required to interact with your secured service to update or add credentials.

Conclusion

Implementing application security in your workload involves reviewing and automating security practices at the application code level. By implementing code security, you can protect against emerging security threats. You can improve the security posture by checking for malicious code, including third-party dependencies.

In this post, I continue from part 1, looking at securely storing, auditing, and rotating secrets that are used in your application code.

In the next post in the series, I start to cover the reliability pillar from the Well-Architected Serverless Lens with regulating inbound request rates.

For more serverless learning resources, visit Serverless Land.

Overview of Data Transfer Costs for Common Architectures

Post Syndicated from Birender Pal original https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/

Data transfer charges are often overlooked while architecting a solution in AWS. Considering data transfer charges while making architectural decisions can help save costs. This blog post will help identify potential data transfer charges you may encounter while operating your workload on AWS. Service charges are out of scope for this blog, but should be carefully considered when designing any architecture.

Data transfer between AWS and internet

There is no charge for inbound data transfer across all services in all Regions. Data transfer from AWS to the internet is charged per service, with rates specific to the originating Region. Refer to the pricing pages for each service—for example, the pricing page for Amazon Elastic Compute Cloud (Amazon EC2)—for more details.

Data transfer within AWS

Data transfer within AWS could be from your workload to other AWS services, or it could be between different components of your workload.

Data transfer between your workload and other AWS services

When your workload accesses AWS services, you may incur data transfer charges.

Accessing services within the same AWS Region

If the internet gateway is used to access the public endpoint of the AWS services in the same Region (Figure 1 – Pattern 1), there are no data transfer charges. If a NAT gateway is used to access the same services (Figure 1 – Pattern 2), there is a data processing charge (per gigabyte (GB)) for data that passes through the gateway.

Accessing AWS services in same Region

Figure 1. Accessing AWS services in same Region

Accessing services across AWS Regions

If your workload accesses services in different Regions (Figure 2), there is a charge for data transfer across Regions. The charge depends on the source and destination Region (as described on the Amazon EC2 Data Transfer pricing page).

Accessing AWS services in different Region

Figure 2. Accessing AWS services in different Region

Data transfer within different components of your workload

Charges may apply if there is data transfer between different components of your workload. These charges vary depending on where the components are deployed.

Workload components in same AWS Region

Data transfer within the same Availability Zone is free. One way to achieve high availability for a workload is to deploy in multiple Availability Zones.

Consider a workload with two application servers running on Amazon EC2 and a database running on Amazon Relational Database Service (Amazon RDS) for MySQL (Figure 3). For high availability, each application server is deployed into a separate Availability Zone. Here, data transfer charges apply for cross-Availability Zone communication between the EC2 instances. Data transfer charges also apply between Amazon EC2 and Amazon RDS. Consult the Amazon RDS for MySQL pricing guide for more information.

Workload components across Availability Zones

Figure 3. Workload components across Availability Zones

To minimize impact of a database instance failure, enable a multi-Availability Zone configuration within Amazon RDS to deploy a standby instance in a different Availability Zone. Replication between the primary and standby instances does not incur additional data transfer charges. However, data transfer charges will apply from any consumers outside the current primary instance Availability Zone. Refer to the Amazon RDS pricing page for more detail.

A common pattern is to deploy workloads across multiple VPCs in your AWS network. Two approaches to enabling VPC-to-VPC communication are VPC peering connections and AWS Transit Gateway. Data transfer over a VPC peering connection that stays within an Availability Zone is free. Data transfer over a VPC peering connection that crosses Availability Zones will incur a data transfer charge for ingress/egress traffic (Figure 4).

VPC peering connection

Figure 4. VPC peering connection

Transit Gateway can interconnect hundreds or thousands of VPCs (Figure 5). Cost elements for Transit Gateway include an hourly charge for each attached VPC, AWS Direct Connect, or AWS Site-to-Site VPN. Data processing charges apply for each GB sent from a VPC, Direct Connect, or VPN to Transit Gateway.

VPC peering using Transit Gateway in same Region

Figure 5. VPC peering using Transit Gateway in same Region

Workload components in different AWS Regions

If workload components communicate across multiple Regions using VPC peering connections or Transit Gateway, additional data transfer charges apply. If the VPCs are peered across Regions, standard inter-Region data transfer charges will apply (Figure 6).

VPC peering across Regions

Figure 6. VPC peering across Regions

For peered Transit Gateways, you will incur data transfer charges on only one side of the peer. Data transfer charges do not apply for data sent from a peering attachment to a Transit Gateway. The data transfer for this cross-Region peering connection is in addition to the data transfer charges for the other attachments (Figure 7).

Transit Gateway peering across Regions

Figure 7. Transit Gateway peering across Regions

Data transfer between AWS and on-premises data centers

Data transfer will occur when your workload needs to access resources in your on-premises data center. There are two common options to help achieve this connectivity: Site-to-Site VPN and Direct Connect.

Data transfer over AWS Site-to-Site VPN

One option to connect workloads to an on-premises network is to use one or more Site-to-Site VPN connections (Figure 8 – Pattern 1). These charges include an hourly charge for the connection and a charge for data transferred from AWS. Refer to Site-to-Site VPN pricing for more details. Another option to connect multiple VPCs to an on-premises network is to use a Site-to-Site VPN connection to a Transit Gateway (Figure 8 – Pattern 2). The Site-to-Site VPN will be considered another attachment on the Transit Gateway. Standard Transit Gateway pricing applies.

Site-to-Site VPN patterns

Figure 8. Site-to-Site VPN patterns

Data transfer over AWS Direct Connect

Direct Connect can be used to connect workloads in AWS to on-premises networks. Direct Connect incurs a fee for each hour the connection port is used and data transfer charges for data flowing out of AWS. Data transfer into AWS is $0.00 per GB in all locations. The data transfer charges depend on the source Region and the Direct Connect provider location. Direct Connect can also connect to the Transit Gateway if multiple VPCs need to be connected (Figure 9). Direct Connect is considered another attachment on the Transit Gateway and standard Transit Gateway pricing applies. Refer to the Direct Connect pricing page for more details.

Figure 9. Direct Connect patterns

Figure 9. Direct Connect patterns

A Direct Connect gateway can be used to share a Direct Connect across multiple Regions. When using a Direct Connect gateway, there will be outbound data charges based on the source Region and Direct Connect location (Figure 10).

Direct Connect gateway

Figure 10. Direct Connect gateway

General tips

Data transfer charges apply based on the source, destination, and amount of traffic. Here are some general tips for when you start planning your architecture:

  • Avoid routing traffic over the internet when connecting to AWS services from within AWS by using VPC endpoints:
    • VPC gateway endpoints allow communication to Amazon S3 and Amazon DynamoDB without incurring data transfer charges.
    • VPC interface endpoints are available for some AWS services. This type of endpoint incurs hourly service charges and data transfer charges.
  • Use Direct Connect instead of the internet for sending data to on-premises networks.
  • Traffic that crosses an Availability Zone boundary typically incurs a data transfer charge. Use resources from the local Availability Zone whenever possible.
  • Traffic that crosses a Regional boundary will typically incur a data transfer charge. Avoid cross-Region data transfer unless your business case requires it.
  • Use the AWS Free Tier. Under certain circumstances, you may be able to test your workload free of charge.
  • Use the AWS Pricing Calculator to help estimate the data transfer costs for your solution.
  • Use a dashboard to better visualize data transfer charges – this workshop will show how.

Conclusion

AWS provides the ability to deploy across multiple Availability Zones and Regions. With a few clicks, you can create a distributed workload. As you increase your footprint across AWS, it helps to understand various data transfer charges that may apply. This blog post provided information to help you make an informed decision and explore different architectural patterns to save on data transfer costs.

Better performance for less: AWS continues to beat Azure on SQL Server price/performance

Post Syndicated from Fred Wurden original https://aws.amazon.com/blogs/compute/sql-server-runs-better-on-aws/

By Fred Wurden, General Manager, AWS Enterprise Engineering (Windows, VMware, RedHat, SAP, Benchmarking)

AWS R5b.8xlarge delivers better performance at lower cost than Azure E64_32s_v4 for a SQL Server workload

In this blog, we will review a recent benchmark that Principled Technologies published on 2/25. The benchmark found that an Amazon Elastic Compute Cloud (Amazon EC2) R5b.8xlarge instance delivered better performance for a SQL Server workload at a lower cost when directly tested against an Azure E64_32s_v4 VM.

Behind the study: Understanding how SQL Server performed better, for a lower cost with an AWS EC2 R5b instance

Principled Technologies tested an online transaction processing (OLTP) workload for SQL Server 2019 on both an R5b instance on Amazon EC2 with Amazon Elastic Block Store (EBS) as storage and Azure E64_32s_v4. This particular Azure VM was chosen as an equivalent to the R5b instance, as both instances have comparable specifications for input/output operations per second (IOPS) performance, use Intel Xeon processors from the same generation (Cascade Lake), and offer the same number of cores (32). For storage, Principled Technologies mirrored storage configurations across the Azure VM and the EC2 instance (which used Amazon Elastic Block Store (EBS)), maxing out the IOPS specs on each while offering a direct comparison between instances.

Test Configurations

Source: Principled Technologies

When benchmarking, Principled Technologies ran a TPC-C-like OLTP workload from HammerDB v3.3 on both instances, testing against new orders per minute (NOPM) performance. NOPM shows the number of new-order transactions completed in one minute as part of a serialized business workload. HammerDB claims that because NOPM is “independent of any particular database implementation [it] is the recommended primary metric to use.”

The results: SQL Server on AWS EC2 R5b delivered 2x performance than the Azure VM and 62% less expensive 

Graphs that show AWS instance outperformed the Azure instance

Source: Principled Technologies

These test results from the Principled Technologies report show the price/performance and performance comparisons. The performance metric is New Orders Per Minute (NOPM); faster is better. The price/performance calculations are based on the cost of on-demand, License Included SQL Server instances and storage to achieve 1,000 NOPM performance, smaller is better.

An EC2 r5b.8xlarge instance powered by an Intel Xeon Scalable processor delivered better SQL Server NOPM performance on the HammerDB benchmark and a lower price per 1,000 NOPM than an Azure E64_32s_v4 VM powered by similar Intel Xeon Scalable processors.

On top of that, AWS’s storage price-performance exceeded Azure’s. The Azure managed disks offered 53 percent more storage than the EBS storage, but the EC2 instance with EBS storage cost 24 percent less than the Azure VM with managed disks. Even by reducing Azure storage by the difference in storage, something customers cannot do, EBS would have cost 13 percent less per storage GB than the Azure managed disks.

Why AWS is the best cloud to run your Windows and SQL Server workloads

To us, these results aren’t surprising. In fact, they’re in line with the success that customers find running Windows on AWS for over 12 years. Customers like Pearson and Expedia have all found better performance and enhanced cost savings by moving their Windows, SQL Server, and .NET workloads to AWS. In fact, RepricerExpress migrated its Windows and SQL Server environments from Azure to AWS to slash outbound bandwidth costs while gaining agility and performance.

Not only do we offer better price-performance for your Windows workloads, but we also offer better ways to run Windows in the cloud. Whether you want to rehost your databases to EC2, move to managed with Amazon Relational Database for SQL Server (RDS), or even modernize to cloud-native databases, AWS stands ready to help you get the most out of the cloud.

 


To learn more on migrating Windows Server or SQL Server, visit Windows on AWS. For more stories about customers who have successfully migrated and modernized SQL Server workloads with AWS, visit our Customer Success page. Contact us to start your migration journey today.

Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS

Post Syndicated from Islam Ghanim original https://aws.amazon.com/blogs/architecture/field-notes-stopping-an-automatically-started-database-instance-with-amazon-rds/

Customers needing to keep an Amazon Relational Database Service (Amazon RDS) instance stopped for more than 7 days, look for ways to efficiently re-stop the database after being automatically started by Amazon RDS. If the database is started and there is no mechanism to stop it; customers start to pay for the instance’s hourly cost. Moreover, customers with database licensing agreements could incur penalties for running beyond their licensed cores/users.

Stopping and starting a DB instance is faster than creating a DB snapshot, and then restoring the snapshot. However, if you plan to keep the Amazon RDS instance stopped for an extended period of time, it is advised to terminate your Amazon RDS instance and recreate it from a snapshot when needed.

This blog provides a step-by-step approach to automatically stop an RDS instance once the auto-restart activity is complete. This saves any costs incurred once the instance is turned on. The proposed architecture is fully serverless and requires no management overhead. It relies on AWS Step Functions and a set of Lambda functions to monitor RDS instance state and stop the instance when required.

Architecture overview

Given the autonomous nature of the architecture and to avoid management overhead, the architecture leverages serverless components.

  • The architecture relies on RDS event notifications. Once a stopped RDS instance is started by AWS due to exceeding the maximum time in the stopped state; an event (RDS-EVENT-0154) is generated by RDS.
  • The RDS event is pushed to a dedicated SNS topic rds-event-notifications-topic.
  • The Lambda function start-statemachine-execution-lambda is subscribed to the SNS topic rds-event-notifications-topic.
    • The function filters messages with event code: RDS-EVENT-0154. In order to restrict the ‘force shutdown’ activity further, the function validates that the RDS instance is tagged with auto-restart-protection and that the tag value is set to ‘yes’.
    • Once all conditions are met, the Lambda function starts the AWS Step Functions state machine execution.
  • The AWS Step Functions state machine integrates with two Lambda functions in order to retrieve the instance state, as well as attempt to stop the RDS instance.
    • In case the instance state is not ‘available’, the state machine waits for 5 minutes and then re-checks the state.
    • Finally, when the Amazon RDS instance state is ‘available’; the state machine will attempt to stop the Amazon RDS instance.

Prerequisites

In order to implement the steps in this post, you need an AWS account as well as an IAM user with permissions to provision and delete resources of the following AWS services:

  • Amazon RDS
  • AWS Lambda
  • AWS Step Functions
  • AWS CloudFormation
  • AWS SNS
  • AWS IAM

Architecture implementation

You can implement the architecture using the AWS Management Console or AWS CLI.  For faster deployment, the architecture is available on GitHub. For more information on the repo, visit GitHub.

The steps below explain how to build the end-to-end architecture from within the AWS Management Console:

Create an SNS topic

  • Open the Amazon SNS console.
  • On the Amazon SNS dashboard, under Common actions, choose Create Topic.
  • In the Create new topic dialog box, for Topic name, enter a name for the topic (rds-event-notifications-topic).
  • Choose Create topic.
  • Note the Topic ARN for the next task (for example, arn:aws:sns:us-east-1:111122223333:my-topic).

Configure RDS event notifications

Amazon RDS uses Amazon Simple Notification Service (Amazon SNS) to provide notification when an Amazon RDS event occurs. These notifications can be in any notification form supported by Amazon SNS for an AWS Region, such as an email, a text message, or a call to an HTTP endpoint.

For this architecture, RDS generates an event indicating that instance has automatically restarted because it exceed the maximum duration to remain stopped. This specific RDS event (RDS-EVENT-0154) belongs to ‘notification’ category. For more information, visit Using Amazon RDS Event Notification.

To subscribe to an RDS event notification

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose Event subscriptions.
  • In the Event subscriptions pane, choose Create event subscription.
  • In the Create event subscription dialog box, do the following:
    • For Name, enter a name for the event notification subscription (RdsAutoRestartEventSubscription).
    • For Send notifications to, choose the SNS topic created in the previous step (rds-event-notifications-topic).
    • For Source type, choose ‘Instances’. Since our source will be RDS instances.
    • For Instances to include, choose ‘All instances’. Instances are included or excluded based on the tag, auto-restart-protection. This is to keep the architecture generic and to avoid regular configurations moving forward.
    • For Event categories to include, choose ‘Select specific event categories’.
    • For Specific event, choose ‘notification’. This is the category under which the RDS event of interest falls. For more information, review Using Amazon RDS Event Notification.
    •  Choose Create.
    • The Amazon RDS console indicates that the subscription is being created.

Create Lambda functions

Following are the three Lambda functions required for the architecture to work:

  1. start-statemachine-execution-lambda, the function will subscribe to the newly created SNS topic (rds-event-notifications-topic) and starts the AWS Step Functions state machine execution.
  2. retrieve-rds-instance-state-lambda, the function is triggered by AWS Step Functions state machine to retrieve an RDS instance state (example, available or stopped)
  3. stop-rds-instance-lambda, the function is triggered by AWS Step Functions state machine in order to attempt to stop an RDS instance.

First, create the Lambda functions’ execution role.

To create an execution role

  • Open the roles page in the IAM console.
  • Choose Create role.
  • Create a role with the following properties.
    • Trusted entity – Lambda.
    • Permissions – AWSLambdaBasicExecutionRole.
    • Role namerds-auto-restart-lambda-role.
    • The AWSLambdaBasicExecutionRole policy has the permissions that the function needs to write logs to CloudWatch Logs.

Now, create a new policy and attach to the role in order to allow the Lambda function to: start an AWS StepFunctions state machine execution, stop an Amazon RDS instance, retrieve RDS instance status, list tags and add tags.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "rds:AddTagsToResource",
                "rds:ListTagsForResource",
                "rds:DescribeDBInstances",
                "states:StartExecution",
                "rds:StopDBInstance"
            ],
            "Resource": "*"
        }
    ]
}
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name (rds-auto-restart-lambda-policy) and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.

To link the new policy to the AWS Lambda execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role created for the three Lambda functions. After selecting the identities, choose Attach policy.

Given the principle of least privilege, it is recommended to create 3 different roles restricting a function’s access to the needed resources only. 

Repeat the following step 3 times to create 3 new Lambda functions. Differences between the 3 Lambda functions are: (1) code and (2) triggers:

  • Open the Lambda console.
  • Choose Create function.
  • Configure the following settings:
    • Name
      • start-statemachine-execution-lambda
      • retrieve-rds-instance-state-lambda
      • stop-rds-instance-lambda
    • Runtime – Python 3.8.
    • Role – Choose an existing role.
    • Existing role – rds-auto-restart-lambda-role.
    • Choose Create function.
    • To configure a test event, choose Test.
    • For Event name, enter test.
  • Choose Create.
  • For the Lambda function —  start-statemachine-execution-lambda, use the following Python 3.8 sample code:
import json
import boto3
import logging
import os

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')

def lambda_handler(event, context):

    #log input event
    LOGGER.info("RdsAutoRestart Event Received, now checking if event is eligible. Event Details ==> ", event)

    #Input event from the SNS topic originated from RDS event notifications
    snsMessage = json.loads(event['Records'][0]['Sns']['Message'])
    rdsInstanceId = snsMessage['Source ID']
    stepFunctionInput = {"rdsInstanceId": rdsInstanceId}
    rdsEventId = snsMessage['Event ID']

    #Retrieve RDS instance ARN
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceArn = db_instance['DBInstanceArn']

    # Filter on the Auto Restart RDS Event. Event code: RDS-EVENT-0154. 

    if 'RDS-EVENT-0154' in rdsEventId:

        #log input event
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        #Verify that instance is tagged with auto-restart-protection tag. The tag is used to classify instances that are required to be terminated once started. 

        tagCheckPass = 'false'
        rdsInstanceTags = rdsClient.list_tags_for_resource(ResourceName=rdsInstanceArn)
        for rdsInstanceTag in rdsInstanceTags["TagList"]:
            if 'auto-restart-protection' in rdsInstanceTag["Key"]:
                if 'yes' in rdsInstanceTag["Value"]:
                    tagCheckPass = 'true'
                    #log instance tags
                    LOGGER.info("RdsAutoRestart verified that the instance is tagged auto-restart-protection = yes, now starting the Step Functions Flow")
                else:
                    tagCheckPass = 'false'


        #log instance tags
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        if 'true' in tagCheckPass:

            #Initialise StepFunctions Client
            stepFunctionsClient = boto3.client('stepfunctions')

            # Start StepFunctions WorkFlow
            # StepFunctionsArn is stored in an environment variable
            stepFunctionsArn = os.environ['STEPFUNCTION_ARN']
            stepFunctionsResponse = stepFunctionsClient.start_execution(
            stateMachineArn= stepFunctionsArn,
            name=event['Records'][0]['Sns']['MessageId'],
            input= json.dumps(stepFunctionInput)

        )

    else:

        LOGGER.info("RdsAutoRestart Event detected, and event is not eligible")

    return {
            'statusCode': 200
        }

And then, configure an SNS source trigger for the function start-statemachine-execution-lambda. RDS event notifications will be published to this SNS topic:

  • In the Designer pane, choose Add trigger.
  • In the Trigger configurations pane, select SNS as a trigger.
  • For SNS topic, choose the SNS topic previously created (rds-event-notifications-topic)
  • For Enable trigger, keep it checked.
  • Choose Add.
  • Choose Save.

For the Lambda function — retrieve-rds-instance-state-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')


def lambda_handler(event, context):
    

    #log input event
    LOGGER.info(event)
    
    #rdsInstanceId is passed as input to the lambda function from the AWS StepFunctions state machine.  
    rdsInstanceId = event['rdsInstanceId']
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceState = db_instance['DBInstanceStatus']
    return {
        'statusCode': 200,
        'rdsInstanceState': rdsInstanceState,
        'rdsInstanceId': rdsInstanceId
    }

Choose Save.

For the Lambda function, stop-rds-instance-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')


def lambda_handler(event, context):
    
    #log input event
    LOGGER.info(event)
    
    rdsInstanceId = event['rdsInstanceId']
    
    #Stop RDS instance
    rdsClient.stop_db_instance(DBInstanceIdentifier=rdsInstanceId)
    
    #Tagging
    
    
    return {
        'statusCode': 200,
        'rdsInstanceId': rdsInstanceId
    }

Choose Save.

Create a Step Function

AWS Step Functions will execute the following service logic:

  1. Retrieve RDS instance state by calling Lambda function, retrieve-rds-instance-state-lambda. The Lambda function then returns the parameter, rdsInstanceState.
  2. If the rdsInstanceState parameter value is ‘available’, then the state machine will step into the next action calling the Lambda function, stop-rds-instance-lambda. If the rdsInstanceState is not ‘available’, the state machine will then wait for 5 minutes and then re-check the RDS instance state again.
  3. Stopping an RDS instance is an asynchronous operation and accordingly the state machine will keep polling the instance state once every 5 minutes until the rdsInstanceState parameter value becomes ‘stopped’. Only then, the state machine execution will complete successfully.

  • An RDS instance path to ‘available’ state may vary depending on the various maintenance activities scheduled for the instance.
  • Once the RDS notification event is generated, the instance will go through multiple states till it becomes ‘available’.
  • The use of the 5 minutes timer is to make sure that the automation flow will keep attempting to stop the instance once it becomes available.
  • The second part will make sure that the flow doesn’t end till the instance status is changed to ‘stopped’ and hence notifying the system administrator.

To create an AWS Step Functions state machine

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose State machines.
  • In the State machines pane, choose Create state machine.
  • On the Define state machine page, choose Author with code snippets. For Type, choose Standard.
  • Enter a Name for your state machine, stop-rds-instance-statemachine.
  • In the State machine definition pane, add the following state machine definition using the ARNs of the two Lambda function created earlier, as shown in the following code sample:
{
  "Comment": "stop-rds-instance-statemachine: Automatically shutting down RDS instance after a forced Auto-Restart",
  "StartAt": "retrieveRdsInstanceState",
  "States": {
    "retrieveRdsInstanceState": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceAvailable"
    },
    "isInstanceAvailable": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.rdsInstanceState",
          "StringEquals": "available",
          "Next": "stopRdsInstance"
        }
      ],
      "Default": "waitFiveMinutes"
    },
    "waitFiveMinutes": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRdsInstanceState"
    },
    "stopRdsInstance": {
      "Type": "Task",
      "Resource": "stop-rds-instance-lambda Arn",
      "Next": "retrieveRDSInstanceStateStopping"
    },
    "retrieveRDSInstanceStateStopping": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceStopped"
    },
    "isInstanceStopped": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.rdsInstanceState",
          "StringEquals": "stopped",
          "Next": "notifyDatabaseAdmin"
        }
      ],
      "Default": "waitFiveMinutesStopping"
    },
    "waitFiveMinutesStopping": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRDSInstanceStateStopping"
    },
    "notifyDatabaseAdmin": {
      "Type": "Pass",
      "Result": "World",
      "End": true
    }
  }
}

This is a definition of the state machine written in Amazon States Language which is used to describe the execution flow of an AWS Step Function.

Choose Next.

  • In the Name pane, enter a name for your state machine, stop-rds-instance-statemachine.
  • In the Permissions pane, choose Create new role. Take note of the the new role’s name displayed at the bottom of the page (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd).
  • Choose Create state machine
  • By default, the created role only grants the state machine access to CloudWatch logs. Since the state machine will have to make Lambda calls, then another IAM policy has to be associated with the new role.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "*"
}
]
}
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name rds-auto-restart-stepfunctions-policy and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy.
  • Choose Create policy to save your work.

To link the new policy to the AWS Step Functions execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role create for the state machine (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd). After selecting the identities, choose Attach policy.

 

Testing the architecture

In order to test the architecture, create a test RDS instance, tag it with auto-restart-protection tag and set the tag value to yes. While the RDS instance is still in creation process, test the Lambda function —  start-statemachine-execution-lambda with a sample event that simulates that the instance was started as it exceeded the maximum time to remain stopped (RDS-EVENT-0154).

To invoke a function

  • Sign in to the AWS Management Console and open the Lambda console.
  • In navigation pane, choose Functions.
  • In Functions pane, choose start-statemachine-execution-lambda.
  • In the upper right corner, choose Test.
  • In the Configure test event page, choose Create new test event and in Event template, leave the default Hello World option.
    {
    "Records": [
        {
        "EventSource": "aws:sns",
        "EventVersion": "1.0",
        "EventSubscriptionArn": "<RDS Event Subscription Arn>",
        "Sns": {
            "Type": "Notification",
            "MessageId": "10001-2d55da-9a73-5e42d46748c0",
            "TopicArn": "<SNS Topic Arn>",
            "Subject": "RDS Notification Message",
            "Message": "{\"Event Source\":\"db-instance\",\"Event Time\":\"2020-07-09 15:15:03.031\",\"Identifier Link\":\"https://console.aws.amazon.com/rds/home?region=<region>#dbinstance:id=<RDS instance id>\",\"Source ID\":\"<RDS instance id>\",\"Event ID\":\"http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_Events.html#RDS-EVENT-0154\",\"Event Message\":\"DB instance started\"}",
            "Timestamp": "2020-07-09T15:15:03.991Z",
            "SignatureVersion": "1",
            "Signature": "YsuM+L6N8rk+pBPBWoWeRcSuYqo/BN5v9D2lyoSg0B0uS46Q8NZZSoZWaIQi25TXfHY3RYXCXF9WbVGXiWa4dJs2Mjg46anM+2j6z9R7BDz0vt25qCrCyWhmWtc7yeETrlwa0jCtR/wxXFFexRwynqlZeDfvQpf/x+KNLrnJlT61WZ2FMTHYs124RwWU8NY3pm1Os0XOIvm8rfv3ywm1ccZfP4rF7Lfn+2EK6a0635Z/5aiyIlldNZxbgRYTODJYroO9INTlF7NPzVV1Y/K0E9aaL/wQgLZNquXQGCAxPFWy5lxJKeyUocOWcG48KJGIBUC36JJaqVdIilbZ9HvxTg==",
            "SigningCertUrl": "https://sns.<region>.amazonaws.com/SimpleNotificationService-a86cb10b4e1f29c941702d737128f7b6.pem",
            "UnsubscribeUrl": "https://sns.<region>.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=<arn>",
            "MessageAttributes": {}
        }
        }
    ]
    }
start-statemachine-execution-lambda uses the SNS MessageId parameter as name for the AWS Step Functions execution. The name field is unique for a certain period of time, accordingly, with every test run the MessageId parameter value must be changed. 
  • Choose Create and then choose Test. Each user can create up to 10 test events per function. Those test events are not available to other users.
  • AWS Lambda executes your function on your behalf. The handler in your Lambda function receives and then processes the sample event.
  • Upon successful execution, view results in the console.
  • The Execution result section shows the execution status as succeeded and also shows the function execution results, returned by the return statement. Following is a sample response of the test execution:

Now, verify the execution of the AWS Step Functions state machine:

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In navigation pane, choose State machines.
  • In the State machine pane, choose stop-rds-instance-statemachine.
  • In the Executions pane, choose the execution with the Name value passed in the test event MessageId parameter.
  • In the Visual workflow pane, the real-time execution status is displayed:

  • Under the Step details tab, all details related to inputs, outputs and exceptions are displayed:

Monitoring

It is recommended to use Amazon CloudWatch to monitor all the components in this architecture. You can use AWS Step Functions to log the state of the execution, inputs and outputs of each step in the flow. So when things go wrong, you can diagnose and debug problems quickly.

Cost

When you build the architecture using serverless components, you pay for what you use with no upfront infrastructure costs. Cost will depend on the number of RDS instances tagged to be protected against an automatic start.

Architectural considerations

This architecture has to be deployed per AWS Account per Region.

Conclusion

The blog post demonstrated how to build a fully serverless architecture that monitors and stops RDS instances restarted by AWS. This helps to avoid falling behind on any required maintenance updates. This architecture helps you save cost incurred by started instances’ running hours and licensing implications.  Feel free to submit enhancements to the GitHub repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers

Updating opt-in status for Amazon Pinpoint channels

Post Syndicated from Varinder Dhanota original https://aws.amazon.com/blogs/messaging-and-targeting/updating-opt-in-status-for-amazon-pinpoint-channels/

In many real-world scenarios, customers are using home-grown or 3rd party systems to manage their campaign related information. This includes user preferences, segmentation, targeting, interactions, and more. To create customer-centric engagement experiences with such existing systems, migrating or integrating into Amazon Pinpoint is needed. Luckily, many AWS services and mechanisms can help to streamline this integration in a resilient and cost-effective way.

In this blog post, we demonstrate a sample solution that captures changes from an on-premises application’s database by utilizing AWS Integration and Transfer Services and updates Amazon Pinpoint in real-time.

If you are looking for a serverless, mobile-optimized preference center allowing end users to manage their Pinpoint communication preferences and attributes, you can also check the Amazon Pinpoint Preference Center.

Architecture

Architecture

In this scenario, users’ SMS opt-in/opt-out preferences are managed by a home-grown customer application. Users interact with the application over its web interface. The application, saves the customer preferences on a MySQL database.

This solution’s flow of events is triggered with a change (insert / update / delete) happening in the database. The change event is then captured by AWS Database Migration Service (DMS) that is configured with an ongoing replication task. This task continuously monitors a specified database and forwards the change event to an Amazon Kinesis Data Streams stream. Raw events that are buffered in this stream are polled by an AWS Lambda function. This function transforms the event, and makes it ready to be passed to Amazon Pinpoint API. This API call will in turn, change the opt-in/opt-out subscription status of the channel for that user.

Ongoing replication tasks are created against multiple types of database engines, including Oracle, MS-SQL, Postgres, and more. In this blog post, we use a MySQL based RDS instance to demonstrate this architecture. The instance will have a database we name pinpoint_demo and one table we name optin_status. In this sample, we assume the table is holding details about a user and their opt-in preference for SMS messages.

userid phone optin lastupdate
user1 +12341111111 1 1593867404
user2 +12341111112 1 1593867404
user2 +12341111113 1 1593867404

Prerequisites

  1. AWS CLI is configured with an active AWS account and appropriate access.
  2. You have an understanding of Amazon Pinpoint concepts. You will be using Amazon Pinpoint to create a segment, populate endpoints, and validate phone numbers. For more details, see the Amazon Pinpoint product page and documentation.

Setup

First, you clone the repository that contains a stack of templates to your local environment. Make sure you have configured your AWS CLI with AWS credentials. Follow the steps below to deploy the CloudFormation stack:

  1. Clone the git repository containing the CloudFormation templates:
    git clone https://github.com/aws-samples/amazon-pinpoint-rds-integration.git
    cd amazon-pinpoint-rds-integration
  2. You need an S3 Bucket to hold the template:
    aws s3 create-bucket –bucket <YOUR-BUCKET-NAME>
  3. Run the following command to package the CloudFormation templates:
    aws cloudformation package --template-file template_stack.yaml --output-template-file template_out.yaml --s3-bucket <YOUR-BUCKET-NAME>
  4. Deploy the stack with the following command:
    aws cloudformation deploy --template-file template_out.yaml --stack-name pinpointblogstack --capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_NAMED_IAM

The AWS CloudFormation stack will create and configure resources for you. Some of the resources it will create are:

  • Amazon RDS instance with MySQL
  • AWS Database Migration Service replication instance
  • AWS Database Migration Service source endpoint for MySQL
  • AWS Database Migration Service target endpoint for Amazon Kinesis Data Streams
  • Amazon Kinesis Data Streams stream
  • AWS Lambda Function
  • Amazon Pinpoint Application
  • A Cloud9 environment as a bastion host

The deployment can take up to 15 minutes. You can track its progress in the CloudFormation console’s Events tab.

Populate RDS data

A CloudFormation stack will output the DNS address of an RDS endpoint and Cloud9 environment upon completion. The Cloud9 environment acts as a bastion host and allows you to reach the RDS instance endpoint deployed into the private subnet by CloudFormation.

  1. Open the AWS Console and navigate to the Cloud9 service.
    Cloud9Console
  2. Click on the Open IDE button to reach your IDE environment.
    Cloud9Env
  3. At the console pane of your IDE, type the following to login to your RDS instance. You can find the RDS Endpoint address at the outputs section of the CloudFormation stack. It is under the key name RDSInstanceEndpoint.
    mysql -h <YOUR_RDS_ENDPOINT> -uadmin -pmypassword
    use blog_db;
  4. Issue the following command to create a table that holds the user’s opt-in status:
    create table optin_status (
      userid varchar(50) not null,
      phone varchar(50) not null,
      optin tinyint default 1,
      lastupdate TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
    );
  5. Next, load sample data into the table. The following inserts nine users for this demo:
    
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user1', '+12341111111', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user2', '+12341111112', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user3', '+12341111113', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user4', '+12341111114', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user5', '+12341111115', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user6', '+12341111116', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user7', '+12341111117', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user8', '+12341111118', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user9', '+12341111119', 1);
  6. The table’s opt-in column holds the SMS opt-in status and phone number for a specific user.

Start the DMS Replication Task

Now that the environment is ready, you can start the DMS replication task and start watching the changes in this table.

  1. From the AWS DMS Console, go to the Database Migration Tasks section.
    DMSMigTask
  2. Select the Migration task named blogreplicationtask.
  3. From the Actions menu, click on Restart/Resume to start the migration task. Wait until the task’s Status transitions from Ready to Starting and Replication ongoing.
  4. At this point, all the changes on the source database are replicated into a Kinesis stream. Before introducing the AWS Lambda function that will be polling this stream, configure the Amazon Pinpoint application.

Inspect the AWS Lambda Function

An AWS Lambda function has been created to receive the events. The Lambda function uses Python and Boto3 to read the records delivered by Kinesis Data Streams. It then performs the update_endpoint API calls in order to add, update, or delete endpoints in the Amazon Pinpoint application.

Lambda code and configuration is accessible through the Lambda Functions Console. In order to inspect the Python code, click the Functions item on the left side. Select the function starting with pinpointblogstack-MainStack by clicking on the function name.

Note: The PINPOINT_APPID under the Environment variables section. This variable provides the Lambda function with the Amazon Pinpoint application ID to make the API call.

LambdaPPAPPID

Inspect Amazon Pinpoint Application in Amazon Pinpoint Console

A Pinpoint application is needed by the Lambda Function to update the endpoints. This application has been created with an SMS Channel by the CloudFormation template. Once the data from the RDS database has been imported into Pinpoint as SMS endpoints, you can validate this import by creating a segment in Pinpoint.

PinpointProject

Testing

With the Lambda function ready, you now test the whole solution.

  1. To initiate the end-to-end test, go to the Cloud9 terminal. Perform the following SQL statement on the optin_table:
    UPDATE optin_status SET optin=0 WHERE userid='user1';
    UPDATE optin_status SET optin=0 WHERE userid='user2';
    UPDATE optin_status SET optin=0 WHERE userid='user3';
    UPDATE optin_status SET optin=0 WHERE userid='user4';
  2. This statement will cause four changes in the database which is collected by DMS and passed to Kinesis Data Streams stream.
  3. This triggers the Lambda function that construct an update_endpoint API call to the Amazon Pinpoint application.
  4. The update_endpoint operation is an upsert operation. Therefore, if the endpoint does not exist on the Amazon Pinpoint application, it creates one. Otherwise, it updates the current endpoint.
  5. In the initial dataset, all the opt-in values are 1. Therefore, these endpoints will be created with an OptOut value of NONE in Amazon Pinpoint.
  6. All OptOut=NONE typed endpoints are considered as active endpoints. Therefore, they are available to be used within segments.

Create Amazon Pinpoint Segment

  1. In order to see these changes, go to the Pinpoint console. Click on PinpointBlogApp.
    PinpointConsole
  2. Click on Segments on the left side. Then click Create a segment.
    PinpointSegment
  3. For the segment name, enter US-Segment.
  4. Select Endpoint from the Filter dropdown.
  5. Under the Choose an endpoint attribute dropdown, select Country.
  6. For Choose values enter US.
    Note: As you do this, the right panel Segment estimate will refresh to show the number of endpoints eligible for this segment filter.
  7. Click Create segment at the bottom of the page.
    PinpointSegDetails
  8. Once the new segment is created, you are directed to the newly created segment with configuration details. You should see five eligible endpoints corresponding to database table rows.
    PinpointSegUpdate
  9. Now, change one row by issuing the following SQL statement. This simulates a user opting out from SMS communication for one of their numbers.
    UPDATE optin_status SET optin=0 WHERE userid='user5';
  10. After the update, go to the Amazon Pinpoint console. Check the eligible endpoints again. You should only see four eligible endpoints.

PinpointSegUpdate

Cleanup

If you no longer want to incur further charge, delete the Cloudformation stack named pinpointblogstack. Select it and click Delete.

PinpointCleanup

Conclusion

This solution walks you through how opt-in change events are delivered from Amazon RDS to Amazon Pinpoint. You can use this solution in other use cases as well. Some examples are importing segments from a 3rd party application like Salesforce and importing other types of channels like e-mail, push, and voice. To learn more about Amazon Pinpoint, visit our website.

Building AWS Glue Spark ETL jobs by bringing your own JDBC drivers for Amazon RDS

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/building-aws-glue-spark-etl-jobs-by-bringing-your-own-jdbc-drivers-for-amazon-rds/

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. AWS Glue has native connectors to connect to supported data sources either on AWS or elsewhere using JDBC drivers. Additionally, AWS Glue now enables you to bring your own JDBC drivers (BYOD) to your Glue Spark ETL jobs. This feature enables you to connect to data sources with custom drivers that aren’t natively supported in AWS Glue, such as MySQL 8 and Oracle 18. You can also use multiple JDBC driver versions in the same AWS Glue job, enabling you to migrate data between source and target databases with different versions. For more information, see Connection Types and Options for ETL in AWS Glue.

This post shows how to build AWS Glue ETL Spark jobs and set up connections with custom drivers with Oracle18 and MySQL8 databases.

Solution overview

We discuss three different use cases in this post, using AWS Glue, Amazon RDS for MySQL, and Amazon RDS for Oracle.

In the following architecture, we connect to Oracle 18 using an external ojdbc7.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the following architecture, we connect to Oracle 18 using an external ojdbc7.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the second scenario, we connect to MySQL 8 using an external mysql-connector-java-8.0.19.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to MySQL 8.

In the second scenario, we connect to MySQL 8 using an external mysql-connector-java-8.0.19.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to MySQL 8.

In the third scenario, we set up a connection where we connect to Oracle 18 and MySQL 8 using external drivers from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the third scenario, we set up a connection where we connect to Oracle 18 and MySQL 8 using external drivers from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

Prerequisites

Before getting started, you must complete the following prerequisites:

  1. Create an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console. Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template provided in this post.
  2. Create am IAM policy for AWS Glue.
  3. Before setting up the AWS Glue job, you need to download drivers for Oracle and MySQL, which we discuss in the next section.

Downloading drivers for Oracle and MySQL

To download the required drivers for Oracle and MySQL, complete the following steps:

  1. Download the MySQL JDBC connector.
  2. Select the operating system as platform independent and download the .tar.gz or .zip file (for example, mysql-connector-java-8.0.19.tar.gz or mysql-connector-java-8.0.19.zip) and extract it.
  3. Pick MySQL connector .jar file (such as mysql-connector-java-8.0.19.jar) and upload it into your Amazon Simple Storage Service (Amazon S3) bucket.
  4. Make a note of that path because you use it later in the AWS Glue job to point to the JDBC driver.
  5. Similarly, download the Oracle JDBC connector (ojdbc7.jar).

This post is tested for mysql-connector-java-8.0.19.jar and ojdbc7.jar drivers, but based on your database types, you can download and use appropriate version of JDBC drivers supported by the database.

  1. Upload the Oracle JDBC 7 driver to (ojdbc7.jar) to your S3 bucket.
  2. Make a note of that path, because you use it in the AWS Glue job to establish the JDBC connection with the database.
  3. Make sure to upload the three scripts (OracleBYOD.py, MySQLBYOD.py, and CrossDB_BYOD.py) in an S3 bucket.
  4. Save the following code as py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_oracle18_options_source_emp = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "employee",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    connection_oracle18_options_source_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
        
    connection_oracle18_options_target_emp_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "emp_dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
        
    # Read DynamicFrame from Oracle 
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    # Write data to Oracle 
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="oracle", connection_options=connection_oracle18_options_target_emp_dept)

  1. Save the following code as MySQLBYOD.py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_mysql8_options_source_emp = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "employee",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
        
    connection_mysql8_options_source_dept = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "dept",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
    
    connection_mysql8_options_target_emp_dept = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "emp_dept",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
        
    
    # Read from JDBC databases with custom driver
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    
    #print "Applied mapping to the Glue DynamicFrame"
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="mysql", connection_options=connection_mysql8_options_target_emp_dept)
    

  1. Save the following code as CrossDB_BYOD.py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_mysql8_options_source_emp = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "employee",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
    
    connection_oracle18_options_source_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    connection_oracle18_options_target_emp_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "emp_dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    # Read DynamicFrame from Oracle
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    # Write data to Oracle
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="oracle", connection_options=connection_oracle18_options_target_emp_dept)
    
    

Provisioning resources with AWS CloudFormation

The generic workflow of setting up a connection with your own custom JDBC drivers involves various steps. It’s a manual configuration that is error prone and adds overhead when repeating the steps between environments and accounts. With AWS CloudFormation, you can provision your application resources in a safe, repeatable manner, allowing you to build and rebuild your infrastructure and applications without having to perform manual actions or write custom scripts. The declarative code in the file captures the intended state of the resources to create, and allows you to automate the creation of AWS resources.

We provide this CloudFormation template for you to use. Review and customize it to suit your needs. Some of the resources deployed by this stack incur costs as long as they remain in use, like Amazon RDS for Oracle and Amazon RDS for MySQL.

This CloudFormation template creates the following resources:

  • A VPC
  • Two subnets
  • A route table
  • An internet gateway
  • A MySQL 8 database
  • An Oracle 18 database

To provision your resources, complete the following steps:

  1. Sign in to the console.
  2. Choose the us-east-1 Region in which to create the stack.
  3. Choose Next.
  4. Choose Launch Stack:

This step automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console as required.

  1. For Stack name, enter a name.
  2. Change the other parameters as needed or keep the following default values:
    1. Oracle user nameoraadmin
    2. Oracle passwordoraadmin123
    3. MySQL usernameMySQLadmin
    4. MySQL passwordMYSQLadmin123

Change the other parameters as needed or keep the following default values:

  1. Choose Next.
  2. Choose Next
  3. Review the details and choose Create.

This stack creation can take up to 20 minutes.

After the stack creation is complete, go to the Outputs tab on the AWS CloudFormation console and note the following values (you use these in later steps):

  • MySQLJDBCConnectionString
  • OracleJDBCConnectionString

Configuring an AWS Glue ETL job using your own drivers

Before creating an AWS Glue ETL, run the SQL script (database_scripts.sql) on both the databases (Oracle and MySQL) to create tables and insert data. For more information about connecting to the RDS DB instance, see How can I troubleshoot connectivity to an Amazon RDS DB instance that uses a public or private subnet of a VPC?

To set up AWS Glue connections, complete the following steps:

  1. On the AWS Glue console, under Databases, choose Connections.
  2. Choose Add Connection.
  3. For Connection Name, enter a name for your connection.
  4. For Connection Type, choose JDBC.
  5. For JDBC URL, enter a URL, such as jdbc:oracle:thin://@<hostname>:1521/ORCL for Oracle or jdbc:mysql://<hostname>:3306/mysql for MySQL.
  6. Enter the user name and password for the database.
  7. Select the VPC in which you created the RDS instance (Oracle and MySQL).
  8. Choose the subnet within your VPC. Refer to the CloudFormation stack Outputs tab for the subnet name.
  9. Choose the security group of the database. Refer to the CloudFormation stack Outputs tab for security group name.
  10. Choose Next.
  11. Check the connection details and choose Finish.

Make sure to add a connection for both databases (Oracle and MySQL).

Creating endpoints and a security group

Before testing the connection, make sure you create an AWS Glue endpoint and S3 endpoint in the VPC in which databases are created. Complete the following steps for both Oracle and MySQL instances:

  1. To create your AWS Glue endpoint, on the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
  3. For Service Names, choose AWS Glue.
  4. Choose amazonaws.<region>.glue (for example, com.amazonaws.us-west-2.glue).
  5. Choose the VPC of the RDS for Oracle or RDS for MySQL
  6. Choose the security group of the RDS instances.
  7. Choose Create endpoint.

To create your S3 endpoint, you use Amazon Virtual Private Cloud (Amazon VPC).

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
  3. For Service Names, choose Amazon S3.
  4. Choose amazonaws.<region>.s3 (for example, com.amazonaws.us-west-2.s3).
  5. Choose the VPC of the RDS for Oracle or RDS for MySQL
  6. Choose the route table ID.
  7. Choose Create endpoint.

The RDS for Oracle or RDS for MySQL security group must include itself as a source in its inbound rules.

  1. On the Security Groups page, choose Edit Inbound Rules.
  2. Choose Add rule.
  3. For Type, choose All Traffic Type, for example glue-byod-stack1….
  4. For Source, choose the same security group.
  5. Choose Save Rules.

If both the databases are in the same VPC and subnet, you don’t need to create a connection for MySQL and Oracle databases separately. The reason for setting an AWS Glue connection to the databases is to establish a private connection between the RDS instances in the VPC and AWS Glue via S3 endpoint, AWS Glue endpoint, and Amazon RDS security group. It’s not required to test JDBC connection because that connection is established by the AWS Glue job when you run it. If you test the connection with MySQL8, it fails because the AWS Glue connection doesn’t support the MySQL 8.0 driver at the time of writing this post, therefore you need to bring your own driver.

Setting up AWS Glue ETL jobs

You’re now ready to set up your ETL job in AWS Glue. Complete the following steps for both connections:

  1. Edit the following parameters in the scripts (OracleBYOD.py, MySQLBYOD.py, and CrossDB_BYOD.py) and upload them in Amazon S3:
    1. url
    2. user
    3. password
    4. customJdbcDriverS3Path for sources and target tables

You can find the database endpoints (url) on the CloudFormation stack Outputs tab; the other parameters are mentioned earlier in this post. If you use another driver, make sure to change customJdbcDriverClassName to the corresponding class in the driver.

Alternatively, you can pass on this as AWS Glue job parameters and retrieve the arguments that are passed using the getResolvedOptions.

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Add Job.
  3. For Job Name, enter a name.
  4. For IAM role, choose the IAM role you created as a prerequisite.
  5. For Type, choose Spark.
  6. For Glue Version, choose Python (latest version).
  7. For This job runs, choose An existing script that you provide.
  8. Choose the Amazon S3 path where the script (OracleBYOD.py, MySQLBYOD.py, or CrossDB_BYOD.py) is stored.
  9. Under Advanced properties, enable Job bookmark.

Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

  1. Keep the remaining settings as their defaults and choose
  2. For Connections, choose the Amazon RDS for Oracle connection for OracleBYOD.py, Amazon RDS for MySQL connection for MySQLBYOD.py, or Amazon RDS for Oracle and Amazon RDS for MySQL connection for CrossDB_BYOD.py.
  3. Choose Save job and edit scripts.
  4. Choose Run Job.
  5. When the job is complete, validate the data loaded in the target table.

Cleaning up

After you finish, don’t forget to delete the CloudFormation stack, because some of the AWS resources deployed by the stack in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack.

  1. On the AWS CloudFormation console, on the Stacks page, select the stack to delete. The stack must be currently running.
  2. In the stack details pane, choose Delete.
  3. Choose Delete stack when prompted.

Summary

In this post, we showed you how to build AWS Glue ETL Spark jobs and set up connections with custom drivers with Oracle18 and MySQL8 databases using AWS CloudFormation. You can use this solution to use your custom drivers for databases not supported natively by AWS Glue.

If you have any questions or suggestions, please leave a comment.


About the Authors

Srikanth Sopirala is a Sr. Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family and road biking.

 

 

Naresh Gautam is a Sr. Analytics Specialist Solutions Architect at AWS. His role is helping customers architect highly available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.

 

 

Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/sharing-amazon-redshift-data-securely-across-amazon-redshift-clusters-for-workload-isolation/

Amazon Redshift data sharing allows for a secure and easy way to share live data for read purposes across Amazon Redshift clusters. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query runs.

In this post, we discuss how to use Amazon Redshift data sharing to achieve workload isolation across diverse analytics use cases and achieve business-critical SLAs. For more information about this new feature, see Announcing Amazon Redshift data sharing (preview).

How to use Amazon Redshift data sharing

Amazon Redshift data sharing allows a producer cluster to share data objects to one or more Amazon Redshift consumer clusters for read purposes without having to copy the data. With this approach, workloads isolated to different clusters can share and collaborate frequently on data to drive innovation and offer value-added analytic services to your internal and external stakeholders. You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined functions, to provide fine-grained access controls that can be tailored for different users and businesses that all need access to Amazon Redshift data.

Data sharing between Amazon Redshift clusters is a two-step process. First, the producer cluster administrator that wants to share data creates an Amazon Redshift data share, a new named object introduced with this release to serve as a unit of sharing. The producer cluster adds the needed database objects such as schemas, tables, and views to the data share and specifies a list of consumer clusters with which to share the data share. Following that, privileged users on consumer clusters create an Amazon Redshift local database reference from the data share made available to them and grant permissions on the database objects to appropriate users and groups. Users and groups can then list the shared objects as part of the standard metadata queries and start querying immediately.

Solution overview

For this post, we use a use case in which the producer cluster is a central ETL cluster hosting enterprise sales data, a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark dataset. This cluster serves multiple BI and data science clusters purpose-built for distinct business groups within the organization. One such group is the sales BI team, who runs BI reports using customer sales data created in the central ETL cluster and joined with the product reviews dataset that they loaded into the BI cluster they manage.

This approach helps the sales BI team isolate data lifecycle management between the enterprise sales dataset in the ETL producer from the product reviews data that they fully manage in the BI consumer cluster to simplify data stewardship. It also allows for agility, allows sizing clusters independently to provide workload isolation, and creates a simple cost charge-back model.

As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales. We demonstrate how to build the semantic layer later in this post. A superuser in etl_cluster then creates a data share named salesdatashare, adds the bi_semantic schema and all objects in that schema to the data share, and grants usage permissions to the BI consumer cluster named bi_cluster. Keep in mind that a data share is simply a metadata container and represents what data is shared from producer to consumer. No data is actually moved.

As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales and performs transformations to create a semantic layer required for BI reports in a new schema named bi_semantic.

The superuser in the BI consumer cluster creates a local database reference named sales_semantic from the data share (step 2 in the preceding diagram). The BI users use the product reviews dataset in the local schema named product_reviews and join with bi_semantic data for reporting purposes (step 3).

You can find the script in the products review dataset, which we use in this post to load the dataset into bi_cluster. You can load the DW benchmark dataset into etl_cluster using this github link. Loading these datasets into the respective Amazon Redshift clusters is outside the scope of this post, and is a prerequisite to following the instructions we outline.

The following diagram depicts the cloud DW benchmark data model used.

The following diagram depicts the cloud DW benchmark data model used.

The following table summarizes the data.

Table Name Rows
STORE_SALES 8,639,936,081
CUSTOMER_ADDRESS 15,000,000
CUSTOMER 30,000,000
CUSTOMER_DEMOGRAPHICS 1,920,800
ITEM 360,000
DATE_DIM 73,049

Building a BI semantic layer

A BI semantic layer is a representation of enterprise data in a way that simplifies BI reporting requirements and offers better performance. In our use case, the BI semantic layer transforms sales data to create a customer denormalized dataset and another dataset for all store sales by product in a given year. The following queries are run on the etl_cluster to create the BI semantic layer.

  1. Create a new schema to host BI semantic tables with the following SQL:
    Create schema bi_semantic;

  2. Create a denormalized customer view with select columns required for sales BI team:
    create view bi_semantic.customer_denorm 
    as
    select
    	c_customer_sk,
    	c_customer_id,
    	c_birth_year,
    	c_birth_country,
    	c_last_review_date_sk,
    	ca_city,
    	ca_state,
    	ca_zip,
    	ca_country,
    	ca_gmt_offset,
    	cd_gender,
    	cd_marital_status,
    	cd_education_status
    from sales.customer c, sales.customer_address ca, sales.customer_demographics cd
    where
    c.c_current_addr_sk=ca.ca_address_sk
    and c.c_current_cdemo_sk=cd.cd_demo_sk;

  1. Create a second view for all product sales with columns required for BI team:
    create view bi_semantic.product_sales
    as 
    select 
    	i_item_id,
    	i_product_name,
    	i_current_price,
    	i_wholesale_cost,
    	i_brand_id,
    	i_brand,
    	i_category_id,
    	i_category,
    	i_manufact,
    	d_date,
    	d_moy,
    	d_year,
    	d_quarter_name,
    	ss_customer_sk,
    	ss_store_sk,
    	ss_sales_price,
    	ss_list_price,
    	ss_net_profit,
    	ss_quantity,
    	ss_coupon_amt
    from sales.store_sales ss, sales.item i, sales.date_dim d
    where ss.ss_item_sk=i.i_item_sk
    and ss.ss_sold_date_sk=d.d_date_sk;

Sharing data across Amazon Redshift clusters

Now, let’s share the bi_semantic schema in the etl_cluster with the bi _cluster.

  1. Create a data share in the etl_cluster using the following command when connected to the etl_cluster. The producer cluster superuser and database owners can create data share objects. By default, PUBLICACCESSIBLE is false. If the producer cluster is publicly accessible, you can add PUBLICACCESSIBLE = true to the following command:
    CREATE DATASHARE SalesDatashare;

  1. Add the BI semantic views to the data share. To add objects to the data share, add the schema before adding objects. Use ALTER DATASHARE to share the entire schema; to share tables, views, and functions in a given schema; and to share objects from multiple schemas:
    ALTER DATASHARE SalesDatashare ADD SCHEMA bi_semantic;
    ALTER DATASHARE SalesDatashare ADD ALL TABLES IN SCHEMA bi_semantic;

The next step requires a cluster namespace GUID from the bi_cluster. One way to find the namespace value of a cluster is to run the SQL statement select current_namespace when connected to the bi_cluster. Another way is on the Amazon Redshift console: choose your Amazon Redshift consumer cluster, and find the value under Namespace located in the General information section.

  1. Add consumers to the data share using the following command:
    GRANT USAGE ON DATASHARE SalesDatashare TO NAMESPACE '1m137c4-1187-4bf3-8ce2-e710b7100eb2';

  1. View the list of the objects added to the share using the following command. The share type is outbound on the producer cluster.
    DESC DATASHARE salesdatashare;

The following screenshot shows our list of objects.

The following screenshot shows our list of objects.

Consuming the data share from the consumer BI Amazon Redshift cluster

From the bi_cluster, let’s review, consume, and set permissions on the data share for end-user consumption.

  1. On the consumer BI cluster, view the data shares using the following command as any user:
    SHOW DATASHARES;

The following screenshot shows our results. Consumers should be able to see the objects within the incoming share but not the full list of consumers associated with the share. For more information about querying the metadata of shares, see DESC DATASHARE.

The following screenshot shows our results.

  1. Start the consumption by creating a local database from the salesdatashare. Cluster users with the permission to do so can create a database from the shares. We use the namespace from the etl_cluster.
    CREATE DATABASE Sales_semantic from DATASHARE SalesDatashare OF NAMESPACE '45b137c4-1287-4vf3-8cw2-e710b7138nd9'; 

Consumers should be able to see databases that they created from the share, along with the databases local to the cluster, at any point by querying SVV_REDSHIFT* tables. Data share objects aren’t available for queries until a local database reference is created using a create database statement.

  1. Run the following command to list the databases in bi_cluster:
    select * from svv_redshift_databases;

The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.

The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.

  1. Grant usage on the database to bi_group, where bi_group is a local Amazon Redshift group with BI users added to that group:
    GRANT USAGE ON DATABASE sales_semantic TO bi_group;

Querying as the BI user

In this section, you connect as a user in the bi_group who got access to the shared data. The user is still connected to the local database on the bi_cluster but can query the shared data via the new cross-database query functionality in Amazon Redshift.

  1. Review the list of objects in the share by running the following SQL:
    SELECT schema_name, table_name, table_type FROM  svv_redshift_tables
         where database_name = 'sales_semantic'

The following screenshot shows our results.

The following screenshot shows our results.

  1. Review the list of columns in the customer_denorm view::
    SELECT * FROM  svv_redshift_columns 
       where database_name = 'sales_semantic' and table_name = 'customer_denorm';

The following screenshot shows our results.

The following screenshot shows our results.

  1. Query the shared objects using three-part notation just like querying any other local database object, using a notation <database>.<schema>.<view/table>:
    select count(*) from sales_semantic.bi_semantic.customer_denorm;

Following is your result:

28950139

  1. Analyze the local product reviews data by joining the shared customer_denorm data to identify the top ratings by customer states for this BI report:
    SELECT PR.product_category, c.ca_state AS customer_state,
                  count(PR.star_rating) AS cnt
          FROM product_reviews.amazon_reviews PR,               --local data
               sales_semantic.bi_semantic.customer_denorm  C    –-shared data
          WHERE  PR.customer_id = C.c_customer_sk
             AND PR.marketplace = 'US'
          GROUP BY 1, 2
          order by cnt desc
          Limit 10;

The following screenshot shows our results.

The following screenshot shows our results.

Adding a data science consumer

Now, let’s assume the company has decided to spin up a data science team to help with new sales strategies, and this team performs analytics on the sales data. The data science team is new and has very different access patterns and SLA requirements compared to the BI team. Thanks to the data sharing feature, onboarding new use cases such as this is easy.

We add a data science consumer cluster named ds_cluster. Because the data science users need access to data in salesdatashare, the superuser in the etl_cluster can simply grant access to the ds_cluster by adding them as another consumer for the share without moving any data:

GRANT USAGE ON DATASHARE SalesDatashare TO NAMESPACE ''1h137c4-1187-4w53-8de2-e710b7100es2';

The following diagram shows our updated architecture with the data science consumer (step 4).

The following diagram shows our updated architecture with the data science consumer (step 4).

This way, multiple clusters of different sizes can access the same dataset and isolate workloads to meet their SLA requirements. Users in these respective clusters are granted access to shared objects to meet their stringent security requirements. The producer keeps control of the data and at any point can remove certain objects from the share or remove access to the share for any of these clusters, and the consumers immediately lose access to the data. Also, as more data is ingested into the producer cluster, the consumer sees transactionally consistent data instantly.

Monitoring and security

Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and usage across all the consumers and revoke access instantly when necessary. The permissions are granted by the superusers from both the producer and the consumer clusters to define who gets access to what objects, similar to the grant commands used in the earlier scenario. You can use the following commands to audit the usage and activities for the data share.

Track all changes to the data share and the shared database imported from the data share with the following code:

Select username, share_name, recordtime, action, 
         share_object_type, share_object_name 
  from svl_datashare_change_log
   order by recordtime desc;

The following screenshot shows our results.

The following screenshot shows our results.

Track data share access activity (usage), which is relevant only on the producer, with the following code:

Select * from svl_datashare_usage;

The following screenshot shows our results.

The following screenshot shows our results.

Summary

Amazon Redshift data sharing provides workload isolation by allowing multiple consumers to share data seamlessly without the need to unload and load data. We also presented a step-by-step guide for securely sharing data from a producer to multiple consumer clusters.


About the Authors

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.

Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-federated-querying-to-amazon-aurora-mysql-and-amazon-rds-for-mysql/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in April 2020, we announced general availability for federated querying to Amazon Aurora PostgreSQL and Amazon Relational Database Service (Amazon RDS) for PostgreSQL to enable you to query data across your operational databases, your data warehouse, and your data lake to gain faster and deeper insights not possible otherwise.

Today, we’re launching a new feature of Amazon Redshift federated query to Amazon Aurora MySQL and Amazon RDS for MySQL to help you expand your operational databases in the MySQL family. With this lake house architecture expansion to support more operational data stores, you can query and combine data more easily in real time and store data in open file formats in your Amazon Simple Storage Service (Amazon S3) data lake. Your data can then be more available to other analytics and machine learning (ML) tools, rather than siloed in disparate data stores.

In this post, we share information about how to get started with this new federated query feature to MySQL.

Prerequisites

To try this new feature, create a new Amazon Redshift cluster in a sql_preview maintenance track and Aurora MySQL instance and load sample TPC data into both data stores. To make sure both Aurora MySQL DB instances can accept connections from the Amazon Redshift cluster, you should make sure that both your Amazon Redshift cluster and Aurora MySQL instances are in the same Amazon Virtual Private Cloud (Amazon VPC) and subnet group. This way, you can add the security group for the Amazon Redshift cluster to the inbound rules of the security group for the Aurora MySQL DB instance.

If your Amazon Redshift cluster and Aurora MySQL instances are in the different VPC, you can set up VPC peering or other networking to allow Amazon Redshift to make connections to your Aurora MySQL instances. For more information about VPC networking, see Working with a DB instance in a VPC.

Configuring AWS Secrets Manager for remote database credentials

Amazon Redshift needs database credentials to issue a federated query to a MySQL database. AWS Secrets Manager provides a centralized service to manage secrets and can be used to store your MySQL database credentials. Because Amazon Redshift retrieves and uses these credentials, they are transient, not stored in any generated code, and discarded after the query runs.

Storing credentials in Secrets Manager takes only a few minutes. To store a new secret, complete the following steps:

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. For Select secret type, select Credentials for RDS database.
  4. For User name, enter a name.
  5. For Password, enter a password.
  6. For Select the encryption key, choose DefaultEncryptionkey.
  7. For Select which RDS database this secret will access, choose your database.

Storing credentials in Secrets Manager takes only a few minutes.

  1. Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post).
  2. Choose Next.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console. The secret ARN is needed in the subsequent step.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console.

Setting up IAM role

You can now pull everything together by embedding the secret ARN into an AWS Identity and Access Management (IAM) policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "<SecretARN>"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your cluster.
  3. On the Actions drop-down menu, choose Manage IAM roles.

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. Choose and add the IAM role you just created.

Setting up external schema

The final step is to create an external schema to connect to your Aurora MySQL instance. The following example code creates the external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS mysqlfq 
FROM MYSQL 
DATABASE 'tpc' 
URI '<AuroraClusterEndpoint>' 
PORT 3306 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – Aurora MySQL cluster endpoint
  • IAM_Role – IAM role created from the previous step
  • Secret_ARN – Secret ARN

After you set up the external schema, you’re ready to run some queries to test different use cases.

Querying live operational data

You can now query real-time operational data in your Aurora MySQL instance from Amazon Redshift. Note that isolation level is read committed for MySQL. See the following code:

dev=# select top 10 ws_order_number from mysqlfq.web_sales;
 ws_order_number 
-----------------
        93628990
       157020207
         4338647
        41395871
        58468186
       171095867
        12514566
        74946143
         3418243
        67054239
(10 rows)

Querying mysqlfq.web_sales in Amazon Redshift routes the request to MySQL tpc database and web_sales table. If you examine the query plan, you can see the query runs at the MySQL instance as shown by the step Remote MySQL Seq Scan:

dev=# explain select top 10 ws_order_number from mysqlfq.web_sales;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 XN Limit  (cost=0.00..0.20 rows=10 width=8)
   ->  XN MySQL Query Scan web_sales  (cost=0.00..6869.28 rows=343464 width=8)
         ->  Remote MySQL Seq Scan mysqlfq.web_sales  (cost=0.00..3434.64 rows=343464 width=8)
(3 rows)

Simplifying ELT and ETL

You can also extract operational data directly from your Aurora MySQL instance and load it into Amazon Redshift. See the following code:

dev=# create table staging_customer as select c_customer_id from mysqlfq.customer where c_customer_id not in (select c_customer_id from customer);
SELECT
dev=# select count(*) from staging_customer;
 count  
--------
 350000
(1 row)

The preceding code uses CTAS to create and load incremental data from your operational MySQL instance into a staging table in Amazon Redshift. You can then perform transformation and merge operations from the staging table to the target table. For more information, see Updating and inserting new data.

Combining operational data with data from your data warehouse and data lake

You can combine live operational data from your Aurora MySQL instance with data from your Amazon Redshift data warehouse and S3 data lake by creating a late binding view.

To access your S3 data lake historical data via Amazon Redshift Spectrum, create an external table:

create external schema mysqlspectrum
from data catalog
database 'spectrumdb'
iam_role '<IAMRole>'
create external database if not exists;
 
create external table mysqlspectrum.customer 
stored as parquet 
location 's3://<yourS3bucket>/customer/'
as select * from customer where c_customer_sk <= 100000;

You can then run queries on the view to gain insight on data across the three sources:

drop view vwCustomer;
create view vwCustomer as
select c_customer_sk, 'redshift' as source from public.customer where c_customer_sk > 100000
union all
select c_customer_sk, 'mysql' as source from mysqlfq.customer
union all
select c_customer_sk, 's3' as source from mysqlspectrum.customer
with no schema binding;

select * from vwCustomer where c_customer_sk in (1, 149712,29033279);

You should the following three records as output:

dev=# select * from vwCustomer where c_customer_sk in (1, 149712,29033279);
 c_customer_sk |  source  
---------------+----------
      29033279 | mysql
             1 | s3
        149712 | redshift
(3 rows)

If you examine the query plan, you can see that the predicates are pushed down to your MySQL instance to run:

dev=# explain select * from vwCustomer where c_customer_sk in (1,149712,29033279);
                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan vwcustomer  (cost=0.00..48398.40 rows=6988 width=36)
   ->  XN Append  (cost=0.00..48328.52 rows=6988 width=4)
         ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..40000.03 rows=3 width=4)
               ->  XN Seq Scan on customer  (cost=0.00..40000.00 rows=3 width=4)
                     Filter: (((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279)) AND (c_customer_sk > 100000))
         ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..6548.63 rows=5492 width=4)
               ->  XN MySQL Query Scan customer  (cost=0.00..6493.71 rows=5492 width=4)
                     ->  Remote MySQL Seq Scan mysqlfq.customer  (cost=0.00..6438.79 rows=5492 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
         ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..1779.86 rows=1493 width=4)
               ->  XN S3 Query Scan customer  (cost=0.00..1764.93 rows=1493 width=4)
                     ->  S3 Seq Scan mysqlspectrum.customer location:"s3://<yourS3bucket>/customer" format:PARQUET  (cost=0.00..1750.00 rows=1493 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
(13 rows)

Available Now

Amazon Redshift federated querying to Aurora MySQL and Amazon RDS for MySQL is now available for public preview with Amazon Redshift release version 1.0.21591 or later. Refer to the AWS Region Table for Amazon Redshift availability and to check the version of your clusters.


About the Authors

BP Yau is an Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next-generation big data analytics platform using AWS technologies.

 

Zhouyi Yang is a Software Development Engineer for the Amazon Redshift Query Processing team. He’s passionate about gaining new knowledge about large databases and has worked on SQL language features such as federated query and IAM role privilege control. In his spare time, he enjoys swimming, tennis, and reading.

 

 

Entong Shen is a Senior Software Development Engineer for Amazon Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics, and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

Use a single AWS Managed Microsoft AD for Amazon RDS for SQL Server instances in multiple Regions

Post Syndicated from Jeremy Girven original https://aws.amazon.com/blogs/security/use-a-single-aws-managed-microsoft-ad-for-amazon-rds-for-sql-server-instances-in-multiple-regions/

Many Amazon Web Services (AWS) customers use Active Directory to centralize user authentication and authorization for a variety of applications and services. For these customers, Active Directory is a critical piece of their IT infrastructure.

AWS offers AWS Directory Service for Microsoft Active Directory, also known as AWS Managed Microsoft AD, to provide a highly accessible and resilient Active Directory service that is built on Microsoft Active Directory.

AWS also offers Amazon Relational Database Service (Amazon RDS) for SQL Server. Amazon RDS enables you to prioritize application development by managing time-consuming database administration tasks including provisioning, backups, software patching, monitoring, and hardware scaling. If you require Windows authentication with Amazon RDS for SQL Server, Amazon RDS for SQL Server instances need to be integrated with AWS Managed Microsoft AD.

With the release of AWS Managed Microsoft AD cross-Region support, you only need one distinct AWS Managed Microsoft AD that spans multiple AWS Regions; this simplifies directory management and configuration. Additionally, it simplifies trusts between the AWS Managed Microsoft AD domain and your on-premises domain. Now, only a single trust between your on-premises domain and AWS Managed Microsoft AD domain is required, as compared to the previous pattern of only one AWS Managed Microsoft AD per Region—each of which would require a trust if you wanted to allow on-premises objects access to your AWS Managed Microsoft AD domain. Further, AWS Managed Microsoft AD cross-Region support provides an additional benefit when using your on-premises users and groups with Amazon RDS for SQL Server: You only need a single, one-way, outgoing trust between your multi-Region AWS Managed Microsoft AD and your on-premises domain.

As detailed in this post, to enable AWS Managed Microsoft AD cross-Region support, you create a new AWS Managed Microsoft AD and extend it to multiple Regions (as shown in Figure 1 below). Once you’ve extended your directory, you deploy an Amazon RDS SQL Server instance in each Region, integrating it to the same directory. Finally, you install SQL Server Management Studio (SSMS) on an instance joined to the AWS Managed Microsoft AD directory. You use that instance to connect to the RDS SQL Server instances using the same domain user account.

Figure 1: High level diagram of resources deployed in this post

Figure 1: High level diagram of resources deployed in this post

The architecture in Figure 1 includes a network connection between the Regions. That connection isn’t required for the AWS Managed Microsoft AD to function. If you don’t require network connectivity between your regions, you can disregard the network link in the diagram. Since you will be using a single Amazon Elastic Compute Cloud (Amazon EC2) instance in one Region, the network connection is needed between Amazon VPCs in the two Regions to allow that instance to connect to a domain controller in each Region.

Prerequisites for AWS Managed Microsoft AD cross-Region Support

  1. An AWS Managed Microsoft AD deployed in a Region of your choice. If you don’t have one already deployed, you can follow the instructions in Create Your AWS Managed Microsoft AD directory to create one. For this post, I recommend that you use us-east-1.
  2. The VPC must be peered in order to complete the steps in this blog. Creating and accepting a VPC peering connection has information on how to create a peering connection between Regions. Be aware of unsupported VPC peering configurations.
  3. A Windows Server instance joined to your managed Active Directory domain. Join an EC2 Instance to Your AWS Managed Microsoft AD Directory has instructions if you need assistance.
  4. Install the Active Directory administration tools onto your domain-joined instance. Installing the Active Directory Administration Tools has detailed instructions.

Extend your AWS Managed Microsoft AD to another Region

We’ve made the process to extend your directory to another Region straightforward. There is no cost to add another Region; you only pay for the resources for your directory running in the new Region. See here for additional information on pricing changes with new Regions. For example, in this post you will be extending your directory into the us-east-2 region. There will be an additional cost for two new domain controllers. Figure 3 shows the additional cost to extend the directory.

Let’s walk through the steps of setting up Windows Authentication with Amazon RDS for SQL Server instances in multiple Regions using a single cross-Region AWS Managed Microsoft AD.

To extend your directory to another Region:

  1. In the AWS Directory Service console navigation pane, choose Directories.

    Note: You should see a list of your AWS Managed Microsoft AD directories.

  2. Choose the Directory ID of the directory you want to expand to another Region.
  3. Go to the Directory details page. In the Multi-region replication section, select Add Region.

    Figure 2: Directory details and new multi-Region replication pane

    Figure 2: Directory details and new multi-Region replication pane

  4. On the Add region page:
    1. For Region to add, select the Region you want to extend your directory to.
    2. For VPC, select the Amazon Virtual Private Cloud (Amazon VPC) for the new domain controllers to use.
    3. For Subnets, select two unique subnets in the Amazon VPC that you selected in the preceding step.
    4. Once you have everything to your liking, choose Add.
      Figure 3: Add a Region

      Figure 3: Add a Region

      In the background, AWS is provisioning two new AWS managed domain controllers in the Region you selected. It could take up to 2 hours for your directory to become available in the Region.

Note: Your managed domain controllers in the home Region are fully functional during this process.

  • On the Directory details page, in Multi-Region replication, the status should be Active when the process has completed. Now you’re ready to deploy your Amazon RDS SQL Server instances.

Enable Amazon RDS for SQL Server

Integrating Amazon RDS into AWS Managed Microsoft AD is exactly the same process as it was before the cross-Region feature was released. This post goes through that original process with only one change, which is that you select the same directory ID for both Regions.

Create an Amazon RDS SQL Server instance in each Region using the same directory

The steps for creating an Amazon RDS SQL Server instance in each Region are the same. The following process will create the first instance. Once you’ve completed the process, you change the AWS Management Console Region to the Region you extended your directory to and repeat the process.

To create an Amazon RDS SQL Server instance:

  1. In the AWS Managed Microsoft AD directory primary Region, go to the Amazon RDS console navigation pane and choose Create database.
  2. Choose Microsoft SQL Server.
  3. You can leave the default values, except for the following settings:
    1. Under Settings select Master and Confirm password.
    2. Under Connectivity, expand Additional connectivity configuration:
      1. Choose Create new to create a new VPC security group.
      2. Enter a name in New VPC security group name.
      3. Select No preference for Availability Zone.
      4. Enter 1433 for Database port.
      Figure 4: Connectivity settings

      Figure 4: Connectivity settings

  4. Select the Enable Microsoft SQL Server Windows authentication check box and then choose Browse Directory.

    Figure 5: Enable Microsoft SQL Server Windows authentication selected

    Figure 5: Enable Microsoft SQL Server Windows authentication selected

  5. Select your directory and select Choose.

    Figure 6: Select a directory

    Figure 6: Select a directory

  6. Choose Create database.
  7. Repeat these steps in your expanded Region. Note that the Directory ID will be the same for both Regions. You can complete the next section while your Amazon RDS SQL instances are provisioning.

Create an Active Directory user and group to delegate SQL administrative rights

The following steps walk you through the process of creating an Active Directory user and group for delegation. Following this process, you add the user to the group you just created and to the AWS Delegated Server Administrators group.

To create a user and group:

  1. Log in to the domain-joined instance with a domain user account that has permissions to create Active Directory users and groups.
  2. Choose Start, enter dsa.msc, and press Enter.
  3. In Active Directory Users and Computers, right-click on the Users OU, select New, and then Group. The New Object – Group window pops up.
    1. Fill in the Group name boxes with your choice of name.
    2. For Group Scope, select Domain local.
    3. For Group type, select Security.
    4. Choose OK.
  4. In Active Directory Users and Computers, right-click on your Users OU and select New and then User. The New Object – User window pops up.
    1. Fill in the boxes with your choice of information, and then choose Next.
    2. Enter your choice of password and clear User must change password at next logon, then choose Next.
    3. On the confirmation page, choose Finish.
  5. Double-click on the user you just created. The user account properties window appears.
    1. Select the Member of tab.
    2. Choose Add.
    3. Enter the name of the group that you previously created and choose Check Names. Next, enter AWS Delegated Server Administrators and choose Check Names again. If you do not receive any error, choose OK, and then OK again.
  6. The Member of tab for the user should include the two groups you just added. Choose OK to close the properties page.

Delegate SQL Server permissions in each Region using the Active Directory group you just created

The following steps guide you through the process of modifying the Amazon RDS SQL security group, installing SQL Server Management Studio (SSMS), and delegating permission in SQL to your Active Directory group.

Modify the Amazon RDS SQL security group

In these next steps, you modify the security group you created with your Amazon RDS instances, allowing your Windows Server instance to connect to the Amazon RDS SQL Server instances over port 1433.

To modify the security group:

  1. From the Amazon Elastic Compute Cloud (Amazon EC2) console, select Security Groups under the Network & Security navigation section.
  2. Select the new Amazon RDS SQL security group that was created with your Amazon RDS SQL instance and select Edit inbound rules.
  3. Choose Add rule and enter the following:
    1. Type – Select Custom TCP.
    2. Protocol – Select TCP.
    3. Port range – Enter 1433.
    4. Source – Select Custom.
    5. Enter the private IP of your instance with a /32. An example would be 10.0.0.10/32.
  4. Choose Save rules.

    Figure 7: Create a security group rule

    Figure 7: Create a security group rule

  5. Repeat these steps on the security group of your other Amazon RDS SQL instance in the other Region.

Install SQL Server Management Studio

All of the steps after the first are done on the Windows Server instance from Prerequisite 3.

To install SMMS:

  1. On your local computer, download SQL Server Management Studio (SSMS).

    Note: If desired, you can disable IE Enhanced Security Configuration and download directly to the Windows Server instance using IE or any other browser, and skip to step 3.

  2. RDP into your Windows Server instance and copy SSMS-Setup-ENU.exe to your RDP session.
  3. Run the file on your Windows Server instance.
  4. Choose Install.

    Figure 8: Install SMMS

    Figure 8: Install SMMS

  5. It might take a few minutes to install. When complete, choose Close.

Delegate permissions in SSMS

All of the following steps are performed on the Windows Server instance from Prerequisite 3. Log in to the Amazon RDS SQL instance using the SQL master user account. Next, create a SQL login for the Active Directory group you created previously and give it elevated permission to the Amazon RDS SQL instance.

To delegate permissions:

  1. Start SMMS.
  2. On the Connect to Server window, enter or select:
    1. Server name – Your Amazon RDS SQL Server endpoint.
    2. Authentication – Select SQL Server Authentication.
    3. Login – Enter the master user name you used when you launched your Amazon RDS SQL instance. The default is admin.
    4. Password – Enter the password for the master user name.
    5. Choose Connect.
    Figure 9: Connect to server

    Figure 9: Connect to server

  3. In SMMS, Choose New Query at the top of the window.

    Figure 10: New query

    Figure 10: New query

  4. In the query window, enter the following query. Replace <CORP\SQL-Admins> with the name of the group you created earlier.
    CREATE LOGIN [<CORP\SQL-Admins>] FROM WINDOWS WITH DEFAULT_DATABASE = [master],
       DEFAULT_LANGUAGE = [us_english];
    

    Figure 11: Query SQL database

    Figure 11: Query SQL database

  5. Choose Execute on the menu bar. You should see a Commands completed successfully message.

    Figure 12: Commands completed successfully

    Figure 12: Commands completed successfully

  6. Next, navigate to the Logins directory on the navigation page. Right-click on the group you added with the SQL command in step 5 and select Properties.

    Figure 13: Open group properties

    Figure 13: Open group properties

  7. Select Server Roles and select the processadmin and setupadmin checkboxes. Then choose OK.

    Figure 14: Configure server roles

    Figure 14: Configure server roles

  8. You can log off from the instance. For the next steps, you log in to the instance using the user account you created previously.
  9. Repeat these steps on the Amazon RDS SQL instance in the other Region.

Connect to the Amazon RDS SQL Server with the same Active Directory user in both Regions

All of the steps are performed on the Windows Server instance from Prerequisite 3. You must log in to the instance using the account you created earlier. You then log in to the Amazon RDS SQL instance using Windows authentication with that account.

  1. Log in to the instance with the user account you created earlier.
  2. Start SSMS.
  3. On the Connect to Server window, enter or select:
    1. Server name: Your Amazon RDS SQL Server endpoint.
    2. Authentication: Select Windows Authentication.
    3. Choose Connect.
    Figure 15: Connect to server

    Figure 15: Connect to server

  4. You should be logged in to SSMS. If you aren’t logged in, make sure you added your user account to the group you created earlier and try again.
  5. Repeat these steps using the other Amazon RDS SQL instance endpoint for the server name. You should be able to connect to both Amazon RDS SQL instances using the same user account.

Summary

In this post, you extended your AWS Managed Microsoft AD into a new Region. You then deployed Amazon RDS for SQL Server in multiple Regions attached to the same AWS Managed Microsoft AD directory. You then tested authentication to both Amazon RDS SQL instances using the same Active Directory user.

To learn more about using AWS Managed Microsoft AD or AD Connector, visit the AWS Directory Service documentation. For general information and pricing, see the AWS Directory Service home page. If you have comments about this blog post, submit a comment in the Comments section below. If you have implementation or troubleshooting questions, start a new thread on the AWS Directory Service forum or contact AWS Support.

Author

Jeremy Girven

Jeremy is a Solutions Architect specializing in Microsoft workloads on AWS. He has over 15 years of experience with Microsoft Active Directory and over 23 years of industry experience. One of his fun projects is using SSM to automate the Active Directory build processes in AWS. To see more please check out the Active Directory AWS QuickStart.

Accessing and visualizing external tables in an Apache Hive metastore with Amazon Athena and Amazon QuickSight

Post Syndicated from James Sun original https://aws.amazon.com/blogs/big-data/accessing-and-visualizing-external-tables-in-an-apache-hive-metastore-with-amazon-athena-and-amazon-quicksight/

Many organizations have an Apache Hive metastore that stores the schemas for their data lake. You can use Amazon Athena due to its serverless nature; Athena makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. You may also want to reliably query the rich datasets in the lake, with their schemas hosted in an external Hive metastore. In response to customers’ requests, AWS announced Athena’s support for Hive metastore in early 2020. This extends the ability in Athena to query external data stores, with a combined benefit of better performance and lower cost.

In this post, we provide an AWS CloudFormation template that configures a remote Hive metastore based on Amazon Relational Database Service (Amazon RDS) and MySQL with Amazon EMR located in a private subnet to perform ETL tasks. We then demonstrate how you can use a Spark step to pull COVID-19 datasets from a public repository and transform the data into a performant Parquet columnar storage format. We also walk through the steps to query the data with Athena and visualize it with Amazon QuickSight. QuickSight is a fully managed data visualization service; it lets the you easily create and publish interactive dashboards by analyzing data from various data sources, including Athena.

Solution walkthrough

The following diagram shows the architecture for our solution.


As shown in the preceding architecture, we have an Availability Zone within a VPC in an AWS Region. The Availability Zone hosts subnets that are either public or private. A multi-master EMR cluster that has Hive software components installed is launched in a private subnet with egress internet traffic through a NAT gateway to download data from public sites for analysis. The multi-master feature also ensures that the primary nodes are highly available. The Hive metastore is backed by a remote RDS for MySQL instance located in the same private subnet.

We also have an Amazon Simple Storage Service (Amazon S3)-based data lake. A Spark step in Amazon EMR retrieves the data in CSV format, saves it in the provisioned S3 bucket, and transforms the data into Parquet format. The Spark step creates an external Hive table referencing the Parquet data and is ready for Athena to query.

With Athena, you can create a data source connector based on an AWS Lambda function to access the Hive metastore hosted on the EMR cluster by using the Apache Thrift interface.

The connector is called a catalog, which when invoked in a SQL statement with Athena, invokes the Lambda function. The function exits if the connector is not active for 15 minutes. For queries that run longer than 15 minutes, it’s recommended to let the query complete before retrieving the query results in an Amazon S3 location you can specify.

In Athena, you can compose a SQL statement against the Hive tables with predicates to further limit the size of the query result for faster visualization by QuickSight.

Deploying the resources with AWS CloudFormation

To demonstrate our solution, we provide a CloudFormation template that you can download to easily deploy the necessary AWS resources. The template creates the following resources to simulate the environment:

  • VPC and subnets – A VPC with one public and one private subnets. A NAT gateway is also created to allow outbound internet access from the EMR cluster to download public COVID-19 datasets from the Johns Hopkins GitHub repo.
  • EMR cluster – A multi-master EMR cluster with Hive, running on three primary nodes (m5.xlarge) and two core nodes (m5.xlarge), is launched to support the thrift connection required by the Athena Lambda connectors.
  • Amazon RDS for MySQL database – An RDS for MySQL primary instance is launched in the same subnet as the EMR cluster. The RDS instance serves as the Hive metastore backend data store.
  • S3 bucket – An S3 bucket stores files in Parquet format by Amazon EMR and is accessed later by Athena.
  • AWS IAM users – Two AWS Identity and Access Management (IAM) users belonging to different user groups. The first user, the data engineer, has permissions to access the Lambda-based Athena data source connector. The other user, the salesperson, does not have permissions to access the connector.

To get started, you need to have an AWS account. If you don’t have one, go to aws.amazon.com to sign up for one. Then complete the following steps:

  1. Sign in to the AWS Management Console as an IAM power user, preferably an admin user.
  2. Choose Launch Stack to launch the CloudFormation template:

This template has been tested in the US East (N. Virginia) Region.

  1. Choose Next.

You’re prompted to enter a few launch parameters.

  1. For Stack name, enter a name for the stack (for example, athena-hms).
  2. For Hive Metastore Host Number, choose the Amazon EMR primary node to connect to (or use the default value).
  3. Continue to choose Next and leave other parameters at their default.
  4. On the review page, select the three check boxes to confirm that AWS CloudFormation might create resources.
  5. Choose Create stack.

The stack takes 15–20 minutes to complete.

  1. On the Outputs tab of the stack details, save the key-value pairs to use later.

When the EMR cluster is provisioned, it uses a bootstrap action to install the necessary Python libraries. It runs an Amazon EMR Spark step (a PySpark script) that downloads three COVID-19 datasets for confirmed, recovered, and death cases from the John Hopkins GitHub repo in CSV format and stores them in the csv subfolder of the S3 bucket created by the CloudFormation stack. Lastly, the final transformed data is converted to Parquet format and external Hive tables are created referencing the Parquet data located in the parquet subfolder of the S3 bucket.

The following are the source codes for the bootstrap and Spark step actions for your reference:

To validate the data, on the Amazon S3 console, choose the bucket name from the CloudFormation template outputs. You should see a covid_data folder in the bucket. The folder contains the two subfolders, csv and parquet, which store the raw CSV and transformed Parquet data, respectively.

Querying the data in Athena

The CloudFormation template creates two users belonging to two different AWS IAM groups. The de_user_us-east-1_athena-hms user is a data engineer with permissions to access the Lambda function to communicate with the Hive metastore using the Athena data source connector. They belong to the group athena-hms-DataEngineerGroup-xxxxxxx. The sales_user_us-east-1_athena-hms user is a salesperson with no access to the connector. They belong to the group athena-hms-SalesGroup-xxxxx. 

To query the data, first retrieve your secret values in AWS Secrets Manager:

  1. On the Secrets Manager console, choose Secrets.
  2. Choose DataEngineerUserCreds.

  1. Choose Retrieve secret value.

 

  1. Save the username and password.

 

  1. Repeat these steps for SalesUserCreds. 

With the data in place, we can now use Athena to query it.

Accessing data as the data engineer

To query the data as the de_user_us-east-1_athena-hms user, complete the following steps:

  1. Sign in to the Athena console as the de_user_us-east-1_athena-hms user with the credentials retrieved from Secrets Manager.

After logging in, we need to create a data source for Hive metastore.

  1. On the navigation bar, choose Data sources.
  2. Choose Connect data source.

  1. For Choose where you data is located, select Query data in Amazon S3.
  2. For Choose a metadata catalog, select Apache Hive metastore.
  3. Chose Next.

  1. For Lambda function, choose the function the CloudFormation template created with the key HiveMetastoreFunctionName.
  2. For Catalog name, enter a name for the Athena catalog (for example, demo_hive_metastore).
  3. Choose Connect.

You should now have a catalog named demo_hive_metastore with the catalog type Hive metastore.

  1. On the navigation bar, choose Query editor.
  2. Enter the following SQL statement:
    SELECT *
    FROM demo_hive_metastore.default.covid_confirmed_cases
    WHERE country = 'US'
            OR country = 'Italy'
            OR country = 'India'
            OR country = 'Brazil'
            OR country = 'Spain'
            OR country = 'United Kingdom'

This SQL statement selects all the columns in the covid_confirmed_cases table with predicates to only include a few countries of interest. We use a table name with the pattern <catalog name>.<hive database name>.<hive table name in the database>, which for this post translates to demo_hive_metastore.default.covid_confirmed_cases.

  1. Choose Run query.

The following screenshot shows your query results.

Make sure you completely sign out of the console before moving on to the next steps.

Accessing data as the salesperson

Sign in to the console as sales_user_us-east-1_athena-hms user. Because the salesperson user doesn’t have the appropriate IAM policies to access the Hive metastore connection, you can’t see the tables.

 

Permissions policies

The data engineer has additional policies attached to their IAM group in addition to the managed AmazonAthenaFullAccess policy: the <stack-name>-DataBucketReadAccessPolicy-xxxxx and <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policies created by the CloudFormation template. Therefore, the data engineer can view the tables, but the salesperson can’t.

These policies are available on the IAM console, on the Permissions tab for the group <stack-name>-DataEngineerGroup-xxxxx.

The following sample JSON code is the <stack-name>-DataBucketReadWriteAccessPolicy-xxxxx policy to allow access to the provisioned S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::<provisioned bucket name>",
                "arn:aws:s3:::<provisioned bucket name>/",
                "arn:aws:s3:::<provisioned bucket name>/*"
            ],
            "Effect": "Allow"
        }
    ]
}

The following sample JSON code is the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy to allow access to the Lambda Hive metastore function:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "lambda:GetFunction",
                "lambda:GetLayerVersion",
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:us-east-1:<account id>:function:athena-hms-HiveMetastoreFunction"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
               "arn:aws:s3:::<provisioned bucket name>/hms_spill"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Next, we walk through using QuickSight to visualize the results. Make sure you completely sign out of the console as the salesperson user before proceeding to the next steps. 

Signing up for QuickSight

You can skip this section if you have already signed up for QuickSight previously.

  1. Sign in to the console as the IAM power user who deployed the CloudFormation template or any user with enough IAM privileges to set up QuickSight.
  2. On the QuickSight console, choose Sign up for QuickSight.
  3. Select either the Standard or Enterprise edition for QuickSight.
  4. Choose Continue.

  1. For QuickSight account name, enter your AWS account ID.
  2. For Notification email address, enter your email.
  3. Select Amazon S3.

  1. Select the provisioned S3 bucket to grant QuickSight permission to access.
  2. Choose Finish.

Your QuickSight account should now be set up.

Attaching the Hive metastore access policy

Before you can use QuickSight, you have to attach the Hive metastore access policy to the QuickSight service role.

  1. On the IAM console, search for the service role aws-quicksight-service-role-v0.
  2. Choose the role.

  1. Search for the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy.
  2. Select the policy and attach it to the QuickSight service role.

Creating your data source and performing data conversions

Before we can create visualizations, we need to set up our data source.

  1. Download the SQL script covid-19.sql.
  2. On the QuickSight console, choose Datasets in the navigation pane.

  1. Choose New dataset.

  1. Choose Athena.

  1. In the pop-up window, for Data source name, enter demo_hive_metastore.
  2. Choose Validate.
  3. When the connection is validated, choose Create data source.

  1. In the next window, choose Use custom SQL.

  1. Enter the content of the covid-19.sql script in the query window.
  2. Choose Confirm query.

  1. Leave Import to SPICE for quicker analytics
  2. Choose Visualize.

Now we perform a few data type conversions before visualizing the data.

  1. Choose the Edit icon next to Data set on the menu bar.

  1. Choose the … icon.
  2. Choose Edit.

  1. Expand date and choose Change data type.
  2. Choose Date.

  1. Enter yyyy-MM-dd to convert the date format.
  2. Choose Update.

Now we create a coordinate using the latitude and longitude values.r

  1. Expand lat and choose Ad to coordinates.

  1. Leave Create new geospatial coordinates
  2. Chose Add.

  1. In the pop-up window, for Name your coordinates, enter coordinate.
  2. For Field to use for longitude, choose lon.
  3. Choose Create coordinates.

  1. Choose Save and visualize on the menu bar.

Creating visualizations in QuickSight

Now we can visualize our data. For this post, we create a map visualization.

  1. For Visual types, choose the map

  1. For Geospatial, drag coordinates.
  2. For Size, drag confirmed.
  3. For Color, drag country.

This world map shows the accumulated confirmed cases for selected countries over time; you need to use a filter to look at confirmed cases on a specific date.

  1. In the navigation pane, choose Filter.
  2. Choose the + icon.
  3. For Filter type, choose Time range.
  4. Choose start and end dates, such as 2020-09-10 00:00 and 2020-09-11 00:00, respectively.
  5. Choose Apply.

This plots the confirmed cases on September 10, 2020, for these countries.

Similarly, you can choose other visual types, such as a line chart, and generate the mortality rate for selected countries over time.

Using highly available primary nodes of the Amazon EMR cluster

The EMR cluster has a multi-master configuration with three primary nodes running to meet high availability requirements. At any time, the Lambda function communicates with one of these three EMR primary nodes. In the rare event that this node goes down, you can quickly re-establish the Athena data source connector to the external Hive metastore by failing over to another active primary node.

To perform this failover, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose athena-hms.
  3. Choose update.
  4. Choose Use current update.
  5. Choose Next.
  6. For Hive Metastore Host Number, choose a host other than the current one you’re using.

  1. Choose Next.
  2. Acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Update stack.

In less than a minute, you should be able to access the Hive metastore and continue to query on the Athena console.

Cleaning up

You may want to clean up the demo environment when you’re done. To do so, on the AWS CloudFormation console, select the template and choose Delete.

This action also deletes the S3 bucket and any data in it. If you want to retain the data for future use, you should make a copy of the bucket before you delete it.

Summary

In this post, we walked through a solution using Athena to query external Hive tables with public COVID-19 datasets hosted in an S3 bucket and visualizing the data with QuickSight. We provided a CloudFormation template to automate the deployment of necessary AWS services for the demo. We encourage you to use these managed and scalable services for your specific use cases in production.


About the Authors

James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.

 

 

Chinmayi Narasimhadevara is a Solutions Architect with Amazon Web Services. Chinmayi has over 15 years of experience in information technology and has worked with organizations ranging from large enterprises to mid-sized startups. She helps AWS customers leverage the correct mix of AWS services to achieve success for their business goals.

 

 

Gagan Brahmi is a Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

New – Amazon RDS on Graviton2 Processors

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-amazon-rds-on-graviton2-processors/

I recently wrote a post to announce the availability of M6g, R6g and C6g families of instances on Amazon Elastic Compute Cloud (EC2). These instances offer better cost-performance ratio than their x86 counterparts. They are based on AWS-designed AWS Graviton2 processors, utilizing 64-bit Arm Neoverse N1 cores.

Starting today, you can also benefit from better cost-performance for your Amazon Relational Database Service (RDS) databases, compared to the previous M5 and R5 generation of database instance types, with the availability of AWS Graviton2 processors for RDS. You can choose between M6g and R6g instance families and three database engines (MySQL 8.0.17 and higher, MariaDB 10.4.13 and higher, and PostgreSQL 12.3 and higher).

M6g instances are ideal for general purpose workloads. R6g instances offer 50% more memory than their M6g counterparts and are ideal for memory intensive workloads, such as Big Data analytics.

Graviton2 instances provide up to 35% performance improvement and up to 52% price-performance improvement for RDS open source databases, based on internal testing of workloads with varying characteristics of compute and memory requirements.

Graviton2 instances family includes several new performance optimizations such as larger L1 and L2 caches per core, higher Amazon Elastic Block Store (EBS) throughput than comparable x86 instances, fully encrypted RAM, and many others as detailed on this page. You can benefit from these optimizations with minimal effort, by provisioning or migrating your RDS instances today.

RDS instances are available in multiple configurations, starting with 2 vCPUs, with 8 GiB memory for M6g, and 16 GiB memory for R6g with up to 10 Gbps of network bandwidth, giving you new entry-level general purpose and memory optimized instances. The table below shows the list of instance sizes available for you:

Instance Size vCPU Memory (GiB) Dedicated EBS Bandwidth (Mbps) Network Bandwidth
(Gbps)
M6g R6g
large 2 8 16 Up to 4750 Up to 10
xlarge 4 16 32 Up to 4750 Up to 10
2xlarge 8 32 64 Up to 4750 Up to 10
4xlarge 16 64 128 4750 Up to 10
8xlarge 32 128 256 9000 12
12xlarge 48 192 384 13500 20
16xlarge 64 256 512 19000 25

Let’s Start Your First Graviton2 Based Instance
To start a new RDS instance, I use the AWS Management Console or the AWS Command Line Interface (CLI), just like usual, and select one of the db.m6g or db.r6ginstance types (this page in the documentation has all the details).

RDS Launch Graviton2 instance

Using the CLI, it would be:

aws rds create-db-instance
 --region us-west-2 \
 --db-instance-identifier $DB_INSTANCE_NAME \
 --db-instance-class db.m6g.large \
 --engine postgres \
 --engine-version 12.3 \
 --allocated-storage 20 \
 --master-username $MASTER_USER \
 --master-user-password $MASTER_PASSWORD

The CLI confirms with:

{
    "DBInstance": {
        "DBInstanceIdentifier": "newsblog",
        "DBInstanceClass": "db.m6g.large",
        "Engine": "postgres",
        "DBInstanceStatus": "creating",
...
}

Migrating to Graviton2 instances is easy, in the AWS Management Console, I select my database and I click Modify.

Modify RDS database

The I select the new DB instance class:

modify db instance

Or, using the CLI, I can use the modify-db-instance API call.

There is a short service interruption happening when you switch instance type. By default, the modification will happen during your next maintenance window, unless you enable the ApplyImmediately option.

You can provision new or migrate to Graviton2 Amazon Relational Database Service (RDS) instances in all regions where EC2 M6g and R6g are available : US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Frankfurt) AWS Regions.

As usual, let us know your feedback on the AWS Forum or through your usual AWS contact.

— seb

How to delete user data in an AWS data lake

Post Syndicated from George Komninos original https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution to delete specific users’ personal data.

In the context of the AWS big data and analytics ecosystem, every architecture, regardless of the problem it targets, uses Amazon Simple Storage Service (Amazon S3) as the core storage service. Despite its versatility and feature completeness, Amazon S3 doesn’t come with an out-of-the-box way to map a user identifier to S3 keys of objects that contain user’s data.

This post walks you through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.

Reference architecture

To address the challenge of implementing a data purge framework, we reduced the problem to the straightforward use case of deleting a user’s data from a platform that uses AWS for its data pipeline. The following diagram illustrates this use case.

We’re introducing the idea of building and maintaining an index metastore that keeps track of the location of each user’s records and allows us locate to them efficiently, reducing the search space.

You can use the following architecture diagram to delete a specific user’s data within your organization’s AWS data lake.

For this initial version, we created three user flows that map each task to a fitting AWS service:

Flow 1: Real-time metastore update

The S3 ObjectCreated or ObjectDelete events trigger an AWS Lambda function that parses the object and performs an add/update/delete operation to keep the metadata index up to date. You can implement a simple workflow for any other storage layer, such as Amazon Relational Database Service (RDS), Amazon Aurora, or Amazon Elasticsearch Service (ES). We use Amazon DynamoDB and Amazon RDS for PostgreSQL as the index metadata storage options, but our approach is flexible to any other technology.

Flow 2: Purge data

When a user asks for their data to be deleted, we trigger an AWS Step Functions state machine through Amazon CloudWatch to orchestrate the workflow. Its first step triggers a Lambda function that queries the metadata index to identify the storage layers that contain user records and generates a report that’s saved to an S3 report bucket. A Step Functions activity is created and picked up by a Lambda Node JS based worker that sends an email to the approver through Amazon Simple Email Service (SES) with approve and reject links.

The following diagram shows a graphical representation of the Step Function state machine as seen on the AWS Management Console.

The approver selects one of the two links, which then calls an Amazon API Gateway endpoint that invokes Step Functions to resume the workflow. If you choose the approve link, Step Functions triggers a Lambda function that takes the report stored in the bucket as input, deletes the objects or records from the storage layer, and updates the index metastore. When the purging job is complete, Amazon Simple Notification Service (SNS) sends a success or fail email to the user.

The following diagram represents the Step Functions flow on the console if the purge flow completed successfully.

For the complete code base, see step-function-definition.json in the GitHub repo.

Flow 3: Batch metastore update

This flow refers to the use case of an existing data lake for which index metastore needs to be created. You can orchestrate the flow through AWS Step Functions, which takes historical data as input and updates metastore through a batch job. Our current implementation doesn’t include a sample script for this user flow.

Our framework

We now walk you through the two use cases we followed for our implementation:

  • You have multiple user records stored in each Amazon S3 file
  • A user has records stored in homogenous AWS storage layers

Within these two approaches, we demonstrate alternatives that you can use to store your index metastore.

Indexing by S3 URI and row number

For this use case, we use a free tier RDS Postgres instance to store our index. We created a simple table with the following code:

CREATE UNLOGGED TABLE IF NOT EXISTS user_objects (
				userid TEXT,
				s3path TEXT,
				recordline INTEGER
			);

You can index on user_id to optimize query performance. On object upload, for each row, you need to insert into the user_objects table a row that indicates the user ID, the URI of the target Amazon S3 object, and the row that corresponds to the record. For instance, when uploading the following JSON input, enter the following code:

{"user_id":"V34qejxNsCbcgD8C0HVk-Q","body":"…"}
{"user_id":"ofKDkJKXSKZXu5xJNGiiBQ","body":"…"}
{"user_id":"UgMW8bLE0QMJDCkQ1Ax5Mg","body ":"…"}

We insert the tuples into user_objects in the Amazon S3 location s3://gdpr-demo/year=2018/month=2/day=26/input.json. See the following code:

(“V34qejxNsCbcgD8C0HVk-Q”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 0)
(“ofKDkJKXSKZXu5xJNGiiBQ”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 1)
(“UgMW8bLE0QMJDCkQ1Ax5Mg”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 2)

You can implement the index update operation by using a Lambda function triggered on any Amazon S3 ObjectCreated event.

When we get a delete request from a user, we need to query our index to get some information about where we have stored the data to delete. See the following code:

SELECT s3path,
                ARRAY_AGG(recordline)
                FROM user_objects
                WHERE userid = ‘V34qejxNsCbcgD8C0HVk-Q’
                GROUP BY;

The preceding example SQL query returns rows like the following:

(“s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json“, {2102,529})

The output indicates that lines 529 and 2102 of S3 object s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json contain the requested user’s data and need to be purged. We then need to download the object, remove those rows, and overwrite the object. For a Python implementation of the Lambda function that implements this functionality, see deleteUserRecords.py in the GitHub repo.

Having the record line available allows you to perform the deletion efficiently in byte format. For implementation simplicity, we purge the rows by replacing the deleted rows with an empty JSON object. You pay a slight storage overhead, but you don’t need to update subsequent row metadata in your index, which would be costly. To eliminate empty JSON objects, we can implement an offline vacuum and index update process.

Indexing by file name and grouping by index key

For this use case, we created a DynamoDB table to store our index. We chose DynamoDB because of its ease of use and scalability; you can use its on-demand pricing model so you don’t need to guess how many capacity units you might need. When files are uploaded to the data lake, a Lambda function parses the file name (for example, 1001-.csv) to identify the user identifier and populates the DynamoDB metadata table. Userid is the partition key, and each different storage layer has its own attribute. For example, if user 1001 had data in Amazon S3 and Amazon RDS, their records look like the following code:

{"userid:": 1001, "s3":{"s3://path1", "s3://path2"}, "RDS":{"db1.table1.column1"}}

For a sample Python implementation of this functionality, see update-dynamo-metadata.py in the GitHub repo.

On delete request, we query the metastore table, which is DynamoDB, and generate a purge report that contains details on what storage layers contain user records, and storage layer specifics that can speed up locating the records. We store the purge report to Amazon S3. For a sample Lambda function that implements this logic, see generate-purge-report.py in the GitHub repo.

After the purging is approved, we use the report as input to delete the required resources. For a sample Lambda function implementation, see gdpr-purge-data.py in the GitHub repo.

Implementation and technology alternatives

We explored and evaluated multiple implementation options, all of which present tradeoffs, such as implementation simplicity, efficiency, critical data compliance, and feature completeness:

  • Scan every record of the data file to create an index – Whenever a file is uploaded, we iterate through its records and generate tuples (userid, s3Uri, row_number) that are then inserted to our metadata storing layer. On delete request, we fetch the metadata records for requested user IDs, download the corresponding S3 objects, perform the delete in place, and re-upload the updated objects, overwriting the existing object. This is the most flexible approach because it supports a single object to store multiple users’ data, which is a very common practice. The flexibility comes at a cost because it requires downloading and re-uploading the object, which introduces a network bottleneck in delete operations. User activity datasets such as customer product reviews are a good fit for this approach, because it’s unexpected to have multiple records for the same user within each partition (such as a date partition), and it’s preferable to combine multiple users’ activity in a single file. It’s similar to what was described in the section “Indexing by S3 URI and row number” and sample code is available in the GitHub repo.
  • Store metadata as file name prefix – Adding the user ID as the prefix of the uploaded object under the different partitions that are defined based on query pattern enables you to reduce the required search operations on delete request. The metadata handling utility finds the user ID from the file name and maintains the index accordingly. This approach is efficient in locating the resources to purge but assumes a single user per object, and requires you to store user IDs within the filename, which might require InfoSec considerations. Clickstream data, where you would expect to have multiple click events for a single customer on a single date partition during a session, is a good fit. We covered this approach in the section “Indexing by file name and grouping by index key” and you can download the codebase from the GitHub repo.
  • Use a metadata file – Along with uploading a new object, we also upload a metadata file that’s picked up by an indexing utility to create and maintain the index up to date. On delete request, we query the index, which points us to the records to purge. A good fit for this approach is a use case that already involves uploading a metadata file whenever a new object is uploaded, such as uploading multimedia data, along with their metadata. Otherwise, uploading a metadata file on every object upload might introduce too much of an overhead.
  • Use the tagging feature of AWS services – Whenever a new file is uploaded to Amazon S3, we use the Put Object Tagging Amazon S3 operation to add a key-value pair for the user identifier. Whenever there is a user data delete request, it fetches objects with that tag and deletes them. This option is straightforward to implement using the existing Amazon S3 API and can therefore be a very initial version of your implementation. However, it involves significant limitations. It assumes a 1:1 cardinality between Amazon S3 objects and users (each object only contains data for a single user), searching objects based on a tag is limited and inefficient, and storing user identifiers as tags might not be compliant with your organization’s InfoSec policy.
  • Use Apache Hudi – Apache Hudi is becoming a very popular option to perform record-level data deletion on Amazon S3. Its current version is restricted to Amazon EMR, and you can use it if you start to build your data lake from scratch, because you need to store your as Hudi datasets. Hudi is a very active project and additional features and integrations with more AWS services are expected.

The key implementation decision of our approach is separating the storage layer we use for our data and the one we use for our metadata. As a result, our design is versatile and can be plugged in any existing data pipeline. Similar to deciding what storage layer to use for your data, there are many factors to consider when deciding how to store your index:

  • Concurrency of requests – If you don’t expect too many simultaneous inserts, even something as simple as Amazon S3 could be a starting point for your index. However, if you get multiple concurrent writes for multiple users, you need to look into a service that copes better with transactions.
  • Existing team knowledge and infrastructure – In this post, we demonstrated using DynamoDB and RDS Postgres for storing and querying the metadata index. If your team has no experience with either of those but are comfortable with Amazon ES, Amazon DocumentDB (with MongoDB compatibility), or any other storage layer, use those. Furthermore, if you’re already running (and paying for) a MySQL database that’s not used to capacity, you could use that for your index for no additional cost.
  • Size of index – The volume of your metadata is orders of magnitude lower than your actual data. However, if your dataset grows significantly, you might need to consider going for a scalable, distributed storage solution rather than, for instance, a relational database management system.

Conclusion

GDPR has transformed best practices and introduced several extra technical challenges in designing and implementing a data lake. The reference architecture and scripts in this post may help you delete data in a manner that’s compliant with GDPR.

Let us know your feedback in the comments and how you implemented this solution in your organization, so that others can learn from it.

 


About the Authors

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

 

 

 

 

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Understanding database options for your serverless web applications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/understanding-database-options-for-your-serverless-web-applications/

Many web developers use relational databases to store and manage data in their web applications. As you migrate to a serverless approach for web development, there are also other options available. These can help improve the scale, performance, and cost-effectiveness of your workloads. In this blog post, I highlight use-cases for different serverless database services, and common patterns useful for web applications.

Using Amazon RDS with serverless web applications

You can access Amazon RDS directly from your AWS Lambda functions. The RDS database, such as Amazon Aurora, is configured within the customer VPC. The Lambda function must be configured with access to the same VPC:

Lambda connecting to RDS

There are special considerations for this design in busy serverless applications. It’s common for popular web applications to experience “spiky” usage, where traffic volumes shift rapidly and unpredictably. Serverless services such as AWS Lambda and Amazon API Gateway are designed to automatically scale to meet these traffic increases.

However, relational databases are connection-based, so they are intended to work with a few long-lived clients, such as web servers. By contrast, Lambda functions are ephemeral and short-lived, so their database connections are numerous and brief. If Lambda scales up to hundreds or thousands of instances, you may overwhelm downstream relational databases with connection requests. This is typically only an issue for moderately busy applications. If you are using a Lambda function for low-volume tasks, such as running daily SQL reports, you do not experience this behavior.

The Amazon RDS Proxy service is built to solve the high-volume use-case. It pools the connections between the Lambda service and the downstream RDS database. This means that a scaling Lambda function is able to reuse connections via the proxy. As a result, the relational database is not overwhelmed with connections requests from individual Lambda functions. This does not require code changes in many cases. You only need to replace the database endpoint with the proxy endpoint in your Lambda function.

Lambda to RDS Proxy to RDS diagram

As a result, if you need to use a relational database in a high-volume web application, you can use RDS Proxy with minimal changes required.

Using Amazon DynamoDB as a high-performance operational database

Amazon DynamoDB is a high-performance key-value and document database that operates with single-digit millisecond response times at any scale. This is a NoSQL database that is a natural fit for many serverless workloads, especially web applications. It can operate equally well for low and high usage workloads. Unlike relational databases, the performance of a well-architected DynamoDB table is not adversely affected by heavy usage or large amounts of data storage.

For web applications, DynamoDB tables are ideal for storing common user configuration and application data. When integrated with Amazon Cognito, you can restrict row-level access to the current user context. This makes it a frequent choice for multi-tenant web applications that host data for many users.

DynamoDB tables can be useful for lookups of key-based information, in addition to geo-spatial queries in many cases. DynamoDB is not connection-based, so this integration works even if a Lambda function scales up to hundreds or thousands of concurrent executions. You can query directly from Lambda with minimal code:

const AWS = require('aws-sdk')
AWS.config.region = process.env.AWS_REGION
const documentClient = new AWS.DynamoDB.DocumentClient()

// Construct params
const params = {
  TableName: 'myDDBtable',
  Item: {
    partitionKey: 'user-123',
    sortKey: Date.now(),
    name: 'Alice',
    cartItems: 3
  }
}

// Store in DynamoDB
const result = await documentClient.put(params).promise()

Using advanced patterns in DynamoDB, it’s possible to build equivalent features frequently found in relational schemas. For example, one-to-many tables, many-to-many tables, and ACID transactions can all be modeled in a single DynamoDB table.

Combining DynamoDB with RDS

While DynamoDB remains highly performant for high volumes of traffic, you need to understand data access patterns for your application before designing the schema. There are times where you need to perform ad hoc queries, or where downstream application users must use SQL-based tools to interact with databases.

In this case, combining both DynamoDB and RDS in your architecture can provide a resilient and flexible solution. For example, for a high-volume transactional web application, you can use DynamoDB to ingest data from your frontend application. For ad hoc SQL-based analytics, you could also use Amazon Aurora.

By using DynamoDB streams, you can process updates to a DynamoDB table using a Lambda function. In a simple case, this function can update tables in RDS, keeping the two databases synchronized. For example, when sales transactions are saved in DynamoDB, a Lambda function can post the sales information to transaction tables in Aurora.

DynamoDB to RDS architecture

Both the Lambda function and RDS database operate with the customer’s VPC, while DynamoDB is outside the VPC. DynamoDB Streams can invoke Lambda functions configured to access the VPC. In this model, RDS users can then run ad hoc SQL queries without impacting operational data managed by DynamoDB.

High-volume ETL processes between DynamoDB and RDS

For high-volume workloads capturing large numbers of transactions in DynamoDB, Lambda may still scale rapidly and exhaust the RDS connection pool. To process these flows, you may introduce Amazon Kinesis Data Firehose to help with data replication between DynamoDB and RDS.

ETL processing with with DynamoDB and RDS

  1. New and updated items in DynamoDB are sent to a DynamoDB stream. The stream invokes a stream processing Lambda function, sending batches of records to Kinesis Data Firehose.
  2. Kinesis buffers incoming messages and performs data transformations using a Lambda function. It then writes the output to Amazon S3, buffering by size (1–128 MB) or interval (60–900 seconds).
  3. The Kinesis Data Firehose transformation uses a custom Lambda function for processing records as needed.
  4. Amazon S3 is a durable store for these batches of transformed records. As objects are written, S3 invokes a Lambda function.
  5. The Lambda function loads the objects from S3, then connects to RDS and imports the data.

This approach supports high transaction volumes, enabling table item transformation before loading into RDS. The RDS concurrent connection pool is optimized by upstream batching and buffering, which reduces the number of concurrent Lambda functions and RDS connections.

Conclusion

Web developers commonly use relational databases in building their applications. When migrating to serverless architectures, a web developer can continue to use databases like RDS, or take advantage of other options available. RDS Proxy enables developers to pool database connections and use connection-based databases with ephemeral functions.

DynamoDB provides high-performance, low-latency NoSQL support, which is ideal for many busy web applications with spiky traffic volumes. However, it’s also possible to use both services to take advantage of the throughput of DynamoDB, together with the flexibility of ad hoc SQL queries in RDS.

For extremely high traffic volumes, you can introduce Kinesis Data Firehose to batch and transform data between DynamoDB and RDS. In this case, you separate the operational database from the analytics database. This solution uses multiple serverless services to handle scaling automatically.

To learn more about AWS serverless database solutions for web developers, visit https://aws.amazon.com/products/databases/.

New – Create Amazon RDS DB Instances on AWS Outposts

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-create-amazon-rds-db-instances-on-aws-outposts/

Late last year I told you about AWS Outposts and invited you to Order Yours Today. As I told you at the time, this is a comprehensive, single-vendor compute and storage offering that is designed to meet the needs of customers who need local processing and very low latency in their data centers and on factory floors. Outposts uses the hardware that we use in AWS public regions

I first told you about Amazon RDS back in 2009. This fully managed service makes it easy for you to launch, operate, and scale a relational database. Over the years we have added support for multiple open source and commercial databases, along with tons of features, all driven by customer requests.

DB Instances on AWS Outposts
Today I am happy to announce that you can now create RDS DB Instances on AWS Outposts. We are launching with support for MySQL and PostgreSQL, with plans to add other database engines in the future (as always, let us know what you need so that we can prioritize it).

You can make use of important RDS features including scheduled backups to Amazon Simple Storage Service (S3), built-in encryption at rest and in transit, and more.

Creating a DB Instance
I can create a DB Instance using the RDS Console, API (CreateDBInstance), CLI (create-db-instance), or CloudFormation (AWS::RDS::DBInstance).

I’ll use the Console, taking care to select the AWS Region that serves as “home base” for my Outpost. I open the Console and click Create database to get started:

I select On-premises for the Database location, and RDS on Outposts for the On-premises database option:

Next, I choose the Virtual Private Cloud (VPC). The VPC must already exist, and it must have a subnet for my Outpost. I also choose the Security Group and the Subnet:

Moving forward, I select the database engine, and version. We’re launching with support for MySQL 8.0.17 and PostgreSQL 12.2-R1, with plans to add more engines and versions based on your feedback:

I give my DB Instance a name (jb-database-2), and enter the credentials for the master user:

Then I choose the size of the instance. I can select between Standard classes (db.m5):

and Memory Optimized classes (db.r5):

Next, I configure the desired amount of SSD storage:

One thing to keep in mind is that each Outpost has a large, but finite amount of compute power and storage. If there’s not enough of either one free when I attempt to create the database, the request will fail.

Within the Additional configuration section I can set up several database options, customize my backups, and set up the maintenance window. Once everything is ready to go, I click Create database:

As usual when I use RDS, the state of my instance starts out as Creating and transitions to Available when my DB Instance is ready:

After the DB instance is ready, I simply configure my code (running in my VPC or in my Outpost) to use the new endpoint:

Things to Know
Here are a couple of things to keep in mind about this new way to use Amazon RDS:

Operations & Functions – Much of what you already know about RDS works as expected and is applicable. You can rename, reboot, stop, start, tag DB instances, and you can make use of point-in-time recovery; you can scale the instance up and down, and automatic minor version upgrades work as expected. You cannot make use of read replicas or create highly available clusters.

Backup & Recover – Automated backups work as expected, and are stored in S3. You can use them to create a fresh DB Instance in the cloud or in any of your Outposts. Manual snapshots also work, and are stored on the Outpost. They can be used to create a fresh DB Instance on the same Outpost.

Encryption – The storage associated with your DB instance is encrypted, as are your DB snapshots, both with KMS keys.

Pricing – RDS on Outposts pricing is based on a management fee that is charged on an hourly basis for each database that is managed. For more information, check out the RDS on Outposts pricing page.

Available Now
You can start creating RDS DB Instances on your Outposts today.

Jeff;

 

Amazon RDS Proxy – Now Generally Available

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-rds-proxy-now-generally-available/

At AWS re:Invent 2019, we launched the preview of Amazon RDS Proxy, a fully managed, highly available database proxy for Amazon Relational Database Service (RDS) that makes applications more scalable, more resilient to database failures, and more secure. Following the preview of MySQL engine, we extended to the PostgreSQL compatibility. Today, I am pleased to announce that we are now generally available for both engines.

Many applications, including those built on modern serverless architectures using AWS Lambda, Fargate, Amazon ECS, or EKS can have a large number of open connections to the database server, and may open and close database connections at a high rate, exhausting database memory and compute resources.

Amazon RDS Proxy allows applications to pool and share connections established with the database, improving database efficiency, application scalability, and security. With RDS Proxy, failover times for Amazon Aurora and RDS databases are reduced by up to 66%, and database credentials, authentication, and access can be managed through integration with AWS Secrets Manager and AWS Identity and Access Management (IAM).

Amazon RDS Proxy can be enabled for most applications with no code change, and you don’t need to provision or manage any additional infrastructure and only pay per vCPU of the database instance for which the proxy is enabled.

Amazon RDS Proxy – Getting started
You can get started with Amazon RDS Proxy in just a few clicks by going to the AWS management console and creating an RDS Proxy endpoint for your RDS databases. In the navigation pane, choose Proxies and Create proxy. You can also see the proxy panel below.

To create your proxy, specify the Proxy identifier, a unique name of your choosing, and choose the database engine – either MySQL or PostgreSQL. Choose the encryption setting if you want the proxy to enforce TLS / SSL for all connection between application and proxy, and specify a time period that a client connection can be idle before the proxy can close it.

A client connection is considered idle when the application doesn’t submit a new request within the specified time after the previous request completed. The underlying connection between the proxy and database stays open and is returned to the connection pool. Thus, it’s available to be reused for new client connections.

Next, choose one RDS DB instance or Aurora DB cluster in Database to access through this proxy. The list only includes DB instances and clusters with compatible database engines, engine versions, and other settings.

Specify Connection pool maximum connections, a value between 1 and 100. This setting represents the percentage of the max_connections value that RDS Proxy can use for its connections. If you only intend to use one proxy with this DB instance or cluster, you can set it to 100. For details about how RDS Proxy uses this setting, see Connection Limits and Timeouts.

Choose at least one Secrets Manager secret associated with the RDS DB instance or Aurora DB cluster that you intend to access with this proxy, and select an IAM role that has permission to access the Secrets Manager secrets you chose. If you don’t have an existing secret, please click Create a new secret before setting up the RDS proxy.

After setting VPC Subnets and a security group, please click Create proxy. If you more settings in details, please refer to the documentation.

You can see the new RDS proxy after waiting a few minutes and then point your application to the RDS Proxy endpoint. That’s it!

You can also create an RDS proxy easily via AWS CLI command.

aws rds create-db-proxy \
    --db-proxy-name channy-proxy \
    --role-arn iam_role \
    --engine-family { MYSQL|POSTGRESQL } \
    --vpc-subnet-ids space_separated_list \
    [--vpc-security-group-ids space_separated_list] \
    [--auth ProxyAuthenticationConfig_JSON_string] \
    [--require-tls | --no-require-tls] \
    [--idle-client-timeout value] \
    [--debug-logging | --no-debug-logging] \
    [--tags comma_separated_list]

How RDS Proxy works
Let’s see an example that demonstrates how open connections continue working during a failover when you reboot a database or it becomes unavailable due to a problem. This example uses a proxy named channy-proxy and an Aurora DB cluster with DB instances instance-8898 and instance-9814. When the failover-db-cluster command is run from the Linux command line, the writer instance that the proxy is connected to changes to a different DB instance. You can see that the DB instance associated with the proxy changes while the connection remains open.

$ mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p
Enter password:
...
mysql> select @@aurora_server_id;
+--------------------+
| @@aurora_server_id |
+--------------------+
| instance-9814 |
+--------------------+
1 row in set (0.01 sec)

mysql>
[1]+ Stopped mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p
$ # Initially, instance-9814 is the writer.
$ aws rds failover-db-cluster --db-cluster-id cluster-56-2019-11-14-1399
JSON output
$ # After a short time, the console shows that the failover operation is complete.
$ # Now instance-8898 is the writer.
$ fg
mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p

mysql> select @@aurora_server_id;
+--------------------+
| @@aurora_server_id |
+--------------------+
| instance-8898 |
+--------------------+
1 row in set (0.01 sec)

mysql>
[1]+ Stopped mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p
$ aws rds failover-db-cluster --db-cluster-id cluster-56-2019-11-14-1399
JSON output
$ # After a short time, the console shows that the failover operation is complete.
$ # Now instance-9814 is the writer again.
$ fg
mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p

mysql> select @@aurora_server_id;
+--------------------+
| @@aurora_server_id |
+--------------------+
| instance-9814 |
+--------------------+
1 row in set (0.01 sec)
+---------------+---------------+
| Variable_name | Value |
+---------------+---------------+
| hostname | ip-10-1-3-178 |
+---------------+---------------+
1 row in set (0.02 sec)

With RDS Proxy, you can build applications that can transparently tolerate database failures without needing to write complex failure handling code. RDS Proxy automatically routes traffic to a new database instance while preserving application connections.

You can review the demo for an overview of RDS Proxy and the steps you need take to access RDS Proxy from a Lambda function.

If you want to know how your serverless applications maintain excellent performance even at peak loads, please read this blog post. For a deeper dive into using RDS Proxy for MySQL with serverless, visit this post.

The following are a few things that you should be aware of:

  • Currently, RDS Proxy is available for the MySQL and PostgreSQL engine family. This engine family includes RDS for MySQL 5.6 and 5.7, PostgreSQL 10.11 and 11.5.
  • In an Aurora cluster, all of the connections in the connection pool are handled by the Aurora primary instance. To perform load balancing for read-intensive workloads, you still use the reader endpoint directly for the Aurora cluster.
  • Your RDS Proxy must be in the same VPC as the database. Although the database can be publicly accessible, the proxy can’t be.
  • Proxies don’t support compressed mode. For example, they don’t support the compression used by the --compress or -C options of the mysql command.

Now Available!
Amazon RDS Proxy is generally available in US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Frankfurt), Europe (Ireland), Europe (London) , Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney) and Asia Pacific (Tokyo) regions for Aurora MySQL, RDS for MySQL, Aurora PostgreSQL, and RDS for PostgreSQL, and it includes support for Aurora Serverless and Aurora Multi-Master.

Take a look at the product page, pricing, and the documentation to learn more. Please send us feedback either in the AWS forum for Amazon RDS or through your usual AWS support contacts.

Channy;

Introducing the serverless LAMP stack – part 2 relational databases

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/introducing-the-serverless-lamp-stack-part-2-relational-databases/

In this post, you learn how to use an Amazon Aurora MySQL relational database in your serverless applications. I show how to pool and share connections to the database with Amazon RDS Proxy, and how to choose configurations. The code examples in this post are written in PHP and can be found in this GitHub repository. The concepts can be applied to any AWS Lambda supported runtime.

TThe serverless LAMP stack

The serverless LAMP stack

This serverless LAMP stack architecture is first discussed in this post. This architecture uses a PHP Lambda function (or multiple functions) to read and write to an Amazon Aurora MySQL database.

Amazon Aurora provides high performance and availability for MySQL and PostgreSQL databases. The underlying storage scales automatically to meet demand, up to 64 tebibytes (TiB). An Amazon Aurora DB instance is created inside a virtual private cloud (VPC) to prevent public access. To connect to the Aurora database instance from a Lambda function, that Lambda function must be configured to access the same VPC.

Database memory exhaustion can occur when connecting directly to an RDS database. This is caused by a surge in database connections or by a large number of connections opening and closing at a high rate. This can lead to slower queries and limited application scalability. Amazon RDS Proxy is implemented to solve this problem. RDS Proxy is a fully managed database proxy feature for Amazon RDS. It establishes a database connection pool that sits between your application and your relational database and reuses connections in this pool. This protects the database against oversubscription, without the memory and CPU overhead of opening a new database connection each time. Credentials for the database connection are securely stored in AWS Secrets Manager. They are accessed via an AWS Identity and Access Management (IAM) role. This enforces strong authentication requirements for database applications without a costly migration effort for the DB instances themselves.

The following steps show how to connect to an Amazon Aurora MySQL database running inside a VPC. The connection is made from a Lambda function running PHP. The Lambda function connects to the database via RDS Proxy. The database credentials that RDS Proxy uses are held in  Secrets Manager and accessed via IAM authentication.

RDS Proxy with IAM Authentication

RDS Proxy with IAM authentication

Getting started

RDS Proxy is currently in preview and not recommended for production workloads. For a full list of available Regions, refer to the RDS Proxy pricing page.

Creating an Amazon RDS Aurora MySQL database

Before creating an Aurora DB cluster, you must meet the prerequisites, such as creating a VPC and an RDS DB subnet group. For more information on how to set this up, see DB cluster prerequisites.

  1. Call the create-db-cluster AWS CLI command to create the Aurora MySQL DB cluster.
    aws rds create-db-cluster \
    --db-cluster-identifier sample-cluster \
    --engine aurora-mysql \
    --engine-version 5.7.12 \
    --master-username admin \
    --master-user-password secret99 \
    --db-subnet-group-name default-vpc-6cc1cf0a \
    --vpc-security-group-ids sg-d7cf52a3 \
    --enable-iam-database-authentication true
  2. Add a new DB instance to the cluster.
    aws rds create-db-instance \
        --db-instance-class db.r5.large \
        --db-instance-identifier sample-instance \
        --engine aurora-mysql  \
        --db-cluster-identifier sample-cluster
  3. Store the database credentials as a secret in AWS Secrets Manager.
    aws secretsmanager create-secret \
    --name MyTestDatabaseSecret \
    --description "My test database secret created with the CLI" \
    --secret-string '{"username":"admin","password":"secret99","engine":"mysql","host":"<REPLACE-WITH-YOUR-DB-WRITER-ENDPOINT>","port":"3306","dbClusterIdentifier":"<REPLACE-WITH-YOUR-DB-CLUSTER-NAME>"}'

    Make a note of the resulting ARN for later

    {
        "VersionId": "eb518920-4970-419f-b1c2-1c0b52062117", 
        "Name": "MySampleDatabaseSecret", 
        "ARN": "arn:aws:secretsmanager:eu-west-1:1234567890:secret:MySampleDatabaseSecret-JgEWv1"
    }

    This secret is used by RDS Proxy to maintain a connection pool to the database. To access the secret, the RDS Proxy service requires permissions to be explicitly granted.

  4. Create an IAM policy that provides secretsmanager permissions to the secret.
    aws iam create-policy \
    --policy-name my-rds-proxy-sample-policy \
    --policy-document '{
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": [
            "secretsmanager:GetResourcePolicy",
            "secretsmanager:GetSecretValue",
            "secretsmanager:DescribeSecret",
            "secretsmanager:ListSecretVersionIds"
          ],
          "Resource": [
            "<the-arn-of-the-secret>”
          ]
        },
        {
          "Sid": "VisualEditor1",
          "Effect": "Allow",
          "Action": [
            "secretsmanager:GetRandomPassword",
            "secretsmanager:ListSecrets"
          ],
          "Resource": "*"
        }
      ]
    }'
    

    Make a note of the resulting policy ARN, which you need to attach to a new role.

    {
        "Policy": {
            "PolicyName": "my-rds-proxy-sample-policy", 
            "PermissionsBoundaryUsageCount": 0, 
            "CreateDate": "2020-06-04T12:21:25Z", 
            "AttachmentCount": 0, 
            "IsAttachable": true, 
            "PolicyId": "ANPA6JE2MLNK3Z4EFQ5KL", 
            "DefaultVersionId": "v1", 
            "Path": "/", 
            "Arn": "arn:aws:iam::1234567890112:policy/my-rds-proxy-sample-policy", 
            "UpdateDate": "2020-06-04T12:21:25Z"
         }
    }
    
  5. Create an IAM Role that has a trust relationship with the RDS Proxy service. This allows the RDS Proxy service to assume this role to retrieve the database credentials.

    aws iam create-role --role-name my-rds-proxy-sample-role --assume-role-policy-document '{
     "Version": "2012-10-17",
     "Statement": [
      {
       "Sid": "",
       "Effect": "Allow",
       "Principal": {
        "Service": "rds.amazonaws.com"
       },
       "Action": "sts:AssumeRole"
      }
     ]
    }'
    
  6. Attach the new policy to the role:
    aws iam attach-role-policy \
    --role-name my-rds-proxy-sample-role \
    --policy-arn arn:aws:iam::123456789:policy/my-rds-proxy-sample-policy
    

Create an RDS Proxy

  1. Use the AWS CLI to create a new RDS Proxy. Replace the – -role-arn and SecretArn value to those values created in the previous steps.
    aws rds create-db-proxy \
    --db-proxy-name sample-db-proxy \
    --engine-family MYSQL \
    --auth '{
            "AuthScheme": "SECRETS",
            "SecretArn": "arn:aws:secretsmanager:eu-west-1:123456789:secret:exampleAuroraRDSsecret1-DyCOcC",
             "IAMAuth": "REQUIRED"
          }' \
    --role-arn arn:aws:iam::123456789:role/my-rds-proxy-sample-role \
    --vpc-subnet-ids  subnet-c07efb9a subnet-2bc08b63 subnet-a9007bcf
    

    To enforce IAM authentication for users of the RDS Proxy, the IAMAuth value is set to REQUIRED. This is a more secure alternative to embedding database credentials in the application code base.

    The Aurora DB cluster and its associated instances are referred to as the targets of that proxy.

  2. Add the database cluster to the proxy with the register-db-proxy-targets command.
    aws rds register-db-proxy-targets \
    --db-proxy-name sample-db-proxy \
    --db-cluster-identifiers sample-cluster
    

Deploying a PHP Lambda function with VPC configuration

This GitHub repository contains a Lambda function with a PHP runtime provided by a Lambda layer. The function uses the MySQLi PHP extension to connect to the RDS Proxy. The extension has been installed and compiled along with a PHP executable using this command:

The PHP executable is packaged together with a Lambda bootstrap file to create a PHP custom runtime. More information on building your own custom runtime for PHP can be found in this post.

Deploy the application stack using the AWS Serverless Application Model (AWS SAM) CLI:

sam deploy -g

When prompted, enter the SecurityGroupIds and the SubnetIds for your Aurora DB cluster.

The SAM template attaches the SecurityGroupIds and SubnetIds parameters to the Lambda function using the VpcConfig sub-resource.

Lambda creates an elastic network interface for each combination of security group and subnet in the function’s VPC configuration. The function can only access resources (and the internet) through that VPC.

Adding RDS Proxy to a Lambda Function

  1. Go to the Lambda console.
  2. Choose the PHPHelloFunction that you just deployed.
  3. Choose Add database proxy at the bottom of the page.
  4. Choose existing database proxy then choose sample-db-proxy.
  5. Choose Add.

Using the RDS Proxy from within the Lambda function

The Lambda function imports three libraries from the AWS PHP SDK. These are used to generate a password token from the database credentials stored in Secrets Manager.

The AWS PHP SDK libraries are provided by the PHP-example-vendor layer. Using Lambda layers in this way creates a mechanism for incorporating additional libraries and dependencies as the application evolves.

The function’s handler named index, is the entry point of the function code. First, getenv() is called to retrieve the environment variables set by the SAM application’s deployment. These are saved as local variables and available for the duration of the Lambda function’s execution.

The AuthTokenGenerator class generates an RDS auth token for use with IAM authentication. This is initialized by passing in the credential provider to the SDK client constructor. The createToken() method is then invoked, with the Proxy endpoint, port number, Region, and database user name provided as method parameters. The resultant temporary token is then used to connect to the proxy.

The PHP mysqli class represents a connection between PHP and a MySQL database. The real_connect() method is used to open a connection to the database via RDS Proxy. Instead of providing the database host endpoint as the first parameter, the proxy endpoint is given. The database user name, temporary token, database name, and port number are also provided. The constant MYSQLI_CLIENT_SSL is set to ensure that the connection uses SSL encryption.

Once a connection has been established, the connection object can be used. In this example, a SHOW TABLES query is executed. The connection is then closed, and the result is encoded to JSON and returned from the Lambda function.

This is the output:

RDS Proxy monitoring and performance tuning

RDS Proxy allows you to monitor and adjust connection limits and timeout intervals without changing application code.

Limit the timeout wait period that is most suitable for your application with the connection borrow timeout option. This specifies how long to wait for a connection to become available in the connection pool before returning a timeout error.

Adjust the idle connection timeout interval to help your applications handle stale resources. This can save your application from mistakenly leaving open connections that hold important database resources.

Multiple applications using a single database can each use an RDS Proxy to divide the connection quotas across each application. Set the maximum proxy connections as a percentage of the max_connections configuration (for MySQL).

The following example shows how to change the MaxConnectionsPercent setting for a proxy target group.

aws rds modify-db-proxy-target-group \
--db-proxy-name sample-db-proxy \
--target-group-name default \
--connection-pool-config '{"MaxConnectionsPercent": 75 }'

Response:

{
    "TargetGroups": [
        {
            "DBProxyName": "sample-db-proxy",
            "TargetGroupName": "default",
            "TargetGroupArn": "arn:aws:rds:eu-west-1:####:target-group:prx-tg-03d7fe854604e0ed1",
            "IsDefault": true,
            "Status": "available",
            "ConnectionPoolConfig": {
            "MaxConnectionsPercent": 75,
            "MaxIdleConnectionsPercent": 50,
            "ConnectionBorrowTimeout": 120,
            "SessionPinningFilters": []
        	},            
"CreatedDate": "2020-06-04T16:14:35.858000+00:00",
            "UpdatedDate": "2020-06-09T09:08:50.889000+00:00"
        }
    ]
}

RDS Proxy may keep a session on the same connection until the session ends when it detects a session state change that isn’t appropriate for reuse. This behavior is called pinning. Performance tuning for RDS Proxy involves maximizing connection reuse by minimizing pinning.

The Amazon CloudWatch metric DatabaseConnectionsCurrentlySessionPinned can be monitored to see how frequently pinning occurs in your application.

Amazon CloudWatch collects and processes raw data from RDS Proxy into readable, near real-time metrics. Use these metrics to observe the number of connections and the memory associated with connection management. This can help identify if a database instance or cluster would benefit from using RDS Proxy. For example, if it is handling many short-lived connections, or opening and closing connections at a high rate.

Conclusion

In this post, you learn how to create and configure an RDS Proxy to manage connections from a PHP Lambda function to an Aurora MySQL database. You see how to enforce strong authentication requirements by using Secrets Manager and IAM authentication. You deploy a Lambda function that uses Lambda layers to store the AWS PHP SDK as a dependency.

You can create secure, scalable, and performant serverless applications with relational databases. Do this by placing the RDS Proxy service between your database and your Lambda functions. You can also migrate your existing MySQL database to an Aurora DB cluster without altering the database. Using RDS Proxy and Lambda, you can build serverless PHP applications faster, with less code.

Find more PHP examples with the Serverless LAMP stack.