Tag Archives: Amazon RDS

Enabling data classification for Amazon RDS database with Macie

Post Syndicated from Bruno Silveira original https://aws.amazon.com/blogs/security/enabling-data-classification-for-amazon-rds-database-with-amazon-macie/

Customers have been asking us about ways to use Amazon Macie data discovery on their Amazon Relational Database Service (Amazon RDS) instances. This post presents how to do so using AWS Database Migration Service (AWS DMS) to extract data from Amazon RDS, store it on Amazon Simple Storage Service (Amazon S3), and then classify the data using Macie. Macie’s resulting findings will also be made available to be queried with Amazon Athena by appropriate teams.

The challenge

Let’s suppose you need to find sensitive data in an RDS-hosted database using Macie, which currently only supports S3 as a data source. Therefore, you will need to extract and store the data from RDS in S3. In addition, you will need an interface for audit teams to audit these findings.

Solution overview

Figure 1: Solution architecture workflow

Figure 1: Solution architecture workflow

The architecture of the solution in Figure 1 can be described as:

  1. A MySQL engine running on RDS is populated with the Sakila sample database.
  2. A DMS task connects to the Sakila database, transforms the data into a set of Parquet compressed files, and loads them into the dcp-macie bucket.
  3. A Macie classification job analyzes the objects in the dcp-macie bucket using a combination of techniques such as machine learning and pattern matching to determine whether the objects contain sensitive data and to generate detailed reports on the findings.
  4. Amazon EventBridge routes the Macie findings reports events to Amazon Kinesis Data Firehose.
  5. Kinesis Data Firehose stores these reports in the dcp-glue bucket.
  6. S3 event notification triggers an AWS Lambda function whenever an object is created in the dcp-glue bucket.
  7. The Lambda function named Start Glue Workflow starts a Glue Workflow.
  8. Glue Workflow transforms the data from JSONL to Apache Parquet file format and places it in the dcp-athena bucket. This provides better performance during data query and optimized storage usage using a binary optimized columnar storage.
  9. Athena is used to query and visualize data generated by Macie.

Note: For better readability, the S3 bucket nomenclature omits the suffix containing the AWS Region and AWS account ID used to meet the global uniqueness naming requirement (for example, dcp-athena-us-east-1-123456789012).

The Sakila database schema consists of the following tables:

  • actor
  • address
  • category
  • city
  • country
  • customer

Building the solution

Prerequisites

Before configuring the solution, the AWS Identity and Access Management (IAM) user must have appropriate access granted for the following services:

You can find an IAM policy with the required permissions here.

Step 1 – Deploying the CloudFormation template

You’ll use CloudFormation to provision quickly and consistently the AWS resources illustrated in Figure 1. Through a pre-built template file, it will create the infrastructure using an Infrastructure-as-Code (IaC) approach.

  1. Download the CloudFormation template.
  2. Go to the CloudFormation console.
  3. Select the Stacks option in the left menu.
  4. Select Create stack and choose With new resources (standard).
  5. On Step 1 – Specify template, choose Upload a template file, select Choose file, and select the file template.yaml downloaded previously.
  6. On Step 2 – Specify stack details, enter a name of your preference for Stack name. You might also adjust the parameters as needed, like the parameter CreateRDSServiceRole to create a service role for RDS if it does not exist in the current account.
  7. On Step 3 – Configure stack options, select Next.
  8. On Step 4 – Review, check the box for I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then select Create Stack.
  9. Wait for the stack to show status CREATE_COMPLETE.

Note: It is expected that provisioning will take around 10 minutes to complete.

Step 2 – Running an AWS DMS task

To extract the data from the Amazon RDS instance, you need to run an AWS DMS task. This makes the data available for Amazon Macie in an S3 bucket in Parquet format.

  1. Go to the AWS DMS console.
  2. In the left menu, select Database migration tasks.
  3. Select the task Identifier named rdstos3task.
  4. Select Actions.
  5. Select the option Restart/Resume.

When the Status changes to Load Complete the task has finished and you will be able to see migrated data in your target bucket (dcp-macie).

Inside each folder you can see parquet file(s) with names similar to LOAD00000001.parquet. Now you can use Macie to discover if you have sensitive data in your database contents as exported to S3.

Step 3 – Running a classification job with Amazon Macie

Now you need to create a data classification job so you can assess the contents of your S3 bucket. The job you create will run once and evaluate the complete contents of your S3 bucket to determine whether it can identify PII among the data. As mentioned earlier, this job only uses the managed identifiers available with Macie – you could also add your own custom identifiers.

  1. Go to the Macie console.
  2. Select the S3 buckets option in the left menu.
  3. Choose the S3 bucket dcp-macie containing the output data from the DMS task. You may need to wait a minute and select the Refresh icon if the bucket names do not display.

  4. Select Create job.
  5. Select Next to continue.
  6. Create a job with the following scope.
    1. Sensitive data Discovery options: One-time job
    2. Sampling Depth: 100%
    3. Leave all other settings with their default values
  7. Select Next to continue.
  8. Select Next again to skip past the Custom data identifiers section.
  9. Give the job a name and description.
  10. Select Next to continue.
  11. Verify the details of the job you have created and select Submit to continue.

You will see a green banner stating that The Job was successfully created. The job can take up to 15 minutes to conclude and the Status will change from Active to Complete. To open the findings from the job, select the job’s check box, choose Show results, and select Show findings.
 

Figure 2: Macie Findings screen

Figure 2: Macie Findings screen

Note: You can navigate in the findings and select each checkbox to see the details.

Step 4 – Enabling querying on classification job results with Amazon Athena

  1. Go to the Athena console and open the Query editor.
  2. If it’s your first-time using Athena you will see a message Before you run your first query, you need to set up a query result location in Amazon S3. Learn more. Select the link presented with this message.
  3. In the Settings window, choose Select and then choose the bucket dcp-assets to store the Athena query results.
  4. (Optional) To store the query results encrypted, check the box for Encrypt query results and select your preferred encryption type. To learn more about Amazon S3 encryption types, see Protecting data using encryption.
  5. Select Save.

Step 5 – Query Amazon Macie results with Amazon Athena.

It might take a few minutes for the data to complete the flow between Amazon Macie and AWS Glue. After it’s finished, you’ll be able to see in Athena’s console the table dcp_athena within the database dcp.

Then, select the three dots next to the table dcp_athena and select the Preview table option to see a data preview, or run your own custom queries.
 

Figure 3: Athena table preview

Figure 3: Athena table preview

As your environment grows, this blog post on Top 10 Performance Tuning Tips for Amazon Athena can help you apply partitioning of data and consolidate your data into larger files if needed.

Clean up

After you finish, to clean up the solution and avoid unnecessary expenses, complete the following steps:

  1. Go to the Amazon S3 console.
  2. Navigate to each of the buckets listed below and delete all its objects:
    • dcp-assets
    • dcp-athena
    • dcp-glue
    • dcp-macie
  3. Go to the CloudFormation console.
  4. Select the Stacks option in the left menu.
  5. Choose the stack you created in Step 1 – Deploying the CloudFormation template.
  6. Select Delete and then select Delete Stack in the pop-up window.

Conclusion

In this blog post, we show how you can find Personally Identifiable Information (PII), and other data defined as sensitive, in Macie’s Managed Data Identifiers in an RDS-hosted MySQL database. You can use this solution with other relational databases like PostgreSQL, SQL Server, or Oracle, whether hosted on RDS or EC2. If you’re using Amazon DynamoDB, you may also find useful the blog post Detecting sensitive data in DynamoDB with Macie.

By classifying your data, you can inform your management of appropriate data protection and handling controls for the use of that data.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Bruno Silveira

Bruno is a Solutions Architect Manager in the Public Sector team with a focus on educational institutions in Brazil. His previous career was in government, financial services, utilities, and nonprofit institutions. Bruno is an enthusiast of cloud security and an appreciator of good rock’n roll with a good beer.

Author

Thiago Pádua

Thiago is Solutions Architect in the AWS Worldwide Public Sector team working in the development and support of partners. He is experienced in software development and systems integration, mainly in the telecommunications industry. He has a special interest in microservices, serverless, and containers.

Migrate Resources Between AWS Accounts

Post Syndicated from Ashok Srirama original https://aws.amazon.com/blogs/architecture/migrate-resources-between-aws-accounts/

Have you ever wondered how to move resources between Amazon Web Services (AWS) accounts? You can really view this as a migration of resources. Migrating resources from one AWS account to another may be desired or required due to your business needs. Following are a few scenarios where this may be of benefit:

  1. When you acquire, sell, or merge overseas operations from other businesses.
  2. When you move regional operations from one Managed Service Provider (MSP) to another.
  3. When you reorganize your AWS account and organizational structure.

This process may involve migrating the infrastructure either partially or completely.

In this blog, we will discuss various approaches to migrating resources based on type, configuration, and workload needs. Usually, the first consideration is infrastructure. What’s in your environment? What are the interdependencies? How will you migrate each resource?

Using this information, you can outline a plan on how you will approach migrating each of the resources in your portfolio, and in what order.

Here are some considerations to address for a typical migration:

Let’s look at each of these considerations in detail.

Migrating infrastructure

To migrate infrastructure that includes ephemeral resources, you can use one of the following Infrastructure as Code (IaC) approaches, shown in Figure 1. IaC templates are like programming scripts that automate the provisioning of IT resources.

Figure 1. Approaches to migrate infrastructure using IaC

Figure 1. Approaches to migrate infrastructure using IaC

1. If you are already using AWS CloudFormation templates, you can easily import the existing templates to the target AWS account.

AWS CloudFormation simplifies provisioning and management on AWS. You can create templates for quick and reliable provisioning of services or applications (called “stacks”).

2. You can use tools like Former2 to templatize your existing resources in the source AWS account and deploy them in the target account.

Former2 is an open-source project that allows you to generate IaC templates. For example, AWS CloudFormation or HashiCorp Terraform can be generated from the existing resources within your AWS account. Read Accelerate infrastructure as code development with open source Former2 for step-by-step guidance.

Migrating compute resources

To migrate compute resources that have a persistent state, you can use one of the following approaches, shown in Figure 2. These provide a virtual computing environment, allowing you to launch instances with a variety of operating systems.

Figure 2. Approaches to migrate compute resources

Figure 2. Approaches to migrate compute resources

1. If you are already using AWS Backup service and AWS Organizations to centrally manage backup policies, you can enable AWS Backup cross-account management feature. This manages, monitors, restores your backup, and copies jobs across AWS accounts. Ensure you have both accounts in same AWS Organization. Once the backups are available in the target account, you can restore EC2 instances. Follow detailed instructions at Creating backup copies across AWS accounts.

AWS Backup is a fully managed data protection service that centralizes and automates data across AWS services, in the cloud, and on-premises. You can configure backup policies and monitor activity for your AWS resources. You can automate and consolidate backup tasks that were previously performed service-by-service. This removes the need to create custom scripts and use manual processes.

2. Create an Amazon Machine Image of your EC2 instances and share it with the target account. You can launch new EC2 instances using the shared AMI. Follow step-by-step instructions: How do I transfer an Amazon EC2 instance or AMI to a different AWS account?

Amazon Machine Image (AMI) provides the information required to launch an instance. Specify an AMI and then launch multiple instances from a single AMI with the same configuration. You can use different AMIs to launch instances when you need instances with different configurations.

For migrating non-persistent compute resources, refer Migrating Infrastructure section.

Migrating storage resources

AWS offers various storage services including object, file, and block storage. To migrate objects from a S3 bucket, you can take the following approaches, shown in Figure 3a.

Figure 3a. Approaches to migrate S3 buckets

Figure 3a. Approaches to migrate S3 buckets

1. Use Amazon S3 command line interface (CLI) commands to copy the initial load of objects from the source account to the target account. Read How can I copy S3 objects from another AWS account? After the initial copy, you can enable Amazon S3 replication feature to continuously replicate object changes across accounts. Add a bucket policy to grant source bucket permission to replicate objects in destination bucket. Read this walkthrough on how to configure replications.

2. If the S3 bucket contains large number of objects, use Amazon S3 Batch operations to copy objects across AWS accounts in bulk. Read Cross-account bulk transfer of files using Amazon S3 Batch Operations.

To migrate files from an Amazon EFS file system, you can take the following approach, shown in Figure 3b.

Figure 3b. Approach to migrate EFS file systems

Figure 3b. Approach to migrate EFS file systems

Use AWS DataSync agent to transfer data from one EFS file system to another. AWS DataSync is an online transfer service that simplifies moving, copying, and synchronizing large amounts of data between on-premises storage systems and AWS storage services. Read Transferring file data across AWS Regions and accounts using AWS DataSync for step-by-step guidance.

Migrating database resources

AWS offers various purpose-built database engines. These include relational, key-value, document, in-memory, graph, time series, wide column, and ledger databases. To migrate relational databases, you can take one of the following approaches, shown in Figure 4.

Figure 4. Approaches to migrate relational database resources

Figure 4. Approaches to migrate relational database resources

1. If you want to continuously replicate data changes, use AWS Database Migration Service (AWS DMS) to replicate your data across AWS accounts with high availability. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. You can set up a DMS task for either one-time migration or on-going replication. An on-going replication task keeps your source and target databases in sync. Once set up, the on-going replication task will continuously apply source changes to the target with minimal latency. Learn how to Set Up AWS DMS for Cross-Account Migration.

AWS DMS is a cloud service that streamlines the migration of relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.

2. Use RDS Snapshots to create and share database backups across AWS accounts. Use the shared snapshots to launch new Amazon Relational Database Service (RDS) instances in the target account. Read step-by-step instructions: How can I share an encrypted Amazon RDS DB snapshot with another account?

3. Use AWS Backup to create backup policies that automatically back up your AWS resources. Use AWS Backup cross-account management feature to manage and monitor your backup, restore, and copy jobs across AWS accounts. Once the backups are available in the target account, you can restore RDS instances. Learn about Creating backup copies across AWS accounts.

In this section, we discussed relational databases migration. You can also use AWS DMS for migrating other databases. Read supported AWS DMS source and target databases.

Conclusion

In this blog post, we discussed various approaches you can take to migrate resources from one account to another depending upon the resource type and configuration. Additionally, you can also explore CloudEndure Migration for continuous data replication. Learn more about Migrating workloads across AWS Regions with CloudEndure Migration.

Field Notes: Automate Disaster Recovery for AWS Workloads with Druva

Post Syndicated from Girish Chanchlani original https://aws.amazon.com/blogs/architecture/field-notes-automate-disaster-recovery-for-aws-workloads-with-druva/

This post was co-written by Akshay Panchmukh, Product Manager, Druva and Girish Chanchlani, Sr Partner Solutions Architect, AWS.

The Uptime Institute’s Annual Outage Analysis 2021 report estimated that 40% of outages or service interruptions in businesses cost between $100,000 and $1 million, while about 17% cost more than $1 million. To guard against this, it is critical for you to have a sound data protection and disaster recovery (DR) strategy to minimize the impact on your business. With the greater adoption of the public cloud, most companies are either following a hybrid model with critical workloads spread across on-premises data centers and the cloud or are all in the cloud.

In this blog post, we focus on how Druva, a SaaS based data protection solution provider, can help you implement a DR strategy for your workloads running in Amazon Web Services (AWS). We explain how to set up protection of AWS workloads running in one AWS account, and fail them over to another AWS account or Region, thereby minimizing the impact of disruption to your business.

Overview of the architecture

In the following architecture, we describe how you can protect your AWS workloads from outages and disasters. You can quickly set up a DR plan using Druva’s simple and intuitive user interface, and within minutes you are ready to protect your AWS infrastructure.

Figure 1. Druva architecture

Druva’s cloud DR is built on AWS using native services to provide a secure operating environment for comprehensive backup and DR operations. With Druva, you can:

  • Seamlessly create cross-account DR sites based on source sites by cloning Amazon Virtual Private Clouds (Amazon VPCs) and their dependents.
  • Set up backup policies to automatically create and copy snapshots of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Relational Database Service (Amazon RDS) instances to DR Regions based on recovery point objective (RPO) requirements.
  • Set up service level objective (SLO) based DR plans with options to schedule automated tests of DR plans and ensure compliance.
  • Monitor implementation of DR plans easily from the Druva console.
  • Generate compliance reports for DR failover and test initiation.

Other notable features include support for automated runbook initiation, selection of target AWS instance types for DR, and simplified orchestration and testing to help protect and recover data at scale. Druva provides the flexibility to adopt evolving infrastructure across geographic locations, adhere to regulatory requirements (such as, GDPR and CCPA), and recover workloads quickly following disasters, helping meet your business-critical recovery time objectives (RTOs). This unified solution offers taking snapshots as frequently as every five minutes, improving RPOs. Because it is a software as a service (SaaS) solution, Druva helps lower costs by eliminating traditional administration and maintenance of storage hardware and software, upgrades, patches, and integrations.

We will show you how to set up Druva to protect your AWS workloads and automate DR.

Step 1: Log into the Druva platform and provide access to AWS accounts

After you sign into the Druva Cloud Platform, you will need to grant Druva third-party access to your AWS account by pressing Add New Account button, and following the steps as shown in Figure 2.

Figure 2. Add AWS account information

Druva uses AWS Identity and Access Management (IAM) roles to access and manage your AWS workloads. To help you with this, Druva provides an AWS CloudFormation template to create a stack or stack set that generates the following:

  • IAM role
  • IAM instance profile
  • IAM policy

The generated Amazon Resource Name (ARN) of the IAM role is then linked to Druva so that it can run backup and DR jobs for your AWS workloads. Note that Druva follows all security protocols and best practices recommended by AWS. All access permissions to your AWS resources and Regions are controlled by IAM.

After you have logged into Druva and set up your account, you can now set up DR for your AWS workloads. The following steps will allow you to set up DR for AWS infrastructure.

Step 2: Identify the source environment

A source environment refers to a logical grouping of Amazon VPCs, subnets, security groups, and other infrastructure components required to run your application.

Figure 3. Create a source environment

In this step, create your source environment by selecting the appropriate AWS resources you’d like to set up for failover. Druva currently supports Amazon EC2 and Amazon RDS as sources that can be protected. With Druva’s automated DR, you can failover these resources to a secondary site with the press of a button.

Figure 4. Add resources to a source environment

Note that creating a source environment does not create or update existing resources or configurations in your AWS account. It only saves this configuration information with Druva’s service.

Step 3: Clone the environment

The next step is to clone the source environment to a Region where you want to failover in case of a disaster. Druva supports cloning the source environment to another Region or AWS account that you have selected. Cloning essentially replicates the source infrastructure in the target Region or account, which allows the resources to be failed over quickly and seamlessly.

Figure 5. Clone the source environment

Step 4: Set up a backup policy

You can create a new backup policy or use an existing backup policy to create backups in the cloned or target Region. This enables Druva to restore instances using the backup copies.

Figure 6. Create a new backup policy or use an existing backup policy

Figure 7. Customize the frequency of backup schedules

Step 5: Create the DR plan

A DR plan is a structured set of instructions designed to recover resources in the event of a failure or disaster. DR aims to get you back to the production-ready setup with minimal downtime. Follow these steps to create your DR plan.

  1. Create DR Plan: Press Create Disaster Recovery Plan button to open the DR plan creation page.

Figure 8. Create a disaster recovery plan

  1. Name: Enter the name of the DR plan.
    Service Level Objective (SLO): Enter your RPO and RTO.
    • Recovery Point Objective – Example: If you set your RPO as 24 hours, and your backup was scheduled daily at 8:00 PM, but a disaster occurred at 7:59 PM, you would be able to recover data that was backed up on the previous day at 8:00 PM. You would lose the data generated after the last backup (24 hours of data loss).
    • Recovery Time Objective – Example: If you set your RTO as 30 hours, when a disaster occurred, you would be able to recover all critical IT services within 30 hours from the point in time the disaster occurs.

Figure 9. Add your RTO and RPO requirements

  1. Create your plan based off the source environment, target environment, and resources.
Environments
Source Account By default, this is the Druva account in which you are currently creating the DR plan.
Source Environment Select the source environment applicable within the Source Account (your Druva account in which you’re creating the DR plan).
Target Account Select the same or a different target account.
Target Environment Select the Target Environment, applicable within the Target Account.
Resources
Create Policy If you do not have a backup policy, then you can create one.
Add Resources Add resources from the source environment that you want to restore. Make sure the verification column shows a ‘Valid Backup Policy’ status. This ensures that the backup policy is frequently creating copies as per the RPO defined previously.

Figure 10. Create DR plan based off the source environment, target environment, and resources

  1. Identify target instance type, test plan instance type, and run books for this DR plan.
Target Instance Type Target Instance Type can be selected. If instance type is not selected then:

  • Select an instance type which is large in size.
  • Select an instance type which is smaller.
  • Fail the creation of the instance.
Test Plan Instance Type There are many options. To reduce incurring cost, the lower instance type can be selected from all available AWS instance types.
Run Books Select this option if you would like to shutdown the source server after failover occurs.

Figure 11. Identify target instance type, test plan instance type, and run books

Step 6: Test the DR plan

After you have defined your DR plan, it is time to test it so that you can—when necessary—initiate a failover of resources to the target Region. You can now easily try this on the resources in the cloned environment without affecting your production environment.

Figure 12. Test the DR plan

Testing your DR plan will help you to find answers for some of the questions like: How long did the recovery take? Did I meet my RTO and RPO objectives?

Step 7: Initiate the DR plan

After you have successfully tested the DR plan, it can easily be initiated with the click of a button to failover your resources from the source Region or account to the target Region or account.

Conclusion

With the growth of cloud-based services, businesses need to ensure that mission-critical applications that power their businesses are always available. Any loss of service has a direct impact on the bottom line, which makes business continuity planning a critical element to any organization. Druva offers a simple DR solution which will help you keep your business always available.

Druva provides unified backup and cloud DR with no need to manage hardware, software, or costs and complexity. It helps automate DR processes to ensure your teams are prepared for any potential disasters while meeting compliance and auditing requirements.

With Druva, you can easily validate your RTO and RPO with automated regular DR testing, cross-account DR for protection against attacks and accidental deletions, and ensure backups are isolated from your primary production account for DR planning. With cross-Region DR, you can duplicate the entire Amazon VPC environment across Regions to protect you against Regionwide failures. In conclusion, Druva is a complete solution built with a goal to protect your native AWS workloads from any disasters.

To learn more, visit: https://www.druva.com/use-cases/aws-cloud-backup/

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.
Akshay Panchmukh

Akshay Panchmukh

Akshay Panchmukh is the Product Manager focusing on cloud-native workloads at Druva. He loves solving complex problems for customers and helping them achieve their business goals. He leverages his SaaS product and design thinking experience to help enterprises build a complete data protection solution. He works combining the best of technology, data, product, empathy, and design to execute a successful product vision.

Middleware-assisted Zero-downtime Live Database Migration to AWS

Post Syndicated from Seif Elharaki original https://aws.amazon.com/blogs/architecture/middleware-assisted-zero-downtime-live-database-migration-to-aws/

When trying to figure out how to refactor your applications to leverage AWS Managed Services, you have some decisions to make. You may have decided to move your storage layer to AWS before the computational layer. This may help with using advanced database features, in addition to reducing costs associated with writing and reading data. AWS Professional Services recently helped a large customer with this implementation.

With more than a quarter billion daily users, this customer uses highly transactional NoSQL databases that are few hundred GBs in size. Volume of data is growing rapidly. The downtime requirements for the migration were stringently low, as their applications are used globally, 24-7. The source data layer was Cloud Datastore, which runs outside of AWS. The destination was Amazon DynamoDB. Several hundred globally distributed applications (writing to and reading from the database) had little or no room for refactoring in the initial phase. While the go-to solution for this scenario is usually AWS Database Migration Service, Cloud Datastore is not yet supported by AWS DMS as a source.

Using a configurable middleware to migrate in-use data layer

The architectural approach chosen was to develop a custom middleware that the applications would communicate with rather than directly calling the database. Common database operations such as Get, Put, Delete, Conditional Updates, Deletes, and Transactions, would be issued against this middleware that is loaded as an in-memory library. Data would be read from and written to this middleware layer. It would then issue reads and writes to multiple databases with configurable load factors. The solution was developed and tested in stages (Dual-Write Single-Read, Dual-Write Dual-Read, and Single-Write Single-Read) as shown in the diagrams following.

Architecture: routing database traffic to multiple storage targets

Figure 1 shows the initial state when the application layer communicates directly with the source database.

Figure 1. Architecture of initial state: Generic 2 or 3 tier application with application and storage layers running inside or outside AWS Cloud

Figure 1. Architecture of initial state: Generic 2 or 3 tier application with application and storage layers running inside or outside AWS Cloud

Figures 2 and 3 illustrate the intermediate and final states, respectively, with database traffic moved to DynamoDB progressively.

Figure 2. Architecture of intermediate state: The middleware layer introduced to switch database traffic between source and target databases

Figure 2. Architecture of intermediate state: The middleware layer introduced to switch database traffic between source and target databases

Figure 3. Architecture of post-migration final state

Figure 3. Architecture of post-migration final state

A closer look into the configurable stages of the migration

Initially, the middleware should be tested with the source database alone. It can then be configured to work with DynamoDB in Dual-Write mode. Reads will still continue from the source database. The target database is synchronized by copying old data in parallel.

In the next stage, reads are expanded to the target database. Reading from two sources allows in-memory comparison of the final result set. This ensures consistency of the data being returned. Upon successful validation, the system is finally configured as Single-Write Single-Read, operating solely on the target database. This is the “Point of no Return,” where the target database surges ahead with new data. In this mode, the migration is deemed complete, and the older database is ready to be taken offline.

This multi-stage approach results in a “live migration” of the data layer to DynamoDB with zero downtime. Higher-level applications are also left intact. This increases the speed and accuracy of the overall migration.

Configurable stages of migration load balanced traffic to underlying databases

The middleware layer acts as a valve or switch regulating traffic from the applications to one or more databases. This allows support for a canary-like load balancing, where a certain percentage of traffic can be diverted in either direction. We can visualize this behavior with the analogy of a 3-stage dial, as shown in Figures 4 through 6. These stages are developed and tested in a non-production environment with production-like data. All related sets of tables should be migrated together.

Dual-Write Single-Read stage

Figure 4. Migration Stage 1: Dual-Write Single-Read mode

Figure 4. Migration Stage 1: Dual-Write Single-Read mode

In this stage, shown in Figure 4, data is written to both the source database and the target database (DynamoDB). At this point, data is read only from the source, because the target is not ready to handle reads yet. While new data is being written to the target database, older data is copied and backfilled by background processes.

Avoid data corruption while copying older data. Don’t change it while you’re bringing the target database on par with the source. The middleware can implement a locking mechanism for write operations based on primary keys. One way to monitor the movement of older data can be a temporary table, which the copying process can update. The middleware can read this table to allow or deny a write operation. In most use cases, writes taper off with time, making it easier for older data to be copied without running into contention.

Dual-Write Dual-Read stage

Figure 5. Migration Stage 2: Dual-Write Dual-Read mode

Figure 5. Migration Stage 2: Dual-Write Dual-Read mode

The prerequisite of this stage is the parity of content between the two databases. As shown in Figure 5, both reads and writes are routed to both databases. In this stage, the middleware layer activates data validation. The records that are read from both the data layers are compared and contrasted for accuracy and consistency. This allows any discrepancies in the data to be fixed and the solution redeployed.

Single-Write Single-Read stage

Figure 6. Migration Stage 3: Single-Write Single-Read mode

Figure 6. Migration Stage 3: Single-Write Single-Read mode

In this stage shown in Figure 6, all reads and writes are directed only to the target database on AWS. This is the “Point of no Return”, as new data is written to the target database alone. The source database falls behind, and can be taken offline for eventual retirement.

Dealing with differences in database features

Apart from acting as a switch, the job of the middleware layer in this design pattern is to accept, translate, and forward the generic database call. For example, when it receives a “Put” call, it must invoke the “Put” API on the specific underlying target. After due translation, it follows the rules governing the corresponding service. The middleware layer does this twice for two different underlying databases, when operated in Dual-Write or Dual-Read modes.

You must deal with differences in databases in terms of specific features, limits, and limitations. The following is a non-exhaustive list of such areas:

  1. Specific quantitative limits: DynamoDB imposes a size limit of 16 MB on Transactions. This limit is likely different for the source database.
  2. Behavioral differences for features like indices: Cloud Datastore supports writing empty values to indexed fields which DynamoDB doesn’t support.
  3. Behavioral differences for primary and secondary keys: Other databases might not treat keys the same way DynamoDB treats its hash and sort keys.
  4. Differences in capacity, throughput, and latency: The middleware may need to throttle or even decline requests. This can happen if it starts operating in an Availability Zone where one underlying database is able to scale, but the other can’t scale.

An object-oriented approach can be an efficient way to deal with such differences. Create a base class encapsulating features that are common to different databases. Then use inheritance and polymorphism to account for the differences. This can ensure reusability, readability, and maintainability.

As the AWS Professional Services team has experienced, the resulting tool can be reused several times in a large organization to migrate different application suites. It can potentially be applied to other use cases. These include, but are not limited to:

  1. Support for more storage configurations and databases, abstraction of application code base by making them largely agnostic of underlying database technology
  2. Extensive database compatibility testing using granular migration stages
  3. Modularization and containerization with computational platforms such as Amazon Elastic Kubernetes Service (EKS) or Amazon Elastic Container Service (ECS)

Conclusion

This design pattern showcases the power of abstraction in enabling live database migrations. Several optimizations are possible based on the rate of writes and pre-existing size of the database. The key benefit of this approach is the elimination of the need to make extensive changes in the application layer. This can result in significant savings in terms of effort, time, and cost, especially if different applications are managed by different teams in an organization. In addition, migration to DynamoDB alone can save AWS customers significantly. This depends on the size and access pattern of data, and whether the solution is architected for cost-savings. Refer to the Cost Optimization Pillar of the Well-Architected Framework for further best practices.

Further reading

Field Notes: How to Back Up a Database with KMS Encryption Using AWS Backup

Post Syndicated from Ram Konchada original https://aws.amazon.com/blogs/architecture/field-notes-how-to-back-up-a-database-with-kms-encryption-using-aws-backup/

An AWS security best practice from The 5 Pillars of the AWS Well-Architected Framework is to ensure that data is protected both in transit and at rest. One option is to use SSL/TLS to encrypt data in transit, and use cryptographic keys to encrypt data at rest. To meet your organization’s disaster recovery goals, periodic snapshots of databases should be scheduled and stored across Regions and across administrative accounts. This ensures quick Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

From a security standpoint, these snapshots should also be encrypted. Consider a scenario where one administrative AWS account is used for running the Amazon Relational Database Service (Amazon RDS) instance and backups. In this scenario, you may discover a situation where data cannot be recovered either from production instance or backups if this AWS account is compromised. Amazon RDS snapshots encrypted using AWS managed customer master keys (CMKs) cannot be copied across accounts. Cross-account backup helps you avoid this situation.

This blog post presents a solution that helps you to perform cross-account backup using AWS Backup service and restore database instance snapshots encrypted using AWS Key Management Service (KMS) CMKs across the accounts. This can help you to meet your security, compliance, and business continuity requirements. Although the solution uses RDS as the database choice, it can be applied to other database services (with some limitations).

Architecture

Figure 1 illustrates the solution described in this blog post.

Figure 1: Reference architecture for Amazon RDS with AWS Backup using KMS over two different accounts

Figure 1: Reference architecture for Amazon RDS with AWS Backup using KMS over two different accounts

Solution overview

When your resources like Amazon RDS (including Aurora clusters) are encrypted, cross-account copy can only be performed if they are encrypted by AWS KMS customer managed customer master keys (CMK). The default vault is encrypted using Service Master Keys (SMK). Therefore, to perform cross-account backups, you must use CMK encrypted vaults instead of using your default backup vault.

  1. In the source account, create a backup of the Amazon RDS instance encrypted with customer managed key.
  2. Give the backup account access to the customer-managed AWS KMS encryption key used by the source account’s RDS instance.
  3. In the backup account, ensure that you have a backup vault encrypted using a customer-managed key created in the backup account. Also, make sure that this vault is shared with the different account using vault policy.
  4. Copy the encrypted snapshots to the target account. This will re-encrypt snapshots using the target vault account’s AWS KMS encryption keys in the target Region.

Prerequisites

  • Ensure you are in the correct AWS Region of operation.
  • Two AWS accounts within the same AWS Organization.
  • Source account where you have a CMK encrypted Amazon RDS instance.
  • Opt-in to cross-account backup.
  • Backup account to which you will copy the encrypted snapshots.
  • A backup vault encrypted with backup CMK (different from source CMK) in the backup account.
  • An IAM role to perform cross-account backup. You can also use the AWSBackupDefaultServiceRole.

Solution

In this blog post, two accounts are used that are part of the same organization. Ensure that you update your account IDs accordingly.

Source account – 111222333444
AWS Backup account – 666777888999

Create a customer-managed key in the source account

Step 1 – Create CMKs

Create symmetric and asymmetric CMKs in the AWS Management Console. During this process, you will determine the cryptographic configuration of your CMK and the origin of its key material. You cannot change these properties after the CMK is created. You can also set the key policy for the CMK, which can be changed later. Follow key rotation best practices to ensure maximum security.

  1. Choose Create key.
  2. To create a symmetric CMK, for key type choose Symmetric. Use AWS KMS as the key material origin, and choose the single-region key for Regionality.
  3. Choose Next, and proceed with Step 2.

Step 2 – Add labels

  1. Type an alias for the CMK (alias cannot begin with aws/. The aws/ prefix is reserved by Amazon Web Services to represent AWS managed CMKs in your account).
  2. Add a description that identifies the key usage.
  3. Add tags based on an appropriate tagging strategy.
  4. Choose Next, and proceed with Step 3.

Step 3 – Key administrative permissions

Select the IAM users and roles that can administer the CMK. Ensure that least privilege design is implemented when assigning roles and permissions, in addition to following best practices.

Step 4 – Key usage permissions

Next, we will need to define the key usage and permissions. Complete the following steps:

  1. Select the IAM users and roles that can use the CMK for cryptographic operations.
  2. Within the Other AWS accounts section, type the 12-digit account number of the backup account.
Figure 2: Key usage permissions

Figure 2: Key usage permissions

Step 5 – Key configuration

Review the chosen settings, and press the Finish button to complete key creation.

Figure 3: Key configuration

Figure 3: Key configuration

 

Cross-account access key policy

Read the blog post on sharing custom encryption keys more securely between accounts using AWS Key Management Service for more information.

Step 6 – CMK verification

Within the AWS KMS console page, verify that the CMK key has been successfully created and status is enabled.

Create an Amazon RDS database in source account

  1. Choose the correct AWS Region.
  2. Navigate to RDS through the console search option.
  3. Choose Create a Database option, and choose your Database type.
  4. In Database encryption settings, use the CMK key you created in the preceding steps.
  5. Create the database.
  6. Follow Amazon RDS security best practices.

Create an AWS Backup vault in the source account

  1. On the AWS Backup service, navigate to AWS Backup > AWS Backup vault.
  2. Create a backup vault by specifying the name, and add tags based on an appropriate tagging strategy.

Create an on-demand AWS Backup in the source account

For the purpose of this solution, we will create an on-demand backup from the AWS Backup dashboard. You can also choose an existing snapshot if it is already available.

  1. Choose Create on-demand backup. Choose resource type as Amazon RDS, and select the appropriate database name. Choose the option to create backup now. Complete the setup by providing an appropriate IAM role and tag values (you can use the prepopulated default IAM role).
  2. Wait for the backup to be created, processed, and completed. This may take several hours, depending on the size of the database. If this step is too close to an existing scheduled backup time, you may see the following message on the console: Note – this step might fail if the on-demand backup window is too close to or overlapping the scheduled backup time determined by the Amazon RDS service. If this occurs, then create an on-demand backup after the scheduled backup is completed.
Figure 4: Create on-demand backup

Figure 4: Create on-demand backup

  1. Confirm the status is completed once the backup process has finished.
Figure 5: Completed on-demand backup

Figure 5: Completed on-demand backup

  1. If you navigate to the backup vault you should see the recovery point stored within the source account’s vault.
Figure 6: Backup vault

Figure 6: Backup vault

Prepare AWS Backup account (666777888999)

Create SYMMETRIC CMK in the backup account

Follow the same steps as before in creating a symmetric CMK in backup account. Ensure that you do not grant access to the source AWS account to this key.

Figure 7: CMK in backup account

Figure 7: CMK in backup account

Add IAM policy to users and roles to use CMK created in source account

The key policy in the account, that owns the CMK, sets the valid range for permissions. But, users and roles in the external account cannot use the CMK until you attach IAM policies that delegate those permissions or use grants to manage access to the CMK. The IAM policies are set in the external account and follow the best practices for IAM policies. Review the blog post on sharing custom encryption keys more securely between accounts using AWS Key Management Service for more information.

Create an AWS Backup vault in backup account

In the backup account, navigate to the “Backup vaults” section on the AWS Backup service page, and choose Create Backup vault. Next, provide a backup vault name, and choose the CMK key you previously created. In addition, you can specify Backup vault tags. Finally, press the Create Backup vault button.

Figure 8: Create backup vault

Figure 8: Create backup vault

Allow access to the backup vault from organization (or organizational unit)

This step will enable multiple accounts with the organization to store snapshots in the backup vault.

{

    "Version": "2012-10-17",

    "Statement": [

        {

            "Effect": "Allow",

            "Principal": "*",

            "Action": "backup:CopyIntoBackupVault",

            "Resource": "*",

            "Condition": {

                "StringEquals": {

                    "aws:PrincipalOrgID": "o-XXXXXXXXXX"

                }

            }

        }

    ]

}

Copy recovery point from source account vault to backup account vault

Initiate a recovery point copy by navigating to the backup vault in the source account, and create a copy job. Select the correct Region, provide the backup vault ARN, and press the Copy button.

Figure 9: Copy job initiation

Figure 9: Copy job initiation

Next, allow the backup account to copy the data back into source account by adding permissions to your back vault “sourcebackupvault” access policy.

Figure 10: Allow AWS Backup vault access

Figure 10: Allow AWS Backup vault access

Initiate copy job

From the backup vault in the source account, press the Copy button to copy a recovery point to the backup account (shown in Figure 11).

Figure 11: Initiate copy job

Figure 11: Initiate copy job

Verify copy job is successfully completed

Verify that the copy job is completed and the recovery point is copied successfully to the AWS Backup account vault.

Figure 12: Copy job is completed

Figure 12: Copy job is completed

Restore Amazon RDS database in AWS Backup account

Initiate restore of recovery point

In the backup account, navigate to the backup vault on the AWS Backup service page. Push the Restore button to initiate the recovery job.

Figure 13: Restore recovery point

Figure 13: Restore recovery point

Restore AWS Backup

The process of restoring the AWS backup will automatically detect the database (DB) engine. Choose the DB instance class, storage type, and the Multi-AZ configuration based on your application requirements. Set the encryption key to the CMK created in the backup account.

Scroll down to bottom of the page, and press the Restore backup button.

Figure 14: Restore backup

Figure 14: Restore backup

Restore job verification

Confirm that Restore job is completed in the Status field.

Figure 15: Restore job verification

Figure 15: Restore job verification

Database verification

Once the job completes, the database is restored. This can be verified on the Management Console of the RDS service.

Conclusion

In this blog post, we showed you how to cross-account backup AWS KMS encrypted RDS instances using AWS Backup and CMK. We also verified the encryption keys used by the source and backup snapshots.

Using AWS Backup cross-account backup and cross-Region copy capabilities, you can quickly restore data to your backup accounts in your preferred AWS Region. This helps AWS Backup users to minimize business impact in the event of compromised accounts, unexpected disaster or service interruption. You can create consistent database snapshots and recover them across regions to meet your security, compliance, RTO and RPO requirements.

Thanks for reading this blog post. If you have any feedback or questions, please add them in the comments section.

 

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Building well-architected serverless applications: Optimizing application performance – part 1

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-optimizing-application-performance-part-1/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

PERF 1. Optimizing your serverless application’s performance

Evaluate and optimize your serverless application’s performance based on access patterns, scaling mechanisms, and native integrations. This allows you to continuously gain more value per transaction. You can improve your overall experience and make more efficient use of the platform in terms of both value and resources.

Good practice: Measure and optimize function startup time

Evaluate your AWS Lambda function startup time for both performance and cost.

Take advantage of execution environment reuse to improve the performance of your function.

Lambda invokes your function in a secure and isolated runtime environment, and manages the resources required to run your function. When a function is first invoked, the Lambda service creates an instance of the function to process the event. This is called a cold start. After completion, the function remains available for a period of time to process subsequent events. These are called warm starts.

Lambda functions must contain a handler method in your code that processes events. During a cold start, Lambda runs the function initialization code, which is the code outside the handler, and then runs the handler code. During a warm start, Lambda runs the handler code.

Lambda function cold and warm starts

Lambda function cold and warm starts

Initialize SDK clients, objects, and database connections outside of the function handler so that they are started during the cold start process. These connections then remain during subsequent warm starts, which improves function performance and cost.

Lambda provides a writable local file system available at /tmp. This is local to each function but shared between subsequent invocations within the same execution environment. You can download and cache assets locally in the /tmp folder during the cold start. This data is then available locally by all subsequent warm start invocations, improving performance.

In the serverless airline example used in this series, the confirm booking Lambda function initializes a number of components during the cold start. These include the Lambda Powertools utilities and creating a session to the Amazon DynamoDB table BOOKING_TABLE_NAME.

import boto3
from aws_lambda_powertools import Logger, Metrics, Tracer
from aws_lambda_powertools.metrics import MetricUnit
from botocore.exceptions import ClientError

logger = Logger()
tracer = Tracer()
metrics = Metrics()

session = boto3.Session()
dynamodb = session.resource("dynamodb")
table_name = os.getenv("BOOKING_TABLE_NAME", "undefined")
table = dynamodb.Table(table_name)

Analyze and improve startup time

There are a number of steps you can take to measure and optimize Lambda function initialization time.

You can view the function cold start initialization time using Amazon CloudWatch Logs and AWS X-Ray. A log REPORT line for a cold start includes the Init Duration value. This is the time the initialization code takes to run before the handler.

CloudWatch Logs cold start report line

CloudWatch Logs cold start report line

When X-Ray tracing is enabled for a function, the trace includes the Initialization segment.

X-Ray trace cold start showing initialization segment

X-Ray trace cold start showing initialization segment

A subsequent warm start REPORT line does not include the Init Duration value, and is not present in the X-Ray trace:

CloudWatch Logs warm start report line

CloudWatch Logs warm start report line

X-Ray trace warm start without showing initialization segment

X-Ray trace warm start without showing initialization segment

CloudWatch Logs Insights allows you to search and analyze CloudWatch Logs data over multiple log groups. There are some useful searches to understand cold starts.

Understand cold start percentage over time:

filter @type = "REPORT"
| stats
  sum(strcontains(
    @message,
    "Init Duration"))
  / count(*)
  * 100
  as coldStartPercentage,
  avg(@duration)
  by bin(5m)
Cold start percentage over time

Cold start percentage over time

Cold start count and InitDuration:

filter @type="REPORT" 
| fields @memorySize / 1000000 as memorySize
| filter @message like /(?i)(Init Duration)/
| parse @message /^REPORT.*Init Duration: (?<initDuration>.*) ms.*/
| parse @log /^.*\/aws\/lambda\/(?<functionName>.*)/
| stats count() as coldStarts, median(initDuration) as avgInitDuration, max(initDuration) as maxInitDuration by functionName, memorySize
Cold start count and InitDuration

Cold start count and InitDuration

Once you have measured cold start performance, there are a number of ways to optimize startup time. For Python, you can use the PYTHONPROFILEIMPORTTIME=1 environment variable.

PYTHONPROFILEIMPORTTIME environment variable

PYTHONPROFILEIMPORTTIME environment variable

This shows how long each package import takes to help you understand how packages impact startup time.

Python import time

Python import time

Previously, for the AWS Node.js SDK, you enabled HTTP keep-alive in your code to maintain TCP connections. Enabling keep-alive allows you to avoid setting up a new TCP connection for every request. Since AWS SDK version 2.463.0, you can also set the Lambda function environment variable AWS_NODEJS_CONNECTION_REUSE_ENABLED=1 to make the SDK reuse connections by default.

You can configure Lambda’s provisioned concurrency feature to pre-initialize a requested number of execution environments. This runs the cold start initialization code so that they are prepared to respond immediately to your function’s invocations.

Use Amazon RDS Proxy to pool and share database connections to improve function performance. For additional options for using RDS with Lambda, see the AWS Serverless Hero blog post “How To: Manage RDS Connections from AWS Lambda Serverless Functions”.

Choose frameworks that load quickly on function initialization startup. For example, prefer simpler Java dependency injection frameworks like Dagger or Guice over more complex framework such as Spring. When using the AWS SDK for Java, there are some cold start performance optimization suggestions in the documentation. For further Java performance optimization tips, see the AWS re:Invent session, “Best practices for AWS Lambda and Java”.

To minimize deployment packages, choose lightweight web frameworks optimized for Lambda. For example, use MiddyJS, Lambda API JS, and Python Chalice over Node.js Express, Python Django or Flask.

If your function has many objects and connections, consider splitting the function into multiple, specialized functions. These are individually smaller and have less initialization code. I cover designing smaller, single purpose functions from a security perspective in “Managing application security boundaries – part 2”.

Minimize your deployment package size to only its runtime necessities

Smaller functions also allow you to separate functionality. Only import the libraries and dependencies that are necessary for your application processing. Use code bundling when you can to reduce the impact of file system lookup calls. This also includes deployment package size.

For example, if you only use Amazon DynamoDB in the AWS SDK, instead of importing the entire SDK, you can import an individual service. Compare the following three examples as shown in the Lambda Operator Guide:

// Instead of const AWS = require('aws-sdk'), use: +
const DynamoDB = require('aws-sdk/clients/dynamodb')

// Instead of const AWSXRay = require('aws-xray-sdk'), use: +
const AWSXRay = require('aws-xray-sdk-core')

// Instead of const AWS = AWSXRay.captureAWS(require('aws-sdk')), use: +
const dynamodb = new DynamoDB.DocumentClient() +
AWSXRay.captureAWSClient(dynamodb.service)

In testing, importing the DynamoDB library instead of the entire AWS SDK was 125 ms faster. Importing the X-Ray core library was 5 ms faster than the X-Ray SDK. Similarly, when wrapping a service initialization, preparing a DocumentClient before wrapping showed a 140-ms gain. Version 3 of the AWS SDK for JavaScript supports modular imports, which can further help reduce unused dependencies.

For additional options when for optimizing AWS Node.js SDK imports, see the AWS Serverless Hero blog post.

Conclusion

Evaluate and optimize your serverless application’s performance based on access patterns, scaling mechanisms, and native integrations. You can improve your overall experience and make more efficient use of the platform in terms of both value and resources.

In this post, I cover measuring and optimizing function startup time. I explain cold and warm starts and how to reuse the Lambda execution environment to improve performance. I show a number of ways to analyze and optimize the initialization startup time. I explain how only importing necessary libraries and dependencies increases application performance.

This well-architected question will be continued is part 2 where I look at designing your function to take advantage of concurrency via asynchronous and stream-based invocations. I cover measuring, evaluating, and selecting optimal capacity units.

For more serverless learning resources, visit Serverless Land.

Manage your Digital Microscopy Data using OMERO on AWS

Post Syndicated from Travis Berkley original https://aws.amazon.com/blogs/architecture/manage-your-digital-microscopy-data-using-omero-on-aws/

The Open Microscopy Environment (OME) consortium develops open-source software and format standards for microscopy data. OME Remote Objects (OMERO) is an open source, image data management platform designed to support digital pathology and cellular biology studies. You can access, share, and work with various biological data. This can include histopathology, high content screening, electron microscopy, and even non-image genotype data. Deploying this open source tool on Amazon Web Services (AWS) allows you to access your image data in a secure central repository. You can take advantage of elastic storage by growing the archive as needed without provisioning excess storage beforehand. OMERO has a web interface, which facilitates data access and visualization. It also supports connection through the OMERO client or other third-party image analysis tools, like CellProfilerTM, QuPath, Fiji, ImageJ, and others.

The challenge of microscopy data

Saint Louis University (SLU) School of Medicine Research Microscopy and Histology Core required a centralized system for both distribution and hosting. The solution must provide research imaging distribution to both internal and external clients. It also needed the capability of hosting an educational platform for microscope images. SLU decided that the open source software OMERO was an ideal fit for them.

In order to provide speed, ease of access, and security for the University’s computer networks, SLU decided the solution must be hosted in the cloud. By partnering with AWS, SLU established a robust system for their clients. The privately hosted images on OMERO represent research material databases used by University researchers. OMERO also hosts teaching datasets for resident and fellow education. Other publicly hosted repositories provide access to source images for future publishing standards and regulations. SLU reported that the implementation was extraordinarily smooth for a non-programmer. In addition, the system design allowed for advanced data management to control costs and security.

Reviewing the OMERO architecture

OMERO is a typical three-tier web application, consisting of the following components:

  • OMERO.web provides access to OMERO’s data hierarchies and also enables annotation, organization, and visualization of data. This web browser-based client of OMERO.server exposes the annotation-based data-sharing mechanism.
  • OMERO.server is a middleware server application that provides access to image data and metadata stored in a series of databases. It contains a multi-threaded, image-rendering engine and supports a wide range (>140) of image pyramid formats through the Bio-Formats Java library. This Java application facilitates remote access and interoperability for modern scientific studies. It also exposes an API to allow any OMERO client to access the original data and any derived measurements.
  • OMERO relational database (PostgreSQL) provides the underlying storage facilities. This storage backend contains the processed metadata associated with the binary images, measurement specification, user information, structured annotations, and more.
Figure 1. Architectural diagram for a highly available (HA) deployment of OMERO on AWS including data ingestion options

Figure 1. Architectural diagram for a highly available (HA) deployment of OMERO on AWS including data ingestion options

To achieve the highly available (HA) deployment in the diagram, follow the guidance from this GitHub repository. Since OMERO only supports one writer per mounted network file share, there is one OMERO read+write server and one read-only server in the HA deployment. Otherwise, multiple instances will compete to get first access to Amazon Elastic File System (EFS). If HA is not a requirement, you can lower costs by deploying only the read+write OMERO.server.

OMERO is deployed on AWS using AWS CloudFormation (CFN) templates, which will deploy two nested CFN stacks, one for storage, and one for compute. The storage template creates an EFS volume and an Amazon Relational Database Service (RDS) instance of PostgreSQL. EFS provides the option to move files to an infrequent accessed storage class after a certain number of days to save storage cost. RDS has Multi-AZ option to improve business continuity. The compute template creates Amazon Elastic Container Service (Amazon ECS) containers for the OMERO web and server functions. You have the option to deploy the OMERO containers on AWS Fargate or Amazon EC2 launch type. It also creates an Amazon Application Load Balancer (ALB) with duration-based stickiness enabled and an AWS Certificate Manager (ACM) certificate for Transport Layer Security (TLS) termination at ALB. Only the ALB is publicly accessible, as the web portal is protected behind it in private subnets. VPC and subnets are required, which can be obtained via this CFN template. It also requires the hosted zone ID and fully qualified domain name in Amazon Route 53, which will be used to validate the TLS certificate. If higher security is not a requirement, there is an option to deploy without the registered domain and the hosted zone in Route 53. You will then be able to access the OMERO web through Application Load Balancer DNS name without TLS encryption.

Additionally, the containers of OMERO.web and OMERO.server can be extended with plugins. The landing page for login can be customized with logos, brands, or disclaimers. Build a new Docker container image with specific configuration changes to enrich the functionality of this open source platform.

You can use Amazon ECS Exec to access the OMERO command line interface (CLI) to import images within the OMERO.server container, running on either AWS Fargate or EC2 launch type. You can also run Amazon ECS Exec via AWS CloudShell. The OMERO CFN templates enable Amazon ECS Exec commands by default. You will only need to install AWS CLI and SSM plugin on your clients or AWS CloudShell to initiate the commands. When you import images within the OMERO.server container instances, you can use the OMERO in-place import to avoid redundant copies of the image files on Amazon EFS. Alternatively, you can access the Windows desktop OMERO client OMERO.insight, via the application virtualization service Amazon AppStream 2.0. This connects to the OMERO.server in the same VPC. Amazon AppStream 2.0 allows Amazon S3 being used as home folder storage, so you can import images directly from Amazon S3 to OMERO.server.

AWS offers multiple options to move your microscopic image data from on premises facilities to the cloud storage, as illustrated in Figure 1:

  1. Use AWS Transfer Family to copy data directly from on premises devices to EFS
  2. Alternatively, transfer data directly from your on-premises Network File System (NFS) to EFS using AWS DataSync. AWS DataSync can also be used to transfer files from S3 to EFS.
  3. Set up AWS Storage Gateway, in particular File Gateway, to move your image files from on premises to Amazon Simple Storage Service (S3) first. A storage lifecycle policy can archive images. You can track the storage activity metrics using Amazon S3 Storage Lens and gain insights on storage cost using cost allocation tags. Once the files are in Amazon S3, you can either set up AWS DataSync to transfer files from S3 to EFS, or directly import files into OMERO.server.

To find the latest development to this solution, check out digital pathology on AWS repository on GitHub.

Conclusion

Researchers and scientists at St. Louis University were able to grow their image repository on AWS without the concern of fixed storage limits. They can scale their compute environment up or down as their research requirements dictate. The managed services, like Amazon ECS and RDS, are able to significantly reduce the operational workloads from researchers. SLU reports that this platform is of great use to their researchers. Other universities, academic medical centers, and pharmaceutical and biotechnology companies can also use this cloud-based image data management platform to collect, visualize, and share access to their image data assets.

Building well-architected serverless applications: Regulating inbound request rates – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-regulating-inbound-request-rates-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Reliability question REL1: How do you regulate inbound request rates?

This post continues part 1 of this security question. Previously, I cover controlling inbound request rates using throttling. I go through how to use throttling to control steady-rate and burst rate requests. I show some solutions for performance testing to identify the request rates that your workload can sustain before impacting performance.

Good practice: Use, analyze, and enforce API quotas

API quotas limit the maximum number of requests a given API key can submit within a specified time interval. Metering API consumers provides a better understanding of how different consumers use your workload at sustained and burst rates at any point in time. With this information, you can determine fine-grained rate limiting for multiple quota limits. These can be done according to a group of consumer needs, and can adjust their limits on a regular basis.

Segregate API consumers steady-rate requests and their quota into multiple buckets or tiers

Amazon API Gateway usage plans allow your API consumer to access selected APIs at agreed-upon request rates and quotas. These help your consumers meet their business requirements and budget constraints. Create and attach API keys to usage plans to control access to certain API stages. I show how to create usage plans and how to associate them with API keys in “Building well-architected serverless applications: Controlling serverless API access – part 2”.

API key associated with usage plan

API key associated with usage plan

You can extract utilization data from usage plans to analyze API usage on a per-API key basis. In the example, I show how to use usage plans to see how many requests are made.

View API key usage

View API key usage

This allows you to generate billing documents and determine whether your customers need higher or lower limits. Have a mechanism to allow customers to request higher limits preemptively. When customers anticipate greater API usage, they can take action proactively.

API Gateway Lambda authorizers can dynamically associate API keys to a given request. This can be used where you do not control API consumers, or want to associate API keys based on your own criteria. For more information, see the documentation.

You can also visualize usage plans with Amazon QuickSight using enriched API Gateway access logs.

Visualize usage plans with Amazon QuickSight

Visualize usage plans with Amazon QuickSight

Define whether your API consumers are end users or machines

Understanding your API consumers helps you manage how they connect to your API. This helps you define a request access pattern strategy, which can distinguish between end users or machines.

Machine consumers make automated connections to your API, which may require a different access pattern to end users. You may decide to prioritize end user consumers to provide a better experience. Machine consumers may be able to handle request throttling automatically.

Best practice: Use mechanisms to protect non-scalable resources

Limit component throughput by enforcing how many transactions it can accept

AWS Lambda functions can scale faster than traditional resources, such as relational databases and cache systems. Protect your non-scalable resources by ensuring that components that scale quickly do not exceed the throughput of downstream systems. This can prevent system performance degrading. There are a number of ways to achieve this, either directly or via buffer mechanisms such as queues and streams.

For relational databases such as Amazon RDS, you can limit the number of connections per user, in addition to the global maximum number of connections. With Amazon RDS Proxy, your applications can pool and share database connections to improve their ability to scale.

Amazon RDS Proxy

Amazon RDS Proxy

For additional options for using RDS with Lambda, see the AWS Serverless Hero blog post “How To: Manage RDS Connections from AWS Lambda Serverless Functions”.

Cache results and only connect to, and fetch data from databases when needed. This reduces the load on the downstream database. Adjust the maximum number of connections for caching systems. Include a caching expiration mechanism to prevent serving stale records. For more information on caching implementation patterns and considerations, see “Caching Best Practices”.

Lambda provides managed scaling. When a function is first invoked, the Lambda service creates an instance of the function to process the event. This is called a cold start. After completion, the function remains available for a period of time to process subsequent events. These are called warm starts. If other events arrive while the function is busy, Lambda creates more instances of the function to handle these requests concurrently as cold starts. The following example shows 10 events processed in six concurrent requests.

Lambda concurrency

Lambda concurrency

You can control the number of concurrent function invocations to both reserve and limit the maximum concurrency your function can achieve. You can configure reserved concurrency to set the maximum number of concurrent instances for the function. This can protect downstream resources such as a database by ensuring Lambda can only scale up to the number of connections the database can support.

For example, you may have a traditional database or external API that can only support a maximum of 50 concurrent connections. You can set the maximum number of concurrent Lambda functions using the function concurrency settings. Setting the value to 50 ensures that the traditional database or external API is not overwhelmed.

Edit Lambda concurrency

Edit Lambda concurrency

You can also set the Lambda function concurrency to 0, which disables the Lambda function in the event of anomalies.

Another solution to protect downstream resources is to use an intermediate buffer. A buffer can persistently store messages in a stream or queue until a receiver processes them. This helps you control how fast messages are processed, which can protect the load on downstream resources.

Amazon Kinesis Data Streams allows you to collect and process large streams of data records in real time, and can act as a buffer. Streams consist of a set of shards that contain a sequence of data records. When using Lambda to process records, it processes one batch of records at a time from each shard.

Kinesis Data Streams control concurrency at the shard level, meaning that a single shard has a single concurrent invocation. This can reduce downstream calls to non-scalable resources such as a traditional database. Kinesis Data Streams also support batch windows up to 5 minutes and batch record sizes. These can also be used to control how frequent invocations can occur.

To learn how to manage scaling with Kinesis, see the documentation. To learn more how Lambda works with Kinesis, read the blog series “Building serverless applications with streaming data”.

Lambda and Kinesis shards

Lambda and Kinesis shards

Amazon Simple Queue Service (SQS) is a fully managed serverless message queuing service that enables you to decouple and scale microservices. You can offload tasks from one component of your application by sending them to a queue and processing them asynchronously.

SQS can act as a buffer, using a Lambda function to process the messages. Lambda polls the queue and invokes your Lambda function synchronously with an event that contains queue messages. Lambda reads messages in batches and invokes your function once for each batch. When your function successfully processes a batch, Lambda deletes its messages from the queue.

You can protect downstream resources using the Lambda concurrency controls. This limits the number of concurrent Lambda functions that pull messages off the queue. The messages persist in the queue until Lambda can process them. For more information see, “Using AWS Lambda with Amazon SQS

Lambda and SQS

Lambda and SQS

Conclusion

Regulating inbound requests helps you adapt different scaling mechanisms based on customer demand. You can achieve better throughput for your workloads and make them more reliable by controlling requests to a rate that your workload can support.

In this post, I cover using, analyzing, and enforcing API quotas using usage plans and API keys. I show mechanisms to protect non-scalable resources such as using RDS Proxy to protect downstream databases. I show how to control the number of Lambda invocations using concurrency controls to protect downstream resources. I explain how you can use streams and queues as an intermediate buffer to store messages persistently until a receiver processes them.

In the next post in the series, I cover the second reliability question from the Well-Architected Serverless Lens, building resiliency into serverless applications.

For more serverless learning resources, visit Serverless Land.

Data preparation using an Amazon RDS for MySQL database with AWS Glue DataBrew

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/data-preparation-using-an-amazon-rds-for-mysql-database-with-aws-glue-databrew/

With AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, or Amazon Aurora and Amazon Relational Database Service (Amazon RDS) databases. You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an RDS database, store the cleaned data in an S3 data lake, and build a business intelligence (BI) report.

Use case overview

For our use case, we use three datasets:

  • A school dataset that contains school details like school ID and school name
  • A student dataset that contains student details like student ID, name, and age
  • A student study details dataset that contains student study time, health, country, and more

The following diagram shows the relation of these tables.

For our use case, this data is collected by a survey organization after an annual exam, and updates are made in Amazon RDS for MySQL using a Java script-based frontend application. We join the tables to create a single view and create aggregated data through a series of data preparation steps, and the business team uses the output data to create BI reports.

Solution overview

The following diagram illustrates our solution architecture. We use Amazon RDS to store data, DataBrew for data preparation, Amazon Athena for data analysis with standard SQL, and Amazon QuickSight for business reporting.

The workflow includes the following steps:
  1. Create a JDBC connection for RDS and a DataBrew project. DataBrew does the transformation to find the top performing students across all the schools considered for analysis.
  2. The DataBrew job writes the final output to our S3 output bucket.
  3. After the output data is written, we can create external tables on top of it with Athena create table statements and load partitions with MCSK REPAIR commands.
  4. Business users can use QuickSight for BI reporting, which fetches data through Athena. Data analysts can also use Athena to analyze the complete refreshed dataset.

Prerequisites

To complete this solution, you should have an AWS account.

Prelab setup

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use three mock datasets. You can download the DDL code and data files from GitHub.

  1. Create the RDS for MySQL instance to capture the student health data.
  2. Make sure you have set up the correct security group for Amazon RDS. For more information, see Setting Up a VPC to Connect to JDBC Data Stores.
  3. Create three tables: student_tbl, study_details_tbl, and school_tbl. You can use DDLsql to create the database objects.
  4. Upload the student.csv, study_details.csv, and school.csv files in their respective tables. You can use student.sql, study_details.sql, and school.sql to insert the data in the tables.

Create an Amazon RDS connection

To create your Amazon RDS connection, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.

  1. For Connection name, enter a name (for example, student_db-conn).
  2. For Connection type, select JDBC.
  3. For Database type, choose MySQL.

  1. Provide other parameters like RDS endpoint, port, database name, and database login credentials.

  1. In the Network options section, choose the VPC, subnet, and security group of your RDS instance.
  2. Choose Create connection.

Create your datasets

We have three tables in Amazon RDS: school_tbl, student_tbl, and study_details_tbl. To use these tables, we first need to create a dataset for each table.

To create the datasets, complete the following steps (we walk you through creating the school dataset):

  1. On the Datasets page of the DataBrew console, choose Connect new dataset.

  1. For Dataset name, enter school-dataset.
  2. Choose the connection you created (AwsGlueDatabrew-student-db-conn).
  3. For Table name, enter school_tbl.
  4. Choose Create dataset.

  1. Repeat these steps for the student_tbl and study_details_tbl tables, and name the new datasets student-dataset and study-detail-dataset, respectively.

All three datasets are available to use on the Datasets page.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project Name, enter my-rds-proj.
  4. For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

  1. For Select a dataset, select My datasets.
  2. For Dataset name, select study-detail-dataset.

  1. For Role name, choose your AWS Identity and Access management (IAM) role to use with DataBrew.
  2. Choose Create project.

You can see a success message along with our RDS study_details_tbl table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Open an Amazon RDS project and build a transformation recipe

In a DataBrew interactive session, you can cleanse and normalize your data using over 250 built-in transforms. In this post, we use DataBrew to identify top performing students by performing a few transforms and finding students who got marks greater than or equal to 60 in the last annual exam.

First, we use DataBrew to join all three RDS tables. To do this, we perform the following steps:

  1. Navigate to the project you created.
  2. Choose Join.

  1. For Select dataset, choose student-dataset.
  2. Choose Next.

  1. For Select join type, select Left join.
  2. For Join keys, choose student_id for Table A and deselect student_id for Table B.
  3. Choose Finish.

Repeat the steps for school-dataset based on the school_id key.

  1. Choose MERGE to merge first_name and last_name.
  2. Enter a space as a separator.
  3. Choose Apply.

We now filter the rows based on marks value greater than or equal to 60 and add the condition as a recipe step.

  1. Choose FILTER.

  1. Provide the source column and filter condition and choose Apply.

The final data shows the top performing students’ data who had marks greater than or equal to 60.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name¸ enter top-performer-student.

For this post, we use Parquet as the output format.

  1. For File type, choose PARQUET.
  2. For S3 location, enter the S3 path of the output folder.

  1. For Role name, choose an existing role or create a new one.
  2. Choose Create and run job.

  1. Navigate to the Jobs page and wait for the top-performer-student job to complete.

  1. Choose the Destination link to navigate to Amazon S3 to access the job output.

Run an Athena query

Let’s validate the aggregated table output in Athena by running a simple SELECT query. The following screenshot shows the output.

Create reports in QuickSight

Now let’s do our final step of the architecture, which is creating BI reports through QuickSight by connecting to the Athena aggregated table.

  1. On the QuickSight console, choose Athena as your data source.

  1. Choose the database and catalog you have in Athena.
  2. Select your table.
  3. Choose Select.

Now you can create a quick report to visualize your output, as shown in the following screenshot.

If QuickSight is using SPICE storage, you need to refresh the dataset in QuickSight after you receive notification about the completion of the data refresh. We recommend using SPICE storage to get better performance.

Clean up

Delete the following resources that might accrue cost over time:

  • The RDS instance
  • The recipe job top-performer-student
  • The job output stored in your S3 bucket
  • The IAM roles created as part of projects and jobs
  • The DataBrew project my-rds-proj and its associated recipe my-rds-proj-recipe
  • The DataBrew datasets

Conclusion

In this post, we saw how to create a JDBC connection for an RDS database. We learned how to use this connection to create a DataBrew dataset for each table, and how to reuse this connection multiple times. We also saw how we can bring data from Amazon RDS into DataBrew and seamlessly apply transformations and run recipe jobs that refresh transformed data for BI reporting.


About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Query your Oracle database using Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-your-oracle-database-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Oracle as your transactional data store, you may need to join the data in your data lake with Oracle on Amazon Relational Database Service (Amazon RDS), Oracle running on Amazon Elastic Compute Cloud (Amazon EC2), or an on-premises Oracle database, for example to build a dashboard or create consolidated reporting.

In these use cases, Amazon Athena Federated Query allows you to seamlessly access the data you’re your Oracle database without having to move the data to the S3 data lake. This removes the overhead in managing such jobs.

In this post, we walk you through a step-by-step configuration to set up Athena Federated query using AWS Lambda to access data in Oracle on Amazon RDS.

For this post, we will be using the Oracle Athena Federated query connector developed by Trianz. The runtime includes Oracle XE running on Amazon EC2 and Amazon RDS. Your Oracle instance can be on Amazon RDS, Amazon EC2, or on premises. You can deploy the Trianz Oracle AFQ connector available in the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Amazon Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Oracle instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Oracle XE instance.
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have an Oracle database up and running.

Create a secret for the Oracle instance

Our first step is to create a secret for the Oracle instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Enter the credentials as key-value pairs (username, password) for your Oracle XE instance.

  1. For Secret name, enter a name for your secret. Use the prefix OracleAFQ so it’s easy to find.
  2. Leave the remaining fields at their defaults and choose Next.
  3. Complete your secret creation.

Create an S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, I use athena-accelerator/oracle.

Configure Athena federation with the Oracle XE instance

To configure Athena federation with your Oracle instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. In the search field, enter TrianzOracleAthenaJDBC.

  1. For Application name, enter TrianzOracleAthenaJDBC.
  2. For SecretNamePrefix, enter OracleAFQ_XE.
  3. For SpillBucket, enter Athena-accelerator/oracle.
  4. For JDBCConnectorConfig, use the format oracle://jdbc:oracle:thin:${secretname}@//hostname:port/servicename.

For example, we enter oracle://jdbc:oracle:thin:${OracleAFQ_XE}@//12.345.67.89:1521/xe.

  1. For DisableSpillEncryption, enter false.
  2. For LambdaFunctionName, enter oracleconnector.
  3. For SecurityGroupID, enter the security group ID where the Oracle instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Oracle instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your federated queries using lambda:trianzoracle against tables in the Oracle database. trianzoracle is the name of lambda function which we have created in step 7 of previous section of this blog

lambda:trianzoracle is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following query joins the dataset between Oracle and the S3 data lake.

Key performance best practices

If you’re considering Athena Federated query with Oracle, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Oracle database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the Oracle database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Oracle to your S3 data lake.
  • The star schema is a commonly used data model in Oracle. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Oracle. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Oracle database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Oracle database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with Oracle. Now you don’t need to wait for all the data in your Oracle data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Oracle for better performance. When queries are well written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Building well-architected serverless applications: Implementing application workload security – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-implementing-application-workload-security-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Security question SEC3: How do you implement application security in your workload?

This post continues part 1 of this security question. Previously, I cover reviewing security awareness documentation such as the Common Vulnerabilities and Exposures (CVE) database. I show how to use GitHub security features to inspect and manage code dependencies. I then show how to validate inbound events using Amazon API Gateway request validation.

Required practice: Store secrets that are used in your code securely

Store secrets such as database passwords or API keys in a secrets manager. Using a secrets manager allows for auditing access, easier rotation, and prevents exposing secrets in application source code. There are a number of AWS and third-party solutions to store and manage secrets.

AWS Partner Network (APN) member Hashicorp provides Vault to keep secrets and application data secure. Vault has a centralized workflow for tightly controlling access to secrets across applications, systems, and infrastructure. You can store secrets in Vault and access them from an AWS Lambda function to, for example, access a database. You can use the Vault Agent for AWS to authenticate with Vault, receive the database credentials, and then perform the necessary queries. You can also use the Vault AWS Lambda extension to manage the connectivity to Vault.

AWS Systems Manager Parameter Store allows you to store configuration data securely, including secrets, as parameter values.

AWS Secrets Manager enables you to replace hardcoded credentials in your code with an API call to Secrets Manager to retrieve the secret programmatically. You can protect, rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle. You can also generate secure secrets. By default, Secrets Manager does not write or cache the secret to persistent storage.

Parameter Store integrates with Secrets Manager. For more information, see “Referencing AWS Secrets Manager secrets from Parameter Store parameters.”

To show how Secrets Manager works, deploy the solution detailed in “How to securely provide database credentials to Lambda functions by using AWS Secrets Manager”.

The AWS Cloud​Formation stack deploys an Amazon RDS MySQL database with a randomly generated password. This is stored in Secrets Manager using a secret resource. A Lambda function behind an API Gateway endpoint returns the record count in a table from the database, using the required credentials. Lambda function environment variables store the database connection details and which secret to return for the database password. The password is not stored as an environment variable, nor in the Lambda function application code.

Lambda environment variables for Secrets Manager

Lambda environment variables for Secrets Manager

The application flow is as follows:

  1. Clients call the API Gateway endpoint
  2. API Gateway invokes the Lambda function
  3. The Lambda function retrieves the database secrets using the Secrets Manager API
  4. The Lambda function connects to the RDS database using the credentials from Secrets Manager and returns the query results

View the password secret value in the Secrets Manager console, which is randomly generated as part of the stack deployment.

Example password stored in Secrets Manager

Example password stored in Secrets Manager

The Lambda function includes the following code to retrieve the secret from Secrets Manager. The function then uses it to connect to the database securely.

secret_name = os.environ['SECRET_NAME']
rds_host = os.environ['RDS_HOST']
name = os.environ['RDS_USERNAME']
db_name = os.environ['RDS_DB_NAME']

session = boto3.session.Session()
client = session.client(
	service_name='secretsmanager',
	region_name=region_name
)
get_secret_value_response = client.get_secret_value(
	SecretId=secret_name
)
...
secret = get_secret_value_response['SecretString']
j = json.loads(secret)
password = j['password']
...
conn = pymysql.connect(
	rds_host, user=name, passwd=password, db=db_name, connect_timeout=5)

Browsing to the endpoint URL specified in the Cloud​Formation output displays the number of records. This confirms that the Lambda function has successfully retrieved the secure database credentials and queried the table for the record count.

Lambda function retrieving database credentials

Lambda function retrieving database credentials

Audit secrets access through a secrets manager

Monitor how your secrets are used to confirm that the usage is expected, and log any changes to them. This helps to ensure that any unexpected usage or change can be investigated, and unwanted changes can be rolled back.

Hashicorp Vault uses Audit devices that keep a detailed log of all requests and responses to Vault. Audit devices can append logs to a file, write to syslog, or write to a socket.

Secrets Manager supports logging API calls with AWS CloudTrail. CloudTrail captures all API calls for Secrets Manager as events. This includes calls from the Secrets Manager console and from code calling the Secrets Manager APIs.

Viewing the CloudTrail event history shows the requests to secretsmanager.amazonaws.com. This shows the requests from the console in addition to the Lambda function.

CloudTrail showing access to Secrets Manager

CloudTrail showing access to Secrets Manager

Secrets Manager also works with Amazon EventBridge so you can trigger alerts when administrator-specified operations occur. You can configure EventBridge rules to alert on deleted secrets or secret rotation. You can also create an alert if anyone tries to use a secret version while it is pending deletion. This can identify and alert when there is an attempt to use an out-of-date secret.

Enforce least privilege access to secrets

Access to secrets must be tightly controlled because the secrets contain sensitive information. Create AWS Identity and Access Management (IAM) policies that enable minimal access to secrets to prevent credentials being accidentally used or compromised. Secrets that have policies that are too permissive could be misused by other environments or developers. This can lead to accidental data loss or compromised systems. For more information, see “Authentication and access control for AWS Secrets Manager”.

Rotate secrets frequently.

Rotating your workload secrets is important. This prevents misuse of your secrets since they become invalid within a configured time period.

Secrets Manager allows you to rotate secrets on a schedule or on demand. This enables you to replace long-term secrets with short-term ones, significantly reducing the risk of compromise. Secrets Manager creates a CloudFormation stack with a Lambda function to manage the rotation process for you. Secrets Manager has native integrations with Amazon RDS, Amazon Redshift, and Amazon DocumentDB. It populates the function with the Amazon Resource Name (ARN) of the secret. You specify the permissions to rotate the credentials, and how often you want to rotate the secret.

The CloudFormation stack creates a MySecretRotationSchedule resource with a MyRotationLambda function to rotate the secret every 30 days.

MySecretRotationSchedule:
    Type: AWS::SecretsManager::RotationSchedule
    DependsOn: SecretRDSInstanceAttachment
    Properties:
    SecretId: !Ref MyRDSInstanceRotationSecret
    RotationLambdaARN: !GetAtt MyRotationLambda.Arn
    RotationRules:
        AutomaticallyAfterDays: 30
MyRotationLambda:
    Type: AWS::Serverless::Function
    Properties:
    Runtime: python3.7
    Role: !GetAtt MyLambdaExecutionRole.Arn
    Handler: mysql_secret_rotation.lambda_handler
    Description: 'This is a lambda to rotate MySql user passwd'
    FunctionName: 'cfn-rotation-lambda'
    CodeUri: 's3://devsecopsblog/code.zip'      
    Environment:
        Variables:
        SECRETS_MANAGER_ENDPOINT: !Sub 'https://secretsmanager.${AWS::Region}.amazonaws.com'

View and edit the rotation settings in the Secrets Manager console.

Secrets Manager rotation settings

Secrets Manager rotation settings

Manually rotate the secret by selecting Rotate secret immediately. This invokes the Lambda function, which updates the database password and updates the secret in Secrets Manager.

View the updated secret in Secrets Manager, where the password has changed.

Secrets Manager password change

Secrets Manager password change

Browse to the endpoint URL to confirm you can still access the database with the updated credentials.

Access endpoint with updated Secret Manager password

Access endpoint with updated Secret Manager password

You can provide your own code to customize a Lambda rotation function for other databases or services. The code includes the commands required to interact with your secured service to update or add credentials.

Conclusion

Implementing application security in your workload involves reviewing and automating security practices at the application code level. By implementing code security, you can protect against emerging security threats. You can improve the security posture by checking for malicious code, including third-party dependencies.

In this post, I continue from part 1, looking at securely storing, auditing, and rotating secrets that are used in your application code.

In the next post in the series, I start to cover the reliability pillar from the Well-Architected Serverless Lens with regulating inbound request rates.

For more serverless learning resources, visit Serverless Land.

Overview of Data Transfer Costs for Common Architectures

Post Syndicated from Birender Pal original https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/

Data transfer charges are often overlooked while architecting a solution in AWS. Considering data transfer charges while making architectural decisions can help save costs. This blog post will help identify potential data transfer charges you may encounter while operating your workload on AWS. Service charges are out of scope for this blog, but should be carefully considered when designing any architecture.

Data transfer between AWS and internet

There is no charge for inbound data transfer across all services in all Regions. Data transfer from AWS to the internet is charged per service, with rates specific to the originating Region. Refer to the pricing pages for each service—for example, the pricing page for Amazon Elastic Compute Cloud (Amazon EC2)—for more details.

Data transfer within AWS

Data transfer within AWS could be from your workload to other AWS services, or it could be between different components of your workload.

Data transfer between your workload and other AWS services

When your workload accesses AWS services, you may incur data transfer charges.

Accessing services within the same AWS Region

If the internet gateway is used to access the public endpoint of the AWS services in the same Region (Figure 1 – Pattern 1), there are no data transfer charges. If a NAT gateway is used to access the same services (Figure 1 – Pattern 2), there is a data processing charge (per gigabyte (GB)) for data that passes through the gateway.

Accessing AWS services in same Region

Figure 1. Accessing AWS services in same Region

Accessing services across AWS Regions

If your workload accesses services in different Regions (Figure 2), there is a charge for data transfer across Regions. The charge depends on the source and destination Region (as described on the Amazon EC2 Data Transfer pricing page).

Accessing AWS services in different Region

Figure 2. Accessing AWS services in different Region

Data transfer within different components of your workload

Charges may apply if there is data transfer between different components of your workload. These charges vary depending on where the components are deployed.

Workload components in same AWS Region

Data transfer within the same Availability Zone is free. One way to achieve high availability for a workload is to deploy in multiple Availability Zones.

Consider a workload with two application servers running on Amazon EC2 and a database running on Amazon Relational Database Service (Amazon RDS) for MySQL (Figure 3). For high availability, each application server is deployed into a separate Availability Zone. Here, data transfer charges apply for cross-Availability Zone communication between the EC2 instances. Data transfer charges also apply between Amazon EC2 and Amazon RDS. Consult the Amazon RDS for MySQL pricing guide for more information.

Workload components across Availability Zones

Figure 3. Workload components across Availability Zones

To minimize impact of a database instance failure, enable a multi-Availability Zone configuration within Amazon RDS to deploy a standby instance in a different Availability Zone. Replication between the primary and standby instances does not incur additional data transfer charges. However, data transfer charges will apply from any consumers outside the current primary instance Availability Zone. Refer to the Amazon RDS pricing page for more detail.

A common pattern is to deploy workloads across multiple VPCs in your AWS network. Two approaches to enabling VPC-to-VPC communication are VPC peering connections and AWS Transit Gateway. Data transfer over a VPC peering connection that stays within an Availability Zone is free. Data transfer over a VPC peering connection that crosses Availability Zones will incur a data transfer charge for ingress/egress traffic (Figure 4).

VPC peering connection

Figure 4. VPC peering connection

Transit Gateway can interconnect hundreds or thousands of VPCs (Figure 5). Cost elements for Transit Gateway include an hourly charge for each attached VPC, AWS Direct Connect, or AWS Site-to-Site VPN. Data processing charges apply for each GB sent from a VPC, Direct Connect, or VPN to Transit Gateway.

VPC peering using Transit Gateway in same Region

Figure 5. VPC peering using Transit Gateway in same Region

Workload components in different AWS Regions

If workload components communicate across multiple Regions using VPC peering connections or Transit Gateway, additional data transfer charges apply. If the VPCs are peered across Regions, standard inter-Region data transfer charges will apply (Figure 6).

VPC peering across Regions

Figure 6. VPC peering across Regions

For peered Transit Gateways, you will incur data transfer charges on only one side of the peer. Data transfer charges do not apply for data sent from a peering attachment to a Transit Gateway. The data transfer for this cross-Region peering connection is in addition to the data transfer charges for the other attachments (Figure 7).

Transit Gateway peering across Regions

Figure 7. Transit Gateway peering across Regions

Data transfer between AWS and on-premises data centers

Data transfer will occur when your workload needs to access resources in your on-premises data center. There are two common options to help achieve this connectivity: Site-to-Site VPN and Direct Connect.

Data transfer over AWS Site-to-Site VPN

One option to connect workloads to an on-premises network is to use one or more Site-to-Site VPN connections (Figure 8 – Pattern 1). These charges include an hourly charge for the connection and a charge for data transferred from AWS. Refer to Site-to-Site VPN pricing for more details. Another option to connect multiple VPCs to an on-premises network is to use a Site-to-Site VPN connection to a Transit Gateway (Figure 8 – Pattern 2). The Site-to-Site VPN will be considered another attachment on the Transit Gateway. Standard Transit Gateway pricing applies.

Site-to-Site VPN patterns

Figure 8. Site-to-Site VPN patterns

Data transfer over AWS Direct Connect

Direct Connect can be used to connect workloads in AWS to on-premises networks. Direct Connect incurs a fee for each hour the connection port is used and data transfer charges for data flowing out of AWS. Data transfer into AWS is $0.00 per GB in all locations. The data transfer charges depend on the source Region and the Direct Connect provider location. Direct Connect can also connect to the Transit Gateway if multiple VPCs need to be connected (Figure 9). Direct Connect is considered another attachment on the Transit Gateway and standard Transit Gateway pricing applies. Refer to the Direct Connect pricing page for more details.

Figure 9. Direct Connect patterns

Figure 9. Direct Connect patterns

A Direct Connect gateway can be used to share a Direct Connect across multiple Regions. When using a Direct Connect gateway, there will be outbound data charges based on the source Region and Direct Connect location (Figure 10).

Direct Connect gateway

Figure 10. Direct Connect gateway

General tips

Data transfer charges apply based on the source, destination, and amount of traffic. Here are some general tips for when you start planning your architecture:

  • Avoid routing traffic over the internet when connecting to AWS services from within AWS by using VPC endpoints:
    • VPC gateway endpoints allow communication to Amazon S3 and Amazon DynamoDB without incurring data transfer charges.
    • VPC interface endpoints are available for some AWS services. This type of endpoint incurs hourly service charges and data transfer charges.
  • Use Direct Connect instead of the internet for sending data to on-premises networks.
  • Traffic that crosses an Availability Zone boundary typically incurs a data transfer charge. Use resources from the local Availability Zone whenever possible.
  • Traffic that crosses a Regional boundary will typically incur a data transfer charge. Avoid cross-Region data transfer unless your business case requires it.
  • Use the AWS Free Tier. Under certain circumstances, you may be able to test your workload free of charge.
  • Use the AWS Pricing Calculator to help estimate the data transfer costs for your solution.
  • Use a dashboard to better visualize data transfer charges – this workshop will show how.

Conclusion

AWS provides the ability to deploy across multiple Availability Zones and Regions. With a few clicks, you can create a distributed workload. As you increase your footprint across AWS, it helps to understand various data transfer charges that may apply. This blog post provided information to help you make an informed decision and explore different architectural patterns to save on data transfer costs.

Better performance for less: AWS continues to beat Azure on SQL Server price/performance

Post Syndicated from Fred Wurden original https://aws.amazon.com/blogs/compute/sql-server-runs-better-on-aws/

By Fred Wurden, General Manager, AWS Enterprise Engineering (Windows, VMware, RedHat, SAP, Benchmarking)

AWS R5b.8xlarge delivers better performance at lower cost than Azure E64_32s_v4 for a SQL Server workload

In this blog, we will review a recent benchmark that Principled Technologies published on 2/25. The benchmark found that an Amazon Elastic Compute Cloud (Amazon EC2) R5b.8xlarge instance delivered better performance for a SQL Server workload at a lower cost when directly tested against an Azure E64_32s_v4 VM.

Behind the study: Understanding how SQL Server performed better, for a lower cost with an AWS EC2 R5b instance

Principled Technologies tested an online transaction processing (OLTP) workload for SQL Server 2019 on both an R5b instance on Amazon EC2 with Amazon Elastic Block Store (EBS) as storage and Azure E64_32s_v4. This particular Azure VM was chosen as an equivalent to the R5b instance, as both instances have comparable specifications for input/output operations per second (IOPS) performance, use Intel Xeon processors from the same generation (Cascade Lake), and offer the same number of cores (32). For storage, Principled Technologies mirrored storage configurations across the Azure VM and the EC2 instance (which used Amazon Elastic Block Store (EBS)), maxing out the IOPS specs on each while offering a direct comparison between instances.

Test Configurations

Source: Principled Technologies

When benchmarking, Principled Technologies ran a TPC-C-like OLTP workload from HammerDB v3.3 on both instances, testing against new orders per minute (NOPM) performance. NOPM shows the number of new-order transactions completed in one minute as part of a serialized business workload. HammerDB claims that because NOPM is “independent of any particular database implementation [it] is the recommended primary metric to use.”

The results: SQL Server on AWS EC2 R5b delivered 2x performance than the Azure VM and 62% less expensive 

Graphs that show AWS instance outperformed the Azure instance

Source: Principled Technologies

These test results from the Principled Technologies report show the price/performance and performance comparisons. The performance metric is New Orders Per Minute (NOPM); faster is better. The price/performance calculations are based on the cost of on-demand, License Included SQL Server instances and storage to achieve 1,000 NOPM performance, smaller is better.

An EC2 r5b.8xlarge instance powered by an Intel Xeon Scalable processor delivered better SQL Server NOPM performance on the HammerDB benchmark and a lower price per 1,000 NOPM than an Azure E64_32s_v4 VM powered by similar Intel Xeon Scalable processors.

On top of that, AWS’s storage price-performance exceeded Azure’s. The Azure managed disks offered 53 percent more storage than the EBS storage, but the EC2 instance with EBS storage cost 24 percent less than the Azure VM with managed disks. Even by reducing Azure storage by the difference in storage, something customers cannot do, EBS would have cost 13 percent less per storage GB than the Azure managed disks.

Why AWS is the best cloud to run your Windows and SQL Server workloads

To us, these results aren’t surprising. In fact, they’re in line with the success that customers find running Windows on AWS for over 12 years. Customers like Pearson and Expedia have all found better performance and enhanced cost savings by moving their Windows, SQL Server, and .NET workloads to AWS. In fact, RepricerExpress migrated its Windows and SQL Server environments from Azure to AWS to slash outbound bandwidth costs while gaining agility and performance.

Not only do we offer better price-performance for your Windows workloads, but we also offer better ways to run Windows in the cloud. Whether you want to rehost your databases to EC2, move to managed with Amazon Relational Database for SQL Server (RDS), or even modernize to cloud-native databases, AWS stands ready to help you get the most out of the cloud.

 


To learn more on migrating Windows Server or SQL Server, visit Windows on AWS. For more stories about customers who have successfully migrated and modernized SQL Server workloads with AWS, visit our Customer Success page. Contact us to start your migration journey today.

Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS

Post Syndicated from Islam Ghanim original https://aws.amazon.com/blogs/architecture/field-notes-stopping-an-automatically-started-database-instance-with-amazon-rds/

Customers needing to keep an Amazon Relational Database Service (Amazon RDS) instance stopped for more than 7 days, look for ways to efficiently re-stop the database after being automatically started by Amazon RDS. If the database is started and there is no mechanism to stop it; customers start to pay for the instance’s hourly cost. Moreover, customers with database licensing agreements could incur penalties for running beyond their licensed cores/users.

Stopping and starting a DB instance is faster than creating a DB snapshot, and then restoring the snapshot. However, if you plan to keep the Amazon RDS instance stopped for an extended period of time, it is advised to terminate your Amazon RDS instance and recreate it from a snapshot when needed.

This blog provides a step-by-step approach to automatically stop an RDS instance once the auto-restart activity is complete. This saves any costs incurred once the instance is turned on. The proposed architecture is fully serverless and requires no management overhead. It relies on AWS Step Functions and a set of Lambda functions to monitor RDS instance state and stop the instance when required.

Architecture overview

Given the autonomous nature of the architecture and to avoid management overhead, the architecture leverages serverless components.

  • The architecture relies on RDS event notifications. Once a stopped RDS instance is started by AWS due to exceeding the maximum time in the stopped state; an event (RDS-EVENT-0154) is generated by RDS.
  • The RDS event is pushed to a dedicated SNS topic rds-event-notifications-topic.
  • The Lambda function start-statemachine-execution-lambda is subscribed to the SNS topic rds-event-notifications-topic.
    • The function filters messages with event code: RDS-EVENT-0154. In order to restrict the ‘force shutdown’ activity further, the function validates that the RDS instance is tagged with auto-restart-protection and that the tag value is set to ‘yes’.
    • Once all conditions are met, the Lambda function starts the AWS Step Functions state machine execution.
  • The AWS Step Functions state machine integrates with two Lambda functions in order to retrieve the instance state, as well as attempt to stop the RDS instance.
    • In case the instance state is not ‘available’, the state machine waits for 5 minutes and then re-checks the state.
    • Finally, when the Amazon RDS instance state is ‘available’; the state machine will attempt to stop the Amazon RDS instance.

Prerequisites

In order to implement the steps in this post, you need an AWS account as well as an IAM user with permissions to provision and delete resources of the following AWS services:

  • Amazon RDS
  • AWS Lambda
  • AWS Step Functions
  • AWS CloudFormation
  • AWS SNS
  • AWS IAM

Architecture implementation

You can implement the architecture using the AWS Management Console or AWS CLI.  For faster deployment, the architecture is available on GitHub. For more information on the repo, visit GitHub.

The steps below explain how to build the end-to-end architecture from within the AWS Management Console:

Create an SNS topic

  • Open the Amazon SNS console.
  • On the Amazon SNS dashboard, under Common actions, choose Create Topic.
  • In the Create new topic dialog box, for Topic name, enter a name for the topic (rds-event-notifications-topic).
  • Choose Create topic.
  • Note the Topic ARN for the next task (for example, arn:aws:sns:us-east-1:111122223333:my-topic).

Configure RDS event notifications

Amazon RDS uses Amazon Simple Notification Service (Amazon SNS) to provide notification when an Amazon RDS event occurs. These notifications can be in any notification form supported by Amazon SNS for an AWS Region, such as an email, a text message, or a call to an HTTP endpoint.

For this architecture, RDS generates an event indicating that instance has automatically restarted because it exceed the maximum duration to remain stopped. This specific RDS event (RDS-EVENT-0154) belongs to ‘notification’ category. For more information, visit Using Amazon RDS Event Notification.

To subscribe to an RDS event notification

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose Event subscriptions.
  • In the Event subscriptions pane, choose Create event subscription.
  • In the Create event subscription dialog box, do the following:
    • For Name, enter a name for the event notification subscription (RdsAutoRestartEventSubscription).
    • For Send notifications to, choose the SNS topic created in the previous step (rds-event-notifications-topic).
    • For Source type, choose ‘Instances’. Since our source will be RDS instances.
    • For Instances to include, choose ‘All instances’. Instances are included or excluded based on the tag, auto-restart-protection. This is to keep the architecture generic and to avoid regular configurations moving forward.
    • For Event categories to include, choose ‘Select specific event categories’.
    • For Specific event, choose ‘notification’. This is the category under which the RDS event of interest falls. For more information, review Using Amazon RDS Event Notification.
    •  Choose Create.
    • The Amazon RDS console indicates that the subscription is being created.

Create Lambda functions

Following are the three Lambda functions required for the architecture to work:

  1. start-statemachine-execution-lambda, the function will subscribe to the newly created SNS topic (rds-event-notifications-topic) and starts the AWS Step Functions state machine execution.
  2. retrieve-rds-instance-state-lambda, the function is triggered by AWS Step Functions state machine to retrieve an RDS instance state (example, available or stopped)
  3. stop-rds-instance-lambda, the function is triggered by AWS Step Functions state machine in order to attempt to stop an RDS instance.

First, create the Lambda functions’ execution role.

To create an execution role

  • Open the roles page in the IAM console.
  • Choose Create role.
  • Create a role with the following properties.
    • Trusted entity – Lambda.
    • Permissions – AWSLambdaBasicExecutionRole.
    • Role namerds-auto-restart-lambda-role.
    • The AWSLambdaBasicExecutionRole policy has the permissions that the function needs to write logs to CloudWatch Logs.

Now, create a new policy and attach to the role in order to allow the Lambda function to: start an AWS StepFunctions state machine execution, stop an Amazon RDS instance, retrieve RDS instance status, list tags and add tags.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "rds:AddTagsToResource",
                "rds:ListTagsForResource",
                "rds:DescribeDBInstances",
                "states:StartExecution",
                "rds:StopDBInstance"
            ],
            "Resource": "*"
        }
    ]
}
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name (rds-auto-restart-lambda-policy) and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.

To link the new policy to the AWS Lambda execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role created for the three Lambda functions. After selecting the identities, choose Attach policy.

Given the principle of least privilege, it is recommended to create 3 different roles restricting a function’s access to the needed resources only. 

Repeat the following step 3 times to create 3 new Lambda functions. Differences between the 3 Lambda functions are: (1) code and (2) triggers:

  • Open the Lambda console.
  • Choose Create function.
  • Configure the following settings:
    • Name
      • start-statemachine-execution-lambda
      • retrieve-rds-instance-state-lambda
      • stop-rds-instance-lambda
    • Runtime – Python 3.8.
    • Role – Choose an existing role.
    • Existing role – rds-auto-restart-lambda-role.
    • Choose Create function.
    • To configure a test event, choose Test.
    • For Event name, enter test.
  • Choose Create.
  • For the Lambda function —  start-statemachine-execution-lambda, use the following Python 3.8 sample code:
import json
import boto3
import logging
import os

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')

def lambda_handler(event, context):

    #log input event
    LOGGER.info("RdsAutoRestart Event Received, now checking if event is eligible. Event Details ==> ", event)

    #Input event from the SNS topic originated from RDS event notifications
    snsMessage = json.loads(event['Records'][0]['Sns']['Message'])
    rdsInstanceId = snsMessage['Source ID']
    stepFunctionInput = {"rdsInstanceId": rdsInstanceId}
    rdsEventId = snsMessage['Event ID']

    #Retrieve RDS instance ARN
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceArn = db_instance['DBInstanceArn']

    # Filter on the Auto Restart RDS Event. Event code: RDS-EVENT-0154. 

    if 'RDS-EVENT-0154' in rdsEventId:

        #log input event
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        #Verify that instance is tagged with auto-restart-protection tag. The tag is used to classify instances that are required to be terminated once started. 

        tagCheckPass = 'false'
        rdsInstanceTags = rdsClient.list_tags_for_resource(ResourceName=rdsInstanceArn)
        for rdsInstanceTag in rdsInstanceTags["TagList"]:
            if 'auto-restart-protection' in rdsInstanceTag["Key"]:
                if 'yes' in rdsInstanceTag["Value"]:
                    tagCheckPass = 'true'
                    #log instance tags
                    LOGGER.info("RdsAutoRestart verified that the instance is tagged auto-restart-protection = yes, now starting the Step Functions Flow")
                else:
                    tagCheckPass = 'false'


        #log instance tags
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        if 'true' in tagCheckPass:

            #Initialise StepFunctions Client
            stepFunctionsClient = boto3.client('stepfunctions')

            # Start StepFunctions WorkFlow
            # StepFunctionsArn is stored in an environment variable
            stepFunctionsArn = os.environ['STEPFUNCTION_ARN']
            stepFunctionsResponse = stepFunctionsClient.start_execution(
            stateMachineArn= stepFunctionsArn,
            name=event['Records'][0]['Sns']['MessageId'],
            input= json.dumps(stepFunctionInput)

        )

    else:

        LOGGER.info("RdsAutoRestart Event detected, and event is not eligible")

    return {
            'statusCode': 200
        }

And then, configure an SNS source trigger for the function start-statemachine-execution-lambda. RDS event notifications will be published to this SNS topic:

  • In the Designer pane, choose Add trigger.
  • In the Trigger configurations pane, select SNS as a trigger.
  • For SNS topic, choose the SNS topic previously created (rds-event-notifications-topic)
  • For Enable trigger, keep it checked.
  • Choose Add.
  • Choose Save.

For the Lambda function — retrieve-rds-instance-state-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')


def lambda_handler(event, context):
    

    #log input event
    LOGGER.info(event)
    
    #rdsInstanceId is passed as input to the lambda function from the AWS StepFunctions state machine.  
    rdsInstanceId = event['rdsInstanceId']
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceState = db_instance['DBInstanceStatus']
    return {
        'statusCode': 200,
        'rdsInstanceState': rdsInstanceState,
        'rdsInstanceId': rdsInstanceId
    }

Choose Save.

For the Lambda function, stop-rds-instance-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')


def lambda_handler(event, context):
    
    #log input event
    LOGGER.info(event)
    
    rdsInstanceId = event['rdsInstanceId']
    
    #Stop RDS instance
    rdsClient.stop_db_instance(DBInstanceIdentifier=rdsInstanceId)
    
    #Tagging
    
    
    return {
        'statusCode': 200,
        'rdsInstanceId': rdsInstanceId
    }

Choose Save.

Create a Step Function

AWS Step Functions will execute the following service logic:

  1. Retrieve RDS instance state by calling Lambda function, retrieve-rds-instance-state-lambda. The Lambda function then returns the parameter, rdsInstanceState.
  2. If the rdsInstanceState parameter value is ‘available’, then the state machine will step into the next action calling the Lambda function, stop-rds-instance-lambda. If the rdsInstanceState is not ‘available’, the state machine will then wait for 5 minutes and then re-check the RDS instance state again.
  3. Stopping an RDS instance is an asynchronous operation and accordingly the state machine will keep polling the instance state once every 5 minutes until the rdsInstanceState parameter value becomes ‘stopped’. Only then, the state machine execution will complete successfully.

  • An RDS instance path to ‘available’ state may vary depending on the various maintenance activities scheduled for the instance.
  • Once the RDS notification event is generated, the instance will go through multiple states till it becomes ‘available’.
  • The use of the 5 minutes timer is to make sure that the automation flow will keep attempting to stop the instance once it becomes available.
  • The second part will make sure that the flow doesn’t end till the instance status is changed to ‘stopped’ and hence notifying the system administrator.

To create an AWS Step Functions state machine

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose State machines.
  • In the State machines pane, choose Create state machine.
  • On the Define state machine page, choose Author with code snippets. For Type, choose Standard.
  • Enter a Name for your state machine, stop-rds-instance-statemachine.
  • In the State machine definition pane, add the following state machine definition using the ARNs of the two Lambda function created earlier, as shown in the following code sample:
{
  "Comment": "stop-rds-instance-statemachine: Automatically shutting down RDS instance after a forced Auto-Restart",
  "StartAt": "retrieveRdsInstanceState",
  "States": {
    "retrieveRdsInstanceState": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceAvailable"
    },
    "isInstanceAvailable": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.rdsInstanceState",
          "StringEquals": "available",
          "Next": "stopRdsInstance"
        }
      ],
      "Default": "waitFiveMinutes"
    },
    "waitFiveMinutes": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRdsInstanceState"
    },
    "stopRdsInstance": {
      "Type": "Task",
      "Resource": "stop-rds-instance-lambda Arn",
      "Next": "retrieveRDSInstanceStateStopping"
    },
    "retrieveRDSInstanceStateStopping": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceStopped"
    },
    "isInstanceStopped": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.rdsInstanceState",
          "StringEquals": "stopped",
          "Next": "notifyDatabaseAdmin"
        }
      ],
      "Default": "waitFiveMinutesStopping"
    },
    "waitFiveMinutesStopping": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRDSInstanceStateStopping"
    },
    "notifyDatabaseAdmin": {
      "Type": "Pass",
      "Result": "World",
      "End": true
    }
  }
}

This is a definition of the state machine written in Amazon States Language which is used to describe the execution flow of an AWS Step Function.

Choose Next.

  • In the Name pane, enter a name for your state machine, stop-rds-instance-statemachine.
  • In the Permissions pane, choose Create new role. Take note of the the new role’s name displayed at the bottom of the page (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd).
  • Choose Create state machine
  • By default, the created role only grants the state machine access to CloudWatch logs. Since the state machine will have to make Lambda calls, then another IAM policy has to be associated with the new role.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "*"
}
]
}
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name rds-auto-restart-stepfunctions-policy and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy.
  • Choose Create policy to save your work.

To link the new policy to the AWS Step Functions execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role create for the state machine (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd). After selecting the identities, choose Attach policy.

 

Testing the architecture

In order to test the architecture, create a test RDS instance, tag it with auto-restart-protection tag and set the tag value to yes. While the RDS instance is still in creation process, test the Lambda function —  start-statemachine-execution-lambda with a sample event that simulates that the instance was started as it exceeded the maximum time to remain stopped (RDS-EVENT-0154).

To invoke a function

  • Sign in to the AWS Management Console and open the Lambda console.
  • In navigation pane, choose Functions.
  • In Functions pane, choose start-statemachine-execution-lambda.
  • In the upper right corner, choose Test.
  • In the Configure test event page, choose Create new test event and in Event template, leave the default Hello World option.
    {
    "Records": [
        {
        "EventSource": "aws:sns",
        "EventVersion": "1.0",
        "EventSubscriptionArn": "<RDS Event Subscription Arn>",
        "Sns": {
            "Type": "Notification",
            "MessageId": "10001-2d55da-9a73-5e42d46748c0",
            "TopicArn": "<SNS Topic Arn>",
            "Subject": "RDS Notification Message",
            "Message": "{\"Event Source\":\"db-instance\",\"Event Time\":\"2020-07-09 15:15:03.031\",\"Identifier Link\":\"https://console.aws.amazon.com/rds/home?region=<region>#dbinstance:id=<RDS instance id>\",\"Source ID\":\"<RDS instance id>\",\"Event ID\":\"http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_Events.html#RDS-EVENT-0154\",\"Event Message\":\"DB instance started\"}",
            "Timestamp": "2020-07-09T15:15:03.991Z",
            "SignatureVersion": "1",
            "Signature": "YsuM+L6N8rk+pBPBWoWeRcSuYqo/BN5v9D2lyoSg0B0uS46Q8NZZSoZWaIQi25TXfHY3RYXCXF9WbVGXiWa4dJs2Mjg46anM+2j6z9R7BDz0vt25qCrCyWhmWtc7yeETrlwa0jCtR/wxXFFexRwynqlZeDfvQpf/x+KNLrnJlT61WZ2FMTHYs124RwWU8NY3pm1Os0XOIvm8rfv3ywm1ccZfP4rF7Lfn+2EK6a0635Z/5aiyIlldNZxbgRYTODJYroO9INTlF7NPzVV1Y/K0E9aaL/wQgLZNquXQGCAxPFWy5lxJKeyUocOWcG48KJGIBUC36JJaqVdIilbZ9HvxTg==",
            "SigningCertUrl": "https://sns.<region>.amazonaws.com/SimpleNotificationService-a86cb10b4e1f29c941702d737128f7b6.pem",
            "UnsubscribeUrl": "https://sns.<region>.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=<arn>",
            "MessageAttributes": {}
        }
        }
    ]
    }
start-statemachine-execution-lambda uses the SNS MessageId parameter as name for the AWS Step Functions execution. The name field is unique for a certain period of time, accordingly, with every test run the MessageId parameter value must be changed. 
  • Choose Create and then choose Test. Each user can create up to 10 test events per function. Those test events are not available to other users.
  • AWS Lambda executes your function on your behalf. The handler in your Lambda function receives and then processes the sample event.
  • Upon successful execution, view results in the console.
  • The Execution result section shows the execution status as succeeded and also shows the function execution results, returned by the return statement. Following is a sample response of the test execution:

Now, verify the execution of the AWS Step Functions state machine:

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In navigation pane, choose State machines.
  • In the State machine pane, choose stop-rds-instance-statemachine.
  • In the Executions pane, choose the execution with the Name value passed in the test event MessageId parameter.
  • In the Visual workflow pane, the real-time execution status is displayed:

  • Under the Step details tab, all details related to inputs, outputs and exceptions are displayed:

Monitoring

It is recommended to use Amazon CloudWatch to monitor all the components in this architecture. You can use AWS Step Functions to log the state of the execution, inputs and outputs of each step in the flow. So when things go wrong, you can diagnose and debug problems quickly.

Cost

When you build the architecture using serverless components, you pay for what you use with no upfront infrastructure costs. Cost will depend on the number of RDS instances tagged to be protected against an automatic start.

Architectural considerations

This architecture has to be deployed per AWS Account per Region.

Conclusion

The blog post demonstrated how to build a fully serverless architecture that monitors and stops RDS instances restarted by AWS. This helps to avoid falling behind on any required maintenance updates. This architecture helps you save cost incurred by started instances’ running hours and licensing implications.  Feel free to submit enhancements to the GitHub repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers

Updating opt-in status for Amazon Pinpoint channels

Post Syndicated from Varinder Dhanota original https://aws.amazon.com/blogs/messaging-and-targeting/updating-opt-in-status-for-amazon-pinpoint-channels/

In many real-world scenarios, customers are using home-grown or 3rd party systems to manage their campaign related information. This includes user preferences, segmentation, targeting, interactions, and more. To create customer-centric engagement experiences with such existing systems, migrating or integrating into Amazon Pinpoint is needed. Luckily, many AWS services and mechanisms can help to streamline this integration in a resilient and cost-effective way.

In this blog post, we demonstrate a sample solution that captures changes from an on-premises application’s database by utilizing AWS Integration and Transfer Services and updates Amazon Pinpoint in real-time.

If you are looking for a serverless, mobile-optimized preference center allowing end users to manage their Pinpoint communication preferences and attributes, you can also check the Amazon Pinpoint Preference Center.

Architecture

Architecture

In this scenario, users’ SMS opt-in/opt-out preferences are managed by a home-grown customer application. Users interact with the application over its web interface. The application, saves the customer preferences on a MySQL database.

This solution’s flow of events is triggered with a change (insert / update / delete) happening in the database. The change event is then captured by AWS Database Migration Service (DMS) that is configured with an ongoing replication task. This task continuously monitors a specified database and forwards the change event to an Amazon Kinesis Data Streams stream. Raw events that are buffered in this stream are polled by an AWS Lambda function. This function transforms the event, and makes it ready to be passed to Amazon Pinpoint API. This API call will in turn, change the opt-in/opt-out subscription status of the channel for that user.

Ongoing replication tasks are created against multiple types of database engines, including Oracle, MS-SQL, Postgres, and more. In this blog post, we use a MySQL based RDS instance to demonstrate this architecture. The instance will have a database we name pinpoint_demo and one table we name optin_status. In this sample, we assume the table is holding details about a user and their opt-in preference for SMS messages.

userid phone optin lastupdate
user1 +12341111111 1 1593867404
user2 +12341111112 1 1593867404
user2 +12341111113 1 1593867404

Prerequisites

  1. AWS CLI is configured with an active AWS account and appropriate access.
  2. You have an understanding of Amazon Pinpoint concepts. You will be using Amazon Pinpoint to create a segment, populate endpoints, and validate phone numbers. For more details, see the Amazon Pinpoint product page and documentation.

Setup

First, you clone the repository that contains a stack of templates to your local environment. Make sure you have configured your AWS CLI with AWS credentials. Follow the steps below to deploy the CloudFormation stack:

  1. Clone the git repository containing the CloudFormation templates:
    git clone https://github.com/aws-samples/amazon-pinpoint-rds-integration.git
    cd amazon-pinpoint-rds-integration
  2. You need an S3 Bucket to hold the template:
    aws s3 create-bucket –bucket <YOUR-BUCKET-NAME>
  3. Run the following command to package the CloudFormation templates:
    aws cloudformation package --template-file template_stack.yaml --output-template-file template_out.yaml --s3-bucket <YOUR-BUCKET-NAME>
  4. Deploy the stack with the following command:
    aws cloudformation deploy --template-file template_out.yaml --stack-name pinpointblogstack --capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_NAMED_IAM

The AWS CloudFormation stack will create and configure resources for you. Some of the resources it will create are:

  • Amazon RDS instance with MySQL
  • AWS Database Migration Service replication instance
  • AWS Database Migration Service source endpoint for MySQL
  • AWS Database Migration Service target endpoint for Amazon Kinesis Data Streams
  • Amazon Kinesis Data Streams stream
  • AWS Lambda Function
  • Amazon Pinpoint Application
  • A Cloud9 environment as a bastion host

The deployment can take up to 15 minutes. You can track its progress in the CloudFormation console’s Events tab.

Populate RDS data

A CloudFormation stack will output the DNS address of an RDS endpoint and Cloud9 environment upon completion. The Cloud9 environment acts as a bastion host and allows you to reach the RDS instance endpoint deployed into the private subnet by CloudFormation.

  1. Open the AWS Console and navigate to the Cloud9 service.
    Cloud9Console
  2. Click on the Open IDE button to reach your IDE environment.
    Cloud9Env
  3. At the console pane of your IDE, type the following to login to your RDS instance. You can find the RDS Endpoint address at the outputs section of the CloudFormation stack. It is under the key name RDSInstanceEndpoint.
    mysql -h <YOUR_RDS_ENDPOINT> -uadmin -pmypassword
    use blog_db;
  4. Issue the following command to create a table that holds the user’s opt-in status:
    create table optin_status (
      userid varchar(50) not null,
      phone varchar(50) not null,
      optin tinyint default 1,
      lastupdate TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
    );
  5. Next, load sample data into the table. The following inserts nine users for this demo:
    
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user1', '+12341111111', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user2', '+12341111112', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user3', '+12341111113', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user4', '+12341111114', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user5', '+12341111115', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user6', '+12341111116', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user7', '+12341111117', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user8', '+12341111118', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user9', '+12341111119', 1);
  6. The table’s opt-in column holds the SMS opt-in status and phone number for a specific user.

Start the DMS Replication Task

Now that the environment is ready, you can start the DMS replication task and start watching the changes in this table.

  1. From the AWS DMS Console, go to the Database Migration Tasks section.
    DMSMigTask
  2. Select the Migration task named blogreplicationtask.
  3. From the Actions menu, click on Restart/Resume to start the migration task. Wait until the task’s Status transitions from Ready to Starting and Replication ongoing.
  4. At this point, all the changes on the source database are replicated into a Kinesis stream. Before introducing the AWS Lambda function that will be polling this stream, configure the Amazon Pinpoint application.

Inspect the AWS Lambda Function

An AWS Lambda function has been created to receive the events. The Lambda function uses Python and Boto3 to read the records delivered by Kinesis Data Streams. It then performs the update_endpoint API calls in order to add, update, or delete endpoints in the Amazon Pinpoint application.

Lambda code and configuration is accessible through the Lambda Functions Console. In order to inspect the Python code, click the Functions item on the left side. Select the function starting with pinpointblogstack-MainStack by clicking on the function name.

Note: The PINPOINT_APPID under the Environment variables section. This variable provides the Lambda function with the Amazon Pinpoint application ID to make the API call.

LambdaPPAPPID

Inspect Amazon Pinpoint Application in Amazon Pinpoint Console

A Pinpoint application is needed by the Lambda Function to update the endpoints. This application has been created with an SMS Channel by the CloudFormation template. Once the data from the RDS database has been imported into Pinpoint as SMS endpoints, you can validate this import by creating a segment in Pinpoint.

PinpointProject

Testing

With the Lambda function ready, you now test the whole solution.

  1. To initiate the end-to-end test, go to the Cloud9 terminal. Perform the following SQL statement on the optin_table:
    UPDATE optin_status SET optin=0 WHERE userid='user1';
    UPDATE optin_status SET optin=0 WHERE userid='user2';
    UPDATE optin_status SET optin=0 WHERE userid='user3';
    UPDATE optin_status SET optin=0 WHERE userid='user4';
  2. This statement will cause four changes in the database which is collected by DMS and passed to Kinesis Data Streams stream.
  3. This triggers the Lambda function that construct an update_endpoint API call to the Amazon Pinpoint application.
  4. The update_endpoint operation is an upsert operation. Therefore, if the endpoint does not exist on the Amazon Pinpoint application, it creates one. Otherwise, it updates the current endpoint.
  5. In the initial dataset, all the opt-in values are 1. Therefore, these endpoints will be created with an OptOut value of NONE in Amazon Pinpoint.
  6. All OptOut=NONE typed endpoints are considered as active endpoints. Therefore, they are available to be used within segments.

Create Amazon Pinpoint Segment

  1. In order to see these changes, go to the Pinpoint console. Click on PinpointBlogApp.
    PinpointConsole
  2. Click on Segments on the left side. Then click Create a segment.
    PinpointSegment
  3. For the segment name, enter US-Segment.
  4. Select Endpoint from the Filter dropdown.
  5. Under the Choose an endpoint attribute dropdown, select Country.
  6. For Choose values enter US.
    Note: As you do this, the right panel Segment estimate will refresh to show the number of endpoints eligible for this segment filter.
  7. Click Create segment at the bottom of the page.
    PinpointSegDetails
  8. Once the new segment is created, you are directed to the newly created segment with configuration details. You should see five eligible endpoints corresponding to database table rows.
    PinpointSegUpdate
  9. Now, change one row by issuing the following SQL statement. This simulates a user opting out from SMS communication for one of their numbers.
    UPDATE optin_status SET optin=0 WHERE userid='user5';
  10. After the update, go to the Amazon Pinpoint console. Check the eligible endpoints again. You should only see four eligible endpoints.

PinpointSegUpdate

Cleanup

If you no longer want to incur further charge, delete the Cloudformation stack named pinpointblogstack. Select it and click Delete.

PinpointCleanup

Conclusion

This solution walks you through how opt-in change events are delivered from Amazon RDS to Amazon Pinpoint. You can use this solution in other use cases as well. Some examples are importing segments from a 3rd party application like Salesforce and importing other types of channels like e-mail, push, and voice. To learn more about Amazon Pinpoint, visit our website.

Building AWS Glue Spark ETL jobs by bringing your own JDBC drivers for Amazon RDS

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/building-aws-glue-spark-etl-jobs-by-bringing-your-own-jdbc-drivers-for-amazon-rds/

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. AWS Glue has native connectors to connect to supported data sources either on AWS or elsewhere using JDBC drivers. Additionally, AWS Glue now enables you to bring your own JDBC drivers (BYOD) to your Glue Spark ETL jobs. This feature enables you to connect to data sources with custom drivers that aren’t natively supported in AWS Glue, such as MySQL 8 and Oracle 18. You can also use multiple JDBC driver versions in the same AWS Glue job, enabling you to migrate data between source and target databases with different versions. For more information, see Connection Types and Options for ETL in AWS Glue.

This post shows how to build AWS Glue ETL Spark jobs and set up connections with custom drivers with Oracle18 and MySQL8 databases.

Solution overview

We discuss three different use cases in this post, using AWS Glue, Amazon RDS for MySQL, and Amazon RDS for Oracle.

In the following architecture, we connect to Oracle 18 using an external ojdbc7.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the following architecture, we connect to Oracle 18 using an external ojdbc7.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the second scenario, we connect to MySQL 8 using an external mysql-connector-java-8.0.19.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to MySQL 8.

In the second scenario, we connect to MySQL 8 using an external mysql-connector-java-8.0.19.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to MySQL 8.

In the third scenario, we set up a connection where we connect to Oracle 18 and MySQL 8 using external drivers from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the third scenario, we set up a connection where we connect to Oracle 18 and MySQL 8 using external drivers from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

Prerequisites

Before getting started, you must complete the following prerequisites:

  1. Create an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console. Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template provided in this post.
  2. Create am IAM policy for AWS Glue.
  3. Before setting up the AWS Glue job, you need to download drivers for Oracle and MySQL, which we discuss in the next section.

Downloading drivers for Oracle and MySQL

To download the required drivers for Oracle and MySQL, complete the following steps:

  1. Download the MySQL JDBC connector.
  2. Select the operating system as platform independent and download the .tar.gz or .zip file (for example, mysql-connector-java-8.0.19.tar.gz or mysql-connector-java-8.0.19.zip) and extract it.
  3. Pick MySQL connector .jar file (such as mysql-connector-java-8.0.19.jar) and upload it into your Amazon Simple Storage Service (Amazon S3) bucket.
  4. Make a note of that path because you use it later in the AWS Glue job to point to the JDBC driver.
  5. Similarly, download the Oracle JDBC connector (ojdbc7.jar).

This post is tested for mysql-connector-java-8.0.19.jar and ojdbc7.jar drivers, but based on your database types, you can download and use appropriate version of JDBC drivers supported by the database.

  1. Upload the Oracle JDBC 7 driver to (ojdbc7.jar) to your S3 bucket.
  2. Make a note of that path, because you use it in the AWS Glue job to establish the JDBC connection with the database.
  3. Make sure to upload the three scripts (OracleBYOD.py, MySQLBYOD.py, and CrossDB_BYOD.py) in an S3 bucket.
  4. Save the following code as py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_oracle18_options_source_emp = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "employee",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    connection_oracle18_options_source_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
        
    connection_oracle18_options_target_emp_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "emp_dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
        
    # Read DynamicFrame from Oracle 
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    # Write data to Oracle 
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="oracle", connection_options=connection_oracle18_options_target_emp_dept)

  1. Save the following code as MySQLBYOD.py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_mysql8_options_source_emp = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "employee",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
        
    connection_mysql8_options_source_dept = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "dept",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
    
    connection_mysql8_options_target_emp_dept = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "emp_dept",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
        
    
    # Read from JDBC databases with custom driver
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    
    #print "Applied mapping to the Glue DynamicFrame"
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="mysql", connection_options=connection_mysql8_options_target_emp_dept)
    

  1. Save the following code as CrossDB_BYOD.py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_mysql8_options_source_emp = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "employee",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
    
    connection_oracle18_options_source_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    connection_oracle18_options_target_emp_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "emp_dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    # Read DynamicFrame from Oracle
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    # Write data to Oracle
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="oracle", connection_options=connection_oracle18_options_target_emp_dept)
    
    

Provisioning resources with AWS CloudFormation

The generic workflow of setting up a connection with your own custom JDBC drivers involves various steps. It’s a manual configuration that is error prone and adds overhead when repeating the steps between environments and accounts. With AWS CloudFormation, you can provision your application resources in a safe, repeatable manner, allowing you to build and rebuild your infrastructure and applications without having to perform manual actions or write custom scripts. The declarative code in the file captures the intended state of the resources to create, and allows you to automate the creation of AWS resources.

We provide this CloudFormation template for you to use. Review and customize it to suit your needs. Some of the resources deployed by this stack incur costs as long as they remain in use, like Amazon RDS for Oracle and Amazon RDS for MySQL.

This CloudFormation template creates the following resources:

  • A VPC
  • Two subnets
  • A route table
  • An internet gateway
  • A MySQL 8 database
  • An Oracle 18 database

To provision your resources, complete the following steps:

  1. Sign in to the console.
  2. Choose the us-east-1 Region in which to create the stack.
  3. Choose Next.
  4. Choose Launch Stack:

This step automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console as required.

  1. For Stack name, enter a name.
  2. Change the other parameters as needed or keep the following default values:
    1. Oracle user nameoraadmin
    2. Oracle passwordoraadmin123
    3. MySQL usernameMySQLadmin
    4. MySQL passwordMYSQLadmin123

Change the other parameters as needed or keep the following default values:

  1. Choose Next.
  2. Choose Next
  3. Review the details and choose Create.

This stack creation can take up to 20 minutes.

After the stack creation is complete, go to the Outputs tab on the AWS CloudFormation console and note the following values (you use these in later steps):

  • MySQLJDBCConnectionString
  • OracleJDBCConnectionString

Configuring an AWS Glue ETL job using your own drivers

Before creating an AWS Glue ETL, run the SQL script (database_scripts.sql) on both the databases (Oracle and MySQL) to create tables and insert data. For more information about connecting to the RDS DB instance, see How can I troubleshoot connectivity to an Amazon RDS DB instance that uses a public or private subnet of a VPC?

To set up AWS Glue connections, complete the following steps:

  1. On the AWS Glue console, under Databases, choose Connections.
  2. Choose Add Connection.
  3. For Connection Name, enter a name for your connection.
  4. For Connection Type, choose JDBC.
  5. For JDBC URL, enter a URL, such as jdbc:oracle:thin://@<hostname>:1521/ORCL for Oracle or jdbc:mysql://<hostname>:3306/mysql for MySQL.
  6. Enter the user name and password for the database.
  7. Select the VPC in which you created the RDS instance (Oracle and MySQL).
  8. Choose the subnet within your VPC. Refer to the CloudFormation stack Outputs tab for the subnet name.
  9. Choose the security group of the database. Refer to the CloudFormation stack Outputs tab for security group name.
  10. Choose Next.
  11. Check the connection details and choose Finish.

Make sure to add a connection for both databases (Oracle and MySQL).

Creating endpoints and a security group

Before testing the connection, make sure you create an AWS Glue endpoint and S3 endpoint in the VPC in which databases are created. Complete the following steps for both Oracle and MySQL instances:

  1. To create your AWS Glue endpoint, on the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
  3. For Service Names, choose AWS Glue.
  4. Choose amazonaws.<region>.glue (for example, com.amazonaws.us-west-2.glue).
  5. Choose the VPC of the RDS for Oracle or RDS for MySQL
  6. Choose the security group of the RDS instances.
  7. Choose Create endpoint.

To create your S3 endpoint, you use Amazon Virtual Private Cloud (Amazon VPC).

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
  3. For Service Names, choose Amazon S3.
  4. Choose amazonaws.<region>.s3 (for example, com.amazonaws.us-west-2.s3).
  5. Choose the VPC of the RDS for Oracle or RDS for MySQL
  6. Choose the route table ID.
  7. Choose Create endpoint.

The RDS for Oracle or RDS for MySQL security group must include itself as a source in its inbound rules.

  1. On the Security Groups page, choose Edit Inbound Rules.
  2. Choose Add rule.
  3. For Type, choose All Traffic Type, for example glue-byod-stack1….
  4. For Source, choose the same security group.
  5. Choose Save Rules.

If both the databases are in the same VPC and subnet, you don’t need to create a connection for MySQL and Oracle databases separately. The reason for setting an AWS Glue connection to the databases is to establish a private connection between the RDS instances in the VPC and AWS Glue via S3 endpoint, AWS Glue endpoint, and Amazon RDS security group. It’s not required to test JDBC connection because that connection is established by the AWS Glue job when you run it. If you test the connection with MySQL8, it fails because the AWS Glue connection doesn’t support the MySQL 8.0 driver at the time of writing this post, therefore you need to bring your own driver.

Setting up AWS Glue ETL jobs

You’re now ready to set up your ETL job in AWS Glue. Complete the following steps for both connections:

  1. Edit the following parameters in the scripts (OracleBYOD.py, MySQLBYOD.py, and CrossDB_BYOD.py) and upload them in Amazon S3:
    1. url
    2. user
    3. password
    4. customJdbcDriverS3Path for sources and target tables

You can find the database endpoints (url) on the CloudFormation stack Outputs tab; the other parameters are mentioned earlier in this post. If you use another driver, make sure to change customJdbcDriverClassName to the corresponding class in the driver.

Alternatively, you can pass on this as AWS Glue job parameters and retrieve the arguments that are passed using the getResolvedOptions.

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Add Job.
  3. For Job Name, enter a name.
  4. For IAM role, choose the IAM role you created as a prerequisite.
  5. For Type, choose Spark.
  6. For Glue Version, choose Python (latest version).
  7. For This job runs, choose An existing script that you provide.
  8. Choose the Amazon S3 path where the script (OracleBYOD.py, MySQLBYOD.py, or CrossDB_BYOD.py) is stored.
  9. Under Advanced properties, enable Job bookmark.

Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

  1. Keep the remaining settings as their defaults and choose
  2. For Connections, choose the Amazon RDS for Oracle connection for OracleBYOD.py, Amazon RDS for MySQL connection for MySQLBYOD.py, or Amazon RDS for Oracle and Amazon RDS for MySQL connection for CrossDB_BYOD.py.
  3. Choose Save job and edit scripts.
  4. Choose Run Job.
  5. When the job is complete, validate the data loaded in the target table.

Cleaning up

After you finish, don’t forget to delete the CloudFormation stack, because some of the AWS resources deployed by the stack in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack.

  1. On the AWS CloudFormation console, on the Stacks page, select the stack to delete. The stack must be currently running.
  2. In the stack details pane, choose Delete.
  3. Choose Delete stack when prompted.

Summary

In this post, we showed you how to build AWS Glue ETL Spark jobs and set up connections with custom drivers with Oracle18 and MySQL8 databases using AWS CloudFormation. You can use this solution to use your custom drivers for databases not supported natively by AWS Glue.

If you have any questions or suggestions, please leave a comment.


About the Authors

Srikanth Sopirala is a Sr. Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family and road biking.

 

 

Naresh Gautam is a Sr. Analytics Specialist Solutions Architect at AWS. His role is helping customers architect highly available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.

 

 

Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/sharing-amazon-redshift-data-securely-across-amazon-redshift-clusters-for-workload-isolation/

Amazon Redshift data sharing allows for a secure and easy way to share live data for read purposes across Amazon Redshift clusters. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query runs.

In this post, we discuss how to use Amazon Redshift data sharing to achieve workload isolation across diverse analytics use cases and achieve business-critical SLAs. For more information about this new feature, see Announcing Amazon Redshift data sharing (preview).

How to use Amazon Redshift data sharing

Amazon Redshift data sharing allows a producer cluster to share data objects to one or more Amazon Redshift consumer clusters for read purposes without having to copy the data. With this approach, workloads isolated to different clusters can share and collaborate frequently on data to drive innovation and offer value-added analytic services to your internal and external stakeholders. You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined functions, to provide fine-grained access controls that can be tailored for different users and businesses that all need access to Amazon Redshift data.

Data sharing between Amazon Redshift clusters is a two-step process. First, the producer cluster administrator that wants to share data creates an Amazon Redshift data share, a new named object introduced with this release to serve as a unit of sharing. The producer cluster adds the needed database objects such as schemas, tables, and views to the data share and specifies a list of consumer clusters with which to share the data share. Following that, privileged users on consumer clusters create an Amazon Redshift local database reference from the data share made available to them and grant permissions on the database objects to appropriate users and groups. Users and groups can then list the shared objects as part of the standard metadata queries and start querying immediately.

Solution overview

For this post, we use a use case in which the producer cluster is a central ETL cluster hosting enterprise sales data, a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark dataset. This cluster serves multiple BI and data science clusters purpose-built for distinct business groups within the organization. One such group is the sales BI team, who runs BI reports using customer sales data created in the central ETL cluster and joined with the product reviews dataset that they loaded into the BI cluster they manage.

This approach helps the sales BI team isolate data lifecycle management between the enterprise sales dataset in the ETL producer from the product reviews data that they fully manage in the BI consumer cluster to simplify data stewardship. It also allows for agility, allows sizing clusters independently to provide workload isolation, and creates a simple cost charge-back model.

As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales. We demonstrate how to build the semantic layer later in this post. A superuser in etl_cluster then creates a data share named salesdatashare, adds the bi_semantic schema and all objects in that schema to the data share, and grants usage permissions to the BI consumer cluster named bi_cluster. Keep in mind that a data share is simply a metadata container and represents what data is shared from producer to consumer. No data is actually moved.

As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales and performs transformations to create a semantic layer required for BI reports in a new schema named bi_semantic.

The superuser in the BI consumer cluster creates a local database reference named sales_semantic from the data share (step 2 in the preceding diagram). The BI users use the product reviews dataset in the local schema named product_reviews and join with bi_semantic data for reporting purposes (step 3).

You can find the script in the products review dataset, which we use in this post to load the dataset into bi_cluster. You can load the DW benchmark dataset into etl_cluster using this github link. Loading these datasets into the respective Amazon Redshift clusters is outside the scope of this post, and is a prerequisite to following the instructions we outline.

The following diagram depicts the cloud DW benchmark data model used.

The following diagram depicts the cloud DW benchmark data model used.

The following table summarizes the data.

Table Name Rows
STORE_SALES 8,639,936,081
CUSTOMER_ADDRESS 15,000,000
CUSTOMER 30,000,000
CUSTOMER_DEMOGRAPHICS 1,920,800
ITEM 360,000
DATE_DIM 73,049

Building a BI semantic layer

A BI semantic layer is a representation of enterprise data in a way that simplifies BI reporting requirements and offers better performance. In our use case, the BI semantic layer transforms sales data to create a customer denormalized dataset and another dataset for all store sales by product in a given year. The following queries are run on the etl_cluster to create the BI semantic layer.

  1. Create a new schema to host BI semantic tables with the following SQL:
    Create schema bi_semantic;

  2. Create a denormalized customer view with select columns required for sales BI team:
    create view bi_semantic.customer_denorm 
    as
    select
    	c_customer_sk,
    	c_customer_id,
    	c_birth_year,
    	c_birth_country,
    	c_last_review_date_sk,
    	ca_city,
    	ca_state,
    	ca_zip,
    	ca_country,
    	ca_gmt_offset,
    	cd_gender,
    	cd_marital_status,
    	cd_education_status
    from sales.customer c, sales.customer_address ca, sales.customer_demographics cd
    where
    c.c_current_addr_sk=ca.ca_address_sk
    and c.c_current_cdemo_sk=cd.cd_demo_sk;

  1. Create a second view for all product sales with columns required for BI team:
    create view bi_semantic.product_sales
    as 
    select 
    	i_item_id,
    	i_product_name,
    	i_current_price,
    	i_wholesale_cost,
    	i_brand_id,
    	i_brand,
    	i_category_id,
    	i_category,
    	i_manufact,
    	d_date,
    	d_moy,
    	d_year,
    	d_quarter_name,
    	ss_customer_sk,
    	ss_store_sk,
    	ss_sales_price,
    	ss_list_price,
    	ss_net_profit,
    	ss_quantity,
    	ss_coupon_amt
    from sales.store_sales ss, sales.item i, sales.date_dim d
    where ss.ss_item_sk=i.i_item_sk
    and ss.ss_sold_date_sk=d.d_date_sk;

Sharing data across Amazon Redshift clusters

Now, let’s share the bi_semantic schema in the etl_cluster with the bi _cluster.

  1. Create a data share in the etl_cluster using the following command when connected to the etl_cluster. The producer cluster superuser and database owners can create data share objects. By default, PUBLICACCESSIBLE is false. If the producer cluster is publicly accessible, you can add PUBLICACCESSIBLE = true to the following command:
    CREATE DATASHARE SalesDatashare;

  1. Add the BI semantic views to the data share. To add objects to the data share, add the schema before adding objects. Use ALTER DATASHARE to share the entire schema; to share tables, views, and functions in a given schema; and to share objects from multiple schemas:
    ALTER DATASHARE SalesDatashare ADD SCHEMA bi_semantic;
    ALTER DATASHARE SalesDatashare ADD ALL TABLES IN SCHEMA bi_semantic;

The next step requires a cluster namespace GUID from the bi_cluster. One way to find the namespace value of a cluster is to run the SQL statement select current_namespace when connected to the bi_cluster. Another way is on the Amazon Redshift console: choose your Amazon Redshift consumer cluster, and find the value under Namespace located in the General information section.

  1. Add consumers to the data share using the following command:
    GRANT USAGE ON DATASHARE SalesDatashare TO NAMESPACE '1m137c4-1187-4bf3-8ce2-e710b7100eb2';

  1. View the list of the objects added to the share using the following command. The share type is outbound on the producer cluster.
    DESC DATASHARE salesdatashare;

The following screenshot shows our list of objects.

The following screenshot shows our list of objects.

Consuming the data share from the consumer BI Amazon Redshift cluster

From the bi_cluster, let’s review, consume, and set permissions on the data share for end-user consumption.

  1. On the consumer BI cluster, view the data shares using the following command as any user:
    SHOW DATASHARES;

The following screenshot shows our results. Consumers should be able to see the objects within the incoming share but not the full list of consumers associated with the share. For more information about querying the metadata of shares, see DESC DATASHARE.

The following screenshot shows our results.

  1. Start the consumption by creating a local database from the salesdatashare. Cluster users with the permission to do so can create a database from the shares. We use the namespace from the etl_cluster.
    CREATE DATABASE Sales_semantic from DATASHARE SalesDatashare OF NAMESPACE '45b137c4-1287-4vf3-8cw2-e710b7138nd9'; 

Consumers should be able to see databases that they created from the share, along with the databases local to the cluster, at any point by querying SVV_REDSHIFT* tables. Data share objects aren’t available for queries until a local database reference is created using a create database statement.

  1. Run the following command to list the databases in bi_cluster:
    select * from svv_redshift_databases;

The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.

The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.

  1. Grant usage on the database to bi_group, where bi_group is a local Amazon Redshift group with BI users added to that group:
    GRANT USAGE ON DATABASE sales_semantic TO bi_group;

Querying as the BI user

In this section, you connect as a user in the bi_group who got access to the shared data. The user is still connected to the local database on the bi_cluster but can query the shared data via the new cross-database query functionality in Amazon Redshift.

  1. Review the list of objects in the share by running the following SQL:
    SELECT schema_name, table_name, table_type FROM  svv_redshift_tables
         where database_name = 'sales_semantic'

The following screenshot shows our results.

The following screenshot shows our results.

  1. Review the list of columns in the customer_denorm view::
    SELECT * FROM  svv_redshift_columns 
       where database_name = 'sales_semantic' and table_name = 'customer_denorm';

The following screenshot shows our results.

The following screenshot shows our results.

  1. Query the shared objects using three-part notation just like querying any other local database object, using a notation <database>.<schema>.<view/table>:
    select count(*) from sales_semantic.bi_semantic.customer_denorm;

Following is your result:

28950139

  1. Analyze the local product reviews data by joining the shared customer_denorm data to identify the top ratings by customer states for this BI report:
    SELECT PR.product_category, c.ca_state AS customer_state,
                  count(PR.star_rating) AS cnt
          FROM product_reviews.amazon_reviews PR,               --local data
               sales_semantic.bi_semantic.customer_denorm  C    –-shared data
          WHERE  PR.customer_id = C.c_customer_sk
             AND PR.marketplace = 'US'
          GROUP BY 1, 2
          order by cnt desc
          Limit 10;

The following screenshot shows our results.

The following screenshot shows our results.

Adding a data science consumer

Now, let’s assume the company has decided to spin up a data science team to help with new sales strategies, and this team performs analytics on the sales data. The data science team is new and has very different access patterns and SLA requirements compared to the BI team. Thanks to the data sharing feature, onboarding new use cases such as this is easy.

We add a data science consumer cluster named ds_cluster. Because the data science users need access to data in salesdatashare, the superuser in the etl_cluster can simply grant access to the ds_cluster by adding them as another consumer for the share without moving any data:

GRANT USAGE ON DATASHARE SalesDatashare TO NAMESPACE ''1h137c4-1187-4w53-8de2-e710b7100es2';

The following diagram shows our updated architecture with the data science consumer (step 4).

The following diagram shows our updated architecture with the data science consumer (step 4).

This way, multiple clusters of different sizes can access the same dataset and isolate workloads to meet their SLA requirements. Users in these respective clusters are granted access to shared objects to meet their stringent security requirements. The producer keeps control of the data and at any point can remove certain objects from the share or remove access to the share for any of these clusters, and the consumers immediately lose access to the data. Also, as more data is ingested into the producer cluster, the consumer sees transactionally consistent data instantly.

Monitoring and security

Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and usage across all the consumers and revoke access instantly when necessary. The permissions are granted by the superusers from both the producer and the consumer clusters to define who gets access to what objects, similar to the grant commands used in the earlier scenario. You can use the following commands to audit the usage and activities for the data share.

Track all changes to the data share and the shared database imported from the data share with the following code:

Select username, share_name, recordtime, action, 
         share_object_type, share_object_name 
  from svl_datashare_change_log
   order by recordtime desc;

The following screenshot shows our results.

The following screenshot shows our results.

Track data share access activity (usage), which is relevant only on the producer, with the following code:

Select * from svl_datashare_usage;

The following screenshot shows our results.

The following screenshot shows our results.

Summary

Amazon Redshift data sharing provides workload isolation by allowing multiple consumers to share data seamlessly without the need to unload and load data. We also presented a step-by-step guide for securely sharing data from a producer to multiple consumer clusters.


About the Authors

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.

Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-federated-querying-to-amazon-aurora-mysql-and-amazon-rds-for-mysql/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in April 2020, we announced general availability for federated querying to Amazon Aurora PostgreSQL and Amazon Relational Database Service (Amazon RDS) for PostgreSQL to enable you to query data across your operational databases, your data warehouse, and your data lake to gain faster and deeper insights not possible otherwise.

Today, we’re launching a new feature of Amazon Redshift federated query to Amazon Aurora MySQL and Amazon RDS for MySQL to help you expand your operational databases in the MySQL family. With this lake house architecture expansion to support more operational data stores, you can query and combine data more easily in real time and store data in open file formats in your Amazon Simple Storage Service (Amazon S3) data lake. Your data can then be more available to other analytics and machine learning (ML) tools, rather than siloed in disparate data stores.

In this post, we share information about how to get started with this new federated query feature to MySQL.

Prerequisites

To try this new feature, create a new Amazon Redshift cluster in a sql_preview maintenance track and Aurora MySQL instance and load sample TPC data into both data stores. To make sure both Aurora MySQL DB instances can accept connections from the Amazon Redshift cluster, you should make sure that both your Amazon Redshift cluster and Aurora MySQL instances are in the same Amazon Virtual Private Cloud (Amazon VPC) and subnet group. This way, you can add the security group for the Amazon Redshift cluster to the inbound rules of the security group for the Aurora MySQL DB instance.

If your Amazon Redshift cluster and Aurora MySQL instances are in the different VPC, you can set up VPC peering or other networking to allow Amazon Redshift to make connections to your Aurora MySQL instances. For more information about VPC networking, see Working with a DB instance in a VPC.

Configuring AWS Secrets Manager for remote database credentials

Amazon Redshift needs database credentials to issue a federated query to a MySQL database. AWS Secrets Manager provides a centralized service to manage secrets and can be used to store your MySQL database credentials. Because Amazon Redshift retrieves and uses these credentials, they are transient, not stored in any generated code, and discarded after the query runs.

Storing credentials in Secrets Manager takes only a few minutes. To store a new secret, complete the following steps:

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. For Select secret type, select Credentials for RDS database.
  4. For User name, enter a name.
  5. For Password, enter a password.
  6. For Select the encryption key, choose DefaultEncryptionkey.
  7. For Select which RDS database this secret will access, choose your database.

Storing credentials in Secrets Manager takes only a few minutes.

  1. Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post).
  2. Choose Next.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console. The secret ARN is needed in the subsequent step.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console.

Setting up IAM role

You can now pull everything together by embedding the secret ARN into an AWS Identity and Access Management (IAM) policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "<SecretARN>"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your cluster.
  3. On the Actions drop-down menu, choose Manage IAM roles.

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. Choose and add the IAM role you just created.

Setting up external schema

The final step is to create an external schema to connect to your Aurora MySQL instance. The following example code creates the external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS mysqlfq 
FROM MYSQL 
DATABASE 'tpc' 
URI '<AuroraClusterEndpoint>' 
PORT 3306 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – Aurora MySQL cluster endpoint
  • IAM_Role – IAM role created from the previous step
  • Secret_ARN – Secret ARN

After you set up the external schema, you’re ready to run some queries to test different use cases.

Querying live operational data

You can now query real-time operational data in your Aurora MySQL instance from Amazon Redshift. Note that isolation level is read committed for MySQL. See the following code:

dev=# select top 10 ws_order_number from mysqlfq.web_sales;
 ws_order_number 
-----------------
        93628990
       157020207
         4338647
        41395871
        58468186
       171095867
        12514566
        74946143
         3418243
        67054239
(10 rows)

Querying mysqlfq.web_sales in Amazon Redshift routes the request to MySQL tpc database and web_sales table. If you examine the query plan, you can see the query runs at the MySQL instance as shown by the step Remote MySQL Seq Scan:

dev=# explain select top 10 ws_order_number from mysqlfq.web_sales;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 XN Limit  (cost=0.00..0.20 rows=10 width=8)
   ->  XN MySQL Query Scan web_sales  (cost=0.00..6869.28 rows=343464 width=8)
         ->  Remote MySQL Seq Scan mysqlfq.web_sales  (cost=0.00..3434.64 rows=343464 width=8)
(3 rows)

Simplifying ELT and ETL

You can also extract operational data directly from your Aurora MySQL instance and load it into Amazon Redshift. See the following code:

dev=# create table staging_customer as select c_customer_id from mysqlfq.customer where c_customer_id not in (select c_customer_id from customer);
SELECT
dev=# select count(*) from staging_customer;
 count  
--------
 350000
(1 row)

The preceding code uses CTAS to create and load incremental data from your operational MySQL instance into a staging table in Amazon Redshift. You can then perform transformation and merge operations from the staging table to the target table. For more information, see Updating and inserting new data.

Combining operational data with data from your data warehouse and data lake

You can combine live operational data from your Aurora MySQL instance with data from your Amazon Redshift data warehouse and S3 data lake by creating a late binding view.

To access your S3 data lake historical data via Amazon Redshift Spectrum, create an external table:

create external schema mysqlspectrum
from data catalog
database 'spectrumdb'
iam_role '<IAMRole>'
create external database if not exists;
 
create external table mysqlspectrum.customer 
stored as parquet 
location 's3://<yourS3bucket>/customer/'
as select * from customer where c_customer_sk <= 100000;

You can then run queries on the view to gain insight on data across the three sources:

drop view vwCustomer;
create view vwCustomer as
select c_customer_sk, 'redshift' as source from public.customer where c_customer_sk > 100000
union all
select c_customer_sk, 'mysql' as source from mysqlfq.customer
union all
select c_customer_sk, 's3' as source from mysqlspectrum.customer
with no schema binding;

select * from vwCustomer where c_customer_sk in (1, 149712,29033279);

You should the following three records as output:

dev=# select * from vwCustomer where c_customer_sk in (1, 149712,29033279);
 c_customer_sk |  source  
---------------+----------
      29033279 | mysql
             1 | s3
        149712 | redshift
(3 rows)

If you examine the query plan, you can see that the predicates are pushed down to your MySQL instance to run:

dev=# explain select * from vwCustomer where c_customer_sk in (1,149712,29033279);
                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan vwcustomer  (cost=0.00..48398.40 rows=6988 width=36)
   ->  XN Append  (cost=0.00..48328.52 rows=6988 width=4)
         ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..40000.03 rows=3 width=4)
               ->  XN Seq Scan on customer  (cost=0.00..40000.00 rows=3 width=4)
                     Filter: (((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279)) AND (c_customer_sk > 100000))
         ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..6548.63 rows=5492 width=4)
               ->  XN MySQL Query Scan customer  (cost=0.00..6493.71 rows=5492 width=4)
                     ->  Remote MySQL Seq Scan mysqlfq.customer  (cost=0.00..6438.79 rows=5492 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
         ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..1779.86 rows=1493 width=4)
               ->  XN S3 Query Scan customer  (cost=0.00..1764.93 rows=1493 width=4)
                     ->  S3 Seq Scan mysqlspectrum.customer location:"s3://<yourS3bucket>/customer" format:PARQUET  (cost=0.00..1750.00 rows=1493 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
(13 rows)

Available Now

Amazon Redshift federated querying to Aurora MySQL and Amazon RDS for MySQL is now available for public preview with Amazon Redshift release version 1.0.21591 or later. Refer to the AWS Region Table for Amazon Redshift availability and to check the version of your clusters.


About the Authors

BP Yau is an Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next-generation big data analytics platform using AWS technologies.

 

Zhouyi Yang is a Software Development Engineer for the Amazon Redshift Query Processing team. He’s passionate about gaining new knowledge about large databases and has worked on SQL language features such as federated query and IAM role privilege control. In his spare time, he enjoys swimming, tennis, and reading.

 

 

Entong Shen is a Senior Software Development Engineer for Amazon Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics, and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

Use a single AWS Managed Microsoft AD for Amazon RDS for SQL Server instances in multiple Regions

Post Syndicated from Jeremy Girven original https://aws.amazon.com/blogs/security/use-a-single-aws-managed-microsoft-ad-for-amazon-rds-for-sql-server-instances-in-multiple-regions/

Many Amazon Web Services (AWS) customers use Active Directory to centralize user authentication and authorization for a variety of applications and services. For these customers, Active Directory is a critical piece of their IT infrastructure.

AWS offers AWS Directory Service for Microsoft Active Directory, also known as AWS Managed Microsoft AD, to provide a highly accessible and resilient Active Directory service that is built on Microsoft Active Directory.

AWS also offers Amazon Relational Database Service (Amazon RDS) for SQL Server. Amazon RDS enables you to prioritize application development by managing time-consuming database administration tasks including provisioning, backups, software patching, monitoring, and hardware scaling. If you require Windows authentication with Amazon RDS for SQL Server, Amazon RDS for SQL Server instances need to be integrated with AWS Managed Microsoft AD.

With the release of AWS Managed Microsoft AD cross-Region support, you only need one distinct AWS Managed Microsoft AD that spans multiple AWS Regions; this simplifies directory management and configuration. Additionally, it simplifies trusts between the AWS Managed Microsoft AD domain and your on-premises domain. Now, only a single trust between your on-premises domain and AWS Managed Microsoft AD domain is required, as compared to the previous pattern of only one AWS Managed Microsoft AD per Region—each of which would require a trust if you wanted to allow on-premises objects access to your AWS Managed Microsoft AD domain. Further, AWS Managed Microsoft AD cross-Region support provides an additional benefit when using your on-premises users and groups with Amazon RDS for SQL Server: You only need a single, one-way, outgoing trust between your multi-Region AWS Managed Microsoft AD and your on-premises domain.

As detailed in this post, to enable AWS Managed Microsoft AD cross-Region support, you create a new AWS Managed Microsoft AD and extend it to multiple Regions (as shown in Figure 1 below). Once you’ve extended your directory, you deploy an Amazon RDS SQL Server instance in each Region, integrating it to the same directory. Finally, you install SQL Server Management Studio (SSMS) on an instance joined to the AWS Managed Microsoft AD directory. You use that instance to connect to the RDS SQL Server instances using the same domain user account.

Figure 1: High level diagram of resources deployed in this post

Figure 1: High level diagram of resources deployed in this post

The architecture in Figure 1 includes a network connection between the Regions. That connection isn’t required for the AWS Managed Microsoft AD to function. If you don’t require network connectivity between your regions, you can disregard the network link in the diagram. Since you will be using a single Amazon Elastic Compute Cloud (Amazon EC2) instance in one Region, the network connection is needed between Amazon VPCs in the two Regions to allow that instance to connect to a domain controller in each Region.

Prerequisites for AWS Managed Microsoft AD cross-Region Support

  1. An AWS Managed Microsoft AD deployed in a Region of your choice. If you don’t have one already deployed, you can follow the instructions in Create Your AWS Managed Microsoft AD directory to create one. For this post, I recommend that you use us-east-1.
  2. The VPC must be peered in order to complete the steps in this blog. Creating and accepting a VPC peering connection has information on how to create a peering connection between Regions. Be aware of unsupported VPC peering configurations.
  3. A Windows Server instance joined to your managed Active Directory domain. Join an EC2 Instance to Your AWS Managed Microsoft AD Directory has instructions if you need assistance.
  4. Install the Active Directory administration tools onto your domain-joined instance. Installing the Active Directory Administration Tools has detailed instructions.

Extend your AWS Managed Microsoft AD to another Region

We’ve made the process to extend your directory to another Region straightforward. There is no cost to add another Region; you only pay for the resources for your directory running in the new Region. See here for additional information on pricing changes with new Regions. For example, in this post you will be extending your directory into the us-east-2 region. There will be an additional cost for two new domain controllers. Figure 3 shows the additional cost to extend the directory.

Let’s walk through the steps of setting up Windows Authentication with Amazon RDS for SQL Server instances in multiple Regions using a single cross-Region AWS Managed Microsoft AD.

To extend your directory to another Region:

  1. In the AWS Directory Service console navigation pane, choose Directories.

    Note: You should see a list of your AWS Managed Microsoft AD directories.

  2. Choose the Directory ID of the directory you want to expand to another Region.
  3. Go to the Directory details page. In the Multi-region replication section, select Add Region.

    Figure 2: Directory details and new multi-Region replication pane

    Figure 2: Directory details and new multi-Region replication pane

  4. On the Add region page:
    1. For Region to add, select the Region you want to extend your directory to.
    2. For VPC, select the Amazon Virtual Private Cloud (Amazon VPC) for the new domain controllers to use.
    3. For Subnets, select two unique subnets in the Amazon VPC that you selected in the preceding step.
    4. Once you have everything to your liking, choose Add.
      Figure 3: Add a Region

      Figure 3: Add a Region

      In the background, AWS is provisioning two new AWS managed domain controllers in the Region you selected. It could take up to 2 hours for your directory to become available in the Region.

Note: Your managed domain controllers in the home Region are fully functional during this process.

  • On the Directory details page, in Multi-Region replication, the status should be Active when the process has completed. Now you’re ready to deploy your Amazon RDS SQL Server instances.

Enable Amazon RDS for SQL Server

Integrating Amazon RDS into AWS Managed Microsoft AD is exactly the same process as it was before the cross-Region feature was released. This post goes through that original process with only one change, which is that you select the same directory ID for both Regions.

Create an Amazon RDS SQL Server instance in each Region using the same directory

The steps for creating an Amazon RDS SQL Server instance in each Region are the same. The following process will create the first instance. Once you’ve completed the process, you change the AWS Management Console Region to the Region you extended your directory to and repeat the process.

To create an Amazon RDS SQL Server instance:

  1. In the AWS Managed Microsoft AD directory primary Region, go to the Amazon RDS console navigation pane and choose Create database.
  2. Choose Microsoft SQL Server.
  3. You can leave the default values, except for the following settings:
    1. Under Settings select Master and Confirm password.
    2. Under Connectivity, expand Additional connectivity configuration:
      1. Choose Create new to create a new VPC security group.
      2. Enter a name in New VPC security group name.
      3. Select No preference for Availability Zone.
      4. Enter 1433 for Database port.
      Figure 4: Connectivity settings

      Figure 4: Connectivity settings

  4. Select the Enable Microsoft SQL Server Windows authentication check box and then choose Browse Directory.

    Figure 5: Enable Microsoft SQL Server Windows authentication selected

    Figure 5: Enable Microsoft SQL Server Windows authentication selected

  5. Select your directory and select Choose.

    Figure 6: Select a directory

    Figure 6: Select a directory

  6. Choose Create database.
  7. Repeat these steps in your expanded Region. Note that the Directory ID will be the same for both Regions. You can complete the next section while your Amazon RDS SQL instances are provisioning.

Create an Active Directory user and group to delegate SQL administrative rights

The following steps walk you through the process of creating an Active Directory user and group for delegation. Following this process, you add the user to the group you just created and to the AWS Delegated Server Administrators group.

To create a user and group:

  1. Log in to the domain-joined instance with a domain user account that has permissions to create Active Directory users and groups.
  2. Choose Start, enter dsa.msc, and press Enter.
  3. In Active Directory Users and Computers, right-click on the Users OU, select New, and then Group. The New Object – Group window pops up.
    1. Fill in the Group name boxes with your choice of name.
    2. For Group Scope, select Domain local.
    3. For Group type, select Security.
    4. Choose OK.
  4. In Active Directory Users and Computers, right-click on your Users OU and select New and then User. The New Object – User window pops up.
    1. Fill in the boxes with your choice of information, and then choose Next.
    2. Enter your choice of password and clear User must change password at next logon, then choose Next.
    3. On the confirmation page, choose Finish.
  5. Double-click on the user you just created. The user account properties window appears.
    1. Select the Member of tab.
    2. Choose Add.
    3. Enter the name of the group that you previously created and choose Check Names. Next, enter AWS Delegated Server Administrators and choose Check Names again. If you do not receive any error, choose OK, and then OK again.
  6. The Member of tab for the user should include the two groups you just added. Choose OK to close the properties page.

Delegate SQL Server permissions in each Region using the Active Directory group you just created

The following steps guide you through the process of modifying the Amazon RDS SQL security group, installing SQL Server Management Studio (SSMS), and delegating permission in SQL to your Active Directory group.

Modify the Amazon RDS SQL security group

In these next steps, you modify the security group you created with your Amazon RDS instances, allowing your Windows Server instance to connect to the Amazon RDS SQL Server instances over port 1433.

To modify the security group:

  1. From the Amazon Elastic Compute Cloud (Amazon EC2) console, select Security Groups under the Network & Security navigation section.
  2. Select the new Amazon RDS SQL security group that was created with your Amazon RDS SQL instance and select Edit inbound rules.
  3. Choose Add rule and enter the following:
    1. Type – Select Custom TCP.
    2. Protocol – Select TCP.
    3. Port range – Enter 1433.
    4. Source – Select Custom.
    5. Enter the private IP of your instance with a /32. An example would be 10.0.0.10/32.
  4. Choose Save rules.

    Figure 7: Create a security group rule

    Figure 7: Create a security group rule

  5. Repeat these steps on the security group of your other Amazon RDS SQL instance in the other Region.

Install SQL Server Management Studio

All of the steps after the first are done on the Windows Server instance from Prerequisite 3.

To install SMMS:

  1. On your local computer, download SQL Server Management Studio (SSMS).

    Note: If desired, you can disable IE Enhanced Security Configuration and download directly to the Windows Server instance using IE or any other browser, and skip to step 3.

  2. RDP into your Windows Server instance and copy SSMS-Setup-ENU.exe to your RDP session.
  3. Run the file on your Windows Server instance.
  4. Choose Install.

    Figure 8: Install SMMS

    Figure 8: Install SMMS

  5. It might take a few minutes to install. When complete, choose Close.

Delegate permissions in SSMS

All of the following steps are performed on the Windows Server instance from Prerequisite 3. Log in to the Amazon RDS SQL instance using the SQL master user account. Next, create a SQL login for the Active Directory group you created previously and give it elevated permission to the Amazon RDS SQL instance.

To delegate permissions:

  1. Start SMMS.
  2. On the Connect to Server window, enter or select:
    1. Server name – Your Amazon RDS SQL Server endpoint.
    2. Authentication – Select SQL Server Authentication.
    3. Login – Enter the master user name you used when you launched your Amazon RDS SQL instance. The default is admin.
    4. Password – Enter the password for the master user name.
    5. Choose Connect.
    Figure 9: Connect to server

    Figure 9: Connect to server

  3. In SMMS, Choose New Query at the top of the window.

    Figure 10: New query

    Figure 10: New query

  4. In the query window, enter the following query. Replace <CORP\SQL-Admins> with the name of the group you created earlier.
    CREATE LOGIN [<CORP\SQL-Admins>] FROM WINDOWS WITH DEFAULT_DATABASE = [master],
       DEFAULT_LANGUAGE = [us_english];
    

    Figure 11: Query SQL database

    Figure 11: Query SQL database

  5. Choose Execute on the menu bar. You should see a Commands completed successfully message.

    Figure 12: Commands completed successfully

    Figure 12: Commands completed successfully

  6. Next, navigate to the Logins directory on the navigation page. Right-click on the group you added with the SQL command in step 5 and select Properties.

    Figure 13: Open group properties

    Figure 13: Open group properties

  7. Select Server Roles and select the processadmin and setupadmin checkboxes. Then choose OK.

    Figure 14: Configure server roles

    Figure 14: Configure server roles

  8. You can log off from the instance. For the next steps, you log in to the instance using the user account you created previously.
  9. Repeat these steps on the Amazon RDS SQL instance in the other Region.

Connect to the Amazon RDS SQL Server with the same Active Directory user in both Regions

All of the steps are performed on the Windows Server instance from Prerequisite 3. You must log in to the instance using the account you created earlier. You then log in to the Amazon RDS SQL instance using Windows authentication with that account.

  1. Log in to the instance with the user account you created earlier.
  2. Start SSMS.
  3. On the Connect to Server window, enter or select:
    1. Server name: Your Amazon RDS SQL Server endpoint.
    2. Authentication: Select Windows Authentication.
    3. Choose Connect.
    Figure 15: Connect to server

    Figure 15: Connect to server

  4. You should be logged in to SSMS. If you aren’t logged in, make sure you added your user account to the group you created earlier and try again.
  5. Repeat these steps using the other Amazon RDS SQL instance endpoint for the server name. You should be able to connect to both Amazon RDS SQL instances using the same user account.

Summary

In this post, you extended your AWS Managed Microsoft AD into a new Region. You then deployed Amazon RDS for SQL Server in multiple Regions attached to the same AWS Managed Microsoft AD directory. You then tested authentication to both Amazon RDS SQL instances using the same Active Directory user.

To learn more about using AWS Managed Microsoft AD or AD Connector, visit the AWS Directory Service documentation. For general information and pricing, see the AWS Directory Service home page. If you have comments about this blog post, submit a comment in the Comments section below. If you have implementation or troubleshooting questions, start a new thread on the AWS Directory Service forum or contact AWS Support.

Author

Jeremy Girven

Jeremy is a Solutions Architect specializing in Microsoft workloads on AWS. He has over 15 years of experience with Microsoft Active Directory and over 23 years of industry experience. One of his fun projects is using SSM to automate the Active Directory build processes in AWS. To see more please check out the Active Directory AWS QuickStart.