Tag Archives: Amazon RDS

Better performance for less: AWS continues to beat Azure on SQL Server price/performance

Post Syndicated from Fred Wurden original https://aws.amazon.com/blogs/compute/sql-server-runs-better-on-aws/

By Fred Wurden, General Manager, AWS Enterprise Engineering (Windows, VMware, RedHat, SAP, Benchmarking)

AWS R5b.8xlarge delivers better performance at lower cost than Azure E64_32s_v4 for a SQL Server workload

In this blog, we will review a recent benchmark that Principled Technologies published on 2/25. The benchmark found that an Amazon Elastic Compute Cloud (Amazon EC2) R5b.8xlarge instance delivered better performance for a SQL Server workload at a lower cost when directly tested against an Azure E64_32s_v4 VM.

Behind the study: Understanding how SQL Server performed better, for a lower cost with an AWS EC2 R5b instance

Principled Technologies tested an online transaction processing (OLTP) workload for SQL Server 2019 on both an R5b instance on Amazon EC2 with Amazon Elastic Block Store (EBS) as storage and Azure E64_32s_v4. This particular Azure VM was chosen as an equivalent to the R5b instance, as both instances have comparable specifications for input/output operations per second (IOPS) performance, use Intel Xeon processors from the same generation (Cascade Lake), and offer the same number of cores (32). For storage, Principled Technologies mirrored storage configurations across the Azure VM and the EC2 instance (which used Amazon Elastic Block Store (EBS)), maxing out the IOPS specs on each while offering a direct comparison between instances.

Test Configurations

Source: Principled Technologies

When benchmarking, Principled Technologies ran a TPC-C-like OLTP workload from HammerDB v3.3 on both instances, testing against new orders per minute (NOPM) performance. NOPM shows the number of new-order transactions completed in one minute as part of a serialized business workload. HammerDB claims that because NOPM is “independent of any particular database implementation [it] is the recommended primary metric to use.”

The results: SQL Server on AWS EC2 R5b delivered 2x performance than the Azure VM and 62% less expensive 

Graphs that show AWS instance outperformed the Azure instance

Source: Principled Technologies

These test results from the Principled Technologies report show the price/performance and performance comparisons. The performance metric is New Orders Per Minute (NOPM); faster is better. The price/performance calculations are based on the cost of on-demand, License Included SQL Server instances and storage to achieve 1,000 NOPM performance, smaller is better.

An EC2 r5b.8xlarge instance powered by an Intel Xeon Scalable processor delivered better SQL Server NOPM performance on the HammerDB benchmark and a lower price per 1,000 NOPM than an Azure E64_32s_v4 VM powered by similar Intel Xeon Scalable processors.

On top of that, AWS’s storage price-performance exceeded Azure’s. The Azure managed disks offered 53 percent more storage than the EBS storage, but the EC2 instance with EBS storage cost 24 percent less than the Azure VM with managed disks. Even by reducing Azure storage by the difference in storage, something customers cannot do, EBS would have cost 13 percent less per storage GB than the Azure managed disks.

Why AWS is the best cloud to run your Windows and SQL Server workloads

To us, these results aren’t surprising. In fact, they’re in line with the success that customers find running Windows on AWS for over 12 years. Customers like Pearson and Expedia have all found better performance and enhanced cost savings by moving their Windows, SQL Server, and .NET workloads to AWS. In fact, RepricerExpress migrated its Windows and SQL Server environments from Azure to AWS to slash outbound bandwidth costs while gaining agility and performance.

Not only do we offer better price-performance for your Windows workloads, but we also offer better ways to run Windows in the cloud. Whether you want to rehost your databases to EC2, move to managed with Amazon Relational Database for SQL Server (RDS), or even modernize to cloud-native databases, AWS stands ready to help you get the most out of the cloud.

 


To learn more on migrating Windows Server or SQL Server, visit Windows on AWS. For more stories about customers who have successfully migrated and modernized SQL Server workloads with AWS, visit our Customer Success page. Contact us to start your migration journey today.

Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS

Post Syndicated from Islam Ghanim original https://aws.amazon.com/blogs/architecture/field-notes-stopping-an-automatically-started-database-instance-with-amazon-rds/

Customers needing to keep an Amazon Relational Database Service (Amazon RDS) instance stopped for more than 7 days, look for ways to efficiently re-stop the database after being automatically started by Amazon RDS. If the database is started and there is no mechanism to stop it; customers start to pay for the instance’s hourly cost. Moreover, customers with database licensing agreements could incur penalties for running beyond their licensed cores/users.

Stopping and starting a DB instance is faster than creating a DB snapshot, and then restoring the snapshot. However, if you plan to keep the Amazon RDS instance stopped for an extended period of time, it is advised to terminate your Amazon RDS instance and recreate it from a snapshot when needed.

This blog provides a step-by-step approach to automatically stop an RDS instance once the auto-restart activity is complete. This saves any costs incurred once the instance is turned on. The proposed architecture is fully serverless and requires no management overhead. It relies on AWS Step Functions and a set of Lambda functions to monitor RDS instance state and stop the instance when required.

Architecture overview

Given the autonomous nature of the architecture and to avoid management overhead, the architecture leverages serverless components.

  • The architecture relies on RDS event notifications. Once a stopped RDS instance is started by AWS due to exceeding the maximum time in the stopped state; an event (RDS-EVENT-0154) is generated by RDS.
  • The RDS event is pushed to a dedicated SNS topic rds-event-notifications-topic.
  • The Lambda function start-statemachine-execution-lambda is subscribed to the SNS topic rds-event-notifications-topic.
    • The function filters messages with event code: RDS-EVENT-0154. In order to restrict the ‘force shutdown’ activity further, the function validates that the RDS instance is tagged with auto-restart-protection and that the tag value is set to ‘yes’.
    • Once all conditions are met, the Lambda function starts the AWS Step Functions state machine execution.
  • The AWS Step Functions state machine integrates with two Lambda functions in order to retrieve the instance state, as well as attempt to stop the RDS instance.
    • In case the instance state is not ‘available’, the state machine waits for 5 minutes and then re-checks the state.
    • Finally, when the Amazon RDS instance state is ‘available’; the state machine will attempt to stop the Amazon RDS instance.

Prerequisites

In order to implement the steps in this post, you need an AWS account as well as an IAM user with permissions to provision and delete resources of the following AWS services:

  • Amazon RDS
  • AWS Lambda
  • AWS Step Functions
  • AWS CloudFormation
  • AWS SNS
  • AWS IAM

Architecture implementation

You can implement the architecture using the AWS Management Console or AWS CLI.  For faster deployment, the architecture is available on GitHub. For more information on the repo, visit GitHub.

The steps below explain how to build the end-to-end architecture from within the AWS Management Console:

Create an SNS topic

  • Open the Amazon SNS console.
  • On the Amazon SNS dashboard, under Common actions, choose Create Topic.
  • In the Create new topic dialog box, for Topic name, enter a name for the topic (rds-event-notifications-topic).
  • Choose Create topic.
  • Note the Topic ARN for the next task (for example, arn:aws:sns:us-east-1:111122223333:my-topic).

Configure RDS event notifications

Amazon RDS uses Amazon Simple Notification Service (Amazon SNS) to provide notification when an Amazon RDS event occurs. These notifications can be in any notification form supported by Amazon SNS for an AWS Region, such as an email, a text message, or a call to an HTTP endpoint.

For this architecture, RDS generates an event indicating that instance has automatically restarted because it exceed the maximum duration to remain stopped. This specific RDS event (RDS-EVENT-0154) belongs to ‘notification’ category. For more information, visit Using Amazon RDS Event Notification.

To subscribe to an RDS event notification

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose Event subscriptions.
  • In the Event subscriptions pane, choose Create event subscription.
  • In the Create event subscription dialog box, do the following:
    • For Name, enter a name for the event notification subscription (RdsAutoRestartEventSubscription).
    • For Send notifications to, choose the SNS topic created in the previous step (rds-event-notifications-topic).
    • For Source type, choose ‘Instances’. Since our source will be RDS instances.
    • For Instances to include, choose ‘All instances’. Instances are included or excluded based on the tag, auto-restart-protection. This is to keep the architecture generic and to avoid regular configurations moving forward.
    • For Event categories to include, choose ‘Select specific event categories’.
    • For Specific event, choose ‘notification’. This is the category under which the RDS event of interest falls. For more information, review Using Amazon RDS Event Notification.
    •  Choose Create.
    • The Amazon RDS console indicates that the subscription is being created.

Create Lambda functions

Following are the three Lambda functions required for the architecture to work:

  1. start-statemachine-execution-lambda, the function will subscribe to the newly created SNS topic (rds-event-notifications-topic) and starts the AWS Step Functions state machine execution.
  2. retrieve-rds-instance-state-lambda, the function is triggered by AWS Step Functions state machine to retrieve an RDS instance state (example, available or stopped)
  3. stop-rds-instance-lambda, the function is triggered by AWS Step Functions state machine in order to attempt to stop an RDS instance.

First, create the Lambda functions’ execution role.

To create an execution role

  • Open the roles page in the IAM console.
  • Choose Create role.
  • Create a role with the following properties.
    • Trusted entity – Lambda.
    • Permissions – AWSLambdaBasicExecutionRole.
    • Role namerds-auto-restart-lambda-role.
    • The AWSLambdaBasicExecutionRole policy has the permissions that the function needs to write logs to CloudWatch Logs.

Now, create a new policy and attach to the role in order to allow the Lambda function to: start an AWS StepFunctions state machine execution, stop an Amazon RDS instance, retrieve RDS instance status, list tags and add tags.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "rds:AddTagsToResource",
                "rds:ListTagsForResource",
                "rds:DescribeDBInstances",
                "states:StartExecution",
                "rds:StopDBInstance"
            ],
            "Resource": "*"
        }
    ]
}
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name (rds-auto-restart-lambda-policy) and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.

To link the new policy to the AWS Lambda execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role created for the three Lambda functions. After selecting the identities, choose Attach policy.

Given the principle of least privilege, it is recommended to create 3 different roles restricting a function’s access to the needed resources only. 

Repeat the following step 3 times to create 3 new Lambda functions. Differences between the 3 Lambda functions are: (1) code and (2) triggers:

  • Open the Lambda console.
  • Choose Create function.
  • Configure the following settings:
    • Name
      • start-statemachine-execution-lambda
      • retrieve-rds-instance-state-lambda
      • stop-rds-instance-lambda
    • Runtime – Python 3.8.
    • Role – Choose an existing role.
    • Existing role – rds-auto-restart-lambda-role.
    • Choose Create function.
    • To configure a test event, choose Test.
    • For Event name, enter test.
  • Choose Create.
  • For the Lambda function —  start-statemachine-execution-lambda, use the following Python 3.8 sample code:
import json
import boto3
import logging
import os

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')

def lambda_handler(event, context):

    #log input event
    LOGGER.info("RdsAutoRestart Event Received, now checking if event is eligible. Event Details ==> ", event)

    #Input event from the SNS topic originated from RDS event notifications
    snsMessage = json.loads(event['Records'][0]['Sns']['Message'])
    rdsInstanceId = snsMessage['Source ID']
    stepFunctionInput = {"rdsInstanceId": rdsInstanceId}
    rdsEventId = snsMessage['Event ID']

    #Retrieve RDS instance ARN
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceArn = db_instance['DBInstanceArn']

    # Filter on the Auto Restart RDS Event. Event code: RDS-EVENT-0154. 

    if 'RDS-EVENT-0154' in rdsEventId:

        #log input event
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        #Verify that instance is tagged with auto-restart-protection tag. The tag is used to classify instances that are required to be terminated once started. 

        tagCheckPass = 'false'
        rdsInstanceTags = rdsClient.list_tags_for_resource(ResourceName=rdsInstanceArn)
        for rdsInstanceTag in rdsInstanceTags["TagList"]:
            if 'auto-restart-protection' in rdsInstanceTag["Key"]:
                if 'yes' in rdsInstanceTag["Value"]:
                    tagCheckPass = 'true'
                    #log instance tags
                    LOGGER.info("RdsAutoRestart verified that the instance is tagged auto-restart-protection = yes, now starting the Step Functions Flow")
                else:
                    tagCheckPass = 'false'


        #log instance tags
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        if 'true' in tagCheckPass:

            #Initialise StepFunctions Client
            stepFunctionsClient = boto3.client('stepfunctions')

            # Start StepFunctions WorkFlow
            # StepFunctionsArn is stored in an environment variable
            stepFunctionsArn = os.environ['STEPFUNCTION_ARN']
            stepFunctionsResponse = stepFunctionsClient.start_execution(
            stateMachineArn= stepFunctionsArn,
            name=event['Records'][0]['Sns']['MessageId'],
            input= json.dumps(stepFunctionInput)

        )

    else:

        LOGGER.info("RdsAutoRestart Event detected, and event is not eligible")

    return {
            'statusCode': 200
        }

And then, configure an SNS source trigger for the function start-statemachine-execution-lambda. RDS event notifications will be published to this SNS topic:

  • In the Designer pane, choose Add trigger.
  • In the Trigger configurations pane, select SNS as a trigger.
  • For SNS topic, choose the SNS topic previously created (rds-event-notifications-topic)
  • For Enable trigger, keep it checked.
  • Choose Add.
  • Choose Save.

For the Lambda function — retrieve-rds-instance-state-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')


def lambda_handler(event, context):
    

    #log input event
    LOGGER.info(event)
    
    #rdsInstanceId is passed as input to the lambda function from the AWS StepFunctions state machine.  
    rdsInstanceId = event['rdsInstanceId']
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceState = db_instance['DBInstanceStatus']
    return {
        'statusCode': 200,
        'rdsInstanceState': rdsInstanceState,
        'rdsInstanceId': rdsInstanceId
    }

Choose Save.

For the Lambda function, stop-rds-instance-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')


def lambda_handler(event, context):
    
    #log input event
    LOGGER.info(event)
    
    rdsInstanceId = event['rdsInstanceId']
    
    #Stop RDS instance
    rdsClient.stop_db_instance(DBInstanceIdentifier=rdsInstanceId)
    
    #Tagging
    
    
    return {
        'statusCode': 200,
        'rdsInstanceId': rdsInstanceId
    }

Choose Save.

Create a Step Function

AWS Step Functions will execute the following service logic:

  1. Retrieve RDS instance state by calling Lambda function, retrieve-rds-instance-state-lambda. The Lambda function then returns the parameter, rdsInstanceState.
  2. If the rdsInstanceState parameter value is ‘available’, then the state machine will step into the next action calling the Lambda function, stop-rds-instance-lambda. If the rdsInstanceState is not ‘available’, the state machine will then wait for 5 minutes and then re-check the RDS instance state again.
  3. Stopping an RDS instance is an asynchronous operation and accordingly the state machine will keep polling the instance state once every 5 minutes until the rdsInstanceState parameter value becomes ‘stopped’. Only then, the state machine execution will complete successfully.

  • An RDS instance path to ‘available’ state may vary depending on the various maintenance activities scheduled for the instance.
  • Once the RDS notification event is generated, the instance will go through multiple states till it becomes ‘available’.
  • The use of the 5 minutes timer is to make sure that the automation flow will keep attempting to stop the instance once it becomes available.
  • The second part will make sure that the flow doesn’t end till the instance status is changed to ‘stopped’ and hence notifying the system administrator.

To create an AWS Step Functions state machine

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose State machines.
  • In the State machines pane, choose Create state machine.
  • On the Define state machine page, choose Author with code snippets. For Type, choose Standard.
  • Enter a Name for your state machine, stop-rds-instance-statemachine.
  • In the State machine definition pane, add the following state machine definition using the ARNs of the two Lambda function created earlier, as shown in the following code sample:
{
  "Comment": "stop-rds-instance-statemachine: Automatically shutting down RDS instance after a forced Auto-Restart",
  "StartAt": "retrieveRdsInstanceState",
  "States": {
    "retrieveRdsInstanceState": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceAvailable"
    },
    "isInstanceAvailable": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.rdsInstanceState",
          "StringEquals": "available",
          "Next": "stopRdsInstance"
        }
      ],
      "Default": "waitFiveMinutes"
    },
    "waitFiveMinutes": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRdsInstanceState"
    },
    "stopRdsInstance": {
      "Type": "Task",
      "Resource": "stop-rds-instance-lambda Arn",
      "Next": "retrieveRDSInstanceStateStopping"
    },
    "retrieveRDSInstanceStateStopping": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceStopped"
    },
    "isInstanceStopped": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.rdsInstanceState",
          "StringEquals": "stopped",
          "Next": "notifyDatabaseAdmin"
        }
      ],
      "Default": "waitFiveMinutesStopping"
    },
    "waitFiveMinutesStopping": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRDSInstanceStateStopping"
    },
    "notifyDatabaseAdmin": {
      "Type": "Pass",
      "Result": "World",
      "End": true
    }
  }
}

This is a definition of the state machine written in Amazon States Language which is used to describe the execution flow of an AWS Step Function.

Choose Next.

  • In the Name pane, enter a name for your state machine, stop-rds-instance-statemachine.
  • In the Permissions pane, choose Create new role. Take note of the the new role’s name displayed at the bottom of the page (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd).
  • Choose Create state machine
  • By default, the created role only grants the state machine access to CloudWatch logs. Since the state machine will have to make Lambda calls, then another IAM policy has to be associated with the new role.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "*"
}
]
}
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name rds-auto-restart-stepfunctions-policy and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy.
  • Choose Create policy to save your work.

To link the new policy to the AWS Step Functions execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role create for the state machine (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd). After selecting the identities, choose Attach policy.

 

Testing the architecture

In order to test the architecture, create a test RDS instance, tag it with auto-restart-protection tag and set the tag value to yes. While the RDS instance is still in creation process, test the Lambda function —  start-statemachine-execution-lambda with a sample event that simulates that the instance was started as it exceeded the maximum time to remain stopped (RDS-EVENT-0154).

To invoke a function

  • Sign in to the AWS Management Console and open the Lambda console.
  • In navigation pane, choose Functions.
  • In Functions pane, choose start-statemachine-execution-lambda.
  • In the upper right corner, choose Test.
  • In the Configure test event page, choose Create new test event and in Event template, leave the default Hello World option.
    {
    "Records": [
        {
        "EventSource": "aws:sns",
        "EventVersion": "1.0",
        "EventSubscriptionArn": "<RDS Event Subscription Arn>",
        "Sns": {
            "Type": "Notification",
            "MessageId": "10001-2d55da-9a73-5e42d46748c0",
            "TopicArn": "<SNS Topic Arn>",
            "Subject": "RDS Notification Message",
            "Message": "{\"Event Source\":\"db-instance\",\"Event Time\":\"2020-07-09 15:15:03.031\",\"Identifier Link\":\"https://console.aws.amazon.com/rds/home?region=<region>#dbinstance:id=<RDS instance id>\",\"Source ID\":\"<RDS instance id>\",\"Event ID\":\"http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_Events.html#RDS-EVENT-0154\",\"Event Message\":\"DB instance started\"}",
            "Timestamp": "2020-07-09T15:15:03.991Z",
            "SignatureVersion": "1",
            "Signature": "YsuM+L6N8rk+pBPBWoWeRcSuYqo/BN5v9D2lyoSg0B0uS46Q8NZZSoZWaIQi25TXfHY3RYXCXF9WbVGXiWa4dJs2Mjg46anM+2j6z9R7BDz0vt25qCrCyWhmWtc7yeETrlwa0jCtR/wxXFFexRwynqlZeDfvQpf/x+KNLrnJlT61WZ2FMTHYs124RwWU8NY3pm1Os0XOIvm8rfv3ywm1ccZfP4rF7Lfn+2EK6a0635Z/5aiyIlldNZxbgRYTODJYroO9INTlF7NPzVV1Y/K0E9aaL/wQgLZNquXQGCAxPFWy5lxJKeyUocOWcG48KJGIBUC36JJaqVdIilbZ9HvxTg==",
            "SigningCertUrl": "https://sns.<region>.amazonaws.com/SimpleNotificationService-a86cb10b4e1f29c941702d737128f7b6.pem",
            "UnsubscribeUrl": "https://sns.<region>.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=<arn>",
            "MessageAttributes": {}
        }
        }
    ]
    }
start-statemachine-execution-lambda uses the SNS MessageId parameter as name for the AWS Step Functions execution. The name field is unique for a certain period of time, accordingly, with every test run the MessageId parameter value must be changed. 
  • Choose Create and then choose Test. Each user can create up to 10 test events per function. Those test events are not available to other users.
  • AWS Lambda executes your function on your behalf. The handler in your Lambda function receives and then processes the sample event.
  • Upon successful execution, view results in the console.
  • The Execution result section shows the execution status as succeeded and also shows the function execution results, returned by the return statement. Following is a sample response of the test execution:

Now, verify the execution of the AWS Step Functions state machine:

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In navigation pane, choose State machines.
  • In the State machine pane, choose stop-rds-instance-statemachine.
  • In the Executions pane, choose the execution with the Name value passed in the test event MessageId parameter.
  • In the Visual workflow pane, the real-time execution status is displayed:

  • Under the Step details tab, all details related to inputs, outputs and exceptions are displayed:

Monitoring

It is recommended to use Amazon CloudWatch to monitor all the components in this architecture. You can use AWS Step Functions to log the state of the execution, inputs and outputs of each step in the flow. So when things go wrong, you can diagnose and debug problems quickly.

Cost

When you build the architecture using serverless components, you pay for what you use with no upfront infrastructure costs. Cost will depend on the number of RDS instances tagged to be protected against an automatic start.

Architectural considerations

This architecture has to be deployed per AWS Account per Region.

Conclusion

The blog post demonstrated how to build a fully serverless architecture that monitors and stops RDS instances restarted by AWS. This helps to avoid falling behind on any required maintenance updates. This architecture helps you save cost incurred by started instances’ running hours and licensing implications.  Feel free to submit enhancements to the GitHub repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers

Updating opt-in status for Amazon Pinpoint channels

Post Syndicated from Varinder Dhanota original https://aws.amazon.com/blogs/messaging-and-targeting/updating-opt-in-status-for-amazon-pinpoint-channels/

In many real-world scenarios, customers are using home-grown or 3rd party systems to manage their campaign related information. This includes user preferences, segmentation, targeting, interactions, and more. To create customer-centric engagement experiences with such existing systems, migrating or integrating into Amazon Pinpoint is needed. Luckily, many AWS services and mechanisms can help to streamline this integration in a resilient and cost-effective way.

In this blog post, we demonstrate a sample solution that captures changes from an on-premises application’s database by utilizing AWS Integration and Transfer Services and updates Amazon Pinpoint in real-time.

If you are looking for a serverless, mobile-optimized preference center allowing end users to manage their Pinpoint communication preferences and attributes, you can also check the Amazon Pinpoint Preference Center.

Architecture

Architecture

In this scenario, users’ SMS opt-in/opt-out preferences are managed by a home-grown customer application. Users interact with the application over its web interface. The application, saves the customer preferences on a MySQL database.

This solution’s flow of events is triggered with a change (insert / update / delete) happening in the database. The change event is then captured by AWS Database Migration Service (DMS) that is configured with an ongoing replication task. This task continuously monitors a specified database and forwards the change event to an Amazon Kinesis Data Streams stream. Raw events that are buffered in this stream are polled by an AWS Lambda function. This function transforms the event, and makes it ready to be passed to Amazon Pinpoint API. This API call will in turn, change the opt-in/opt-out subscription status of the channel for that user.

Ongoing replication tasks are created against multiple types of database engines, including Oracle, MS-SQL, Postgres, and more. In this blog post, we use a MySQL based RDS instance to demonstrate this architecture. The instance will have a database we name pinpoint_demo and one table we name optin_status. In this sample, we assume the table is holding details about a user and their opt-in preference for SMS messages.

userid phone optin lastupdate
user1 +12341111111 1 1593867404
user2 +12341111112 1 1593867404
user2 +12341111113 1 1593867404

Prerequisites

  1. AWS CLI is configured with an active AWS account and appropriate access.
  2. You have an understanding of Amazon Pinpoint concepts. You will be using Amazon Pinpoint to create a segment, populate endpoints, and validate phone numbers. For more details, see the Amazon Pinpoint product page and documentation.

Setup

First, you clone the repository that contains a stack of templates to your local environment. Make sure you have configured your AWS CLI with AWS credentials. Follow the steps below to deploy the CloudFormation stack:

  1. Clone the git repository containing the CloudFormation templates:
    git clone https://github.com/aws-samples/amazon-pinpoint-rds-integration.git
    cd amazon-pinpoint-rds-integration
  2. You need an S3 Bucket to hold the template:
    aws s3 create-bucket –bucket <YOUR-BUCKET-NAME>
  3. Run the following command to package the CloudFormation templates:
    aws cloudformation package --template-file template_stack.yaml --output-template-file template_out.yaml --s3-bucket <YOUR-BUCKET-NAME>
  4. Deploy the stack with the following command:
    aws cloudformation deploy --template-file template_out.yaml --stack-name pinpointblogstack --capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_NAMED_IAM

The AWS CloudFormation stack will create and configure resources for you. Some of the resources it will create are:

  • Amazon RDS instance with MySQL
  • AWS Database Migration Service replication instance
  • AWS Database Migration Service source endpoint for MySQL
  • AWS Database Migration Service target endpoint for Amazon Kinesis Data Streams
  • Amazon Kinesis Data Streams stream
  • AWS Lambda Function
  • Amazon Pinpoint Application
  • A Cloud9 environment as a bastion host

The deployment can take up to 15 minutes. You can track its progress in the CloudFormation console’s Events tab.

Populate RDS data

A CloudFormation stack will output the DNS address of an RDS endpoint and Cloud9 environment upon completion. The Cloud9 environment acts as a bastion host and allows you to reach the RDS instance endpoint deployed into the private subnet by CloudFormation.

  1. Open the AWS Console and navigate to the Cloud9 service.
    Cloud9Console
  2. Click on the Open IDE button to reach your IDE environment.
    Cloud9Env
  3. At the console pane of your IDE, type the following to login to your RDS instance. You can find the RDS Endpoint address at the outputs section of the CloudFormation stack. It is under the key name RDSInstanceEndpoint.
    mysql -h <YOUR_RDS_ENDPOINT> -uadmin -pmypassword
    use blog_db;
  4. Issue the following command to create a table that holds the user’s opt-in status:
    create table optin_status (
      userid varchar(50) not null,
      phone varchar(50) not null,
      optin tinyint default 1,
      lastupdate TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
    );
  5. Next, load sample data into the table. The following inserts nine users for this demo:
    
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user1', '+12341111111', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user2', '+12341111112', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user3', '+12341111113', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user4', '+12341111114', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user5', '+12341111115', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user6', '+12341111116', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user7', '+12341111117', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user8', '+12341111118', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user9', '+12341111119', 1);
  6. The table’s opt-in column holds the SMS opt-in status and phone number for a specific user.

Start the DMS Replication Task

Now that the environment is ready, you can start the DMS replication task and start watching the changes in this table.

  1. From the AWS DMS Console, go to the Database Migration Tasks section.
    DMSMigTask
  2. Select the Migration task named blogreplicationtask.
  3. From the Actions menu, click on Restart/Resume to start the migration task. Wait until the task’s Status transitions from Ready to Starting and Replication ongoing.
  4. At this point, all the changes on the source database are replicated into a Kinesis stream. Before introducing the AWS Lambda function that will be polling this stream, configure the Amazon Pinpoint application.

Inspect the AWS Lambda Function

An AWS Lambda function has been created to receive the events. The Lambda function uses Python and Boto3 to read the records delivered by Kinesis Data Streams. It then performs the update_endpoint API calls in order to add, update, or delete endpoints in the Amazon Pinpoint application.

Lambda code and configuration is accessible through the Lambda Functions Console. In order to inspect the Python code, click the Functions item on the left side. Select the function starting with pinpointblogstack-MainStack by clicking on the function name.

Note: The PINPOINT_APPID under the Environment variables section. This variable provides the Lambda function with the Amazon Pinpoint application ID to make the API call.

LambdaPPAPPID

Inspect Amazon Pinpoint Application in Amazon Pinpoint Console

A Pinpoint application is needed by the Lambda Function to update the endpoints. This application has been created with an SMS Channel by the CloudFormation template. Once the data from the RDS database has been imported into Pinpoint as SMS endpoints, you can validate this import by creating a segment in Pinpoint.

PinpointProject

Testing

With the Lambda function ready, you now test the whole solution.

  1. To initiate the end-to-end test, go to the Cloud9 terminal. Perform the following SQL statement on the optin_table:
    UPDATE optin_status SET optin=0 WHERE userid='user1';
    UPDATE optin_status SET optin=0 WHERE userid='user2';
    UPDATE optin_status SET optin=0 WHERE userid='user3';
    UPDATE optin_status SET optin=0 WHERE userid='user4';
  2. This statement will cause four changes in the database which is collected by DMS and passed to Kinesis Data Streams stream.
  3. This triggers the Lambda function that construct an update_endpoint API call to the Amazon Pinpoint application.
  4. The update_endpoint operation is an upsert operation. Therefore, if the endpoint does not exist on the Amazon Pinpoint application, it creates one. Otherwise, it updates the current endpoint.
  5. In the initial dataset, all the opt-in values are 1. Therefore, these endpoints will be created with an OptOut value of NONE in Amazon Pinpoint.
  6. All OptOut=NONE typed endpoints are considered as active endpoints. Therefore, they are available to be used within segments.

Create Amazon Pinpoint Segment

  1. In order to see these changes, go to the Pinpoint console. Click on PinpointBlogApp.
    PinpointConsole
  2. Click on Segments on the left side. Then click Create a segment.
    PinpointSegment
  3. For the segment name, enter US-Segment.
  4. Select Endpoint from the Filter dropdown.
  5. Under the Choose an endpoint attribute dropdown, select Country.
  6. For Choose values enter US.
    Note: As you do this, the right panel Segment estimate will refresh to show the number of endpoints eligible for this segment filter.
  7. Click Create segment at the bottom of the page.
    PinpointSegDetails
  8. Once the new segment is created, you are directed to the newly created segment with configuration details. You should see five eligible endpoints corresponding to database table rows.
    PinpointSegUpdate
  9. Now, change one row by issuing the following SQL statement. This simulates a user opting out from SMS communication for one of their numbers.
    UPDATE optin_status SET optin=0 WHERE userid='user5';
  10. After the update, go to the Amazon Pinpoint console. Check the eligible endpoints again. You should only see four eligible endpoints.

PinpointSegUpdate

Cleanup

If you no longer want to incur further charge, delete the Cloudformation stack named pinpointblogstack. Select it and click Delete.

PinpointCleanup

Conclusion

This solution walks you through how opt-in change events are delivered from Amazon RDS to Amazon Pinpoint. You can use this solution in other use cases as well. Some examples are importing segments from a 3rd party application like Salesforce and importing other types of channels like e-mail, push, and voice. To learn more about Amazon Pinpoint, visit our website.

Building AWS Glue Spark ETL jobs by bringing your own JDBC drivers for Amazon RDS

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/building-aws-glue-spark-etl-jobs-by-bringing-your-own-jdbc-drivers-for-amazon-rds/

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. AWS Glue has native connectors to connect to supported data sources either on AWS or elsewhere using JDBC drivers. Additionally, AWS Glue now enables you to bring your own JDBC drivers (BYOD) to your Glue Spark ETL jobs. This feature enables you to connect to data sources with custom drivers that aren’t natively supported in AWS Glue, such as MySQL 8 and Oracle 18. You can also use multiple JDBC driver versions in the same AWS Glue job, enabling you to migrate data between source and target databases with different versions. For more information, see Connection Types and Options for ETL in AWS Glue.

This post shows how to build AWS Glue ETL Spark jobs and set up connections with custom drivers with Oracle18 and MySQL8 databases.

Solution overview

We discuss three different use cases in this post, using AWS Glue, Amazon RDS for MySQL, and Amazon RDS for Oracle.

In the following architecture, we connect to Oracle 18 using an external ojdbc7.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the following architecture, we connect to Oracle 18 using an external ojdbc7.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the second scenario, we connect to MySQL 8 using an external mysql-connector-java-8.0.19.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to MySQL 8.

In the second scenario, we connect to MySQL 8 using an external mysql-connector-java-8.0.19.jar driver from AWS Glue ETL, extract the data, transform it, and load the transformed data to MySQL 8.

In the third scenario, we set up a connection where we connect to Oracle 18 and MySQL 8 using external drivers from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

In the third scenario, we set up a connection where we connect to Oracle 18 and MySQL 8 using external drivers from AWS Glue ETL, extract the data, transform it, and load the transformed data to Oracle 18.

Prerequisites

Before getting started, you must complete the following prerequisites:

  1. Create an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console. Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template provided in this post.
  2. Create am IAM policy for AWS Glue.
  3. Before setting up the AWS Glue job, you need to download drivers for Oracle and MySQL, which we discuss in the next section.

Downloading drivers for Oracle and MySQL

To download the required drivers for Oracle and MySQL, complete the following steps:

  1. Download the MySQL JDBC connector.
  2. Select the operating system as platform independent and download the .tar.gz or .zip file (for example, mysql-connector-java-8.0.19.tar.gz or mysql-connector-java-8.0.19.zip) and extract it.
  3. Pick MySQL connector .jar file (such as mysql-connector-java-8.0.19.jar) and upload it into your Amazon Simple Storage Service (Amazon S3) bucket.
  4. Make a note of that path because you use it later in the AWS Glue job to point to the JDBC driver.
  5. Similarly, download the Oracle JDBC connector (ojdbc7.jar).

This post is tested for mysql-connector-java-8.0.19.jar and ojdbc7.jar drivers, but based on your database types, you can download and use appropriate version of JDBC drivers supported by the database.

  1. Upload the Oracle JDBC 7 driver to (ojdbc7.jar) to your S3 bucket.
  2. Make a note of that path, because you use it in the AWS Glue job to establish the JDBC connection with the database.
  3. Make sure to upload the three scripts (OracleBYOD.py, MySQLBYOD.py, and CrossDB_BYOD.py) in an S3 bucket.
  4. Save the following code as py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_oracle18_options_source_emp = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "employee",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    connection_oracle18_options_source_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
        
    connection_oracle18_options_target_emp_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "emp_dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
        
    # Read DynamicFrame from Oracle 
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    # Write data to Oracle 
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="oracle", connection_options=connection_oracle18_options_target_emp_dept)

  1. Save the following code as MySQLBYOD.py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_mysql8_options_source_emp = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "employee",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
        
    connection_mysql8_options_source_dept = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "dept",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
    
    connection_mysql8_options_target_emp_dept = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "emp_dept",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
        
    
    # Read from JDBC databases with custom driver
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    
    #print "Applied mapping to the Glue DynamicFrame"
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="mysql", connection_options=connection_mysql8_options_target_emp_dept)
    

  1. Save the following code as CrossDB_BYOD.py in your S3 bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext, SparkConf
    from awsglue.context import GlueContext
    from awsglue.job import Job
    import time
    from pyspark.sql.types import StructType, StructField, IntegerType, StringType
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    
    connection_mysql8_options_source_emp = {
        "url": "jdbc:mysql://<MySQL RDS Endpoint>:3306/byod",
        "dbtable": "employee",
        "user": "MySQLadmin",
        "password": "MYSQLadmin123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/mysql-connector-java-8.0.19.jar",
        "customJdbcDriverClassName": "com.mysql.cj.jdbc.Driver"}
    
    connection_oracle18_options_source_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    connection_oracle18_options_target_emp_dept = {
        "url": "jdbc:oracle:thin://@<Oracle RDS Endpoint>:1521:orcl",
        "dbtable": "emp_dept",
        "user": "byod",
        "password": "Awsuser123",
        "customJdbcDriverS3Path": "s3://<Bucket>/<Folder>/ojdbc7.jar",
        "customJdbcDriverClassName": "oracle.jdbc.OracleDriver"}
    
    # Read DynamicFrame from Oracle
    df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
    df_emp = ApplyMapping.apply(frame = df_emp, mappings = [("employee_id", "integer", "employee_id", "integer"), ("first_name", "string", "first_name", "string"), ("last_name", "string", "last_name", "string"), ("email", "string", "email", "string"), ("phone_number", "string", "phone_number", "string"), ("hire_date", "string", "hire_date", "string"), ("job_id", "string", "job_id", "string"), ("salary", "long", "salary", "long"), ("commission_pct", "long", "commission_pct", "long"), ("manager_id", "long", "manager_id", "long"), ("department_id", "integer", "department_id", "integer")])
    df_emp = df_emp.drop_fields(['phone_number','hire_date','job_id','salary','commission_pct','manager_id'])
    df_dept = glueContext.create_dynamic_frame.from_options(connection_type="oracle", connection_options=connection_oracle18_options_source_dept)
    df_dept = ApplyMapping.apply(frame = df_dept, mappings = [("department_id", "integer", "dept_id", "integer"), ("dept_name", "string", "dept_name", "string")])
    
    
    
    df_emp.printSchema()
    df_dept.printSchema()
    
    df_emp_dept = Join.apply(df_emp, df_dept, 'department_id', 'dept_id')
    df_emp_dept = df_emp_dept.drop_fields(['department_id','dept_id'])
    df_emp_dept = DropNullFields.apply(frame = df_emp_dept)
    
    df_emp_dept.printSchema()
    
    # Write data to Oracle
    glueContext.write_from_options(frame_or_dfc=df_emp_dept, connection_type="oracle", connection_options=connection_oracle18_options_target_emp_dept)
    
    

Provisioning resources with AWS CloudFormation

The generic workflow of setting up a connection with your own custom JDBC drivers involves various steps. It’s a manual configuration that is error prone and adds overhead when repeating the steps between environments and accounts. With AWS CloudFormation, you can provision your application resources in a safe, repeatable manner, allowing you to build and rebuild your infrastructure and applications without having to perform manual actions or write custom scripts. The declarative code in the file captures the intended state of the resources to create, and allows you to automate the creation of AWS resources.

We provide this CloudFormation template for you to use. Review and customize it to suit your needs. Some of the resources deployed by this stack incur costs as long as they remain in use, like Amazon RDS for Oracle and Amazon RDS for MySQL.

This CloudFormation template creates the following resources:

  • A VPC
  • Two subnets
  • A route table
  • An internet gateway
  • A MySQL 8 database
  • An Oracle 18 database

To provision your resources, complete the following steps:

  1. Sign in to the console.
  2. Choose the us-east-1 Region in which to create the stack.
  3. Choose Next.
  4. Choose Launch Stack:

This step automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console as required.

  1. For Stack name, enter a name.
  2. Change the other parameters as needed or keep the following default values:
    1. Oracle user nameoraadmin
    2. Oracle passwordoraadmin123
    3. MySQL usernameMySQLadmin
    4. MySQL passwordMYSQLadmin123

Change the other parameters as needed or keep the following default values:

  1. Choose Next.
  2. Choose Next
  3. Review the details and choose Create.

This stack creation can take up to 20 minutes.

After the stack creation is complete, go to the Outputs tab on the AWS CloudFormation console and note the following values (you use these in later steps):

  • MySQLJDBCConnectionString
  • OracleJDBCConnectionString

Configuring an AWS Glue ETL job using your own drivers

Before creating an AWS Glue ETL, run the SQL script (database_scripts.sql) on both the databases (Oracle and MySQL) to create tables and insert data. For more information about connecting to the RDS DB instance, see How can I troubleshoot connectivity to an Amazon RDS DB instance that uses a public or private subnet of a VPC?

To set up AWS Glue connections, complete the following steps:

  1. On the AWS Glue console, under Databases, choose Connections.
  2. Choose Add Connection.
  3. For Connection Name, enter a name for your connection.
  4. For Connection Type, choose JDBC.
  5. For JDBC URL, enter a URL, such as jdbc:oracle:thin://@<hostname>:1521/ORCL for Oracle or jdbc:mysql://<hostname>:3306/mysql for MySQL.
  6. Enter the user name and password for the database.
  7. Select the VPC in which you created the RDS instance (Oracle and MySQL).
  8. Choose the subnet within your VPC. Refer to the CloudFormation stack Outputs tab for the subnet name.
  9. Choose the security group of the database. Refer to the CloudFormation stack Outputs tab for security group name.
  10. Choose Next.
  11. Check the connection details and choose Finish.

Make sure to add a connection for both databases (Oracle and MySQL).

Creating endpoints and a security group

Before testing the connection, make sure you create an AWS Glue endpoint and S3 endpoint in the VPC in which databases are created. Complete the following steps for both Oracle and MySQL instances:

  1. To create your AWS Glue endpoint, on the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
  3. For Service Names, choose AWS Glue.
  4. Choose amazonaws.<region>.glue (for example, com.amazonaws.us-west-2.glue).
  5. Choose the VPC of the RDS for Oracle or RDS for MySQL
  6. Choose the security group of the RDS instances.
  7. Choose Create endpoint.

To create your S3 endpoint, you use Amazon Virtual Private Cloud (Amazon VPC).

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
  3. For Service Names, choose Amazon S3.
  4. Choose amazonaws.<region>.s3 (for example, com.amazonaws.us-west-2.s3).
  5. Choose the VPC of the RDS for Oracle or RDS for MySQL
  6. Choose the route table ID.
  7. Choose Create endpoint.

The RDS for Oracle or RDS for MySQL security group must include itself as a source in its inbound rules.

  1. On the Security Groups page, choose Edit Inbound Rules.
  2. Choose Add rule.
  3. For Type, choose All Traffic Type, for example glue-byod-stack1….
  4. For Source, choose the same security group.
  5. Choose Save Rules.

If both the databases are in the same VPC and subnet, you don’t need to create a connection for MySQL and Oracle databases separately. The reason for setting an AWS Glue connection to the databases is to establish a private connection between the RDS instances in the VPC and AWS Glue via S3 endpoint, AWS Glue endpoint, and Amazon RDS security group. It’s not required to test JDBC connection because that connection is established by the AWS Glue job when you run it. If you test the connection with MySQL8, it fails because the AWS Glue connection doesn’t support the MySQL 8.0 driver at the time of writing this post, therefore you need to bring your own driver.

Setting up AWS Glue ETL jobs

You’re now ready to set up your ETL job in AWS Glue. Complete the following steps for both connections:

  1. Edit the following parameters in the scripts (OracleBYOD.py, MySQLBYOD.py, and CrossDB_BYOD.py) and upload them in Amazon S3:
    1. url
    2. user
    3. password
    4. customJdbcDriverS3Path for sources and target tables

You can find the database endpoints (url) on the CloudFormation stack Outputs tab; the other parameters are mentioned earlier in this post. If you use another driver, make sure to change customJdbcDriverClassName to the corresponding class in the driver.

Alternatively, you can pass on this as AWS Glue job parameters and retrieve the arguments that are passed using the getResolvedOptions.

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Add Job.
  3. For Job Name, enter a name.
  4. For IAM role, choose the IAM role you created as a prerequisite.
  5. For Type, choose Spark.
  6. For Glue Version, choose Python (latest version).
  7. For This job runs, choose An existing script that you provide.
  8. Choose the Amazon S3 path where the script (OracleBYOD.py, MySQLBYOD.py, or CrossDB_BYOD.py) is stored.
  9. Under Advanced properties, enable Job bookmark.

Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

  1. Keep the remaining settings as their defaults and choose
  2. For Connections, choose the Amazon RDS for Oracle connection for OracleBYOD.py, Amazon RDS for MySQL connection for MySQLBYOD.py, or Amazon RDS for Oracle and Amazon RDS for MySQL connection for CrossDB_BYOD.py.
  3. Choose Save job and edit scripts.
  4. Choose Run Job.
  5. When the job is complete, validate the data loaded in the target table.

Cleaning up

After you finish, don’t forget to delete the CloudFormation stack, because some of the AWS resources deployed by the stack in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack.

  1. On the AWS CloudFormation console, on the Stacks page, select the stack to delete. The stack must be currently running.
  2. In the stack details pane, choose Delete.
  3. Choose Delete stack when prompted.

Summary

In this post, we showed you how to build AWS Glue ETL Spark jobs and set up connections with custom drivers with Oracle18 and MySQL8 databases using AWS CloudFormation. You can use this solution to use your custom drivers for databases not supported natively by AWS Glue.

If you have any questions or suggestions, please leave a comment.


About the Authors

Srikanth Sopirala is a Sr. Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family and road biking.

 

 

Naresh Gautam is a Sr. Analytics Specialist Solutions Architect at AWS. His role is helping customers architect highly available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.

 

 

Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/sharing-amazon-redshift-data-securely-across-amazon-redshift-clusters-for-workload-isolation/

Amazon Redshift data sharing allows for a secure and easy way to share live data for read purposes across Amazon Redshift clusters. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query runs.

In this post, we discuss how to use Amazon Redshift data sharing to achieve workload isolation across diverse analytics use cases and achieve business-critical SLAs. For more information about this new feature, see Announcing Amazon Redshift data sharing (preview).

How to use Amazon Redshift data sharing

Amazon Redshift data sharing allows a producer cluster to share data objects to one or more Amazon Redshift consumer clusters for read purposes without having to copy the data. With this approach, workloads isolated to different clusters can share and collaborate frequently on data to drive innovation and offer value-added analytic services to your internal and external stakeholders. You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined functions, to provide fine-grained access controls that can be tailored for different users and businesses that all need access to Amazon Redshift data.

Data sharing between Amazon Redshift clusters is a two-step process. First, the producer cluster administrator that wants to share data creates an Amazon Redshift data share, a new named object introduced with this release to serve as a unit of sharing. The producer cluster adds the needed database objects such as schemas, tables, and views to the data share and specifies a list of consumer clusters with which to share the data share. Following that, privileged users on consumer clusters create an Amazon Redshift local database reference from the data share made available to them and grant permissions on the database objects to appropriate users and groups. Users and groups can then list the shared objects as part of the standard metadata queries and start querying immediately.

Solution overview

For this post, we use a use case in which the producer cluster is a central ETL cluster hosting enterprise sales data, a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark dataset. This cluster serves multiple BI and data science clusters purpose-built for distinct business groups within the organization. One such group is the sales BI team, who runs BI reports using customer sales data created in the central ETL cluster and joined with the product reviews dataset that they loaded into the BI cluster they manage.

This approach helps the sales BI team isolate data lifecycle management between the enterprise sales dataset in the ETL producer from the product reviews data that they fully manage in the BI consumer cluster to simplify data stewardship. It also allows for agility, allows sizing clusters independently to provide workload isolation, and creates a simple cost charge-back model.

As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales. We demonstrate how to build the semantic layer later in this post. A superuser in etl_cluster then creates a data share named salesdatashare, adds the bi_semantic schema and all objects in that schema to the data share, and grants usage permissions to the BI consumer cluster named bi_cluster. Keep in mind that a data share is simply a metadata container and represents what data is shared from producer to consumer. No data is actually moved.

As depicted in the following diagram, the central ETL cluster etl_cluster hosts the sales data in a schema named sales and performs transformations to create a semantic layer required for BI reports in a new schema named bi_semantic.

The superuser in the BI consumer cluster creates a local database reference named sales_semantic from the data share (step 2 in the preceding diagram). The BI users use the product reviews dataset in the local schema named product_reviews and join with bi_semantic data for reporting purposes (step 3).

You can find the script in the products review dataset, which we use in this post to load the dataset into bi_cluster. You can load the DW benchmark dataset into etl_cluster using this github link. Loading these datasets into the respective Amazon Redshift clusters is outside the scope of this post, and is a prerequisite to following the instructions we outline.

The following diagram depicts the cloud DW benchmark data model used.

The following diagram depicts the cloud DW benchmark data model used.

The following table summarizes the data.

Table Name Rows
STORE_SALES 8,639,936,081
CUSTOMER_ADDRESS 15,000,000
CUSTOMER 30,000,000
CUSTOMER_DEMOGRAPHICS 1,920,800
ITEM 360,000
DATE_DIM 73,049

Building a BI semantic layer

A BI semantic layer is a representation of enterprise data in a way that simplifies BI reporting requirements and offers better performance. In our use case, the BI semantic layer transforms sales data to create a customer denormalized dataset and another dataset for all store sales by product in a given year. The following queries are run on the etl_cluster to create the BI semantic layer.

  1. Create a new schema to host BI semantic tables with the following SQL:
    Create schema bi_semantic;

  2. Create a denormalized customer view with select columns required for sales BI team:
    create view bi_semantic.customer_denorm 
    as
    select
    	c_customer_sk,
    	c_customer_id,
    	c_birth_year,
    	c_birth_country,
    	c_last_review_date_sk,
    	ca_city,
    	ca_state,
    	ca_zip,
    	ca_country,
    	ca_gmt_offset,
    	cd_gender,
    	cd_marital_status,
    	cd_education_status
    from sales.customer c, sales.customer_address ca, sales.customer_demographics cd
    where
    c.c_current_addr_sk=ca.ca_address_sk
    and c.c_current_cdemo_sk=cd.cd_demo_sk;

  1. Create a second view for all product sales with columns required for BI team:
    create view bi_semantic.product_sales
    as 
    select 
    	i_item_id,
    	i_product_name,
    	i_current_price,
    	i_wholesale_cost,
    	i_brand_id,
    	i_brand,
    	i_category_id,
    	i_category,
    	i_manufact,
    	d_date,
    	d_moy,
    	d_year,
    	d_quarter_name,
    	ss_customer_sk,
    	ss_store_sk,
    	ss_sales_price,
    	ss_list_price,
    	ss_net_profit,
    	ss_quantity,
    	ss_coupon_amt
    from sales.store_sales ss, sales.item i, sales.date_dim d
    where ss.ss_item_sk=i.i_item_sk
    and ss.ss_sold_date_sk=d.d_date_sk;

Sharing data across Amazon Redshift clusters

Now, let’s share the bi_semantic schema in the etl_cluster with the bi _cluster.

  1. Create a data share in the etl_cluster using the following command when connected to the etl_cluster. The producer cluster superuser and database owners can create data share objects. By default, PUBLICACCESSIBLE is false. If the producer cluster is publicly accessible, you can add PUBLICACCESSIBLE = true to the following command:
    CREATE DATASHARE SalesDatashare;

  1. Add the BI semantic views to the data share. To add objects to the data share, add the schema before adding objects. Use ALTER DATASHARE to share the entire schema; to share tables, views, and functions in a given schema; and to share objects from multiple schemas:
    ALTER DATASHARE SalesDatashare ADD SCHEMA bi_semantic;
    ALTER DATASHARE SalesDatashare ADD ALL TABLES IN SCHEMA bi_semantic;

The next step requires a cluster namespace GUID from the bi_cluster. One way to find the namespace value of a cluster is to run the SQL statement select current_namespace when connected to the bi_cluster. Another way is on the Amazon Redshift console: choose your Amazon Redshift consumer cluster, and find the value under Namespace located in the General information section.

  1. Add consumers to the data share using the following command:
    GRANT USAGE ON DATASHARE SalesDatashare TO NAMESPACE '1m137c4-1187-4bf3-8ce2-e710b7100eb2';

  1. View the list of the objects added to the share using the following command. The share type is outbound on the producer cluster.
    DESC DATASHARE salesdatashare;

The following screenshot shows our list of objects.

The following screenshot shows our list of objects.

Consuming the data share from the consumer BI Amazon Redshift cluster

From the bi_cluster, let’s review, consume, and set permissions on the data share for end-user consumption.

  1. On the consumer BI cluster, view the data shares using the following command as any user:
    SHOW DATASHARES;

The following screenshot shows our results. Consumers should be able to see the objects within the incoming share but not the full list of consumers associated with the share. For more information about querying the metadata of shares, see DESC DATASHARE.

The following screenshot shows our results.

  1. Start the consumption by creating a local database from the salesdatashare. Cluster users with the permission to do so can create a database from the shares. We use the namespace from the etl_cluster.
    CREATE DATABASE Sales_semantic from DATASHARE SalesDatashare OF NAMESPACE '45b137c4-1287-4vf3-8cw2-e710b7138nd9'; 

Consumers should be able to see databases that they created from the share, along with the databases local to the cluster, at any point by querying SVV_REDSHIFT* tables. Data share objects aren’t available for queries until a local database reference is created using a create database statement.

  1. Run the following command to list the databases in bi_cluster:
    select * from svv_redshift_databases;

The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.

The following screenshot shows that both the local and shared databases are listed so that you can explore and navigate metadata for shared datasets.

  1. Grant usage on the database to bi_group, where bi_group is a local Amazon Redshift group with BI users added to that group:
    GRANT USAGE ON DATABASE sales_semantic TO bi_group;

Querying as the BI user

In this section, you connect as a user in the bi_group who got access to the shared data. The user is still connected to the local database on the bi_cluster but can query the shared data via the new cross-database query functionality in Amazon Redshift.

  1. Review the list of objects in the share by running the following SQL:
    SELECT schema_name, table_name, table_type FROM  svv_redshift_tables
         where database_name = 'sales_semantic'

The following screenshot shows our results.

The following screenshot shows our results.

  1. Review the list of columns in the customer_denorm view::
    SELECT * FROM  svv_redshift_columns 
       where database_name = 'sales_semantic' and table_name = 'customer_denorm';

The following screenshot shows our results.

The following screenshot shows our results.

  1. Query the shared objects using three-part notation just like querying any other local database object, using a notation <database>.<schema>.<view/table>:
    select count(*) from sales_semantic.bi_semantic.customer_denorm;

Following is your result:

28950139

  1. Analyze the local product reviews data by joining the shared customer_denorm data to identify the top ratings by customer states for this BI report:
    SELECT PR.product_category, c.ca_state AS customer_state,
                  count(PR.star_rating) AS cnt
          FROM product_reviews.amazon_reviews PR,               --local data
               sales_semantic.bi_semantic.customer_denorm  C    –-shared data
          WHERE  PR.customer_id = C.c_customer_sk
             AND PR.marketplace = 'US'
          GROUP BY 1, 2
          order by cnt desc
          Limit 10;

The following screenshot shows our results.

The following screenshot shows our results.

Adding a data science consumer

Now, let’s assume the company has decided to spin up a data science team to help with new sales strategies, and this team performs analytics on the sales data. The data science team is new and has very different access patterns and SLA requirements compared to the BI team. Thanks to the data sharing feature, onboarding new use cases such as this is easy.

We add a data science consumer cluster named ds_cluster. Because the data science users need access to data in salesdatashare, the superuser in the etl_cluster can simply grant access to the ds_cluster by adding them as another consumer for the share without moving any data:

GRANT USAGE ON DATASHARE SalesDatashare TO NAMESPACE ''1h137c4-1187-4w53-8de2-e710b7100es2';

The following diagram shows our updated architecture with the data science consumer (step 4).

The following diagram shows our updated architecture with the data science consumer (step 4).

This way, multiple clusters of different sizes can access the same dataset and isolate workloads to meet their SLA requirements. Users in these respective clusters are granted access to shared objects to meet their stringent security requirements. The producer keeps control of the data and at any point can remove certain objects from the share or remove access to the share for any of these clusters, and the consumers immediately lose access to the data. Also, as more data is ingested into the producer cluster, the consumer sees transactionally consistent data instantly.

Monitoring and security

Amazon Redshift offers comprehensive auditing capabilities using system tables and AWS CloudTrail to allow you to monitor the data sharing permissions and usage across all the consumers and revoke access instantly when necessary. The permissions are granted by the superusers from both the producer and the consumer clusters to define who gets access to what objects, similar to the grant commands used in the earlier scenario. You can use the following commands to audit the usage and activities for the data share.

Track all changes to the data share and the shared database imported from the data share with the following code:

Select username, share_name, recordtime, action, 
         share_object_type, share_object_name 
  from svl_datashare_change_log
   order by recordtime desc;

The following screenshot shows our results.

The following screenshot shows our results.

Track data share access activity (usage), which is relevant only on the producer, with the following code:

Select * from svl_datashare_usage;

The following screenshot shows our results.

The following screenshot shows our results.

Summary

Amazon Redshift data sharing provides workload isolation by allowing multiple consumers to share data seamlessly without the need to unload and load data. We also presented a step-by-step guide for securely sharing data from a producer to multiple consumer clusters.


About the Authors

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.

Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-federated-querying-to-amazon-aurora-mysql-and-amazon-rds-for-mysql/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in April 2020, we announced general availability for federated querying to Amazon Aurora PostgreSQL and Amazon Relational Database Service (Amazon RDS) for PostgreSQL to enable you to query data across your operational databases, your data warehouse, and your data lake to gain faster and deeper insights not possible otherwise.

Today, we’re launching a new feature of Amazon Redshift federated query to Amazon Aurora MySQL and Amazon RDS for MySQL to help you expand your operational databases in the MySQL family. With this lake house architecture expansion to support more operational data stores, you can query and combine data more easily in real time and store data in open file formats in your Amazon Simple Storage Service (Amazon S3) data lake. Your data can then be more available to other analytics and machine learning (ML) tools, rather than siloed in disparate data stores.

In this post, we share information about how to get started with this new federated query feature to MySQL.

Prerequisites

To try this new feature, create a new Amazon Redshift cluster in a sql_preview maintenance track and Aurora MySQL instance and load sample TPC data into both data stores. To make sure both Aurora MySQL DB instances can accept connections from the Amazon Redshift cluster, you should make sure that both your Amazon Redshift cluster and Aurora MySQL instances are in the same Amazon Virtual Private Cloud (Amazon VPC) and subnet group. This way, you can add the security group for the Amazon Redshift cluster to the inbound rules of the security group for the Aurora MySQL DB instance.

If your Amazon Redshift cluster and Aurora MySQL instances are in the different VPC, you can set up VPC peering or other networking to allow Amazon Redshift to make connections to your Aurora MySQL instances. For more information about VPC networking, see Working with a DB instance in a VPC.

Configuring AWS Secrets Manager for remote database credentials

Amazon Redshift needs database credentials to issue a federated query to a MySQL database. AWS Secrets Manager provides a centralized service to manage secrets and can be used to store your MySQL database credentials. Because Amazon Redshift retrieves and uses these credentials, they are transient, not stored in any generated code, and discarded after the query runs.

Storing credentials in Secrets Manager takes only a few minutes. To store a new secret, complete the following steps:

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. For Select secret type, select Credentials for RDS database.
  4. For User name, enter a name.
  5. For Password, enter a password.
  6. For Select the encryption key, choose DefaultEncryptionkey.
  7. For Select which RDS database this secret will access, choose your database.

Storing credentials in Secrets Manager takes only a few minutes.

  1. Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post).
  2. Choose Next.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console. The secret ARN is needed in the subsequent step.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console.

Setting up IAM role

You can now pull everything together by embedding the secret ARN into an AWS Identity and Access Management (IAM) policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "<SecretARN>"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your cluster.
  3. On the Actions drop-down menu, choose Manage IAM roles.

Finally, attach the same IAM role to your Amazon Redshift cluster.

  1. Choose and add the IAM role you just created.

Setting up external schema

The final step is to create an external schema to connect to your Aurora MySQL instance. The following example code creates the external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS mysqlfq 
FROM MYSQL 
DATABASE 'tpc' 
URI '<AuroraClusterEndpoint>' 
PORT 3306 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – Aurora MySQL cluster endpoint
  • IAM_Role – IAM role created from the previous step
  • Secret_ARN – Secret ARN

After you set up the external schema, you’re ready to run some queries to test different use cases.

Querying live operational data

You can now query real-time operational data in your Aurora MySQL instance from Amazon Redshift. Note that isolation level is read committed for MySQL. See the following code:

dev=# select top 10 ws_order_number from mysqlfq.web_sales;
 ws_order_number 
-----------------
        93628990
       157020207
         4338647
        41395871
        58468186
       171095867
        12514566
        74946143
         3418243
        67054239
(10 rows)

Querying mysqlfq.web_sales in Amazon Redshift routes the request to MySQL tpc database and web_sales table. If you examine the query plan, you can see the query runs at the MySQL instance as shown by the step Remote MySQL Seq Scan:

dev=# explain select top 10 ws_order_number from mysqlfq.web_sales;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 XN Limit  (cost=0.00..0.20 rows=10 width=8)
   ->  XN MySQL Query Scan web_sales  (cost=0.00..6869.28 rows=343464 width=8)
         ->  Remote MySQL Seq Scan mysqlfq.web_sales  (cost=0.00..3434.64 rows=343464 width=8)
(3 rows)

Simplifying ELT and ETL

You can also extract operational data directly from your Aurora MySQL instance and load it into Amazon Redshift. See the following code:

dev=# create table staging_customer as select c_customer_id from mysqlfq.customer where c_customer_id not in (select c_customer_id from customer);
SELECT
dev=# select count(*) from staging_customer;
 count  
--------
 350000
(1 row)

The preceding code uses CTAS to create and load incremental data from your operational MySQL instance into a staging table in Amazon Redshift. You can then perform transformation and merge operations from the staging table to the target table. For more information, see Updating and inserting new data.

Combining operational data with data from your data warehouse and data lake

You can combine live operational data from your Aurora MySQL instance with data from your Amazon Redshift data warehouse and S3 data lake by creating a late binding view.

To access your S3 data lake historical data via Amazon Redshift Spectrum, create an external table:

create external schema mysqlspectrum
from data catalog
database 'spectrumdb'
iam_role '<IAMRole>'
create external database if not exists;
 
create external table mysqlspectrum.customer 
stored as parquet 
location 's3://<yourS3bucket>/customer/'
as select * from customer where c_customer_sk <= 100000;

You can then run queries on the view to gain insight on data across the three sources:

drop view vwCustomer;
create view vwCustomer as
select c_customer_sk, 'redshift' as source from public.customer where c_customer_sk > 100000
union all
select c_customer_sk, 'mysql' as source from mysqlfq.customer
union all
select c_customer_sk, 's3' as source from mysqlspectrum.customer
with no schema binding;

select * from vwCustomer where c_customer_sk in (1, 149712,29033279);

You should the following three records as output:

dev=# select * from vwCustomer where c_customer_sk in (1, 149712,29033279);
 c_customer_sk |  source  
---------------+----------
      29033279 | mysql
             1 | s3
        149712 | redshift
(3 rows)

If you examine the query plan, you can see that the predicates are pushed down to your MySQL instance to run:

dev=# explain select * from vwCustomer where c_customer_sk in (1,149712,29033279);
                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan vwcustomer  (cost=0.00..48398.40 rows=6988 width=36)
   ->  XN Append  (cost=0.00..48328.52 rows=6988 width=4)
         ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..40000.03 rows=3 width=4)
               ->  XN Seq Scan on customer  (cost=0.00..40000.00 rows=3 width=4)
                     Filter: (((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279)) AND (c_customer_sk > 100000))
         ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..6548.63 rows=5492 width=4)
               ->  XN MySQL Query Scan customer  (cost=0.00..6493.71 rows=5492 width=4)
                     ->  Remote MySQL Seq Scan mysqlfq.customer  (cost=0.00..6438.79 rows=5492 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
         ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..1779.86 rows=1493 width=4)
               ->  XN S3 Query Scan customer  (cost=0.00..1764.93 rows=1493 width=4)
                     ->  S3 Seq Scan mysqlspectrum.customer location:"s3://<yourS3bucket>/customer" format:PARQUET  (cost=0.00..1750.00 rows=1493 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
(13 rows)

Available Now

Amazon Redshift federated querying to Aurora MySQL and Amazon RDS for MySQL is now available for public preview with Amazon Redshift release version 1.0.21591 or later. Refer to the AWS Region Table for Amazon Redshift availability and to check the version of your clusters.


About the Authors

BP Yau is an Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next-generation big data analytics platform using AWS technologies.

 

Zhouyi Yang is a Software Development Engineer for the Amazon Redshift Query Processing team. He’s passionate about gaining new knowledge about large databases and has worked on SQL language features such as federated query and IAM role privilege control. In his spare time, he enjoys swimming, tennis, and reading.

 

 

Entong Shen is a Senior Software Development Engineer for Amazon Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics, and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

Use a single AWS Managed Microsoft AD for Amazon RDS for SQL Server instances in multiple Regions

Post Syndicated from Jeremy Girven original https://aws.amazon.com/blogs/security/use-a-single-aws-managed-microsoft-ad-for-amazon-rds-for-sql-server-instances-in-multiple-regions/

Many Amazon Web Services (AWS) customers use Active Directory to centralize user authentication and authorization for a variety of applications and services. For these customers, Active Directory is a critical piece of their IT infrastructure.

AWS offers AWS Directory Service for Microsoft Active Directory, also known as AWS Managed Microsoft AD, to provide a highly accessible and resilient Active Directory service that is built on Microsoft Active Directory.

AWS also offers Amazon Relational Database Service (Amazon RDS) for SQL Server. Amazon RDS enables you to prioritize application development by managing time-consuming database administration tasks including provisioning, backups, software patching, monitoring, and hardware scaling. If you require Windows authentication with Amazon RDS for SQL Server, Amazon RDS for SQL Server instances need to be integrated with AWS Managed Microsoft AD.

With the release of AWS Managed Microsoft AD cross-Region support, you only need one distinct AWS Managed Microsoft AD that spans multiple AWS Regions; this simplifies directory management and configuration. Additionally, it simplifies trusts between the AWS Managed Microsoft AD domain and your on-premises domain. Now, only a single trust between your on-premises domain and AWS Managed Microsoft AD domain is required, as compared to the previous pattern of only one AWS Managed Microsoft AD per Region—each of which would require a trust if you wanted to allow on-premises objects access to your AWS Managed Microsoft AD domain. Further, AWS Managed Microsoft AD cross-Region support provides an additional benefit when using your on-premises users and groups with Amazon RDS for SQL Server: You only need a single, one-way, outgoing trust between your multi-Region AWS Managed Microsoft AD and your on-premises domain.

As detailed in this post, to enable AWS Managed Microsoft AD cross-Region support, you create a new AWS Managed Microsoft AD and extend it to multiple Regions (as shown in Figure 1 below). Once you’ve extended your directory, you deploy an Amazon RDS SQL Server instance in each Region, integrating it to the same directory. Finally, you install SQL Server Management Studio (SSMS) on an instance joined to the AWS Managed Microsoft AD directory. You use that instance to connect to the RDS SQL Server instances using the same domain user account.

Figure 1: High level diagram of resources deployed in this post

Figure 1: High level diagram of resources deployed in this post

The architecture in Figure 1 includes a network connection between the Regions. That connection isn’t required for the AWS Managed Microsoft AD to function. If you don’t require network connectivity between your regions, you can disregard the network link in the diagram. Since you will be using a single Amazon Elastic Compute Cloud (Amazon EC2) instance in one Region, the network connection is needed between Amazon VPCs in the two Regions to allow that instance to connect to a domain controller in each Region.

Prerequisites for AWS Managed Microsoft AD cross-Region Support

  1. An AWS Managed Microsoft AD deployed in a Region of your choice. If you don’t have one already deployed, you can follow the instructions in Create Your AWS Managed Microsoft AD directory to create one. For this post, I recommend that you use us-east-1.
  2. The VPC must be peered in order to complete the steps in this blog. Creating and accepting a VPC peering connection has information on how to create a peering connection between Regions. Be aware of unsupported VPC peering configurations.
  3. A Windows Server instance joined to your managed Active Directory domain. Join an EC2 Instance to Your AWS Managed Microsoft AD Directory has instructions if you need assistance.
  4. Install the Active Directory administration tools onto your domain-joined instance. Installing the Active Directory Administration Tools has detailed instructions.

Extend your AWS Managed Microsoft AD to another Region

We’ve made the process to extend your directory to another Region straightforward. There is no cost to add another Region; you only pay for the resources for your directory running in the new Region. See here for additional information on pricing changes with new Regions. For example, in this post you will be extending your directory into the us-east-2 region. There will be an additional cost for two new domain controllers. Figure 3 shows the additional cost to extend the directory.

Let’s walk through the steps of setting up Windows Authentication with Amazon RDS for SQL Server instances in multiple Regions using a single cross-Region AWS Managed Microsoft AD.

To extend your directory to another Region:

  1. In the AWS Directory Service console navigation pane, choose Directories.

    Note: You should see a list of your AWS Managed Microsoft AD directories.

  2. Choose the Directory ID of the directory you want to expand to another Region.
  3. Go to the Directory details page. In the Multi-region replication section, select Add Region.

    Figure 2: Directory details and new multi-Region replication pane

    Figure 2: Directory details and new multi-Region replication pane

  4. On the Add region page:
    1. For Region to add, select the Region you want to extend your directory to.
    2. For VPC, select the Amazon Virtual Private Cloud (Amazon VPC) for the new domain controllers to use.
    3. For Subnets, select two unique subnets in the Amazon VPC that you selected in the preceding step.
    4. Once you have everything to your liking, choose Add.
      Figure 3: Add a Region

      Figure 3: Add a Region

      In the background, AWS is provisioning two new AWS managed domain controllers in the Region you selected. It could take up to 2 hours for your directory to become available in the Region.

Note: Your managed domain controllers in the home Region are fully functional during this process.

  • On the Directory details page, in Multi-Region replication, the status should be Active when the process has completed. Now you’re ready to deploy your Amazon RDS SQL Server instances.

Enable Amazon RDS for SQL Server

Integrating Amazon RDS into AWS Managed Microsoft AD is exactly the same process as it was before the cross-Region feature was released. This post goes through that original process with only one change, which is that you select the same directory ID for both Regions.

Create an Amazon RDS SQL Server instance in each Region using the same directory

The steps for creating an Amazon RDS SQL Server instance in each Region are the same. The following process will create the first instance. Once you’ve completed the process, you change the AWS Management Console Region to the Region you extended your directory to and repeat the process.

To create an Amazon RDS SQL Server instance:

  1. In the AWS Managed Microsoft AD directory primary Region, go to the Amazon RDS console navigation pane and choose Create database.
  2. Choose Microsoft SQL Server.
  3. You can leave the default values, except for the following settings:
    1. Under Settings select Master and Confirm password.
    2. Under Connectivity, expand Additional connectivity configuration:
      1. Choose Create new to create a new VPC security group.
      2. Enter a name in New VPC security group name.
      3. Select No preference for Availability Zone.
      4. Enter 1433 for Database port.
      Figure 4: Connectivity settings

      Figure 4: Connectivity settings

  4. Select the Enable Microsoft SQL Server Windows authentication check box and then choose Browse Directory.

    Figure 5: Enable Microsoft SQL Server Windows authentication selected

    Figure 5: Enable Microsoft SQL Server Windows authentication selected

  5. Select your directory and select Choose.

    Figure 6: Select a directory

    Figure 6: Select a directory

  6. Choose Create database.
  7. Repeat these steps in your expanded Region. Note that the Directory ID will be the same for both Regions. You can complete the next section while your Amazon RDS SQL instances are provisioning.

Create an Active Directory user and group to delegate SQL administrative rights

The following steps walk you through the process of creating an Active Directory user and group for delegation. Following this process, you add the user to the group you just created and to the AWS Delegated Server Administrators group.

To create a user and group:

  1. Log in to the domain-joined instance with a domain user account that has permissions to create Active Directory users and groups.
  2. Choose Start, enter dsa.msc, and press Enter.
  3. In Active Directory Users and Computers, right-click on the Users OU, select New, and then Group. The New Object – Group window pops up.
    1. Fill in the Group name boxes with your choice of name.
    2. For Group Scope, select Domain local.
    3. For Group type, select Security.
    4. Choose OK.
  4. In Active Directory Users and Computers, right-click on your Users OU and select New and then User. The New Object – User window pops up.
    1. Fill in the boxes with your choice of information, and then choose Next.
    2. Enter your choice of password and clear User must change password at next logon, then choose Next.
    3. On the confirmation page, choose Finish.
  5. Double-click on the user you just created. The user account properties window appears.
    1. Select the Member of tab.
    2. Choose Add.
    3. Enter the name of the group that you previously created and choose Check Names. Next, enter AWS Delegated Server Administrators and choose Check Names again. If you do not receive any error, choose OK, and then OK again.
  6. The Member of tab for the user should include the two groups you just added. Choose OK to close the properties page.

Delegate SQL Server permissions in each Region using the Active Directory group you just created

The following steps guide you through the process of modifying the Amazon RDS SQL security group, installing SQL Server Management Studio (SSMS), and delegating permission in SQL to your Active Directory group.

Modify the Amazon RDS SQL security group

In these next steps, you modify the security group you created with your Amazon RDS instances, allowing your Windows Server instance to connect to the Amazon RDS SQL Server instances over port 1433.

To modify the security group:

  1. From the Amazon Elastic Compute Cloud (Amazon EC2) console, select Security Groups under the Network & Security navigation section.
  2. Select the new Amazon RDS SQL security group that was created with your Amazon RDS SQL instance and select Edit inbound rules.
  3. Choose Add rule and enter the following:
    1. Type – Select Custom TCP.
    2. Protocol – Select TCP.
    3. Port range – Enter 1433.
    4. Source – Select Custom.
    5. Enter the private IP of your instance with a /32. An example would be 10.0.0.10/32.
  4. Choose Save rules.

    Figure 7: Create a security group rule

    Figure 7: Create a security group rule

  5. Repeat these steps on the security group of your other Amazon RDS SQL instance in the other Region.

Install SQL Server Management Studio

All of the steps after the first are done on the Windows Server instance from Prerequisite 3.

To install SMMS:

  1. On your local computer, download SQL Server Management Studio (SSMS).

    Note: If desired, you can disable IE Enhanced Security Configuration and download directly to the Windows Server instance using IE or any other browser, and skip to step 3.

  2. RDP into your Windows Server instance and copy SSMS-Setup-ENU.exe to your RDP session.
  3. Run the file on your Windows Server instance.
  4. Choose Install.

    Figure 8: Install SMMS

    Figure 8: Install SMMS

  5. It might take a few minutes to install. When complete, choose Close.

Delegate permissions in SSMS

All of the following steps are performed on the Windows Server instance from Prerequisite 3. Log in to the Amazon RDS SQL instance using the SQL master user account. Next, create a SQL login for the Active Directory group you created previously and give it elevated permission to the Amazon RDS SQL instance.

To delegate permissions:

  1. Start SMMS.
  2. On the Connect to Server window, enter or select:
    1. Server name – Your Amazon RDS SQL Server endpoint.
    2. Authentication – Select SQL Server Authentication.
    3. Login – Enter the master user name you used when you launched your Amazon RDS SQL instance. The default is admin.
    4. Password – Enter the password for the master user name.
    5. Choose Connect.
    Figure 9: Connect to server

    Figure 9: Connect to server

  3. In SMMS, Choose New Query at the top of the window.

    Figure 10: New query

    Figure 10: New query

  4. In the query window, enter the following query. Replace <CORP\SQL-Admins> with the name of the group you created earlier.
    CREATE LOGIN [<CORP\SQL-Admins>] FROM WINDOWS WITH DEFAULT_DATABASE = [master],
       DEFAULT_LANGUAGE = [us_english];
    

    Figure 11: Query SQL database

    Figure 11: Query SQL database

  5. Choose Execute on the menu bar. You should see a Commands completed successfully message.

    Figure 12: Commands completed successfully

    Figure 12: Commands completed successfully

  6. Next, navigate to the Logins directory on the navigation page. Right-click on the group you added with the SQL command in step 5 and select Properties.

    Figure 13: Open group properties

    Figure 13: Open group properties

  7. Select Server Roles and select the processadmin and setupadmin checkboxes. Then choose OK.

    Figure 14: Configure server roles

    Figure 14: Configure server roles

  8. You can log off from the instance. For the next steps, you log in to the instance using the user account you created previously.
  9. Repeat these steps on the Amazon RDS SQL instance in the other Region.

Connect to the Amazon RDS SQL Server with the same Active Directory user in both Regions

All of the steps are performed on the Windows Server instance from Prerequisite 3. You must log in to the instance using the account you created earlier. You then log in to the Amazon RDS SQL instance using Windows authentication with that account.

  1. Log in to the instance with the user account you created earlier.
  2. Start SSMS.
  3. On the Connect to Server window, enter or select:
    1. Server name: Your Amazon RDS SQL Server endpoint.
    2. Authentication: Select Windows Authentication.
    3. Choose Connect.
    Figure 15: Connect to server

    Figure 15: Connect to server

  4. You should be logged in to SSMS. If you aren’t logged in, make sure you added your user account to the group you created earlier and try again.
  5. Repeat these steps using the other Amazon RDS SQL instance endpoint for the server name. You should be able to connect to both Amazon RDS SQL instances using the same user account.

Summary

In this post, you extended your AWS Managed Microsoft AD into a new Region. You then deployed Amazon RDS for SQL Server in multiple Regions attached to the same AWS Managed Microsoft AD directory. You then tested authentication to both Amazon RDS SQL instances using the same Active Directory user.

To learn more about using AWS Managed Microsoft AD or AD Connector, visit the AWS Directory Service documentation. For general information and pricing, see the AWS Directory Service home page. If you have comments about this blog post, submit a comment in the Comments section below. If you have implementation or troubleshooting questions, start a new thread on the AWS Directory Service forum or contact AWS Support.

Author

Jeremy Girven

Jeremy is a Solutions Architect specializing in Microsoft workloads on AWS. He has over 15 years of experience with Microsoft Active Directory and over 23 years of industry experience. One of his fun projects is using SSM to automate the Active Directory build processes in AWS. To see more please check out the Active Directory AWS QuickStart.

Accessing and visualizing external tables in an Apache Hive metastore with Amazon Athena and Amazon QuickSight

Post Syndicated from James Sun original https://aws.amazon.com/blogs/big-data/accessing-and-visualizing-external-tables-in-an-apache-hive-metastore-with-amazon-athena-and-amazon-quicksight/

Many organizations have an Apache Hive metastore that stores the schemas for their data lake. You can use Amazon Athena due to its serverless nature; Athena makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. You may also want to reliably query the rich datasets in the lake, with their schemas hosted in an external Hive metastore. In response to customers’ requests, AWS announced Athena’s support for Hive metastore in early 2020. This extends the ability in Athena to query external data stores, with a combined benefit of better performance and lower cost.

In this post, we provide an AWS CloudFormation template that configures a remote Hive metastore based on Amazon Relational Database Service (Amazon RDS) and MySQL with Amazon EMR located in a private subnet to perform ETL tasks. We then demonstrate how you can use a Spark step to pull COVID-19 datasets from a public repository and transform the data into a performant Parquet columnar storage format. We also walk through the steps to query the data with Athena and visualize it with Amazon QuickSight. QuickSight is a fully managed data visualization service; it lets the you easily create and publish interactive dashboards by analyzing data from various data sources, including Athena.

Solution walkthrough

The following diagram shows the architecture for our solution.


As shown in the preceding architecture, we have an Availability Zone within a VPC in an AWS Region. The Availability Zone hosts subnets that are either public or private. A multi-master EMR cluster that has Hive software components installed is launched in a private subnet with egress internet traffic through a NAT gateway to download data from public sites for analysis. The multi-master feature also ensures that the primary nodes are highly available. The Hive metastore is backed by a remote RDS for MySQL instance located in the same private subnet.

We also have an Amazon Simple Storage Service (Amazon S3)-based data lake. A Spark step in Amazon EMR retrieves the data in CSV format, saves it in the provisioned S3 bucket, and transforms the data into Parquet format. The Spark step creates an external Hive table referencing the Parquet data and is ready for Athena to query.

With Athena, you can create a data source connector based on an AWS Lambda function to access the Hive metastore hosted on the EMR cluster by using the Apache Thrift interface.

The connector is called a catalog, which when invoked in a SQL statement with Athena, invokes the Lambda function. The function exits if the connector is not active for 15 minutes. For queries that run longer than 15 minutes, it’s recommended to let the query complete before retrieving the query results in an Amazon S3 location you can specify.

In Athena, you can compose a SQL statement against the Hive tables with predicates to further limit the size of the query result for faster visualization by QuickSight.

Deploying the resources with AWS CloudFormation

To demonstrate our solution, we provide a CloudFormation template that you can download to easily deploy the necessary AWS resources. The template creates the following resources to simulate the environment:

  • VPC and subnets – A VPC with one public and one private subnets. A NAT gateway is also created to allow outbound internet access from the EMR cluster to download public COVID-19 datasets from the Johns Hopkins GitHub repo.
  • EMR cluster – A multi-master EMR cluster with Hive, running on three primary nodes (m5.xlarge) and two core nodes (m5.xlarge), is launched to support the thrift connection required by the Athena Lambda connectors.
  • Amazon RDS for MySQL database – An RDS for MySQL primary instance is launched in the same subnet as the EMR cluster. The RDS instance serves as the Hive metastore backend data store.
  • S3 bucket – An S3 bucket stores files in Parquet format by Amazon EMR and is accessed later by Athena.
  • AWS IAM users – Two AWS Identity and Access Management (IAM) users belonging to different user groups. The first user, the data engineer, has permissions to access the Lambda-based Athena data source connector. The other user, the salesperson, does not have permissions to access the connector.

To get started, you need to have an AWS account. If you don’t have one, go to aws.amazon.com to sign up for one. Then complete the following steps:

  1. Sign in to the AWS Management Console as an IAM power user, preferably an admin user.
  2. Choose Launch Stack to launch the CloudFormation template:

This template has been tested in the US East (N. Virginia) Region.

  1. Choose Next.

You’re prompted to enter a few launch parameters.

  1. For Stack name, enter a name for the stack (for example, athena-hms).
  2. For Hive Metastore Host Number, choose the Amazon EMR primary node to connect to (or use the default value).
  3. Continue to choose Next and leave other parameters at their default.
  4. On the review page, select the three check boxes to confirm that AWS CloudFormation might create resources.
  5. Choose Create stack.

The stack takes 15–20 minutes to complete.

  1. On the Outputs tab of the stack details, save the key-value pairs to use later.

When the EMR cluster is provisioned, it uses a bootstrap action to install the necessary Python libraries. It runs an Amazon EMR Spark step (a PySpark script) that downloads three COVID-19 datasets for confirmed, recovered, and death cases from the John Hopkins GitHub repo in CSV format and stores them in the csv subfolder of the S3 bucket created by the CloudFormation stack. Lastly, the final transformed data is converted to Parquet format and external Hive tables are created referencing the Parquet data located in the parquet subfolder of the S3 bucket.

The following are the source codes for the bootstrap and Spark step actions for your reference:

To validate the data, on the Amazon S3 console, choose the bucket name from the CloudFormation template outputs. You should see a covid_data folder in the bucket. The folder contains the two subfolders, csv and parquet, which store the raw CSV and transformed Parquet data, respectively.

Querying the data in Athena

The CloudFormation template creates two users belonging to two different AWS IAM groups. The de_user_us-east-1_athena-hms user is a data engineer with permissions to access the Lambda function to communicate with the Hive metastore using the Athena data source connector. They belong to the group athena-hms-DataEngineerGroup-xxxxxxx. The sales_user_us-east-1_athena-hms user is a salesperson with no access to the connector. They belong to the group athena-hms-SalesGroup-xxxxx. 

To query the data, first retrieve your secret values in AWS Secrets Manager:

  1. On the Secrets Manager console, choose Secrets.
  2. Choose DataEngineerUserCreds.

  1. Choose Retrieve secret value.

 

  1. Save the username and password.

 

  1. Repeat these steps for SalesUserCreds. 

With the data in place, we can now use Athena to query it.

Accessing data as the data engineer

To query the data as the de_user_us-east-1_athena-hms user, complete the following steps:

  1. Sign in to the Athena console as the de_user_us-east-1_athena-hms user with the credentials retrieved from Secrets Manager.

After logging in, we need to create a data source for Hive metastore.

  1. On the navigation bar, choose Data sources.
  2. Choose Connect data source.

  1. For Choose where you data is located, select Query data in Amazon S3.
  2. For Choose a metadata catalog, select Apache Hive metastore.
  3. Chose Next.

  1. For Lambda function, choose the function the CloudFormation template created with the key HiveMetastoreFunctionName.
  2. For Catalog name, enter a name for the Athena catalog (for example, demo_hive_metastore).
  3. Choose Connect.

You should now have a catalog named demo_hive_metastore with the catalog type Hive metastore.

  1. On the navigation bar, choose Query editor.
  2. Enter the following SQL statement:
    SELECT *
    FROM demo_hive_metastore.default.covid_confirmed_cases
    WHERE country = 'US'
            OR country = 'Italy'
            OR country = 'India'
            OR country = 'Brazil'
            OR country = 'Spain'
            OR country = 'United Kingdom'

This SQL statement selects all the columns in the covid_confirmed_cases table with predicates to only include a few countries of interest. We use a table name with the pattern <catalog name>.<hive database name>.<hive table name in the database>, which for this post translates to demo_hive_metastore.default.covid_confirmed_cases.

  1. Choose Run query.

The following screenshot shows your query results.

Make sure you completely sign out of the console before moving on to the next steps.

Accessing data as the salesperson

Sign in to the console as sales_user_us-east-1_athena-hms user. Because the salesperson user doesn’t have the appropriate IAM policies to access the Hive metastore connection, you can’t see the tables.

 

Permissions policies

The data engineer has additional policies attached to their IAM group in addition to the managed AmazonAthenaFullAccess policy: the <stack-name>-DataBucketReadAccessPolicy-xxxxx and <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policies created by the CloudFormation template. Therefore, the data engineer can view the tables, but the salesperson can’t.

These policies are available on the IAM console, on the Permissions tab for the group <stack-name>-DataEngineerGroup-xxxxx.

The following sample JSON code is the <stack-name>-DataBucketReadWriteAccessPolicy-xxxxx policy to allow access to the provisioned S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::<provisioned bucket name>",
                "arn:aws:s3:::<provisioned bucket name>/",
                "arn:aws:s3:::<provisioned bucket name>/*"
            ],
            "Effect": "Allow"
        }
    ]
}

The following sample JSON code is the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy to allow access to the Lambda Hive metastore function:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "lambda:GetFunction",
                "lambda:GetLayerVersion",
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:us-east-1:<account id>:function:athena-hms-HiveMetastoreFunction"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
               "arn:aws:s3:::<provisioned bucket name>/hms_spill"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Next, we walk through using QuickSight to visualize the results. Make sure you completely sign out of the console as the salesperson user before proceeding to the next steps. 

Signing up for QuickSight

You can skip this section if you have already signed up for QuickSight previously.

  1. Sign in to the console as the IAM power user who deployed the CloudFormation template or any user with enough IAM privileges to set up QuickSight.
  2. On the QuickSight console, choose Sign up for QuickSight.
  3. Select either the Standard or Enterprise edition for QuickSight.
  4. Choose Continue.

  1. For QuickSight account name, enter your AWS account ID.
  2. For Notification email address, enter your email.
  3. Select Amazon S3.

  1. Select the provisioned S3 bucket to grant QuickSight permission to access.
  2. Choose Finish.

Your QuickSight account should now be set up.

Attaching the Hive metastore access policy

Before you can use QuickSight, you have to attach the Hive metastore access policy to the QuickSight service role.

  1. On the IAM console, search for the service role aws-quicksight-service-role-v0.
  2. Choose the role.

  1. Search for the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy.
  2. Select the policy and attach it to the QuickSight service role.

Creating your data source and performing data conversions

Before we can create visualizations, we need to set up our data source.

  1. Download the SQL script covid-19.sql.
  2. On the QuickSight console, choose Datasets in the navigation pane.

  1. Choose New dataset.

  1. Choose Athena.

  1. In the pop-up window, for Data source name, enter demo_hive_metastore.
  2. Choose Validate.
  3. When the connection is validated, choose Create data source.

  1. In the next window, choose Use custom SQL.

  1. Enter the content of the covid-19.sql script in the query window.
  2. Choose Confirm query.

  1. Leave Import to SPICE for quicker analytics
  2. Choose Visualize.

Now we perform a few data type conversions before visualizing the data.

  1. Choose the Edit icon next to Data set on the menu bar.

  1. Choose the … icon.
  2. Choose Edit.

  1. Expand date and choose Change data type.
  2. Choose Date.

  1. Enter yyyy-MM-dd to convert the date format.
  2. Choose Update.

Now we create a coordinate using the latitude and longitude values.r

  1. Expand lat and choose Ad to coordinates.

  1. Leave Create new geospatial coordinates
  2. Chose Add.

  1. In the pop-up window, for Name your coordinates, enter coordinate.
  2. For Field to use for longitude, choose lon.
  3. Choose Create coordinates.

  1. Choose Save and visualize on the menu bar.

Creating visualizations in QuickSight

Now we can visualize our data. For this post, we create a map visualization.

  1. For Visual types, choose the map

  1. For Geospatial, drag coordinates.
  2. For Size, drag confirmed.
  3. For Color, drag country.

This world map shows the accumulated confirmed cases for selected countries over time; you need to use a filter to look at confirmed cases on a specific date.

  1. In the navigation pane, choose Filter.
  2. Choose the + icon.
  3. For Filter type, choose Time range.
  4. Choose start and end dates, such as 2020-09-10 00:00 and 2020-09-11 00:00, respectively.
  5. Choose Apply.

This plots the confirmed cases on September 10, 2020, for these countries.

Similarly, you can choose other visual types, such as a line chart, and generate the mortality rate for selected countries over time.

Using highly available primary nodes of the Amazon EMR cluster

The EMR cluster has a multi-master configuration with three primary nodes running to meet high availability requirements. At any time, the Lambda function communicates with one of these three EMR primary nodes. In the rare event that this node goes down, you can quickly re-establish the Athena data source connector to the external Hive metastore by failing over to another active primary node.

To perform this failover, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose athena-hms.
  3. Choose update.
  4. Choose Use current update.
  5. Choose Next.
  6. For Hive Metastore Host Number, choose a host other than the current one you’re using.

  1. Choose Next.
  2. Acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Update stack.

In less than a minute, you should be able to access the Hive metastore and continue to query on the Athena console.

Cleaning up

You may want to clean up the demo environment when you’re done. To do so, on the AWS CloudFormation console, select the template and choose Delete.

This action also deletes the S3 bucket and any data in it. If you want to retain the data for future use, you should make a copy of the bucket before you delete it.

Summary

In this post, we walked through a solution using Athena to query external Hive tables with public COVID-19 datasets hosted in an S3 bucket and visualizing the data with QuickSight. We provided a CloudFormation template to automate the deployment of necessary AWS services for the demo. We encourage you to use these managed and scalable services for your specific use cases in production.


About the Authors

James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.

 

 

Chinmayi Narasimhadevara is a Solutions Architect with Amazon Web Services. Chinmayi has over 15 years of experience in information technology and has worked with organizations ranging from large enterprises to mid-sized startups. She helps AWS customers leverage the correct mix of AWS services to achieve success for their business goals.

 

 

Gagan Brahmi is a Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

New – Amazon RDS on Graviton2 Processors

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-amazon-rds-on-graviton2-processors/

I recently wrote a post to announce the availability of M6g, R6g and C6g families of instances on Amazon Elastic Compute Cloud (EC2). These instances offer better cost-performance ratio than their x86 counterparts. They are based on AWS-designed AWS Graviton2 processors, utilizing 64-bit Arm Neoverse N1 cores.

Starting today, you can also benefit from better cost-performance for your Amazon Relational Database Service (RDS) databases, compared to the previous M5 and R5 generation of database instance types, with the availability of AWS Graviton2 processors for RDS. You can choose between M6g and R6g instance families and three database engines (MySQL 8.0.17 and higher, MariaDB 10.4.13 and higher, and PostgreSQL 12.3 and higher).

M6g instances are ideal for general purpose workloads. R6g instances offer 50% more memory than their M6g counterparts and are ideal for memory intensive workloads, such as Big Data analytics.

Graviton2 instances provide up to 35% performance improvement and up to 52% price-performance improvement for RDS open source databases, based on internal testing of workloads with varying characteristics of compute and memory requirements.

Graviton2 instances family includes several new performance optimizations such as larger L1 and L2 caches per core, higher Amazon Elastic Block Store (EBS) throughput than comparable x86 instances, fully encrypted RAM, and many others as detailed on this page. You can benefit from these optimizations with minimal effort, by provisioning or migrating your RDS instances today.

RDS instances are available in multiple configurations, starting with 2 vCPUs, with 8 GiB memory for M6g, and 16 GiB memory for R6g with up to 10 Gbps of network bandwidth, giving you new entry-level general purpose and memory optimized instances. The table below shows the list of instance sizes available for you:

Instance Size vCPU Memory (GiB) Dedicated EBS Bandwidth (Mbps) Network Bandwidth
(Gbps)
M6g R6g
large 2 8 16 Up to 4750 Up to 10
xlarge 4 16 32 Up to 4750 Up to 10
2xlarge 8 32 64 Up to 4750 Up to 10
4xlarge 16 64 128 4750 Up to 10
8xlarge 32 128 256 9000 12
12xlarge 48 192 384 13500 20
16xlarge 64 256 512 19000 25

Let’s Start Your First Graviton2 Based Instance
To start a new RDS instance, I use the AWS Management Console or the AWS Command Line Interface (CLI), just like usual, and select one of the db.m6g or db.r6ginstance types (this page in the documentation has all the details).

RDS Launch Graviton2 instance

Using the CLI, it would be:

aws rds create-db-instance
 --region us-west-2 \
 --db-instance-identifier $DB_INSTANCE_NAME \
 --db-instance-class db.m6g.large \
 --engine postgres \
 --engine-version 12.3 \
 --allocated-storage 20 \
 --master-username $MASTER_USER \
 --master-user-password $MASTER_PASSWORD

The CLI confirms with:

{
    "DBInstance": {
        "DBInstanceIdentifier": "newsblog",
        "DBInstanceClass": "db.m6g.large",
        "Engine": "postgres",
        "DBInstanceStatus": "creating",
...
}

Migrating to Graviton2 instances is easy, in the AWS Management Console, I select my database and I click Modify.

Modify RDS database

The I select the new DB instance class:

modify db instance

Or, using the CLI, I can use the modify-db-instance API call.

There is a short service interruption happening when you switch instance type. By default, the modification will happen during your next maintenance window, unless you enable the ApplyImmediately option.

You can provision new or migrate to Graviton2 Amazon Relational Database Service (RDS) instances in all regions where EC2 M6g and R6g are available : US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Frankfurt) AWS Regions.

As usual, let us know your feedback on the AWS Forum or through your usual AWS contact.

— seb

How to delete user data in an AWS data lake

Post Syndicated from George Komninos original https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution to delete specific users’ personal data.

In the context of the AWS big data and analytics ecosystem, every architecture, regardless of the problem it targets, uses Amazon Simple Storage Service (Amazon S3) as the core storage service. Despite its versatility and feature completeness, Amazon S3 doesn’t come with an out-of-the-box way to map a user identifier to S3 keys of objects that contain user’s data.

This post walks you through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.

Reference architecture

To address the challenge of implementing a data purge framework, we reduced the problem to the straightforward use case of deleting a user’s data from a platform that uses AWS for its data pipeline. The following diagram illustrates this use case.

We’re introducing the idea of building and maintaining an index metastore that keeps track of the location of each user’s records and allows us locate to them efficiently, reducing the search space.

You can use the following architecture diagram to delete a specific user’s data within your organization’s AWS data lake.

For this initial version, we created three user flows that map each task to a fitting AWS service:

Flow 1: Real-time metastore update

The S3 ObjectCreated or ObjectDelete events trigger an AWS Lambda function that parses the object and performs an add/update/delete operation to keep the metadata index up to date. You can implement a simple workflow for any other storage layer, such as Amazon Relational Database Service (RDS), Amazon Aurora, or Amazon Elasticsearch Service (ES). We use Amazon DynamoDB and Amazon RDS for PostgreSQL as the index metadata storage options, but our approach is flexible to any other technology.

Flow 2: Purge data

When a user asks for their data to be deleted, we trigger an AWS Step Functions state machine through Amazon CloudWatch to orchestrate the workflow. Its first step triggers a Lambda function that queries the metadata index to identify the storage layers that contain user records and generates a report that’s saved to an S3 report bucket. A Step Functions activity is created and picked up by a Lambda Node JS based worker that sends an email to the approver through Amazon Simple Email Service (SES) with approve and reject links.

The following diagram shows a graphical representation of the Step Function state machine as seen on the AWS Management Console.

The approver selects one of the two links, which then calls an Amazon API Gateway endpoint that invokes Step Functions to resume the workflow. If you choose the approve link, Step Functions triggers a Lambda function that takes the report stored in the bucket as input, deletes the objects or records from the storage layer, and updates the index metastore. When the purging job is complete, Amazon Simple Notification Service (SNS) sends a success or fail email to the user.

The following diagram represents the Step Functions flow on the console if the purge flow completed successfully.

For the complete code base, see step-function-definition.json in the GitHub repo.

Flow 3: Batch metastore update

This flow refers to the use case of an existing data lake for which index metastore needs to be created. You can orchestrate the flow through AWS Step Functions, which takes historical data as input and updates metastore through a batch job. Our current implementation doesn’t include a sample script for this user flow.

Our framework

We now walk you through the two use cases we followed for our implementation:

  • You have multiple user records stored in each Amazon S3 file
  • A user has records stored in homogenous AWS storage layers

Within these two approaches, we demonstrate alternatives that you can use to store your index metastore.

Indexing by S3 URI and row number

For this use case, we use a free tier RDS Postgres instance to store our index. We created a simple table with the following code:

CREATE UNLOGGED TABLE IF NOT EXISTS user_objects (
				userid TEXT,
				s3path TEXT,
				recordline INTEGER
			);

You can index on user_id to optimize query performance. On object upload, for each row, you need to insert into the user_objects table a row that indicates the user ID, the URI of the target Amazon S3 object, and the row that corresponds to the record. For instance, when uploading the following JSON input, enter the following code:

{"user_id":"V34qejxNsCbcgD8C0HVk-Q","body":"…"}
{"user_id":"ofKDkJKXSKZXu5xJNGiiBQ","body":"…"}
{"user_id":"UgMW8bLE0QMJDCkQ1Ax5Mg","body ":"…"}

We insert the tuples into user_objects in the Amazon S3 location s3://gdpr-demo/year=2018/month=2/day=26/input.json. See the following code:

(“V34qejxNsCbcgD8C0HVk-Q”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 0)
(“ofKDkJKXSKZXu5xJNGiiBQ”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 1)
(“UgMW8bLE0QMJDCkQ1Ax5Mg”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 2)

You can implement the index update operation by using a Lambda function triggered on any Amazon S3 ObjectCreated event.

When we get a delete request from a user, we need to query our index to get some information about where we have stored the data to delete. See the following code:

SELECT s3path,
                ARRAY_AGG(recordline)
                FROM user_objects
                WHERE userid = ‘V34qejxNsCbcgD8C0HVk-Q’
                GROUP BY;

The preceding example SQL query returns rows like the following:

(“s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json“, {2102,529})

The output indicates that lines 529 and 2102 of S3 object s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json contain the requested user’s data and need to be purged. We then need to download the object, remove those rows, and overwrite the object. For a Python implementation of the Lambda function that implements this functionality, see deleteUserRecords.py in the GitHub repo.

Having the record line available allows you to perform the deletion efficiently in byte format. For implementation simplicity, we purge the rows by replacing the deleted rows with an empty JSON object. You pay a slight storage overhead, but you don’t need to update subsequent row metadata in your index, which would be costly. To eliminate empty JSON objects, we can implement an offline vacuum and index update process.

Indexing by file name and grouping by index key

For this use case, we created a DynamoDB table to store our index. We chose DynamoDB because of its ease of use and scalability; you can use its on-demand pricing model so you don’t need to guess how many capacity units you might need. When files are uploaded to the data lake, a Lambda function parses the file name (for example, 1001-.csv) to identify the user identifier and populates the DynamoDB metadata table. Userid is the partition key, and each different storage layer has its own attribute. For example, if user 1001 had data in Amazon S3 and Amazon RDS, their records look like the following code:

{"userid:": 1001, "s3":{"s3://path1", "s3://path2"}, "RDS":{"db1.table1.column1"}}

For a sample Python implementation of this functionality, see update-dynamo-metadata.py in the GitHub repo.

On delete request, we query the metastore table, which is DynamoDB, and generate a purge report that contains details on what storage layers contain user records, and storage layer specifics that can speed up locating the records. We store the purge report to Amazon S3. For a sample Lambda function that implements this logic, see generate-purge-report.py in the GitHub repo.

After the purging is approved, we use the report as input to delete the required resources. For a sample Lambda function implementation, see gdpr-purge-data.py in the GitHub repo.

Implementation and technology alternatives

We explored and evaluated multiple implementation options, all of which present tradeoffs, such as implementation simplicity, efficiency, critical data compliance, and feature completeness:

  • Scan every record of the data file to create an index – Whenever a file is uploaded, we iterate through its records and generate tuples (userid, s3Uri, row_number) that are then inserted to our metadata storing layer. On delete request, we fetch the metadata records for requested user IDs, download the corresponding S3 objects, perform the delete in place, and re-upload the updated objects, overwriting the existing object. This is the most flexible approach because it supports a single object to store multiple users’ data, which is a very common practice. The flexibility comes at a cost because it requires downloading and re-uploading the object, which introduces a network bottleneck in delete operations. User activity datasets such as customer product reviews are a good fit for this approach, because it’s unexpected to have multiple records for the same user within each partition (such as a date partition), and it’s preferable to combine multiple users’ activity in a single file. It’s similar to what was described in the section “Indexing by S3 URI and row number” and sample code is available in the GitHub repo.
  • Store metadata as file name prefix – Adding the user ID as the prefix of the uploaded object under the different partitions that are defined based on query pattern enables you to reduce the required search operations on delete request. The metadata handling utility finds the user ID from the file name and maintains the index accordingly. This approach is efficient in locating the resources to purge but assumes a single user per object, and requires you to store user IDs within the filename, which might require InfoSec considerations. Clickstream data, where you would expect to have multiple click events for a single customer on a single date partition during a session, is a good fit. We covered this approach in the section “Indexing by file name and grouping by index key” and you can download the codebase from the GitHub repo.
  • Use a metadata file – Along with uploading a new object, we also upload a metadata file that’s picked up by an indexing utility to create and maintain the index up to date. On delete request, we query the index, which points us to the records to purge. A good fit for this approach is a use case that already involves uploading a metadata file whenever a new object is uploaded, such as uploading multimedia data, along with their metadata. Otherwise, uploading a metadata file on every object upload might introduce too much of an overhead.
  • Use the tagging feature of AWS services – Whenever a new file is uploaded to Amazon S3, we use the Put Object Tagging Amazon S3 operation to add a key-value pair for the user identifier. Whenever there is a user data delete request, it fetches objects with that tag and deletes them. This option is straightforward to implement using the existing Amazon S3 API and can therefore be a very initial version of your implementation. However, it involves significant limitations. It assumes a 1:1 cardinality between Amazon S3 objects and users (each object only contains data for a single user), searching objects based on a tag is limited and inefficient, and storing user identifiers as tags might not be compliant with your organization’s InfoSec policy.
  • Use Apache Hudi – Apache Hudi is becoming a very popular option to perform record-level data deletion on Amazon S3. Its current version is restricted to Amazon EMR, and you can use it if you start to build your data lake from scratch, because you need to store your as Hudi datasets. Hudi is a very active project and additional features and integrations with more AWS services are expected.

The key implementation decision of our approach is separating the storage layer we use for our data and the one we use for our metadata. As a result, our design is versatile and can be plugged in any existing data pipeline. Similar to deciding what storage layer to use for your data, there are many factors to consider when deciding how to store your index:

  • Concurrency of requests – If you don’t expect too many simultaneous inserts, even something as simple as Amazon S3 could be a starting point for your index. However, if you get multiple concurrent writes for multiple users, you need to look into a service that copes better with transactions.
  • Existing team knowledge and infrastructure – In this post, we demonstrated using DynamoDB and RDS Postgres for storing and querying the metadata index. If your team has no experience with either of those but are comfortable with Amazon ES, Amazon DocumentDB (with MongoDB compatibility), or any other storage layer, use those. Furthermore, if you’re already running (and paying for) a MySQL database that’s not used to capacity, you could use that for your index for no additional cost.
  • Size of index – The volume of your metadata is orders of magnitude lower than your actual data. However, if your dataset grows significantly, you might need to consider going for a scalable, distributed storage solution rather than, for instance, a relational database management system.

Conclusion

GDPR has transformed best practices and introduced several extra technical challenges in designing and implementing a data lake. The reference architecture and scripts in this post may help you delete data in a manner that’s compliant with GDPR.

Let us know your feedback in the comments and how you implemented this solution in your organization, so that others can learn from it.

 


About the Authors

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

 

 

 

 

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Understanding database options for your serverless web applications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/understanding-database-options-for-your-serverless-web-applications/

Many web developers use relational databases to store and manage data in their web applications. As you migrate to a serverless approach for web development, there are also other options available. These can help improve the scale, performance, and cost-effectiveness of your workloads. In this blog post, I highlight use-cases for different serverless database services, and common patterns useful for web applications.

Using Amazon RDS with serverless web applications

You can access Amazon RDS directly from your AWS Lambda functions. The RDS database, such as Amazon Aurora, is configured within the customer VPC. The Lambda function must be configured with access to the same VPC:

Lambda connecting to RDS

There are special considerations for this design in busy serverless applications. It’s common for popular web applications to experience “spiky” usage, where traffic volumes shift rapidly and unpredictably. Serverless services such as AWS Lambda and Amazon API Gateway are designed to automatically scale to meet these traffic increases.

However, relational databases are connection-based, so they are intended to work with a few long-lived clients, such as web servers. By contrast, Lambda functions are ephemeral and short-lived, so their database connections are numerous and brief. If Lambda scales up to hundreds or thousands of instances, you may overwhelm downstream relational databases with connection requests. This is typically only an issue for moderately busy applications. If you are using a Lambda function for low-volume tasks, such as running daily SQL reports, you do not experience this behavior.

The Amazon RDS Proxy service is built to solve the high-volume use-case. It pools the connections between the Lambda service and the downstream RDS database. This means that a scaling Lambda function is able to reuse connections via the proxy. As a result, the relational database is not overwhelmed with connections requests from individual Lambda functions. This does not require code changes in many cases. You only need to replace the database endpoint with the proxy endpoint in your Lambda function.

Lambda to RDS Proxy to RDS diagram

As a result, if you need to use a relational database in a high-volume web application, you can use RDS Proxy with minimal changes required.

Using Amazon DynamoDB as a high-performance operational database

Amazon DynamoDB is a high-performance key-value and document database that operates with single-digit millisecond response times at any scale. This is a NoSQL database that is a natural fit for many serverless workloads, especially web applications. It can operate equally well for low and high usage workloads. Unlike relational databases, the performance of a well-architected DynamoDB table is not adversely affected by heavy usage or large amounts of data storage.

For web applications, DynamoDB tables are ideal for storing common user configuration and application data. When integrated with Amazon Cognito, you can restrict row-level access to the current user context. This makes it a frequent choice for multi-tenant web applications that host data for many users.

DynamoDB tables can be useful for lookups of key-based information, in addition to geo-spatial queries in many cases. DynamoDB is not connection-based, so this integration works even if a Lambda function scales up to hundreds or thousands of concurrent executions. You can query directly from Lambda with minimal code:

const AWS = require('aws-sdk')
AWS.config.region = process.env.AWS_REGION
const documentClient = new AWS.DynamoDB.DocumentClient()

// Construct params
const params = {
  TableName: 'myDDBtable',
  Item: {
    partitionKey: 'user-123',
    sortKey: Date.now(),
    name: 'Alice',
    cartItems: 3
  }
}

// Store in DynamoDB
const result = await documentClient.put(params).promise()

Using advanced patterns in DynamoDB, it’s possible to build equivalent features frequently found in relational schemas. For example, one-to-many tables, many-to-many tables, and ACID transactions can all be modeled in a single DynamoDB table.

Combining DynamoDB with RDS

While DynamoDB remains highly performant for high volumes of traffic, you need to understand data access patterns for your application before designing the schema. There are times where you need to perform ad hoc queries, or where downstream application users must use SQL-based tools to interact with databases.

In this case, combining both DynamoDB and RDS in your architecture can provide a resilient and flexible solution. For example, for a high-volume transactional web application, you can use DynamoDB to ingest data from your frontend application. For ad hoc SQL-based analytics, you could also use Amazon Aurora.

By using DynamoDB streams, you can process updates to a DynamoDB table using a Lambda function. In a simple case, this function can update tables in RDS, keeping the two databases synchronized. For example, when sales transactions are saved in DynamoDB, a Lambda function can post the sales information to transaction tables in Aurora.

DynamoDB to RDS architecture

Both the Lambda function and RDS database operate with the customer’s VPC, while DynamoDB is outside the VPC. DynamoDB Streams can invoke Lambda functions configured to access the VPC. In this model, RDS users can then run ad hoc SQL queries without impacting operational data managed by DynamoDB.

High-volume ETL processes between DynamoDB and RDS

For high-volume workloads capturing large numbers of transactions in DynamoDB, Lambda may still scale rapidly and exhaust the RDS connection pool. To process these flows, you may introduce Amazon Kinesis Data Firehose to help with data replication between DynamoDB and RDS.

ETL processing with with DynamoDB and RDS

  1. New and updated items in DynamoDB are sent to a DynamoDB stream. The stream invokes a stream processing Lambda function, sending batches of records to Kinesis Data Firehose.
  2. Kinesis buffers incoming messages and performs data transformations using a Lambda function. It then writes the output to Amazon S3, buffering by size (1–128 MB) or interval (60–900 seconds).
  3. The Kinesis Data Firehose transformation uses a custom Lambda function for processing records as needed.
  4. Amazon S3 is a durable store for these batches of transformed records. As objects are written, S3 invokes a Lambda function.
  5. The Lambda function loads the objects from S3, then connects to RDS and imports the data.

This approach supports high transaction volumes, enabling table item transformation before loading into RDS. The RDS concurrent connection pool is optimized by upstream batching and buffering, which reduces the number of concurrent Lambda functions and RDS connections.

Conclusion

Web developers commonly use relational databases in building their applications. When migrating to serverless architectures, a web developer can continue to use databases like RDS, or take advantage of other options available. RDS Proxy enables developers to pool database connections and use connection-based databases with ephemeral functions.

DynamoDB provides high-performance, low-latency NoSQL support, which is ideal for many busy web applications with spiky traffic volumes. However, it’s also possible to use both services to take advantage of the throughput of DynamoDB, together with the flexibility of ad hoc SQL queries in RDS.

For extremely high traffic volumes, you can introduce Kinesis Data Firehose to batch and transform data between DynamoDB and RDS. In this case, you separate the operational database from the analytics database. This solution uses multiple serverless services to handle scaling automatically.

To learn more about AWS serverless database solutions for web developers, visit https://aws.amazon.com/products/databases/.

New – Create Amazon RDS DB Instances on AWS Outposts

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-create-amazon-rds-db-instances-on-aws-outposts/

Late last year I told you about AWS Outposts and invited you to Order Yours Today. As I told you at the time, this is a comprehensive, single-vendor compute and storage offering that is designed to meet the needs of customers who need local processing and very low latency in their data centers and on factory floors. Outposts uses the hardware that we use in AWS public regions

I first told you about Amazon RDS back in 2009. This fully managed service makes it easy for you to launch, operate, and scale a relational database. Over the years we have added support for multiple open source and commercial databases, along with tons of features, all driven by customer requests.

DB Instances on AWS Outposts
Today I am happy to announce that you can now create RDS DB Instances on AWS Outposts. We are launching with support for MySQL and PostgreSQL, with plans to add other database engines in the future (as always, let us know what you need so that we can prioritize it).

You can make use of important RDS features including scheduled backups to Amazon Simple Storage Service (S3), built-in encryption at rest and in transit, and more.

Creating a DB Instance
I can create a DB Instance using the RDS Console, API (CreateDBInstance), CLI (create-db-instance), or CloudFormation (AWS::RDS::DBInstance).

I’ll use the Console, taking care to select the AWS Region that serves as “home base” for my Outpost. I open the Console and click Create database to get started:

I select On-premises for the Database location, and RDS on Outposts for the On-premises database option:

Next, I choose the Virtual Private Cloud (VPC). The VPC must already exist, and it must have a subnet for my Outpost. I also choose the Security Group and the Subnet:

Moving forward, I select the database engine, and version. We’re launching with support for MySQL 8.0.17 and PostgreSQL 12.2-R1, with plans to add more engines and versions based on your feedback:

I give my DB Instance a name (jb-database-2), and enter the credentials for the master user:

Then I choose the size of the instance. I can select between Standard classes (db.m5):

and Memory Optimized classes (db.r5):

Next, I configure the desired amount of SSD storage:

One thing to keep in mind is that each Outpost has a large, but finite amount of compute power and storage. If there’s not enough of either one free when I attempt to create the database, the request will fail.

Within the Additional configuration section I can set up several database options, customize my backups, and set up the maintenance window. Once everything is ready to go, I click Create database:

As usual when I use RDS, the state of my instance starts out as Creating and transitions to Available when my DB Instance is ready:

After the DB instance is ready, I simply configure my code (running in my VPC or in my Outpost) to use the new endpoint:

Things to Know
Here are a couple of things to keep in mind about this new way to use Amazon RDS:

Operations & Functions – Much of what you already know about RDS works as expected and is applicable. You can rename, reboot, stop, start, tag DB instances, and you can make use of point-in-time recovery; you can scale the instance up and down, and automatic minor version upgrades work as expected. You cannot make use of read replicas or create highly available clusters.

Backup & Recover – Automated backups work as expected, and are stored in S3. You can use them to create a fresh DB Instance in the cloud or in any of your Outposts. Manual snapshots also work, and are stored on the Outpost. They can be used to create a fresh DB Instance on the same Outpost.

Encryption – The storage associated with your DB instance is encrypted, as are your DB snapshots, both with KMS keys.

Pricing – RDS on Outposts pricing is based on a management fee that is charged on an hourly basis for each database that is managed. For more information, check out the RDS on Outposts pricing page.

Available Now
You can start creating RDS DB Instances on your Outposts today.

Jeff;

 

Amazon RDS Proxy – Now Generally Available

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-rds-proxy-now-generally-available/

At AWS re:Invent 2019, we launched the preview of Amazon RDS Proxy, a fully managed, highly available database proxy for Amazon Relational Database Service (RDS) that makes applications more scalable, more resilient to database failures, and more secure. Following the preview of MySQL engine, we extended to the PostgreSQL compatibility. Today, I am pleased to announce that we are now generally available for both engines.

Many applications, including those built on modern serverless architectures using AWS Lambda, Fargate, Amazon ECS, or EKS can have a large number of open connections to the database server, and may open and close database connections at a high rate, exhausting database memory and compute resources.

Amazon RDS Proxy allows applications to pool and share connections established with the database, improving database efficiency, application scalability, and security. With RDS Proxy, failover times for Amazon Aurora and RDS databases are reduced by up to 66%, and database credentials, authentication, and access can be managed through integration with AWS Secrets Manager and AWS Identity and Access Management (IAM).

Amazon RDS Proxy can be enabled for most applications with no code change, and you don’t need to provision or manage any additional infrastructure and only pay per vCPU of the database instance for which the proxy is enabled.

Amazon RDS Proxy – Getting started
You can get started with Amazon RDS Proxy in just a few clicks by going to the AWS management console and creating an RDS Proxy endpoint for your RDS databases. In the navigation pane, choose Proxies and Create proxy. You can also see the proxy panel below.

To create your proxy, specify the Proxy identifier, a unique name of your choosing, and choose the database engine – either MySQL or PostgreSQL. Choose the encryption setting if you want the proxy to enforce TLS / SSL for all connection between application and proxy, and specify a time period that a client connection can be idle before the proxy can close it.

A client connection is considered idle when the application doesn’t submit a new request within the specified time after the previous request completed. The underlying connection between the proxy and database stays open and is returned to the connection pool. Thus, it’s available to be reused for new client connections.

Next, choose one RDS DB instance or Aurora DB cluster in Database to access through this proxy. The list only includes DB instances and clusters with compatible database engines, engine versions, and other settings.

Specify Connection pool maximum connections, a value between 1 and 100. This setting represents the percentage of the max_connections value that RDS Proxy can use for its connections. If you only intend to use one proxy with this DB instance or cluster, you can set it to 100. For details about how RDS Proxy uses this setting, see Connection Limits and Timeouts.

Choose at least one Secrets Manager secret associated with the RDS DB instance or Aurora DB cluster that you intend to access with this proxy, and select an IAM role that has permission to access the Secrets Manager secrets you chose. If you don’t have an existing secret, please click Create a new secret before setting up the RDS proxy.

After setting VPC Subnets and a security group, please click Create proxy. If you more settings in details, please refer to the documentation.

You can see the new RDS proxy after waiting a few minutes and then point your application to the RDS Proxy endpoint. That’s it!

You can also create an RDS proxy easily via AWS CLI command.

aws rds create-db-proxy \
    --db-proxy-name channy-proxy \
    --role-arn iam_role \
    --engine-family { MYSQL|POSTGRESQL } \
    --vpc-subnet-ids space_separated_list \
    [--vpc-security-group-ids space_separated_list] \
    [--auth ProxyAuthenticationConfig_JSON_string] \
    [--require-tls | --no-require-tls] \
    [--idle-client-timeout value] \
    [--debug-logging | --no-debug-logging] \
    [--tags comma_separated_list]

How RDS Proxy works
Let’s see an example that demonstrates how open connections continue working during a failover when you reboot a database or it becomes unavailable due to a problem. This example uses a proxy named channy-proxy and an Aurora DB cluster with DB instances instance-8898 and instance-9814. When the failover-db-cluster command is run from the Linux command line, the writer instance that the proxy is connected to changes to a different DB instance. You can see that the DB instance associated with the proxy changes while the connection remains open.

$ mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p
Enter password:
...
mysql> select @@aurora_server_id;
+--------------------+
| @@aurora_server_id |
+--------------------+
| instance-9814 |
+--------------------+
1 row in set (0.01 sec)

mysql>
[1]+ Stopped mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p
$ # Initially, instance-9814 is the writer.
$ aws rds failover-db-cluster --db-cluster-id cluster-56-2019-11-14-1399
JSON output
$ # After a short time, the console shows that the failover operation is complete.
$ # Now instance-8898 is the writer.
$ fg
mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p

mysql> select @@aurora_server_id;
+--------------------+
| @@aurora_server_id |
+--------------------+
| instance-8898 |
+--------------------+
1 row in set (0.01 sec)

mysql>
[1]+ Stopped mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p
$ aws rds failover-db-cluster --db-cluster-id cluster-56-2019-11-14-1399
JSON output
$ # After a short time, the console shows that the failover operation is complete.
$ # Now instance-9814 is the writer again.
$ fg
mysql -h channy-proxy.proxy-abcdef123.us-east-1.rds.amazonaws.com -u admin_user -p

mysql> select @@aurora_server_id;
+--------------------+
| @@aurora_server_id |
+--------------------+
| instance-9814 |
+--------------------+
1 row in set (0.01 sec)
+---------------+---------------+
| Variable_name | Value |
+---------------+---------------+
| hostname | ip-10-1-3-178 |
+---------------+---------------+
1 row in set (0.02 sec)

With RDS Proxy, you can build applications that can transparently tolerate database failures without needing to write complex failure handling code. RDS Proxy automatically routes traffic to a new database instance while preserving application connections.

You can review the demo for an overview of RDS Proxy and the steps you need take to access RDS Proxy from a Lambda function.

If you want to know how your serverless applications maintain excellent performance even at peak loads, please read this blog post. For a deeper dive into using RDS Proxy for MySQL with serverless, visit this post.

The following are a few things that you should be aware of:

  • Currently, RDS Proxy is available for the MySQL and PostgreSQL engine family. This engine family includes RDS for MySQL 5.6 and 5.7, PostgreSQL 10.11 and 11.5.
  • In an Aurora cluster, all of the connections in the connection pool are handled by the Aurora primary instance. To perform load balancing for read-intensive workloads, you still use the reader endpoint directly for the Aurora cluster.
  • Your RDS Proxy must be in the same VPC as the database. Although the database can be publicly accessible, the proxy can’t be.
  • Proxies don’t support compressed mode. For example, they don’t support the compression used by the --compress or -C options of the mysql command.

Now Available!
Amazon RDS Proxy is generally available in US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Frankfurt), Europe (Ireland), Europe (London) , Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney) and Asia Pacific (Tokyo) regions for Aurora MySQL, RDS for MySQL, Aurora PostgreSQL, and RDS for PostgreSQL, and it includes support for Aurora Serverless and Aurora Multi-Master.

Take a look at the product page, pricing, and the documentation to learn more. Please send us feedback either in the AWS forum for Amazon RDS or through your usual AWS support contacts.

Channy;

Introducing the serverless LAMP stack – part 2 relational databases

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/introducing-the-serverless-lamp-stack-part-2-relational-databases/

In this post, you learn how to use an Amazon Aurora MySQL relational database in your serverless applications. I show how to pool and share connections to the database with Amazon RDS Proxy, and how to choose configurations. The code examples in this post are written in PHP and can be found in this GitHub repository. The concepts can be applied to any AWS Lambda supported runtime.

TThe serverless LAMP stack

The serverless LAMP stack

This serverless LAMP stack architecture is first discussed in this post. This architecture uses a PHP Lambda function (or multiple functions) to read and write to an Amazon Aurora MySQL database.

Amazon Aurora provides high performance and availability for MySQL and PostgreSQL databases. The underlying storage scales automatically to meet demand, up to 64 tebibytes (TiB). An Amazon Aurora DB instance is created inside a virtual private cloud (VPC) to prevent public access. To connect to the Aurora database instance from a Lambda function, that Lambda function must be configured to access the same VPC.

Database memory exhaustion can occur when connecting directly to an RDS database. This is caused by a surge in database connections or by a large number of connections opening and closing at a high rate. This can lead to slower queries and limited application scalability. Amazon RDS Proxy is implemented to solve this problem. RDS Proxy is a fully managed database proxy feature for Amazon RDS. It establishes a database connection pool that sits between your application and your relational database and reuses connections in this pool. This protects the database against oversubscription, without the memory and CPU overhead of opening a new database connection each time. Credentials for the database connection are securely stored in AWS Secrets Manager. They are accessed via an AWS Identity and Access Management (IAM) role. This enforces strong authentication requirements for database applications without a costly migration effort for the DB instances themselves.

The following steps show how to connect to an Amazon Aurora MySQL database running inside a VPC. The connection is made from a Lambda function running PHP. The Lambda function connects to the database via RDS Proxy. The database credentials that RDS Proxy uses are held in  Secrets Manager and accessed via IAM authentication.

RDS Proxy with IAM Authentication

RDS Proxy with IAM authentication

Getting started

RDS Proxy is currently in preview and not recommended for production workloads. For a full list of available Regions, refer to the RDS Proxy pricing page.

Creating an Amazon RDS Aurora MySQL database

Before creating an Aurora DB cluster, you must meet the prerequisites, such as creating a VPC and an RDS DB subnet group. For more information on how to set this up, see DB cluster prerequisites.

  1. Call the create-db-cluster AWS CLI command to create the Aurora MySQL DB cluster.
    aws rds create-db-cluster \
    --db-cluster-identifier sample-cluster \
    --engine aurora-mysql \
    --engine-version 5.7.12 \
    --master-username admin \
    --master-user-password secret99 \
    --db-subnet-group-name default-vpc-6cc1cf0a \
    --vpc-security-group-ids sg-d7cf52a3 \
    --enable-iam-database-authentication true
  2. Add a new DB instance to the cluster.
    aws rds create-db-instance \
        --db-instance-class db.r5.large \
        --db-instance-identifier sample-instance \
        --engine aurora-mysql  \
        --db-cluster-identifier sample-cluster
  3. Store the database credentials as a secret in AWS Secrets Manager.
    aws secretsmanager create-secret \
    --name MyTestDatabaseSecret \
    --description "My test database secret created with the CLI" \
    --secret-string '{"username":"admin","password":"secret99","engine":"mysql","host":"<REPLACE-WITH-YOUR-DB-WRITER-ENDPOINT>","port":"3306","dbClusterIdentifier":"<REPLACE-WITH-YOUR-DB-CLUSTER-NAME>"}'

    Make a note of the resulting ARN for later

    {
        "VersionId": "eb518920-4970-419f-b1c2-1c0b52062117", 
        "Name": "MySampleDatabaseSecret", 
        "ARN": "arn:aws:secretsmanager:eu-west-1:1234567890:secret:MySampleDatabaseSecret-JgEWv1"
    }

    This secret is used by RDS Proxy to maintain a connection pool to the database. To access the secret, the RDS Proxy service requires permissions to be explicitly granted.

  4. Create an IAM policy that provides secretsmanager permissions to the secret.
    aws iam create-policy \
    --policy-name my-rds-proxy-sample-policy \
    --policy-document '{
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": [
            "secretsmanager:GetResourcePolicy",
            "secretsmanager:GetSecretValue",
            "secretsmanager:DescribeSecret",
            "secretsmanager:ListSecretVersionIds"
          ],
          "Resource": [
            "<the-arn-of-the-secret>”
          ]
        },
        {
          "Sid": "VisualEditor1",
          "Effect": "Allow",
          "Action": [
            "secretsmanager:GetRandomPassword",
            "secretsmanager:ListSecrets"
          ],
          "Resource": "*"
        }
      ]
    }'
    

    Make a note of the resulting policy ARN, which you need to attach to a new role.

    {
        "Policy": {
            "PolicyName": "my-rds-proxy-sample-policy", 
            "PermissionsBoundaryUsageCount": 0, 
            "CreateDate": "2020-06-04T12:21:25Z", 
            "AttachmentCount": 0, 
            "IsAttachable": true, 
            "PolicyId": "ANPA6JE2MLNK3Z4EFQ5KL", 
            "DefaultVersionId": "v1", 
            "Path": "/", 
            "Arn": "arn:aws:iam::1234567890112:policy/my-rds-proxy-sample-policy", 
            "UpdateDate": "2020-06-04T12:21:25Z"
         }
    }
    
  5. Create an IAM Role that has a trust relationship with the RDS Proxy service. This allows the RDS Proxy service to assume this role to retrieve the database credentials.

    aws iam create-role --role-name my-rds-proxy-sample-role --assume-role-policy-document '{
     "Version": "2012-10-17",
     "Statement": [
      {
       "Sid": "",
       "Effect": "Allow",
       "Principal": {
        "Service": "rds.amazonaws.com"
       },
       "Action": "sts:AssumeRole"
      }
     ]
    }'
    
  6. Attach the new policy to the role:
    aws iam attach-role-policy \
    --role-name my-rds-proxy-sample-role \
    --policy-arn arn:aws:iam::123456789:policy/my-rds-proxy-sample-policy
    

Create an RDS Proxy

  1. Use the AWS CLI to create a new RDS Proxy. Replace the – -role-arn and SecretArn value to those values created in the previous steps.
    aws rds create-db-proxy \
    --db-proxy-name sample-db-proxy \
    --engine-family MYSQL \
    --auth '{
            "AuthScheme": "SECRETS",
            "SecretArn": "arn:aws:secretsmanager:eu-west-1:123456789:secret:exampleAuroraRDSsecret1-DyCOcC",
             "IAMAuth": "REQUIRED"
          }' \
    --role-arn arn:aws:iam::123456789:role/my-rds-proxy-sample-role \
    --vpc-subnet-ids  subnet-c07efb9a subnet-2bc08b63 subnet-a9007bcf
    

    To enforce IAM authentication for users of the RDS Proxy, the IAMAuth value is set to REQUIRED. This is a more secure alternative to embedding database credentials in the application code base.

    The Aurora DB cluster and its associated instances are referred to as the targets of that proxy.

  2. Add the database cluster to the proxy with the register-db-proxy-targets command.
    aws rds register-db-proxy-targets \
    --db-proxy-name sample-db-proxy \
    --db-cluster-identifiers sample-cluster
    

Deploying a PHP Lambda function with VPC configuration

This GitHub repository contains a Lambda function with a PHP runtime provided by a Lambda layer. The function uses the MySQLi PHP extension to connect to the RDS Proxy. The extension has been installed and compiled along with a PHP executable using this command:

The PHP executable is packaged together with a Lambda bootstrap file to create a PHP custom runtime. More information on building your own custom runtime for PHP can be found in this post.

Deploy the application stack using the AWS Serverless Application Model (AWS SAM) CLI:

sam deploy -g

When prompted, enter the SecurityGroupIds and the SubnetIds for your Aurora DB cluster.

The SAM template attaches the SecurityGroupIds and SubnetIds parameters to the Lambda function using the VpcConfig sub-resource.

Lambda creates an elastic network interface for each combination of security group and subnet in the function’s VPC configuration. The function can only access resources (and the internet) through that VPC.

Adding RDS Proxy to a Lambda Function

  1. Go to the Lambda console.
  2. Choose the PHPHelloFunction that you just deployed.
  3. Choose Add database proxy at the bottom of the page.
  4. Choose existing database proxy then choose sample-db-proxy.
  5. Choose Add.

Using the RDS Proxy from within the Lambda function

The Lambda function imports three libraries from the AWS PHP SDK. These are used to generate a password token from the database credentials stored in Secrets Manager.

The AWS PHP SDK libraries are provided by the PHP-example-vendor layer. Using Lambda layers in this way creates a mechanism for incorporating additional libraries and dependencies as the application evolves.

The function’s handler named index, is the entry point of the function code. First, getenv() is called to retrieve the environment variables set by the SAM application’s deployment. These are saved as local variables and available for the duration of the Lambda function’s execution.

The AuthTokenGenerator class generates an RDS auth token for use with IAM authentication. This is initialized by passing in the credential provider to the SDK client constructor. The createToken() method is then invoked, with the Proxy endpoint, port number, Region, and database user name provided as method parameters. The resultant temporary token is then used to connect to the proxy.

The PHP mysqli class represents a connection between PHP and a MySQL database. The real_connect() method is used to open a connection to the database via RDS Proxy. Instead of providing the database host endpoint as the first parameter, the proxy endpoint is given. The database user name, temporary token, database name, and port number are also provided. The constant MYSQLI_CLIENT_SSL is set to ensure that the connection uses SSL encryption.

Once a connection has been established, the connection object can be used. In this example, a SHOW TABLES query is executed. The connection is then closed, and the result is encoded to JSON and returned from the Lambda function.

This is the output:

RDS Proxy monitoring and performance tuning

RDS Proxy allows you to monitor and adjust connection limits and timeout intervals without changing application code.

Limit the timeout wait period that is most suitable for your application with the connection borrow timeout option. This specifies how long to wait for a connection to become available in the connection pool before returning a timeout error.

Adjust the idle connection timeout interval to help your applications handle stale resources. This can save your application from mistakenly leaving open connections that hold important database resources.

Multiple applications using a single database can each use an RDS Proxy to divide the connection quotas across each application. Set the maximum proxy connections as a percentage of the max_connections configuration (for MySQL).

The following example shows how to change the MaxConnectionsPercent setting for a proxy target group.

aws rds modify-db-proxy-target-group \
--db-proxy-name sample-db-proxy \
--target-group-name default \
--connection-pool-config '{"MaxConnectionsPercent": 75 }'

Response:

{
    "TargetGroups": [
        {
            "DBProxyName": "sample-db-proxy",
            "TargetGroupName": "default",
            "TargetGroupArn": "arn:aws:rds:eu-west-1:####:target-group:prx-tg-03d7fe854604e0ed1",
            "IsDefault": true,
            "Status": "available",
            "ConnectionPoolConfig": {
            "MaxConnectionsPercent": 75,
            "MaxIdleConnectionsPercent": 50,
            "ConnectionBorrowTimeout": 120,
            "SessionPinningFilters": []
        	},            
"CreatedDate": "2020-06-04T16:14:35.858000+00:00",
            "UpdatedDate": "2020-06-09T09:08:50.889000+00:00"
        }
    ]
}

RDS Proxy may keep a session on the same connection until the session ends when it detects a session state change that isn’t appropriate for reuse. This behavior is called pinning. Performance tuning for RDS Proxy involves maximizing connection reuse by minimizing pinning.

The Amazon CloudWatch metric DatabaseConnectionsCurrentlySessionPinned can be monitored to see how frequently pinning occurs in your application.

Amazon CloudWatch collects and processes raw data from RDS Proxy into readable, near real-time metrics. Use these metrics to observe the number of connections and the memory associated with connection management. This can help identify if a database instance or cluster would benefit from using RDS Proxy. For example, if it is handling many short-lived connections, or opening and closing connections at a high rate.

Conclusion

In this post, you learn how to create and configure an RDS Proxy to manage connections from a PHP Lambda function to an Aurora MySQL database. You see how to enforce strong authentication requirements by using Secrets Manager and IAM authentication. You deploy a Lambda function that uses Lambda layers to store the AWS PHP SDK as a dependency.

You can create secure, scalable, and performant serverless applications with relational databases. Do this by placing the RDS Proxy service between your database and your Lambda functions. You can also migrate your existing MySQL database to an Aurora DB cluster without altering the database. Using RDS Proxy and Lambda, you can build serverless PHP applications faster, with less code.

Find more PHP examples with the Serverless LAMP stack.

Introducing the new Serverless LAMP stack

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/introducing-the-new-serverless-lamp-stack/

This is the first in a series of posts for PHP developers. The series will explain how to use serverless technologies with PHP. It covers the available tools, frameworks and strategies to build serverless applications, and why now is the right time to start.

In future posts, I demonstrate how to use AWS Lambda for web applications built with PHP frameworks such as Laravel and Symphony. I show how to move from using Lambda as a replacement for web hosting functionality to a decoupled, event-driven approach. I cover how to combine multiple Lambda functions of minimal scope with other serverless services to create performant scalable microservices.

In this post, you learn how to use PHP with Lambda via the custom runtime API. Visit this GitHub repository for the sample code.

The Serverless LAMP stack

The Serverless LAMP stack

The challenges with traditional PHP applications

Scalability is an inherent challenge with the traditional LAMP stack. A scalable application is one that can handle highly variable levels of traffic. PHP applications are often scaled horizontally, by adding more web servers as needed. This is managed via a load balancer, which directs requests to various web servers. Each additional server brings additional overhead with networking, administration, storage capacity, backup and restore systems, and an update to asset management inventories. Additionally, each horizontally scaled server runs independently. This can result in configuration synchronization challenges.

Horizontal scaling with traditional LAMP stack applications.

Horizontal scaling with traditional LAMP stack applications.

New storage challenges arise as each server has its own disks and filesystem, often requiring developers to add a mechanism to handle user sessions. Using serverless technologies, scalability is managed for the developer.

If traffic surges, the services scale to meet the demand without having to deploy additional servers. This allows applications to quickly transition from prototype to production.

The serverless LAMP architecture

A traditional web application can be split in to two components:

  • The static assets (media files, css, js)
  • The dynamic application (PHP, MySQL)

A serverless approach to serving these two components is illustrated below:

The serverless LAMP stack

The serverless LAMP stack

All requests for dynamic content (anything excluding /assets/*) are forwarded to Amazon API Gateway. This is a fully managed service for creating, publishing, and securing APIs at any scale. It acts as the “front door” to the PHP application, routing requests downstream to Lambda functions. The Lambda functions contain the business logic and interaction with the MySQL database. You can pass the input to the Lambda function as any combination of request headers, path variables, query string parameters, and body.

Notable AWS features for PHP developers

Amazon Aurora Serverless

During re:Invent 2017, AWS announced Aurora Serverless, an on-demand serverless relational database with a pay-per-use cost model. This manages the responsibility of relational database provisioning and scaling for the developer.

Lambda Layers and custom runtime API.

At re:Invent 2018, AWS announced two new Lambda features. These enable developers to build custom runtimes, and share and manage common code between functions.

Improved VPC networking for Lambda functions.

In September 2019, AWS announced significant improvements in cold starts for Lambda functions inside a VPC. This results in faster function startup performance and more efficient usage of elastic network interfaces, reducing VPC cold starts.

Amazon RDS Proxy

At re:Invent 2019, AWS announced the launch of a new service called Amazon RDS Proxy. A fully managed database proxy that sits between your application and your relational database. It efficiently pools and shares database connections to improve the scalability of your application.

 

Significant moments in the serverless LAMP stack timeline

Significant moments in the serverless LAMP stack timeline

Combining these services, it is now it is possible to build secure and performant scalable serverless applications with PHP and relational databases.

Custom runtime API

The custom runtime API is a simple interface to enable Lambda function execution in any programming language or a specific language version. The custom runtime API requires an executable text file called a bootstrap. The bootstrap file is responsible for the communication between your code and the Lambda environment.

To create a custom runtime, you must first compile the required version of PHP in an Amazon Linux environment compatible with the Lambda execution environment .To do this, follow these step-by-step instructions.

The bootstrap file

The file below is an example of a basic PHP bootstrap file. This example is for explanation purposes as there is no error handling or abstractions taking place. To ensure that you handle exceptions appropriately, consult the runtime API documentation as you build production custom runtimes.

#!/opt/bin/php
<?PHP

// This invokes Composer's autoloader so that we'll be able to use Guzzle and any other 3rd party libraries we need.
require __DIR__ . '/vendor/autoload.php;

// This is the request processing loop. Barring unrecoverable failure, this loop runs until the environment shuts down.
do {
    // Ask the runtime API for a request to handle.
    $request = getNextRequest();

    // Obtain the function name from the _HANDLER environment variable and ensure the function's code is available.
    $handlerFunction = array_slice(explode('.', $_ENV['_HANDLER']), -1)[0];
    require_once $_ENV['LAMBDA_TASK_ROOT'] . '/src/' . $handlerFunction . '.php;

    // Execute the desired function and obtain the response.
    $response = $handlerFunction($request['payload']);

    // Submit the response back to the runtime API.
    sendResponse($request['invocationId'], $response);
} while (true);

function getNextRequest()
{
    $client = new \GuzzleHttp\Client();
    $response = $client->get('http://' . $_ENV['AWS_LAMBDA_RUNTIME_API'] . '/2018-06-01/runtime/invocation/next');

    return [
      'invocationId' => $response->getHeader('Lambda-Runtime-Aws-Request-Id')[0],
      'payload' => json_decode((string) $response->getBody(), true)
    ];
}

function sendResponse($invocationId, $response)
{
    $client = new \GuzzleHttp\Client();
    $client->post(
    'http://' . $_ENV['AWS_LAMBDA_RUNTIME_API'] . '/2018-06-01/runtime/invocation/' . $invocationId . '/response',
       ['body' => $response]
    );
}

The #!/opt/bin/php declaration instructs the program loader to use the PHP binary compiled for Amazon Linux.

The bootstrap file performs the following tasks, in an operational loop:

  1. Obtains the next request.
  2. Executes the code to handle the request.
  3. Returns a response.

Follow these steps to package the bootstrap and compiled PHP binary together into a `runtime.zip`.

Libraries and dependencies

The runtime bootstrap uses an HTTP-based local interface. This retrieves the event payload for each Lambda function invocation and returns back the response from the function. This bootstrap file uses Guzzle, a popular PHP HTTP client, to make requests to the custom runtime API. The Guzzle package is installed using Composer package manager. Installing packages in this way creates a mechanism for incorporating additional libraries and dependencies as the application evolves.

Follow these steps to create and package the runtime dependencies into a `vendors.zip` binary.

Lambda Layers provides a mechanism to centrally manage code and data that is shared across multiple functions. When a Lambda function is configured with a layer, the layer’s contents are put into the /opt directory of the execution environment. You can include a custom runtime in your function’s deployment package, or as a layer. Lambda executes the bootstrap file in your deployment package, if available. If not, Lambda looks for a runtime in the function’s layers. There are several open source PHP runtime layers available today, most notably:

The following steps show how to publish the `runtime.zip` and `vendor.zip` binaries created earlier into Lambda layers and use them to build a Lambda function with a PHP runtime:

  1.  Use the AWS Command Line Interface (CLI) to publish layers from the binaries created earlier
    aws lambda publish-layer-version \
        --layer-name PHP-example-runtime \
        --zip-file fileb://runtime.zip \
        --region eu-west-1

    aws lambda publish-layer-version \
        --layer-name PHP-example-vendor \
        --zip-file fileb://vendors.zip \
        --region eu-west-1

  2. Make note of each command’s LayerVersionArn output value (for example arn:aws:lambda:eu-west-1:XXXXXXXXXXXX:layer:PHP-example-runtime:1), which you’ll need for the next steps.

Creating a PHP Lambda function

You can create a Lambda function via the AWS CLI, the AWS Serverless Application Model (SAM), or directly in the AWS Management Console. To do this using the console:

  1. Navigate to the Lambda section  of the AWS Management Console and choose Create function.
  2. Enter “PHPHello” into the Function name field, and choose Provide your own bootstrap in the Runtime field. Then choose Create function.
  3. Right click on bootstrap.sample and choose Delete.
  4. Choose the layers icon and choose Add a layer.
  5. Choose Provide a layer version ARN, then copy and paste the ARN of the custom runtime layer from in step 1 into the Layer version ARN field.
  6. Repeat steps 6 and 7 for the vendor ARN.
  7. In the Function Code section, create a new folder called src and inside it create a new file called index.php.
  8. Paste the following code into index.php:
    //index function
    function index($data)
    {
     return "Hello, ". $data['name'];
    }
    
  9. Insert “index” into the Handler input field. This instructs Lambda to run the index function when invoked.
  10. Choose Save at the top right of the page.
  11. Choose Test at the top right of the page, and  enter “PHPTest” into the Event name field. Enter the following into the event payload field and then choose Create:{ "name": "world"}
  12. Choose Test and Select the dropdown next to the execution result heading.

You can see that the event payload “name” value is used to return “hello world”. This is taken from the $data['name'] parameter provided to the Lambda function. The log output provides details about the actual duration, billed duration, and amount of memory used to execute the code.

Conclusion

This post explains how to create a Lambda function with a PHP runtime using Lambda Layers and the custom runtime API. It introduces the architecture for a serverless LAMP stack that scales with application traffic.

Lambda allows for functions with mixed runtimes to interact with each other. Now, PHP developers can join other serverless development teams focusing on shipping code. With serverless technologies, you no longer have to think about restarting webhosts, scaling or hosting.

Start building your own custom runtime for Lambda.

How to track changes to secrets stored in AWS Secrets Manager using AWS Config and AWS Config Rules

Post Syndicated from Jerry Hayes original https://aws.amazon.com/blogs/security/how-to-track-changes-to-secrets-stored-in-aws-secrets-manager-using-aws-config-and-aws-config-rules/

On April 20th, AWS Config announced support for AWS Secrets Manager, making it easier to track configuration changes to the secrets you manage in AWS Secrets Manager. You can now use AWS Config to track changes to secrets’ metadata — such as secret description and rotation configuration, relationship to other AWS sources such as the KMS Key used for secret encryption, Lambda function used for secret rotation, and attributes such as tags associated with the secrets.

You can also leverage two new AWS Managed Config Rules to evaluate if your secrets’ configuration is in compliance with your organization’s security and compliance requirements, identify secrets that don’t conform to these standards, and receive notifications about them via Amazon Simple Notification Service (SNS). Once enabled, these rules will trigger every time a secret’s configuration changes.

  • secretsmanager-rotation-enabled-check: Checks whether or not secrets present in AWS Secrets Manager are configured for rotation. This rule also supports the maximumAllowedRotationFrequency parameter which, if specified, will compare the secret’s configured rotation frequency to the value set in the parameter.
  • secretsmanager-scheduled-rotation-success-check: Checks whether or not secrets present in AWS Secrets Manager configured for rotation have been rotated within their rotation schedule.

In this blog post, I walk you through two ways to use AWS Config rules to determine if your organization’s secrets are being managed in compliance with your security requirements:

  • Example 1: Drive rotation adoption by identifying secrets in a single account that aren’t configured for rotation. This maps to the first managed rule listed above.
  • Example 2: Drive compliance with your security standards across multiple AWS accounts by creating an AWS Config Aggregator, which allows you to collect configuration and compliance data from multiple accounts across multiple regions.

Example 1: Drive rotation adoption by identifying secrets that aren’t configured for rotation in a single account and region

Many organizations require regular secret rotation. Use the new managed rule secretsmanager-rotation-check to verify whether your secrets are configured for automatic rotation.

  1. From the AWS Config console, navigate to Settings and ensure that Recording is on. Under Resource types to record, turn on recording for all resources by checking the All resources box next to Record all resources supported in this region, as shown in Figure 1 below.

    Figure 1: Enable Recording

    Figure 1: Enable Recording

  2. To set up the rule, go to the Rules page in the AWS Config console and select Add rule, as shown in Figure 2.

    Figure 2: Add Rule

    Figure 2: Add Rule

  3. Search for secretsmanager-rotation-enabled-check in the search bar and select the rule that appears, as shown in Figure 3.

    Figure 3: Search for rule

    Figure 3: Search for rule

  4. In Figure 4, I use the name secretsmanager-rotation-enabled-check for the name of my rule. Trigger type is set to run upon changes to the resource’s configuration. For Scope of changes, you can monitor all applicable resources for this rule type or resources with specific tags. In my example, I am monitoring all secrets where the ENVIRONMENT tag is set to PRODUCTION. And finally, under Rule Parameters, I set maximumAllowedRotationFrequency to 30 days.

    Figure 4: Add managed rule

    Figure 4: Add managed rule

  5. In my example, I specify AWS-PublishSNSNotification as my Remediation action and enter the parameters for AutomationAssumeRole, Message, and TopicArn topic so that I can receive notifications from an Amazon SNS topic regarding non-compliant secrets, as shown in Figure 5 below. Setting a Remediation action is optional. Once the rule is set up the way you want it, select Save.
    Figure 5: Choose remediation action

    Figure 5: Choose remediation action

    Once you’ve saved the rule, it will evaluate your secrets every time there’s a change in the secret metadata, and you’ll receive an Amazon SNS notification about non-compliant secrets.

  6. In the AWS Config console, view your results by selecting Rules from the menu on the left. In Figure 6, secretsmanager-rotation-enabled-check shows that it has detected 1 noncompliant resource.

    Figure 6: View rule evaluation results

    Figure 6: View rule evaluation results

  7. Select secretsmanager-rotation-enabled-check and it provides a link to the Resource ID of the non-compliant secret, as shown in Figure 7.

    Figure 7: Detailed view of rule with noncompliant secret

    Figure 7: Detailed view of rule with noncompliant secret

Example 2: Drive security compliance across multiple AWS accounts in your AWS Organization by creating an AWS Config Aggregator

Next, I’ll show you how to use the AWS Config Aggregator to review how secrets are configured across all accounts and regions in your AWS Organization so you can see whether they’re in compliance with your organization’s security and compliance requirements. AWS Organizations helps you centrally govern your environment as you grow and scale your workloads on AWS.

NOTE: You must enable AWS Config and the AWS Config managed rules specific to secrets in all accounts and regions that you want to monitor before creating the aggregator. You can use AWS CloudFormation StackSets to enable AWS Config and provision rules across accounts and regions as described here.

  1. In this example, I create the aggregator in my AWS Organization’s master account. From the AWS Config console, select Aggregators from the left menu, then select Add aggregator, as shown in Figure 8.

    Figure 8: Add aggregator

    Figure 8: Add aggregator

  2. Select the check box next to Allow data replication, as shown in Figure 9 below. This provides the permission for your AWS Organization’s master account to access the resource configuration and compliance details for all the accounts and regions in your AWS Organization.

    Figure 9: Allow data replication

    Figure 9: Allow data replication

  3. Provide a name for the aggregator. In Figure 10, I’ve named mine MyOrganizationsSecrets. Select Add my organization, then Choose IAM role. Select Create a Role and enter a role name and then select Choose IAM role. The IAM role allows AWS Config to get the list of accounts in your AWS Organization.
    Figure 10: Enable data replication and configure aggregator

    Figure 10: Enable data replication and configure aggregator

    NOTE: If you do not have an organization configured in AWS Organizations, you can select Add individual account IDs and then either add account IDs manually or update a comma separated list of accounts.

  4. Select Choose IAM role. Ensure Create a role is selected and enter a unique name. In Figure 11, I’ve named my role aws-config-aggregator-role. Select Choose IAM role again to create the role and again to continue.

    Figure 11: Choose IAM role

    Figure 11: Choose IAM role

  5. Select the Regions you want to aggregate data and select Save. In Figure 12, I’ve selected the two regions in which my AWS Organization uses Secrets Manager.
    Figure 12: Pick target regions for aggregation

    Figure 12: Select target regions for aggregation

    Once you’ve selected your regions, click Save.

  6. Select the aggregator you just created to see the Aggregated view. In Figure 13, I select MyOrganizationsSecrets.As noted on the console, an aggregator is an AWS Config resource type that collects AWS Config data from multiple accounts and regions, the data displayed in the dashboard is received from multiple aggregation sources and is refreshed at different intervals. Data might be delayed by a few minutes.

    Figure 13: Select aggregator

    Figure 13: Select aggregator

  7. In the Aggregated view shown in Figure 14 below, you can now see a dashboard view of all resources in your Organization, across all accounts and regions.On the top right, the Config rule compliance status shows that this organization has 11 compliant and 7 non-compliant rules. Below that is the Top 5 non-compliant rules which denotes the rule name, the region, the account number, and number of non-compliant resources.
    Figure 14: Aggregated view

    Figure 14: Aggregated view

    You can drill down into this data to view all compliant and non-compliant secrets in all your organization’s accounts and regions, and you can work with individual account or secret owners to drive security compliance — ensuring all secrets are configured for rotation, all secrets meet your organizations’ standard for rotation frequency, and secrets are rotated successfully.

  8. In Figure 15, I select secretsmanager-rotation-enabled-check for us-east-1 from the Top 5 non-complaint rules.

    Figure 15: Top 5 noncompliant rules

    Figure 15: Top 5 noncompliant rules

  9. The detail view in Figure 16 below shows the 5 non-compliant resources and their corresponding Resource IDs.

    Figure 16: Compliant and non-compliant secrets

    Figure 16: Compliant and non-compliant secrets

Summary

In this post, I showed you how to track and evaluate secret configuration using AWS Config and AWS Config Rules using the AWS Management Console. You can also do this using the AWS Config APIs or the AWS Command Line Interface (CLI).

This enables you to drive secrets management best practices in an individual account or across your AWS Organization. To get started managing secrets, open the Secrets Manager console. To learn more, read How to Store, Distribute, and Rotate Credentials Securely with Secret Manager or refer to the Secrets Manager documentation.

If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Secrets Manager forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Jerry Hayes

Jerry Hayes is a Solutions Architect Manager on the World Wide Public Sector (WWPS) Solutions Architect (SA) team where he manages a high-performing team of Specialist SAs supporting National Security customers. He holds a Master’s degree from George Washington University and a Bachelor’s degree from Virginia Tech (Go Hokies!). Outside of work, Jerry enjoys spending time with his family, watching football, running, and traveling to new and exciting places.

Enhancing automated database continuous integration with AWS CodeBuild and Amazon RDS Database Snapshot

Post Syndicated from bobyeh original https://aws.amazon.com/blogs/devops/enhancing-automated-database-continuous-integration-with-aws-codebuild-and-amazon-rds-database-snapshot/

In major integration merges, it’s sometimes necessary to verify the changes with existing online data. To inspect the changes with a cloned database can give us confidence to deploy to the production database. This post demonstrates how to use AWS CodeBuild and Amazon RDS Database Snapshot to verify your code revisions in both the application layer and the underlying layer, ensuring that your existing data works seamlessly with your revised code.

Making code revisions using continuous integration requires running periodic verification to ensure that your new deliverable works functionally and reliably. It’s easy to focus attention solely on the surface level changes made to the application layer. However, it’s important to remember to inspect the changes made to the underlying data layer too.

From the application layer, users modify the data model for different reasons. Any data model definition change in the application layer maps to a schema change in the database. For those services backed with a relational database (RDBMS), a user might perform data definition language (DDL) operations directly toward a database schema or rely on an object-relational mapping (ORM) library to migrate the schema to fit the application revision. These schema changes (CREATE, DROP, ALTER, TRUNCATE, etc.) can be very critical, especially for those services serving real customers.

Performing proper verification and simulation for these changes mitigates the risk of bringing down services. After the changes are applied, fundamental operation testing (CRUD – CREATE, READ, UPDATE, DELETE) toward data models is mandatory; this leads to data control language (DCL) operations (INSERT, SELECT, UPDATE, DELETE, etc.). After all the necessary steps, a user can move on to the deployment stage.

About this page

  • Time to read:6 minutes
  • Time to complete:30 minutes
  • Cost to complete (estimated):Less than $1 for 1-GB database snapshot and restored instance
  • Learning level:Advanced (300)
  • Services used:AWS CodeBuild, IAM, RDS

Solution overview

This example uses a buildspec file in CodeBuild. Set up a build project that points to a source control repository containing that buildspec file. The CodeBuild runtime environment restores the database server from an RDS snapshot.We restore snapshot to an Amazon Aurora cluster as example through AWS Command Line Interface (AWS CLI). After the database is restored, the build process starts to run your integration process, which is in mock code in the buildspec definition. After the verification stage, CodeBuild drops the restored database.

 

Architecture diagram showing an overview of how we use CodeBuild to restore a database snapshot to verify and validate the new database schema change.

Prerequisites

The following components are required to implement this example:

Walkthrough

Follow these steps to execute the solution.

Prepare your build specification file

Before you begin, prepare your CodeBuild Build Specification file with following information:

  • db-cluster-identifier-prefix
  • db-snapshot-identifier
  • region-ID
  • account-ID
  • vpc-security-group-id

The db-cluster-identifier-prefix creates a temporary database followed by a timestamp. Make sure that this value does not overlap with any other databases. The db-snapshot-identifier points to the snapshot you are calling to run with your application. Region-ID and account-ID describe the account on which you are running. The vpc-security-group-id indicates the security group you use in the CodeBuild environment and temporary database.

YAML
Version: 0.2
phases:
  install:
    runtime-versions:
      python: 3.7
pre_build:
  commands:
    - pip3 install awscli --upgrade --user
    - export DATE=`date +%Y%m%d%H%M`
    - export DBIDENTIFIER=db-cluster-identifier-prefix-$DATE
    - echo $DBIDENTIFIER
    - aws rds restore-db-cluster-from-snapshot --snapshot-identifier arn:aws:rds:region-ID:account-ID:cluster-snapshot:db-snapshot-identifier –vpc-security-group-ids vpc-security-group-id --db-cluster-identifier $DBIDENTIFIER --engine aurora
    - while [ $(aws rds describe-db-cluster-endpoints --db-cluster-identifier $DBNAME | grep -c available) -eq 0 ]; do echo "sleep 60s"; sleep 60; done
    - echo "Temp db ready"
    - export ENDPOINT=$(aws rds describe-db-cluster-endpoints --db-cluster-identifier $DBIDENTIFIER| grep "\"Endpoint\"" | grep -v "\-ro\-" | awk -F '\"' '{print $4}')
    - echo $ENDPOINT
build:
  commands:
    - echo Build started on `date`
    - echo proceed db connection to $ENDPOINT
    - echo proceed db migrate update, DDL proceed here
    - echo proceed application test, CRUD test run here
post_build:
  commands:
    - echo Build completed on `date`
    - echo $DBNAME
    - aws rds delete-db-cluster --db-cluster-identifier $DBIDENTIFIER --skip-final-snapshot &

 

After you finish editing the file, name it buildspec.yml. Save it in the root directory with which you plan to build, then commit the file into your code repository.

  1. Open the CodeBuild console.
  2. Choose Create build project.
  3. In Project Configuration, enter the name and description for the build project.
  4. In Source, select the source provider for your code repository.
  5. In Environment image, choose Managed image, Ubuntu, and the latest runtime version.
  6. Choose the appropriate service role for your project.
  7. In the Additional configuration menu, select the VPC with your Amazon RDS database snapshots, as shown in the following screenshot, and then select Validate VPC Settings. For more information, see Use CodeBuild with Amazon Virtual Private Cloud.
  8. In Security Groups, select the security group needed for the CodeBuild environment to access your temporary database.
  9. In Build Specifications, select Use a buildspec file.

CodeBuild Project Additional Configuration - VPC

Grant permission for the build project

Follow these steps to grant permission.

  1. Navigate to the AWS Management Console Policies.
  2. Choose Create a policy and select the JSON tab.To give CodeBuild access to the Amazon RDS resource in the pre_build stage, you must grant RestoreDBClusterFromSnapshot and DeleteDBCluster. Follow the least privilege guideline and limit the DeleteDBCluster action point to “arn:aws:rds:*:*:cluster: db-cluster-identifier-*”.
  3. Copy the following code and paste it into your policy:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": "rds:RestoreDBClusterFromSnapshot",
          "Resource": "*"
        },
        {
          "Sid": "VisualEditor1",
          "Effect": "Allow",
          "Action": "rds:DeleteDBCluster*",
          "Resource": "arn:aws:rds:*:*:cluster:db-cluster-identifier-*"
        }
      ]
    }
  4. Choose Review Policy.
  5. Enter a Name and Description for this policy, then choose Create Policy.
  6. After the policy is ready, attach it to your CodeBuild service role, as shown in the following screenshot.

Attach created policy to IAM role

Use database snapshot restore to launch the build process

  1. Navigate back to CodeBuild and locate the project you just created.
  2. Give an appropriate timeout setting and make sure that you set it to the correct branch for your repository.
  3. Choose Start Build.
  4. Open the Build Log to view the database cluster from your snapshot in the pre_build stage, as shown in the following screenshot.CodeBuild ProjectBuild Log - pre_build stage
  5. In the build stage, use $ENDPOINT to point your application to this temporary database, as shown in the following screenshot.CodeBuild Project Build Log - build stage
  6. In the post_build, delete the cluster, as shown in the following screenshot.CodeBuild Project Build log - post build stage

Test your database schema change

After you set up this pipeline, you can begin to test your database schema change within your application code. This example defines several steps in the Build Specifications file to migrate the schema and run with the latest application code. In this example, you can verify that all the modifications fit from the application to the database.

YAML
build:
  commands:
    - echo Build started on `date`
    - echo proceed db connection to $ENDPOINT
    # run a script to apply your latest schema change
    - echo proceed db migrate update
    # start the latest code, and run your own testing
    - echo proceed application test

After validation

After we validated the database schema change in the above steps, a suitable strategy for deployment to production should be utilized that would align with the criteria to satisfy the business goals.

Cleaning up

To avoid incurring future charges, delete the resources as following steps:

  1. Open the CodeBuild console
  2. Click the project you created for this test.
  3. Click the delete build project and input delete to confirm deletion.

Conclusion

In this post, you created a mechanism to set up a temporary database and limit access into the build runtime. The temporary database stands alone and isolated. This mechanism can be applied to secure the permission control for the database snapshot, or not to break any existing environment. The database engine applies to all available RDS options, including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. This provides options, without impacting any existing environments, for critical events triggered by major changes in the production database schema, or data format changes required by business decisions.

 

Integrating AWS Lake Formation with Amazon RDS for SQL Server

Post Syndicated from Marcio Morales original https://aws.amazon.com/blogs/big-data/integrating-aws-lake-formation-with-amazon-rds-for-sql-server/

To grow and develop your business, you must collect data from a myriad of sources (such as relational and NoSQL databases, clickstream, and IoT events) and combine it using analytics to extract insight. AWS Lake Formation enables you to ingest data from many different sources into a data lake based in Amazon S3. Furthermore, you can use Lake Formation to control access to this data from a single place.

This post shows how to ingest data from Amazon RDS into a data lake on Amazon S3 using Lake Formation blueprints and how to have column-level access controls for running SQL queries on the extracted data from Amazon Athena.

A blueprint is a data management template that enables you to ingest data into a data lake easily. Lake Formation provides several blueprints, each for a predefined source type, such as a relational database or AWS CloudTrail logs. From a blueprint, you can create a workflow. Workflows consist of AWS Glue crawlers, jobs, and triggers that orchestrate the loading and update of data. Blueprints take the data source, data target, and schedule as input to configure the workflow.

For more information about Lake Formation, see AWS Lake Formation: How It Works.

Walkthrough overview

A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.

Lake Formation makes it easy to set up a secure data lake quickly. The walkthrough contains the following steps:

  1. Register an S3 bucket as a data lake storage
  2. Create a logical database to hold the tables that represent your data
  3. Create a workflow for ingesting data from Amazon RDS for SQL Server to the data lake
  4. Grant permissions to access the tables in the data lake

Prerequisites

To complete this walkthrough, you must have the following:

  • An Amazon RDS for SQL Server up and running
  • An IAM user with access to Lake Formation, Amazon S3, AWS Glue, and Athena

Registering your Amazon S3 storage

To register your S3 bucket, complete the following steps:

  1. On the AWS Lake Formation console, choose Get Started.
  2. Choose Register location.
  3. For Amazon S3 path, choose Browse.
  4. Choose the S3 bucket you designated for your data lake.
  5. Make sure that the S3 bucket you selected has the correct IAM permissions to work with Lake Formation.
  6. Choose Review location permissions.
  7. For IAM role, choose an IAM role designated to allow Lake Formation to read and write data in the S3 bucket.You have the option to let Lake Formation create an IAM role with the necessary permissions, or you can select an existing IAM role.
  8. Choose Register location.
  9. On the AWS Lake Formation console, under Register and ingest, choose Data lake locations.You can see your S3 bucket registered.

Creating a database

You are now ready to create a database to hold your data lake tables. Databases are logical and can be treated as namespaces. They are containers for the metadata tables that the AWS Glue Data Catalog stores. Databases can have an optional location property. This location needs to be within an Amazon S3 location that is registered with Lake Formation.

To create your database, complete the following steps:

  1. Choose Create database.
  2. For Name, enter a name for your database.
  3. For Location, enter the location of your S3 bucket.
  4. Choose Create database.

Creating a workflow to ingest data

Now that your data lake is set up, you are ready to ingest data. Lake Formation includes a feature called blueprints, which allows you to ingest data from MySQL, PostgreSQL, Oracle, and SQL Server databases into your data lake. Blueprints allow you to ingest data by copying a table or incrementally loading newly inserted data. When you use blueprints, they create an AWS Glue workflow and all relevant jobs and crawlers to enable the discovery and movement of data.

To create your workflow to ingest data, complete the following steps:

  1. On the Lake Formation console, under Register and ingest, choose Blueprints.
  2. In the Blueprint overview section, choose Use blueprints.
  3. For Blueprint type, select what type of data you want to ingest to your data lake.For this walkthrough, choose Database snapshot, because the source table is not very large. For large tables, an incremental database may be a better option.To ingest data to your data lake, you must define a database connection with parameters to allow AWS Glue to access the source database.
  4. In the Import source section, for Database connection, choose Create a connection in AWS Glue.This step redirects you to the AWS Glue console.
  5. On the Connections page, for Connection name, enter the name of your connection.
  6. For Connection type, choose your driver.You can choose between Amazon RDS, Amazon Redshift, and JDBC as the driver. For this walkthrough, choose JDBC, so your job can talk to both Amazon RDS and SQL Server on Amazon EC2 using the same connection configuration.
  7. Select if the connection requires SSL.
  8. Choose Next.
  9. For JDBC URL, you must follow the path structure mentioned in the AWS Glue documentation.Change the value MYRDSENDPOINT in the JDBC URL with the real Amazon RDS endpoint from your environment.
  10. Enter a username and password that has read permission to your database.
  11. Choose the VPC and Subnet where your database instance is running.
  12. Choose a Security Group that allows AWS Glue to connect to your database instance.
  13. Choose Next.
  14. On the Review all steps page, choose Finish.You return to the Lake Formation console.
  15. In the Import source section, choose the refresh button next to the Database connection drop-down menu.
  16. Choose the connection you just created.
  17. For Source data path, enter the database name.If you choose to only read a particular table, you can update this field to reflect the exact database, schema, and table you want. Wildcards are allowed.
  18. As an option, enter any exclude patterns to exclude any data you may not want to ingest into your data lake.For more information, see Include and Exclude Patterns.You are now ready to configure the target of the workflow.
  19. In the Import target section, for Target database, choose the database you just created.
  20. For Target storage location, enter the data lake location you created.
  21. For Data format, choose Parquet.For more information about Parquet, see the Apache Parquet website.
  22. For Import frequency, choose your import frequency.For this walkthrough, choose Run on demand.
  23. In the Import options section, for Workflow name, enter a name for the workflow.
  24. For IAM Role, choose an IAM Role that allows AWS Glue to call AWS services on your behalf.For instructions on setting up this IAM role, see documentation.
  25. For Table prefix, enter the table prefix the job uses when creating the final output table in the data lake.This post enters sqlserver as a way to discover the tables the workflow ingests quickly.
  26. For Maximum capacity, enter the number of DPUs to allocate when the job runs.This post enters 5.
  27. For Concurrency, enter the maximum number of concurrent runs that are allowed for this job.This post enters 5.
  28. Choose Create.
  29. On the Lake Formation console, choose Blueprints.You can see the workflow you created under Workflows.
  30. Choose the workflow.
  31. From the Actions drop-down menu, choose Start.The process of ingestion and transformation varies based on the database size.
  32. Choose the workflow.
  33. Choose the run ID of the current run.This step redirects you to the AWS Glue console, where you can dig deeper into each task of the workflow.

Granting permissions

Now that you have data in your data lake on Amazon S3 and metadata tables in your database on the Data Catalog, you can grant users permissions to access the data. Complete the following steps:

  1. Choose Grant permissions.
  2. For IAM users and roles, choose the IAM users or roles that are allowed to access the data.You have the option to grant access to a database, a set of tables within a database, or a specific set of columns within a table.

    For this post, the login is federated through Active Directory Federation Services (ADFS). Active Directory Domain Services Groups (ADDS) manages access privileges, and users can select which IAM Role to assume.

    This post assumes the IAM Role ADFS-Marketing. It has access to the database Lakeformationdatabase and table sqlserver_appdatabase01_dbo_customer. Furthermore, the role does not have permission to access the phone column.You can now verify that permissions are working as expected.

  3. Log in with a federated user assuming the ADFS-Marketing role.The following screenshot shows that the phone column isn’t available.
    However, if you assume a role that has access to all of the specific table, you can view the phone column. See the following screenshot.

Conclusion

This post demonstrated how to quickly register an Amazon S3 data lake in Lake Formation, ingest data from an operational SQL Server database using blueprints and grant permissions at the column level. Not only this but, also show customers that is possible to reduce third-party tools license costs using AWS Services.

 


About the Author

Marcio Morales is a Senior Microsoft Specialist SA at AWS.

 

 

 

Enabling job accounting for HPC with AWS ParallelCluster and Amazon RDS

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/enabling-job-accounting-for-hpc-with-aws-parallelcluster-and-amazon-rds/

This post is written by Nicola Venuti, HPC Specialist SA, and contributed to by Rex Chen, Software Development Engineer.

Introduction

Accounting, reporting, and advanced analytics used for data-driven planning and decision making are key areas of focus for High Performance Computing (HPC) Administrators. In the cloud, these areas are more relevant to the costs of the services, which directly impact budgeting and forecasting of expenses. With the growth of new HPC services that perform analyses and corrective actions, you can better optimize for performance, which reduces cost.

Solution Overview

In this blog post, we walk through an easy way to collect accounting information for evert job and step executed in a cluster with job scheduling. This post uses a new feature in the latest version (2.6.0) of AWS ParallelCluster, which makes this process easier than before, and Slurm.  Accounting records are saved into a relational database for both currently executing jobs and jobs, which have already terminated.

Prerequisites

This tutorial assumes you already have a cluster in AWS ParallelCluster. If  you don’t, refer to the AWS ParallelCluster documentation, a getting started blog post, or a how-to blog post.

Solution

Choose your architecture

There are two common architectures to save job accounting information into a database:

  1. Installing and directly managing a DBMS in the master node of your cluster (or in an additional EC2 instance dedicated to it)
  2. Using a fully managed service like Amazon Relational Database Service (RDS)

While the first option might appear to be the most economical solution, it requires heavy lifting. You must install and manage the database, which is not a core part of running your HPC workloads.  Alternatively, Amazon RDS reduces this burden of installing updates, managing security patches, and properly allocating resources.  Additionally, Amazon RDS Free Tier can get you started with a managed database service in the cloud for free. Refer to the hyperlink for a list of free resources.

Amazon RDS is my preferred choice, and the following sections implement this architecture. Bear in mind, however, that the settings and the customizations required in the AWS ParallelCluster environment are the same, regardless of which architecture you prefer.

 

Set up and configure your database

Now, with your architecture determined, let’s configure it.  First, go to Amazon RDS’ console.  Select the same Region where your AWS ParallelCluster is deployed, then click on Create Database.

There are two database instances to consider: Amazon Aurora and MySQL.

Amazon Aurora has many benefits compared to MySQL. However, in this blog post, I use MySQL to show how to build your HPC accounting database within the Free-tier offering.

The following steps are the same regardless of your database choice. So, if you’re interested in one of the many features that differentiate Amazon Aurora from MySQL, feel free to use. Check out Amazon Aurora’s landing page to learn more about its benefits, such as its faster performance and cost effectiveness.

To configure your database, you must complete the following steps:

  1. Name the database
  2. Establish credential settings
  3. Select the DB instance size
  4. Identify storage type
  5. Allocate amount of storage

The following images show the settings that I chose for storage options and the “Free tier” template.  Feel free to change it accordingly to the scope and the usage you expect.

Make sure you select the corresponding VPC to wherever your “compute fleet” is deployed by AWS ParallelCluster, and wherever the Security Group of your compute fleet is selected.  You can access information for your “compute fleet” in your AWS ParallelCluster config file. The Security Group should look something like this: “parallelcluster-XXX-ComputeSecurityGroup-XYZ”.

At this stage, you can click on Create database and wait until the Database status moves from the Creating to Available in the Amazon RDS Dashboard.

The last step for this section is to grant privileges on the database.

  1. Connect to your database. Use the master node of your AWS ParallelCluster as a client.
  2. Install the MySQL client by running sudo yum install mysql on AmazonLinux and CentOS and sudo apt-get install mysql-client on Ubuntu.
  3. Connect to your MySQL RDS database using the following code: mysql --host=<your_rds_endpoint> --port=3306 -u admin -p The following screenshot shows how to find your RDS endpoint and port.

 

4. Run GRANT ALL ON `%`.* TO [email protected]`%`; to grant the required privileges.

The following code demonstrates these steps together:

[[email protected]]$ mysql --host=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com --port=3306 -u admin -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 251
Server version: 8.0.16 Source distribution

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> GRANT ALL ON `%`.* TO [email protected]`%`;

Note: typically this command is run as GRANT ALL ON *.* TO 'admin'@'%';   With Amazon RDS , for security reasons, this is not possible as the master account does not have access to the MySQL database. Using *.*  triggers an error. To work around this, I use the _ and % wildcards that are permitted. To look at the actual grants, you can run the following: SHOW GRANTS;

Enable Slurm Database logging

Now, your database is fully configured. The next step is to enable Slurm as a workload manager.

A few steps must occur to let Slurm log its job accounting information on an external database. The following code demonstrates the steps you must make.

  1. Add the DB configuration file, slurmdbd.conf after /opt/slurm/etc/ 
  2. Slurm’s slurm.conf file requires a few modifications. These changes are noted after the following code examples.

 

Note: You do not need to configure each and every compute node because AWS ParallelCluster installs Slurm in a shared directory. All of these nodes share this directory, and, thus the same configuration files with the master node of your cluster.

Below, you can find two example configuration files that you can use just by modifying a few parameters accordingly to your setup.

For more information about all the possible settings of configuration parameters, please refer to the official Slurm documentation, and in particular to the accounting section.

Add the DB configuration file

#
## Sample /opt/slurm/etc/slurmdbd.conf
#
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
AuthType=auth/munge
DbdHost=ip-10-0-16-243  #YOUR_MASTER_IP_ADDRESS_OR_NAME
DbdPort=6819
DebugLevel=info
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
SlurmUser=slurm
LogFile=/var/log/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageUser=admin
StoragePass=password
StorageHost=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com # Endpoint from RDS console
StoragePort=3306                                                                # Port from RDS console

See below for key values that you should plug into the example configuration file:

  • DbdHost: the name of the machine where the Slurm Database Daemon is executed. This is typically the master node of your AWS ParallelCluster. You can run hostname -s on your master node to get this value.
  • DbdPort: The port number that the Slurm Database Daemon (slurmdbd) listens to for work. 6819 is the default value.
  • StorageUser: Define the user name used to connect to the database. This has been defined during the Amazon RDS configuration as shown in the second step of the previous section.
  • StoragePass: Define the password used to gain access to the database. Defined as the user name during the Amazon RDS configuration.
  • StorageHost: Define the name of the host running the database. You can find this value in the Amazon RDS console, under “Connectivity & security”.
  • StoragePort: Define the port on which the database is listening. You can find this value in the Amazon RDS console, under “Connectivity & security”. (see the screenshot below for more information).

Modify the file

Add the following lines at the end of the slurm configuration file:

#
## /opt/slurm/etc/slurm.conf
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=ip-10-0-16-243
AccountingStorageUser=admin
AccountingStoragePort=6819

Modify the following:

  • AccountingStorageHost: The hostname or address of the host where SlurmDBD executes. In our case this is again the master node of our AWS ParallelCluster, you can get this value by running hostname -s again.
  • AccountingStoragePort: The network port that SlurmDBD accepts communication on. It must be the same as DbdPort specified in /opt/slurm/etc/slurmdbd.conf
  • AccountingStorageUser: it must be the same as in /opt/slurm/etc/slurmdbd.conf (specified in the “Credential Settings” of your Amazon RDS database).

 

Restart the Slurm service and start the SlurmDB demon on the master node

 

Depending on the operating system you are running, this would look like:

  • Amazon Linux / Amazon Linux 2
[[email protected]]$ sudo /etc/init.d/slurm restart                                                                                                                                                                                                                                             
stopping slurmctld:                                        [  OK  ]
slurmctld is stopped
slurmctld is stopped
starting slurmctld:                                        [  OK  ]
[[email protected]]$ 
[[email protected]]$ sudo /opt/slurm/sbin/slurmdbd
  • CentOS7 and Ubuntu 16/18
[[email protected] ~]$ sudo /opt/slurm/sbin/slurmdbd
[[email protected] ~]$ sudo systemctl restart slurmctld

Note: even if you have jobs running, restarting the daemons will not affect them.

Check to see if your cluster is already in the Slurm Database:

/opt/slurm/bin/sacctmgr list cluster

And if it is not (see below):

[[email protected]]$ /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
[[email protected]]$ 

You can add it as follows:

sudo /opt/slurm/bin/sacctmgr add cluster parallelcluster

You should now see something like the following:

[[email protected]]$ /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
parallelc+     10.0.16.243         6817  8704         1                                                                                           normal           
[[email protected]]$ 

At this stage, you should be all set with your AWS ParallelCluster accounting configured to be stored in the Amazon RDS database.

Replicate the process on multiple clusters

The same database instance can be easily used for multiple clusters to log its accounting data in. To do this, repeat the last configuration step for your clusters built using AWS ParallelCluster that you want to share the same database.

The additional steps to follow are:

  • Ensure that all the clusters are in the same VPC (or, if you prefer to use multiple VPCs, you can choose to set up VPC-Peering)
  • Add the SecurityGroup of your new compute fleets (“parallelcluster-XXX-ComputeSecurityGroup-XYZ”) to your RDS database
  • Change the cluster name parameter at the very top of the file. This is in addition to the slurm configuration file ( /opt/slurm/etc/slurm.conf) editing explained prior.  By default, your cluster is called “parallelcluster.” You may want to change that to clearly identify other clusters using the same database. For instance: ClusterName=parallelcluster2

Once these additional steps are complete, you can run /opt/slurm/bin/sacctmgr list cluster  again. Now, you should see two (or multiple) clusters:

[[email protected] ]# /opt/slurm/bin/sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
parallelc+                            0  8704         1                                                                                           normal           
parallelc+     10.0.16.129         6817  8704         1                                                                                           normal           
[[email protected] ]#   

If you want to see the full name of your clusters, run the following:

[[email protected] ]# /opt/slurm/bin/sacctmgr list cluster format=cluster%30
                       Cluster 
------------------------------ 
               parallelcluster 
              parallelcluster2 
[[email protected] ]# 

Note: If you check the Slurm logs (under /var/log/slurm*), you may see this error:

error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout

This error refers to default parameters that Amazon RDS sets for you on your MySQL database. You can change them by setting new “group parameters” as explained in the official documentation and in this support article. Please also note that the innodb_buffer_pool_size is related to the amount of memory available on your instance, so you may want to use a different instance type with higher memory to avoid this warning.

Run your first job and check the accounting

Now that the application is installed and configured, you can test it! Submit a job to Slurm, query your database, and check your job accounting information.

If you are using a brand new cluster, test it with a simple hostname job as follows:

[[email protected]]$ sbatch -N2 <<EOF
> #!/bin/sh
> srun hostname |sort
> srun sleep 10
> EOF
Submitted batch job 31
[[email protected]]$

Immediately after you have submitted the job, you should see it with a state of “pending”:

[[email protected]]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2    PENDING      0:0 
[[email protected]]$ 

And, after a while the job should be “completed”:

[[email protected]]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2  COMPLETED      0:0 
38.batch           batch                                1  COMPLETED      0:0 
38.0           hostname                                2  COMPLETED      0:0 
[[email protected]]$

Now that you know your cluster works, you can build complex queries using sacct. See few examples below, and refer to the official documentation for more details:

[[email protected]]$ sacct --format=jobid,elapsed,ncpus,ntasks,state
       JobID    Elapsed      NCPUS   NTasks      State 
------------ ---------- ---------- -------- ---------- 
38             00:00:00          2           COMPLETED 
38.batch       00:00:00          1        1  COMPLETED 
38.0           00:00:00          2        2  COMPLETED 
39             00:00:00          2           COMPLETED 
39.batch       00:00:00          1        1  COMPLETED 
39.0           00:00:00          2        2  COMPLETED 
40             00:00:10          2           COMPLETED 
40.batch       00:00:10          1        1  COMPLETED 
40.0           00:00:00          2        2  COMPLETED 
40.1           00:00:10          2        2  COMPLETED 
[[email protected]]$ sacct --allocations
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
38               sbatch    compute                     2  COMPLETED      0:0 
39               sbatch    compute                     2  COMPLETED      0:0 
40               sbatch    compute                     2  COMPLETED      0:0 
[[email protected]]$ sacct -S2020-02-17 -E2020-02-20 -X -ojobid,start,end,state
       JobID               Start                 End      State 
------------ ------------------- ------------------- ---------- 
38           2020-02-17T12:25:12 2020-02-17T12:25:12  COMPLETED 
39           2020-02-17T12:25:12 2020-02-17T12:25:12  COMPLETED 
40           2020-02-17T12:27:59 2020-02-17T12:28:09  COMPLETED 
[[email protected]]$ 

If you have configured your cluster(s) for multiple users, you may want to look at the accounting info for all of these.  If you want to configure your clusters with multiple users, follow this blog post. It demonstrates how to configure AWS ParallelCluster with AWS Directory Services to create a multiuser, POSIX-compliant system with centralized authentication.

Each and every user can only look at his own accounting data. However, Slurm admins (or root) can see accounting info for every user. The following code shows accounting data coming from two clusters (parallelcluster and parallelcluster2) and from two users (ec2-user and nicola):

[[email protected] ~]# sacct -S 2020-01-01 --clusters=parallelcluster,parallelcluster2 --format=jobid,elapsed,ncpus,ntasks,state,user,cluster%20                                                                                                                                                        
       JobID    Elapsed      NCPUS   NTasks      State      User              Cluster 
------------ ---------- ---------- -------- ---------- --------- -------------------- 
36             00:00:00          2              FAILED  ec2-user      parallelcluster 
36.batch       00:00:00          1        1     FAILED                parallelcluster 
37             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
37.batch       00:00:00          1        1  COMPLETED                parallelcluster 
37.0           00:00:00          2        2  COMPLETED                parallelcluster 
38             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
38.batch       00:00:00          1        1  COMPLETED                parallelcluster 
38.0           00:00:00          2        2  COMPLETED                parallelcluster 
39             00:00:00          2           COMPLETED  ec2-user      parallelcluster 
39.batch       00:00:00          1        1  COMPLETED                parallelcluster 
39.0           00:00:00          2        2  COMPLETED                parallelcluster 
40             00:00:10          2           COMPLETED  ec2-user      parallelcluster 
40.batch       00:00:10          1        1  COMPLETED                parallelcluster 
40.0           00:00:00          2        2  COMPLETED                parallelcluster 
40.1           00:00:10          2        2  COMPLETED                parallelcluster 
41             00:00:29        144              FAILED  ec2-user      parallelcluster 
41.batch       00:00:29         36        1     FAILED                parallelcluster 
41.0           00:00:00        144      144  COMPLETED                parallelcluster 
41.1           00:00:01        144      144  COMPLETED                parallelcluster 
41.2           00:00:00          3        3  COMPLETED                parallelcluster 
41.3           00:00:00          3        3  COMPLETED                parallelcluster 
42             01:22:03        144           COMPLETED  ec2-user      parallelcluster 
42.batch       01:22:03         36        1  COMPLETED                parallelcluster 
42.0           00:00:01        144      144  COMPLETED                parallelcluster 
42.1           00:00:00        144      144  COMPLETED                parallelcluster 
42.2           00:00:39          3        3  COMPLETED                parallelcluster 
42.3           00:34:55          3        3  COMPLETED                parallelcluster 
43             00:00:11          2           COMPLETED  ec2-user      parallelcluster 
43.batch       00:00:11          1        1  COMPLETED                parallelcluster 
43.0           00:00:01          2        2  COMPLETED                parallelcluster 
43.1           00:00:10          2        2  COMPLETED                parallelcluster 
44             00:00:11          2           COMPLETED    nicola      parallelcluster 
44.batch       00:00:11          1        1  COMPLETED                parallelcluster 
44.0           00:00:01          2        2  COMPLETED                parallelcluster 
44.1           00:00:10          2        2  COMPLETED                parallelcluster 
4              00:00:10          2           COMPLETED    nicola     parallelcluster2 
4.batch        00:00:10          1        1  COMPLETED               parallelcluster2 
4.0            00:00:00          2        2  COMPLETED               parallelcluster2 
4.1            00:00:10          2        2  COMPLETED               parallelcluster2 
[[email protected] ~]# 

You can also directly query your database, and look at the accounting information stored in it or link your preferred BI tool to get insights from your HPC cluster. To do so, run the following code:

[[email protected]]$ mysql --host=parallelcluster-accounting.c68dmmc6ycyr.us-east-1.rds.amazonaws.com --port=3306 -u admin -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 48
Server version: 8.0.16 Source distribution

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| slurm_acct_db      |
+--------------------+
4 rows in set (0.00 sec)

mysql> use slurm_acct_db;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------------------------------+
| Tables_in_slurm_acct_db                 |
+-----------------------------------------+
| acct_coord_table                        |
| acct_table                              |
| clus_res_table                          |
| cluster_table                           |
| convert_version_table                   |
| federation_table                        |
| parallelcluster_assoc_table             |
| parallelcluster_assoc_usage_day_table   |
| parallelcluster_assoc_usage_hour_table  |
| parallelcluster_assoc_usage_month_table |
| parallelcluster_event_table             |
| parallelcluster_job_table               |
| parallelcluster_last_ran_table          |
| parallelcluster_resv_table              |
| parallelcluster_step_table              |
| parallelcluster_suspend_table           |
| parallelcluster_usage_day_table         |
| parallelcluster_usage_hour_table        |
| parallelcluster_usage_month_table       |
| parallelcluster_wckey_table             |
| parallelcluster_wckey_usage_day_table   |
| parallelcluster_wckey_usage_hour_table  |
| parallelcluster_wckey_usage_month_table |
| qos_table                               |
| res_table                               |
| table_defs_table                        |
| tres_table                              |
| txn_table                               |
| user_table                              |
+-----------------------------------------+
29 rows in set (0.00 sec)

mysql>

Conclusion

You’re finally all set! In this blog post you set up a database using Amazon RDS, configured AWS ParallelCluster and Slurm to enable job accounting with your database, and learned how to query your job accounting history from your database using the sacct command or by running SQL queries.

Deriving insights for your HPC workloads doesn’t end when your workloads finish running. Now, you can better understand and optimize your usage patterns and generate ideas about how to wring more price-performance out of your HPC clusters on AWS. For retrospective analysis, you can easily understand whether specific jobs, projects, or users are responsible for driving your HPC usage on AWS.  For forward-looking analysis, you can better forecast future usage to set budgets with appropriate insight into your costs and your resource consumption.

You can also use these accounting systems to identify users who may require additional training on how to make the most of cloud resources on AWS. Finally, together with your spending patterns, you can better capture and explain the return on investment from all of the valuable HPC work you do. And, this gives you the raw data to analyze how you can get even more value and price-performance out of the work you’re doing on AWS.

 

 

 

 

Amazon Lightsail Database Tips and Tricks

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/amazon-lightsail-database-tips-and-tricks/

This post is contributed by Mike Coleman | Developer Advocate for Lightsail | Twitter: @mikegcoleman

Managed Databases on Amazon Lightsail are affordably priced, and incredibly easy to run. Lightsail databases offer a solid foundation on which to build your application.  You can leverage attractive features like one-click high availability, automatic backups, and a choice of database engines to support your Lightsail apps.

While it’s super simple to do an initial deployment on Amazon Lightsail, I often get questions about how to perform some standard management tasks. Some examples of these tasks are scaling up a database or accessing that database with command line tools. I am also asked how to handle a scenario when you find that you need some of the advanced features found in Amazon Relation Database Service (RDS).

This blog answers these questions and offers general guidance on how to address these issues.

Scale Up Your Database

When I first deploy resources to the cloud, I always choose the least expensive option. Often times, that choice works out and everything runs fine. But sometimes, this results in under sizing resources, which necessitates a move to resources with more horsepower.

If this happens with your Lightsail databases, it’s straightforward to move your database to a larger size. Additionally, you can check the metrics page in the Amazon Lightsail console to see your database performance, and to determine if you need to upgrade.

Let’s walk through how to size up your database.

Start by creating a snapshot of your instance.

  1. Navigate to the Lightsail home page and click databases
  2. Click on the name of your database
  3. From the horizontal menu, click on Snapshots & restoreScreenshot of the snapshot and restore choice
  4. Under Manual Snapshot click + Create snapshotscreenshot of where to hit create snapshot
  5. Give the snapshot a name
  6. Click Create

It takes several minutes for the snapshot creation process to complete. Once the snapshot is available, you can create your new database instance choosing a larger size.

  1. Click the three-dot menu to the right of the snapshot you just created
  2. Choose Create new database
  3. Under Choose your database plan, select either a Standard or High Availability If you’re running a mission critical application, you definitely want to choose the high availability option. Standard is great for test environments or workloads where your application can withstand downtime in the event of a database failure.
  4. Choose the size for your new database instance
  5. Give your database instance a name
  6. Click Create database

The new database is created after several minutes.

Lightsail generates a new password when you create a new database from a snapshot. You can either use this newly generated password, or change it. You can change the password using the following steps:

  1. From the Lightsail home, page click Databases
  2. Scroll down to the Connection details section
  3. If you want to use the auto-generated password, click Show in the password box to display the password
    Otherwise complete steps 4 and 5 to specify a new password.
  4. Under Password, click Change password
  5. Enter a new password and click Save
    It will take a few minutes for the password to update

Now, go into your application. Configure the application to point the new database using the new endpoint, user name, and password values.

Note: It’s out of the scope for this blog to cover how to configure individual applications. Consult your application documentation to see how to do it for your specific application.

Command Line Access

There may be times when you need to work on your database using command line tools. You cannot connect directly to your Lightsail database instance. But, you can access the database remotely from another Lightsail instance.

You can also make your instance accessible via the public internet, and access it remotely from any internet-connected computer. However, I wouldn’t recommend this from a security perspective.

You first must create a new Lightsail instance to get started accessing your Lightsail database via the command line. I recommend basing your instance on Lightsail’s LAMP blueprint because there are MySQL command line tools already installed.

To create a new LAMP instance, do the following:

  1. From the Lightsail home page, click Create Instance
  2. Make sure you create the instance in the same Region as your Lightsail databaseinstance location image
  3. Under Select a blueprint, choose LAMP (PHP 7)blueprint selection
  4. Since you’re only using this instance to run MySQL command line tools, you can choose the smallest instance size
  5. Give your instance a name
  6. Click Create Instance

It takes a few minutes for your new instance to start up.

To check that everything is working correctly, use the MySQL command line interface.

Make sure you have the database user name, password, and endpoint. These can be found by clicking on the name of your database under the Connection details section.

  1. Use either your own SSH client or the built-sin web client to access the Lightsail instance you just created
  2. On the command line, enter the following command substituting the values for your database
mysql \
--host <lightsail database endpoint> \
--user <lightsail database username> \
--password

For example:

mysql \
--host ls-randomchars.us-east-2.rds.amazonaws.com \
--user dbmasteruser \
--password

Notice that you don’t actually put the password on the command line.

3. When prompted enter the password (note that the password will not show up when you enter it)

4. You should now be at the MySQL command prompt

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 87482
Server version: 5.7.26-log Source distribution

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

From here, you can use the command line as you normally would.

Migrating From a Managed Database to Amazon RDS

One of the great things about Lightsail is that it’s easy to get started quickly. It also gives you an easy migration path to more advanced AWS services, should you ever need them. For instance, you might se tup your database on Lightsail, and then realize that it could benefit from read replicas to handle growing traffic. Fortunately, it’s a pretty straightforward process to migrate your data from Lightsail to RDS.

 

Deploy an Amazon RDS database

First, make sure you have an RDS database running the same engine in the same Region as your Lightsail instance, and in your default Amazon VPC. For example, if your Lightsail database is running MySQL in the Oregon Region, RDS should also be running MySQL in the Oregon Region and in the default VPC. If you’re not sure how to create an RDS database, check out their documentation.

Make sure to note the username and password for your new database.

Create a Lightsail Instance

You also need a Lightsail instance with the MySQL command line tools installed. You can set one up by following the instructions in the previous section of this blog.

Enable VPC Peering

To get started, ensure that the Lightsail VPC can communicate. You do this by enabling VPC peering in Lightsail, and modifying the security group for RDS to allow traffic from the Lightsail VPC.

  1. Return to the Lightsail console home page and click Account in the top-right corner. Choose Account from the pop out menu.
  2. Click Advanced on the horizontal menu
  3. Under VPC peering, ensure that the Enable VPC peering box is checked for the region where your database is deployed.
    enable vpc peering screenshot

Adjust the RDS database security group

The next step is to edit the security group for the RDS instance to allow traffic from the Lightsail subnet.

  1. Return to the RDS console home page
  2. Under Resourcesclick on DB Instances
  3. Click on the name of the database you want to migrate data into
  4. Under Connectivity and securityclick on the security group nameconnectivity and security configuration

The security group dialog appears. From here you can add an entry for the Lightsail subnet.

  1. Click the Inbound tab near the bottom of the screen
  2. Click the Edit button
  3. Click Add rule in the pop-up box
  4. From the Type drop-down choose MySQL/Aurora
  5. In the source box, enter 172.26.0.0/16 (this is the CIDR address for the Lightsail subnet)inbound rules
  6. Click Save

Migrate the data from the Lightsail Database to RDS

Now that Lightsail resources can talk with your RDS database, you can do the actual migration.

The initial step is to use mysqldump to export your database information into a file that can be imported into RDS. mysqldump has many options. In this case, you export a database named tasks. Choose the appropriate database for your use case, as well as any other options that make sense.

  1. Use either your own SSH client or the built-in web client to access the Lightsail instance you just created.
  2. Use the following mysqldump command to create a backup of your database to a text file (dump.sql). Substitute the connection values for your Lightsail database. These values  are on the details page of your database under Connection details. The database name must be specific to your environment.
mysqldump \
--host <lightsail database endpoint> \
--user <lightsail database username> \
--databases <database name> \
--password \
> dump.sql

For example:

mysqldump \
--host ls-randomchars.us-west-2.rds.amazonaws.com \
--user dbmasteruser \
--databases tasks \
--password \
--set-gtid-purged=OFF \
> dump.sql

Now that you have a database backup, you can import that into your RDS instance. You need the connection details from your RDS database. Use the username and password from when you created the database. You can find the endpoint on the details page of your database under Connectivity and security (See the following screenshot for an example).

endpoint and port for connectivity and security

If you are not already, return to the terminal session for the Lightsail instance that has the MySQL tools installed.

To import the data into the RDS database you must provide the contents of the dump.sql file to the mysql command line, too. The cat command lists out the file, and by using | (referred to as a pipe) we can send the output directly from that command into mysql.

cat dump.sql | \
mysql \
--host <RDS database endpoint> \
--user <RDS user> \
--password

For example:

cat dump.sql | \
mysql \
--host database.randomchars.us-west-2.rds.amazonaws.com \
--user dbmasteruser \
--password

You can also use the mysql command to see if the database was created (this is similar to what we did when we passed in the file in the previous step. Instead, this time we’re using echo to pipe in the command show databases;)

echo "show databases;| \
mysql \
--host <RDS database endpoint> \
--user <RDS user> \
--password

For example:

echo "show databases;" | \
mysql \
--host database.randomchars.us-west-2.rds.amazonaws.com \
--user dbmasteruser \
--password

From here, you reconfigure your application to access your new RDS database.

Conclusion

In this post I reviewed some common tasks that you might want to do once you created your Amazon Lightsail database. You learned how to scale up the size of your database, how to access it with command line tools, and how to migrate to RDS.

If you’ve not yet deployed a Managed Database on Lightsail why not head over to the Lightsail console and create one now. If you need a bit of guidance to get started, we have a workshop at https://lightsailworkshop.com that will show you how to use Lightsail to deploy a two-tier web application using a MySQL database backend. Please feel free to leave comments and questions for future blog posts.