Tag Archives: Permissions

How to Patch Linux Workloads on AWS

Post Syndicated from Koen van Blijderveen original https://aws.amazon.com/blogs/security/how-to-patch-linux-workloads-on-aws/

Most malware tries to compromise your systems by using a known vulnerability that the operating system maker has already patched. As best practices to help prevent malware from affecting your systems, you should apply all operating system patches and actively monitor your systems for missing patches.

In this blog post, I show you how to patch Linux workloads using AWS Systems Manager. To accomplish this, I will show you how to use the AWS Command Line Interface (AWS CLI) to:

  1. Launch an Amazon EC2 instance for use with Systems Manager.
  2. Configure Systems Manager to patch your Amazon EC2 Linux instances.

In two previous blog posts (Part 1 and Part 2), I showed how to use the AWS Management Console to perform the necessary steps to patch, inspect, and protect Microsoft Windows workloads. You can implement those same processes for your Linux instances running in AWS by changing the instance tags and types shown in the previous blog posts.

Because most Linux system administrators are more familiar with using a command line, I show how to patch Linux workloads by using the AWS CLI in this blog post. The steps to use the Amazon EBS Snapshot Scheduler and Amazon Inspector are identical for both Microsoft Windows and Linux.

What you should know first

To follow along with the solution in this post, you need one or more Amazon EC2 instances. You may use existing instances or create new instances. For this post, I assume this is an Amazon EC2 for Amazon Linux instance installed from Amazon Machine Images (AMIs).

Systems Manager is a collection of capabilities that helps you automate management tasks for AWS-hosted instances on Amazon EC2 and your on-premises servers. In this post, I use Systems Manager for two purposes: to run remote commands and apply operating system patches. To learn about the full capabilities of Systems Manager, see What Is AWS Systems Manager?

As of Amazon Linux 2017.09, the AMI comes preinstalled with the Systems Manager agent. Systems Manager Patch Manager also supports Red Hat and Ubuntu. To install the agent on these Linux distributions or an older version of Amazon Linux, see Installing and Configuring SSM Agent on Linux Instances.

If you are not familiar with how to launch an Amazon EC2 instance, see Launching an Instance. I also assume you launched or will launch your instance in a private subnet. You must make sure that the Amazon EC2 instance can connect to the internet using a network address translation (NAT) instance or NAT gateway to communicate with Systems Manager. The following diagram shows how you should structure your VPC.

Diagram showing how to structure your VPC

Later in this post, you will assign tasks to a maintenance window to patch your instances with Systems Manager. To do this, the IAM user you are using for this post must have the iam:PassRole permission. This permission allows the IAM user assigning tasks to pass his own IAM permissions to the AWS service. In this example, when you assign a task to a maintenance window, IAM passes your credentials to Systems Manager. You also should authorize your IAM user to use Amazon EC2 and Systems Manager. As mentioned before, you will be using the AWS CLI for most of the steps in this blog post. Our documentation shows you how to get started with the AWS CLI. Make sure you have the AWS CLI installed and configured with an AWS access key and secret access key that belong to an IAM user that have the following AWS managed policies attached to the IAM user you are using for this example: AmazonEC2FullAccess and AmazonSSMFullAccess.

Step 1: Launch an Amazon EC2 Linux instance

In this section, I show you how to launch an Amazon EC2 instance so that you can use Systems Manager with the instance. This step requires you to do three things:

  1. Create an IAM role for Systems Manager before launching your Amazon EC2 instance.
  2. Launch your Amazon EC2 instance with Amazon EBS and the IAM role for Systems Manager.
  3. Add tags to the instances so that you can add your instances to a Systems Manager maintenance window based on tags.

A. Create an IAM role for Systems Manager

Before launching an Amazon EC2 instance, I recommend that you first create an IAM role for Systems Manager, which you will use to update the Amazon EC2 instance. AWS already provides a preconfigured policy that you can use for the new role and it is called AmazonEC2RoleforSSM.

  1. Create a JSON file named trustpolicy-ec2ssm.json that contains the following trust policy. This policy describes which principal (an entity that can take action on an AWS resource) is allowed to assume the role we are going to create. In this example, the principal is the Amazon EC2 service.
    {
      "Version": "2012-10-17",
      "Statement": {
        "Effect": "Allow",
        "Principal": {"Service": "ec2.amazonaws.com"},
        "Action": "sts:AssumeRole"
      }
    }

  1. Use the following command to create a role named EC2SSM that has the AWS managed policy AmazonEC2RoleforSSM attached to it. This generates JSON-based output that describes the role and its parameters, if the command is successful.
    $ aws iam create-role --role-name EC2SSM --assume-role-policy-document file://trustpolicy-ec2ssm.json

  1. Use the following command to attach the AWS managed IAM policy (AmazonEC2RoleforSSM) to your newly created role.
    $ aws iam attach-role-policy --role-name EC2SSM --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM

  1. Use the following commands to create the IAM instance profile and add the role to the instance profile. The instance profile is needed to attach the role we created earlier to your Amazon EC2 instance.
    $ aws iam create-instance-profile --instance-profile-name EC2SSM-IP
    $ aws iam add-role-to-instance-profile --instance-profile-name EC2SSM-IP --role-name EC2SSM

B. Launch your Amazon EC2 instance

To follow along, you need an Amazon EC2 instance that is running Amazon Linux. You can use any existing instance you may have or create a new instance.

When launching a new Amazon EC2 instance, be sure that:

  1. Use the following command to launch a new Amazon EC2 instance using an Amazon Linux AMI available in the US East (N. Virginia) Region (also known as us-east-1). Replace YourKeyPair and YourSubnetId with your information. For more information about creating a key pair, see the create-key-pair documentation. Write down the InstanceId that is in the output because you will need it later in this post.
    $ aws ec2 run-instances --image-id ami-cb9ec1b1 --instance-type t2.micro --key-name YourKeyPair --subnet-id YourSubnetId --iam-instance-profile Name=EC2SSM-IP

  1. If you are using an existing Amazon EC2 instance, you can use the following command to attach the instance profile you created earlier to your instance.
    $ aws ec2 associate-iam-instance-profile --instance-id YourInstanceId --iam-instance-profile Name=EC2SSM-IP

C. Add tags

The final step of configuring your Amazon EC2 instances is to add tags. You will use these tags to configure Systems Manager in Step 2 of this post. For this example, I add a tag named Patch Group and set the value to Linux Servers. I could have other groups of Amazon EC2 instances that I treat differently by having the same tag name but a different tag value. For example, I might have a collection of other servers with the tag name Patch Group with a value of Web Servers.

  • Use the following command to add the Patch Group tag to your Amazon EC2 instance.
    $ aws ec2 create-tags --resources YourInstanceId --tags --tags Key="Patch Group",Value="Linux Servers"

Note: You must wait a few minutes until the Amazon EC2 instance is available before you can proceed to the next section. To make sure your Amazon EC2 instance is online and ready, you can use the following AWS CLI command:

$ aws ec2 describe-instance-status --instance-ids YourInstanceId

At this point, you now have at least one Amazon EC2 instance you can use to configure Systems Manager.

Step 2: Configure Systems Manager

In this section, I show you how to configure and use Systems Manager to apply operating system patches to your Amazon EC2 instances, and how to manage patch compliance.

To start, I provide some background information about Systems Manager. Then, I cover how to:

  1. Create the Systems Manager IAM role so that Systems Manager is able to perform patch operations.
  2. Create a Systems Manager patch baseline and associate it with your instance to define which patches Systems Manager should apply.
  3. Define a maintenance window to make sure Systems Manager patches your instance when you tell it to.
  4. Monitor patch compliance to verify the patch state of your instances.

You must meet two prerequisites to use Systems Manager to apply operating system patches. First, you must attach the IAM role you created in the previous section, EC2SSM, to your Amazon EC2 instance. Second, you must install the Systems Manager agent on your Amazon EC2 instance. If you have used a recent Amazon Linux AMI, Amazon has already installed the Systems Manager agent on your Amazon EC2 instance. You can confirm this by logging in to an Amazon EC2 instance and checking the Systems Manager agent log files that are located at /var/log/amazon/ssm/.

To install the Systems Manager agent on an instance that does not have the agent preinstalled or if you want to use the Systems Manager agent on your on-premises servers, see Installing and Configuring the Systems Manager Agent on Linux Instances. If you forgot to attach the newly created role when launching your Amazon EC2 instance or if you want to attach the role to already running Amazon EC2 instances, see Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI or use the AWS Management Console.

A. Create the Systems Manager IAM role

For a maintenance window to be able to run any tasks, you must create a new role for Systems Manager. This role is a different kind of role than the one you created earlier: this role will be used by Systems Manager instead of Amazon EC2. Earlier, you created the role, EC2SSM, with the policy, AmazonEC2RoleforSSM, which allowed the Systems Manager agent on your instance to communicate with Systems Manager. In this section, you need a new role with the policy, AmazonSSMMaintenanceWindowRole, so that the Systems Manager service can execute commands on your instance.

To create the new IAM role for Systems Manager:

  1. Create a JSON file named trustpolicy-maintenancewindowrole.json that contains the following trust policy. This policy describes which principal is allowed to assume the role you are going to create. This trust policy allows not only Amazon EC2 to assume this role, but also Systems Manager.
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Sid":"",
             "Effect":"Allow",
             "Principal":{
                "Service":[
                   "ec2.amazonaws.com",
                   "ssm.amazonaws.com"
               ]
             },
             "Action":"sts:AssumeRole"
          }
       ]
    }

  1. Use the following command to create a role named MaintenanceWindowRole that has the AWS managed policy, AmazonSSMMaintenanceWindowRole, attached to it. This command generates JSON-based output that describes the role and its parameters, if the command is successful.
    $ aws iam create-role --role-name MaintenanceWindowRole --assume-role-policy-document file://trustpolicy-maintenancewindowrole.json

  1. Use the following command to attach the AWS managed IAM policy (AmazonEC2RoleforSSM) to your newly created role.
    $ aws iam attach-role-policy --role-name MaintenanceWindowRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonSSMMaintenanceWindowRole

B. Create a Systems Manager patch baseline and associate it with your instance

Next, you will create a Systems Manager patch baseline and associate it with your Amazon EC2 instance. A patch baseline defines which patches Systems Manager should apply to your instance. Before you can associate the patch baseline with your instance, though, you must determine if Systems Manager recognizes your Amazon EC2 instance. Use the following command to list all instances managed by Systems Manager. The --filters option ensures you look only for your newly created Amazon EC2 instance.

$ aws ssm describe-instance-information --filters Key=InstanceIds,Values= YourInstanceId

{
    "InstanceInformationList": [
        {
            "IsLatestVersion": true,
            "ComputerName": "ip-10-50-2-245",
            "PingStatus": "Online",
            "InstanceId": "YourInstanceId",
            "IPAddress": "10.50.2.245",
            "ResourceType": "EC2Instance",
            "AgentVersion": "2.2.120.0",
            "PlatformVersion": "2017.09",
            "PlatformName": "Amazon Linux AMI",
            "PlatformType": "Linux",
            "LastPingDateTime": 1515759143.826
        }
    ]
}

If your instance is missing from the list, verify that:

  1. Your instance is running.
  2. You attached the Systems Manager IAM role, EC2SSM.
  3. You deployed a NAT gateway in your public subnet to ensure your VPC reflects the diagram shown earlier in this post so that the Systems Manager agent can connect to the Systems Manager internet endpoint.
  4. The Systems Manager agent logs don’t include any unaddressed errors.

Now that you have checked that Systems Manager can manage your Amazon EC2 instance, it is time to create a patch baseline. With a patch baseline, you define which patches are approved to be installed on all Amazon EC2 instances associated with the patch baseline. The Patch Group resource tag you defined earlier will determine to which patch group an instance belongs. If you do not specifically define a patch baseline, the default AWS-managed patch baseline is used.

To create a patch baseline:

  1. Use the following command to create a patch baseline named AmazonLinuxServers. With approval rules, you can determine the approved patches that will be included in your patch baseline. In this example, you add all Critical severity patches to the patch baseline as soon as they are released, by setting the Auto approval delay to 0 days. By setting the Auto approval delay to 2 days, you add to this patch baseline the Important, Medium, and Low severity patches two days after they are released.
    $ aws ssm create-patch-baseline --name "AmazonLinuxServers" --description "Baseline containing all updates for Amazon Linux" --operating-system AMAZON_LINUX --approval-rules "PatchRules=[{PatchFilterGroup={PatchFilters=[{Values=[Critical],Key=SEVERITY}]},ApproveAfterDays=0,ComplianceLevel=CRITICAL},{PatchFilterGroup={PatchFilters=[{Values=[Important,Medium,Low],Key=SEVERITY}]},ApproveAfterDays=2,ComplianceLevel=HIGH}]"
    
    {
        "BaselineId": "YourBaselineId"
    }

  1. Use the following command to register the patch baseline you created with your instance. To do so, you use the Patch Group tag that you added to your Amazon EC2 instance.
    $ aws ssm register-patch-baseline-for-patch-group --baseline-id YourPatchBaselineId --patch-group "Linux Servers"
    
    {
        "PatchGroup": "Linux Servers",
        "BaselineId": "YourBaselineId"
    }

C.  Define a maintenance window

Now that you have successfully set up a role, created a patch baseline, and registered your Amazon EC2 instance with your patch baseline, you will define a maintenance window so that you can control when your Amazon EC2 instances will receive patches. By creating multiple maintenance windows and assigning them to different patch groups, you can make sure your Amazon EC2 instances do not all reboot at the same time.

To define a maintenance window:

  1. Use the following command to define a maintenance window. In this example command, the maintenance window will start every Saturday at 10:00 P.M. UTC. It will have a duration of 4 hours and will not start any new tasks 1 hour before the end of the maintenance window.
    $ aws ssm create-maintenance-window --name SaturdayNight --schedule "cron(0 0 22 ? * SAT *)" --duration 4 --cutoff 1 --allow-unassociated-targets
    
    {
        "WindowId": "YourMaintenanceWindowId"
    }

For more information about defining a cron-based schedule for maintenance windows, see Cron and Rate Expressions for Maintenance Windows.

  1. After defining the maintenance window, you must register the Amazon EC2 instance with the maintenance window so that Systems Manager knows which Amazon EC2 instance it should patch in this maintenance window. You can register the instance by using the same Patch Group tag you used to associate the Amazon EC2 instance with the AWS-provided patch baseline, as shown in the following command.
    $ aws ssm register-target-with-maintenance-window --window-id YourMaintenanceWindowId --resource-type INSTANCE --targets "Key=tag:Patch Group,Values=Linux Servers"
    
    {
        "WindowTargetId": "YourWindowTargetId"
    }

  1. Assign a task to the maintenance window that will install the operating system patches on your Amazon EC2 instance. The following command includes the following options.
    1. name is the name of your task and is optional. I named mine Patching.
    2. task-arn is the name of the task document you want to run.
    3. max-concurrency allows you to specify how many of your Amazon EC2 instances Systems Manager should patch at the same time. max-errors determines when Systems Manager should abort the task. For patching, this number should not be too low, because you do not want your entire patch task to stop on all instances if one instance fails. You can set this, for example, to 20%.
    4. service-role-arn is the Amazon Resource Name (ARN) of the AmazonSSMMaintenanceWindowRole role you created earlier in this blog post.
    5. task-invocation-parameters defines the parameters that are specific to the AWS-RunPatchBaseline task document and tells Systems Manager that you want to install patches with a timeout of 600 seconds (10 minutes).
      $ aws ssm register-task-with-maintenance-window --name "Patching" --window-id "YourMaintenanceWindowId" --targets "Key=WindowTargetIds,Values=YourWindowTargetId" --task-arn AWS-RunPatchBaseline --service-role-arn "arn:aws:iam::123456789012:role/MaintenanceWindowRole" --task-type "RUN_COMMAND" --task-invocation-parameters "RunCommand={Comment=,TimeoutSeconds=600,Parameters={SnapshotId=[''],Operation=[Install]}}" --max-concurrency "500" --max-errors "20%"
      
      {
          "WindowTaskId": "YourWindowTaskId"
      }

Now, you must wait for the maintenance window to run at least once according to the schedule you defined earlier. If your maintenance window has expired, you can check the status of any maintenance tasks Systems Manager has performed by using the following command.

$ aws ssm describe-maintenance-window-executions --window-id "YourMaintenanceWindowId"

{
    "WindowExecutions": [
        {
            "Status": "SUCCESS",
            "WindowId": "YourMaintenanceWindowId",
            "WindowExecutionId": "b594984b-430e-4ffa-a44c-a2e171de9dd3",
            "EndTime": 1515766467.487,
            "StartTime": 1515766457.691
        }
    ]
}

D.  Monitor patch compliance

You also can see the overall patch compliance of all Amazon EC2 instances using the following command in the AWS CLI.

$ aws ssm list-compliance-summaries

This command shows you the number of instances that are compliant with each category and the number of instances that are not in JSON format.

You also can see overall patch compliance by choosing Compliance under Insights in the navigation pane of the Systems Manager console. You will see a visual representation of how many Amazon EC2 instances are up to date, how many Amazon EC2 instances are noncompliant, and how many Amazon EC2 instances are compliant in relation to the earlier defined patch baseline.

Screenshot of the Compliance page of the Systems Manager console

In this section, you have set everything up for patch management on your instance. Now you know how to patch your Amazon EC2 instance in a controlled manner and how to check if your Amazon EC2 instance is compliant with the patch baseline you have defined. Of course, I recommend that you apply these steps to all Amazon EC2 instances you manage.

Summary

In this blog post, I showed how to use Systems Manager to create a patch baseline and maintenance window to keep your Amazon EC2 Linux instances up to date with the latest security patches. Remember that by creating multiple maintenance windows and assigning them to different patch groups, you can make sure your Amazon EC2 instances do not all reboot at the same time.

If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing any part of this solution, start a new thread on the Amazon EC2 forum or contact AWS Support.

– Koen

Troubleshooting event publishing issues in Amazon SES

Post Syndicated from Dustin Taylor original https://aws.amazon.com/blogs/ses/troubleshooting-event-publishing-issues-in-amazon-ses/

Over the past year, we’ve released several features that make it easier to track the metrics that are associated with your Amazon SES account. The first of these features, launched in November of last year, was event publishing.

Initially, event publishing let you capture basic metrics related to your email sending and publish them to other AWS services, such as Amazon CloudWatch and Amazon Kinesis Data Firehose. Some examples of these basic metrics include the number of emails that were sent and delivered, as well as the number that bounced or received complaints. A few months ago, we expanded this feature by adding engagement metrics—specifically, information about the number of emails that your customers opened or engaged with by clicking links.

As a former Cloud Support Engineer, I’ve seen Amazon SES customers do some amazing things with event publishing, but I’ve also seen some common issues. In this article, we look at some of these issues, and discuss the steps you can take to resolve them.

Before we begin

This post assumes that your Amazon SES account is already out of the sandbox, that you’ve verified an identity (such as an email address or domain), and that you have the necessary permissions to use Amazon SES and the service that you’ll publish event data to (such as Amazon SNS, CloudWatch, or Kinesis Data Firehose).

We also assume that you’re familiar with the process of creating configuration sets and specifying event destinations for those configuration sets. For more information, see Using Amazon SES Configuration Sets in the Amazon SES Developer Guide.

Amazon SNS event destinations

If you want to receive notifications when events occur—such as when recipients click a link in an email, or when they report an email as spam—you can use Amazon SNS as an event destination.

Occasionally, customers ask us why they’re not receiving notifications when they use an Amazon SNS topic as an event destination. One of the most common reasons for this issue is that they haven’t configured subscriptions for their Amazon SNS topic yet.

A single topic in Amazon SNS can have one or more subscriptions. When you subscribe to a topic, you tell that topic which endpoints (such as email addresses or mobile phone numbers) to contact when it receives a notification. If you haven’t set up any subscriptions, nothing will happen when an email event occurs.

For more information about setting up topics and subscriptions, see Getting Started in the Amazon SNS Developer Guide. For information about publishing Amazon SES events to Amazon SNS topics, see Set Up an Amazon SNS Event Destination for Amazon SES Event Publishing in the Amazon SES Developer Guide.

Kinesis Data Firehose event destinations

If you want to store your Amazon SES event data for the long term, choose Amazon Kinesis Data Firehose as a destination for Amazon SES events. With Kinesis Data Firehose, you can stream data to Amazon S3 or Amazon Redshift for storage and analysis.

The process of setting up Kinesis Data Firehose as an event destination is similar to the process for setting up Amazon SNS: you choose the types of events (such as deliveries, opens, clicks, or bounces) that you want to export, and the name of the Kinesis Data Firehose stream that you want to export to. However, there’s one important difference. When you set up a Kinesis Data Firehose event destination, you must also choose the IAM role that Amazon SES uses to send event data to Kinesis Data Firehose.

When you set up the Kinesis Data Firehose event destination, you can choose to have Amazon SES create the IAM role for you automatically. For many users, this is the best solution—it ensures that the IAM role has the appropriate permissions to move event data from Amazon SES to Kinesis Data Firehose.

Customers occasionally run into issues with the Kinesis Data Firehose event destination when they use an existing IAM role. If you use an existing IAM role, or create a new role for this purpose, make sure that the role includes the firehose:PutRecord and firehose:PutRecordBatch permissions. If the role doesn’t include these permissions, then the Amazon SES event data isn’t published to Kinesis Data Firehose. For more information, see Controlling Access with Amazon Kinesis Data Firehose in the Amazon Kinesis Data Firehose Developer Guide.

CloudWatch event destinations

By publishing your Amazon SES event data to Amazon CloudWatch, you can create dashboards that track your sending statistics in real time, as well as alarms that notify you when your event metrics reach certain thresholds.

The amount that you’re charged for using CloudWatch is based on several factors, including the number of metrics you use. In order to give you more control over the specific metrics you send to CloudWatch—and to help you avoid unexpected charges—you can limit the email sending events that are sent to CloudWatch.

When you choose CloudWatch as an event destination, you must choose a value source. The value source can be one of three options: a message tag, a link tag, or an email header. After you choose a value source, you then specify a name and a value. When you send an email using a configuration set that refers to a CloudWatch event destination, it only sends the metrics for that email to CloudWatch if the email contains the name and value that you specified as the value source. This requirement is commonly overlooked.

For example, assume that you chose Message Tag as the value source, and specified “CategoryId” as the dimension name and “31415” as the dimension value. When you want to send events for an email to CloudWatch, you must specify the name of the configuration set that uses the CloudWatch destination. You must also include a tag in your message. The name of the tag must be “CategoryId” and the value must be “31415”.

For more information about adding tags and email headers to your messages, see Send Email Using Amazon SES Event Publishing in the Amazon SES Developer Guide. For more information about adding tags to links, see Amazon SES Email Sending Metrics FAQs in the Amazon SES Developer Guide.

Troubleshooting event publishing for open and click data

Occasionally, customers ask why they’re not seeing open and click data for their emails. This issue most often occurs when the customer only sends text versions of their emails. Because of the way Amazon SES tracks open and click events, you can only see open and click data for emails that are sent as HTML. For more information about how Amazon SES modifies your emails when you enable open and click tracking, see Amazon SES Email Sending Metrics FAQs in the Amazon SES Developer Guide.

The process that you use to send HTML emails varies based on the email sending method you use. The Code Examples section of the Amazon SES Developer Guide contains examples of several methods of sending email by using the Amazon SES SMTP interface or an AWS SDK. All of the examples in this section include methods for sending HTML (as well as text-only) emails.

If you encounter any issues that weren’t covered in this post, please open a case in the Support Center and we’d be more than happy to assist.

Sharing Secrets with AWS Lambda Using AWS Systems Manager Parameter Store

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/sharing-secrets-with-aws-lambda-using-aws-systems-manager-parameter-store/

This post courtesy of Roberto Iturralde, Sr. Application Developer- AWS Professional Services

Application architects are faced with key decisions throughout the process of designing and implementing their systems. One decision common to nearly all solutions is how to manage the storage and access rights of application configuration. Shared configuration should be stored centrally and securely with each system component having access only to the properties that it needs for functioning.

With AWS Systems Manager Parameter Store, developers have access to central, secure, durable, and highly available storage for application configuration and secrets. Parameter Store also integrates with AWS Identity and Access Management (IAM), allowing fine-grained access control to individual parameters or branches of a hierarchical tree.

This post demonstrates how to create and access shared configurations in Parameter Store from AWS Lambda. Both encrypted and plaintext parameter values are stored with only the Lambda function having permissions to decrypt the secrets. You also use AWS X-Ray to profile the function.

Solution overview

This example is made up of the following components:

  • An AWS SAM template that defines:
    • A Lambda function and its permissions
    • An unencrypted Parameter Store parameter that the Lambda function loads
    • A KMS key that only the Lambda function can access. You use this key to create an encrypted parameter later.
  • Lambda function code in Python 3.6 that demonstrates how to load values from Parameter Store at function initialization for reuse across invocations.

Launch the AWS SAM template

To create the resources shown in this post, you can download the SAM template or choose the button to launch the stack. The template requires one parameter, an IAM user name, which is the name of the IAM user to be the admin of the KMS key that you create. In order to perform the steps listed in this post, this IAM user will need permissions to execute Lambda functions, create Parameter Store parameters, administer keys in KMS, and view the X-Ray console. If you have these privileges in your IAM user account you can use your own account to complete the walkthrough. You can not use the root user to administer the KMS keys.

SAM template resources

The following sections show the code for the resources defined in the template.
Lambda function

ParameterStoreBlogFunctionDev:
    Type: 'AWS::Serverless::Function'
    Properties:
      FunctionName: 'ParameterStoreBlogFunctionDev'
      Description: 'Integrating lambda with Parameter Store'
      Handler: 'lambda_function.lambda_handler'
      Role: !GetAtt ParameterStoreBlogFunctionRoleDev.Arn
      CodeUri: './code'
      Environment:
        Variables:
          ENV: 'dev'
          APP_CONFIG_PATH: 'parameterStoreBlog'
          AWS_XRAY_TRACING_NAME: 'ParameterStoreBlogFunctionDev'
      Runtime: 'python3.6'
      Timeout: 5
      Tracing: 'Active'

  ParameterStoreBlogFunctionRoleDev:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          -
            Effect: Allow
            Principal:
              Service:
                - 'lambda.amazonaws.com'
            Action:
              - 'sts:AssumeRole'
      ManagedPolicyArns:
        - 'arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole'
      Policies:
        -
          PolicyName: 'ParameterStoreBlogDevParameterAccess'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              -
                Effect: Allow
                Action:
                  - 'ssm:GetParameter*'
                Resource: !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/dev/parameterStoreBlog*'
        -
          PolicyName: 'ParameterStoreBlogDevXRayAccess'
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              -
                Effect: Allow
                Action:
                  - 'xray:PutTraceSegments'
                  - 'xray:PutTelemetryRecords'
                Resource: '*'

In this YAML code, you define a Lambda function named ParameterStoreBlogFunctionDev using the SAM AWS::Serverless::Function type. The environment variables for this function include the ENV (dev) and the APP_CONFIG_PATH where you find the configuration for this app in Parameter Store. X-Ray tracing is also enabled for profiling later.

The IAM role for this function extends the AWSLambdaBasicExecutionRole by adding IAM policies that grant the function permissions to write to X-Ray and get parameters from Parameter Store, limited to paths under /dev/parameterStoreBlog*.
Parameter Store parameter

SimpleParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: '/dev/parameterStoreBlog/appConfig'
      Description: 'Sample dev config values for my app'
      Type: String
      Value: '{"key1": "value1","key2": "value2","key3": "value3"}'

This YAML code creates a plaintext string parameter in Parameter Store in a path that your Lambda function can access.
KMS encryption key

ParameterStoreBlogDevEncryptionKeyAlias:
    Type: AWS::KMS::Alias
    Properties:
      AliasName: 'alias/ParameterStoreBlogKeyDev'
      TargetKeyId: !Ref ParameterStoreBlogDevEncryptionKey

  ParameterStoreBlogDevEncryptionKey:
    Type: AWS::KMS::Key
    Properties:
      Description: 'Encryption key for secret config values for the Parameter Store blog post'
      Enabled: True
      EnableKeyRotation: False
      KeyPolicy:
        Version: '2012-10-17'
        Id: 'key-default-1'
        Statement:
          -
            Sid: 'Allow administration of the key & encryption of new values'
            Effect: Allow
            Principal:
              AWS:
                - !Sub 'arn:aws:iam::${AWS::AccountId}:user/${IAMUsername}'
            Action:
              - 'kms:Create*'
              - 'kms:Encrypt'
              - 'kms:Describe*'
              - 'kms:Enable*'
              - 'kms:List*'
              - 'kms:Put*'
              - 'kms:Update*'
              - 'kms:Revoke*'
              - 'kms:Disable*'
              - 'kms:Get*'
              - 'kms:Delete*'
              - 'kms:ScheduleKeyDeletion'
              - 'kms:CancelKeyDeletion'
            Resource: '*'
          -
            Sid: 'Allow use of the key'
            Effect: Allow
            Principal:
              AWS: !GetAtt ParameterStoreBlogFunctionRoleDev.Arn
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'

This YAML code creates an encryption key with a key policy with two statements.

The first statement allows a given user (${IAMUsername}) to administer the key. Importantly, this includes the ability to encrypt values using this key and disable or delete this key, but does not allow the administrator to decrypt values that were encrypted with this key.

The second statement grants your Lambda function permission to encrypt and decrypt values using this key. The alias for this key in KMS is ParameterStoreBlogKeyDev, which is how you reference it later.

Lambda function

Here I walk you through the Lambda function code.

import os, traceback, json, configparser, boto3
from aws_xray_sdk.core import patch_all
patch_all()

# Initialize boto3 client at global scope for connection reuse
client = boto3.client('ssm')
env = os.environ['ENV']
app_config_path = os.environ['APP_CONFIG_PATH']
full_config_path = '/' + env + '/' + app_config_path
# Initialize app at global scope for reuse across invocations
app = None

class MyApp:
    def __init__(self, config):
        """
        Construct new MyApp with configuration
        :param config: application configuration
        """
        self.config = config

    def get_config(self):
        return self.config

def load_config(ssm_parameter_path):
    """
    Load configparser from config stored in SSM Parameter Store
    :param ssm_parameter_path: Path to app config in SSM Parameter Store
    :return: ConfigParser holding loaded config
    """
    configuration = configparser.ConfigParser()
    try:
        # Get all parameters for this app
        param_details = client.get_parameters_by_path(
            Path=ssm_parameter_path,
            Recursive=False,
            WithDecryption=True
        )

        # Loop through the returned parameters and populate the ConfigParser
        if 'Parameters' in param_details and len(param_details.get('Parameters')) > 0:
            for param in param_details.get('Parameters'):
                param_path_array = param.get('Name').split("/")
                section_position = len(param_path_array) - 1
                section_name = param_path_array[section_position]
                config_values = json.loads(param.get('Value'))
                config_dict = {section_name: config_values}
                print("Found configuration: " + str(config_dict))
                configuration.read_dict(config_dict)

    except:
        print("Encountered an error loading config from SSM.")
        traceback.print_exc()
    finally:
        return configuration

def lambda_handler(event, context):
    global app
    # Initialize app if it doesn't yet exist
    if app is None:
        print("Loading config and creating new MyApp...")
        config = load_config(full_config_path)
        app = MyApp(config)

    return "MyApp config is " + str(app.get_config()._sections)

Beneath the import statements, you import the patch_all function from the AWS X-Ray library, which you use to patch boto3 to create X-Ray segments for all your boto3 operations.

Next, you create a boto3 SSM client at the global scope for reuse across function invocations, following Lambda best practices. Using the function environment variables, you assemble the path where you expect to find your configuration in Parameter Store. The class MyApp is meant to serve as an example of an application that would need its configuration injected at construction. In this example, you create an instance of ConfigParser, a class in Python’s standard library for handling basic configurations, to give to MyApp.

The load_config function loads the all the parameters from Parameter Store at the level immediately beneath the path provided in the Lambda function environment variables. Each parameter found is put into a new section in ConfigParser. The name of the section is the name of the parameter, less the base path. In this example, the full parameter name is /dev/parameterStoreBlog/appConfig, which is put in a section named appConfig.

Finally, the lambda_handler function initializes an instance of MyApp if it doesn’t already exist, constructing it with the loaded configuration from Parameter Store. Then it simply returns the currently loaded configuration in MyApp. The impact of this design is that the configuration is only loaded from Parameter Store the first time that the Lambda function execution environment is initialized. Subsequent invocations reuse the existing instance of MyApp, resulting in improved performance. You see this in the X-Ray traces later in this post. For more advanced use cases where configuration changes need to be received immediately, you could implement an expiry policy for your configuration entries or push notifications to your function.

To confirm that everything was created successfully, test the function in the Lambda console.

  1. Open the Lambda console.
  2. In the navigation pane, choose Functions.
  3. In the Functions pane, filter to ParameterStoreBlogFunctionDev to find the function created by the SAM template earlier. Open the function name to view its details.
  4. On the top right of the function detail page, choose Test. You may need to create a new test event. The input JSON doesn’t matter as this function ignores the input.

After running the test, you should see output similar to the following. This demonstrates that the function successfully fetched the unencrypted configuration from Parameter Store.

Create an encrypted parameter

You currently have a simple, unencrypted parameter and a Lambda function that can access it.

Next, you create an encrypted parameter that only your Lambda function has permission to use for decryption. This limits read access for this parameter to only this Lambda function.

To follow along with this section, deploy the SAM template for this post in your account and make your IAM user name the KMS key admin mentioned earlier.

  1. In the Systems Manager console, under Shared Resources, choose Parameter Store.
  2. Choose Create Parameter.
    • For Name, enter /dev/parameterStoreBlog/appSecrets.
    • For Type, select Secure String.
    • For KMS Key ID, choose alias/ParameterStoreBlogKeyDev, which is the key that your SAM template created.
    • For Value, enter {"secretKey": "secretValue"}.
    • Choose Create Parameter.
  3. If you now try to view the value of this parameter by choosing the name of the parameter in the parameters list and then choosing Show next to the Value field, you won’t see the value appear. This is because, even though you have permission to encrypt values using this KMS key, you do not have permissions to decrypt values.
  4. In the Lambda console, run another test of your function. You now also see the secret parameter that you created and its decrypted value.

If you do not see the new parameter in the Lambda output, this may be because the Lambda execution environment is still warm from the previous test. Because the parameters are loaded at Lambda startup, you need a fresh execution environment to refresh the values.

Adjust the function timeout to a different value in the Advanced Settings at the bottom of the Lambda Configuration tab. Choose Save and test to trigger the creation of a new Lambda execution environment.

Profiling the impact of querying Parameter Store using AWS X-Ray

By using the AWS X-Ray SDK to patch boto3 in your Lambda function code, each invocation of the function creates traces in X-Ray. In this example, you can use these traces to validate the performance impact of your design decision to only load configuration from Parameter Store on the first invocation of the function in a new execution environment.

From the Lambda function details page where you tested the function earlier, under the function name, choose Monitoring. Choose View traces in X-Ray.

This opens the X-Ray console in a new window filtered to your function. Be aware of the time range field next to the search bar if you don’t see any search results.
In this screenshot, I’ve invoked the Lambda function twice, one time 10.3 minutes ago with a response time of 1.1 seconds and again 9.8 minutes ago with a response time of 8 milliseconds.

Looking at the details of the longer running trace by clicking the trace ID, you can see that the Lambda function spent the first ~350 ms of the full 1.1 sec routing the request through Lambda and creating a new execution environment for this function, as this was the first invocation with this code. This is the portion of time before the initialization subsegment.

Next, it took 725 ms to initialize the function, which includes executing the code at the global scope (including creating the boto3 client). This is also a one-time cost for a fresh execution environment.

Finally, the function executed for 65 ms, of which 63.5 ms was the GetParametersByPath call to Parameter Store.

Looking at the trace for the second, much faster function invocation, you see that the majority of the 8 ms execution time was Lambda routing the request to the function and returning the response. Only 1 ms of the overall execution time was attributed to the execution of the function, which makes sense given that after the first invocation you’re simply returning the config stored in MyApp.

While the Traces screen allows you to view the details of individual traces, the X-Ray Service Map screen allows you to view aggregate performance data for all traced services over a period of time.

In the X-Ray console navigation pane, choose Service map. Selecting a service node shows the metrics for node-specific requests. Selecting an edge between two nodes shows the metrics for requests that traveled that connection. Again, be aware of the time range field next to the search bar if you don’t see any search results.

After invoking your Lambda function several more times by testing it from the Lambda console, you can view some aggregate performance metrics. Look at the following:

  • From the client perspective, requests to the Lambda service for the function are taking an average of 50 ms to respond. The function is generating ~1 trace per minute.
  • The function itself is responding in an average of 3 ms. In the following screenshot, I’ve clicked on this node, which reveals a latency histogram of the traced requests showing that over 95% of requests return in under 5 ms.
  • Parameter Store is responding to requests in an average of 64 ms, but note the much lower trace rate in the node. This is because you only fetch data from Parameter Store on the initialization of the Lambda execution environment.

Conclusion

Deduplication, encryption, and restricted access to shared configuration and secrets is a key component to any mature architecture. Serverless architectures designed using event-driven, on-demand, compute services like Lambda are no different.

In this post, I walked you through a sample application accessing unencrypted and encrypted values in Parameter Store. These values were created in a hierarchy by application environment and component name, with the permissions to decrypt secret values restricted to only the function needing access. The techniques used here can become the foundation of secure, robust configuration management in your enterprise serverless applications.

timeShift(GrafanaBuzz, 1w) Issue 33

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2018/02/09/timeshiftgrafanabuzz-1w-issue-33/

Welcome to TimeShift The big news this week of course, was the release of Grafana 5.0 beta. This is the largest Grafana release since the project’s launch and we’re extremely excited to receive feedback from the community. Some of the bigger features include teams, dashboard folders, and permissions, but we tried to look at even little things like cleaning up icons, taking a look at the default graph color palette, and reworking the light theme.

Migrating Your Amazon ECS Containers to AWS Fargate

Post Syndicated from Tiffany Jernigan original https://aws.amazon.com/blogs/compute/migrating-your-amazon-ecs-containers-to-aws-fargate/

AWS Fargate is a new technology that works with Amazon Elastic Container Service (ECS) to run containers without having to manage servers or clusters. What does this mean? With Fargate, you no longer need to provision or manage a single virtual machine; you can just create tasks and run them directly!

Fargate uses the same API actions as ECS, so you can use the ECS console, the AWS CLI, or the ECS CLI. I recommend running through the first-run experience for Fargate even if you’re familiar with ECS. It creates all of the one-time setup requirements, such as the necessary IAM roles. If you’re using a CLI, make sure to upgrade to the latest version

In this blog, you will see how to migrate ECS containers from running on Amazon EC2 to Fargate.

Getting started

Note: Anything with code blocks is a change in the task definition file. Screen captures are from the console. Additionally, Fargate is currently available in the us-east-1 (N. Virginia) region.

Launch type

When you create tasks (grouping of containers) and clusters (grouping of tasks), you now have two launch type options: EC2 and Fargate. The default launch type, EC2, is ECS as you knew it before the announcement of Fargate. You need to specify Fargate as the launch type when running a Fargate task.

Even though Fargate abstracts away virtual machines, tasks still must be launched into a cluster. With Fargate, clusters are a logical infrastructure and permissions boundary that allow you to isolate and manage groups of tasks. ECS also supports heterogeneous clusters that are made up of tasks running on both EC2 and Fargate launch types.

The optional, new requiresCompatibilities parameter with FARGATE in the field ensures that your task definition only passes validation if you include Fargate-compatible parameters. Tasks can be flagged as compatible with EC2, Fargate, or both.

"requiresCompatibilities": [
    "FARGATE"
]

Networking

"networkMode": "awsvpc"

In November, we announced the addition of task networking with the network mode awsvpc. By default, ECS uses the bridge network mode. Fargate requires using the awsvpc network mode.

In bridge mode, all of your tasks running on the same instance share the instance’s elastic network interface, which is a virtual network interface, IP address, and security groups.

The awsvpc mode provides this networking support to your tasks natively. You now get the same VPC networking and security controls at the task level that were previously only available with EC2 instances. Each task gets its own elastic networking interface and IP address so that multiple applications or copies of a single application can run on the same port number without any conflicts.

The awsvpc mode also provides a separation of responsibility for tasks. You can get complete control of task placement within your own VPCs, subnets, and the security policies associated with them, even though the underlying infrastructure is managed by Fargate. Also, you can assign different security groups to each task, which gives you more fine-grained security. You can give an application only the permissions it needs.

"portMappings": [
    {
        "containerPort": "3000"
    }
 ]

What else has to change? First, you only specify a containerPort value, not a hostPort value, as there is no host to manage. Your container port is the port that you access on your elastic network interface IP address. Therefore, your container ports in a single task definition file need to be unique.

"environment": [
    {
        "name": "WORDPRESS_DB_HOST",
        "value": "127.0.0.1:3306"
    }
 ]

Additionally, links are not allowed as they are a property of the “bridge” network mode (and are now a legacy feature of Docker). Instead, containers share a network namespace and communicate with each other over the localhost interface. They can be referenced using the following:

localhost/127.0.0.1:<some_port_number>

CPU and memory

"memory": "1024",
 "cpu": "256"

"memory": "1gb",
 "cpu": ".25vcpu"

When launching a task with the EC2 launch type, task performance is influenced by the instance types that you select for your cluster combined with your task definition. If you pick larger instances, your applications make use of the extra resources if there is no contention.

In Fargate, you needed a way to get additional resource information so we created task-level resources. Task-level resources define the maximum amount of memory and cpu that your task can consume.

  • memory can be defined in MB with just the number, or in GB, for example, “1024” or “1gb”.
  • cpu can be defined as the number or in vCPUs, for example, “256” or “.25vcpu”.
    • vCPUs are virtual CPUs. You can look at the memory and vCPUs for instance types to get an idea of what you may have used before.

The memory and CPU options available with Fargate are:

CPU Memory
256 (.25 vCPU) 0.5GB, 1GB, 2GB
512 (.5 vCPU) 1GB, 2GB, 3GB, 4GB
1024 (1 vCPU) 2GB, 3GB, 4GB, 5GB, 6GB, 7GB, 8GB
2048 (2 vCPU) Between 4GB and 16GB in 1GB increments
4096 (4 vCPU) Between 8GB and 30GB in 1GB increments

IAM roles

Because Fargate uses awsvpc mode, you need an Amazon ECS service-linked IAM role named AWSServiceRoleForECS. It provides Fargate with the needed permissions, such as the permission to attach an elastic network interface to your task. After you create your service-linked IAM role, you can delete the remaining roles in your services.

"executionRoleArn": "arn:aws:iam::<your_account_id>:role/ecsTaskExecutionRole"

With the EC2 launch type, an instance role gives the agent the ability to pull, publish, talk to ECS, and so on. With Fargate, the task execution IAM role is only needed if you’re pulling from Amazon ECR or publishing data to Amazon CloudWatch Logs.

The Fargate first-run experience tutorial in the console automatically creates these roles for you.

Volumes

Fargate currently supports non-persistent, empty data volumes for containers. When you define your container, you no longer use the host field and only specify a name.

Load balancers

For awsvpc mode, and therefore for Fargate, use the IP target type instead of the instance target type. You define this in the Amazon EC2 service when creating a load balancer.

If you’re using a Classic Load Balancer, change it to an Application Load Balancer or a Network Load Balancer.

Tip: If you are using an Application Load Balancer, make sure that your tasks are launched in the same VPC and Availability Zones as your load balancer.

Let’s migrate a task definition!

Here is an example NGINX task definition. This type of task definition is what you’re used to if you created one before Fargate was announced. It’s what you would run now with the EC2 launch type.

{
    "containerDefinitions": [
        {
            "name": "nginx",
            "image": "nginx",
            "memory": "512",
            "cpu": "100",
            "essential": true,
            "portMappings": [
                {
                    "hostPort": "80",
                    "containerPort": "80",
                    "protocol": "tcp"
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "family": "nginx-ec2"
}

OK, so now what do you need to do to change it to run with the Fargate launch type?

  • Add FARGATE for requiredCompatibilities (not required, but a good safety check for your task definition).
  • Use awsvpc as the network mode.
  • Just specify the containerPort (the hostPortvalue is the same).
  • Add a task executionRoleARN value to allow logging to CloudWatch.
  • Provide cpu and memory limits for the task.
{
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "containerDefinitions": [
        {
            "name": "nginx",
            "image": "nginx",
            "memory": "512",
            "cpu": "100",
            "essential": true,
            "portMappings": [
                {
                    "containerPort": "80",
                    "protocol": "tcp"
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "networkMode": "awsvpc",
    "executionRoleArn": "arn:aws:iam::<your_account_id>:role/ecsTaskExecutionRole",
    "family": "nginx-fargate",
    "memory": "512",
    "cpu": "256"
}

Are there more examples?

Yep! Head to the AWS Samples GitHub repo. We have several sample task definitions you can try for both the EC2 and Fargate launch types. Contributions are very welcome too :).

 

tiffany jernigan
@tiffanyfayj

Build a Multi-Tenant Amazon EMR Cluster with Kerberos, Microsoft Active Directory Integration and EMRFS Authorization

Post Syndicated from Songzhi Liu original https://aws.amazon.com/blogs/big-data/build-a-multi-tenant-amazon-emr-cluster-with-kerberos-microsoft-active-directory-integration-and-emrfs-authorization/

One of the challenges faced by our customers—especially those in highly regulated industries—is balancing the need for security with flexibility. In this post, we cover how to enable multi-tenancy and increase security by using EMRFS (EMR File System) authorization, the Amazon S3 storage-level authorization on Amazon EMR.

Amazon EMR is an easy, fast, and scalable analytics platform enabling large-scale data processing. EMRFS authorization provides Amazon S3 storage-level authorization by configuring EMRFS with multiple IAM roles. With this functionality enabled, different users and groups can share the same cluster and assume their own IAM roles respectively.

Simply put, on Amazon EMR, we can now have an Amazon EC2 role per user assumed at run time instead of one general EC2 role at the cluster level. When the user is trying to access Amazon S3 resources, Amazon EMR evaluates against a predefined mappings list in EMRFS configurations and picks up the right role for the user.

In this post, we will discuss what EMRFS authorization is (Amazon S3 storage-level access control) and show how to configure the role mappings with detailed examples. You will then have the desired permissions in a multi-tenant environment. We also demo Amazon S3 access from HDFS command line, Apache Hive on Hue, and Apache Spark.

EMRFS authorization for Amazon S3

There are two prerequisites for using this feature:

  1. Users must be authenticated, because EMRFS needs to map the current user/group/prefix to a predefined user/group/prefix. There are several authentication options. In this post, we launch a Kerberos-enabled cluster that manages the Key Distribution Center (KDC) on the master node, and enable a one-way trust from the KDC to a Microsoft Active Directory domain.
  2. The application must support accessing Amazon S3 via Applications that have their own S3FileSystem APIs (for example, Presto) are not supported at this time.

EMRFS supports three types of mapping entries: user, group, and Amazon S3 prefix. Let’s use an example to show how this works.

Assume that you have the following three identities in your organization, and they are defined in the Active Directory:

To enable all these groups and users to share the EMR cluster, you need to define the following IAM roles:

In this case, you create a separate Amazon EC2 role that doesn’t give any permission to Amazon S3. Let’s call the role the base role (the EC2 role attached to the EMR cluster), which in this example is named EMR_EC2_RestrictedRole. Then, you define all the Amazon S3 permissions for each specific user or group in their own roles. The restricted role serves as the fallback role when the user doesn’t belong to any user/group, nor does the user try to access any listed Amazon S3 prefixes defined on the list.

Important: For all other roles, like emrfs_auth_group_role_data_eng, you need to add the base role (EMR_EC2_RestrictedRole) as the trusted entity so that it can assume other roles. See the following example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::511586466501:role/EMR_EC2_RestrictedRole"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The following is an example policy for the admin user role (emrfs_auth_user_role_admin_user):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

We are assuming the admin user has access to all buckets in this example.

The following is an example policy for the data science group role (emrfs_auth_group_role_data_sci):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::emrfs-auth-data-science-bucket-demo/*",
                "arn:aws:s3:::emrfs-auth-data-science-bucket-demo"
            ],
            "Action": [
                "s3:*"
            ]
        }
    ]
}

This role grants all Amazon S3 permissions to the emrfs-auth-data-science-bucket-demo bucket and all the objects in it. Similarly, the policy for the role emrfs_auth_group_role_data_eng is shown below:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::emrfs-auth-data-engineering-bucket-demo/*",
                "arn:aws:s3:::emrfs-auth-data-engineering-bucket-demo"
            ],
            "Action": [
                "s3:*"
            ]
        }
    ]
}

Example role mappings configuration

To configure EMRFS authorization, you use EMR security configuration. Here is the configuration we use in this post

Consider the following scenario.

First, the admin user admin1 tries to log in and run a command to access Amazon S3 data through EMRFS. The first role emrfs_auth_user_role_admin_user on the mapping list, which is a user role, is mapped and picked up. Then admin1 has access to the Amazon S3 locations that are defined in this role.

Then a user from the data engineer group (grp_data_engineering) tries to access a data bucket to run some jobs. When EMRFS sees that the user is a member of the grp_data_engineering group, the group role emrfs_auth_group_role_data_eng is assumed, and the user has proper access to Amazon S3 that is defined in the emrfs_auth_group_role_data_eng role.

Next, the third user comes, who is not an admin and doesn’t belong to any of the groups. After failing evaluation of the top three entries, EMRFS evaluates whether the user is trying to access a certain Amazon S3 prefix defined in the last mapping entry. This type of mapping entry is called the prefix type. If the user is trying to access s3://emrfs-auth-default-bucket-demo/, then the prefix mapping is in effect, and the prefix role emrfs_auth_prefix_role_default_s3_prefix is assumed.

If the user is not trying to access any of the Amazon S3 paths that are defined on the list—which means it failed the evaluation of all the entries—it only has the permissions defined in the EMR_EC2RestrictedRole. This role is assumed by the EC2 instances in the cluster.

In this process, all the mappings defined are evaluated in the defined order, and the first role that is mapped is assumed, and the rest of the list is skipped.

Setting up an EMR cluster and mapping Active Directory users and groups

Now that we know how EMRFS authorization role mapping works, the next thing we need to think about is how we can use this feature in an easy and manageable way.

Active Directory setup

Many customers manage their users and groups using Microsoft Active Directory or other tools like OpenLDAP. In this post, we create the Active Directory on an Amazon EC2 instance running Windows Server and create the users and groups we will be using in the example below. After setting up Active Directory, we use the Amazon EMR Kerberos auto-join capability to establish a one-way trust from the KDC running on the EMR master node to the Active Directory domain on the EC2 instance. You can use your own directory services as long as it talks to the LDAP (Lightweight Directory Access Protocol).

To create and join Active Directory to Amazon EMR, follow the steps in the blog post Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory.

After configuring Active Directory, you can create all the users and groups using the Active Directory tools and add users to appropriate groups. In this example, we created users like admin1, dataeng1, datascientist1, grp_data_engineering, and grp_data_science, and then add the users to the right groups.

Join the EMR cluster to an Active Directory domain

For clusters with Kerberos, Amazon EMR now supports automated Active Directory domain joins. You can use the security configuration to configure the one-way trust from the KDC to the Active Directory domain. You also configure the EMRFS role mappings in the same security configuration.

The following is an example of the EMR security configuration with a trusted Active Directory domain EMRKRB.TEST.COM and the EMRFS role mappings as we discussed earlier:

The EMRFS role mapping configuration is shown in this example:

We will also provide an example AWS CLI command that you can run.

Launching the EMR cluster and running the tests

Now you have configured Kerberos and EMRFS authorization for Amazon S3.

Additionally, you need to configure Hue with Active Directory using the Amazon EMR configuration API in order to log in using the AD users created before. The following is an example of Hue AD configuration.

[
  {
    "Classification":"hue-ini",
    "Properties":{

    },
    "Configurations":[
      {
        "Classification":"desktop",
        "Properties":{

        },
        "Configurations":[
          {
            "Classification":"ldap",
            "Properties":{

            },
            "Configurations":[
              {
                "Classification":"ldap_servers",
                "Properties":{

                },
                "Configurations":[
                  {
                    "Classification":"AWS",
                    "Properties":{
                      "base_dn":"DC=emrkrb,DC=test,DC=com",
                      "ldap_url":"ldap://emrkrb.test.com",
                      "search_bind_authentication":"false",
                      "bind_dn":"CN=adjoiner,CN=users,DC=emrkrb,DC=test,DC=com",
                      "bind_password":"Abc123456",
                      "create_users_on_login":"true",
                      "nt_domain":"emrkrb.test.com"
                    },
                    "Configurations":[

                    ]
                  }
                ]
              }
            ]
          },
          {
            "Classification":"auth",
            "Properties":{
              "backend":"desktop.auth.backend.LdapBackend"
            },
            "Configurations":[

            ]
          }
        ]
      }
    ]
  }

Note: In the preceding configuration JSON file, change the values as required before pasting it into the software setting section in the Amazon EMR console.

Now let’s use this configuration and the security configuration you created before to launch the cluster.

In the Amazon EMR console, choose Create cluster. Then choose Go to advanced options. On the Step1: Software and Steps page, under Edit software settings (optional), paste the configuration in the box.

The rest of the setup is the same as an ordinary cluster setup, except in the Security Options section. In Step 4: Security, under Permissions, choose Custom, and then choose the RestrictedRole that you created before.

Choose the appropriate subnets (these should meet the base requirement in order for a successful Active Directory join—see the Amazon EMR Management Guide for more details), and choose the appropriate security groups to make sure it talks to the Active Directory. Choose a key so that you can log in and configure the cluster.

Most importantly, choose the security configuration that you created earlier to enable Kerberos and EMRFS authorization for Amazon S3.

You can use the following AWS CLI command to create a cluster.

aws emr create-cluster --name "TestEMRFSAuthorization" \ 
--release-label emr-5.10.0 \ --instance-type m3.xlarge \ 
--instance-count 3 \ 
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=MyEC2KeyPair \ --service-role EMR_DefaultRole \ 
--security-configuration MyKerberosConfig \ 
--configurations file://hue-config.json \
--applications Name=Hadoop Name=Hive Name=Hue Name=Spark \ 
--kerberos-attributes Realm=EC2.INTERNAL, \ KdcAdminPassword=<YourClusterKDCAdminPassword>, \ ADDomainJoinUser=<YourADUserLogonName>,ADDomainJoinPassword=<YourADUserPassword>, \ 
CrossRealmTrustPrincipalPassword=<MatchADTrustPwd>

Note: If you create the cluster using CLI, you need to save the JSON configuration for Hue into a file named hue-config.json and place it on the server where you run the CLI command.

After the cluster gets into the Waiting state, try to connect by using SSH into the cluster using the Active Directory user name and password.

ssh -l [email protected] <EMR IP or DNS name>

Quickly run two commands to show that the Active Directory join is successful:

  1. id [user name] shows the mapped AD users and groups in Linux.
  2. hdfs groups [user name] shows the mapped group in Hadoop.

Both should return the current Active Directory user and group information if the setup is correct.

Now, you can test the user mapping first. Log in with the admin1 user, and run a Hadoop list directory command:

hadoop fs -ls s3://emrfs-auth-data-science-bucket-demo/

Now switch to a user from the data engineer group.

Retry the previous command to access the admin’s bucket. It should throw an Amazon S3 Access Denied exception.

When you try listing the Amazon S3 bucket that a data engineer group member has accessed, it triggers the group mapping.

hadoop fs -ls s3://emrfs-auth-data-engineering-bucket-demo/

It successfully returns the listing results. Next we will test Apache Hive and then Apache Spark.

 

To run jobs successfully, you need to create a home directory for every user in HDFS for staging data under /user/<username>. Users can configure a step to create a home directory at cluster launch time for every user who has access to the cluster. In this example, you use Hue since Hue will create the home directory in HDFS for the user at the first login. Here Hue also needs to be integrated with the same Active Directory as explained in the example configuration described earlier.

First, log in to Hue as a data engineer user, and open a Hive Notebook in Hue. Then run a query to create a new table pointing to the data engineer bucket, s3://emrfs-auth-data-engineering-bucket-demo/table1_data_eng/.

You can see that the table was created successfully. Now try to create another table pointing to the data science group’s bucket, where the data engineer group doesn’t have access.

It failed and threw an Amazon S3 Access Denied error.

Now insert one line of data into the successfully create table.

Next, log out, switch to a data science group user, and create another table, test2_datasci_tb.

The creation is successful.

The last task is to test Spark (it requires the user directory, but Hue created one in the previous step).

Now let’s come back to the command line and run some Spark commands.

Login to the master node using the datascientist1 user:

Start the SparkSQL interactive shell by typing spark-sql, and run the show tables command. It should list the tables that you created using Hive.

As a data science group user, try select on both tables. You will find that you can only select the table defined in the location that your group has access to.

Conclusion

EMRFS authorization for Amazon S3 enables you to have multiple roles on the same cluster, providing flexibility to configure a shared cluster for different teams to achieve better efficiency. The Active Directory integration and group mapping make it much easier for you to manage your users and groups, and provides better auditability in a multi-tenant environment.


Additional Reading

If you found this post useful, be sure to check out Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory and Launching and Running an Amazon EMR Cluster inside a VPC.


About the Authors

Songzhi Liu is a Big Data Consultant with AWS Professional Services. He works closely with AWS customers to provide them Big Data & Machine Learning solutions and best practices on the Amazon cloud.

 

 

 

 

Building Blocks of Amazon ECS

Post Syndicated from Tiffany Jernigan original https://aws.amazon.com/blogs/compute/building-blocks-of-amazon-ecs/

So, what’s Amazon Elastic Container Service (ECS)? ECS is a managed service for running containers on AWS, designed to make it easy to run applications in the cloud without worrying about configuring the environment for your code to run in. Using ECS, you can easily deploy containers to host a simple website or run complex distributed microservices using thousands of containers.

Getting started with ECS isn’t too difficult. To fully understand how it works and how you can use it, it helps to understand the basic building blocks of ECS and how they fit together!

Let’s begin with an analogy

Imagine you’re in a virtual reality game with blocks and portals, in which your task is to build kingdoms.

In your spaceship, you pull up a holographic map of your upcoming destination: Nozama, a golden-orange planet. Looking at its various regions, you see that the nearest one is za-southwest-1 (SW Nozama). You set your destination, and use your jump drive to jump to the outer atmosphere of za-southwest-1.

As you approach SW Nozama, you see three portals, 1a, 1b, and 1c. Each portal lets you transport directly to an isolated zone (Availability Zone), where you can start construction on your new kingdom (cluster), Royaume.

With your supply of blocks, you take the portal to 1b, and erect the surrounding walls of your first territory (instance)*.

Before you get ahead of yourself, there are some rules to keep in mind. For your territory to be a part of Royaume, the land ordinance requires construction of a building (container), specifically a castle, from which your territory’s lord (agent)* rules.

You can then create architectural plans (task definitions) to build your developments (tasks), consisting of up to 10 buildings per plan. A development can be built now within this or any territory, or multiple territories.

If you do decide to create more territories, you can either stay here in 1b or take a portal to another location in SW Nozama and start building there.

Amazon EC2 building blocks

We currently provide two launch types: EC2 and Fargate. With Fargate, the Amazon EC2 instances are abstracted away and managed for you. Instead of worrying about ECS container instances, you can just worry about tasks. In this post, the infrastructure components used by ECS that are handled by Fargate are marked with a *.

Instance*

EC2 instances are good ol’ virtual machines (VMs). And yes, don’t worry, you can connect to them (via SSH). Because customers have varying needs in memory, storage, and computing power, many different instance types are offered. Just want to run a small application or try a free trial? Try t2.micro. Want to run memory-optimized workloads? R3 and X1 instances are a couple options. There are many more instance types as well, which cater to various use cases.

AMI*

Sorry if you wanted to immediately march forward, but before you create your instance, you need to choose an AMI. An AMI stands for Amazon Machine Image. What does that mean? Basically, an AMI provides the information required to launch an instance: root volume, launch permissions, and volume-attachment specifications. You can find and choose a Linux or Windows AMI provided by AWS, the user community, the AWS Marketplace (for example, the Amazon ECS-Optimized AMI), or you can create your own.

Region

AWS is divided into regions that are geographic areas around the world (for now it’s just Earth, but maybe someday…). These regions have semi-evocative names such as us-east-1 (N. Virginia), us-west-2 (Oregon), eu-central-1 (Frankfurt), ap-northeast-1 (Tokyo), etc.

Each region is designed to be completely isolated from the others, and consists of multiple, distinct data centers. This creates a “blast radius” for failure so that even if an entire region goes down, the others aren’t affected. Like many AWS services, to start using ECS, you first need to decide the region in which to operate. Typically, this is the region nearest to you or your users.

Availability Zone

AWS regions are subdivided into Availability Zones. A region has at minimum two zones, and up to a handful. Zones are physically isolated from each other, spanning one or more different data centers, but are connected through low-latency, fiber-optic networking, and share some common facilities. EC2 is designed so that the most common failures only affect a single zone to prevent region-wide outages. This means you can achieve high availability in a region by spanning your services across multiple zones and distributing across hosts.

Amazon ECS building blocks

Container

Well, without containers, ECS wouldn’t exist!

Are containers virtual machines?
Nope! Virtual machines virtualize the hardware (benefits), while containers virtualize the operating system (even more benefits!). If you look inside a container, you would see that it is made by processes running on the host, and tied together by kernel constructs like namespaces, cgroups, etc. But you don’t need to bother about that level of detail, at least not in this post!

Why containers?
Containers give you the ability to build, ship, and run your code anywhere!

Before the cloud, you needed to self-host and therefore had to buy machines in addition to setting up and configuring the operating system (OS), and running your code. In the cloud, with virtualization, you can just skip to setting up the OS and running your code. Containers make the process even easier—you can just run your code.

Additionally, all of the dependencies travel in a package with the code, which is called an image. This allows containers to be deployed on any host machine. From the outside, it looks like a host is just holding a bunch of containers. They all look the same, in the sense that they are generic enough to be deployed on any host.

With ECS, you can easily run your containerized code and applications across a managed cluster of EC2 instances.

Are containers a fairly new technology?
The concept of containerization is not new. Its origins date back to 1979 with the creation of chroot. However, it wasn’t until the early 2000s that containers became a major technology. The most significant milestone to date was the release of Docker in 2013, which led to the popularization and widespread adoption of containers.

What does ECS use?
While other container technologies exist (LXC, rkt, etc.), because of its massive adoption and use by our customers, ECS was designed first to work natively with Docker containers.

Container instance*

Yep, you are back to instances. An instance is just slightly more complex in the ECS realm though. Here, it is an ECS container instance that is an EC2 instance running the agent, has a specifically defined IAM policy and role, and has been registered into your cluster.

And as you probably guessed, in these instances, you are running containers. 

AMI*

These container instances can use any AMI as long as it has the following specifications: a modern Linux distribution with the agent and the Docker Daemon with any Docker runtime dependencies running on it.

Want it more simplified? Well, AWS created the Amazon ECS-Optimized AMI for just that. Not only does that AMI come preconfigured with all of the previously mentioned specifications, it’s tested and includes the recommended ecs-init upstart process to run and monitor the agent.

Cluster

An ECS cluster is a grouping of (container) instances* (or tasks in Fargate) that lie within a single region, but can span multiple Availability Zones – it’s even a good idea for redundancy. When launching an instance (or tasks in Fargate), unless specified, it registers with the cluster named “default”. If “default” doesn’t exist, it is created. You can also scale and delete your clusters.

Agent*

The Amazon ECS container agent is a Go program that runs in its own container within each EC2 instance that you use with ECS. (It’s also available open source on GitHub!) The agent is the intermediary component that takes care of the communication between the scheduler and your instances. Want to register your instance into a cluster? (Why wouldn’t you? A cluster is both a logical boundary and provider of pool of resources!) Then you need to run the agent on it.

Task

When you want to start a container, it has to be part of a task. Therefore, you have to create a task first. Succinctly, tasks are a logical grouping of 1 to N containers that run together on the same instance, with N defined by you, up to 10. Let’s say you want to run a custom blog engine. You could put together a web server, an application server, and an in-memory cache, each in their own container. Together, they form a basic frontend unit.

Task definition

Ah, but you cannot create a task directly. You have to create a task definition that tells ECS that “task definition X is composed of this container (and maybe that other container and that other container too!).” It’s kind of like an architectural plan for a city. Some other details it can include are how the containers interact, container CPU and memory constraints, and task permissions using IAM roles.

Then you can tell ECS, “start one task using task definition X.” It might sound like unnecessary planning at first. As soon as you start to deal with multiple tasks, scaling, upgrades, and other “real life” scenarios, you’ll be glad that you have task definitions to keep track of things!

Scheduler*

So, the scheduler schedules… sorry, this should be more helpful, huh? The scheduler is part of the “hosted orchestration layer” provided by ECS. Wait a minute, what do I mean by “hosted orchestration”? Simply put, hosted means that it’s operated by ECS on your behalf, without you having to care about it. Your applications are deployed in containers running on your instances, but the managing of tasks is taken care of by ECS. One less thing to worry about!

Also, the scheduler is the component that decides what (which containers) gets to run where (on which instances), according to a number of constraints. Say that you have a custom blog engine to scale for high availability. You could create a service, which by default, spreads tasks across all zones in the chosen region. And if you want each task to be on a different instance, you can use the distinctInstance task placement constraint. ECS makes sure that not only this happens, but if a task fails, it starts again.

Service

To ensure that you always have your task running without managing it yourself, you can create a service based on the task that you defined and ECS ensures that it stays running. A service is a special construct that says, “at any given time, I want to make sure that N tasks using task definition X1 are running.” If N=1, it just means “make sure that this task is running, and restart it if needed!” And with N>1, you’re basically scaling your application until you hit N, while also ensuring each task is running.

So, what now?

Hopefully you, at the very least, learned a tiny something. All comments are very welcome!

Want to discuss ECS with others? Join the amazon-ecs slack group, which members of the community created and manage.

Also, if you’re interested in learning more about the core concepts of ECS and its relation to EC2, here are some resources:

Pages
Amazon ECS landing page
AWS Fargate landing page
Amazon ECS Getting Started
Nathan Peck’s AWSome ECS

Docs
Amazon EC2
Amazon ECS

Blogs
AWS Compute Blog
AWS Blog

GitHub code
Amazon ECS container agent
Amazon ECS CLI

AWS videos
Learn Amazon ECS
AWS videos
AWS webinars

 

— tiffany

 @tiffanyfayj

 

timeShift(GrafanaBuzz, 1w) Issue 30

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2018/01/19/timeshiftgrafanabuzz-1w-issue-30/

Welcome to TimeShift

We’re only 6 weeks away from the next GrafanaCon and here at Grafana Labs we’re buzzing with excitement. We have some great talks lined up that you won’t want to miss.

This week’s TimeShift covers Grafana’s annotation functionality, monitoring with Prometheus, integrating Grafana with NetFlow and a peek inside Stream’s monitoring stack. Enjoy!


Latest Stable Release

Grafana 4.6.3 is now available. Latest bugfixes include:

  • Gzip: Fixes bug Gravatar images when gzip was enabled #5952
  • Alert list: Now shows alert state changes even after adding manual annotations on dashboard #99513
  • Alerting: Fixes bug where rules evaluated as firing when all conditions was false and using OR operator. #93183
  • Cloudwatch: CloudWatch no longer display metrics’ default alias #101514, thx @mtanda

Download Grafana 4.6.3 Now


From the Blogosphere

Walkthrough: Watch your Ansible deployments in Grafana!: Your graphs start spiking and your platform begins behaving abnormally. Did the config change in a deployment, causing the problem? This article covers Grafana’s new annotation functionality, and specifically, how to create deployment annotations via Ansible playbooks.

Application Monitoring in OpenShift with Prometheus and Grafana: There are many article describing how to monitor OpenShift with Prometheus running in the same cluster, but what if you don’t have admin permissions to the cluster you need to monitor?

Spring Boot Metrics Monitoring Using Prometheus & Grafana: As the title suggests, this post walks you through how to configure Prometheus and Grafana to monitor you Spring Boot application metrics.

How to Integrate Grafana with NetFlow: Learn how to monitor NetFlow from Scrutinizer using Grafana’s SimpleJSON data source.

Stream & Go: News Feeds for Over 300 Million End Users: Stream lets you build scalable newsfeeds and activity streams via their API, which is used by more than 300 million end users. In this article, they discuss their monitoring stack and why they chose particular components and technologies.


GrafanaCon EU Tickets are Going Fast!

We’re six weeks from kicking off GrafanaCon EU! Join us for talks from Google, Bloomberg, Tinder, eBay and more! You won’t want to miss two great days of open source monitoring talks and fun in Amsterdam. Get your tickets before they sell out!

Get Your Ticket Now


Grafana Plugins

We have a couple of plugin updates to share this week that add some new features and improvements. Updating your plugins is easy. For on-prem Grafana, use the Grafana-cli tool, or update with 1 click on your Hosted Grafana.

UPDATED PLUGIN

Druid Data Source – This new update is packed with new features. Notable enhancement include:

  • Post Aggregation feature
  • Support for thetaSketch
  • Improvements to the Query editor

Update Now

UPDATED PLUGIN

Breadcrumb Panel – The Breadcrumb Panel is a small panel you can include in your dashboard that tracks other dashboards you have visited – making it easy to navigate back to a previously visited dashboard. The latest release adds support for dashboards loaded from a file.

Update Now


Upcoming Events

In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We also like to make sure we mention other Grafana-related events happening all over the world. If you’re putting on just such an event, let us know and we’ll list it here.

SnowCamp 2018: Yves Brissaud – Application metrics with Prometheus and Grafana | Grenoble, France – Jan 24, 2018:
We’ll take a look at how Prometheus, Grafana and a bit of code make it possible to obtain temporal data to visualize the state of our applications as well as to help with development and debugging.

Register Now

Women Who Go Berlin: Go Workshop – Monitoring and Troubleshooting using Prometheus and Grafana | Berlin, Germany – Jan 31, 2018: In this workshop we will learn about one of the most important topics in making apps production ready: Monitoring. We will learn how to use tools you’ve probably heard a lot about – Prometheus and Grafana, and using what we learn we will troubleshoot a particularly buggy Go app.

Register Now

FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. There is no need to register; all are welcome.

Jfokus | Stockholm, Sweden – Feb 5-7, 2018:
Carl Bergquist – Quickie: Monitoring? Not OPS Problem

Why should we monitor our system? Why can’t we just rely on the operations team anymore? They use to be able to do that. What’s currently changing? Presentation content: – Why do we monitor our system – How did it use to work? – Whats changing – Why do we need to shift focus – Everyone should be on call. – Resilience is the goal (Best way of having someone care about quality is to make them responsible).

Register Now

Jfokus | Stockholm, Sweden – Feb 5-7, 2018:
Leonard Gram – Presentation: DevOps Deconstructed

What’s a Site Reliability Engineer and how’s that role different from the DevOps engineer my boss wants to hire? I really don’t want to be on call, should I? Is Docker the right place for my code or am I better of just going straight to Serverless? And why should I care about any of it? I’ll try to answer some of these questions while looking at what DevOps really is about and how commodisation of servers through “the cloud” ties into it all. This session will be an opinionated piece from a developer who’s been on-call for the past 6 years and would like to convince you to do the same, at least once.

Register Now

Stockholm Metrics and Monitoring | Stockholm, Sweden – Feb 7, 2018:
Observability 3 ways – Logging, Metrics and Distributed Tracing

Let’s talk about often confused telemetry tools: Logging, Metrics and Distributed Tracing. We’ll show how you capture latency using each of the tools and how they work differently. Through examples and discussion, we’ll note edge cases where certain tools have advantages over others. By the end of this talk, we’ll better understand how each of Logging, Metrics and Distributed Tracing aids us in different ways to understand our applications.

Register Now

OpenNMS – Introduction to “Grafana” | Webinar – Feb 21, 2018:
IT monitoring helps detect emerging hardware damage and performance bottlenecks in the enterprise network before any consequential damage or disruption to business processes occurs. The powerful open-source OpenNMS software monitors a network, including all connected devices, and provides logging of a variety of data that can be used for analysis and planning purposes. In our next OpenNMS webinar on February 21, 2018, we introduce “Grafana” – a web-based tool for creating and displaying dashboards from various data sources, which can be perfectly combined with OpenNMS.

Register Now


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

As we say with pie charts, use emojis wisely 😉


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


How are we doing?

That wraps up our 30th issue of TimeShift. What do you think? Are there other types of content you’d like to see here? Submit a comment on this issue below, or post something at our community forum.

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

AWS Glue Now Supports Scala Scripts

Post Syndicated from Mehul Shah original https://aws.amazon.com/blogs/big-data/aws-glue-now-supports-scala-scripts/

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

This post shows how to build an example Scala script that identifies highly negative issues in the timeline. It pulls out issue events in the timeline sample, analyzes their titles using the sentiment prediction functions from the Stanford CoreNLP libraries, and surfaces the most negative issues.

Getting started

Before we start writing scripts, we use AWS Glue crawlers to get a sense of the data—its structure and characteristics. We also set up a development endpoint and attach an Apache Zeppelin notebook, so we can interactively explore the data and author the script.

Crawl the data

The dataset used in this example was downloaded from the GitHub archive website into our sample dataset bucket in Amazon S3, and copied to the following locations:

s3://aws-glue-datasets-<region>/examples/scala-blog/githubarchive/data/

Choose the best folder by replacing <region> with the region that you’re working in, for example, us-east-1. Crawl this folder, and put the results into a database named githubarchive in the AWS Glue Data Catalog, as described in the AWS Glue Developer Guide. This folder contains 12 hours of the timeline from January 22, 2017, and is organized hierarchically (that is, partitioned) by year, month, and day.

When finished, use the AWS Glue console to navigate to the table named data in the githubarchive database. Notice that this data has eight top-level columns, which are common to each event type, and three partition columns that correspond to year, month, and day.

Choose the payload column, and you will notice that it has a complex schema—one that reflects the union of the payloads of event types that appear in the crawled data. Also note that the schema that crawlers generate is a subset of the true schema because they sample only a subset of the data.

Set up the library, development endpoint, and notebook

Next, you need to download and set up the libraries that estimate the sentiment in a snippet of text. The Stanford CoreNLP libraries contain a number of human language processing tools, including sentiment prediction.

Download the Stanford CoreNLP libraries. Unzip the .zip file, and you’ll see a directory full of jar files. For this example, the following jars are required:

  • stanford-corenlp-3.8.0.jar
  • stanford-corenlp-3.8.0-models.jar
  • ejml-0.23.jar

Upload these files to an Amazon S3 path that is accessible to AWS Glue so that it can load these libraries when needed. For this example, they are in s3://glue-sample-other/corenlp/.

Development endpoints are static Spark-based environments that can serve as the backend for data exploration. You can attach notebooks to these endpoints to interactively send commands and explore and analyze your data. These endpoints have the same configuration as that of AWS Glue’s job execution system. So, commands and scripts that work there also work the same when registered and run as jobs in AWS Glue.

To set up an endpoint and a Zeppelin notebook to work with that endpoint, follow the instructions in the AWS Glue Developer Guide. When you are creating an endpoint, be sure to specify the locations of the previously mentioned jars in the Dependent jars path as a comma-separated list. Otherwise, the libraries will not be loaded.

After you set up the notebook server, go to the Zeppelin notebook by choosing Dev Endpoints in the left navigation pane on the AWS Glue console. Choose the endpoint that you created. Next, choose the Notebook Server URL, which takes you to the Zeppelin server. Log in using the notebook user name and password that you specified when creating the notebook. Finally, create a new note to try out this example.

Each notebook is a collection of paragraphs, and each paragraph contains a sequence of commands and the output for that command. Moreover, each notebook includes a number of interpreters. If you set up the Zeppelin server using the console, the (Python-based) pyspark and (Scala-based) spark interpreters are already connected to your new development endpoint, with pyspark as the default. Therefore, throughout this example, you need to prepend %spark at the top of your paragraphs. In this example, we omit these for brevity.

Working with the data

In this section, we use AWS Glue extensions to Spark to work with the dataset. We look at the actual schema of the data and filter out the interesting event types for our analysis.

Start with some boilerplate code to import libraries that you need:

%spark

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext

Then, create the Spark and AWS Glue contexts needed for working with the data:

@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)

You need the transient decorator on the SparkContext when working in Zeppelin; otherwise, you will run into a serialization error when executing commands.

Dynamic frames

This section shows how to create a dynamic frame that contains the GitHub records in the table that you crawled earlier. A dynamic frame is the basic data structure in AWS Glue scripts. It is like an Apache Spark data frame, except that it is designed and optimized for data cleaning and transformation workloads. A dynamic frame is well-suited for representing semi-structured datasets like the GitHub timeline.

A dynamic frame is a collection of dynamic records. In Spark lingo, it is an RDD (resilient distributed dataset) of DynamicRecords. A dynamic record is a self-describing record. Each record encodes its columns and types, so every record can have a schema that is unique from all others in the dynamic frame. This is convenient and often more efficient for datasets like the GitHub timeline, where payloads can vary drastically from one event type to another.

The following creates a dynamic frame, github_events, from your table:

val github_events = glueContext
                    .getCatalogSource(database = "githubarchive", tableName = "data")
                    .getDynamicFrame()

The getCatalogSource() method returns a DataSource, which represents a particular table in the Data Catalog. The getDynamicFrame() method returns a dynamic frame from the source.

Recall that the crawler created a schema from only a sample of the data. You can scan the entire dataset, count the rows, and print the complete schema as follows:

github_events.count
github_events.printSchema()

The result looks like the following:

The data has 414,826 records. As before, notice that there are eight top-level columns, and three partition columns. If you scroll down, you’ll also notice that the payload is the most complex column.

Run functions and filter records

This section describes how you can create your own functions and invoke them seamlessly to filter records. Unlike filtering with Python lambdas, Scala scripts do not need to convert records from one language representation to another, thereby reducing overhead and running much faster.

Let’s create a function that picks only the IssuesEvents from the GitHub timeline. These events are generated whenever someone posts an issue for a particular repository. Each GitHub event record has a field, “type”, that indicates the kind of event it is. The issueFilter() function returns true for records that are IssuesEvents.

def issueFilter(rec: DynamicRecord): Boolean = { 
    rec.getField("type").exists(_ == "IssuesEvent") 
}

Note that the getField() method returns an Option[Any] type, so you first need to check that it exists before checking the type.

You pass this function to the filter transformation, which applies the function on each record and returns a dynamic frame of those records that pass.

val issue_events =  github_events.filter(issueFilter)

Now, let’s look at the size and schema of issue_events.

issue_events.count
issue_events.printSchema()

It’s much smaller (14,063 records), and the payload schema is less complex, reflecting only the schema for issues. Keep a few essential columns for your analysis, and drop the rest using the ApplyMapping() transform:

val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                 ("actor.login", "string", "actor", "string"), 
                                                 ("repo.name", "string", "repo", "string"),
                                                 ("payload.action", "string", "action", "string"),
                                                 ("payload.issue.title", "string", "title", "string")))
issue_titles.show()

The ApplyMapping() transform is quite handy for renaming columns, casting types, and restructuring records. The preceding code snippet tells the transform to select the fields (or columns) that are enumerated in the left half of the tuples and map them to the fields and types in the right half.

Estimating sentiment using Stanford CoreNLP

To focus on the most pressing issues, you might want to isolate the records with the most negative sentiments. The Stanford CoreNLP libraries are Java-based and offer sentiment-prediction functions. Accessing these functions through Python is possible, but quite cumbersome. It requires creating Python surrogate classes and objects for those found on the Java side. Instead, with Scala support, you can use those classes and objects directly and invoke their methods. Let’s see how.

First, import the libraries needed for the analysis:

import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

The Stanford CoreNLP libraries have a main driver that orchestrates all of their analysis. The driver setup is heavyweight, setting up threads and data structures that are shared across analyses. Apache Spark runs on a cluster with a main driver process and a collection of backend executor processes that do most of the heavy sifting of the data.

The Stanford CoreNLP shared objects are not serializable, so they cannot be distributed easily across a cluster. Instead, you need to initialize them once for every backend executor process that might need them. Here is how to accomplish that:

val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")

object myNLP {
    lazy val coreNLP = new StanfordCoreNLP(props)
}

The properties tell the libraries which annotators to execute and how many words to process. The preceding code creates an object, myNLP, with a field coreNLP that is lazily evaluated. This field is initialized only when it is needed, and only once. So, when the backend executors start processing the records, each executor initializes the driver for the Stanford CoreNLP libraries only one time.

Next is a function that estimates the sentiment of a text string. It first calls Stanford CoreNLP to annotate the text. Then, it pulls out the sentences and takes the average sentiment across all the sentences. The sentiment is a double, from 0.0 as the most negative to 4.0 as the most positive.

def estimatedSentiment(text: String): Double = {
    if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
    val annotations = myNLP.coreNLP.process(text)
    val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
    sentences.foldLeft(0.0)( (csum, x) => { 
        csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
    }) / sentences.length
}

Now, let’s estimate the sentiment of the issue titles and add that computed field as part of the records. You can accomplish this with the map() method on dynamic frames:

val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
    val mbody = rec.getField("title")
    mbody match {
        case Some(mval: String) => { 
            rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
            rec }
        case _ => rec
    }
})

The map() method applies the user-provided function on every record. The function takes a DynamicRecord as an argument and returns a DynamicRecord. The code above computes the sentiment, adds it in a top-level field, sentiment, to the record, and returns the record.

Count the records with sentiment and show the schema. This takes a few minutes because Spark must initialize the library and run the sentiment analysis, which can be involved.

issue_sentiments.count
issue_sentiments.printSchema()

Notice that all records were processed (14,063), and the sentiment value was added to the schema.

Finally, let’s pick out the titles that have the lowest sentiment (less than 1.5). Count them and print out a sample to see what some of the titles look like.

val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
pressing_issues.count
pressing_issues.show(10)

Next, write them all to a file so that you can handle them later. (You’ll need to replace the output path with your own.)

glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions("""{"path": "s3://<bucket>/out/path/"}"""), 
                              format = "json")
            .writeDynamicFrame(pressing_issues)

Take a look in the output path, and you can see the output files.

Putting it all together

Now, let’s create a job from the preceding interactive session. The following script combines all the commands from earlier. It processes the GitHub archive files and writes out the highly negative issues:

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

object GlueApp {

    object myNLP {
        val props = new Properties()
        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
        props.setProperty("parse.maxlen", "70")

        lazy val coreNLP = new StanfordCoreNLP(props)
    }

    def estimatedSentiment(text: String): Double = {
        if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
        val annotations = myNLP.coreNLP.process(text)
        val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
        sentences.foldLeft(0.0)( (csum, x) => { 
            csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
        }) / sentences.length
    }

    def main(sysArgs: Array[String]) {
        val spark: SparkContext = SparkContext.getOrCreate()
        val glueContext: GlueContext = new GlueContext(spark)

        val dbname = "githubarchive"
        val tblname = "data"
        val outpath = "s3://<bucket>/out/path/"

        val github_events = glueContext
                            .getCatalogSource(database = dbname, tableName = tblname)
                            .getDynamicFrame()

        val issue_events =  github_events.filter((rec: DynamicRecord) => {
            rec.getField("type").exists(_ == "IssuesEvent")
        })

        val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                         ("actor.login", "string", "actor", "string"), 
                                                         ("repo.name", "string", "repo", "string"),
                                                         ("payload.action", "string", "action", "string"),
                                                         ("payload.issue.title", "string", "title", "string")))

        val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
            val mbody = rec.getField("title")
            mbody match {
                case Some(mval: String) => { 
                    rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
                    rec }
                case _ => rec
            }
        })

        val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))

        glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions(s"""{"path": "$outpath"}"""), 
                              format = "json")
                    .writeDynamicFrame(pressing_issues)
    }
}

Notice that the script is enclosed in a top-level object called GlueApp, which serves as the script’s entry point for the job. (You’ll need to replace the output path with your own.) Upload the script to an Amazon S3 location so that AWS Glue can load it when needed.

To create the job, open the AWS Glue console. Choose Jobs in the left navigation pane, and then choose Add job. Create a name for the job, and specify a role with permissions to access the data. Choose An existing script that you provide, and choose Scala as the language.

For the Scala class name, type GlueApp to indicate the script’s entry point. Specify the Amazon S3 location of the script.

Choose Script libraries and job parameters. In the Dependent jars path field, enter the Amazon S3 locations of the Stanford CoreNLP libraries from earlier as a comma-separated list (without spaces). Then choose Next.

No connections are needed for this job, so choose Next again. Review the job properties, and choose Finish. Finally, choose Run job to execute the job.

You can simply edit the script’s input table and output path to run this job on whatever GitHub timeline datasets that you might have.

Conclusion

In this post, we showed how to write AWS Glue ETL scripts in Scala via notebooks and how to run them as jobs. Scala has the advantage that it is the native language for the Spark runtime. With Scala, it is easier to call Scala or Java functions and third-party libraries for analyses. Moreover, data processing is faster in Scala because there’s no need to convert records from one language runtime to another.

You can find more example of Scala scripts in our GitHub examples repository: https://github.com/awslabs/aws-glue-samples. We encourage you to experiment with Scala scripts and let us know about any interesting ETL flows that you want to share.

Happy Glue-ing!

 


Additional Reading

If you found this post useful, be sure to check out Simplify Querying Nested JSON with the AWS Glue Relationalize Transform and Genomic Analysis with Hail on Amazon EMR and Amazon Athena.

 


About the Authors

Mehul Shah is a senior software manager for AWS Glue. His passion is leveraging the cloud to build smarter, more efficient, and easier to use data systems. He has three girls, and, therefore, he has no spare time.

 

 

 

Ben Sowell is a software development engineer at AWS Glue.

 

 

 

 
Vinay Vivili is a software development engineer for AWS Glue.

 

 

 

Combine Transactional and Analytical Data Using Amazon Aurora and Amazon Redshift

Post Syndicated from Re Alvarez-Parmar original https://aws.amazon.com/blogs/big-data/combine-transactional-and-analytical-data-using-amazon-aurora-and-amazon-redshift/

A few months ago, we published a blog post about capturing data changes in an Amazon Aurora database and sending it to Amazon Athena and Amazon QuickSight for fast analysis and visualization. In this post, I want to demonstrate how easy it can be to take the data in Aurora and combine it with data in Amazon Redshift using Amazon Redshift Spectrum.

With Amazon Redshift, you can build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Amazon Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat.

In this post, we describe how to combine data in Aurora in Amazon Redshift. Here’s an overview of the solution:

  • Use AWS Lambda functions with Amazon Aurora to capture data changes in a table.
  • Save data in an Amazon S3
  • Query data using Amazon Redshift Spectrum.

We use the following services:

Serverless architecture for capturing and analyzing Aurora data changes

Consider a scenario in which an e-commerce web application uses Amazon Aurora for a transactional database layer. The company has a sales table that captures every single sale, along with a few corresponding data items. This information is stored as immutable data in a table. Business users want to monitor the sales data and then analyze and visualize it.

In this example, you take the changes in data in an Aurora database table and save it in Amazon S3. After the data is captured in Amazon S3, you combine it with data in your existing Amazon Redshift cluster for analysis.

By the end of this post, you will understand how to capture data events in an Aurora table and push them out to other AWS services using AWS Lambda.

The following diagram shows the flow of data as it occurs in this tutorial:

The starting point in this architecture is a database insert operation in Amazon Aurora. When the insert statement is executed, a custom trigger calls a Lambda function and forwards the inserted data. Lambda writes the data that it received from Amazon Aurora to a Kinesis data delivery stream. Kinesis Data Firehose writes the data to an Amazon S3 bucket. Once the data is in an Amazon S3 bucket, it is queried in place using Amazon Redshift Spectrum.

Creating an Aurora database

First, create a database by following these steps in the Amazon RDS console:

  1. Sign in to the AWS Management Console, and open the Amazon RDS console.
  2. Choose Launch a DB instance, and choose Next.
  3. For Engine, choose Amazon Aurora.
  4. Choose a DB instance class. This example uses a small, since this is not a production database.
  5. In Multi-AZ deployment, choose No.
  6. Configure DB instance identifier, Master username, and Master password.
  7. Launch the DB instance.

After you create the database, use MySQL Workbench to connect to the database using the CNAME from the console. For information about connecting to an Aurora database, see Connecting to an Amazon Aurora DB Cluster.

The following screenshot shows the MySQL Workbench configuration:

Next, create a table in the database by running the following SQL statement:

Create Table
CREATE TABLE Sales (
InvoiceID int NOT NULL AUTO_INCREMENT,
ItemID int NOT NULL,
Category varchar(255),
Price double(10,2), 
Quantity int not NULL,
OrderDate timestamp,
DestinationState varchar(2),
ShippingType varchar(255),
Referral varchar(255),
PRIMARY KEY (InvoiceID)
)

You can now populate the table with some sample data. To generate sample data in your table, copy and run the following script. Ensure that the highlighted (bold) variables are replaced with appropriate values.

#!/usr/bin/python
import MySQLdb
import random
import datetime

db = MySQLdb.connect(host="AURORA_CNAME",
                     user="DBUSER",
                     passwd="DBPASSWORD",
                     db="DB")

states = ("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN",
"IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
"NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA",
"WA","WV","WI","WY")

shipping_types = ("Free", "3-Day", "2-Day")

product_categories = ("Garden", "Kitchen", "Office", "Household")
referrals = ("Other", "Friend/Colleague", "Repeat Customer", "Online Ad")

for i in range(0,10):
    item_id = random.randint(1,100)
    state = states[random.randint(0,len(states)-1)]
    shipping_type = shipping_types[random.randint(0,len(shipping_types)-1)]
    product_category = product_categories[random.randint(0,len(product_categories)-1)]
    quantity = random.randint(1,4)
    referral = referrals[random.randint(0,len(referrals)-1)]
    price = random.randint(1,100)
    order_date = datetime.date(2016,random.randint(1,12),random.randint(1,30)).isoformat()

    data_order = (item_id, product_category, price, quantity, order_date, state,
    shipping_type, referral)

    add_order = ("INSERT INTO Sales "
                   "(ItemID, Category, Price, Quantity, OrderDate, DestinationState, \
                   ShippingType, Referral) "
                   "VALUES (%s, %s, %s, %s, %s, %s, %s, %s)")

    cursor = db.cursor()
    cursor.execute(add_order, data_order)

    db.commit()

cursor.close()
db.close() 

The following screenshot shows how the table appears with the sample data:

Sending data from Amazon Aurora to Amazon S3

There are two methods available to send data from Amazon Aurora to Amazon S3:

  • Using a Lambda function
  • Using SELECT INTO OUTFILE S3

To demonstrate the ease of setting up integration between multiple AWS services, we use a Lambda function to send data to Amazon S3 using Amazon Kinesis Data Firehose.

Alternatively, you can use a SELECT INTO OUTFILE S3 statement to query data from an Amazon Aurora DB cluster and save it directly in text files that are stored in an Amazon S3 bucket. However, with this method, there is a delay between the time that the database transaction occurs and the time that the data is exported to Amazon S3 because the default file size threshold is 6 GB.

Creating a Kinesis data delivery stream

The next step is to create a Kinesis data delivery stream, since it’s a dependency of the Lambda function.

To create a delivery stream:

  1. Open the Kinesis Data Firehose console
  2. Choose Create delivery stream.
  3. For Delivery stream name, type AuroraChangesToS3.
  4. For Source, choose Direct PUT.
  5. For Record transformation, choose Disabled.
  6. For Destination, choose Amazon S3.
  7. In the S3 bucket drop-down list, choose an existing bucket, or create a new one.
  8. Enter a prefix if needed, and choose Next.
  9. For Data compression, choose GZIP.
  10. In IAM role, choose either an existing role that has access to write to Amazon S3, or choose to generate one automatically. Choose Next.
  11. Review all the details on the screen, and choose Create delivery stream when you’re finished.

 

Creating a Lambda function

Now you can create a Lambda function that is called every time there is a change that needs to be tracked in the database table. This Lambda function passes the data to the Kinesis data delivery stream that you created earlier.

To create the Lambda function:

  1. Open the AWS Lambda console.
  2. Ensure that you are in the AWS Region where your Amazon Aurora database is located.
  3. If you have no Lambda functions yet, choose Get started now. Otherwise, choose Create function.
  4. Choose Author from scratch.
  5. Give your function a name and select Python 3.6 for Runtime
  6. Choose and existing or create a new Role, the role would need to have access to call firehose:PutRecord
  7. Choose Next on the trigger selection screen.
  8. Paste the following code in the code window. Change the stream_name variable to the Kinesis data delivery stream that you created in the previous step.
  9. Choose File -> Save in the code editor and then choose Save.
import boto3
import json

firehose = boto3.client('firehose')
stream_name = ‘AuroraChangesToS3’


def Kinesis_publish_message(event, context):
    
    firehose_data = (("%s,%s,%s,%s,%s,%s,%s,%s\n") %(event['ItemID'], 
    event['Category'], event['Price'], event['Quantity'],
    event['OrderDate'], event['DestinationState'], event['ShippingType'], 
    event['Referral']))
    
    firehose_data = {'Data': str(firehose_data)}
    print(firehose_data)
    
    firehose.put_record(DeliveryStreamName=stream_name,
    Record=firehose_data)

Note the Amazon Resource Name (ARN) of this Lambda function.

Giving Aurora permissions to invoke a Lambda function

To give Amazon Aurora permissions to invoke a Lambda function, you must attach an IAM role with appropriate permissions to the cluster. For more information, see Invoking a Lambda Function from an Amazon Aurora DB Cluster.

Once you are finished, the Amazon Aurora database has access to invoke a Lambda function.

Creating a stored procedure and a trigger in Amazon Aurora

Now, go back to MySQL Workbench, and run the following command to create a new stored procedure. When this stored procedure is called, it invokes the Lambda function you created. Change the ARN in the following code to your Lambda function’s ARN.

DROP PROCEDURE IF EXISTS CDC_TO_FIREHOSE;
DELIMITER ;;
CREATE PROCEDURE CDC_TO_FIREHOSE (IN ItemID VARCHAR(255), 
									IN Category varchar(255), 
									IN Price double(10,2),
                                    IN Quantity int(11),
                                    IN OrderDate timestamp,
                                    IN DestinationState varchar(2),
                                    IN ShippingType varchar(255),
                                    IN Referral  varchar(255)) LANGUAGE SQL 
BEGIN
  CALL mysql.lambda_async('arn:aws:lambda:us-east-1:XXXXXXXXXXXXX:function:CDCFromAuroraToKinesis', 
     CONCAT('{ "ItemID" : "', ItemID, 
            '", "Category" : "', Category,
            '", "Price" : "', Price,
            '", "Quantity" : "', Quantity, 
            '", "OrderDate" : "', OrderDate, 
            '", "DestinationState" : "', DestinationState, 
            '", "ShippingType" : "', ShippingType, 
            '", "Referral" : "', Referral, '"}')
     );
END
;;
DELIMITER ;

Create a trigger TR_Sales_CDC on the Sales table. When a new record is inserted, this trigger calls the CDC_TO_FIREHOSE stored procedure.

DROP TRIGGER IF EXISTS TR_Sales_CDC;
 
DELIMITER ;;
CREATE TRIGGER TR_Sales_CDC
  AFTER INSERT ON Sales
  FOR EACH ROW
BEGIN
  SELECT  NEW.ItemID , NEW.Category, New.Price, New.Quantity, New.OrderDate
  , New.DestinationState, New.ShippingType, New.Referral
  INTO @ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral;
  CALL  CDC_TO_FIREHOSE(@ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral);
END
;;
DELIMITER ;

If a new row is inserted in the Sales table, the Lambda function that is mentioned in the stored procedure is invoked.

Verify that data is being sent from the Lambda function to Kinesis Data Firehose to Amazon S3 successfully. You might have to insert a few records, depending on the size of your data, before new records appear in Amazon S3. This is due to Kinesis Data Firehose buffering. To learn more about Kinesis Data Firehose buffering, see the “Amazon S3” section in Amazon Kinesis Data Firehose Data Delivery.

Every time a new record is inserted in the sales table, a stored procedure is called, and it updates data in Amazon S3.

Querying data in Amazon Redshift

In this section, you use the data you produced from Amazon Aurora and consume it as-is in Amazon Redshift. In order to allow you to process your data as-is, where it is, while taking advantage of the power and flexibility of Amazon Redshift, you use Amazon Redshift Spectrum. You can use Redshift Spectrum to run complex queries on data stored in Amazon S3, with no need for loading or other data prep.

Just create a data source and issue your queries to your Amazon Redshift cluster as usual. Behind the scenes, Redshift Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your dataset grows to beyond an exabyte! Being able to query data that is stored in Amazon S3 means that you can scale your compute and your storage independently. You have the full power of the Amazon Redshift query model and all the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Amazon Redshift tables and in Amazon S3.

Redshift Spectrum supports open, common data types, including CSV/TSV, Apache Parquet, SequenceFile, and RCFile. Files can be compressed using gzip or Snappy, with other data types and compression methods in the works.

First, create an Amazon Redshift cluster. Follow the steps in Launch a Sample Amazon Redshift Cluster.

Next, create an IAM role that has access to Amazon S3 and Athena. By default, Amazon Redshift Spectrum uses the Amazon Athena data catalog. Your cluster needs authorization to access your external data catalog in AWS Glue or Athena and your data files in Amazon S3.

In the demo setup, I attached AmazonS3FullAccess and AmazonAthenaFullAccess. In a production environment, the IAM roles should follow the standard security of granting least privilege. For more information, see IAM Policies for Amazon Redshift Spectrum.

Attach the newly created role to the Amazon Redshift cluster. For more information, see Associate the IAM Role with Your Cluster.

Next, connect to the Amazon Redshift cluster, and create an external schema and database:

create external schema if not exists spectrum_schema
from data catalog 
database 'spectrum_db' 
region 'us-east-1'
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/RedshiftSpectrumRole'
create external database if not exists;

Don’t forget to replace the IAM role in the statement.

Then create an external table within the database:

 CREATE EXTERNAL TABLE IF NOT EXISTS spectrum_schema.ecommerce_sales(
  ItemID int,
  Category varchar,
  Price DOUBLE PRECISION,
  Quantity int,
  OrderDate TIMESTAMP,
  DestinationState varchar,
  ShippingType varchar,
  Referral varchar)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://{BUCKET_NAME}/CDC/'

Query the table, and it should contain data. This is a fact table.

select top 10 * from spectrum_schema.ecommerce_sales

 

Next, create a dimension table. For this example, we create a date/time dimension table. Create the table:

CREATE TABLE date_dimension (
  d_datekey           integer       not null sortkey,
  d_dayofmonth        integer       not null,
  d_monthnum          integer       not null,
  d_dayofweek                varchar(10)   not null,
  d_prettydate        date       not null,
  d_quarter           integer       not null,
  d_half              integer       not null,
  d_year              integer       not null,
  d_season            varchar(10)   not null,
  d_fiscalyear        integer       not null)
diststyle all;

Populate the table with data:

copy date_dimension from 's3://reparmar-lab/2016dates' 
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/redshiftspectrum'
DELIMITER ','
dateformat 'auto';

The date dimension table should look like the following:

Querying data in local and external tables using Amazon Redshift

Now that you have the fact and dimension table populated with data, you can combine the two and run analysis. For example, if you want to query the total sales amount by weekday, you can run the following:

select sum(quantity*price) as total_sales, date_dimension.d_season
from spectrum_schema.ecommerce_sales 
join date_dimension on spectrum_schema.ecommerce_sales.orderdate = date_dimension.d_prettydate 
group by date_dimension.d_season

You get the following results:

Similarly, you can replace d_season with d_dayofweek to get sales figures by weekday:

With Amazon Redshift Spectrum, you pay only for the queries you run against the data that you actually scan. We encourage you to use file partitioning, columnar data formats, and data compression to significantly minimize the amount of data scanned in Amazon S3. This is important for data warehousing because it dramatically improves query performance and reduces cost.

Partitioning your data in Amazon S3 by date, time, or any other custom keys enables Amazon Redshift Spectrum to dynamically prune nonrelevant partitions to minimize the amount of data processed. If you store data in a columnar format, such as Parquet, Amazon Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows. Similarly, if you compress your data using one of the supported compression algorithms in Amazon Redshift Spectrum, less data is scanned.

Analyzing and visualizing Amazon Redshift data in Amazon QuickSight

Modify the Amazon Redshift security group to allow an Amazon QuickSight connection. For more information, see Authorizing Connections from Amazon QuickSight to Amazon Redshift Clusters.

After modifying the Amazon Redshift security group, go to Amazon QuickSight. Create a new analysis, and choose Amazon Redshift as the data source.

Enter the database connection details, validate the connection, and create the data source.

Choose the schema to be analyzed. In this case, choose spectrum_schema, and then choose the ecommerce_sales table.

Next, we add a custom field for Total Sales = Price*Quantity. In the drop-down list for the ecommerce_sales table, choose Edit analysis data sets.

On the next screen, choose Edit.

In the data prep screen, choose New Field. Add a new calculated field Total Sales $, which is the product of the Price*Quantity fields. Then choose Create. Save and visualize it.

Next, to visualize total sales figures by month, create a graph with Total Sales on the x-axis and Order Data formatted as month on the y-axis.

After you’ve finished, you can use Amazon QuickSight to add different columns from your Amazon Redshift tables and perform different types of visualizations. You can build operational dashboards that continuously monitor your transactional and analytical data. You can publish these dashboards and share them with others.

Final notes

Amazon QuickSight can also read data in Amazon S3 directly. However, with the method demonstrated in this post, you have the option to manipulate, filter, and combine data from multiple sources or Amazon Redshift tables before visualizing it in Amazon QuickSight.

In this example, we dealt with data being inserted, but triggers can be activated in response to an INSERT, UPDATE, or DELETE trigger.

Keep the following in mind:

  • Be careful when invoking a Lambda function from triggers on tables that experience high write traffic. This would result in a large number of calls to your Lambda function. Although calls to the lambda_async procedure are asynchronous, triggers are synchronous.
  • A statement that results in a large number of trigger activations does not wait for the call to the AWS Lambda function to complete. But it does wait for the triggers to complete before returning control to the client.
  • Similarly, you must account for Amazon Kinesis Data Firehose limits. By default, Kinesis Data Firehose is limited to a maximum of 5,000 records/second. For more information, see Monitoring Amazon Kinesis Data Firehose.

In certain cases, it may be optimal to use AWS Database Migration Service (AWS DMS) to capture data changes in Aurora and use Amazon S3 as a target. For example, AWS DMS might be a good option if you don’t need to transform data from Amazon Aurora. The method used in this post gives you the flexibility to transform data from Aurora using Lambda before sending it to Amazon S3. Additionally, the architecture has the benefits of being serverless, whereas AWS DMS requires an Amazon EC2 instance for replication.

For design considerations while using Redshift Spectrum, see Using Amazon Redshift Spectrum to Query External Data.

If you have questions or suggestions, please comment below.


Additional Reading

If you found this post useful, be sure to check out Capturing Data Changes in Amazon Aurora Using AWS Lambda and 10 Best Practices for Amazon Redshift Spectrum


About the Authors

Re Alvarez-Parmar is a solutions architect for Amazon Web Services. He helps enterprises achieve success through technical guidance and thought leadership. In his spare time, he enjoys spending time with his two kids and exploring outdoors.