Tag Archives: hbase

How to Patch Linux Workloads on AWS

Post Syndicated from Koen van Blijderveen original https://aws.amazon.com/blogs/security/how-to-patch-linux-workloads-on-aws/

Most malware tries to compromise your systems by using a known vulnerability that the operating system maker has already patched. As best practices to help prevent malware from affecting your systems, you should apply all operating system patches and actively monitor your systems for missing patches.

In this blog post, I show you how to patch Linux workloads using AWS Systems Manager. To accomplish this, I will show you how to use the AWS Command Line Interface (AWS CLI) to:

  1. Launch an Amazon EC2 instance for use with Systems Manager.
  2. Configure Systems Manager to patch your Amazon EC2 Linux instances.

In two previous blog posts (Part 1 and Part 2), I showed how to use the AWS Management Console to perform the necessary steps to patch, inspect, and protect Microsoft Windows workloads. You can implement those same processes for your Linux instances running in AWS by changing the instance tags and types shown in the previous blog posts.

Because most Linux system administrators are more familiar with using a command line, I show how to patch Linux workloads by using the AWS CLI in this blog post. The steps to use the Amazon EBS Snapshot Scheduler and Amazon Inspector are identical for both Microsoft Windows and Linux.

What you should know first

To follow along with the solution in this post, you need one or more Amazon EC2 instances. You may use existing instances or create new instances. For this post, I assume this is an Amazon EC2 for Amazon Linux instance installed from Amazon Machine Images (AMIs).

Systems Manager is a collection of capabilities that helps you automate management tasks for AWS-hosted instances on Amazon EC2 and your on-premises servers. In this post, I use Systems Manager for two purposes: to run remote commands and apply operating system patches. To learn about the full capabilities of Systems Manager, see What Is AWS Systems Manager?

As of Amazon Linux 2017.09, the AMI comes preinstalled with the Systems Manager agent. Systems Manager Patch Manager also supports Red Hat and Ubuntu. To install the agent on these Linux distributions or an older version of Amazon Linux, see Installing and Configuring SSM Agent on Linux Instances.

If you are not familiar with how to launch an Amazon EC2 instance, see Launching an Instance. I also assume you launched or will launch your instance in a private subnet. You must make sure that the Amazon EC2 instance can connect to the internet using a network address translation (NAT) instance or NAT gateway to communicate with Systems Manager. The following diagram shows how you should structure your VPC.

Diagram showing how to structure your VPC

Later in this post, you will assign tasks to a maintenance window to patch your instances with Systems Manager. To do this, the IAM user you are using for this post must have the iam:PassRole permission. This permission allows the IAM user assigning tasks to pass his own IAM permissions to the AWS service. In this example, when you assign a task to a maintenance window, IAM passes your credentials to Systems Manager. You also should authorize your IAM user to use Amazon EC2 and Systems Manager. As mentioned before, you will be using the AWS CLI for most of the steps in this blog post. Our documentation shows you how to get started with the AWS CLI. Make sure you have the AWS CLI installed and configured with an AWS access key and secret access key that belong to an IAM user that have the following AWS managed policies attached to the IAM user you are using for this example: AmazonEC2FullAccess and AmazonSSMFullAccess.

Step 1: Launch an Amazon EC2 Linux instance

In this section, I show you how to launch an Amazon EC2 instance so that you can use Systems Manager with the instance. This step requires you to do three things:

  1. Create an IAM role for Systems Manager before launching your Amazon EC2 instance.
  2. Launch your Amazon EC2 instance with Amazon EBS and the IAM role for Systems Manager.
  3. Add tags to the instances so that you can add your instances to a Systems Manager maintenance window based on tags.

A. Create an IAM role for Systems Manager

Before launching an Amazon EC2 instance, I recommend that you first create an IAM role for Systems Manager, which you will use to update the Amazon EC2 instance. AWS already provides a preconfigured policy that you can use for the new role and it is called AmazonEC2RoleforSSM.

  1. Create a JSON file named trustpolicy-ec2ssm.json that contains the following trust policy. This policy describes which principal (an entity that can take action on an AWS resource) is allowed to assume the role we are going to create. In this example, the principal is the Amazon EC2 service.
    {
      "Version": "2012-10-17",
      "Statement": {
        "Effect": "Allow",
        "Principal": {"Service": "ec2.amazonaws.com"},
        "Action": "sts:AssumeRole"
      }
    }

  1. Use the following command to create a role named EC2SSM that has the AWS managed policy AmazonEC2RoleforSSM attached to it. This generates JSON-based output that describes the role and its parameters, if the command is successful.
    $ aws iam create-role --role-name EC2SSM --assume-role-policy-document file://trustpolicy-ec2ssm.json

  1. Use the following command to attach the AWS managed IAM policy (AmazonEC2RoleforSSM) to your newly created role.
    $ aws iam attach-role-policy --role-name EC2SSM --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM

  1. Use the following commands to create the IAM instance profile and add the role to the instance profile. The instance profile is needed to attach the role we created earlier to your Amazon EC2 instance.
    $ aws iam create-instance-profile --instance-profile-name EC2SSM-IP
    $ aws iam add-role-to-instance-profile --instance-profile-name EC2SSM-IP --role-name EC2SSM

B. Launch your Amazon EC2 instance

To follow along, you need an Amazon EC2 instance that is running Amazon Linux. You can use any existing instance you may have or create a new instance.

When launching a new Amazon EC2 instance, be sure that:

  1. Use the following command to launch a new Amazon EC2 instance using an Amazon Linux AMI available in the US East (N. Virginia) Region (also known as us-east-1). Replace YourKeyPair and YourSubnetId with your information. For more information about creating a key pair, see the create-key-pair documentation. Write down the InstanceId that is in the output because you will need it later in this post.
    $ aws ec2 run-instances --image-id ami-cb9ec1b1 --instance-type t2.micro --key-name YourKeyPair --subnet-id YourSubnetId --iam-instance-profile Name=EC2SSM-IP

  1. If you are using an existing Amazon EC2 instance, you can use the following command to attach the instance profile you created earlier to your instance.
    $ aws ec2 associate-iam-instance-profile --instance-id YourInstanceId --iam-instance-profile Name=EC2SSM-IP

C. Add tags

The final step of configuring your Amazon EC2 instances is to add tags. You will use these tags to configure Systems Manager in Step 2 of this post. For this example, I add a tag named Patch Group and set the value to Linux Servers. I could have other groups of Amazon EC2 instances that I treat differently by having the same tag name but a different tag value. For example, I might have a collection of other servers with the tag name Patch Group with a value of Web Servers.

  • Use the following command to add the Patch Group tag to your Amazon EC2 instance.
    $ aws ec2 create-tags --resources YourInstanceId --tags --tags Key="Patch Group",Value="Linux Servers"

Note: You must wait a few minutes until the Amazon EC2 instance is available before you can proceed to the next section. To make sure your Amazon EC2 instance is online and ready, you can use the following AWS CLI command:

$ aws ec2 describe-instance-status --instance-ids YourInstanceId

At this point, you now have at least one Amazon EC2 instance you can use to configure Systems Manager.

Step 2: Configure Systems Manager

In this section, I show you how to configure and use Systems Manager to apply operating system patches to your Amazon EC2 instances, and how to manage patch compliance.

To start, I provide some background information about Systems Manager. Then, I cover how to:

  1. Create the Systems Manager IAM role so that Systems Manager is able to perform patch operations.
  2. Create a Systems Manager patch baseline and associate it with your instance to define which patches Systems Manager should apply.
  3. Define a maintenance window to make sure Systems Manager patches your instance when you tell it to.
  4. Monitor patch compliance to verify the patch state of your instances.

You must meet two prerequisites to use Systems Manager to apply operating system patches. First, you must attach the IAM role you created in the previous section, EC2SSM, to your Amazon EC2 instance. Second, you must install the Systems Manager agent on your Amazon EC2 instance. If you have used a recent Amazon Linux AMI, Amazon has already installed the Systems Manager agent on your Amazon EC2 instance. You can confirm this by logging in to an Amazon EC2 instance and checking the Systems Manager agent log files that are located at /var/log/amazon/ssm/.

To install the Systems Manager agent on an instance that does not have the agent preinstalled or if you want to use the Systems Manager agent on your on-premises servers, see Installing and Configuring the Systems Manager Agent on Linux Instances. If you forgot to attach the newly created role when launching your Amazon EC2 instance or if you want to attach the role to already running Amazon EC2 instances, see Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI or use the AWS Management Console.

A. Create the Systems Manager IAM role

For a maintenance window to be able to run any tasks, you must create a new role for Systems Manager. This role is a different kind of role than the one you created earlier: this role will be used by Systems Manager instead of Amazon EC2. Earlier, you created the role, EC2SSM, with the policy, AmazonEC2RoleforSSM, which allowed the Systems Manager agent on your instance to communicate with Systems Manager. In this section, you need a new role with the policy, AmazonSSMMaintenanceWindowRole, so that the Systems Manager service can execute commands on your instance.

To create the new IAM role for Systems Manager:

  1. Create a JSON file named trustpolicy-maintenancewindowrole.json that contains the following trust policy. This policy describes which principal is allowed to assume the role you are going to create. This trust policy allows not only Amazon EC2 to assume this role, but also Systems Manager.
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Sid":"",
             "Effect":"Allow",
             "Principal":{
                "Service":[
                   "ec2.amazonaws.com",
                   "ssm.amazonaws.com"
               ]
             },
             "Action":"sts:AssumeRole"
          }
       ]
    }

  1. Use the following command to create a role named MaintenanceWindowRole that has the AWS managed policy, AmazonSSMMaintenanceWindowRole, attached to it. This command generates JSON-based output that describes the role and its parameters, if the command is successful.
    $ aws iam create-role --role-name MaintenanceWindowRole --assume-role-policy-document file://trustpolicy-maintenancewindowrole.json

  1. Use the following command to attach the AWS managed IAM policy (AmazonEC2RoleforSSM) to your newly created role.
    $ aws iam attach-role-policy --role-name MaintenanceWindowRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonSSMMaintenanceWindowRole

B. Create a Systems Manager patch baseline and associate it with your instance

Next, you will create a Systems Manager patch baseline and associate it with your Amazon EC2 instance. A patch baseline defines which patches Systems Manager should apply to your instance. Before you can associate the patch baseline with your instance, though, you must determine if Systems Manager recognizes your Amazon EC2 instance. Use the following command to list all instances managed by Systems Manager. The --filters option ensures you look only for your newly created Amazon EC2 instance.

$ aws ssm describe-instance-information --filters Key=InstanceIds,Values= YourInstanceId

{
    "InstanceInformationList": [
        {
            "IsLatestVersion": true,
            "ComputerName": "ip-10-50-2-245",
            "PingStatus": "Online",
            "InstanceId": "YourInstanceId",
            "IPAddress": "10.50.2.245",
            "ResourceType": "EC2Instance",
            "AgentVersion": "2.2.120.0",
            "PlatformVersion": "2017.09",
            "PlatformName": "Amazon Linux AMI",
            "PlatformType": "Linux",
            "LastPingDateTime": 1515759143.826
        }
    ]
}

If your instance is missing from the list, verify that:

  1. Your instance is running.
  2. You attached the Systems Manager IAM role, EC2SSM.
  3. You deployed a NAT gateway in your public subnet to ensure your VPC reflects the diagram shown earlier in this post so that the Systems Manager agent can connect to the Systems Manager internet endpoint.
  4. The Systems Manager agent logs don’t include any unaddressed errors.

Now that you have checked that Systems Manager can manage your Amazon EC2 instance, it is time to create a patch baseline. With a patch baseline, you define which patches are approved to be installed on all Amazon EC2 instances associated with the patch baseline. The Patch Group resource tag you defined earlier will determine to which patch group an instance belongs. If you do not specifically define a patch baseline, the default AWS-managed patch baseline is used.

To create a patch baseline:

  1. Use the following command to create a patch baseline named AmazonLinuxServers. With approval rules, you can determine the approved patches that will be included in your patch baseline. In this example, you add all Critical severity patches to the patch baseline as soon as they are released, by setting the Auto approval delay to 0 days. By setting the Auto approval delay to 2 days, you add to this patch baseline the Important, Medium, and Low severity patches two days after they are released.
    $ aws ssm create-patch-baseline --name "AmazonLinuxServers" --description "Baseline containing all updates for Amazon Linux" --operating-system AMAZON_LINUX --approval-rules "PatchRules=[{PatchFilterGroup={PatchFilters=[{Values=[Critical],Key=SEVERITY}]},ApproveAfterDays=0,ComplianceLevel=CRITICAL},{PatchFilterGroup={PatchFilters=[{Values=[Important,Medium,Low],Key=SEVERITY}]},ApproveAfterDays=2,ComplianceLevel=HIGH}]"
    
    {
        "BaselineId": "YourBaselineId"
    }

  1. Use the following command to register the patch baseline you created with your instance. To do so, you use the Patch Group tag that you added to your Amazon EC2 instance.
    $ aws ssm register-patch-baseline-for-patch-group --baseline-id YourPatchBaselineId --patch-group "Linux Servers"
    
    {
        "PatchGroup": "Linux Servers",
        "BaselineId": "YourBaselineId"
    }

C.  Define a maintenance window

Now that you have successfully set up a role, created a patch baseline, and registered your Amazon EC2 instance with your patch baseline, you will define a maintenance window so that you can control when your Amazon EC2 instances will receive patches. By creating multiple maintenance windows and assigning them to different patch groups, you can make sure your Amazon EC2 instances do not all reboot at the same time.

To define a maintenance window:

  1. Use the following command to define a maintenance window. In this example command, the maintenance window will start every Saturday at 10:00 P.M. UTC. It will have a duration of 4 hours and will not start any new tasks 1 hour before the end of the maintenance window.
    $ aws ssm create-maintenance-window --name SaturdayNight --schedule "cron(0 0 22 ? * SAT *)" --duration 4 --cutoff 1 --allow-unassociated-targets
    
    {
        "WindowId": "YourMaintenanceWindowId"
    }

For more information about defining a cron-based schedule for maintenance windows, see Cron and Rate Expressions for Maintenance Windows.

  1. After defining the maintenance window, you must register the Amazon EC2 instance with the maintenance window so that Systems Manager knows which Amazon EC2 instance it should patch in this maintenance window. You can register the instance by using the same Patch Group tag you used to associate the Amazon EC2 instance with the AWS-provided patch baseline, as shown in the following command.
    $ aws ssm register-target-with-maintenance-window --window-id YourMaintenanceWindowId --resource-type INSTANCE --targets "Key=tag:Patch Group,Values=Linux Servers"
    
    {
        "WindowTargetId": "YourWindowTargetId"
    }

  1. Assign a task to the maintenance window that will install the operating system patches on your Amazon EC2 instance. The following command includes the following options.
    1. name is the name of your task and is optional. I named mine Patching.
    2. task-arn is the name of the task document you want to run.
    3. max-concurrency allows you to specify how many of your Amazon EC2 instances Systems Manager should patch at the same time. max-errors determines when Systems Manager should abort the task. For patching, this number should not be too low, because you do not want your entire patch task to stop on all instances if one instance fails. You can set this, for example, to 20%.
    4. service-role-arn is the Amazon Resource Name (ARN) of the AmazonSSMMaintenanceWindowRole role you created earlier in this blog post.
    5. task-invocation-parameters defines the parameters that are specific to the AWS-RunPatchBaseline task document and tells Systems Manager that you want to install patches with a timeout of 600 seconds (10 minutes).
      $ aws ssm register-task-with-maintenance-window --name "Patching" --window-id "YourMaintenanceWindowId" --targets "Key=WindowTargetIds,Values=YourWindowTargetId" --task-arn AWS-RunPatchBaseline --service-role-arn "arn:aws:iam::123456789012:role/MaintenanceWindowRole" --task-type "RUN_COMMAND" --task-invocation-parameters "RunCommand={Comment=,TimeoutSeconds=600,Parameters={SnapshotId=[''],Operation=[Install]}}" --max-concurrency "500" --max-errors "20%"
      
      {
          "WindowTaskId": "YourWindowTaskId"
      }

Now, you must wait for the maintenance window to run at least once according to the schedule you defined earlier. If your maintenance window has expired, you can check the status of any maintenance tasks Systems Manager has performed by using the following command.

$ aws ssm describe-maintenance-window-executions --window-id "YourMaintenanceWindowId"

{
    "WindowExecutions": [
        {
            "Status": "SUCCESS",
            "WindowId": "YourMaintenanceWindowId",
            "WindowExecutionId": "b594984b-430e-4ffa-a44c-a2e171de9dd3",
            "EndTime": 1515766467.487,
            "StartTime": 1515766457.691
        }
    ]
}

D.  Monitor patch compliance

You also can see the overall patch compliance of all Amazon EC2 instances using the following command in the AWS CLI.

$ aws ssm list-compliance-summaries

This command shows you the number of instances that are compliant with each category and the number of instances that are not in JSON format.

You also can see overall patch compliance by choosing Compliance under Insights in the navigation pane of the Systems Manager console. You will see a visual representation of how many Amazon EC2 instances are up to date, how many Amazon EC2 instances are noncompliant, and how many Amazon EC2 instances are compliant in relation to the earlier defined patch baseline.

Screenshot of the Compliance page of the Systems Manager console

In this section, you have set everything up for patch management on your instance. Now you know how to patch your Amazon EC2 instance in a controlled manner and how to check if your Amazon EC2 instance is compliant with the patch baseline you have defined. Of course, I recommend that you apply these steps to all Amazon EC2 instances you manage.

Summary

In this blog post, I showed how to use Systems Manager to create a patch baseline and maintenance window to keep your Amazon EC2 Linux instances up to date with the latest security patches. Remember that by creating multiple maintenance windows and assigning them to different patch groups, you can make sure your Amazon EC2 instances do not all reboot at the same time.

If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing any part of this solution, start a new thread on the Amazon EC2 forum or contact AWS Support.

– Koen

How to Patch, Inspect, and Protect Microsoft Windows Workloads on AWS—Part 1

Post Syndicated from Koen van Blijderveen original https://aws.amazon.com/blogs/security/how-to-patch-inspect-and-protect-microsoft-windows-workloads-on-aws-part-1/

Most malware tries to compromise your systems by using a known vulnerability that the maker of the operating system has already patched. To help prevent malware from affecting your systems, two security best practices are to apply all operating system patches to your systems and actively monitor your systems for missing patches. In case you do need to recover from a malware attack, you should make regular backups of your data.

In today’s blog post (Part 1 of a two-part post), I show how to keep your Amazon EC2 instances that run Microsoft Windows up to date with the latest security patches by using Amazon EC2 Systems Manager. Tomorrow in Part 2, I show how to take regular snapshots of your data by using Amazon EBS Snapshot Scheduler and how to use Amazon Inspector to check if your EC2 instances running Microsoft Windows contain any common vulnerabilities and exposures (CVEs).

What you should know first

To follow along with the solution in this post, you need one or more EC2 instances. You may use existing instances or create new instances. For the blog post, I assume this is an EC2 for Microsoft Windows Server 2012 R2 instance installed from the Amazon Machine Images (AMIs). If you are not familiar with how to launch an EC2 instance, see Launching an Instance. I also assume you launched or will launch your instance in a private subnet. A private subnet is not directly accessible via the internet, and access to it requires either a VPN connection to your on-premises network or a jump host in a public subnet (a subnet with access to the internet). You must make sure that the EC2 instance can connect to the internet using a network address translation (NAT) instance or NAT gateway to communicate with Systems Manager and Amazon Inspector. The following diagram shows how you should structure your Amazon Virtual Private Cloud (VPC). You should also be familiar with Restoring an Amazon EBS Volume from a Snapshot and Attaching an Amazon EBS Volume to an Instance.

Later on, you will assign tasks to a maintenance window to patch your instances with Systems Manager. To do this, the AWS Identity and Access Management (IAM) user you are using for this post must have the iam:PassRole permission. This permission allows this IAM user to assign tasks to pass their own IAM permissions to the AWS service. In this example, when you assign a task to a maintenance window, IAM passes your credentials to Systems Manager. This safeguard ensures that the user cannot use the creation of tasks to elevate their IAM privileges because their own IAM privileges limit which tasks they can run against an EC2 instance. You should also authorize your IAM user to use EC2, Amazon Inspector, Amazon CloudWatch, and Systems Manager. You can achieve this by attaching the following AWS managed policies to the IAM user you are using for this example: AmazonInspectorFullAccess, AmazonEC2FullAccess, and AmazonSSMFullAccess.

Architectural overview

The following diagram illustrates the components of this solution’s architecture.

Diagram showing the components of this solution's architecture

For this blog post, Microsoft Windows EC2 is Amazon EC2 for Microsoft Windows Server 2012 R2 instances with attached Amazon Elastic Block Store (Amazon EBS) volumes, which are running in your VPC. These instances may be standalone Windows instances running your Windows workloads, or you may have joined them to an Active Directory domain controller. For instances joined to a domain, you can be using Active Directory running on an EC2 for Windows instance, or you can use AWS Directory Service for Microsoft Active Directory.

Amazon EC2 Systems Manager is a scalable tool for remote management of your EC2 instances. You will use the Systems Manager Run Command to install the Amazon Inspector agent. The agent enables EC2 instances to communicate with the Amazon Inspector service and run assessments, which I explain in detail later in this blog post. You also will create a Systems Manager association to keep your EC2 instances up to date with the latest security patches.

You can use the EBS Snapshot Scheduler to schedule automated snapshots at regular intervals. You will use it to set up regular snapshots of your Amazon EBS volumes. EBS Snapshot Scheduler is a prebuilt solution by AWS that you will deploy in your AWS account. With Amazon EBS snapshots, you pay only for the actual data you store. Snapshots save only the data that has changed since the previous snapshot, which minimizes your cost.

You will use Amazon Inspector to run security assessments on your EC2 for Windows Server instance. In this post, I show how to assess if your EC2 for Windows Server instance is vulnerable to any of the more than 50,000 CVEs registered with Amazon Inspector.

In today’s and tomorrow’s posts, I show you how to:

  1. Launch an EC2 instance with an IAM role, Amazon EBS volume, and tags that Systems Manager and Amazon Inspector will use.
  2. Configure Systems Manager to install the Amazon Inspector agent and patch your EC2 instances.
  3. Take EBS snapshots by using EBS Snapshot Scheduler to automate snapshots based on instance tags.
  4. Use Amazon Inspector to check if your EC2 instances running Microsoft Windows contain any common vulnerabilities and exposures (CVEs).

Step 1: Launch an EC2 instance

In this section, I show you how to launch your EC2 instances so that you can use Systems Manager with the instances and use instance tags with EBS Snapshot Scheduler to automate snapshots. This requires three things:

  • Create an IAM role for Systems Manager before launching your EC2 instance.
  • Launch your EC2 instance with Amazon EBS and the IAM role for Systems Manager.
  • Add tags to instances so that you can automate policies for which instances you take snapshots of and when.

Create an IAM role for Systems Manager

Before launching your EC2 instance, I recommend that you first create an IAM role for Systems Manager, which you will use to update the EC2 instance you will launch. AWS already provides a preconfigured policy that you can use for your new role, and it is called AmazonEC2RoleforSSM.

  1. Sign in to the IAM console and choose Roles in the navigation pane. Choose Create new role.
    Screenshot of choosing "Create role"
  2. In the role-creation workflow, choose AWS service > EC2 > EC2 to create a role for an EC2 instance.
    Screenshot of creating a role for an EC2 instance
  3. Choose the AmazonEC2RoleforSSM policy to attach it to the new role you are creating.
    Screenshot of attaching the AmazonEC2RoleforSSM policy to the new role you are creating
  4. Give the role a meaningful name (I chose EC2SSM) and description, and choose Create role.
    Screenshot of giving the role a name and description

Launch your EC2 instance

To follow along, you need an EC2 instance that is running Microsoft Windows Server 2012 R2 and that has an Amazon EBS volume attached. You can use any existing instance you may have or create a new instance.

When launching your new EC2 instance, be sure that:

  • The operating system is Microsoft Windows Server 2012 R2.
  • You attach at least one Amazon EBS volume to the EC2 instance.
  • You attach the newly created IAM role (EC2SSM).
  • The EC2 instance can connect to the internet through a network address translation (NAT) gateway or a NAT instance.
  • You create the tags shown in the following screenshot (you will use them later).

If you are using an already launched EC2 instance, you can attach the newly created role as described in Easily Replace or Attach an IAM Role to an Existing EC2 Instance by Using the EC2 Console.

Add tags

The final step of configuring your EC2 instances is to add tags. You will use these tags to configure Systems Manager in Step 2 of this blog post and to configure Amazon Inspector in Part 2. For this example, I add a tag key, Patch Group, and set the value to Windows Servers. I could have other groups of EC2 instances that I treat differently by having the same tag key but a different tag value. For example, I might have a collection of other servers with the Patch Group tag key with a value of IAS Servers.

Screenshot of adding tags

Note: You must wait a few minutes until the EC2 instance becomes available before you can proceed to the next section.

At this point, you now have at least one EC2 instance you can use to configure Systems Manager, use EBS Snapshot Scheduler, and use Amazon Inspector.

Note: If you have a large number of EC2 instances to tag, you may want to use the EC2 CreateTags API rather than manually apply tags to each instance.

Step 2: Configure Systems Manager

In this section, I show you how to use Systems Manager to apply operating system patches to your EC2 instances, and how to manage patch compliance.

To start, I will provide some background information about Systems Manager. Then, I will cover how to:

  • Create the Systems Manager IAM role so that Systems Manager is able to perform patch operations.
  • Associate a Systems Manager patch baseline with your instance to define which patches Systems Manager should apply.
  • Define a maintenance window to make sure Systems Manager patches your instance when you tell it to.
  • Monitor patch compliance to verify the patch state of your instances.

Systems Manager is a collection of capabilities that helps you automate management tasks for AWS-hosted instances on EC2 and your on-premises servers. In this post, I use Systems Manager for two purposes: to run remote commands and apply operating system patches. To learn about the full capabilities of Systems Manager, see What Is Amazon EC2 Systems Manager?

Patch management is an important measure to prevent malware from infecting your systems. Most malware attacks look for vulnerabilities that are publicly known and in most cases are already patched by the maker of the operating system. These publicly known vulnerabilities are well documented and therefore easier for an attacker to exploit than having to discover a new vulnerability.

Patches for these new vulnerabilities are available through Systems Manager within hours after Microsoft releases them. There are two prerequisites to use Systems Manager to apply operating system patches. First, you must attach the IAM role you created in the previous section, EC2SSM, to your EC2 instance. Second, you must install the Systems Manager agent on your EC2 instance. If you have used a recent Microsoft Windows Server 2012 R2 AMI published by AWS, Amazon has already installed the Systems Manager agent on your EC2 instance. You can confirm this by logging in to an EC2 instance and looking for Amazon SSM Agent under Programs and Features in Windows. To install the Systems Manager agent on an instance that does not have the agent preinstalled or if you want to use the Systems Manager agent on your on-premises servers, see the documentation about installing the Systems Manager agent. If you forgot to attach the newly created role when launching your EC2 instance or if you want to attach the role to already running EC2 instances, see Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI or use the AWS Management Console.

To make sure your EC2 instance receives operating system patches from Systems Manager, you will use the default patch baseline provided and maintained by AWS, and you will define a maintenance window so that you control when your EC2 instances should receive patches. For the maintenance window to be able to run any tasks, you also must create a new role for Systems Manager. This role is a different kind of role than the one you created earlier: Systems Manager will use this role instead of EC2. Earlier we created the EC2SSM role with the AmazonEC2RoleforSSM policy, which allowed the Systems Manager agent on our instance to communicate with the Systems Manager service. Here we need a new role with the policy AmazonSSMMaintenanceWindowRole to make sure the Systems Manager service is able to execute commands on our instance.

Create the Systems Manager IAM role

To create the new IAM role for Systems Manager, follow the same procedure as in the previous section, but in Step 3, choose the AmazonSSMMaintenanceWindowRole policy instead of the previously selected AmazonEC2RoleforSSM policy.

Screenshot of creating the new IAM role for Systems Manager

Finish the wizard and give your new role a recognizable name. For example, I named my role MaintenanceWindowRole.

Screenshot of finishing the wizard and giving your new role a recognizable name

By default, only EC2 instances can assume this new role. You must update the trust policy to enable Systems Manager to assume this role.

To update the trust policy associated with this new role:

  1. Navigate to the IAM console and choose Roles in the navigation pane.
  2. Choose MaintenanceWindowRole and choose the Trust relationships tab. Then choose Edit trust relationship.
  3. Update the policy document by copying the following policy and pasting it in the Policy Document box. As you can see, I have added the ssm.amazonaws.com service to the list of allowed Principals that can assume this role. Choose Update Trust Policy.
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Sid":"",
             "Effect":"Allow",
             "Principal":{
                "Service":[
                   "ec2.amazonaws.com",
                   "ssm.amazonaws.com"
               ]
             },
             "Action":"sts:AssumeRole"
          }
       ]
    }

Associate a Systems Manager patch baseline with your instance

Next, you are going to associate a Systems Manager patch baseline with your EC2 instance. A patch baseline defines which patches Systems Manager should apply. You will use the default patch baseline that AWS manages and maintains. Before you can associate the patch baseline with your instance, though, you must determine if Systems Manager recognizes your EC2 instance.

Navigate to the EC2 console, scroll down to Systems Manager Shared Resources in the navigation pane, and choose Managed Instances. Your new EC2 instance should be available there. If your instance is missing from the list, verify the following:

  1. Go to the EC2 console and verify your instance is running.
  2. Select your instance and confirm you attached the Systems Manager IAM role, EC2SSM.
  3. Make sure that you deployed a NAT gateway in your public subnet to ensure your VPC reflects the diagram at the start of this post so that the Systems Manager agent can connect to the Systems Manager internet endpoint.
  4. Check the Systems Manager Agent logs for any errors.

Now that you have confirmed that Systems Manager can manage your EC2 instance, it is time to associate the AWS maintained patch baseline with your EC2 instance:

  1. Choose Patch Baselines under Systems Manager Services in the navigation pane of the EC2 console.
  2. Choose the default patch baseline as highlighted in the following screenshot, and choose Modify Patch Groups in the Actions drop-down.
    Screenshot of choosing Modify Patch Groups in the Actions drop-down
  3. In the Patch group box, enter the same value you entered under the Patch Group tag of your EC2 instance in “Step 1: Configure your EC2 instance.” In this example, the value I enter is Windows Servers. Choose the check mark icon next to the patch group and choose Close.Screenshot of modifying the patch group

Define a maintenance window

Now that you have successfully set up a role and have associated a patch baseline with your EC2 instance, you will define a maintenance window so that you can control when your EC2 instances should receive patches. By creating multiple maintenance windows and assigning them to different patch groups, you can make sure your EC2 instances do not all reboot at the same time. The Patch Group resource tag you defined earlier will determine to which patch group an instance belongs.

To define a maintenance window:

  1. Navigate to the EC2 console, scroll down to Systems Manager Shared Resources in the navigation pane, and choose Maintenance Windows. Choose Create a Maintenance Window.
    Screenshot of starting to create a maintenance window in the Systems Manager console
  2. Select the Cron schedule builder to define the schedule for the maintenance window. In the example in the following screenshot, the maintenance window will start every Saturday at 10:00 P.M. UTC.
  3. To specify when your maintenance window will end, specify the duration. In this example, the four-hour maintenance window will end on the following Sunday morning at 2:00 A.M. UTC (in other words, four hours after it started).
  4. Systems manager completes all tasks that are in process, even if the maintenance window ends. In my example, I am choosing to prevent new tasks from starting within one hour of the end of my maintenance window because I estimated my patch operations might take longer than one hour to complete. Confirm the creation of the maintenance window by choosing Create maintenance window.
    Screenshot of completing all boxes in the maintenance window creation process
  5. After creating the maintenance window, you must register the EC2 instance to the maintenance window so that Systems Manager knows which EC2 instance it should patch in this maintenance window. To do so, choose Register new targets on the Targets tab of your newly created maintenance window. You can register your targets by using the same Patch Group tag you used before to associate the EC2 instance with the AWS-provided patch baseline.
    Screenshot of registering new targets
  6. Assign a task to the maintenance window that will install the operating system patches on your EC2 instance:
    1. Open Maintenance Windows in the EC2 console, select your previously created maintenance window, choose the Tasks tab, and choose Register run command task from the Register new task drop-down.
    2. Choose the AWS-RunPatchBaseline document from the list of available documents.
    3. For Parameters:
      1. For Role, choose the role you created previously (called MaintenanceWindowRole).
      2. For Execute on, specify how many EC2 instances Systems Manager should patch at the same time. If you have a large number of EC2 instances and want to patch all EC2 instances within the defined time, make sure this number is not too low. For example, if you have 1,000 EC2 instances, a maintenance window of 4 hours, and 2 hours’ time for patching, make this number at least 500.
      3. For Stop after, specify after how many errors Systems Manager should stop.
      4. For Operation, choose Install to make sure to install the patches.
        Screenshot of stipulating maintenance window parameters

Now, you must wait for the maintenance window to run at least once according to the schedule you defined earlier. Note that if you don’t want to wait, you can adjust the schedule to run sooner by choosing Edit maintenance window on the Maintenance Windows page of Systems Manager. If your maintenance window has expired, you can check the status of any maintenance tasks Systems Manager has performed on the Maintenance Windows page of Systems Manager and select your maintenance window.

Screenshot of the maintenance window successfully created

Monitor patch compliance

You also can see the overall patch compliance of all EC2 instances that are part of defined patch groups by choosing Patch Compliance under Systems Manager Services in the navigation pane of the EC2 console. You can filter by Patch Group to see how many EC2 instances within the selected patch group are up to date, how many EC2 instances are missing updates, and how many EC2 instances are in an error state.

Screenshot of monitoring patch compliance

In this section, you have set everything up for patch management on your instance. Now you know how to patch your EC2 instance in a controlled manner and how to check if your EC2 instance is compliant with the patch baseline you have defined. Of course, I recommend that you apply these steps to all EC2 instances you manage.

Summary

In Part 1 of this blog post, I have shown how to configure EC2 instances for use with Systems Manager, EBS Snapshot Scheduler, and Amazon Inspector. I also have shown how to use Systems Manager to keep your Microsoft Windows–based EC2 instances up to date. In Part 2 of this blog post tomorrow, I will show how to take regular snapshots of your data by using EBS Snapshot Scheduler and how to use Amazon Inspector to check if your EC2 instances running Microsoft Windows contain any CVEs.

If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the EC2 forum or the Amazon Inspector forum, or contact AWS Support.

– Koen

Amazon EC2 Systems Manager Patch Manager now supports Linux

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-ec2-systems-manager-patch-manager-now-supports-linux/

Hot on the heels of some other great Amazon EC2 Systems Manager (SSM) updates is another vital enhancement: the ability to use Patch Manager on Linux instances!

We launched Patch Manager with SSM at re:Invent in 2016 and Linux support was a commonly requested feature. Starting today we can support patch manager in:

  • Amazon Linux 2014.03 and later (2015.03 and later for 64-bit)
  • Ubuntu Server 16.04 LTS, 14.04 LTS, and 12.04 LTS
  • RHEL 6.5 and later (7.x and later for 64-Bit)

When I think about patching a big group of heterogenous systems I get a little anxious. Years ago, I administered my school’s computer lab. This involved a modest group of machines running a small number of VMs with an immodest number of distinct Linux distros. When there was a critical security patch it was a lot of work to remember the constraints of each system. I remember having to switch back and forth between arcane invocations of various package managers – pinning and unpinning packages: sudo yum update -y, rpm -Uvh ..., apt-get, or even emerge (one of our professors loved Gentoo).

Even now, when I use configuration management systems like Chef or Puppet I still have to specify the package manager and remember a portion of the invocation – and I don’t always want to roll out a patch without some manual approval process. Based on these experiences I decided it was time for me to update my skillset and learn to use Patch Manager.

Patch Manager is a fully-managed service (provided at no additional cost) that helps you simplify your operating system patching process, including defining the patches you want to approve for deployment, the method of patch deployment, the timing for patch roll-outs, and determining patch compliance status across your entire fleet of instances. It’s extremely configurable with some sensible defaults and helps you easily deal with patching hetergenous clusters.

Since I’m not running that school computer lab anymore my fleet is a bit smaller these days:

a list of instances with amusing names

As you can see above I only have a few instances in this region but if you look at the launch times they range from 2014 to a few minutes ago. I’d be willing to bet I’ve missed a patch or two somewhere (luckily most of these have strict security groups). To get started I installed the SSM agent on all of my machines by following the documentation here. I also made sure I had the appropriate role and IAM profile attached to the instances to talk to SSM – I just used this managed policy: AmazonEC2RoleforSSM.

Now I need to define a Patch Baseline. We’ll make security updates critical and all other updates informational and subject to my approval.

 

Next, I can run the AWS-RunPatchBaseline SSM Run Command in “Scan” mode to generate my patch baseline data.

Then, we can go to the Patch Compliance page in the EC2 console and check out how I’m doing.

Yikes, looks like I need some security updates! Now, I can use Maintenance Windows, Run Command, or State Manager in SSM to actually manage this patching process. One thing to note, when patching is completed, your machine reboots – so managing that roll out with Maintenance Windows or State Manager is a best practice. If I had a larger set of instances I could group them by creating a tag named “Patch Group”.

For now, I’ll just use the same AWS-RunPatchBaseline Run Command command from above with the “Install” operation to update these machines.

As always, the CLIs and APIs have been updated to support these new options. The documentation is here. I hope you’re all able to spend less time patching and more time coding!

Randall

Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15)

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/160966148886

yahoohadoop:

image

We’re excited to co-host the 10th Annual Hadoop Summit, the leading conference for the Apache Hadoop community, taking place on June 13 – 15 at the San Jose Convention Center. In the last few years, the Hadoop Summit has expanded to cover all things data beyond just Apache Hadoop – such as data science, cloud and operations, IoT and applications – and has been aptly renamed the DataWorks Summit. The three-day program is bursting at the seams! Here are just a few of the reasons why you cannot miss this must-attend event:

  • Familiarize yourself with the cutting edge in Apache project developments from the committers
  • Learn from your peers and industry experts about innovative and real-world use cases, development and administration tips and tricks, success stories and best practices to leverage all your data – on-premise and in the cloud – to drive predictive analytics, distributed deep-learning and artificial intelligence initiatives
  • Attend one of our more than 170 technical deep dive breakout sessions from nearly 200 speakers across eight tracks
  • Check out our keynotes, meetups, trainings, technical crash courses, birds-of-a-feather sessions, Women in Big Data and more
  • Attend the community showcase where you can network with sponsors and industry experts, including a host of startups and large companies like Microsoft, IBM, Oracle, HP, Dell EMC and Teradata

Similar to previous years, we look forward to continuing Yahoo’s decade-long tradition of thought leadership at this year’s summit. Join us for an in-depth look at Yahoo’s Hadoop culture and for the latest in technologies such as Apache Tez, HBase, Hive, Data Highway Rainbow, Mail Data Warehouse and Distributed Deep Learning at the breakout sessions below. Or, stop by Yahoo kiosk #700 at the community showcase.

Also, as a co-host of the event, Yahoo is pleased to offer a 20% discount for the summit with the code MSPO20. Register here for Hadoop Summit, San Jose, California!


DAY 1. TUESDAY June 13, 2017


12:20 – 1:00 P.M. TensorFlowOnSpark – Scalable TensorFlow Learning On Spark Clusters

Andy Feng – VP Architecture, Big Data and Machine Learning

Lee Yang – Sr. Principal Engineer

In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, that was open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.

2:10 – 2:50 P.M. Handling Kernel Upgrades at Scale – The Dirty Cow Story

Samy Gawande – Sr. Operations Engineer

Savitha Ravikrishnan – Site Reliability Engineer

Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).

5:00 – 5:40 P.M. Data Highway Rainbow –  Petabyte Scale Event Collection, Transport, and Delivery at Yahoo

Nilam Sharma – Sr. Software Engineer

Huibing Yin – Sr. Software Engineer

This talk presents the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency sensitive consumers.


DAY 2. WEDNESDAY June 14, 2017


9:05 – 9:15 A.M. Yahoo General Session – Shaping Data Platform for Lasting Value

Sumeet Singh  – Sr. Director, Products

With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.

12:20 – 1:00 P.M. CaffeOnSpark Update – Recent Enhancements and Use Cases

Mridul Jain – Sr. Principal Engineer

Jun Shi – Principal Engineer

By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recent development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.

12:20 – 1:00 P.M. Tez Shuffle Handler – Shuffling at Scale with Apache Hadoop

Jon Eagles – Principal Engineer  

Kuhu Shukla – Software Engineer

In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate performance slow down and provides deletion APIs to reduce disk usage for long running Tez sessions. As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale.

2:10 – 2:50 P.M. Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes

Thiruvel Thirumoolan – Principal Engineer

Francis Liu – Sr. Principal Engineer

At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). We will walk through multi-tenancy features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0.

2:10 – 2:50 P.M. Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse

Nick Huang – Director, Data Engineering, Yahoo Mail  

Saurabh Dixit – Sr. Principal Engineer, Yahoo Mail

Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail. In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.

DAY3. THURSDAY June 15, 2017


2:10 – 2:50 P.M. OracleStore – A Highly Performant RawStore Implementation for Hive Metastore

Chris Drome – Sr. Principal Engineer  

Jin Sun – Principal Engineer

Today, Yahoo uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore. As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data. In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL.

3:00 P.M. – 3:40 P.M. Bullet – A Real Time Data Query Engine

Akshai Sarma – Sr. Software Engineer

Michael Natkovich – Director, Engineering

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average.

3:00 P.M. – 3:40 P.M. Yahoo – Moving Beyond Running 100% of Apache Pig Jobs on Apache Tez

Rohini Palaniswamy – Sr. Principal Engineer

Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez. After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc. We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation.

4:10 P.M. – 4:50 P.M. Leveraging Docker for Hadoop Build Automation and Big Data Stack Provisioning

Evans Ye,  Software Engineer

Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we’ll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.

Register here for Hadoop Summit, San Jose, California with a 20% discount code MSPO20

Questions? Feel free to reach out to us at [email protected] Hope to see you there!

Tips for Migrating to Apache HBase on Amazon S3 from HDFS

Post Syndicated from Bruno Faria original https://aws.amazon.com/blogs/big-data/tips-for-migrating-to-apache-hbase-on-amazon-s3-from-hdfs/

Starting with Amazon EMR 5.2.0, you have the option to run Apache HBase on Amazon S3. Running HBase on S3 gives you several added benefits, including lower costs, data durability, and easier scalability.

HBase provides several options that you can use to migrate and back up HBase tables. The steps to migrate to HBase on S3 are similar to the steps for HBase on the Apache Hadoop Distributed File System (HDFS). However, the migration can be easier if you are aware of some minor differences and a few “gotchas.”

In this post, I describe how to use some of the common HBase migration options to get started with HBase on S3.

HBase migration options

Selecting the right migration method and tools is an important step in ensuring a successful HBase table migration. However, choosing the right ones is not always an easy task.

The following HBase feature and utilities help you migrate to HBase on S3:

  • snapshots
  • Export and Import
  • CopyTable

The following diagram summarizes the steps for each option.

Various factors determine the HBase migration method that you use. For example, EMR offers HBase version 1.2.3 as the earliest version that you can run on S3. Therefore, the HBase version that you’re migrating from can be an important factor in helping you decide. For more information about HBase versions and compatibility, see the HBase version number and compatibility documentation in the Apache HBase Reference Guide.

If you’re migrating from an older version of HBase (for example, HBase 0.94), you should test your application to make sure it’s compatible with newer HBase API versions. You don’t want to spend several hours migrating a large table only to find out that your application and API have issues with a different HBase version.

The good news is that HBase provides utilities that you can use to migrate only part of a table. This lets you test your existing HBase applications without having to fully migrate entire HBase tables. For example, you can use the Export, Import, or CopyTable utilities to migrate a small part of your table to HBase on S3. After you confirm that your application works with newer HBase versions, you can proceed with migrating the entire table using HBase snapshots.

Option 1: Migrate to HBase on S3 using snapshots

You can create table backups easily by using HBase snapshots. HBase also provides the ExportSnapshot utility, which lets you export snapshots to a different location, like S3. In this section, I discuss how you can combine snapshots with ExportSnapshot to migrate tables to HBase on S3.

For details about how you can use HBase snapshots to perform table backups, see Using HBase Snapshots in the Amazon EMR Release Guide and HBase Snapshots in the Apache HBase Reference Guide. These resources provide additional settings and configurations that you can use with snapshots and ExportSnapshot.

The following example shows how to use snapshots to migrate HBase tables to HBase on S3.

Note: Earlier HBase versions, like HBase 0.94, have a different snapshot structure than HBase 1.x, which is what you’re migrating to. If you’re migrating from HBase 0.94 using snapshots, you get a TableInfoMissingException error when you try to restore the table. For details about migrating from HBase 0.94 using snapshots, see the Migrating from HBase 0.94 section.

  1. From the source HBase cluster, create a snapshot of your table:
    $ echo "snapshot '<table_name>', '<snapshot_name>'" | hbase shell

  2. Export the snapshot to an S3 bucket:
    $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot <snapshot_name> -copy-to s3://<HBase_on_S3_root_dir>/

    For the -copy-to parameter in the ExportSnapshot utility, specify the S3 location that you are using for the HBase root directory of your EMR cluster. If your cluster is already up and running, you can find its S3 hbase.rootdir value by viewing the cluster’s Configurations in the EMR console, or by using the AWS CLI. Here’s the command to find that value:

    $ aws emr describe-cluster --cluster-id <cluster_id> | grep hbase.rootdir

  3. Launch an EMR cluster that uses the S3 storage option with HBase (skip this step if you already have one up and running). For detailed steps, see Creating a Cluster with HBase Using the Console in the Amazon EMR Release Guide. When launching the cluster, ensure that the HBase root directory is set to the same S3 location as your exported snapshots (that is, the location used in the -copy-to parameter in the previous step).
  4. Restore or clone the HBase table from that snapshot.
    • To restore the table and keep the same table name as the source table, use restore_snapshot:
      $ echo "restore_snapshot '<SNAPSHOT_NAME>'"| hbase shell

    • To restore the table into a different table name, use clone_snapshot:
      $ echo "clone_snapshot '<snapshot_name>', '<table_name>'" | hbase shell

Migrating from HBase 0.94 using snapshots

If you’re migrating from HBase version 0.94 using the snapshot method, you get an error if you try to restore from the snapshot. This is because the structure of a snapshot in HBase 0.94 is different from the snapshot structure in HBase 1.x.

The following steps show how to fix an HBase 0.94 snapshot so that it can be restored to an HBase on S3 table.

  1. Complete steps 1—3 in the previous example to create and export a snapshot.
  2. From your destination cluster, follow these steps to repair the snapshot:
    • Use s3-dist-cp to copy the snapshot data (archive) directory into a new directory. The archive directory contains your snapshot data. Depending on your table size, it might be large. Use s3-dist-cp to make this step faster:
      $ s3-dist-cp --src s3://<HBase_on_S3_root_dir>/.archive/<table_name> --dest s3://<HBase_on_S3_root_dir>/archive/data/default/<table_name>

    • Create and fix the snapshot descriptor file:
      $ hdfs dfs -mkdir s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tabledesc
      
      $ hdfs dfs -mv s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tableinfo.<*> s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tabledesc

  3. Restore the snapshot:
    $ echo "restore_snapshot '<snapshot_name>'" | hbase shell

Option 2: Migrate to HBase on S3 using Export and Import

As I discussed in the earlier sections, HBase snapshots and ExportSnapshot are great options for migrating tables. But sometimes you want to migrate only part of a table, so you need a different tool. In this section, I describe how to use the HBase Export and Import utilities.

The steps to migrate a table to HBase on S3 using Export and Import is not much different from the steps provided in the HBase documentation. In those docs, you can also find detailed information, including how you can use them to migrate part of a table.

The following steps show how you can use Export and Import to migrate a table to HBase on S3.

  1. From your source cluster, export the HBase table:
    $ hbase org.apache.hadoop.hbase.mapreduce.Export <table_name> s3://<table_s3_backup>/<location>/

  2. In the destination cluster, create the target table into which to import data. Ensure that the column families in the target table are identical to the exported/source table’s column families.
  3. From the destination cluster, import the table using the Import utility:
    $ hbase org.apache.hadoop.hbase.mapreduce.Import '<table_name>' s3://<table_s3_backup>/<location>/

HBase snapshots are usually the recommended method to migrate HBase tables. However, the Export and Import utilities can be useful for test use cases in which you migrate only a small part of your table and test your application. It’s also handy if you’re migrating from an HBase cluster that does not have the HBase snapshots feature.

Option 3: Migrate to HBase on S3 using CopyTable

Similar to the Export and Import utilities, CopyTable is an HBase utility that you can use to copy part of HBase tables. However, keep in mind that CopyTable doesn’t work if you’re copying or migrating tables between HBase versions that are not wire compatible (for example, copying from HBase 0.94 to HBase 1.x).

For more information and examples, see CopyTable in the HBase documentation.

Conclusion

In this post, I demonstrated how you can use common HBase backup utilities to migrate your tables easily to HBase on S3. By using HBase snapshots, you can migrate entire tables to HBase on S3. To test HBase on S3 by migrating or copying only part of your tables, you can use the HBase Export, Import, or CopyTable utilities.

If you have questions or suggestions, please comment below.

 


About the Author

Bruno Faria is an EMR Solution Architect with AWS. He works with our customers to provide them architectural guidance for running complex applications on Amazon EMR. In his spare time, he enjoys spending time with his family and learning about new big data solutions.

 


Related

Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3

 

 

 

 

 

 

Meet the Amazon EMR Team this Friday at a Tech Talk & Networking Event in Mountain View

Post Syndicated from Jonathan Fritz original https://aws.amazon.com/blogs/big-data/meet-the-amazon-emr-team-this-friday-at-a-tech-talk-networking-event-in-mountain-view/

Want to change the world with Big Data and Analytics? Come join us on the Amazon EMR team in Amazon Web Services!

Meet the Amazon EMR team this Friday April 7th from 5:00 – 7:30 PM at Michael’s at Shoreline in Mountain View. We’ll feature short tech talks by EMR leadership who will talk about the past, present, and future of Apache Hadoop and Spark ecosystem and EMR. You’ll also meet EMR engineers who are eager to discuss the challenges and opportunities involved in building the EMR service and running the latest open-source big data frameworks like Spark and Presto at massive scale. We’ll give out several door prizes, including an Amazon Echo with an Amazon Dot, Kindle, and Fire TV Stick!

Amazon EMR is a web service which enables customers to run massive clusters with distributed big data frameworks like Apache Hadoop, Hive, Tez, Flink, Spark, Presto, HBase and more, with the ability to effortlessly scale up and down as needed. We run a large number of customer clusters, enabling processing on vast datasets.

We are developing innovative new features including our next-generation cluster management system, improvements for real-time processing of big data, and ways to enable customers to more easily interact with their data. We’re looking for top engineers to build them from the ground up.

Here are sample features that we have recently delivered:

Interested? We hope you can make it! Please RSVP on Eventbrite.

AWS Achieves FedRAMP Authorization for New Services in the AWS GovCloud (US) Region

Post Syndicated from Chad Woolf original https://aws.amazon.com/blogs/security/aws-achieves-fedramp-authorization-for-a-wide-array-of-services/

Today, we’re pleased to announce an array of AWS services that are available in the AWS GovCloud (US) Region and have achieved Federal Risk and Authorization Management Program (FedRAMP) High authorizations. The FedRAMP Joint Authorization Board (JAB) has issued Provisional Authority to Operate (P-ATO) approvals, which are effective immediately. If you are a federal or commercial customer, you can use these services to process and store your critical workloads in the AWS GovCloud (US) Region’s authorization boundary with data up to the high impact level.

The services newly available in the AWS GovCloud (US) Region include database, storage, data warehouse, security, and configuration automation solutions that will help you increase your ability to manage data in the cloud. For example, with AWS CloudFormation, you can deploy AWS resources by automating configuration processes. AWS Key Management Service (KMS) enables you to create and control the encryption keys used to secure your data. Amazon Redshift enables you to analyze all your data cost effectively by using existing business intelligence tools to automate common administrative tasks for managing, monitoring, and scaling your data warehouse.

Our federal and commercial customers can now leverage our FedRAMP P-ATO to access the following services:

  • CloudFormation – CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion. You can use sample templates in CloudFormation, or create your own templates to describe the AWS resources and any associated dependencies or run-time parameters required to run your application.
  • Amazon DynamoDBAmazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit-millisecond latency at any scale. It is a fully managed cloud database and supports both document and key-value store models.
  • Amazon EMRAmazon EMR provides a managed Hadoop framework that makes it efficient and cost effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR, and interact with data in other AWS data stores such as Amazon S3 and DynamoDB.
  • Amazon GlacierAmazon Glacier is a secure, durable, and low-cost cloud storage service for data archiving and long-term backup. Customers can reliably store large or small amounts of data for as little as $0.004 per gigabyte per month, a significant savings compared to on-premises solutions.
  • KMS – KMS is a managed service that makes it easier for you to create and control the encryption keys used to encrypt your data, and uses Hardware Security Modules (HSMs) to protect the security of your keys. KMS is integrated with other AWS services to help you protect the data you store with these services. For example, KMS is integrated with CloudTrail to provide you with logs of all key usage and help you meet your regulatory and compliance needs.
  • Redshift – Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost effective to analyze all your data by using your existing business intelligence tools.
  • Amazon Simple Notification Service (SNS)Amazon SNS is a fast, flexible, fully managed push notification service that lets you send individual messages or “fan out” messages to large numbers of recipients. SNS makes it simple and cost effective to send push notifications to mobile device users and email recipients or even send messages to other distributed services.
  • Amazon Simple Queue Service (SQS)Amazon SQS is a fully-managed message queuing service for reliably communicating among distributed software components and microservices—at any scale. Using SQS, you can send, store, and receive messages between software components at any volume, without losing messages or requiring other services to be always available.
  • Amazon Simple Workflow Service (SWF)Amazon SWF helps developers build, run, and scale background jobs that have parallel or sequential steps. SWF is a fully managed state tracker and task coordinator in the cloud.

AWS works closely with the FedRAMP Program Management Office (PMO), National Institute of Standards and Technology (NIST), and other federal regulatory and compliance bodies to ensure that we provide you with the cutting-edge technology you need in a secure and compliant fashion. We are working with our authorizing officials to continue to expand the scope of our authorized services, and we are fully committed to ensuring that AWS GovCloud (US) continues to offer government customers the most comprehensive mix of functionality and security.

– Chad

Month in Review: December 2016

Post Syndicated from Derek Young original https://aws.amazon.com/blogs/big-data/month-in-review-december-2016/

Another month of big data solutions on the Big Data Blog.

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. In this post, walk through the steps to enable authorization and audit for Amazon EMR clusters using Apache Ranger.

Amazon Redshift Engineering’s Advanced Table Design Playbook
Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. In practice, the best way to improve query performance by orders of magnitude is by tuning Amazon Redshift tables to better meet your workload requirements. This five-part blog series will guide you through applying distribution styles, sort keys, and compression encodings and configuring tables for data durability and recovery purposes.

Interactive Analysis of Genomic Datasets Using Amazon Athena
In this post, learn to prepare genomic data for analysis with Amazon Athena. We’ll demonstrate how Athena is well-adapted to address common genomics query paradigms using the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study. Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

Joining and Enriching Streaming Data on Amazon Kinesis
In this blog post, learn three approaches for joining and enriching streaming data on Amazon Kinesis Streams by using Amazon Kinesis Analytics, AWS Lambda, and Amazon DynamoDB.

Using SaltStack to Run Commands in Parallel on Amazon EMR
SaltStack is an open source project for automation and configuration management. It started as a remote execution engine designed to scale to many machines while delivering high-speed execution. You can now use the new bootstrap action that installs SaltStack on Amazon EMR. It provides a basic configuration that enables selective targeting of the nodes based on instance roles, instance groups, and other parameters.

Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway
Amazon Game Studios’ new title Breakaway is an online 4v4 team battle sport that delivers fast action, teamwork, and competition. In this post, learn the technical details of how the Breakaway team uses AWS to collect, process, and analyze gameplay telemetry to answer questions about arena design.

Respond to State Changes on Amazon EMR Clusters with Amazon CloudWatch Events
With new support for Amazon EMR in Amazon CloudWatch Events, you can be notified quickly and programmatically respond to state changes in your EMR clusters. Additionally, these events are also displayed in the Amazon EMR console. CloudWatch Events allows you to create filters and rules to match these events and route them to Amazon SNS topics, AWS Lambda functions, Amazon SQS queues, streams in Amazon Kinesis Streams, or built-in targets.

Run Jupyter Notebook and JupyterHub on Amazon EMR
Data scientists who run Jupyter and JupyterHub on Amazon EMR can use Python, R, Julia, and Scala to process, analyze, and visualize big data stored in Amazon S3. Jupyter notebooks can be saved to S3 automatically, so users can shut down and launch new EMR clusters, as needed. See how EMR makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets.

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight
In this post, see how you can build a business intelligence capability for streaming IoT device data using AWS serverless and managed services. You can be up and running in minutes―starting small, but able to easily grow to millions of devices and billions of messages.

Serving Real-Time Machine Learning Predictions on Amazon EMR
The typical progression for creating and using a trained model for recommendations falls into two general areas: training the model and hosting the model. Model training has become a well-known standard practice. In this post, we highlight one way to host those recommendations using Amazon EMR with JobServer

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
In this post, learn to generate a predictive model for flight delays that can be used to help pick the flight least likely to add to your travel stress. To accomplish this, you’ll use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models.

FROM THE ARCHIVE

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR
Sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. Sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API. This short post shows you how to run RStudio and sparklyr on EMR.


Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.

EC2 Systems Manager – Configure & Manage EC2 and On-Premises Systems

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-systems-manager-configure-manage-ec2-and-on-premises-systems/

Last year I introduced you to the EC2 Run Command and showed you how to use it to do remote instance management at scale, first for EC2 instances and then in hybrid and cross-cloud environments. Along the way we added support for Linux instances, making EC2 Run Command a widely applicable and incredibly useful administration tool.

Welcome to the Family
Werner announced the EC2 Systems Manager at AWS re:Invent and I’m finally getting around to telling you about it!

This a new management service that include an enhanced version of EC2 Run Command along with eight other equally useful functions. Like EC2 Run Command it supports hybrid and cross-cloud environments composed of instances and services running Windows and Linux. You simply open up the AWS Management Console, select the instances that you want to manage, and define the tasks that you want to perform (API and CLI access is also available).

Here’s an overview of the improvements and new features:

Run Command – Now allows you to control the velocity of command executions, and to stop issuing commands if the error rate grows too high.

State Manager – Maintains a defined system configuration via policies that are applied at regular intervals.

Parameter Store – Provides centralized (and optionally encrypted) storage for license keys, passwords, user lists, and other values.

Maintenance Window -Specify a time window for installation of updates and other system maintenance.

Software Inventory – Gathers a detailed software and configuration inventory (with user-defined additions) from each instance.

AWS Config Integration – In conjunction with the new software inventory feature, AWS Config can record software inventory changes to your instances.

Patch Management – Simplify and automate the patching process for your instances.

Automation – Simplify AMI building and other recurring AMI-related tasks.

Let’s take a look at each one…

Run Command Improvements
You can now control the number of concurrent command executions. This can be useful in situations where the command references a shared, limited resource such as an internal update or patch server and you want to avoid overloading it with too many requests.

This feature is currently accessible from the CLI and from the API. Here’s a CLI example that limits the number of concurrent executions to 2:

$ aws ssm send-command \
  --instance-ids "i-023c301591e6651ea" "i-03cf0fc05ec82a30b" "i-09e4ed09e540caca0" "i-0f6d1fe27dc064099" \
  --document-name "AWS-RunShellScript" \
  --comment "Run a shell script or specify the commands to run." \
  --parameters commands="date" \
  --timeout-seconds 600 --output-s3-bucket-name "jbarr-data" \
  --region us-east-1 --max-concurrency 2

Here’s a more interesting variant that is driven by tags and tag values by specifying --targets instead of --instance-ids:

$ aws ssm send-command \
  --targets "Key=tag:Mode,Values=Production" ... 

You  can also stop issuing commands if they are returning errors, with the option to specify either a maximum number of errors or a failure rate:

$ aws ssm send-command --max-errors 5 ... 
$ aws ssm send-command --max-errors 5% ...

State Manager
State Manager helps to keep your instances in a defined state, as defined by a document. You create the document, associate it with a set of target instances, and then create an association to specify when and how often the document should be applied. Here’s a document that updates the message of the day file:

And here’s the association (this one uses tags so that it applies to current instances and to others that are launched later and are tagged in the same way):

Specifying targets using tags makes the association future-proof, and allows it to work as expected in dynamic, auto-scaled environments. I can see all of my associations, and I can run the new one by selecting it and clicking on Apply Association Now:

Parameter Store
This feature simplifies storage and management for license keys, passwords, and other data that you want to distribute  to your instances. Each parameter has a type (string, string list, or secure string), and can be stored in encrypted form. Here’s how I create a parameter:

And here’s how I reference the parameter in a command:

Maintenance Window
This feature allows specification of a time window for installation of updates and other system maintenance. Here’s how I create a weekly time window that opens for four hours every Saturday:

After I create the window I need to assign a set of instances to it. I can do this by instance Id or by tag:

And  then I need to register a task to perform during the maintenance window. For example, I can run a Linux shell script:

Software Inventory
This feature collects information about software and settings for a set of instances. To access it, I click on Managed Instances and Setup Inventory:

Setting up the inventory creates an association between an AWS-owned document and a set of instances. I simply choose the targets, set the schedule, and identify the types of items to be inventoried, then click on Setup Inventory:

After the inventory runs, I can select an instance and then click on the Inventory tab in order to inspect the results:

The results can be filtered for further analysis. For example, I can narrow down the list of AWS Components to show only development tools and libraries:

I can also run inventory-powered queries across all of the managed instances. Here’s how I can find Windows Server 2012 R2 instances that are running a version of .NET older than 4.6:

AWS Config Integration
The results of the inventory can be routed to AWS Config  and allow you to track changes to the applications, AWS components, instance information, network configuration, and Windows Updates over time. To access this information, I click on Managed instance information above the Config timeline for the instance:

The three lines at the bottom lead to the inventory information. Here’s the network configuration:

Patch Management
This feature helps you to keep the operating system on your Windows instances up to date. Patches are applied during maintenance windows that you define, and are done with respect to a baseline. The baseline specifies rules for automatic approval of patches based on classification and severity, along with an explicit list of patches to approve or reject.

Here’s my baseline:

Each baseline can apply to one or more patch groups. Instances within a patch group have a Patch Group tag. I named my group Win2016:

Then I associated the value with the baseline:

The next step is to arrange to apply the patches during a maintenance window using the AWS-ApplyPatchBaseline document:

I can return to the list of Managed Instances and use a pair of filters to find out which instances are in need of patches:

Automation
Last but definitely not least, the Automation feature simplifies common AMI-building and updating tasks. For example, you can build a fresh Amazon Linux AMI each month using the AWS-UpdateLinuxAmi document:

Here’s what happens when this automation is run:

Available Now
All of the EC2 Systems Manager features and functions that I described above are available now and you can start using them today at no charge. You pay only for the resources that you manage.

Jeff;

 

Month in Review: November 2016

Post Syndicated from Derek Young original https://aws.amazon.com/blogs/big-data/month-in-review-november-2016/

Another month of big data solutions on the Big Data Blog.

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

Use Apache Flink on Amazon EMR
It is even easier to run Flink on AWS as it is now natively supported in Amazon EMR 5.1.0. EMR supports running Flink-on-YARN so you can create either a long-running cluster that accepts multiple jobs or a short-running Flink session in a transient cluster that helps reduce your costs by only charging you for the time that you use.

Scale Your Amazon Kinesis Stream Capacity with UpdateShardCount
With the new Amazon Kinesis Streams UpdateShardCount API operation, you can automatically scale your stream shard capacity by using Amazon CloudWatch alarms, Amazon SNS, and AWS Lambda. In this post, walk through an example of how you can automatically scale your shards using a few lines of code.

Build a Community of Analysts with Amazon QuickSight
In this post, learn how Amazon QuickSight can be used to share dashboards, analyses, and stories. Although fictitious, CoffeeCo, like many companies, benefits from distributing information to people who understand its context and can act on the insights that it contains. 

Dynamically Scale Applications on Amazon EMR with Auto Scaling
With new support for Auto Scaling in Amazon EMR releases 4.x and 5.x, customers can now add (scale out) and remove (scale in) nodes on a cluster more easily. Scaling actions are triggered automatically by Amazon CloudWatch metrics provided by EMR at 5 minute intervals, including several YARN metrics related to memory utilization, applications pending, and HDFS utilization.

Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3
By migrating to HBase on EMR using S3 for storage, FINRA has lowered its costs by 60%, decreased operational complexity, increased durability and availability, and have created a more scalable architecture.

Introducing the Data Lake Solution on AWS
Learn why a data lake on AWS can increase the flexibility and agility of your analytics.

Analyzing Data in S3 using Amazon Athena
Learn how to use Athena on logs from Elastic Load Balancers, generated as text files in a pre-defined format. We show you how to create a table, partition the data in a format used by Athena, convert it to Parquet, and compare query performance.

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR

Post Syndicated from Varun Rao Bhamidimarri original https://aws.amazon.com/blogs/big-data/implementing-authorization-and-auditing-using-apache-ranger-on-amazon-emr/

Varun Rao is a Big Data Architect for AWS Professional Services

Role-based access control (RBAC) is an important security requirement for multi-tenant Hadoop clusters. Enforcing this across always-on and transient clusters can be hard to set up and maintain.

Imagine an organization that has an RBAC matrix using Active Directory users and groups. They would like to manage it on a central security policy server and enforce it on all Hadoop clusters that are spun up on AWS. This policy server should also store access and audit information for compliance needs.

In this post, I provide the steps to enable authorization and audit for Amazon EMR clusters using Apache Ranger.

Apache Ranger

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, like NameNode and HiveServer2.

Architecture

Using the setup in the following diagram, multiple EMR clusters can sync policies with a standalone security policy server. The idea is similar to a shared Hive metastore that can be used across EMR clusters.

EMRRanger_1

Walkthrough

In this walkthrough, three users—analyst1, analyst2, and admin1—are set up for the initial authorization, as shown in the following diagram. Using the Ranger Admin UI, I show how to modify these access permissions. These changes are propagated to the EMR cluster and validated through Hue.

o_EMRRanger_2

To manage users/groups/credentials, we will use Simple AD, a managed directory service offered by AWS Directory Service. A Windows EC2 instance will be setup to join the SimpleAD domain and load users/groups using a PowerShell script. A stand-alone security policy server (Ranger) and EMR cluster will be setup and configured. Finally, we will update the security policies and test the changes.

Prerequisites

The following steps assume that you have a VPC with at least two subnets, with NAT configured for private subnets. Also, verify that DNS Resolution (enableDnsSupport) and DNS Hostnames (enableDnsHostnames) are set to Yes on the VPC. The EC2 instance created in the steps below can be used as bastion if launched in a public subnet. If no public subnets are selected, you will need a bastion host or a VPN connection to login to the windows instance and access Web UI links (Hue, Ranger).

I have created AWS CloudFormation templates for each step and a nested CloudFormation template for single-click deployment launch_stack. If you use this nested Cloudformation template, skip to the “Testing the cluster” step after the stack has been successfully created.

To create each component individually, follow the steps below.

IMPORTANT: The templates use hard-coded username and passwords, and open security groups. They are not intended for production use without modification.

Setting up a SimpleAD server

Using this CloudFormation template, set up a SimpleAD server. To launch the stack directly through the console, use launch_stack. It takes the following parameters:

EMRRanger_1_1

CloudFormation output:

EMRRanger_Grid2

NOTE: SimpleAD creates two servers for high availability. For the following steps, you can use either of the two IP addresses.

Creating a Windows EC2 instance

To manage the SimpleAD server, set up a Windows instance. It is used to load LDAP users required to test the access policies. On instance startup, a PowerShell script is executed automatically to load users (analyst1, analyst2, admin1).

Using this CloudFormation template, set up this Windows instance. Select a public subnet if you want to use this as a bastion host to access Web UI (Hue, Ranger). To launch the stack directly through the console, use launch_stack. It takes the following parameters:

EMRRanger_3_2

You can specify either of the two SimpleAD IP addresses.

CloudFormation output:

EMRRanger_Grid4

Once stack creation is complete, Remote desktop into this instance using the SimpleAD username (EmrSimpleAD\Administrator) and password ([email protected]) before moving to the next step.

NOTE: The instance initialization is longer than usual because of the SimpleAD Join and PowerShell scripts that need to be executed after the join.

Setting up the Ranger server

Now that SimpleAD has been created and the users loaded, you are ready to set up the security policy server (Ranger). This runs on a standard Amazon Linux instance and Ranger is installed and configured on startup.

Using this CloudFormation template, set up the Ranger server. To launch the stack directly through the console, use launch_stack. It takes the following parameters:

EMRRanger_5_1

CloudFormation output:

EMRRanger_Grid6

NOTE: The Ranger server syncs users with SimpleAD and enables LDAP authentication for the Admin UI. The default Ranger Admin password is not changed.

Creating an EMR cluster

Finally, it’s time to create the EMR cluster and configure it with the required plugins. You can use the AWS CLI or CloudFormation to create and configure the cluster. EMR security configurations are not currently supported by CloudFormation.

Using the AWS CLI to create a cluster

aws emr create-cluster --applications Name=Hive Name=Spark Name=Hue --tags 'Name=EMR-Security' \
--release-label emr-5.0.0 \
--ec2-attributes 'SubnetId=<subnet-xxxxx>,InstanceProfile=EMR_EC2_DefaultRole,KeyName=<key name>' \
--service-role EMR_DefaultRole \
--instance-count 4 \
--instance-type m3.2xlarge \
--log-uri '<s3 location for logging>' \
--name 'SecurityPOCCluster' --region us-east-1 \
--bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/scripts/download-scripts.sh","Args":["s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger"],"Name":"Download scripts"}]' \
--steps '[{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/updateHueLdapUrl.sh","<ip address of simple ad server>"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"UpdateHueLdapServer"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/install-hive-hdfs-ranger-policies.sh","<ranger host ip>","s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/inputdata"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"InstallRangerPolicies"},{"Args":["spark-submit","--deploy-mode","cluster","--class","org.apache.spark.examples.SparkPi","/usr/lib/spark/examples/jars/spark-examples.jar","10"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"SparkStep"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/install-hive-hdfs-ranger-plugin.sh","<ranger host ip>","0.6","s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"InstallRangerPlugin"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/loadDataIntoHDFS.sh","us-east-1"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"LoadHDFSData"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/createHiveTables.sh","us-east-1"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"CreateHiveTables"}]' \
--configurations '[{"Classification":"hue-ini","Properties":{},"Configurations":[{"Classification":"desktop","Properties":{},"Configurations":[{"Classification":"auth","Properties":{"backend":"desktop.auth.backend.LdapBackend"},"Configurations":[]},{"Classification":"ldap","Properties":{"bind_dn":"binduser","trace_level":"0","search_bind_authentication":"false","debug":"true","base_dn":"dc=corp,dc=emr,dc=local","bind_password":"[email protected]","ignore_username_case":"true","create_users_on_login":"true","ldap_username_pattern":"uid=<username>,cn=users,dc=corp,dc=emr,dc=local","force_username_lowercase":"true","ldap_url":"ldap://<ip address of simple ad server>","nt_domain":"corp.emr.local"},"Configurations":[{"Classification":"groups","Properties":{"group_filter":"objectclass=*","group_name_attr":"cn"},"Configurations":[]},{"Classification":"users","Properties":{"user_name_attr":"sAMAccountName","user_filter":"objectclass=*"},"Configurations":[]}]}]}]}]'

The LDAP-related configuration for HUE is passed using the --configurations option. For more information, see Configure Hue for LDAP Users and the EMR create-cluster CLI reference.

Using a CloudFormation template to create a cluster

This step requires some Hue configuration changes in the CloudFormation template. The IP address of the LDAP server (SimpleAD) needs to be updated.

  1. Open the template in CloudFormation Designer. For more information about how to modify a CloudFormation template, see Walkthrough: Use AWS CloudFormation Designer to Modify a Stack’s Template.
  2. Choose EMRSampleCluster.
  3. On the Properties section, update the value of ldap_url with the IP address of the SimpleAD server:
    "ldap_url": "ldap://<change it to the SimpleAD IP address>",
  4. On the Designer toolbar, choose Validate template to check for syntax errors in your template.
  5. Choose Create Stack.
  6. Update Stack name and the stack parameters.

CloudFormation parameters:

EMRRanger_7_1

CloudFormation output:

EMRRanger_Grid8

EMR steps are used to perform the following:

  • Install and configure Ranger HDFS and Hive plugins
  • Use the Ranger REST API to update repository and authorization policies.
    NOTE: This step needs to be executed the first time. New clusters do not need to include this step action.
  • Create Hive tables (tblAnalyst1 and tblAnalyst2) and copy sample data.
  • Create HDFS folders (/user/analyst1 and /user/analyst2) and copy sample data.
  • Run a SparkPi job using the spark submit action to verify the cluster setup.

To validate that all the step actions were executed successfully, view the Step section for the EMR cluster.

o_EMRRanger_3

NOTE: Cluster creation can take anywhere between 10-15 minutes.

Testing the cluster

Congratulations! You have successfully configured the EMR cluster with the ability to manage authorization policies, using Ranger. How do you know if it actually works? You can test this by accessing HDFS files and running Hive queries.

Using HDFS

Log in to Hue (URL: http://<master DNS or IP>:8888) as “analyst1” and try to delete a file owned by “analyst2”. For more information about how to access Hue, see Launch the Hue Web Interface. The Windows EC2 instance created in the previous steps can be used to access this without having to setup a SSH tunnel.

  1. Log in as user “analyst1” (password: [email protected]).
  2. Browse to the /user/analyst2 HDFS directory and move the file “football_coach_position.tsv” to trash.
  3. You should see a “Permission denied” error, which is expected.
    o_EMRRanger_4

Using Hive queries

Using the HUE SQL Editor, execute the following query.

These queries use external tables, and Hive leverages EMRFS to access the data stored in S3. Because HiveServer2 (where Hue is submitting these queries) is checking with Ranger to grant or deny before accessing any data in S3, you can create fine-grained SQL-based permissions for users even though there is a single EC2 role specified for the cluster (which is used by all requests the cluster makes to S3). For more information, see Additional Features of Hive on Amazon EMR.

SELECT * FROM default.tblanalyst1

This should return the results as expected. Now, run the following query:

SELECT * FROM default.tblanalyst2

You should see the following error:

o_EMRRanger_5

This makes sense. User analyst1 does not have table SELECT permissions on table tblanalyst2.

User analyst2 (default password: [email protected]) should see a similar error when accessing table tblanalyst1. User admin1 (default password: [email protected]) should be able to run both queries.

Updating the security policies

You have verified that the policies are being enforced. Now, let’s try to update them.

  1. Log in to the Ranger Admin UI server
    • URL: http:://<ip address of the ranger server>:6080/login.jsp
    • Default admin username/password: admin/admin.
  2. View all the Ranger Hive policies by selecting “hivedev”
    o_EMRRanger_6
  3. Select the policy named “Analyst2Policy”
  4. Edit the policy by adding “analyst1” user with “select” permissions for table “tblanalyst2”
    EMRRanger_7
  5. Save the changes.

This policy change is pulled in by the Hive plugin on the EMR cluster. Give it at least 60 seconds for the policy refresh to happen.

Go back to Hue to test if this change has been propagated.

  1. Log back in to the Hue UI as user “analyst1” (see earlier steps).
  2. In the Hive SQL Editor, run the query that failed earlier:
    SELECT * FROM default.tblanalyst2

This query should now run successfully.

o_EMRRanger_8

Audits

Can you now find those who tried to access the Hive tables and see if they were “denied” or “allowed”?

  1. Log back in to the Ranger UI as admin (see earlier steps).
    URL: http://<ip address of the ranger server>:6080/login.jsp
  2. Choose Audit and filter by “analyst1”.
    • Analyst1 was denied SELECT access to the tblanalyst2 table.
      o_EMRRanger_9
    • After the policy change, the access was granted and logged.
      o_EMRRanger_10

The same audit information is also stored in SOLR for performing more complex and full test searches. The SOLR instance is installed on the same instance as the Ranger server.

  • Open Solr UI:
    http://<ip-address-of-ranger-server>:8983/solr/#/ranger_audits/query
  • Perform a document search
    o_EMRRanger_11

Direct URL: http:// <ip-address-of-ranger-server>:8983/solr/ranger_audits/select?q=*%3A*&wt=json&indent=true

Conclusion

In this post, I walked through the steps required to enable authorization and audit capabilities on EMR using Apache Ranger, with a centrally managed security policy server. I also covered the steps to automate this using CloudFormation templates.

Stay tuned for more posts about security on EMR. If you have questions or suggestions, please comment below.

For information about other EMR security aspects, see Jeff Barr’s posts:


About the author


varun_90Varun Rao is a Big Data Architect for AWS Professional Services.
He works with enterprise customers to define data strategy in the cloud. In his spare time, he tries to keep up with his 2-year-old.

 

 


Related

Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations

security_config

 

Use Spark 2.0, Hive 2.1 on Tez, and the latest from the Hadoop ecosystem on Amazon EMR release 5.0

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx3KG7STXIZV5QZ/Use-Spark-2-0-Hive-2-1-on-Tez-and-the-latest-from-the-Hadoop-ecosystem-on-Amazon

Jonathan Fritz is a Senior Product Manager for Amazon EMR

We are excited to launch Amazon EMR release 5.0 today, giving customers the latest versions of 16 supported open-source applications in the big data ecosystem, including new major versions of Spark and Hive.

Almost exactly a year ago, we shipped release 4.0, which brought significant improvements to EMR. We based our build and packaging system on Apache Bigtop, moved to standard ports and paths, and streamlined application configuration with configuration objects. Our initial 4.0 release consolidated our set of supported Apache big data applications to Apache Hadoop, Apache Spark, Apache Hive, Apache Pig, and Apache Mahout.

Over the subsequent months, EMR added support for additional open-source projects, unlocking various use cases such as low-latency SQL over datasets in Amazon S3 with Presto, real-time data access and SQL analytics with Apache HBase and Phoenix, collaborative analysis for data science with notebooks in Apache Zeppelin, and designing complex processing workflows with Apache Oozie.

Also, we kept versions of most major projects up-to-date with each EMR release, such as offering the latest version of Spark just a few weeks after the open source release. Each new version of a project had many performance improvements, new features, and bug fixes, and customers demanded these improvements quickly to support their big data architectures.

EMR release 5.0 is a milestone in delivering the most up-to-date, complete selection of open-source applications in the Hadoop ecosystem to our customers:

  • Upgrade to Spark 2.0 a week after the Apache release, giving customers access to improved SQL support, significant performance increases, the new Structured Streaming API, and enhanced SparkR support. We have also compiled it with Scala 2.11.
  • Upgrade from Hive 1.x to Hive 2.1, which includes a variety of performance enhancements, better Parquet file format support, and bug fixes.
  • Trade Hadoop MapReduce for Tez as the default execution engine for Hive and Pig, signaling a greater move from traditional Hadoop MapReduce to newer frameworks like Tez and Spark.
  • Add the newest versions of Hue and Zeppelin, notebook and query UIs for Hadoop ecosystem applications, enable data scientists and business intelligence analysts to interact with data even more easily and efficiently.
  • Upgrade all sandbox applications are now release on EMR.
  • Use the latest versions of all supported applications: Hadoop 2.7.2, Spark 2.0, Presto 0.150, Hive 2.1, Tez 0.8.4, Pig 0.16, HBase 1.2.2, Phoenix 4.7.0, Zeppelin 0.6.1 (Snapshot), Hue 3.10, Oozie 4.2.0, Sqoop 1.4.6, Ganglia 3.7.2, HCatalog 2.1.0, Mahout 0.12.2, and ZooKeeper 3.4.8.

EMR 5

If you have any questions about release 5.0, feedback, or would like to share an interesting use case that leverages these applications, please leave a comment below.

You can also join our live webinar, Introducing Amazon EMR Release 5.0, at 9AM PDT on Tuesday, August 23.

 

Amazon EMR 5.0.0 – Major App Updates, UI Improvements, Better Debugging, and More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-emr-5-0-0-major-app-updates-ui-improvements-better-debugging-and-more/

The Amazon EMR team has been cranking out new releases at a fast and furious pace! Here’s a quick recap of this year’s launches:

  • EMR 4.7.0 – Updates to Apache Tez, Apache Phoenix, Presto, HBase, and Mahout (June).
  • EMR 4.6.0 – HBase for realtime access to massive datasets (April).
  • EMR 4.5.0 – Updates to Hadoop, Presto; addition of Spark and EMRFS (April).
  • EMR 4.4.0 – Sqoop, HCatalog, Java 8, and more (March).
  • EMR 4.3.0 – Updates to Spark, Presto, and Ganglia (January).

Today the team is announcing and releasing EMR 5.0.0. This is a major release that includes support for 16 open source Hadoop ecosystem projects, major version upgrades for Spark and Hive, use of Tez by default for Hive and Pig, user interface improvements to Hue and Zeppelin, and enhanced debugging functionality.

Here’s a map that shows how EMR has progressed over the course of the past couple of releases:

Let’s check out the new features in EMR 5.0.0!

Support for 16 Open Source Hadoop Ecosystem Projects
We started using Apache Bigtop to manage the EMR build and packaging process during the development of EMR 4.0.0. The use of Bigtop helped us to accelerate the release cycle while we continued to add additional packages from the Hadoop ecosystem, with a goal of making the newest GA (generally available) open source versions accessible to you as quickly as possible.

In accord with our goal, EMR 5.0 includes support for 16 Hadoop ecosystem projects including Apache Hadoop, Apache Spark, Presto, Apache Hive, Apache HBase, and Apache Tez. You can choose the desired set of apps when you create a new EMR cluster:

Major Version Upgrade for Spark and Hive
This release of EMR updates Hive (a SQL-like interface for Tez and Hadoop MapReduce) from 1.0 to 2.1, accompanied by a move to Java 8. It also updates Spark (an engine for large-scale data processing) from 1.6.2 to 2.0, with a similar move to Scala 2.11. The Spark and Hive updates are both major releases and include new features, performance enhancements, and bug fixes. For example, Spark now includes a Structured Streaming API, better SQL support, and more. Be aware that the new versions of Spark and Hive are not 100% backward compatible with the old ones; check your code and upgrade to EMR 5.0.0 with care.

With this release, Tez is now the default execution engine for Hive 2.1 and Pig 0.16, replacing Hadoop MapReduce and resulting in better performance, including reduced query latency. With this update, EMR uses MapReduce only when running a Hadoop MapReduce job directly (Hive and Pig now use Tez; Spark has its own framework).

User Interface Improvements
EMR 5.0.0 also updates Apache Zeppelin (a notebook for interactive data analytics) from 0.5.6 to 0.6.1, and Hue (an interface for analyzing data with Hadoop) from 3.7.1 to 3.10. The new versions of both of these web-based tools include new features and lots of smaller improvements.

Zeppelin is often used with Spark; Hue works well with Hive, Pig, and HBase. The new version of Hue includes a notebooks feature that allows you to have multiple queries on the same page:

Hue can also help you to design Oozie workflows:

Enhanced Debugging Functionality
Finally, EMR 5.0.0 includes some better debugging functionality, making it easier for you to figure out why a particular step of your EMR job failed. The console now displays a partial stack track and links to the log file (stored in Amazon S3) in order to help you to find, troubleshoot, and fix errors:

Launch a Cluster Today
You can launch an EMR 5.0.0 cluster today in any AWS Region! Open up the EMR Console, click on Create cluster, and choose emr-5.0.0 from the Release menu:

Learn More
To learn more about this powerful new release of EMR, plan to attend our webinar or August 23rd, Introducing Amazon EMR Release 5.0: Faster, Easier, Hadoop, Spark, and Presto.


Jeff;