Tag Archives: Batch

New – Per-Second Billing for EC2 Instances and EBS Volumes

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-per-second-billing-for-ec2-instances-and-ebs-volumes/

Back in the old days, you needed to buy or lease a server if you needed access to compute power. When we launched EC2 back in 2006, the ability to use an instance for an hour, and to pay only for that hour, was big news. The pay-as-you-go model inspired our customers to think about new ways to develop, test, and run applications of all types.

Today, services like AWS Lambda prove that we can do a lot of useful work in a short time. Many of our customers are dreaming up applications for EC2 that can make good use of a large number of instances for shorter amounts of time, sometimes just a few minutes.

Per-Second Billing for EC2 and EBS
Effective October 2nd, usage of Linux instances that are launched in On-Demand, Reserved, and Spot form will be billed in one-second increments. Similarly, provisioned storage for EBS volumes will be billed in one-second increments.

Per-second billing also applies to Amazon EMR and AWS Batch:

Amazon EMR – Our customers add capacity to their EMR clusters in order to get their results more quickly. With per-second billing for the EC2 instances in the clusters, adding nodes is more cost-effective than ever.

AWS Batch – Many of the batch jobs that our customers run complete in less than an hour. AWS Batch already launches and terminates Spot Instances; with per-second billing batch processing will become even more economical.

Some of our more sophisticated customers have built systems to get the most value from EC2 by strategically choosing the most advantageous target instances when managing their gaming, ad tech, or 3D rendering fleets. Per-second billing obviates the need for this extra layer of instance management, and brings the costs savings to all customers and all workloads.

While this will result in a price reduction for many workloads (and you know we love price reductions), I don’t think that’s the most important aspect of this change. I believe that this change will inspire you to innovate and to think about your compute-bound problems in new ways. How can you use it to improve your support for continuous integration? Can it change the way that you provision transient environments for your dev and test workloads? What about your analytics, batch processing, and 3D rendering?

One of the many advantages of cloud computing is the elastic nature of provisioning or deprovisioning resources as you need them. By billing usage down to the second we will enable customers to level up their elasticity, save money, and customers will be positioned to take advantage of continuing advances in computing.

Things to Know
This change is effective in all AWS Regions and will be effective October 2, for all Linux instances that are newly launched or already running. Per-second billing is not currently applicable to instances running Microsoft Windows or Linux distributions that have a separate hourly charge. There is a 1 minute minimum charge per-instance.

List prices and Spot Market prices are still listed on a per-hour basis, but bills are calculated down to the second, as is Reserved Instance usage (you can launch, use, and terminate multiple instances within an hour and get the Reserved Instance Benefit for all of the instances). Also, bills will show times in decimal form, like this:

The Dedicated Per Region Fee, EBS Snapshots, and products in AWS Marketplace are still billed on an hourly basis.

Jeff;

 

AWS IAM Policy Summaries Now Help You Identify Errors and Correct Permissions in Your IAM Policies

Post Syndicated from Joy Chatterjee original https://aws.amazon.com/blogs/security/iam-policy-summaries-now-help-you-identify-errors-and-correct-permissions-in-your-iam-policies/

In March, we made it easier to view and understand the permissions in your AWS Identity and Access Management (IAM) policies by using IAM policy summaries. Today, we updated policy summaries to help you identify and correct errors in your IAM policies. When you set permissions using IAM policies, for each action you specify, you must match that action to supported resources or conditions. Now, you will see a warning if these policy elements (Actions, Resources, and Conditions) defined in your IAM policy do not match.

When working with policies, you may find that although the policy has valid JSON syntax, it does not grant or deny the desired permissions because the Action element does not have an applicable Resource element or Condition element defined in the policy. For example, you may want to create a policy that allows users to view a specific Amazon EC2 instance. To do this, you create a policy that specifies ec2:DescribeInstances for the Action element and the Amazon Resource Name (ARN) of the instance for the Resource element. When testing this policy, you find AWS denies this access because ec2:DescribeInstances does not support resource-level permissions and requires access to list all instances. Therefore, to grant access to this Action element, you need to specify a wildcard (*) in the Resource element of your policy for this Action element in order for the policy to function correctly.

To help you identify and correct permissions, you will now see a warning in a policy summary if the policy has either of the following:

  • An action that does not support the resource specified in a policy.
  • An action that does not support the condition specified in a policy.

In this blog post, I walk through two examples of how you can use policy summaries to help identify and correct these types of errors in your IAM policies.

How to use IAM policy summaries to debug your policies

Example 1: An action does not support the resource specified in a policy

Let’s say a human resources (HR) representative, Casey, needs access to the personnel files stored in HR’s Amazon S3 bucket. To do this, I create the following policy to grant all actions that begin with s3:List. In addition, I grant access to s3:GetObject in the Action element of the policy. To ensure that Casey has access only to a specific bucket and not others, I specify the bucket ARN in the Resource element of the policy.

Note: This policy does not grant the desired permissions.

This policy does not work. Do not copy.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ThisPolicyDoesNotGrantAllListandGetActions",
            "Effect": "Allow",
            "Action": ["s3:List*",
                       "s3:GetObject"],
            "Resource": ["arn:aws:s3:::HumanResources"]
        }
    ]
}

After I create the policy, HRBucketPermissions, I select this policy from the Policies page to view the policy summary. From here, I check to see if there are any warnings or typos in the policy. I see a warning at the top of the policy detail page because the policy does not grant some permissions specified in the policy, which is caused by a mismatch among the actions, resources, or conditions.

Screenshot showing the warning at the top of the policy

To view more details about the warning, I choose Show remaining so that I can understand why the permissions do not appear in the policy summary. As shown in the following screenshot, I see no access to the services that are not granted by the IAM policy in the policy, which is expected. However, next to S3, I see a warning that one or more S3 actions do not have an applicable resource.

Screenshot showing that one or more S3 actions do not have an applicable resource

To understand why the specific actions do not have a supported resource, I choose S3 from the list of services and choose Show remaining. I type List in the filter to understand why some of the list actions are not granted by the policy. As shown in the following screenshot, I see these warnings:

  • This action does not support resource-level permissions. This means the action does not support resource-level permissions and requires a wildcard (*) in the Resource element of the policy.
  • This action does not have an applicable resource. This means the action supports resource-level permissions, but not the resource type defined in the policy. In this example, I specified an S3 bucket for an action that supports only an S3 object resource type.

From these warnings, I see that s3:ListAllMyBuckets, s3:ListBucketMultipartUploadsParts3:ListObjects , and s3:GetObject do not support an S3 bucket resource type, which results in Casey not having access to the S3 bucket. To correct the policy, I choose Edit policy and update the policy with three statements based on the resource that the S3 actions support. Because Casey needs access to view and read all of the objects in the HumanResources bucket, I add a wildcard (*) for the S3 object path in the Resource ARN.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "TheseActionsSupportBucketResourceType",
            "Effect": "Allow",
            "Action": ["s3:ListBucket",
                       "s3:ListBucketByTags",
                       "s3:ListBucketMultipartUploads",
                       "s3:ListBucketVersions"],
            "Resource": ["arn:aws:s3:::HumanResources"]
        },{
            "Sid": "TheseActionsRequireAllResources",
            "Effect": "Allow",
            "Action": ["s3:ListAllMyBuckets",
                       "s3:ListMultipartUploadParts",
                       "s3:ListObjects"],
            "Resource": [ "*"]
        },{
            "Sid": "TheseActionsRequireSupportsObjectResourceType",
            "Effect": "Allow",
            "Action": ["s3:GetObject"],
            "Resource": ["arn:aws:s3:::HumanResources/*"]
        }
    ]
}

After I make these changes, I see the updated policy summary and see that warnings are no longer displayed.

Screenshot of the updated policy summary that no longer shows warnings

In the previous example, I showed how to identify and correct permissions errors that include actions that do not support a specified resource. In the next example, I show how to use policy summaries to identify and correct a policy that includes actions that do not support a specified condition.

Example 2: An action does not support the condition specified in a policy

For this example, let’s assume Bob is a project manager who requires view and read access to all the code builds for his team. To grant him this access, I create the following JSON policy that specifies all list and read actions to AWS CodeBuild and defines a condition to limit access to resources in the us-west-2 Region in which Bob’s team develops.

This policy does not work. Do not copy. 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListReadAccesstoCodeServices",
            "Effect": "Allow",
            "Action": [
                "codebuild:List*",
                "codebuild:BatchGet*"
            ],
            "Resource": ["*"], 
             "Condition": {
                "StringEquals": {
                    "ec2:Region": "us-west-2"
                }
            }
        }
    ]	
}

After I create the policy, PMCodeBuildAccess, I select this policy from the Policies page to view the policy summary in the IAM console. From here, I check to see if the policy has any warnings or typos. I see an error at the top of the policy detail page because the policy does not grant any permissions.

Screenshot with an error showing the policy does not grant any permissions

To view more details about the error, I choose Show remaining to understand why no permissions result from the policy. I see this warning: One or more conditions do not have an applicable action. This means that the condition is not supported by any of the actions defined in the policy.

From the warning message (see preceding screenshot), I realize that ec2:Region is not a supported condition for any actions in CodeBuild. To correct the policy, I separate the list actions that do not support resource-level permissions into a separate Statement element and specify * as the resource. For the remaining CodeBuild actions that support resource-level permissions, I use the ARN to specify the us-west-2 Region in the project resource type.

CORRECT POLICY 
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "TheseActionsSupportAllResources",
            "Effect": "Allow",
            "Action": [
                "codebuild:ListBuilds",
                "codebuild:ListProjects",
                "codebuild:ListRepositories",
                "codebuild:ListCuratedEnvironmentImages",
                "codebuild:ListConnectedOAuthAccounts"
            ],
            "Resource": ["*"] 
        }, {
            "Sid": "TheseActionsSupportAResource",
            "Effect": "Allow",
            "Action": [
                "codebuild:ListBuildsForProject",
                "codebuild:BatchGet*"
            ],
            "Resource": ["arn:aws:codebuild:us-west-2:123456789012:project/*"] 
        }

    ]	
}

After I make the changes, I view the updated policy summary and see that no warnings are displayed.

Screenshot showing the updated policy summary with no warnings

When I choose CodeBuild from the list of services, I also see that for the actions that support resource-level permissions, the access is limited to the us-west-2 Region.

Screenshow showing that for the Actions that support resource-level permissions, the access is limited to the us-west-2 region.

Conclusion

Policy summaries make it easier to view and understand the permissions and resources in your IAM policies by displaying the permissions granted by the policies. As I’ve demonstrated in this post, you can also use policy summaries to help you identify and correct your IAM policies. To understand the types of warnings that policy summaries support, you can visit Troubleshoot IAM Policies. To view policy summaries in your AWS account, sign in to the IAM console and navigate to any policy on the Policies page of the IAM console or the Permissions tab on a user’s page.

If you have comments about this post, submit them in the “Comments” section below. If you have questions about or suggestions for this solution, start a new thread on the IAM forum or contact AWS Support.

– Joy

Delivering Graphics Apps with Amazon AppStream 2.0

Post Syndicated from Deepak Suryanarayanan original https://aws.amazon.com/blogs/compute/delivering-graphics-apps-with-amazon-appstream-2-0/

Sahil Bahri, Sr. Product Manager, Amazon AppStream 2.0

Do you need to provide a workstation class experience for users who run graphics apps? With Amazon AppStream 2.0, you can stream graphics apps from AWS to a web browser running on any supported device. AppStream 2.0 offers a choice of GPU instance types. The range includes the newly launched Graphics Design instance, which allows you to offer a fast, fluid user experience at a fraction of the cost of using a graphics workstation, without upfront investments or long-term commitments.

In this post, I discuss the Graphics Design instance type in detail, and how you can use it to deliver a graphics application such as Siemens NX―a popular CAD/CAM application that we have been testing on AppStream 2.0 with engineers from Siemens PLM.

Graphics Instance Types on AppStream 2.0

First, a quick recap on the GPU instance types available with AppStream 2.0. In July, 2017, we launched graphics support for AppStream 2.0 with two new instance types that Jeff Barr discussed on the AWS Blog:

  • Graphics Desktop
  • Graphics Pro

Many customers in industries such as engineering, media, entertainment, and oil and gas are using these instances to deliver high-performance graphics applications to their users. These instance types are based on dedicated NVIDIA GPUs and can run the most demanding graphics applications, including those that rely on CUDA graphics API libraries.

Last week, we added a new lower-cost instance type: Graphics Design. This instance type is a great fit for engineers, 3D modelers, and designers who use graphics applications that rely on the hardware acceleration of DirectX, OpenGL, or OpenCL APIs, such as Siemens NX, Autodesk AutoCAD, or Adobe Photoshop. The Graphics Design instance is based on AMD’s FirePro S7150x2 Server GPUs and equipped with AMD Multiuser GPU technology. The instance type uses virtualized GPUs to achieve lower costs, and is available in four instance sizes to scale and match the requirements of your applications.

Instance vCPUs Instance RAM (GiB) GPU Memory (GiB)
stream.graphics-design.large 2 7.5 GiB 1
stream.graphics-design.xlarge 4 15.3 GiB 2
stream.graphics-design.2xlarge 8 30.5 GiB 4
stream.graphics-design.4xlarge 16 61 GiB 8

The following table compares all three graphics instance types on AppStream 2.0, along with example applications you could use with each.

  Graphics Design Graphics Desktop Graphics Pro
Number of instance sizes 4 1 3
GPU memory range
1–8 GiB 4 GiB 8–32 GiB
vCPU range 2–16 8 16–32
Memory range 7.5–61 GiB 15 GiB 122–488 GiB
Graphics libraries supported AMD FirePro S7150x2 NVIDIA GRID K520 NVIDIA Tesla M60
Price range (N. Virginia AWS Region) $0.25 – $2.00/hour $0.5/hour $2.05 – $8.20/hour
Example applications Adobe Premiere Pro, AutoDesk Revit, Siemens NX AVEVA E3D, SOLIDWORKS AutoDesk Maya, Landmark DecisionSpace, Schlumberger Petrel

Example graphics instance set up with Siemens NX

In the section, I walk through setting up Siemens NX with Graphics Design instances on AppStream 2.0. After set up is complete, users can able to access NX from within their browser and also access their design files from a file share. You can also use these steps to set up and test your own graphics applications on AppStream 2.0. Here’s the workflow:

  1. Create a file share to load and save design files.
  2. Create an AppStream 2.0 image with Siemens NX installed.
  3. Create an AppStream 2.0 fleet and stack.
  4. Invite users to access Siemens NX through a browser.
  5. Validate the setup.

To learn more about AppStream 2.0 concepts and set up, see the previous post Scaling Your Desktop Application Streams with Amazon AppStream 2.0. For a deeper review of all the setup and maintenance steps, see Amazon AppStream 2.0 Developer Guide.

Step 1: Create a file share to load and save design files

To launch and configure the file server

  1. Open the EC2 console and choose Launch Instance.
  2. Scroll to the Microsoft Windows Server 2016 Base Image and choose Select.
  3. Choose an instance type and size for your file server (I chose the general purpose m4.large instance). Choose Next: Configure Instance Details.
  4. Select a VPC and subnet. You launch AppStream 2.0 resources in the same VPC. Choose Next: Add Storage.
  5. If necessary, adjust the size of your EBS volume. Choose Review and Launch, Launch.
  6. On the Instances page, give your file server a name, such as My File Server.
  7. Ensure that the security group associated with the file server instance allows for incoming traffic from the security group that you select for your AppStream 2.0 fleets or image builders. You can use the default security group and select the same group while creating the image builder and fleet in later steps.

Log in to the file server using a remote access client such as Microsoft Remote Desktop. For more information about connecting to an EC2 Windows instance, see Connect to Your Windows Instance.

To enable file sharing

  1. Create a new folder (such as C:\My Graphics Files) and upload the shared files to make available to your users.
  2. From the Windows control panel, enable network discovery.
  3. Choose Server Manager, File and Storage Services, Volumes.
  4. Scroll to Shares and choose Start the Add Roles and Features Wizard. Go through the wizard to install the File Server and Share role.
  5. From the left navigation menu, choose Shares.
  6. Choose Start the New Share Wizard to set up your folder as a file share.
  7. Open the context (right-click) menu on the share and choose Properties, Permissions, Customize Permissions.
  8. Choose Permissions, Add. Add Read and Execute permissions for everyone on the network.

Step 2:  Create an AppStream 2.0 image with Siemens NX installed

To connect to the image builder and install applications

  1. Open the AppStream 2.0 management console and choose Images, Image Builder, Launch Image Builder.
  2. Create a graphics design image builder in the same VPC as your file server.
  3. From the Image builder tab, select your image builder and choose Connect. This opens a new browser tab and display a desktop to log in to.
  4. Log in to your image builder as ImageBuilderAdmin.
  5. Launch the Image Assistant.
  6. Download and install Siemens NX and other applications on the image builder. I added Blender and Firefox, but you could replace these with your own applications.
  7. To verify the user experience, you can test the application performance on the instance.

Before you finish creating the image, you must mount the file share by enabling a few Microsoft Windows services.

To mount the file share

  1. Open services.msc and check the following services:
  • DNS Client
  • Function Discovery Resource Publication
  • SSDP Discovery
  • UPnP Device H
  1. If any of the preceding services have Startup Type set to Manual, open the context (right-click) menu on the service and choose Start. Otherwise, open the context (right-click) menu on the service and choose Properties. For Startup Type, choose Manual, Apply. To start the service, choose Start.
  2. From the Windows control panel, enable network discovery.
  3. Create a batch script that mounts a file share from the storage server set up earlier. The file share is mounted automatically when a user connects to the AppStream 2.0 environment.

Logon Script Location: C:\Users\Public\logon.bat

Script Contents:

:loop

net use H: \\path\to\network\share 

PING localhost -n 30 >NUL

IF NOT EXIST H:\ GOTO loop

  1. Open gpedit.msc and choose User Configuration, Windows Settings, Scripts. Set logon.bat as the user logon script.
  2. Next, create a batch script that makes the mounted drive visible to the user.

Logon Script Location: C:\Users\Public\startup.bat

Script Contents:
REG DELETE “HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer” /v “NoDrives” /f

  1. Open Task Scheduler and choose Create Task.
  2. Choose General, provide a task name, and then choose Change User or Group.
  3. For Enter the object name to select, enter SYSTEM and choose Check Names, OK.
  4. Choose Triggers, New. For Begin the task, choose At startup. Under Advanced Settings, change Delay task for to 5 minutes. Choose OK.
  5. Choose Actions, New. Under Settings, for Program/script, enter C:\Users\Public\startup.bat. Choose OK.
  6. Choose Conditions. Under Power, clear the Start the task only if the computer is on AC power Choose OK.
  7. To view your scheduled task, choose Task Scheduler Library. Close Task Scheduler when you are done.

Step 3:  Create an AppStream 2.0 fleet and stack

To create a fleet and stack

  1. In the AppStream 2.0 management console, choose Fleets, Create Fleet.
  2. Give the fleet a name, such as Graphics-Demo-Fleet, that uses the newly created image and the same VPC as your file server.
  3. Choose Stacks, Create Stack. Give the stack a name, such as Graphics-Demo-Stack.
  4. After the stack is created, select it and choose Actions, Associate Fleet. Associate the stack with the fleet you created in step 1.

Step 4:  Invite users to access Siemens NX through a browser

To invite users

  1. Choose User Pools, Create User to create users.
  2. Enter a name and email address for each user.
  3. Select the users just created, and choose Actions, Assign Stack to provide access to the stack created in step 2. You can also provide access using SAML 2.0 and connect to your Active Directory if necessary. For more information, see the Enabling Identity Federation with AD FS 3.0 and Amazon AppStream 2.0 post.

Your user receives an email invitation to set up an account and use a web portal to access the applications that you have included in your stack.

Step 5:  Validate the setup

Time for a test drive with Siemens NX on AppStream 2.0!

  1. Open the link for the AppStream 2.0 web portal shared through the email invitation. The web portal opens in your default browser. You must sign in with the temporary password and set a new password. After that, you get taken to your app catalog.
  2. Launch Siemens NX and interact with it using the demo files available in the shared storage folder – My Graphics Files. 

After I launched NX, I captured the screenshot below. The Siemens PLM team also recorded a video with NX running on AppStream 2.0.

Summary

In this post, I discussed the GPU instances available for delivering rich graphics applications to users in a web browser. While I demonstrated a simple setup, you can scale this out to launch a production environment with users signing in using Active Directory credentials,  accessing persistent storage with Amazon S3, and using other commonly requested features reviewed in the Amazon AppStream 2.0 Launch Recap – Domain Join, Simple Network Setup, and Lots More post.

To learn more about AppStream 2.0 and capabilities added this year, see Amazon AppStream 2.0 Resources.

Disabling Intel Hyper-Threading Technology on Amazon EC2 Windows Instances

Post Syndicated from Brian Beach original https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-ec2-windows-instances/

In a prior post, Disabling Intel Hyper-Threading on Amazon Linux, I investigated how the Linux kernel enumerates CPUs. I also discussed the options to disable Intel Hyper-Threading (HT Technology) in Amazon Linux running on Amazon EC2.

In this post, I do the same for Microsoft Windows Server 2016 running on EC2 instances. I begin with a quick review of HT Technology and the reasons you might want to disable it. I also recommend that you take a moment to review the prior post for a more thorough foundation.

HT Technology

HT Technology makes a single physical processor appear as multiple logical processors. Each core in an Intel Xeon processor has two threads of execution. Most of the time, these threads can progress independently; one thread executing while the other is waiting on a relatively slow operation (for example, reading from memory) to occur. However, the two threads do share resources and occasionally one thread is forced to wait while the other is executing.

There a few unique situations where disabling HT Technology can improve performance. One example is high performance computing (HPC) workloads that rely heavily on floating point operations. In these rare cases, it can be advantageous to disable HT Technology. However, these cases are rare, and for the overwhelming majority of workloads you should leave it enabled. I recommend that you test with and without HT Technology enabled, and only disable threads if you are sure it will improve performance.

Exploring HT Technology on Microsoft Windows

Here’s how Microsoft Windows enumerates CPUs. As before, I am running these examples on an m4.2xlarge. I also chose to run Windows Server 2016, but you can walk through these exercises on any version of Windows. Remember that the m4.2xlarge has eight vCPUs, and each vCPU is a thread of an Intel Xeon core. Therefore, the m4.2xlarge has four cores, each of which run two threads, resulting in eight vCPUs.

Windows does not have a built-in utility to examine CPU configuration, but you can download the Sysinternals coreinfo utility from Microsoft’s website. This utility provides useful information about the system CPU and memory topology. For this walkthrough, you enumerate the individual CPUs, which you can do by running coreinfo -c. For example:

C:\Users\Administrator >coreinfo -c

Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com

Logical to Physical Processor Map:
**------ Physical Processor 0 (Hyperthreaded)
--**---- Physical Processor 1 (Hyperthreaded)
----**-- Physical Processor 2 (Hyperthreaded)
------** Physical Processor 3 (Hyperthreaded)

As you can see from the screenshot, the coreinfo utility displays a table where each row is a physical core and each column is a logical CPU. In other words, the two asterisks on the first line indicate that CPU 0 and CPU 1 are the two threads in the first physical core. Therefore, my m4.2xlarge has for four physical processors and each processor has two threads resulting in eight total CPUs, just as expected.

It is interesting to note that Windows Server 2016 enumerates CPUs in a different order than Linux. Remember from the prior post that Linux enumerated the first thread in each core, followed by the second thread in each core. You can see from the output earlier that Windows Server 2016, enumerates both threads in the first core, then both threads in the second core, and so on. The diagram below shows the relationship of CPUs to cores and threads in both operating systems.

In the Linux post, I disabled CPUs 4–6, leaving one thread per core, and effectively disabling HT Technology. You can see from the diagram that you must disable the odd-numbered threads (that is, 1, 3, 5, and 7) to achieve the same result in Windows. Here’s how to do that.

Disabling HT Technology on Microsoft Windows

In Linux, you can globally disable CPUs dynamically. In Windows, there is no direct equivalent that I could find, but there are a few alternatives.

First, you can disable CPUs using the msconfig.exe tool. If you choose Boot, Advanced Options, you have the option to set the number of processors. In the example below, I limit my m4.2xlarge to four CPUs. Restart for this change to take effect.

Unfortunately, Windows does not disable hyperthreaded CPUs first and then real cores, as Linux does. As you can see in the following output, coreinfo reports that my c4.2xlarge has two real cores and four hyperthreads, after rebooting. Msconfig.exe is useful for disabling cores, but it does not allow you to disable HT Technology.

Note: If you have been following along, you can re-enable all your CPUs by unselecting the Number of processors check box and rebooting your system.

 

C:\Users\Administrator >coreinfo -c

Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com

Logical to Physical Processor Map:
**-- Physical Processor 0 (Hyperthreaded)
--** Physical Processor 1 (Hyperthreaded)

While you cannot disable HT Technology systemwide, Windows does allow you to associate a particular process with one or more CPUs. Microsoft calls this, “processor affinity”. To see an example, use the following steps.

  1. Launch an instance of Notepad.
  2. Open Windows Task Manager and choose Processes.
  3. Open the context (right click) menu on notepad.exe and choose Set Affinity….

This brings up the Processor Affinity dialog box.

As you can see, all the CPUs are allowed to run this instance of notepad.exe. You can uncheck a few CPUs to exclude them. Windows is smart enough to allow any scheduled operations to continue to completion on disabled CPUs. It then saves its state at the next scheduling event, and resumes those operations on another CPU. To ensure that only one thread in each core is able to run a process, you uncheck every other core. This effectively disables HT Technology for this process. For example:

Of course, this can be tedious when you have a large number of cores. Remember that the x1.32xlarge has 128 CPUs. Luckily, you can set the affinity of a running process from PowerShell using the Get-Process cmdlet. For example:

PS C:\> (Get-Process -Name 'notepad').ProcessorAffinity = 0x55;

The ProcessorAffinity attribute takes a bitmask in hexadecimal format. 0x55 in hex is equivalent to 01010101 in binary. Think of the binary encoding as 1=enabled and 0=disabled. This is slightly confusing, but we work left to right so that CPU 0 is the rightmost bit and CPU 7 is the leftmost bit. Therefore, 01010101 means that the first thread in each CPU is enabled just as it was in the diagram earlier.

The calculator built into Windows includes a “programmer view” that helps you convert from hexadecimal to binary. In addition, the ProcessorAffinity attribute is a 64-bit number. Therefore, you can only configure the processor affinity on systems up to 64 CPUs. At the moment, only the x1.32xlarge has more than 64 vCPUs.

In the preceding examples, you changed the processor affinity of a running process. Sometimes, you want to start a process with the affinity already configured. You can do this using the start command. The start command includes an affinity flag that takes a hexadecimal number like the PowerShell example earlier.

C:\Users\Administrator>start /affinity 55 notepad.exe

It is interesting to note that a child process inherits the affinity from its parent. For example, the following commands create a batch file that launches Notepad, and starts the batch file with the affinity set. If you examine the instance of Notepad launched by the batch file, you see that the affinity has been applied to as well.

C:\Users\Administrator>echo notepad.exe > test.bat
C:\Users\Administrator>start /affinity 55 test.bat

This means that you can set the affinity of your task scheduler and any tasks that the scheduler starts inherits the affinity. So, you can disable every other thread when you launch the scheduler and effectively disable HT Technology for all of the tasks as well. Be sure to test this point, however, as some schedulers override the normal inheritance behavior and explicitly set processor affinity when starting a child process.

Conclusion

While the Windows operating system does not allow you to disable logical CPUs, you can set processor affinity on individual processes. You also learned that Windows Server 2016 enumerates CPUs in a different order than Linux. Therefore, you can effectively disable HT Technology by restricting a process to every other CPU. Finally, you learned how to set affinity of both new and running processes using Task Manager, PowerShell, and the start command.

Note: this technical approach has nothing to do with control over software licensing, or licensing rights, which are sometimes linked to the number of “CPUs” or “cores.” For licensing purposes, those are legal terms, not technical terms. This post did not cover anything about software licensing or licensing rights.

If you have questions or suggestions, please comment below.

From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum

Post Syndicated from Dylan Tong original https://aws.amazon.com/blogs/big-data/from-data-lake-to-data-warehouse-enhancing-customer-360-with-amazon-redshift-spectrum/

Achieving a 360o-view of your customer has become increasingly challenging as companies embrace omni-channel strategies, engaging customers across websites, mobile, call centers, social media, physical sites, and beyond. The promise of a web where online and physical worlds blend makes understanding your customers more challenging, but also more important. Businesses that are successful in this medium have a significant competitive advantage.

The big data challenge requires the management of data at high velocity and volume. Many customers have identified Amazon S3 as a great data lake solution that removes the complexities of managing a highly durable, fault tolerant data lake infrastructure at scale and economically.

AWS data services substantially lessen the heavy lifting of adopting technologies, allowing you to spend more time on what matters most—gaining a better understanding of customers to elevate your business. In this post, I show how a recent Amazon Redshift innovation, Redshift Spectrum, can enhance a customer 360 initiative.

Customer 360 solution

A successful customer 360 view benefits from using a variety of technologies to deliver different forms of insights. These could range from real-time analysis of streaming data from wearable devices and mobile interactions to historical analysis that requires interactive, on demand queries on billions of transactions. In some cases, insights can only be inferred through AI via deep learning. Finally, the value of your customer data and insights can’t be fully realized until it is operationalized at scale—readily accessible by fleets of applications. Companies are leveraging AWS for the breadth of services that cover these domains, to drive their data strategy.

A number of AWS customers stream data from various sources into a S3 data lake through Amazon Kinesis. They use Kinesis and technologies in the Hadoop ecosystem like Spark running on Amazon EMR to enrich this data. High-value data is loaded into an Amazon Redshift data warehouse, which allows users to analyze and interact with data through a choice of client tools. Redshift Spectrum expands on this analytics platform by enabling Amazon Redshift to blend and analyze data beyond the data warehouse and across a data lake.

The following diagram illustrates the workflow for such a solution.

This solution delivers value by:

  • Reducing complexity and time to value to deeper insights. For instance, an existing data model in Amazon Redshift may provide insights across dimensions such as customer, geography, time, and product on metrics from sales and financial systems. Down the road, you may gain access to streaming data sources like customer-care call logs and website activity that you want to blend in with the sales data on the same dimensions to understand how web and call center experiences maybe correlated with sales performance. Redshift Spectrum can join these dimensions in Amazon Redshift with data in S3 to allow you to quickly gain new insights, and avoid the slow and more expensive alternative of fully integrating these sources with your data warehouse.
  • Providing an additional avenue for optimizing costs and performance. In cases like call logs and clickstream data where volumes could be many TBs to PBs, storing the data exclusively in S3 yields significant cost savings. Interactive analysis on massive datasets may now be economically viable in cases where data was previously analyzed periodically through static reports generated by inexpensive batch processes. In some cases, you can improve the user experience while simultaneously lowering costs. Spectrum is powered by a large-scale infrastructure external to your Amazon Redshift cluster, and excels at scanning and aggregating large volumes of data. For instance, your analysts maybe performing data discovery on customer interactions across millions of consumers over years of data across various channels. On this large dataset, certain queries could be slow if you didn’t have a large Amazon Redshift cluster. Alternatively, you could use Redshift Spectrum to achieve a better user experience with a smaller cluster.

Proof of concept walkthrough

To make evaluation easier for you, I’ve conducted a Redshift Spectrum proof-of-concept (PoC) for the customer 360 use case. For those who want to replicate the PoC, the instructions, AWS CloudFormation templates, and public data sets are available in the GitHub repository.

The remainder of this post is a journey through the project, observing best practices in action, and learning how you can achieve business value. The walkthrough involves:

  • An analysis of performance data from the PoC environment involving queries that demonstrate blending and analysis of data across Amazon Redshift and S3. Observe that great results are achievable at scale.
  • Guidance by example on query tuning, design, and data preparation to illustrate the optimization process. This includes tuning a query that combines clickstream data in S3 with customer and time dimensions in Amazon Redshift, and aggregates ~1.9 B out of 3.7 B+ records in under 10 seconds with a small cluster!
  • Guidance and measurements to help assess deciding between two options: accessing and analyzing data exclusively in Amazon Redshift, or using Redshift Spectrum to access data left in S3.

Stream ingestion and enrichment

The focus of this post isn’t stream ingestion and enrichment on Kinesis and EMR, but be mindful of performance best practices on S3 to ensure good streaming and query performance:

  • Use random object keys: The data files provided for this project are prefixed with SHA-256 hashes to prevent hot partitions. This is important to ensure that optimal request rates to support PUT requests from the incoming stream in addition to certain queries from large Amazon Redshift clusters that could send a large number of parallel GET requests.
  • Micro-batch your data stream: S3 isn’t optimized for small random write workloads. Your datasets should be micro-batched into large files. For instance, the “parquet-1” dataset provided batches >7 million records per file. The optimal file size for Redshift Spectrum is usually in the 100 MB to 1 GB range.

If you have an edge case that may pose scalability challenges, AWS would love to hear about it. For further guidance, talk to your solutions architect.

Environment

The project consists of the following environment:

  • Amazon Redshift cluster: 4 X dc1.large
  • Data:
    • Time and customer dimension tables are stored on all Amazon Redshift nodes (ALL distribution style):
      • The data originates from the DWDATE and CUSTOMER tables in the Star Schema Benchmark
      • The customer table contains attributes for 3 million customers.
      • The time data is at the day-level granularity, and spans 7 years, from the start of 1992 to the end of 1998.
    • The clickstream data is stored in an S3 bucket, and serves as a fact table.
      • Various copies of this dataset in CSV and Parquet format have been provided, for reasons to be discussed later.
      • The data is a modified version of the uservisits dataset from AMPLab’s Big Data Benchmark, which was generated by Intel’s Hadoop benchmark tools.
      • Changes were minimal, so that existing test harnesses for this test can be adapted:
        • Increased the 751,754,869-row dataset 5X to 3,758,774,345 rows.
        • Added surrogate keys to support joins with customer and time dimensions. These keys were distributed evenly across the entire dataset to represents user visits from six customers over seven years.
        • Values for the visitDate column were replaced to align with the 7-year timeframe, and the added time surrogate key.

Queries across the data lake and data warehouse 

Imagine a scenario where a business analyst plans to analyze clickstream metrics like ad revenue over time and by customer, market segment and more. The example below is a query that achieves this effect: 

The query part highlighted in red retrieves clickstream data in S3, and joins the data with the time and customer dimension tables in Amazon Redshift through the part highlighted in blue. The query returns the total ad revenue for three customers over the last three months, along with info on their respective market segment.

Unfortunately, this query takes around three minutes to run, and doesn’t enable the interactive experience that you want. However, there’s a number of performance optimizations that you can implement to achieve the desired performance.

Performance analysis

Two key utilities provide visibility into Redshift Spectrum:

  • EXPLAIN
    Provides the query execution plan, which includes info around what processing is pushed down to Redshift Spectrum. Steps in the plan that include the prefix S3 are executed on Redshift Spectrum. For instance, the plan for the previous query has the step “S3 Seq Scan clickstream.uservisits_csv10”, indicating that Redshift Spectrum performs a scan on S3 as part of the query execution.
  • SVL_S3QUERY_SUMMARY
    Statistics for Redshift Spectrum queries are stored in this table. While the execution plan presents cost estimates, this table stores actual statistics for past query runs.

You can get the statistics of your last query by inspecting the SVL_S3QUERY_SUMMARY table with the condition (query = pg_last_query_id()). Inspecting the previous query reveals that the entire dataset of nearly 3.8 billion rows was scanned to retrieve less than 66.3 million rows. Improving scan selectivity in your query could yield substantial performance improvements.

Partitioning

Partitioning is a key means to improving scan efficiency. In your environment, the data and tables have already been organized, and configured to support partitions. For more information, see the PoC project setup instructions. The clickstream table was defined as:

CREATE EXTERNAL TABLE clickstream.uservisits_csv10
…
PARTITIONED BY(customer int4, visitYearMonth int4)

The entire 3.8 billion-row dataset is organized as a collection of large files where each file contains data exclusive to a particular customer and month in a year. This allows you to partition your data into logical subsets by customer and year/month. With partitions, the query engine can target a subset of files:

  • Only for specific customers
  • Only data for specific months
  • A combination of specific customers and year/months

You can use partitions in your queries. Instead of joining your customer data on the surrogate customer key (that is, c.c_custkey = uv.custKey), the partition key “customer” should be used instead:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
…
ON c.c_custkey = uv.customer
…
ORDER BY c.c_name, c.c_mktsegment, uv.yearMonthKey  ASC

This query should run approximately twice as fast as the previous query. If you look at the statistics for this query in SVL_S3QUERY_SUMMARY, you see that only half the dataset was scanned. This is expected because your query is on three out of six customers on an evenly distributed dataset. However, the scan is still inefficient, and you can benefit from using your year/month partition key as well:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
…
ON c.c_custkey = uv.customer
…
ON uv.visitYearMonth = t.d_yearmonthnum
…
ORDER BY c.c_name, c.c_mktsegment, uv.visitYearMonth ASC

All joins between the tables are now using partitions. Upon reviewing the statistics for this query, you should observe that Redshift Spectrum scans and returns the exact number of rows, 66,270,117. If you run this query a few times, you should see execution time in the range of 8 seconds, which is a 22.5X improvement on your original query!

Predicate pushdown and storage optimizations 

Previously, I mentioned that Redshift Spectrum performs processing through large-scale infrastructure external to your Amazon Redshift cluster. It is optimized for performing large scans and aggregations on S3. In fact, Redshift Spectrum may even out-perform a medium size Amazon Redshift cluster on these types of workloads with the proper optimizations. There are two important variables to consider for optimizing large scans and aggregations:

  • File size and count. As a general rule, use files 100 MB-1 GB in size, as Redshift Spectrum and S3 are optimized for reading this object size. However, the number of files operating on a query is directly correlated with the parallelism achievable by a query. There is an inverse relationship between file size and count: the bigger the files, the fewer files there are for the same dataset. Consequently, there is a trade-off between optimizing for object read performance, and the amount of parallelism achievable on a particular query. Large files are best for large scans as the query likely operates on sufficiently large number of files. For queries that are more selective and for which fewer files are operating, you may find that smaller files allow for more parallelism.
  • Data format. Redshift Spectrum supports various data formats. Columnar formats like Parquet can sometimes lead to substantial performance benefits by providing compression and more efficient I/O for certain workloads. Generally, format types like Parquet should be used for query workloads involving large scans, and high attribute selectivity. Again, there are trade-offs as formats like Parquet require more compute power to process than plaintext. For queries on smaller subsets of data, the I/O efficiency benefit of Parquet is diminished. At some point, Parquet may perform the same or slower than plaintext. Latency, compression rates, and the trade-off between user experience and cost should drive your decision.

To help illustrate how Redshift Spectrum performs on these large aggregation workloads, run a basic query that aggregates the entire ~3.7 billion record dataset on Redshift Spectrum, and compared that with running the query exclusively on Amazon Redshift:

SELECT uv.custKey, COUNT(uv.custKey)
FROM <your clickstream table> as uv
GROUP BY uv.custKey
ORDER BY uv.custKey ASC

For the Amazon Redshift test case, the clickstream data is loaded, and distributed evenly across all nodes (even distribution style) with optimal column compression encodings prescribed by the Amazon Redshift’s ANALYZE command.

The Redshift Spectrum test case uses a Parquet data format with each file containing all the data for a particular customer in a month. This results in files mostly in the range of 220-280 MB, and in effect, is the largest file size for this partitioning scheme. If you run tests with the other datasets provided, you see that this data format and size is optimal and out-performs others by ~60X. 

Performance differences will vary depending on the scenario. The important takeaway is to understand the testing strategy and the workload characteristics where Redshift Spectrum is likely to yield performance benefits. 

The following chart compares the query execution time for the two scenarios. The results indicate that you would have to pay for 12 X DC1.Large nodes to get performance comparable to using a small Amazon Redshift cluster that leverages Redshift Spectrum. 

Chart showing simple aggregation on ~3.7 billion records

So you’ve validated that Spectrum excels at performing large aggregations. Could you benefit by pushing more work down to Redshift Spectrum in your original query? It turns out that you can, by making the following modification:

The clickstream data is stored at a day-level granularity for each customer while your query rolls up the data to the month level per customer. In the earlier query that uses the day/month partition key, you optimized the query so that it only scans and retrieves the data required, but the day level data is still sent back to your Amazon Redshift cluster for joining and aggregation. The query shown here pushes aggregation work down to Redshift Spectrum as indicated by the query plan:

In this query, Redshift Spectrum aggregates the clickstream data to the month level before it is returned to the Amazon Redshift cluster and joined with the dimension tables. This query should complete in about 4 seconds, which is roughly twice as fast as only using the partition key. The speed increase is evident upon reviewing the SVL_S3QUERY_SUMMARY table:

  • Bytes scanned is 21.6X less because of the Parquet data format.
  • Only 90 records are returned back to the Amazon Redshift cluster as a result of the push-down, instead of ~66.2 million, leading to substantially less join overhead, and about 530 MB less data sent back to your cluster.
  • No adverse change in average parallelism.

Assessing the value of Amazon Redshift vs. Redshift Spectrum

At this point, you might be asking yourself, why would I ever not use Redshift Spectrum? Well, you still get additional value for your money by loading data into Amazon Redshift, and querying in Amazon Redshift vs. querying S3.

In fact, it turns out that the last version of our query runs even faster when executed exclusively in native Amazon Redshift, as shown in the following chart:

Chart comparing Amazon Redshift vs. Redshift Spectrum with pushdown aggregation over 3 months of data

As a general rule, queries that aren’t dominated by I/O and which involve multiple joins are better optimized in native Amazon Redshift. For instance, the performance difference between running the partition key query entirely in Amazon Redshift versus with Redshift Spectrum is twice as large as that that of the pushdown aggregation query, partly because the former case benefits more from better join performance.

Furthermore, the variability in latency in native Amazon Redshift is lower. For use cases where you have tight performance SLAs on queries, you may want to consider using Amazon Redshift exclusively to support those queries.

On the other hand, when you perform large scans, you could benefit from the best of both worlds: higher performance at lower cost. For instance, imagine that you wanted to enable your business analysts to interactively discover insights across a vast amount of historical data. In the example below, the pushdown aggregation query is modified to analyze seven years of data instead of three months:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, uv.totalRevenue
…
WHERE customer <= 3 and visitYearMonth >= 199201
… 
FROM dwdate WHERE d_yearmonthnum >= 199201) as t
…
ORDER BY c.c_name, c.c_mktsegment, uv.visitYearMonth ASC

This query requires scanning and aggregating nearly 1.9 billion records. As shown in the chart below, Redshift Spectrum substantially speeds up this query. A large Amazon Redshift cluster would have to be provisioned to support this use case. With the aid of Redshift Spectrum, you could use an existing small cluster, keep a single copy of your data in S3, and benefit from economical, durable storage while only paying for what you use via the pay per query pricing model.

Chart comparing Amazon Redshift vs. Redshift Spectrum with pushdown aggregation over 7 years of data

Summary

Redshift Spectrum lowers the time to value for deeper insights on customer data queries spanning the data lake and data warehouse. It can enable interactive analysis on datasets in cases that weren’t economically practical or technically feasible before.

There are cases where you can get the best of both worlds from Redshift Spectrum: higher performance at lower cost. However, there are still latency-sensitive use cases where you may want native Amazon Redshift performance. For more best practice tips, see the 10 Best Practices for Amazon Redshift post.

Please visit the Amazon Redshift Spectrum PoC Environment Github page. If you have questions or suggestions, please comment below.

 


Additional Reading

Learn more about how Amazon Redshift Spectrum extends data warehousing out to exabytes – no loading required.


About the Author

Dylan Tong is an Enterprise Solutions Architect at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

 

 

AWS Online Tech Talks – August 2017

Post Syndicated from Sara Rodas original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-august-2017/

Welcome to mid-August, everyone–the season of beach days, family road trips, and an inbox full of “out of office” emails from your coworkers. Just in case spending time indoors has you feeling a bit blue, we’ve got a piping hot batch of AWS Online Tech Talks for you to check out. Kick up your feet, grab a glass of ice cold lemonade, and dive into our latest Tech Talks on Compute and DevOps.

August 2017 – Schedule

Noted below are the upcoming scheduled live, online technical sessions being held during the month of August. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts.

Webinars featured this month are:

Thursday, August 17 – Compute

9:00 – 9:40 AM PDT: Deep Dive on [email protected].

Monday, August 28 – DevOps

10:30 – 11:10 AM PDT: Building a Python Serverless Applications with AWS Chalice.

12:00 – 12:40 PM PDT: How to Deploy .NET Code to AWS from Within Visual Studio.

The AWS Online Tech Talks series covers a broad range of topics at varying technical levels. These sessions feature live demonstrations & customer examples led by AWS engineers and Solution Architects. Check out the AWS YouTube channel for more on-demand webinars on AWS technologies.

– Sara (Hello everyone, I’m a co-op from Northeastern University joining the team until December.)

Curb Your Enthusiasm on Those HBO Leaks

Post Syndicated from Ernesto original https://torrentfreak.com/curb-your-enthusiasm-on-those-hbo-leaks-170814/

Late July, news broke that a hacker, or hackers, had compromised the network of the American cable and television network HBO.

Those responsible contacted reporters, informing them about the prominent breach, and leaked files surfaced on the dedicated website Winter-leak.com.

The website wasn’t around for long, but last week the hackers reached out to the press again with a curated batch of new leaks shared through Mega.nz. Among other things, it contained more Game of Thrones spoilers, marketing plans, and other confidential HBO files.

Fast forward another week and there’s yet another freshly curated batch of leaks. This time it includes episodes of the highly anticipated return of ‘Curb Your Enthusiasm,’ which officially airs in October, as well as episodes from “Barry,” “Insecure” and “The Deuce,” AP reports.

These shows are part of the treasure trove of 1.5 terabytes that was taken from HBO. These and several other titles were already teased last week in a screenshot the hackers released to the press.

There’s no reason to doubt that the leaks are real, but thus far they haven’t been widely distributed. It appears that the various journalists who received the latest batch of Mega.nz links are not very eager to post them in public.

TorrentFreak scoured popular torrent sites and streaming portals for public copies of the new Curb Your Enthusiasm episodes and came up empty-handed. And we’re certainly not the only ones having trouble spotting the leaks in public.

“I searched around a lot a few hours ago and couldn’t find anything,” one Curb Your Enthusiasm watcher commented on Reddit. “Why can’t these hackers be courteous and place links?” another added.

This is quite different from the leaked episode of Game of Thrones that came out before its official release two weeks ago. That leak was not related to the HBO hack, but before the news broke in the mainstream press, thousands of copies were already available on pirate sites.

HBO, meanwhile, appears to have had enough of the continued enthusiasm the hacker is managing to generate in the press.

“We are not in communication with the hacker and we’re not going to comment every time a new piece of information is released,” a company spokesperson said.

“It has been widely reported that there was a cyber incident at HBO. The hacker may continue to drop bits and pieces of stolen information in an attempt to generate media attention. That’s a game we’re not going to participate in.”

As for the Curb Your Enthusiasm fans who were hoping for an early preview of the new season. They may have to, well… you know. For now at least.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Hackers Leak More Confidential Game of Thrones Files

Post Syndicated from Ernesto original https://torrentfreak.com/hackers-leak-more-confidential-game-of-thrones-files-170808/

Last week, news broke that a hacker, or hackers, had compromised the network of the American cable and television network HBO.

Those responsible sent out an email to reporters, announcing the prominent breach, and leaked files surfaced on the dedicated website Winter-leak.com.

While the latter is no longer accessible, the hackers are not done yet. Another curated batch of leaked files has now appeared online, revealing more Game of Thrones spoilers, marketing plans, and other confidential HBO files.

The first leak put a preliminary outline of the fourth episode of the current Game of Thrones season in the spotlight, and the second batch follows up with the same for the upcoming fifth episode.

Although the outline was prepared over a year ago, it likely contains various accurate spoilers, which we won’t repeat here.

Preliminary outline S07E05

The new data dump, which is a subsection of the 1.5 terabytes of data the hackers claimed to have in their possession, also lists a variety of other Game of Thrones related files.

Among other items, there’s a confidential cast list for the current season, a highly confidential “Game of Ideas” brief, an outline of GoT marketing strategies, and a Game of Thrones roadmap. The information all appears to be a few months old.

The hackers took a screenshot of several folders, where the files may have been taken from, as seen below.

Folders screenshot

In addition, the hackers provided ‘proof’ that they have emails, which according to AP point to HBO’s vice president for film programming Leslie Cohen.

Finally, the new batch contains a video letter to HBO CEO Richard Plepler, titled “First letter to HBO,” where a certain Mr. Smith takes credit for the hack. The letter offered to keep the information away from the public, in exchange for a ransom payment.

First letter to HBO

For spoiler-eager Game of Thrones fans the hack is a true treasure trove. However, like the first batch, no leaked episodes are included. And, based on another screenshot, these are probably not on the way either.

A “Series Screenshot” includes a list of likely compromised titles, such as The Deviant Ones and the previously leaked Barry, Ballers, and Room 104, but no Game of Thrones.

A leak of the fourth GoT episode did appear online late last week, but this wasn’t linked to the breach of HBO’s network. Still, HBO is likely not amused and will do everything in its power to catch those responsible.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Turbocharge your Apache Hive queries on Amazon EMR using LLAP

Post Syndicated from Jigar Mistry original https://aws.amazon.com/blogs/big-data/turbocharge-your-apache-hive-queries-on-amazon-emr-using-llap/

Apache Hive is one of the most popular tools for analyzing large datasets stored in a Hadoop cluster using SQL. Data analysts and scientists use Hive to query, summarize, explore, and analyze big data.

With the introduction of Hive LLAP (Low Latency Analytical Processing), the notion of Hive being just a batch processing tool has changed. LLAP uses long-lived daemons with intelligent in-memory caching to circumvent batch-oriented latency and provide sub-second query response times.

This post provides an overview of Hive LLAP, including its architecture and common use cases for boosting query performance. You will learn how to install and configure Hive LLAP on an Amazon EMR cluster and run queries on LLAP daemons.

What is Hive LLAP?

Hive LLAP was introduced in Apache Hive 2.0, which provides very fast processing of queries. It uses persistent daemons that are deployed on a Hadoop YARN cluster using Apache Slider. These daemons are long-running and provide functionality such as I/O with DataNode, in-memory caching, query processing, and fine-grained access control. And since the daemons are always running in the cluster, it saves substantial overhead of launching new YARN containers for every new Hive session, thereby avoiding long startup times.

When Hive is configured in hybrid execution mode, small and short queries execute directly on LLAP daemons. Heavy lifting (like large shuffles in the reduce stage) is performed in YARN containers that belong to the application. Resources (CPU, memory, etc.) are obtained in a traditional fashion using YARN. After the resources are obtained, the execution engine can decide which resources are to be allocated to LLAP, or it can launch Apache Tez processors in separate YARN containers. You can also configure Hive to run all the processing workloads on LLAP daemons for querying small datasets at lightning fast speeds.

LLAP daemons are launched under YARN management to ensure that the nodes don’t get overloaded with the compute resources of these daemons. You can use scheduling queues to make sure that there is enough compute capacity for other YARN applications to run.

Why use Hive LLAP?

With many options available in the market (Presto, Spark SQL, etc.) for doing interactive SQL  over data that is stored in Amazon S3 and HDFS, there are several reasons why using Hive and LLAP might be a good choice:

  • For those who are heavily invested in the Hive ecosystem and have external BI tools that connect to Hive over JDBC/ODBC connections, LLAP plugs in to their existing architecture without a steep learning curve.
  • It’s compatible with existing Hive SQL and other Hive tools, like HiveServer2, and JDBC drivers for Hive.
  • It has native support for security features with authentication and authorization (SQL standards-based authorization) using HiveServer2.
  • LLAP daemons are aware about of the columns and records that are being processed which enables you to enforce fine-grained access control.
  • It can use Hive’s vectorization capabilities to speed up queries, and Hive has better support for Parquet file format when vectorization is enabled.
  • It can take advantage of a number of Hive optimizations like merging multiple small files for query results, automatically determining the number of reducers for joins and groupbys, etc.
  • It’s optional and modular so it can be turned on or off depending on the compute and resource requirements of the cluster. This lets you to run other YARN applications concurrently without reserving a cluster specifically for LLAP.

How do you install Hive LLAP in Amazon EMR?

To install and configure LLAP on an EMR cluster, use the following bootstrap action (BA):

s3://aws-bigdata-blog/artifacts/Turbocharge_Apache_Hive_on_EMR/configure-Hive-LLAP.sh

This BA downloads and installs Apache Slider on the cluster and configures LLAP so that it works with EMR Hive. For LLAP to work, the EMR cluster must have Hive, Tez, and Apache Zookeeper installed.

You can pass the following arguments to the BA.

Argument Definition Default value
--instances Number of instances of LLAP daemon Number of core/task nodes of the cluster
--cache Cache size per instance 20% of physical memory of the node
--executors Number of executors per instance Number of CPU cores of the node
--iothreads Number of IO threads per instance Number of CPU cores of the node
--size Container size per instance 50% of physical memory of the node
--xmx Working memory size 50% of container size
--log-level Log levels for the LLAP instance INFO

LLAP example

This section describes how you can try the faster Hive queries with LLAP using the TPC-DS testbench for Hive on Amazon EMR.

Use the following AWS command line interface (AWS CLI) command to launch a 1+3 nodes m4.xlarge EMR 5.6.0 cluster with the bootstrap action to install LLAP:

aws emr create-cluster --release-label emr-5.6.0 \
--applications Name=Hadoop Name=Hive Name=Hue Name=ZooKeeper Name=Tez \
--bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/Turbocharge_Apache_Hive_on_EMR/configure-Hive-LLAP.sh","Name":"Custom action"}]' \ 
--ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx"}' 
--service-role EMR_DefaultRole \
--enable-debugging \
--log-uri 's3n://<YOUR-BUCKET/' --name 'test-hive-llap' \
--instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"MASTER","InstanceType":"m4.xlarge","Name":"Master - 1"},{"InstanceCount":3,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"CORE","InstanceType":"m4.xlarge","Name":"Core - 2"}]' 
--region us-east-1

After the cluster is launched, log in to the master node using SSH, and do the following:

  1. Open the hive-tpcds folder:
    cd /home/hadoop/hive-tpcds/
  2. Start Hive CLI using the testbench configuration, create the required tables, and run the sample query:

    hive –i testbench.settings
    hive> source create_tables.sql;
    hive> source query55.sql;

    This sample query runs on a 40 GB dataset that is stored on Amazon S3. The dataset is generated using the data generation tool in the TPC-DS testbench for Hive.It results in output like the following:
  3. This screenshot shows that the query finished in about 47 seconds for LLAP mode. Now, to compare this to the execution time without LLAP, you can run the same workload using only Tez containers:
    hive> set hive.llap.execution.mode=none;
    hive> source query55.sql;


    This query finished in about 80 seconds.

The difference in query execution time is almost 1.7 times when using just YARN containers in contrast to running the query on LLAP daemons. And with every rerun of the query, you notice that the execution time substantially decreases by the virtue of in-memory caching by LLAP daemons.

Conclusion

In this post, I introduced Hive LLAP as a way to boost Hive query performance. I discussed its architecture and described several use cases for the component. I showed how you can install and configure Hive LLAP on an Amazon EMR cluster and how you can run queries on LLAP daemons.

If you have questions about using Hive LLAP on Amazon EMR or would like to share your use cases, please leave a comment below.


Additional Reading

Learn how to to automatically partition Hive external tables with AWS.


About the Author

Jigar Mistry is a Hadoop Systems Engineer with Amazon Web Services. He works with customers to provide them architectural guidance and technical support for processing large datasets in the cloud using open-source applications. In his spare time, he enjoys going for camping and exploring different restaurants in the Seattle area.

 

 

 

 

Create Multiple Builds from the Same Source Using Different AWS CodeBuild Build Specification Files

Post Syndicated from Prakash Palanisamy original https://aws.amazon.com/blogs/devops/create-multiple-builds-from-the-same-source-using-different-aws-codebuild-build-specification-files/

In June 2017, AWS CodeBuild announced you can now specify an alternate build specification file name or location in an AWS CodeBuild project.

In this post, I’ll show you how to use different build specification files in the same repository to create different builds. You’ll find the source code for this post in our GitHub repo.

Requirements

The AWS CLI must be installed and configured.

Solution Overview

I have created a C program (cbsamplelib.c) that will be used to create a shared library and another utility program (cbsampleutil.c) to use that library. I’ll use a Makefile to compile these files.

I need to put this sample application in RPM and DEB packages so end users can easily deploy them. I have created a build specification file for RPM. It will use make to compile this code and the RPM specification file (cbsample.rpmspec) configured in the build specification to create the RPM package. Similarly, I have created a build specification file for DEB. It will create the DEB package based on the control specification file (cbsample.control) configured in this build specification.

RPM Build Project:

The following build specification file (buildspec-rpm.yml) uses build specification version 0.2. As described in the documentation, this version has different syntax for environment variables. This build specification includes multiple phases:

  • As part of the install phase, the required packages is installed using yum.
  • During the pre_build phase, the required directories are created and the required files, including the RPM build specification file, are copied to the appropriate location.
  • During the build phase, the code is compiled, and then the RPM package is created based on the RPM specification.

As defined in the artifact section, the RPM file will be uploaded as a build artifact.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - yum install rpm-build make gcc glibc -y
  pre_build:
    commands:
      - curr_working_dir=`pwd`
      - mkdir -p ./{RPMS,SRPMS,BUILD,SOURCES,SPECS,tmp}
      - filename="cbsample-$build_version"
      - echo $filename
      - mkdir -p $filename
      - cp ./*.c ./*.h Makefile $filename
      - tar -zcvf /root/$filename.tar.gz $filename
      - cp /root/$filename.tar.gz ./SOURCES/
      - cp cbsample.rpmspec ./SPECS/
  build:
    commands:
      - echo "Triggering RPM build"
      - rpmbuild --define "_topdir `pwd`" -ba SPECS/cbsample.rpmspec
      - cd $curr_working_dir

artifacts:
  files:
    - RPMS/x86_64/cbsample*.rpm
  discard-paths: yes

Using cb-centos-project.json as a reference, create the input JSON file for the CLI command. This project uses an AWS CodeCommit repository named codebuild-multispec and a file named buildspec-rpm.yml as the build specification file. To create the RPM package, we need to specify a custom image name. I’m using the latest CentOS 7 image available in the Docker Hub. I’m using a role named CodeBuildServiceRole. It contains permissions similar to those defined in CodeBuildServiceRole.json. (You need to change the resource fields in the policy, as appropriate.)

{
    "name": "rpm-build-project",
    "description": "Project which will build RPM from the source.",
    "source": {
        "type": "CODECOMMIT",
        "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec",
        "buildspec": "buildspec-rpm.yml"
    },
    "artifacts": {
        "type": "S3",
        "location": "codebuild-demo-artifact-repository"
    },
    "environment": {
        "type": "LINUX_CONTAINER",
        "image": "centos:7",
        "computeType": "BUILD_GENERAL1_SMALL"
    },
    "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole",
    "timeoutInMinutes": 15,
    "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3",
    "tags": [
        {
            "key": "Name",
            "value": "RPM Demo Build"
        }
    ]
}

After the cli-input-json file is ready, execute the following command to create the build project.

$ aws codebuild create-project --name CodeBuild-RPM-Demo --cli-input-json file://cb-centos-project.json

{
    "project": {
        "name": "CodeBuild-RPM-Demo", 
        "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole", 
        "tags": [
            {
                "value": "RPM Demo Build", 
                "key": "Name"
            }
        ], 
        "artifacts": {
            "namespaceType": "NONE", 
            "packaging": "NONE", 
            "type": "S3", 
            "location": "codebuild-demo-artifact-repository", 
            "name": "CodeBuild-RPM-Demo"
        }, 
        "lastModified": 1500559811.13, 
        "timeoutInMinutes": 15, 
        "created": 1500559811.13, 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3", 
        "arn": "arn:aws:codebuild:eu-west-1:012345678912:project/CodeBuild-RPM-Demo", 
        "description": "Project which will build RPM from the source."
    }
}

When the project is created, run the following command to start the build. After the build has started, get the build ID. You can use the build ID to get the status of the build.

$ aws codebuild start-build --project-name CodeBuild-RPM-Demo
{
    "build": {
        "buildComplete": false, 
        "initiator": "prakash", 
        "artifacts": {
            "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
        }, 
        "projectName": "CodeBuild-RPM-Demo", 
        "timeoutInMinutes": 15, 
        "buildStatus": "IN_PROGRESS", 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "currentPhase": "SUBMITTED", 
        "startTime": 1500560156.761, 
        "id": "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc", 
        "arn": "arn:aws:codebuild:eu-west-1: 012345678912:build/CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
    }
}

$ aws codebuild list-builds-for-project --project-name CodeBuild-RPM-Demo
{
    "ids": [
        "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
    ]
}

$ aws codebuild batch-get-builds --ids CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc
{
    "buildsNotFound": [], 
    "builds": [
        {
            "buildComplete": true, 
            "phases": [
                {
                    "phaseStatus": "SUCCEEDED", 
                    "endTime": 1500560157.164, 
                    "phaseType": "SUBMITTED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560156.761
                }, 
                {
                    "contexts": [], 
                    "phaseType": "PROVISIONING", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 24, 
                    "startTime": 1500560157.164, 
                    "endTime": 1500560182.066
                }, 
                {
                    "contexts": [], 
                    "phaseType": "DOWNLOAD_SOURCE", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 15, 
                    "startTime": 1500560182.066, 
                    "endTime": 1500560197.906
                }, 
                {
                    "contexts": [], 
                    "phaseType": "INSTALL", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 19, 
                    "startTime": 1500560197.906, 
                    "endTime": 1500560217.515
                }, 
                {
                    "contexts": [], 
                    "phaseType": "PRE_BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.515, 
                    "endTime": 1500560217.662
                }, 
                {
                    "contexts": [], 
                    "phaseType": "BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.662, 
                    "endTime": 1500560217.995
                }, 
                {
                    "contexts": [], 
                    "phaseType": "POST_BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.995, 
                    "endTime": 1500560218.074
                }, 
                {
                    "contexts": [], 
                    "phaseType": "UPLOAD_ARTIFACTS", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560218.074, 
                    "endTime": 1500560218.542
                }, 
                {
                    "contexts": [], 
                    "phaseType": "FINALIZING", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 4, 
                    "startTime": 1500560218.542, 
                    "endTime": 1500560223.128
                }, 
                {
                    "phaseType": "COMPLETED", 
                    "startTime": 1500560223.128
                }
            ], 
            "logs": {
                "groupName": "/aws/codebuild/CodeBuild-RPM-Demo", 
                "deepLink": "https://console.aws.amazon.com/cloudwatch/home?region=eu-west-1#logEvent:group=/aws/codebuild/CodeBuild-RPM-Demo;stream=57a36755-4d37-4b08-9c11-1468e1682abc", 
                "streamName": "57a36755-4d37-4b08-9c11-1468e1682abc"
            }, 
            "artifacts": {
                "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
            }, 
            "projectName": "CodeBuild-RPM-Demo", 
            "timeoutInMinutes": 15, 
            "initiator": "prakash", 
            "buildStatus": "SUCCEEDED", 
            "environment": {
                "computeType": "BUILD_GENERAL1_SMALL", 
                "privilegedMode": false, 
                "image": "centos:7", 
                "type": "LINUX_CONTAINER", 
                "environmentVariables": []
            }, 
            "source": {
                "buildspec": "buildspec-rpm.yml", 
                "type": "CODECOMMIT", 
                "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
            }, 
            "currentPhase": "COMPLETED", 
            "startTime": 1500560156.761, 
            "endTime": 1500560223.128, 
            "id": "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc", 
            "arn": "arn:aws:codebuild:eu-west-1:012345678912:build/CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
        }
    ]
}

DEB Build Project:

In this project, we will use the build specification file named buildspec-deb.yml. Like the RPM build project, this specification includes multiple phases. Here I use a Debian control file to create the package in DEB format. After a successful build, the DEB package will be uploaded as build artifact.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - apt-get install gcc make -y
  pre_build:
    commands:
      - mkdir -p ./cbsample-$build_version/DEBIAN
      - mkdir -p ./cbsample-$build_version/usr/lib
      - mkdir -p ./cbsample-$build_version/usr/include
      - mkdir -p ./cbsample-$build_version/usr/bin
      - cp -f cbsample.control ./cbsample-$build_version/DEBIAN/control
  build:
    commands:
      - echo "Building the application"
      - make
      - cp libcbsamplelib.so ./cbsample-$build_version/usr/lib
      - cp cbsamplelib.h ./cbsample-$build_version/usr/include
      - cp cbsampleutil ./cbsample-$build_version/usr/bin
      - chmod +x ./cbsample-$build_version/usr/bin/cbsampleutil
      - dpkg-deb --build ./cbsample-$build_version

artifacts:
  files:
    - cbsample-*.deb

Here we use cb-ubuntu-project.json as a reference to create the CLI input JSON file. This project uses the same AWS CodeCommit repository (codebuild-multispec) but a different buildspec file in the same repository (buildspec-deb.yml). We use the default CodeBuild image to create the DEB package. We use the same IAM role (CodeBuildServiceRole).

{
    "name": "deb-build-project",
    "description": "Project which will build DEB from the source.",
    "source": {
        "type": "CODECOMMIT",
        "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec",
        "buildspec": "buildspec-deb.yml"
    },
    "artifacts": {
        "type": "S3",
        "location": "codebuild-demo-artifact-repository"
    },
    "environment": {
        "type": "LINUX_CONTAINER",
        "image": "aws/codebuild/ubuntu-base:14.04",
        "computeType": "BUILD_GENERAL1_SMALL"
    },
    "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole",
    "timeoutInMinutes": 15,
    "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3",
    "tags": [
        {
            "key": "Name",
            "value": "Debian Demo Build"
        }
    ]
}

Using the CLI input JSON file, create the project, start the build, and check the status of the project.

$ aws codebuild create-project --name CodeBuild-DEB-Demo --cli-input-json file://cb-ubuntu-project.json

$ aws codebuild list-builds-for-project --project-name CodeBuild-DEB-Demo

$ aws codebuild batch-get-builds --ids CodeBuild-DEB-Demo:e535c4b0-7067-4fbe-8060-9bb9de203789

After successful completion of the RPM and DEB builds, check the S3 bucket configured in the artifacts section for the build packages. Build projects will create a directory in the name of the build project and copy the artifacts inside it.

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-RPM-Demo/
2017-07-20 16:16:59       8108 cbsample-0.1-1.el7.centos.x86_64.rpm

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-DEB-Demo/
2017-07-20 16:37:22       5420 cbsample-0.1.deb

Override Buildspec During Build Start:

It’s also possible to override the build specification file of an existing project when starting a build. If we want to create the libs RPM package instead of the whole RPM, we will use the build specification file named buildspec-libs-rpm.yml. This build specification file is similar to the earlier RPM build. The only difference is that it uses a different RPM specification file to create libs RPM.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - yum install rpm-build make gcc glibc -y
  pre_build:
    commands:
      - curr_working_dir=`pwd`
      - mkdir -p ./{RPMS,SRPMS,BUILD,SOURCES,SPECS,tmp}
      - filename="cbsample-libs-$build_version"
      - echo $filename
      - mkdir -p $filename
      - cp ./*.c ./*.h Makefile $filename
      - tar -zcvf /root/$filename.tar.gz $filename
      - cp /root/$filename.tar.gz ./SOURCES/
      - cp cbsample-libs.rpmspec ./SPECS/
  build:
    commands:
      - echo "Triggering RPM build"
      - rpmbuild --define "_topdir `pwd`" -ba SPECS/cbsample-libs.rpmspec
      - cd $curr_working_dir

artifacts:
  files:
    - RPMS/x86_64/cbsample-libs*.rpm
  discard-paths: yes

Using the same RPM build project that we created earlier, start a new build and set the value of the `–buildspec-override` parameter to buildspec-libs-rpm.yml .

$ aws codebuild start-build --project-name CodeBuild-RPM-Demo --buildspec-override buildspec-libs-rpm.yml
{
    "build": {
        "buildComplete": false, 
        "initiator": "prakash", 
        "artifacts": {
            "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
        }, 
        "projectName": "CodeBuild-RPM-Demo", 
        "timeoutInMinutes": 15, 
        "buildStatus": "IN_PROGRESS", 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-libs-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "currentPhase": "SUBMITTED", 
        "startTime": 1500562366.239, 
        "id": "CodeBuild-RPM-Demo:82d05f8a-b161-401c-82f0-83cb41eba567", 
        "arn": "arn:aws:codebuild:eu-west-1:012345678912:build/CodeBuild-RPM-Demo:82d05f8a-b161-401c-82f0-83cb41eba567"
    }
}

After the build is completed successfully, check to see if the package appears in the artifact S3 bucket under the CodeBuild-RPM-Demo build project folder.

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-RPM-Demo/
2017-07-20 16:16:59       8108 cbsample-0.1-1.el7.centos.x86_64.rpm
2017-07-20 16:53:54       5320 cbsample-libs-0.1-1.el7.centos.x86_64.rpm

Conclusion

In this post, I have shown you how multiple buildspec files in the same source repository can be used to run multiple AWS CodeBuild build projects. I have also shown you how to provide a different buildspec file when starting the build.

For more information about AWS CodeBuild, see the AWS CodeBuild documentation. You can get started with AWS CodeBuild by using this step by step guide.


About the author

Prakash Palanisamy is a Solutions Architect for Amazon Web Services. When he is not working on Serverless, DevOps or Alexa, he will be solving problems in Project Euler. He also enjoys watching educational documentaries.

Zero-Day Vulnerabilities against Windows in the NSA Tools Released by the Shadow Brokers

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/07/zero-day_vulner.html

In April, the Shadow Brokers — presumably Russia — released a batch of Windows exploits from what is presumably the NSA. Included in that release were eight different Windows vulnerabilities. Given a presumed theft date of the data as sometime between 2012 and 2013 — based on timestamps of the documents and the limited Windows 8 support of the tools:

  • Three were already patched by Microsoft. That is, they were not zero days, and could only be used against unpatched targets. They are EMERALDTHREAD, EDUCATEDSCHOLAR, and ECLIPSEDWING.
  • One was discovered to have been used in the wild and patched in 2014: ESKIMOROLL.

  • Four were only patched when the NSA informed Microsoft about them in early 2017: ETERNALBLUE, ETERNALSYNERGY, ETERNALROMANCE, and ETERNALCHAMPION.

So of the five serious zero-day vulnerabilities against Windows in the NSA’s pocket, four were never independently discovered. This isn’t new news, but I haven’t seen this summary before.

How to Use Batch References in Amazon Cloud Directory to Refer to New Objects in a Batch Request

Post Syndicated from Vineeth Harikumar original https://aws.amazon.com/blogs/security/how-to-use-batch-references-in-amazon-cloud-directory-to-refer-to-new-objects-in-a-batch-request/

In Amazon Cloud Directory, it’s often necessary to add new objects or add relationships between new objects and existing objects to reflect changes in a real-world hierarchy. With Cloud Directory, you can make these changes efficiently by using batch references within batch operations.

Let’s say I want to take an existing child object in a hierarchy, detach it from its parent, and reattach it to another part of the hierarchy. A simple way to do this would be to make a call to get the object’s unique identifier, another call to detach the object from its parent using the unique identifier, and a third call to attach it to a new parent. However, if I use batch references within a batch write operation, I can perform all three of these actions in the same request, greatly simplifying my code and reducing the round trips required to make such changes.

In this post, I demonstrate how to use batch references in a single write request to simplify adding and restructuring a Cloud Directory hierarchy. I have used the AWS SDK for Java for all the sample code in this post, but you can use other language SDKs or the AWS CLI in a similar way.

Using batch references

In my previous post, I demonstrated how to add AnyCompany’s North American warehouses to a global network of warehouses. As time passes and demand grows, AnyCompany launches multiple warehouses in North American cities to fulfill customer orders with continued efficiency. This requires the company to restructure the network to group warehouses in the same region so that the company can apply similar standards to them, such as delivery times, delivery areas, and types of products sold.

For instance, in the NorthAmerica object (see the following diagram), AnyCompany has launched two new warehouses in the Phoenix (PHX) area: PHX_2 and PHX_3. AnyCompany wants to add these new warehouses to the network and regroup them with existing warehouse PHX_1 under the new node, PHX.

The state of the hierarchy before this regrouping is shown in the following diagram, where I added the NorthAmerica warehouses (also represented as NA in the diagram) to the larger network of AnyCompany’s warehouses.

Diagram showing the state of the hierarchy before this post's regrouping

Adding and grouping new warehouses in the NorthAmerica network

I want to add and group the new warehouses with a single request, and using batch references in a batch write lets me do that. A batch reference is just another way of using object references that you are allowed to define arbitrarily. This allows you to chain operations, which means using the return value from one operation in a subsequent operation within the same batch write request

Let’s say I have a batch write request with two batch operations: operation A and operation B. Both batch operations operate on the same object X. In operation A, I use the object X found at /NorthAmerica/Phoenix, and I assign it to a batch reference that I call referencePhoenix. In operation B, I want to modify the same object X, so I use referencePhoenix as the object reference that points to the same unique object X used in operation A. I also will use the same helper method implementation from my previous post for getBatchCreateOperation. To learn more about batch references, see the ObjectReference documentation.

To add and group the new warehouses, I will take advantage of batch references to sequentially:

  1. Detach PHX_1 from the NA node and maintain a reference to PHX_1.
  2. Create a new child node, PHX, and attach it to the NA node.
  3. Create PHX_2 and PHX_3 nodes for the new warehouses.
  4. Link all three nodes—PHX_1 (using the batch reference), PHX_2, and PHX_3—to the PHX node.

The following code example achieves these changes in a single batch by using references. First, the code sets up a createObjectPHX operation to create the PHX parent object and attach it to the parent NorthAmerica object. It then sets up createObjectPHX_2 and createObjectPHX_3 and attaches these new objects to the new PHX object. The code then sets up a detachObject to detach the current PHX_1 object from its parent and assign it to a batch reference. The last operation uses that same batch reference to attach the PHX_1 object to the newly created PHX object. The code example orders these steps sequentially in a batch write operation.

   BatchWriteOperation createObjectPHX = getBatchCreateOperation(
        "PHX",
        directorySchemaARN,
        "/NorthAmerica",
        "Phoenix");
   BatchWriteOperation createObjectPHX_2 = getBatchCreateOperation(
        "PHX_2",
        directorySchemaARN,
        "/NorthAmerica/Phoenix",
        "PHX_2");
   BatchWriteOperation createObjectPHX_3 = getBatchCreateOperation(
        "PHX_3",
        directorySchemaARN,
        "/NorthAmerica/Phoenix",
        "PHX_3");


   BatchDetachObject detachObject = new BatchDetachObject()
        .withBatchReferenceName("referenceToPHX_1")
        .withLinkName("Phoenix")
        .withParentReference(new ObjectReference()
             .withSelector("/NorthAmerica"));

   BatchAttachObject attachObject = new BatchAttachObject()
        .withChildReference(new ObjectReference().withSelector("#referenceToPHX_1"))
        .withLinkName("PHX_1")
        .withParentReference(new ObjectReference()
            .withSelector("/NorthAmerica/Phoenix"));

   BatchWriteOperation detachOperation = new BatchWriteOperation()
       .withDetachObject(detachObject);
   BatchWriteOperation attachOperation = new BatchWriteOperation()
       .withAttachObject(attachObject);


   BatchWriteRequest request = new BatchWriteRequest();
   request.setDirectoryArn(directoryARN);
   request.setOperations(Lists.newArrayList(
       detachOperation,
       createObjectPHX,
       createObjectPHX_2,
       createObjectPHX_3,
       attachOperation));

   client.batchWrite(request);

In the preceding code example, I use the batch reference, referenceToPHX_1, in the same batch write operation because I do not have to know the object identifier of that object. If I couldn’t use such a batch reference, I would have to use separate requests to get the PHX_1 identifier, detach it from the NA node, and then attach it to the new PHX node.

I now have the network configuration I want, as shown in the following diagram. I have used a combination of batch operations with batch references to bring new warehouses into the network and regroup them within the same local group of warehouses.

Diagram showing the desired network configuration

Summary

In this post, I have shown how you can use batch references in a single batch write request to simplify adding and restructuring your existing hierarchies in Cloud Directory. You can use batch references in scenarios where you want to get an object identifier, but don’t want the overhead of using a read operation before a write operation. Instead, you can use a batch reference to refer to an object as part of the intermediate batch operation. To learn more about batch operations, see Batches, BatchWrite, and BatchRead.

If you have comments about this post, submit them in the “Comments” section below. If you have implementation questions, start a new thread on the Directory Service forum.

– Vineeth

Write and Read Multiple Objects in Amazon Cloud Directory by Using Batch Operations

Post Syndicated from Vineeth Harikumar original https://aws.amazon.com/blogs/security/write-and-read-multiple-objects-in-amazon-cloud-directory-by-using-batch-operations/

Amazon Cloud Directory is a hierarchical data store that enables you to build flexible, cloud-native directories for organizing hierarchies of data along multiple dimensions. For example, you can create an organizational structure that you can navigate through multiple hierarchies for reporting structure, location, and cost center.

In this blog post, I demonstrate how you can use Cloud Directory APIs to write and read multiple objects by using batch operations. With batch write operations, you can execute a sequence of operations atomically—meaning that all of the write operations must occur, or none of them do. You also can make your application efficient by reducing the number of required round trips to read and write objects to your directory. I have used the AWS SDK for Java for all the sample code in this blog post, but you can use other language SDKs or the AWS CLI in a similar way.

Using batch write operations

To demonstrate batch write operations, let’s say that AnyCompany’s warehouses are organized to determine the fastest methods to ship orders to its customers. In North America, AnyCompany plans to open new warehouses regularly so that the company can keep up with customer demand while continuing to meet the delivery times to which they are committed.

The following diagram shows part of AnyCompany’s global network, including Asian and European warehouse networks.

Let’s take a look at how I can use batch write operations to add NorthAmerica to AnyCompany’s global network of warehouses, with the first three warehouses in New York City (NYC), Las Vegas (LAS), and Phoenix (PHX).

Adding NorthAmerica to the global network

To add NorthAmerica to the global network, I can use a batch write operation to create and link all the objects in the existing network.

First, I set up a helper method, which performs repetitive tasks, for the getBatchCreateOperation object. The following lines of code help me create an NA object for NorthAmerica and then attach the three city-related nodes: NYC, LAS, and PHX. Because AnyCompany is planning to grow its network, I add a suffix of _1 to each city code (such as PHX_1), which will be helpful hierarchically when the company adds more warehouses within a city.

    private BatchWriteOperation getBatchCreateOperation(
            String warehouseName,
            String directorySchemaARN,
            String parentReference,
            String linkName) {

        SchemaFacet warehouse_facet = new SchemaFacet()
            .withFacetName("warehouse")
            .withSchemaArn(directorySchemaARN);

        AttributeKeyAndValue kv = new AttributeKeyAndValue()
            .withKey(new AttributeKey()
                .withFacetName("warehouse")
                .withName("name")
                .withSchemaArn(directorySchemaARN))
            .withValue(new TypedAttributeValue()
                .withStringValue(warehouseName);

        List<SchemaFacet> facets = Lists.newArrayList(warehouse_facet);
        List<AttributeKeyAndValue> kvs = Lists.newArrayList(kv);

        BatchCreateObject createObject = new BatchCreateObject();

        createObject.withParentReference(new ObjectReference()
            .withSelector(parentReference));
        createObject.withLinkName(linkName);

        createObject.withBatchReferenceName(UUID.randomUUID().toString());
        createObject.withSchemaFacet(facets);
        createObject.withObjectAttributeList(kvs);

        return new BatchWriteOperation().withCreateObject
                                       (createObject);
    }

The parameters of this helper method include:

  • warehouseName – The name of the warehouse to create in the getBatchCreateOperation object.
  • directorySchemaARN – The Amazon Resource Name (ARN) of the schema applied to the directory.
  • parentReference – The object reference of the parent object.
  • linkName – The unique child path from the parent reference where the object should be attached.

I then use this helper method to set up multiple create operations for NorthAmerica, NewYork, Phoenix, and LasVegas. For the sake of simplicity, I use airport codes to stand for the cities (for example, NYC stands for NewYork).

   BatchWriteOperation createObjectNA = getBatchCreateOperation(
                      "NA",
                      directorySchemaARN,
                      "/",
                      "NorthAmerica");
   BatchWriteOperation createObjectNYC = getBatchCreateOperation(
                      "NYC_1",
                      directorySchemaARN,
                      "/NorthAmerica",
                      "NewYork");
   BatchWriteOperation createObjectPHX = getBatchCreateOperation(
                       "PHX_1",
                       directorySchemaARN,
                       "/NorthAmerica",
                       "Phoenix");
   BatchWriteOperation createObjectLAS = getBatchCreateOperation(
                      "LAS_1",
                      directorySchemaARN,
                      "/NorthAmerica",
                      "LasVegas");

   BatchWriteRequest request = new BatchWriteRequest();
   request.setDirectoryArn(directoryARN);
   request.setOperations(Lists.newArrayList(
       createObjectNA,
       createObjectNYC,
       createObjectPHX,
       createObjectLAS));

   client.batchWrite(request);

Running the preceding code results in a hierarchy for the network with NA added to the network, as shown in the following diagram.

Using batch read operations

Now, let’s say that after I add NorthAmerica to AnyCompany’s global network, an analyst wants to see the updated view of the NorthAmerica warehouse network as well as some information about the newly introduced warehouse configurations for the Phoenix warehouses. To do this, I can use batch read operations to get the network of warehouses for NorthAmerica as well as specifically request the attributes and configurations of the Phoenix warehouses.

To list the children of the NorthAmerica warehouses, I use the BatchListObjectChildren API to get all the children at the path, /NorthAmerica. Next, I want to view the attributes of the Phoenix object, so I use the BatchListObjectAttributes API to read all the attributes of the object at /NorthAmerica/Phoenix, as shown in the following code example.

    BatchListObjectChildren listObjectChildrenRequest = new BatchListObjectChildren()
        .withObjectReference(new ObjectReference().withSelector("/NorthAmerica"));
    BatchListObjectAttributes listObjectAttributesRequest = new BatchListObjectAttributes()
        .withObjectReference(new ObjectReference()
            .withSelector("/NorthAmerica/Phoenix"));
    BatchReadRequest batchRead = new BatchReadRequest()
        .withConsistencyLevel(ConsistencyLevel.EVENTUAL)
        .withDirectoryArn(directoryArn)
        .withOperations(Lists.newArrayList(listObjectChildrenRequest, listObjectAttributesRequest));

    BatchReadResult result = client.batchRead(batchRead);

Exception handling

Batch operations in Cloud Directory might sometimes fail, and it is important to know how to handle such failures, which differ for write operations and read operations.

Batch write operation failures

If a batch write operation fails, Cloud Directory fails the entire batch operation and returns an exception. The exception contains the index of the operation that failed along with the exception type and message. If you see RetryableConflictException, you can try again with exponential backoff. A simple way to do this is to double the amount of time you wait each time you get an exception or failure. For example, if your first batch write operation fails, wait 100 milliseconds and try the request again. If the second request fails, wait 200 milliseconds and try again. If the third request fails, wait 400 milliseconds and try again.

Batch read operation failures

If a batch read operation fails, the response contains either a successful response or an exception response. Individual batch read operation failures do not cause the entire batch read operation to fail—Cloud Directory returns individual success or failure responses for each operation.

Limits of batch operations

Batch operations are still constrained by the same Cloud Directory limits as other Cloud Directory APIs. A single batch operation does not limit the number of operations, but the total number of nodes or objects being written or edited in a single batch operation have enforced limits. For example, a total of 20 objects can be written in a single batch operation request to Cloud Directory, regardless of how many individual operations there are within that batch. Similarly, a total of 200 objects can be read in a single batch operation request to Cloud Directory. For more information, see limits on batch operations.

Summary

In this post, I have demonstrated how you can use batch operations to operate on multiple objects and simplify making complicated changes across hierarchies. In my next post, I will demonstrate how to use batch references within batch write operations. To learn more about batch operations, see Batches, BatchWrite, and BatchRead.

If you have comments about this post, submit them in the “Comments” section below. If you have implementation questions, start a new thread on the Directory Service forum.

– Vineeth

Analysis of Top-N DynamoDB Objects using Amazon Athena and Amazon QuickSight

Post Syndicated from Rendy Oka original https://aws.amazon.com/blogs/big-data/analysis-of-top-n-dynamodb-objects-using-amazon-athena-and-amazon-quicksight/

If you run an operation that continuously generates a large amount of data, you may want to know what kind of data is being inserted by your application. The ability to analyze data intake quickly can be very valuable for business units, such as operations and marketing. For many operations, it’s important to see what is driving the business at any particular moment. For retail companies, for example, understanding which products are currently popular can aid in planning for future growth. Similarly, for PR companies, understanding the impact of an advertising campaign can help them market their products more effectively.

This post covers an architecture that helps you analyze your streaming data. You’ll build a solution using Amazon DynamoDB Streams, AWS Lambda, Amazon Kinesis Firehose, and Amazon Athena to analyze data intake at a frequency that you choose. And because this is a serverless architecture, you can use all of the services here without the need to provision or manage servers.

The data source

You’ll collect a random sampling of tweets via Twitter’s API and store a variety of attributes in your DynamoDB table, such as: Twitter handle, tweet ID, hashtags, location, and Time-To-Live (TTL) value.

In DynamoDB, the primary key is used as an input to an internal hash function. The output from this function determines the partition in which the data will be stored. When using a combination of primary key and sort key as a DynamoDB schema, you need to make sure that no single partition key contains many more objects than the other partition keys because this can cause partition level throttling. For the demonstration in this blog, the Twitter handle will be the primary key and the tweet ID will be the sort key. This allows you to group and sort tweets from each user.

To help you get started, I have written a script that pulls a live Twitter stream that you can use to generate your data. All you need to do is provide your own Twitter Apps credentials, and it should generate the data immediately. Alternatively, I have also provided a script that you can use to generate random Tweets with little effort.

You can find both scripts in the Github repository:

https://github.com/awslabs/aws-blog-dynamodb-analysis

There are some modules that you may need to install to run these scripts. You can find them in Python’s module repository:

To get your own Twitter credentials, go to https://www.twitter.com/ and sign up for a free account, if you don’t already have one. After your account is set up, go to https://apps.twitter.com/. On the main landing page, choose the Create New App button. After the application is created, go to Keys and Access Tokens to get your credentials to use the Twitter API. You’ll need to generate Customer Tokens/Secret and Access Token/Secret. All four keys will be used to authenticate your request.

Architecture overview

Before we begin, let’s take a look at the overall flow of information will look like, from data ingestion into DynamoDB to visualization of results in Amazon QuickSight.

As illustrated in the architecture diagram above, any changes made to the items in DynamoDB will be captured and processed using DynamoDB Streams. Next, a Lambda function will be invoked by a trigger that is configured to respond to events in DynamoDB Streams. The Lambda function processes the data prior to pushing to Amazon Kinesis Firehose, which will output to Amazon S3. Finally, you use Amazon Athena to analyze the streaming data landing in Amazon S3. The result can be explored and visualized in Amazon QuickSight for your company’s business analytics.

You’ll need to implement your custom Lambda function to help transform the raw <key, value> data stored in DynamoDB to a JSON format for Athena to digest, but I can help you with a sample code that you are free to modify.

Implementation

In the following sections, I’ll walk through how you can set up the architecture discussed earlier.

Create your DynamoDB table

First, let’s create a DynamoDB table and enable DynamoDB Streams. This will enable data to be copied out of this table. From the console, use the user_id as the partition key and tweet_id as the sort key:

After the table is ready, you can enable DynamoDB Streams. This process operates asynchronously, so there is no performance impact on the table when you enable this feature. The easiest way to manage DynamoDB Streams is also through the DynamoDB console.

In the Overview tab of your newly created table, click Manage Stream. In the window, choose the information that will be written to the stream whenever data in the table is added or modified. In this example, you can choose either New image or New and old images.

For more details on this process, check out our documentation:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

Configure Kinesis Firehose

Before creating the Lambda function, you need to configure Kinesis Firehose delivery stream so that it’s ready to accept data from Lambda. Open the Firehose console and choose Create Firehose Delivery Stream. From here, choose S3 as the destination and use the following to information to configure the resource. Note the Delivery stream name because you will use it in the next step.

For more details on this process, check out our documentation:

http://docs.aws.amazon.com/firehose/latest/dev/basic-create.html#console-to-s3

Create your Lambda function

Now that Kinesis Firehose is ready to accept data, you can create your Lambda function.

From the AWS Lambda console, choose the Create a Lambda function button and use the Blank Function. Enter a name and description, and choose Python 2.7 as the Runtime. Note your Lambda function name because you’ll need it in the next step.

In the Lambda function code field, you can paste the script that I have written for this purpose. All this function needs is the name of your Firehose stream name set as an environment variable.

import boto3
import json
import os

# Initiate Firehose client
firehose_client = boto3.client('firehose')

def lambda_handler(event, context):
    records = []
    batch   = []
    try :
        for record in event['Records']:
            tweet = {}
            t_stats = '{ "table_name":"%s", "user_id":"%s", "tweet_id":"%s", "approx_post_time":"%d" }\n' \
                      % ( record['eventSourceARN'].split('/')[1], \
                          record['dynamodb']['Keys']['user_id']['S'], \
                          record['dynamodb']['Keys']['tweet_id']['N'], \
                          int(record['dynamodb']['ApproximateCreationDateTime']) )
            tweet["Data"] = t_stats
            records.append(tweet)
        batch.append(records)
        res = firehose_client.put_record_batch(
            DeliveryStreamName = os.environ['firehose_stream_name'],
            Records = batch[0]
        )
        return 'Successfully processed {} records.'.format(len(event['Records']))
    except Exception :
        pass

The handler should be set to lambda_function.lambda_handler and you can use the existing lambda_dynamodb_streams role that’s been created by default.

Enable DynamoDB trigger and start collecting data

Everything is ready to go. Open your table using the DynamoDB console and go to the Triggers tab. Select the Create trigger drop down list and choose Existing Lambda function. In the pop-up window, select the function that you just created, and choose the Create button.

At this point, you can start collecting data with the Python script that I’ve provided. The first one will create a script that will pull public Twitter data and the other will generate fake tweets using Lorem Ipsum text.

Configure Amazon Athena to read the data

Next, you will configure Amazon Athena so that it can read the data Kinesis Firehose outputs to Amazon S3 and allow you to analyze the data as needed. You can connect to Athena directly from the Athena console, and you can establish a connection using JDBC or the Athena API. In this example, I’m going to demonstrate what this looks like on the Athena console.

First, create a new database and a new table. You can do this by running the following two queries. The first query creates a new database:

CREATE DATABASE IF NOT EXISTS ddbtablestats

And the second query creates a new table:

CREATE EXTERNAL TABLE IF NOT EXISTS ddbtablestats.twitterfeed (
    `table_name` string,
    `user_id` string,
    `tweet_id` bigint,
    `approx_post_time` timestamp 
) PARTITIONED BY (
    year string,
    month string,
    day string,
    hour string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1')
LOCATION 's3://myBucket/dynamodb/streams/transactions/'

Note that this table is created using partitions. Partitioning separates your data into logical parts based on certain criteria, such as date, location, language, etc. This allows Athena to selectively pull your data without needing to process the entire data set. This effectively minimizes the query execution time, and it also allows you to have greater control over the data that you want to query.

After the query has completed, you should be able to see the table in the left side pane of the Athena dashboard.

After the database and table have been created, execute the ALTER TABLE query to populate the partitions in your table. Replace the date with the current date when the script was executed.

ALTER TABLE ddbtablestats.TwitterFeed ADD IF NOT EXISTS
PARTITION (year='2017',month='05',day='17',hour='01') location 's3://myBucket/dynamodb/streams/transactions/2017/05/17/01/'

Using the Athena console, you’ll need to manually populate each partition for each additional partition that you’d like to analyze, however you can programmatically automate this process by using the JDBC driver or any AWS SDK of your choice.

For more information on partitioning in Athena, check out our documentation:

http://docs.aws.amazon.com/athena/latest/ug/partitions.html

Querying the data in Amazon Athena

This is it! Let’s run this query to see the top 10 most active Twitter users in the last 24 hours. You can do this from the Athena console:

SELECT user_id, COUNT(DISTINCT tweet_id) tweets FROM ddbTableStats.TwitterFeed
WHERE year='2017' AND month='05' AND day='17'
GROUP BY user_id
ORDER BY tweets DESC
LIMIT 10

The result should look similar to the following:

Linking Athena to Amazon QuickSight

Finally, to make this data available to a larger audience, let’s visualize this data in Amazon QuickSight. Amazon QuickSight provides native connectivity to AWS data sources such as Amazon Redshift, Amazon RDS, and Amazon Athena. Amazon QuickSight can also connect to on-premises databases, Excel, or CSV files, and it can connect to cloud data sources such as Salesforce.com. For this solution, we will connect Amazon QuickSight to the Athena table we just created.

Amazon QuickSight has a free tier that provides 1 user and 1GB of SPICE (Superfast Parallel In-memory Calculated Engine) capacity free. So you can sign up and use QuickSight free of charge.

When you are signing up for Amazon QuickSight, ensure that you grant permissions for QuickSight to connect to Athena and the S3 bucket where the data is stored.

After you’ve signed up, navigate to the new analysis button, and choose new data set, and then select the Athena data source option. Create a new name for your data source and proceed to the next prompt. At this point, you should see the Athena table you created earlier.

Choose the option to import the data to SPICE for a quicker analysis. SPICE is an in-memory optimized calculation engine that is designed for quick data visualization through parallel processing. SPICE also enables you to refresh your data sets at a regular interval or on-demand as you want.

In the dialog box, confirm this data set creation, and you’ll arrive on the landing page where you can start building your graph. The X-axis will represent the user_id and the Value will be used to represent the SUM total of the tweets from each user.

The Amazon QuickSight report looks like this:

Through this visualization, I can easily see that there are 3 users that tweeted over 20 times that day and that the majority of the users have fewer than 10 tweets that day. I can also set up a scheduled refresh of my SPICE dataset so that I have a dashboard that is regularly updated with the latest data.

Closing thoughts

Here are the benefits that you can gain from using this architecture:

  1. You can optimize the design of your DynamoDB schema that follows AWS best practice recommendations.
  1. You can run analysis and data intelligence in order to understand the current customer demands for your business.
  1. You can store incremental backup for future auditing.

The flexibility of our AWS services invites you to create and design the ideal workflow for your production at any scale, and, as always, if you ever need some guidance, don’t hesitate to reach out to us.I  hope this has been helpful to you! Please leave any questions and comments below.

 


Additional Reading

Learn how to analyze VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight.


About the Author

Rendy Oka is a Big Data Support Engineer for Amazon Web Services. He provides consultations and architectural designs and partners with the TAMs, Solution Architects, and AWS product teams to help develop solutions for our customers. He is also a team lead for the big data support team in Seattle. Rendy has traveled to dozens of countries around the world and takes every opportunity to experience the local culture wherever he goes

 

 

 

 

Synchronizing Amazon S3 Buckets Using AWS Step Functions

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/synchronizing-amazon-s3-buckets-using-aws-step-functions/

Constantin Gonzalez is a Principal Solutions Architect at AWS

In my free time, I run a small blog that uses Amazon S3 to host static content and Amazon CloudFront to distribute it world-wide. I use a home-grown, static website generator to create and upload my blog content onto S3.

My blog uses two S3 buckets: one for staging and testing, and one for production. As a website owner, I want to update the production bucket with all changes from the staging bucket in a reliable and efficient way, without having to create and populate a new bucket from scratch. Therefore, to synchronize files between these two buckets, I use AWS Lambda and AWS Step Functions.

In this post, I show how you can use Step Functions to build a scalable synchronization engine for S3 buckets and learn some common patterns for designing Step Functions state machines while you do so.

Step Functions overview

Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly.

While this particular example focuses on synchronizing objects between two S3 buckets, it can be generalized to any other use case that involves coordinated processing of any number of objects in S3 buckets, or other, similar data processing patterns.

Bucket replication options

Before I dive into the details on how this particular example works, take a look at some alternatives for copying or replicating data between two Amazon S3 buckets:

  • The AWS CLI provides customers with a powerful aws s3 sync command that can synchronize the contents of one bucket with another.
  • S3DistCP is a powerful tool for users of Amazon EMR that can efficiently load, save, or copy large amounts of data between S3 buckets and HDFS.
  • The S3 cross-region replication functionality enables automatic, asynchronous copying of objects across buckets in different AWS regions.

In this use case, you are looking for a slightly different bucket synchronization solution that:

  • Works within the same region
  • Is more scalable than a CLI approach running on a single machine
  • Doesn’t require managing any servers
  • Uses a more finely grained cost model than the hourly based Amazon EMR approach

You need a scalable, serverless, and customizable bucket synchronization utility.

Solution architecture

Your solution needs to do three things:

  1. Copy all objects from a source bucket into a destination bucket, but leave out objects that are already present, for efficiency.
  2. Delete all "orphaned" objects from the destination bucket that aren’t present on the source bucket, because you don’t want obsolete objects lying around.
  3. Keep track of all objects for #1 and #2, regardless of how many objects there are.

In the beginning, you read in the source and destination buckets as parameters and perform basic parameter validation. Then, you operate two separate, independent loops, one for copying missing objects and one for deleting obsolete objects. Each loop is a sequence of Step Functions states that read in chunks of S3 object lists and use the continuation token to decide in a choice state whether to continue the loop or not.

This solution is based on the following architecture that uses Step Functions, Lambda, and two S3 buckets:

As you can see, this setup involves no servers, just two main building blocks:

  • Step Functions manages the overall flow of synchronizing the objects from the source bucket with the destination bucket.
  • A set of Lambda functions carry out the individual steps necessary to perform the work, such as validating input, getting lists of objects from source and destination buckets, copying or deleting objects in batches, and so on.

To understand the synchronization flow in more detail, look at the Step Functions state machine diagram for this example.

Walkthrough

Here’s a detailed discussion of how this works.

To follow along, use the code in the sync-buckets-state-machine GitHub repo. The code comes with a ready-to-run deployment script in Python that takes care of all the IAM roles, policies, Lambda functions, and of course the Step Functions state machine deployment using AWS CloudFormation, as well as instructions on how to use it.

Fine print: Use at your own risk

Before I start, here are some disclaimers:

  • Educational purposes only.

    The following example and code are intended for educational purposes only. Make sure that you customize, test, and review it on your own before using any of this in production.

  • S3 object deletion.

    In particular, using the code included below may delete objects on S3 in order to perform synchronization. Make sure that you have backups of your data. In particular, consider using the Amazon S3 Versioning feature to protect yourself against unintended data modification or deletion.

Step Functions execution starts with an initial set of parameters that contain the source and destination bucket names in JSON:

{
    "source":       "my-source-bucket-name",
    "destination":  "my-destination-bucket-name"
}

Armed with this data, Step Functions execution proceeds as follows.

Step 1: Detect the bucket region

First, you need to know the regions where your buckets reside. In this case, take advantage of the Step Functions Parallel state. This allows you to use a Lambda function get_bucket_location.py inside two different, parallel branches of task states:

  • FindRegionForSourceBucket
  • FindRegionForDestinationBucket

Each task state receives one bucket name as an input parameter, then detects the region corresponding to "their" bucket. The output of these functions is collected in a result array containing one element per parallel function.

Step 2: Combine the parallel states

The output of a parallel state is a list with all the individual branches’ outputs. To combine them into a single structure, use a Lambda function called combine_dicts.py in its own CombineRegionOutputs task state. The function combines the two outputs from step 1 into a single JSON dict that provides you with the necessary region information for each bucket.

Step 3: Validate the input

In this walkthrough, you only support buckets that reside in the same region, so you need to decide if the input is valid or if the user has given you two buckets in different regions. To find out, use a Lambda function called validate_input.py in the ValidateInput task state that tests if the two regions from the previous step are equal. The output is a Boolean.

Step 4: Branch the workflow

Use another type of Step Functions state, a Choice state, which branches into a Failure state if the comparison in step 3 yields false, or proceeds with the remaining steps if the comparison was successful.

Step 5: Execute in parallel

The actual work is happening in another Parallel state. Both branches of this state are very similar to each other and they re-use some of the Lambda function code.

Each parallel branch implements a looping pattern across the following steps:

  1. Use a Pass state to inject either the string value "source" (InjectSourceBucket) or "destination" (InjectDestinationBucket) into the listBucket attribute of the state document.

    The next step uses either the source or the destination bucket, depending on the branch, while executing the same, generic Lambda function. You don’t need two Lambda functions that differ only slightly. This step illustrates how to use Pass states as a way of injecting constant parameters into your state machine and as a way of controlling step behavior while re-using common step execution code.

  2. The next step UpdateSourceKeyList/UpdateDestinationKeyList lists objects in the given bucket.

    Remember that the previous step injected either "source" or "destination" into the state document’s listBucket attribute. This step uses the same list_bucket.py Lambda function to list objects in an S3 bucket. The listBucket attribute of its input decides which bucket to list. In the left branch of the main parallel state, use the list of source objects to work through copying missing objects. The right branch uses the list of destination objects, to check if they have a corresponding object in the source bucket and eliminate any orphaned objects. Orphans don’t have a source object of the same S3 key.

  3. This step performs the actual work. In the left branch, the CopySourceKeys step uses the copy_keys.py Lambda function to go through the list of source objects provided by the previous step, then copies any missing object into the destination bucket. Its sister step in the other branch, DeleteOrphanedKeys, uses its destination bucket key list to test whether each object from the destination bucket has a corresponding source object, then deletes any orphaned objects.

  4. The S3 ListObjects API action is designed to be scalable across many objects in a bucket. Therefore, it returns object lists in chunks of configurable size, along with a continuation token. If the API result has a continuation token, it means that there are more objects in this list. You can work from token to token to continue getting object list chunks, until you get no more continuation tokens.

By breaking down large amounts of work into chunks, you can make sure each chunk is completed within the timeframe allocated for the Lambda function, and within the maximum input/output data size for a Step Functions state.

This approach comes with a slight tradeoff: the more objects you process at one time in a given chunk, the faster you are done. There’s less overhead for managing individual chunks. On the other hand, if you process too many objects within the same chunk, you risk going over time and space limits of the processing Lambda function or the Step Functions state so the work cannot be completed.

In this particular case, use a Lambda function that maximizes the number of objects listed from the S3 bucket that can be stored in the input/output state data. This is currently up to 32,768 bytes, assuming (based on some experimentation) that the execution of the COPY/DELETE requests in the processing states can always complete in time.

A more sophisticated approach would use the Step Functions retry/catch state attributes to account for any time limits encountered and adjust the list size accordingly through some list site adjusting.

Step 6: Test for completion

Because the presence of a continuation token in the S3 ListObjects output signals that you are not done processing all objects yet, use a Choice state to test for its presence. If a continuation token exists, it branches into the UpdateSourceKeyList step, which uses the token to get to the next chunk of objects. If there is no token, you’re done. The state machine then branches into the FinishCopyBranch/FinishDeleteBranch state.

By using Choice states like this, you can create loops exactly like the old times, when you didn’t have for statements and used branches in assembly code instead!

Step 7: Success!

Finally, you’re done, and can step into your final Success state.

Lessons learned

When implementing this use case with Step Functions and Lambda, I learned the following things:

  • Sometimes, it is necessary to manipulate the JSON state of a Step Functions state machine with just a few lines of code that hardly seem to warrant their own Lambda function. This is ok, and the cost is actually pretty low given Lambda’s 100 millisecond billing granularity. The upside is that functions like these can be helpful to make the data more palatable for the following steps or for facilitating Choice states. An example here would be the combine_dicts.py function.
  • Pass states can be useful beyond debugging and tracing, they can be used to inject arbitrary values into your state JSON and guide generic Lambda functions into doing specific things.
  • Choice states are your friend because you can build while-loops with them. This allows you to reliably grind through large amounts of data with the patience of an engine that currently supports execution times of up to 1 year.

    Currently, there is an execution history limit of 25,000 events. Each Lambda task state execution takes up 5 events, while each choice state takes 2 events for a total of 7 events per loop. This means you can loop about 3500 times with this state machine. For even more scalability, you can split up work across multiple Step Functions executions through object key sharding or similar approaches.

  • It’s not necessary to spend a lot of time coding exception handling within your Lambda functions. You can delegate all exception handling to Step Functions and instead simplify your functions as much as possible.

  • Step Functions are great replacements for shell scripts. This could have been a shell script, but then I would have had to worry about where to execute it reliably, how to scale it if it went beyond a few thousand objects, etc. Think of Step Functions and Lambda as tools for scripting at a cloud level, beyond the boundaries of servers or containers. "Serverless" here also means "boundary-less".

Summary

This approach gives you scalability by breaking down any number of S3 objects into chunks, then using Step Functions to control logic to work through these objects in a scalable, serverless, and fully managed way.

To take a look at the code or tweak it for your own needs, use the code in the sync-buckets-state-machine GitHub repo.

To see more examples, please visit the Step Functions Getting Started page.

Enjoy!