Tag Archives: https

Using AWS Step Functions State Machines to Handle Workflow-Driven AWS CodePipeline Actions

Post Syndicated from Marcilio Mendonca original https://aws.amazon.com/blogs/devops/using-aws-step-functions-state-machines-to-handle-workflow-driven-aws-codepipeline-actions/

AWS CodePipeline is a continuous integration and continuous delivery service for fast and reliable application and infrastructure updates. It offers powerful integration with other AWS services, such as AWS CodeBuildAWS CodeDeployAWS CodeCommit, AWS CloudFormation and with third-party tools such as Jenkins and GitHub. These services make it possible for AWS customers to successfully automate various tasks, including infrastructure provisioning, blue/green deployments, serverless deployments, AMI baking, database provisioning, and release management.

Developers have been able to use CodePipeline to build sophisticated automation pipelines that often require a single CodePipeline action to perform multiple tasks, fork into different execution paths, and deal with asynchronous behavior. For example, to deploy a Lambda function, a CodePipeline action might first inspect the changes pushed to the code repository. If only the Lambda code has changed, the action can simply update the Lambda code package, create a new version, and point the Lambda alias to the new version. If the changes also affect infrastructure resources managed by AWS CloudFormation, the pipeline action might have to create a stack or update an existing one through the use of a change set. In addition, if an update is required, the pipeline action might enforce a safety policy to infrastructure resources that prevents the deletion and replacement of resources. You can do this by creating a change set and having the pipeline action inspect its changes before updating the stack. Change sets that do not conform to the policy are deleted.

This use case is a good illustration of workflow-driven pipeline actions. These are actions that run multiple tasks, deal with async behavior and loops, need to maintain and propagate state, and fork into different execution paths. Implementing workflow-driven actions directly in CodePipeline can lead to complex pipelines that are hard for developers to understand and maintain. Ideally, a pipeline action should perform a single task and delegate the complexity of dealing with workflow-driven behavior associated with that task to a state machine engine. This would make it possible for developers to build simpler, more intuitive pipelines and allow them to use state machine execution logs to visualize and troubleshoot their pipeline actions.

In this blog post, we discuss how AWS Step Functions state machines can be used to handle workflow-driven actions. We show how a CodePipeline action can trigger a Step Functions state machine and how the pipeline and the state machine are kept decoupled through a Lambda function. The advantages of using state machines include:

  • Simplified logic (complex tasks are broken into multiple smaller tasks).
  • Ease of handling asynchronous behavior (through state machine wait states).
  • Built-in support for choices and processing different execution paths (through state machine choices).
  • Built-in visualization and logging of the state machine execution.

The source code for the sample pipeline, pipeline actions, and state machine used in this post is available at https://github.com/awslabs/aws-codepipeline-stepfunctions.


This figure shows the components in the CodePipeline-Step Functions integration that will be described in this post. The pipeline contains two stages: a Source stage represented by a CodeCommit Git repository and a Prod stage with a single Deploy action that represents the workflow-driven action.

This action invokes a Lambda function (1) called the State Machine Trigger Lambda, which, in turn, triggers a Step Function state machine to process the request (2). The Lambda function sends a continuation token back to the pipeline (3) to continue its execution later and terminates. Seconds later, the pipeline invokes the Lambda function again (4), passing the continuation token received. The Lambda function checks the execution state of the state machine (5,6) and communicates the status to the pipeline. The process is repeated until the state machine execution is complete. Then the Lambda function notifies the pipeline that the corresponding pipeline action is complete (7). If the state machine has failed, the Lambda function will then fail the pipeline action and stop its execution (7). While running, the state machine triggers various Lambda functions to perform different tasks. The state machine and the pipeline are fully decoupled. Their interaction is handled by the Lambda function.

The Deploy State Machine

The sample state machine used in this post is a simplified version of the use case, with emphasis on infrastructure deployment. The state machine will follow distinct execution paths and thus have different outcomes, depending on:

  • The current state of the AWS CloudFormation stack.
  • The nature of the code changes made to the AWS CloudFormation template and pushed into the pipeline.

If the stack does not exist, it will be created. If the stack exists, a change set will be created and its resources inspected by the state machine. The inspection consists of parsing the change set results and detecting whether any resources will be deleted or replaced. If no resources are being deleted or replaced, the change set is allowed to be executed and the state machine completes successfully. Otherwise, the change set is deleted and the state machine completes execution with a failure as the terminal state.

Let’s dive into each of these execution paths.

Path 1: Create a Stack and Succeed Deployment

The Deploy state machine is shown here. It is triggered by the Lambda function using the following input parameters stored in an S3 bucket.

Create New Stack Execution Path

    "environmentName": "prod",
    "stackName": "sample-lambda-app",
    "templatePath": "infra/Lambda-template.yaml",
    "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
    "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ"

Note that some values used here are for the use case example only. Account-specific parameters like revisionS3Bucket and revisionS3Key will be different when you deploy this use case in your account.

These input parameters are used by various states in the state machine and passed to the corresponding Lambda functions to perform different tasks. For example, stackName is used to create a stack, check the status of stack creation, and create a change set. The environmentName represents the environment (for example, dev, test, prod) to which the code is being deployed. It is used to prefix the name of stacks and change sets.

With the exception of built-in states such as wait and choice, each state in the state machine invokes a specific Lambda function.  The results received from the Lambda invocations are appended to the state machine’s original input. When the state machine finishes its execution, several parameters will have been added to its original input.

The first stage in the state machine is “Check Stack Existence”. It checks whether a stack with the input name specified in the stackName input parameter already exists. The output of the state adds a Boolean value called doesStackExist to the original state machine input as follows:

  "doesStackExist": true,
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",

The following stage, “Does Stack Exist?”, is represented by Step Functions built-in choice state. It checks the value of doesStackExist to determine whether a new stack needs to be created (doesStackExist=true) or a change set needs to be created and inspected (doesStackExist=false).

If the stack does not exist, the states illustrated in green in the preceding figure are executed. This execution path creates the stack, waits until the stack is created, checks the status of the stack’s creation, and marks the deployment successful after the stack has been created. Except for “Stack Created?” and “Wait Stack Creation,” each of these stages invokes a Lambda function. “Stack Created?” and “Wait Stack Creation” are implemented by using the built-in choice state (to decide which path to follow) and the wait state (to wait a few seconds before proceeding), respectively. Each stage adds the results of their Lambda function executions to the initial input of the state machine, allowing future stages to process them.

Path 2: Safely Update a Stack and Mark Deployment as Successful

Safely Update a Stack and Mark Deployment as Successful Execution Path

If the stack indicated by the stackName parameter already exists, a different path is executed. (See the green states in the figure.) This path will create a change set and use wait and choice states to wait until the change set is created. Afterwards, a stage in the execution path will inspect  the resources affected before the change set is executed.

The inspection procedure represented by the “Inspect Change Set Changes” stage consists of parsing the resources affected by the change set and checking whether any of the existing resources are being deleted or replaced. The following is an excerpt of the algorithm, where changeSetChanges.Changes is the object representing the change set changes:

for (var i = 0; i < changeSetChanges.Changes.length; i++) {
    var change = changeSetChanges.Changes[i];
    if (change.Type == "Resource") {
        if (change.ResourceChange.Action == "Delete") {
        if (change.ResourceChange.Action == "Modify") {
            if (change.ResourceChange.Replacement == "True") {

The algorithm returns different values to indicate whether the change set can be safely executed (CAN_SAFELY_UPDATE_EXISTING_STACK or RESOURCES_BEING_DELETED_OR_REPLACED). This value is used later by the state machine to decide whether to execute the change set and update the stack or interrupt the deployment.

The output of the “Inspect Change Set” stage is shown here.

  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
  "doesStackExist": true,
  "changeSetName": "prod-sample-lambda-app-change-set-545",
  "changeSetCreationStatus": "complete",

At this point, these parameters have been added to the state machine’s original input:

  • changeSetName, which is added by the “Create Change Set” state.
  • changeSetCreationStatus, which is added by the “Get Change Set Creation Status” state.
  • changeSetAction, which is added by the “Inspect Change Set Changes” state.

The “Safe to Update Infra?” step is a choice state (its JSON spec follows) that simply checks the value of the changeSetAction parameter. If the value is equal to “CAN-SAFELY-UPDATE-EXISTING-STACK“, meaning that no resources will be deleted or replaced, the step will execute the change set by proceeding to the “Execute Change Set” state. The deployment is successful (the state machine completes its execution successfully).

"Safe to Update Infra?": {
      "Type": "Choice",
      "Choices": [
          "Variable": "$.taskParams.changeSetAction",
          "StringEquals": "CAN-SAFELY-UPDATE-EXISTING-STACK",
          "Next": "Execute Change Set"
      "Default": "Deployment Failed"

Path 3: Reject Stack Update and Fail Deployment

Reject Stack Update and Fail Deployment Execution Path

If the changeSetAction parameter is different from “CAN-SAFELY-UPDATE-EXISTING-STACK“, the state machine will interrupt the deployment by deleting the change set and proceeding to the “Deployment Fail” step, which is a built-in Fail state. (Its JSON spec follows.) This state causes the state machine to stop in a failed state and serves to indicate to the Lambda function that the pipeline deployment should be interrupted in a fail state as well.

 "Deployment Failed": {
      "Type": "Fail",
      "Cause": "Deployment Failed",
      "Error": "Deployment Failed"

In all three scenarios, there’s a state machine’s visual representation available in the AWS Step Functions console that makes it very easy for developers to identify what tasks have been executed or why a deployment has failed. Developers can also inspect the inputs and outputs of each state and look at the state machine Lambda function’s logs for details. Meanwhile, the corresponding CodePipeline action remains very simple and intuitive for developers who only need to know whether the deployment was successful or failed.

The State Machine Trigger Lambda Function

The Trigger Lambda function is invoked directly by the Deploy action in CodePipeline. The CodePipeline action must pass a JSON structure to the trigger function through the UserParameters attribute, as follows:

  "s3Bucket": "codepipeline-StepFunctions-sample",
  "stateMachineFile": "state_machine_input.json"

The s3Bucket parameter specifies the S3 bucket location for the state machine input parameters file. The stateMachineFile parameter specifies the file holding the input parameters. By being able to specify different input parameters to the state machine, we make the Trigger Lambda function and the state machine reusable across environments. For example, the same state machine could be called from a test and prod pipeline action by specifying a different S3 bucket or state machine input file for each environment.

The Trigger Lambda function performs two main tasks: triggering the state machine and checking the execution state of the state machine. Its core logic is shown here:

exports.index = function (event, context, callback) {
    try {
        console.log("Event: " + JSON.stringify(event));
        console.log("Context: " + JSON.stringify(context));
        console.log("Environment Variables: " + JSON.stringify(process.env));
        if (Util.isContinuingPipelineTask(event)) {
            monitorStateMachineExecution(event, context, callback);
        else {
            triggerStateMachine(event, context, callback);
    catch (err) {
        failure(Util.jobId(event), callback, context.invokeid, err.message);

Util.isContinuingPipelineTask(event) is a utility function that checks if the Trigger Lambda function is being called for the first time (that is, no continuation token is passed by CodePipeline) or as a continuation of a previous call. In its first execution, the Lambda function will trigger the state machine and send a continuation token to CodePipeline that contains the state machine execution ARN. The state machine ARN is exposed to the Lambda function through a Lambda environment variable called stateMachineArn. Here is the code that triggers the state machine:

function triggerStateMachine(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var s3Bucket = Util.actionUserParameter(event, "s3Bucket");
    var stateMachineFile = Util.actionUserParameter(event, "stateMachineFile");
    getStateMachineInputData(s3Bucket, stateMachineFile)
        .then(function (data) {
            var initialParameters = data.Body.toString();
            var stateMachineInputJSON = createStateMachineInitialInput(initialParameters, event);
            console.log("State machine input JSON: " + JSON.stringify(stateMachineInputJSON));
            return stateMachineInputJSON;
        .then(function (stateMachineInputJSON) {
            return triggerStateMachineExecution(stateMachineArn, stateMachineInputJSON);
        .then(function (triggerStateMachineOutput) {
            var continuationToken = { "stateMachineExecutionArn": triggerStateMachineOutput.executionArn };
            var message = "State machine has been triggered: " + JSON.stringify(triggerStateMachineOutput) + ", continuationToken: " + JSON.stringify(continuationToken);
            return continueExecution(Util.jobId(event), continuationToken, callback, message);
        .catch(function (err) {
            console.log("Error triggering state machine: " + stateMachineArn + ", Error: " + err.message);
            failure(Util.jobId(event), callback, context.invokeid, err.message);

The Trigger Lambda function fetches the state machine input parameters from an S3 file, triggers the execution of the state machine using the input parameters and the stateMachineArn environment variable, and signals to CodePipeline that the execution should continue later by passing a continuation token that contains the state machine execution ARN. In case any of these operations fail and an exception is thrown, the Trigger Lambda function will fail the pipeline immediately by signaling a pipeline failure through the putJobFailureResult CodePipeline API.

If the Lambda function is continuing a previous execution, it will extract the state machine execution ARN from the continuation token and check the status of the state machine, as shown here.

function monitorStateMachineExecution(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var continuationToken = JSON.parse(Util.continuationToken(event));
    var stateMachineExecutionArn = continuationToken.stateMachineExecutionArn;
        .then(function (response) {
            if (response.status === "RUNNING") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " is still " + response.status;
                return continueExecution(Util.jobId(event), continuationToken, callback, message);
            if (response.status === "SUCCEEDED") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
                return success(Util.jobId(event), callback, message);
            var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
            return failure(Util.jobId(event), callback, context.invokeid, message);
        .catch(function (err) {
            var message = "Error monitoring execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + ", Error: " + err.message;
            failure(Util.jobId(event), callback, context.invokeid, message);

If the state machine is in the RUNNING state, the Lambda function will send the continuation token back to the CodePipeline action. This will cause CodePipeline to call the Lambda function again a few seconds later. If the state machine has SUCCEEDED, then the Lambda function will notify the CodePipeline action that the action has succeeded. In any other case (FAILURE, TIMED-OUT, or ABORT), the Lambda function will fail the pipeline action.

This behavior is especially useful for developers who are building and debugging a new state machine because a bug in the state machine can potentially leave the pipeline action hanging for long periods of time until it times out. The Trigger Lambda function prevents this.

Also, by having the Trigger Lambda function as a means to decouple the pipeline and state machine, we make the state machine more reusable. It can be triggered from anywhere, not just from a CodePipeline action.

The Pipeline in CodePipeline

Our sample pipeline contains two simple stages: the Source stage represented by a CodeCommit Git repository and the Prod stage, which contains the Deploy action that invokes the Trigger Lambda function. When the state machine decides that the change set created must be rejected (because it replaces or deletes some the existing production resources), it fails the pipeline without performing any updates to the existing infrastructure. (See the failed Deploy action in red.) Otherwise, the pipeline action succeeds, indicating that the existing provisioned infrastructure was either created (first run) or updated without impacting any resources. (See the green Deploy stage in the pipeline on the left.)

The Pipeline in CodePipeline

The JSON spec for the pipeline’s Prod stage is shown here. We use the UserParameters attribute to pass the S3 bucket and state machine input file to the Lambda function. These parameters are action-specific, which means that we can reuse the state machine in another pipeline action.

  "name": "Prod",
  "actions": [
          "inputArtifacts": [
                  "name": "CodeCommitOutput"
          "name": "Deploy",
          "actionTypeId": {
              "category": "Invoke",
              "owner": "AWS",
              "version": "1",
              "provider": "Lambda"
          "outputArtifacts": [],
          "configuration": {
              "FunctionName": "StateMachineTriggerLambda",
              "UserParameters": "{\"s3Bucket\": \"codepipeline-StepFunctions-sample\", \"stateMachineFile\": \"state_machine_input.json\"}"
          "runOrder": 1


In this blog post, we discussed how state machines in AWS Step Functions can be used to handle workflow-driven actions. We showed how a Lambda function can be used to fully decouple the pipeline and the state machine and manage their interaction. The use of a state machine greatly simplified the associated CodePipeline action, allowing us to build a much simpler and cleaner pipeline while drilling down into the state machine’s execution for troubleshooting or debugging.

Here are two exercises you can complete by using the source code.

Exercise #1: Do not fail the state machine and pipeline action after inspecting a change set that deletes or replaces resources. Instead, create a stack with a different name (think of blue/green deployments). You can do this by creating a state machine transition between the “Safe to Update Infra?” and “Create Stack” stages and passing a new stack name as input to the “Create Stack” stage.

Exercise #2: Add wait logic to the state machine to wait until the change set completes its execution before allowing the state machine to proceed to the “Deployment Succeeded” stage. Use the stack creation case as an example. You’ll have to create a Lambda function (similar to the Lambda function that checks the creation status of a stack) to get the creation status of the change set.

Have fun and share your thoughts!

About the Author

Marcilio Mendonca is a Sr. Consultant in the Canadian Professional Services Team at Amazon Web Services. He has helped AWS customers design, build, and deploy best-in-class, cloud-native AWS applications using VMs, containers, and serverless architectures. Before he joined AWS, Marcilio was a Software Development Engineer at Amazon. Marcilio also holds a Ph.D. in Computer Science. In his spare time, he enjoys playing drums, riding his motorcycle in the Toronto GTA area, and spending quality time with his family.

Implementing Default Directory Indexes in Amazon S3-backed Amazon CloudFront Origins Using [email protected]

Post Syndicated from Ronnie Eichler original https://aws.amazon.com/blogs/compute/implementing-default-directory-indexes-in-amazon-s3-backed-amazon-cloudfront-origins-using-lambdaedge/

With the recent launch of [email protected], it’s now possible for you to provide even more robust functionality to your static websites. Amazon CloudFront is a content distribution network service. In this post, I show how you can use [email protected] along with the CloudFront origin access identity (OAI) for Amazon S3 and still provide simple URLs (such as www.example.com/about/ instead of www.example.com/about/index.html).


Amazon S3 is a great platform for hosting a static website. You don’t need to worry about managing servers or underlying infrastructure—you just publish your static to content to an S3 bucket. S3 provides a DNS name such as <bucket-name>.s3-website-<AWS-region>.amazonaws.com. Use this name for your website by creating a CNAME record in your domain’s DNS environment (or Amazon Route 53) as follows:

www.example.com -> <bucket-name>.s3-website-<AWS-region>.amazonaws.com

You can also put CloudFront in front of S3 to further scale the performance of your site and cache the content closer to your users. CloudFront can enable HTTPS-hosted sites, by either using a custom Secure Sockets Layer (SSL) certificate or a managed certificate from AWS Certificate Manager. In addition, CloudFront also offers integration with AWS WAF, a web application firewall. As you can see, it’s possible to achieve some robust functionality by using S3, CloudFront, and other managed services and not have to worry about maintaining underlying infrastructure.

One of the key concerns that you might have when implementing any type of WAF or CDN is that you want to force your users to go through the CDN. If you implement CloudFront in front of S3, you can achieve this by using an OAI. However, in order to do this, you cannot use the HTTP endpoint that is exposed by S3’s static website hosting feature. Instead, CloudFront must use the S3 REST endpoint to fetch content from your origin so that the request can be authenticated using the OAI. This presents some challenges in that the REST endpoint does not support redirection to a default index page.

CloudFront does allow you to specify a default root object (index.html), but it only works on the root of the website (such as http://www.example.com > http://www.example.com/index.html). It does not work on any subdirectory (such as http://www.example.com/about/). If you were to attempt to request this URL through CloudFront, CloudFront would do a S3 GetObject API call against a key that does not exist.

Of course, it is a bad user experience to expect users to always type index.html at the end of every URL (or even know that it should be there). Until now, there has not been an easy way to provide these simpler URLs (equivalent to the DirectoryIndex Directive in an Apache Web Server configuration) to users through CloudFront. Not if you still want to be able to restrict access to the S3 origin using an OAI. However, with the release of [email protected], you can use a JavaScript function running on the CloudFront edge nodes to look for these patterns and request the appropriate object key from the S3 origin.


In this example, you use the compute power at the CloudFront edge to inspect the request as it’s coming in from the client. Then re-write the request so that CloudFront requests a default index object (index.html in this case) for any request URI that ends in ‘/’.

When a request is made against a web server, the client specifies the object to obtain in the request. You can use this URI and apply a regular expression to it so that these URIs get resolved to a default index object before CloudFront requests the object from the origin. Use the following code:

'use strict';
exports.handler = (event, context, callback) => {
    // Extract the request from the CloudFront event that is sent to [email protected] 
    var request = event.Records[0].cf.request;

    // Extract the URI from the request
    var olduri = request.uri;

    // Match any '/' that occurs at the end of a URI. Replace it with a default index
    var newuri = olduri.replace(/\/$/, '\/index.html');
    // Log the URI as received by CloudFront and the new URI to be used to fetch from origin
    console.log("Old URI: " + olduri);
    console.log("New URI: " + newuri);
    // Replace the received URI with the URI that includes the index page
    request.uri = newuri;
    // Return to CloudFront
    return callback(null, request);


To get started, create an S3 bucket to be the origin for CloudFront:

Create bucket

On the other screens, you can just accept the defaults for the purposes of this walkthrough. If this were a production implementation, I would recommend enabling bucket logging and specifying an existing S3 bucket as the destination for access logs. These logs can be useful if you need to troubleshoot issues with your S3 access.

Now, put some content into your S3 bucket. For this walkthrough, create two simple webpages to demonstrate the functionality:  A page that resides at the website root, and another that is in a subdirectory.


<!doctype html>
        <meta charset="utf-8">
        <title>Root home page</title>
        <p>Hello, this page resides in the root directory.</p>


<!doctype html>
        <meta charset="utf-8">
        <title>Subdirectory home page</title>
        <p>Hello, this page resides in the /subdirectory/ directory.</p>

When uploading the files into S3, you can accept the defaults. You add a bucket policy as part of the CloudFront distribution creation that allows CloudFront to access the S3 origin. You should now have an S3 bucket that looks like the following:

Root of bucket

Subdirectory in bucket

Next, create a CloudFront distribution that your users will use to access the content. Open the CloudFront console, and choose Create Distribution. For Select a delivery method for your content, under Web, choose Get Started.

On the next screen, you set up the distribution. Below are the options to configure:

  • Origin Domain Name:  Select the S3 bucket that you created earlier.
  • Restrict Bucket Access: Choose Yes.
  • Origin Access Identity: Create a new identity.
  • Grant Read Permissions on Bucket: Choose Yes, Update Bucket Policy.
  • Object Caching: Choose Customize (I am changing the behavior to avoid having CloudFront cache objects, as this could affect your ability to troubleshoot while implementing the Lambda code).
    • Minimum TTL: 0
    • Maximum TTL: 0
    • Default TTL: 0

You can accept all of the other defaults. Again, this is a proof-of-concept exercise. After you are comfortable that the CloudFront distribution is working properly with the origin and Lambda code, you can re-visit the preceding values and make changes before implementing it in production.

CloudFront distributions can take several minutes to deploy (because the changes have to propagate out to all of the edge locations). After that’s done, test the functionality of the S3-backed static website. Looking at the distribution, you can see that CloudFront assigns a domain name:

CloudFront Distribution Settings

Try to access the website using a combination of various URLs:

http://<domainname>/:  Works

› curl -v http://d3gt20ea1hllb.cloudfront.net/
*   Trying
* Connected to d3gt20ea1hllb.cloudfront.net ( port 80 (#0)
> GET / HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
< HTTP/1.1 200 OK
< ETag: "cb7e2634fe66c1fd395cf868087dd3b9"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: -D2FSRwzfcwyKZKFZr6DqYFkIf4t7HdGw2MkUF5sE6YFDxRJgi0R1g==
< Content-Length: 209
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:16 GMT
< Via: 1.1 6419ba8f3bd94b651d416054d9416f1e.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<!doctype html>
        <meta charset="utf-8">
        <title>Root home page</title>
        <p>Hello, this page resides in the root directory.</p>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact

This is because CloudFront is configured to request a default root object (index.html) from the origin.

http://<domainname>/subdirectory/:  Doesn’t work

› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/
*   Trying
* Connected to d3gt20ea1hllb.cloudfront.net ( port 80 (#0)
> GET /subdirectory/ HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
< HTTP/1.1 200 OK
< ETag: "d41d8cd98f00b204e9800998ecf8427e"
< x-amz-server-side-encryption: AES256
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: Iqf0Gy8hJLiW-9tOAdSFPkL7vCWBrgm3-1ly5tBeY_izU82ftipodA==
< Content-Length: 0
< Content-Type: application/x-directory
< Last-Modified: Wed, 19 Jul 2017 19:21:24 GMT
< Via: 1.1 6419ba8f3bd94b651d416054d9416f1e.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact

If you use a tool such like cURL to test this, you notice that CloudFront and S3 are returning a blank response. The reason for this is that the subdirectory does exist, but it does not resolve to an S3 object. Keep in mind that S3 is an object store, so there are no real directories. User interfaces such as the S3 console present a hierarchical view of a bucket with folders based on the presence of forward slashes, but behind the scenes the bucket is just a collection of keys that represent stored objects.

http://<domainname>/subdirectory/index.html:  Works

› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/index.html
*   Trying
* Connected to d3gt20ea1hllb.cloudfront.net ( port 80 (#0)
> GET /subdirectory/index.html HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
< HTTP/1.1 200 OK
< Date: Thu, 20 Jul 2017 20:35:15 GMT
< ETag: "ddf87c487acf7cef9d50418f0f8f8dae"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: RefreshHit from cloudfront
< X-Amz-Cf-Id: bkh6opXdpw8pUomqG3Qr3UcjnZL8axxOH82Lh0OOcx48uJKc_Dc3Cg==
< Content-Length: 227
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:45 GMT
< Via: 1.1 3f2788d309d30f41de96da6f931d4ede.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<!doctype html>
        <meta charset="utf-8">
        <title>Subdirectory home page</title>
        <p>Hello, this page resides in the /subdirectory/ directory.</p>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact

This request works as expected because you are referencing the object directly. Now, you implement the [email protected] function to return the default index.html page for any subdirectory. Looking at the example JavaScript code, here’s where the magic happens:

var newuri = olduri.replace(/\/$/, '\/index.html');

You are going to use a JavaScript regular expression to match any ‘/’ that occurs at the end of the URI and replace it with ‘/index.html’. This is the equivalent to what S3 does on its own with static website hosting. However, as I mentioned earlier, you can’t rely on this if you want to use a policy on the bucket to restrict it so that users must access the bucket through CloudFront. That way, all requests to the S3 bucket must be authenticated using the S3 REST API. Because of this, you implement a [email protected] function that takes any client request ending in ‘/’ and append a default ‘index.html’ to the request before requesting the object from the origin.

In the Lambda console, choose Create function. On the next screen, skip the blueprint selection and choose Author from scratch, as you’ll use the sample code provided.

Next, configure the trigger. Choosing the empty box shows a list of available triggers. Choose CloudFront and select your CloudFront distribution ID (created earlier). For this example, leave Cache Behavior as * and CloudFront Event as Origin Request. Select the Enable trigger and replicate box and choose Next.

Lambda Trigger

Next, give the function a name and a description. Then, copy and paste the following code:

'use strict';
exports.handler = (event, context, callback) => {
    // Extract the request from the CloudFront event that is sent to [email protected] 
    var request = event.Records[0].cf.request;

    // Extract the URI from the request
    var olduri = request.uri;

    // Match any '/' that occurs at the end of a URI. Replace it with a default index
    var newuri = olduri.replace(/\/$/, '\/index.html');
    // Log the URI as received by CloudFront and the new URI to be used to fetch from origin
    console.log("Old URI: " + olduri);
    console.log("New URI: " + newuri);
    // Replace the received URI with the URI that includes the index page
    request.uri = newuri;
    // Return to CloudFront
    return callback(null, request);


Next, define a role that grants permissions to the Lambda function. For this example, choose Create new role from template, Basic Edge Lambda permissions. This creates a new IAM role for the Lambda function and grants the following permissions:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [

In a nutshell, these are the permissions that the function needs to create the necessary CloudWatch log group and log stream, and to put the log events so that the function is able to write logs when it executes.

After the function has been created, you can go back to the browser (or cURL) and re-run the test for the subdirectory request that failed previously:

› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/
*   Trying
* Connected to d3gt20ea1hllb.cloudfront.net ( port 80 (#0)
> GET /subdirectory/ HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
< HTTP/1.1 200 OK
< Date: Thu, 20 Jul 2017 21:18:44 GMT
< ETag: "ddf87c487acf7cef9d50418f0f8f8dae"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: rwFN7yHE70bT9xckBpceTsAPcmaadqWB9omPBv2P6WkIfQqdjTk_4w==
< Content-Length: 227
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:45 GMT
< Via: 1.1 3572de112011f1b625bb77410b0c5cca.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<!doctype html>
        <meta charset="utf-8">
        <title>Subdirectory home page</title>
        <p>Hello, this page resides in the /subdirectory/ directory.</p>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact

You have now configured a way for CloudFront to return a default index page for subdirectories in S3!


In this post, you used [email protected] to be able to use CloudFront with an S3 origin access identity and serve a default root object on subdirectory URLs. To find out some more about this use-case, see [email protected] integration with CloudFront in our documentation.

If you have questions or suggestions, feel free to comment below. For troubleshooting or implementation help, check out the Lambda forum.

ACME Support in Apache HTTP Server Project

Post Syndicated from ris original https://lwn.net/Articles/736668/rss

Let’s Encrypt has announced
that Automatic Certificate Management Environment (ACME) protocol support
is being integrated into the Apache HTTP Server (httpd). “ACME support being built in to one of the world’s most popular Web servers, Apache httpd, is great because it means that deploying HTTPS will be even easier for millions of websites. It’s a huge step towards delivering the ideal certificate issuance and management experience to as many people as possible.

ACME Support in Apache HTTP Server Project

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org//2017/10/17/acme-support-in-apache-httpd.html

We’re excited that support for getting and managing TLS certificates via the ACME protocol is coming to the Apache HTTP Server Project (httpd). ACME is the protocol used by Let’s Encrypt, and hopefully other Certificate Authorities in the future. We anticipate this feature will significantly aid the adoption of HTTPS for new and existing websites.

We created Let’s Encrypt in order to make getting and managing TLS certificates as simple as possible. For Let’s Encrypt subscribers, this usually means obtaining an ACME client and executing some simple commands. Ultimately though, we’d like for most Let’s Encrypt subscribers to have ACME clients built in to their server software so that obtaining an additional piece of software is not necessary. The less work people have to do to deploy HTTPS the better!

ACME support being built in to one of the world’s most popular Web servers, Apache httpd, is great because it means that deploying HTTPS will be even easier for millions of websites. It’s a huge step towards delivering the ideal certificate issuance and management experience to as many people as possible.

The Apache httpd ACME module is called mod_md. It’s currently in the development version of httpd and a plan is being formulated to backport it to an httpd 2.4.x stable release. The mod_md code is also available on GitHub.

It’s also worth mentioning that the development version of Apache httpd now includes support for an SSLPolicy directive. Properly configuring TLS has traditionally involved making a large number of complex choices. With the SSLPolicy directive, admins simply select a modern, intermediate, or old TLS configuration, and sensible choices will be made for them.

Development of mod_md and the SSLPolicy directive has been funded by Mozilla and carried out primarily by Stefan Eissing of greenbytes. Thank you Mozilla and Stefan!

Let’s Encrypt is currently providing certificates for more than 55 million websites. We look forward to being able to serve even more websites as efforts like this make deploying HTTPS with Let’s Encrypt even easier. If you’re as excited about the potential for a 100% HTTPS Web as we are, please consider getting involved, making a donation, or sponsoring Let’s Encrypt.

Backblaze Release 5.1 – RMM Compatibility for Mass Deployments

Post Syndicated from Yev original https://www.backblaze.com/blog/rmm-for-mass-deployments/

diagram of Backblaze remote monitoring and management

Introducing Backblaze Computer Backup Release 5.1

This is a relatively minor release in terms of the core Backblaze Computer Backup service functionality, but is a big deal for Backblaze for Business as we’ve updated our Mac and PC clients to be RMM (Remote Monitoring and Management) compatible.

What Is New?

  • Updated Mac and PC clients to better handle large file uploads
  • Updated PC downloader to improve stability
  • Added RMM support for PC and Mac clients

What Is RMM?

RMM stands for “Remote Monitoring and Management.” It’s a way to administer computers that might be distributed geographically, without having access to the actual machine. If you are a systems administrator working with anywhere from a few distributed computers to a few thousand, you’re familiar with RMM and how it makes life easier.

The new clients allow administrators to deploy Backblaze Computer Backup through most “silent” installation/mass deployment tools. Two popular RMM tools are Munki and Jamf. We’ve written up knowledge base articles for both of these.

munki logo jamf logo
Learn more about Munki Learn more about Jamf

Do I Need To Use RMM Tools?

No — unless you are a systems administrator or someone who is deploying Backblaze to a lot of people all at once, you do not have to worry about RMM support.

Release Version Number:

Mac:  5.1.0
PC:  5.1.0


October 12, 2017

Upgrade Methods:

  • “Check for Updates” on the Backblaze Client (right click on the Backblaze icon and then select “Check for Updates”)
  • Download from: https://secure.backblaze.com/update.htm
  • Auto-update will begin in a couple of weeks
Mac backup update PC backup update
Updating Backblaze on Mac Updating Backblaze on Windows


If you have any questions, please contact Backblaze Support at www.backblaze.com/help.

The post Backblaze Release 5.1 – RMM Compatibility for Mass Deployments appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.


Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
h2o.init(nthreads = -1)
##  Connection successful!
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version: 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {


Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',

Verify the connection. The results returned depend on your specific Athena setup.

## <JDBCConnection>
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
## [1] TRUE
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
## [1] TRUE
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
  |                                                                 |   0%
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
  |                                                                 |   0%
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
  |                                                                 |   0%
  |=====                                                            |   8%
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

## H2OBinomialMetrics: glm
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
  |                                                                 |   0%
  |=======                                                          |  10%
  |=================================================================| 100%

Measure the performance of Model 2.

## H2OBinomialMetrics: glm
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
  |                                                                 |   0%
  |========                                                         |  12%
  |=================================================================| 100%
## H2OBinomialMetrics: glm
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
  |                                                                 |   0%
  |=================================================================| 100%
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## [373 rows x 3 columns]
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## [373 rows x 1 column]
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
  |                                                                 |   0%
  |===                                                              |   5%
  |=====                                                            |   7%
  |======                                                           |   9%
  |=======                                                          |  10%
  |======================                                           |  33%
  |=====================================                            |  56%
  |====================================================             |  79%
  |================================================================ |  98%
  |=================================================================| 100%
## H2OBinomialMetrics: gbm
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
  |                                                                 |   0%
  |===                                                              |   4%
  |=====                                                            |   8%
  |========                                                         |  12%
  |==========                                                       |  16%
  |=============                                                    |  20%
  |================                                                 |  24%
  |==================                                               |  28%
  |=====================                                            |  32%
  |=======================                                          |  36%
  |==========================                                       |  40%
  |=============================                                    |  44%
  |===============================                                  |  48%
  |==================================                               |  52%
  |====================================                             |  56%
  |=======================================                          |  60%
  |==========================================                       |  64%
  |============================================                     |  68%
  |===============================================                  |  72%
  |=================================================                |  76%
  |====================================================             |  80%
  |=======================================================          |  84%
  |=========================================================        |  88%
  |============================================================     |  92%
  |==============================================================   |  96%
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
## H2OBinomialMetrics: deeplearning
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model



















1.0 1.0





1.0 1.0





0.2033898 0.1355932



AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.


In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.

Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.

About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.



Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.



timeShift(GrafanaBuzz, 1w) Issue 17

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/10/13/timeshiftgrafanabuzz-1w-issue-17/

It’s been a busy week here at Grafana Labs. While we’ve been working on GrafanaCon EU preparations here at the NYC office, the Stockholm office has been diligently working to release Grafana 4.6-beta-1. We’re really excited about this latest release and look forward to your feedback on the new features.

Latest Release

Grafana 4.6-beta-1 is now available! Grafana v4.6 brings many enhancements to Annotations, Cloudwatch and Prometheus. It also adds support for Postgres as a metric and table data source!

To see more details on what’s in the newest version, please see the release notes.

Download Grafana 4.6.0-beta-1 Now

From the Blogosphere

Using Kafka and Grafana to Monitor Meteorological Conditions: Oliver was looking for a way to track historical mountain conditions around the UK, but only had available data for the last 24 hours. It seemed like a perfect job for Kafka. This post discusses how to get going with Kafka very easily, store the data in Graphite and visualize the data in Grafana.

Web Interfaces for your Syslog Server – An Overview: System administrators often prefer to use the command line, but complex queries can be completed much faster with logs indexed in a database and a web interface. This article provides a run-down of various GUI-based tools available for your syslog server.

JEE Performance with JMeter, Prometheus and Grafana. Complete Project from Scratch: This comprehensive article walks you through the steps of monitoring JEE application performance from scratch. We start with making implementation decisions, then how to collect data, visualization and dashboarding configuration, and conclude with alerting. Buckle up; it’s a long article, with a ton of information.

Early Bird Tickets Now Available

Early bird tickets are going fast, so take advantage of the discounted price before they’re gone! We will be announcing the first block of speakers in the coming week.

There’s still time to submit a talk. We’ll accept submissions through the end of October. We’re accepting technical and non-technical talks of all sizes. Submit a CFP.

Get Your Early Bird Ticket Now

Grafana Plugins

This week we add the Prometheus Alertmanager Data Source to our growing list of plugins, lots of updates to the GLPI Data source, and have a urgent bugfix for the WorldMap Panel. To update plugins from on-prem Grafana, use the Grafana-cli tool, or with 1 click if you are using Hosted Grafana.


Prometheus Alertmanager Data Source – This new data source lets you show data from the Prometheus Alertmanager in Grafana. The Alertmanager handles alerts sent by client applications such as the Prometheus server. With this data source, you can show data in Table form or as a SingleStat.

Install Now


WorldMap Panel – A new version with an urgent bugfix for Elasticsearch users:

  • A fix for Geohash maps after a breaking change in Grafana 4.5.0.
  • Last Geohash as center for the map – it centers the map on the last geohash position received. Useful for real time tracking (with auto refresh on in Grafana).



GLPI App – Lots of fixes in the new version:

  • Compatibility with GLPI 9.2
  • Autofill the Timerange field based on the query
  • When adding new query, add by default a ticket query instead of undefined
  • Correct values in hover tooltip
  • Can have element count by hour of the day with the panel histogram


Contributions of the week:

Each week we highlight some of the important contributions from our amazing open source community. Thank you for helping make Grafana better!

Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions

New Annotation Function

In addition to being able to add annotations easily in the graph panel, you can also create ranges as shown above. Give 4.6.0-beta-1 a try and give us your feedback.

We Need Your Help!

Do you have a graph that you love because the data is beautiful or because the graph provides interesting information? Please get in touch. Tweet or send us an email with a screenshot, and we’ll tell you about this fun experiment.

Tell Me More

What do you think?

We want to keep these articles interesting and relevant, so please tell us how we’re doing. Submit a comment on this article below, or post something at our community forum. Help us make these weekly roundups better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

AWS Developer Tools Expands Integration to Include GitHub

Post Syndicated from Balaji Iyer original https://aws.amazon.com/blogs/devops/aws-developer-tools-expands-integration-to-include-github/

AWS Developer Tools is a set of services that include AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy. Together, these services help you securely store and maintain version control of your application’s source code and automatically build, test, and deploy your application to AWS or your on-premises environment. These services are designed to enable developers and IT professionals to rapidly and safely deliver software.

As part of our continued commitment to extend the AWS Developer Tools ecosystem to third-party tools and services, we’re pleased to announce AWS CodeStar and AWS CodeBuild now integrate with GitHub. This will make it easier for GitHub users to set up a continuous integration and continuous delivery toolchain as part of their release process using AWS Developer Tools.

In this post, I will walk through the following:


You’ll need an AWS account, a GitHub account, an Amazon EC2 key pair, and administrator-level permissions for AWS Identity and Access Management (IAM), AWS CodeStar, AWS CodeBuild, AWS CodePipeline, Amazon EC2, Amazon S3.


Integrating GitHub with AWS CodeStar

AWS CodeStar enables you to quickly develop, build, and deploy applications on AWS. Its unified user interface helps you easily manage your software development activities in one place. With AWS CodeStar, you can set up your entire continuous delivery toolchain in minutes, so you can start releasing code faster.

When AWS CodeStar launched in April of this year, it used AWS CodeCommit as the hosted source repository. You can now choose between AWS CodeCommit or GitHub as the source control service for your CodeStar projects. In addition, your CodeStar project dashboard lets you centrally track GitHub activities, including commits, issues, and pull requests. This makes it easy to manage project activity across the components of your CI/CD toolchain. Adding the GitHub dashboard view will simplify development of your AWS applications.

In this section, I will show you how to use GitHub as the source provider for your CodeStar projects. I’ll also show you how to work with recent commits, issues, and pull requests in the CodeStar dashboard.

Sign in to the AWS Management Console and from the Services menu, choose CodeStar. In the CodeStar console, choose Create a new project. You should see the Choose a project template page.

CodeStar Project

Choose an option by programming language, application category, or AWS service. I am going to choose the Ruby on Rails web application that will be running on Amazon EC2.

On the Project details page, you’ll now see the GitHub option. Type a name for your project, and then choose Connect to GitHub.

Project details

You’ll see a message requesting authorization to connect to your GitHub repository. When prompted, choose Authorize, and then type your GitHub account password.


This connects your GitHub identity to AWS CodeStar through OAuth. You can always review your settings by navigating to your GitHub application settings.

Installed GitHub Apps

You’ll see AWS CodeStar is now connected to GitHub:

Create project

You can choose a public or private repository. GitHub offers free accounts for users and organizations working on public and open source projects and paid accounts that offer unlimited private repositories and optional user management and security features.

In this example, I am going to choose the public repository option. Edit the repository description, if you like, and then choose Next.

Review your CodeStar project details, and then choose Create Project. On Choose an Amazon EC2 Key Pair, choose Create Project.

Key Pair

On the Review project details page, you’ll see Edit Amazon EC2 configuration. Choose this link to configure instance type, VPC, and subnet options. AWS CodeStar requires a service role to create and manage AWS resources and IAM permissions. This role will be created for you when you select the AWS CodeStar would like permission to administer AWS resources on your behalf check box.

Choose Create Project. It might take a few minutes to create your project and resources.

Review project details

When you create a CodeStar project, you’re added to the project team as an owner. If this is the first time you’ve used AWS CodeStar, you’ll be asked to provide the following information, which will be shown to others:

  • Your display name.
  • Your email address.

This information is used in your AWS CodeStar user profile. User profiles are not project-specific, but they are limited to a single AWS region. If you are a team member in projects in more than one region, you’ll have to create a user profile in each region.

User settings

User settings

Choose Next. AWS CodeStar will create a GitHub repository with your configuration settings (for example, https://github.com/biyer/ruby-on-rails-service).

When you integrate your integrated development environment (IDE) with AWS CodeStar, you can continue to write and develop code in your preferred environment. The changes you make will be included in the AWS CodeStar project each time you commit and push your code.


After setting up your IDE, choose Next to go to the CodeStar dashboard. Take a few minutes to familiarize yourself with the dashboard. You can easily track progress across your entire software development process, from your backlog of work items to recent code deployments.


After the application deployment is complete, choose the endpoint that will display the application.


This is what you’ll see when you open the application endpoint:

The Commit history section of the dashboard lists the commits made to the Git repository. If you choose the commit ID or the Open in GitHub option, you can use a hotlink to your GitHub repository.

Commit history

Your AWS CodeStar project dashboard is where you and your team view the status of your project resources, including the latest commits to your project, the state of your continuous delivery pipeline, and the performance of your instances. This information is displayed on tiles that are dedicated to a particular resource. To see more information about any of these resources, choose the details link on the tile. The console for that AWS service will open on the details page for that resource.


You can also filter issues based on their status and the assigned user.


AWS CodeBuild Now Supports Building GitHub Pull Requests

CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue. You can use prepackaged build environments to get started quickly or you can create custom build environments that use your own build tools.

We recently announced support for GitHub pull requests in AWS CodeBuild. This functionality makes it easier to collaborate across your team while editing and building your application code with CodeBuild. You can use the AWS CodeBuild or AWS CodePipeline consoles to run AWS CodeBuild. You can also automate the running of AWS CodeBuild by using the AWS Command Line Interface (AWS CLI), the AWS SDKs, or the AWS CodeBuild Plugin for Jenkins.

AWS CodeBuild

In this section, I will show you how to trigger a build in AWS CodeBuild with a pull request from GitHub through webhooks.

Open the AWS CodeBuild console at https://console.aws.amazon.com/codebuild/. Choose Create project. If you already have a CodeBuild project, you can choose Edit project, and then follow along. CodeBuild can connect to AWS CodeCommit, S3, BitBucket, and GitHub to pull source code for builds. For Source provider, choose GitHub, and then choose Connect to GitHub.


After you’ve successfully linked GitHub and your CodeBuild project, you can choose a repository in your GitHub account. CodeBuild also supports connections to any public repository. You can review your settings by navigating to your GitHub application settings.

GitHub Apps

On Source: What to Build, for Webhook, select the Rebuild every time a code change is pushed to this repository check box.

Note: You can select this option only if, under Repository, you chose Use a repository in my account.


In Environment: How to build, for Environment image, select Use an image managed by AWS CodeBuild. For Operating system, choose Ubuntu. For Runtime, choose Base. For Version, choose the latest available version. For Build specification, you can provide a collection of build commands and related settings, in YAML format (buildspec.yml) or you can override the build spec by inserting build commands directly in the console. AWS CodeBuild uses these commands to run a build. In this example, the output is the string “hello.”


On Artifacts: Where to put the artifacts from this build project, for Type, choose No artifacts. (This is also the type to choose if you are just running tests or pushing a Docker image to Amazon ECR.) You also need an AWS CodeBuild service role so that AWS CodeBuild can interact with dependent AWS services on your behalf. Unless you already have a role, choose Create a role, and for Role name, type a name for your role.


In this example, leave the advanced settings at their defaults.

If you expand Show advanced settings, you’ll see options for customizing your build, including:

  • A build timeout.
  • A KMS key to encrypt all the artifacts that the builds for this project will use.
  • Options for building a Docker image.
  • Elevated permissions during your build action (for example, accessing Docker inside your build container to build a Dockerfile).
  • Resource options for the build compute type.
  • Environment variables (built-in or custom). For more information, see Create a Build Project in the AWS CodeBuild User Guide.

Advanced settings

You can use the AWS CodeBuild console to create a parameter in Amazon EC2 Systems Manager. Choose Create a parameter, and then follow the instructions in the dialog box. (In that dialog box, for KMS key, you can optionally specify the ARN of an AWS KMS key in your account. Amazon EC2 Systems Manager uses this key to encrypt the parameter’s value during storage and decrypt during retrieval.)

Create parameter

Choose Continue. On the Review page, either choose Save and build or choose Save to run the build later.

Choose Start build. When the build is complete, the Build logs section should display detailed information about the build.


To demonstrate a pull request, I will fork the repository as a different GitHub user, make commits to the forked repo, check in the changes to a newly created branch, and then open a pull request.

Pull request

As soon as the pull request is submitted, you’ll see CodeBuild start executing the build.


GitHub sends an HTTP POST payload to the webhook’s configured URL (highlighted here), which CodeBuild uses to download the latest source code and execute the build phases.

Build project

If you expand the Show all checks option for the GitHub pull request, you’ll see that CodeBuild has completed the build, all checks have passed, and a deep link is provided in Details, which opens the build history in the CodeBuild console.

Pull request


In this post, I showed you how to use GitHub as the source provider for your CodeStar projects and how to work with recent commits, issues, and pull requests in the CodeStar dashboard. I also showed you how you can use GitHub pull requests to automatically trigger a build in AWS CodeBuild — specifically, how this functionality makes it easier to collaborate across your team while editing and building your application code with CodeBuild.

About the author:

Balaji Iyer is an Enterprise Consultant for the Professional Services Team at Amazon Web Services. In this role, he has helped several customers successfully navigate their journey to AWS. His specialties include architecting and implementing highly scalable distributed systems, serverless architectures, large scale migrations, operational security, and leading strategic AWS initiatives. Before he joined Amazon, Balaji spent more than a decade building operating systems, big data analytics solutions, mobile services, and web applications. In his spare time, he enjoys experiencing the great outdoors and spending time with his family.


How to Automatically Revert and Receive Notifications About Changes to Your Amazon VPC Security Groups

Post Syndicated from Rob Barnes original https://aws.amazon.com/blogs/security/how-to-automatically-revert-and-receive-notifications-about-changes-to-your-amazon-vpc-security-groups/

In a previous AWS Security Blog post, Jeff Levine showed how you can monitor changes to your Amazon EC2 security groups. The methods he describes in that post are examples of detective controls, which can help you determine when changes are made to security controls on your AWS resources.

In this post, I take that approach a step further by introducing an example of a responsive control, which you can use to automatically respond to a detected security event by applying a chosen security mitigation. I demonstrate a solution that continuously monitors changes made to an Amazon VPC security group, and if a new ingress rule (the same as an inbound rule) is added to that security group, the solution removes the rule and then sends you a notification after the changes have been automatically reverted.

The scenario

Let’s say you want to reduce your infrastructure complexity by replacing your Secure Shell (SSH) bastion hosts with Amazon EC2 Systems Manager (SSM). SSM allows you to run commands on your hosts remotely, removing the need to manage bastion hosts or rely on SSH to execute commands. To support this objective, you must prevent your staff members from opening SSH ports to your web server’s Amazon VPC security group. If one of your staff members does modify the VPC security group to allow SSH access, you want the change to be automatically reverted and then receive a notification that the change to the security group was automatically reverted. If you are not yet familiar with security groups, see Security Groups for Your VPC before reading the rest of this post.

Solution overview

This solution begins with a directive control to mandate that no web server should be accessible using SSH. The directive control is enforced using a preventive control, which is implemented using a security group rule that prevents ingress from port 22 (typically used for SSH). The detective control is a “listener” that identifies any changes made to your security group. Finally, the responsive control reverts changes made to the security group and then sends a notification of this security mitigation.

The detective control, in this case, is an Amazon CloudWatch event that detects changes to your security group and triggers the responsive control, which in this case is an AWS Lambda function. I use AWS CloudFormation to simplify the deployment.

The following diagram shows the architecture of this solution.

Solution architecture diagram

Here is how the process works:

  1. Someone on your staff adds a new ingress rule to your security group.
  2. A CloudWatch event that continually monitors changes to your security groups detects the new ingress rule and invokes a designated Lambda function (with Lambda, you can run code without provisioning or managing servers).
  3. The Lambda function evaluates the event to determine whether you are monitoring this security group and reverts the new security group ingress rule.
  4. Finally, the Lambda function sends you an email to let you know what the change was, who made it, and that the change was reverted.

Deploy the solution by using CloudFormation

In this section, you will click the Launch Stack button shown below to launch the CloudFormation stack and deploy the solution.


  • You must have AWS CloudTrail already enabled in the AWS Region where you will be deploying the solution. CloudTrail lets you log, continuously monitor, and retain events related to API calls across your AWS infrastructure. See Getting Started with CloudTrail for more information.
  • You must have a default VPC in the region in which you will be deploying the solution. AWS accounts have one default VPC per AWS Region. If you’ve deleted your VPC, see Creating a Default VPC to recreate it.

Resources that this solution creates

When you launch the CloudFormation stack, it creates the following resources:

  • A sample VPC security group in your default VPC, which is used as the target for reverting ingress rule changes.
  • A CloudWatch event rule that monitors changes to your AWS infrastructure.
  • A Lambda function that reverts changes to the security group and sends you email notifications.
  • A permission that allows CloudWatch to invoke your Lambda function.
  • An AWS Identity and Access Management (IAM) role with limited privileges that the Lambda function assumes when it is executed.
  • An Amazon SNS topic to which the Lambda function publishes notifications.

Launch the CloudFormation stack

The link in this section uses the us-east-1 Region (the US East [N. Virginia] Region). Change the region if you want to use this solution in a different region. See Selecting a Region for more information about changing the region.

To deploy the solution, click the following Launch Stack button to launch the stack. After you click the button, you must sign in to the AWS Management Console if you have not already done so.

Click this "Launch Stack" button


  1. Choose Next to proceed to the Specify Details page.
  2. On the Specify Details page, type your email address in the Send notifications to box. This is the email address to which change notifications will be sent. (After the stack is launched, you will receive a confirmation email that you must accept before you can receive notifications.)
  3. Choose Next until you get to the Review page, and then choose the I acknowledge that AWS CloudFormation might create IAM resources check box. This confirms that you are aware that the CloudFormation template includes an IAM resource.
  4. Choose Create. CloudFormation displays the stack status, CREATE_COMPLETE, when the stack has launched completely, which should take less than two minutes.Screenshot showing that the stack has launched completely

Testing the solution

  1. Check your email for the SNS confirmation email. You must confirm this subscription to receive future notification emails. If you don’t confirm the subscription, your security group ingress rules still will be automatically reverted, but you will not receive notification emails.
  2. Navigate to the EC2 console and choose Security Groups in the navigation pane.
  3. Choose the security group created by CloudFormation. Its name is Web Server Security Group.
  4. Choose the Inbound tab in the bottom pane of the page. Note that only one rule allows HTTPS ingress on port 443 from (from anywhere).Screenshot showing the "Inbound" tab in the bottom pane of the page
  1. Choose Edit to display the Edit inbound rules dialog box (again, an inbound rule and an ingress rule are the same thing).
  2. Choose Add Rule.
  3. Choose SSH from the Type drop-down list.
  4. Choose My IP from the Source drop-down list. Your IP address is populated for you. By adding this rule, you are simulating one of your staff members violating your organization’s policy (in this blog post’s hypothetical example) against allowing SSH access to your EC2 servers. You are testing the solution created when you launched the CloudFormation stack in the previous section. The solution should remove this newly created SSH rule automatically.
    Screenshot of editing inbound rules
  5. Choose Save.

Adding this rule creates an EC2 AuthorizeSecurityGroupIngress service event, which triggers the Lambda function created in the CloudFormation stack. After a few moments, choose the refresh button ( The "refresh" icon ) to see that the new SSH ingress rule that you just created has been removed by the solution you deployed earlier with the CloudFormation stack. If the rule is still there, wait a few more moments and choose the refresh button again.

Screenshot of refreshing the page to see that the SSH ingress rule has been removed

You should also receive an email to notify you that the ingress rule was added and subsequently reverted.

Screenshot of the notification email

Cleaning up

If you want to remove the resources created by this CloudFormation stack, you can delete the CloudFormation stack:

  1. Navigate to the CloudFormation console.
  2. Choose the stack that you created earlier.
  3. Choose the Actions drop-down list.
  4. Choose Delete Stack, and then choose Yes, Delete.
  5. CloudFormation will display a status of DELETE_IN_PROGRESS while it deletes the resources created with the stack. After a few moments, the stack should no longer appear in the list of completed stacks.
    Screenshot of stack "DELETE_IN_PROGRESS"

Other applications of this solution

I have shown one way to use multiple AWS services to help continuously ensure that your security controls haven’t deviated from your security baseline. However, you also could use the CIS Amazon Web Services Foundations Benchmarks, for example, to establish a governance baseline across your AWS accounts and then use the principles in this blog post to automatically mitigate changes to that baseline.

To scale this solution, you can create a framework that uses resource tags to identify particular resources for monitoring. You also can use a consolidated monitoring approach by using cross-account event delivery. See Sending and Receiving Events Between AWS Accounts for more information. You also can extend the principle of automatic mitigation to detect and revert changes to other resources such as IAM policies and Amazon S3 bucket policies.


In this blog post, I demonstrated how you can automatically revert changes to a VPC security group and have a notification sent about the changes. You can use this solution in your own AWS accounts to enforce your security requirements continuously.

If you have comments about this blog post or other ideas for ways to use this solution, submit a comment in the “Comments” section below. If you have implementation questions, start a new thread in the EC2 forum or contact AWS Support.

– Rob

Application Load Balancers Now Support Multiple TLS Certificates With Smart Selection Using SNI

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/new-application-load-balancer-sni/

Today we’re launching support for multiple TLS/SSL certificates on Application Load Balancers (ALB) using Server Name Indication (SNI). You can now host multiple TLS secured applications, each with its own TLS certificate, behind a single load balancer. In order to use SNI, all you need to do is bind multiple certificates to the same secure listener on your load balancer. ALB will automatically choose the optimal TLS certificate for each client. These new features are provided at no additional charge.

If you’re looking for a TL;DR on how to use this new feature just click here. If you’re like me and you’re a little rusty on the specifics of Transport Layer Security (TLS) then keep reading.


People tend to use the terms SSL and TLS interchangeably even though the two are technically different. SSL technically refers to a predecessor of the TLS protocol. To keep things simple I’ll be using the term TLS for the rest of this post.

TLS is a protocol for securely transmitting data like passwords, cookies, and credit card numbers. It enables privacy, authentication, and integrity of the data being transmitted. TLS uses certificate based authentication where certificates are like ID cards for your websites. You trust the person that signed and issued the certificate, the certificate authority (CA), so you trust that the data in the certificate is correct. When a browser connects to your TLS-enabled ALB, ALB presents a certificate that contains your site’s public key, which has been cryptographically signed by a CA. This way the client can be sure it’s getting the ‘real you’ and that it’s safe to use your site’s public key to establish a secure connection.

With SNI support we’re making it easy to use more than one certificate with the same ALB. The most common reason you might want to use multiple certificates is to handle different domains with the same load balancer. It’s always been possible to use wildcard and subject-alternate-name (SAN) certificates with ALB, but these come with limitations. Wildcard certificates only work for related subdomains that match a simple pattern and while SAN certificates can support many different domains, the same certificate authority has to authenticate each one. That means you have reauthenticate and reprovision your certificate everytime you add a new domain.

One of our most frequent requests on forums, reddit, and in my e-mail inbox has been to use the Server Name Indication (SNI) extension of TLS to choose a certificate for a client. Since TLS operates at the transport layer, below HTTP, it doesn’t see the hostname requested by a client. SNI works by having the client tell the server “This is the domain I expect to get a certificate for” when it first connects. The server can then choose the correct certificate to respond to the client. All modern web browsers and a large majority of other clients support SNI. In fact, today we see SNI supported by over 99.5% of clients connecting to CloudFront.

Smart Certificate Selection on ALB

ALB’s smart certificate selection goes beyond SNI. In addition to containing a list of valid domain names, certificates also describe the type of key exchange and cryptography that the server supports, as well as the signature algorithm (SHA2, SHA1, MD5) used to sign the certificate. To establish a TLS connection, a client starts a TLS handshake by sending a “ClientHello” message that outlines the capabilities of the client: the protocol versions, extensions, cipher suites, and compression methods. Based on what an individual client supports, ALB’s smart selection algorithm chooses a certificate for the connection and sends it to the client. ALB supports both the classic RSA algorithm and the newer, hipper, and faster Elliptic-curve based ECDSA algorithm. ECDSA support among clients isn’t as prevalent as SNI, but it is supported by all modern web browsers. Since it’s faster and requires less CPU, it can be particularly useful for ultra-low latency applications and for conserving the amount of battery used by mobile applications. Since ALB can see what each client supports from the TLS handshake, you can upload both RSA and ECDSA certificates for the same domains and ALB will automatically choose the best one for each client.

Using SNI with ALB

I’ll use a few example websites like VimIsBetterThanEmacs.com and VimIsTheBest.com. I’ve purchased and hosted these domains on Amazon Route 53, and provisioned two separate certificates for them in AWS Certificate Manager (ACM). If I want to securely serve both of these sites through a single ALB, I can quickly add both certificates in the console.

First, I’ll select my load balancer in the console, go to the listeners tab, and select “view/edit certificates”.

Next, I’ll use the “+” button in the top left corner to select some certificates then I’ll click the “Add” button.

There are no more steps. If you’re not really a GUI kind of person you’ll be pleased to know that it’s also simple to add new certificates via the AWS Command Line Interface (CLI) (or SDKs).

aws elbv2 add-listener-certificates --listener-arn <listener-arn> --certificates CertificateArn=<cert-arn>

Things to know

  • ALB Access Logs now include the client’s requested hostname and the certificate ARN used. If the “hostname” field is empty (represented by a “-“) the client did not use the SNI extension in their request.
  • You can use any of your certificates in ACM or IAM.
  • You can bind multiple certificates for the same domain(s) to a secure listener. Your ALB will choose the optimal certificate based on multiple factors including the capabilities of the client.
  • If the client does not support SNI your ALB will use the default certificate (the one you specified when you created the listener).
  • There are three new ELB API calls: AddListenerCertificates, RemoveListenerCertificates, and DescribeListenerCertificates.
  • You can bind up to 25 certificates per load balancer (not counting the default certificate).
  • These new features are supported by AWS CloudFormation at launch.

You can see an example of these new features in action with a set of websites created by my colleague Jon Zobrist: https://www.exampleloadbalancer.com/.

Overall, I will personally use this feature and I’m sure a ton of AWS users will benefit from it as well. I want to thank the Elastic Load Balancing team for all their hard work in getting this into the hands of our users.


Low-tech Raspberry Pi robot

Post Syndicated from Rachel Churcher original https://www.raspberrypi.org/blog/low-tech-raspberry-pi-robot/

Robot-builder extraordinaire Clément Didier is ushering in the era of our cybernetic overlords. Future generations will remember him as the creator of robots constructed from cardboard and conductive paint which are so easy to replicate that a robot could do it. Welcome to the singularity.

Bare Conductive on Twitter

This cool robot was made with the #PiCap, conductive paint and @Raspberry_Pi by @clementdidier. Full tutorial: https://t.co/AcQVTS4vr2 https://t.co/D04U5UGR0P

Simple interface

To assemble the robot, Clément made use of a Pi Cap board, a motor driver, and most importantly, a tube of Bare Conductive Electric Paint. He painted the control interface onto the cardboard surface of the robot, allowing a human, replicant, or superior robot to direct its movements simply by touching the paint.

Clever design

The Raspberry Pi 3, the motor control board, and the painted input buttons interface via the GPIO breakout pins on the Pi Cap. Crocodile clips connect the Pi Cap to the cardboard-and-paint control surface, while jumper wires connect it to the motor control board.

Raspberry Pi and bare conductive Pi Cap

Sing with me: ‘The Raspberry Pi’s connected to the Pi Cap, and the Pi Cap’s connected to the inputs, and…’

Two battery packs provide power to the Raspberry Pi, and to the four independently driven motors. Software, written in Python, allows the robot to respond to inputs from the conductive paint. The motors drive wheels attached to a plastic chassis, moving and turning the robot at the touch of a square of black paint.

Artistic circuit

Clément used masking tape and a paintbrush to create the control buttons. For a human, this is obviously a fiddly process which relies on the blocking properties of the masking tape and a steady hand. For a robot, however, the process would be a simple, freehand one, resulting in neatly painted circuits on every single one of countless robotic minions. Cybernetic domination is at (metallic) hand.

The control surface of the robot, painted with bare conductive paint

One fiddly job for a human, one easy task for robotkind

The instructions and code for Clément’s build can be found here.

Low-tech solutions

Here at Pi Towers, we love seeing the high-tech Raspberry Pi integrated so successfully with low-tech components. In addition to conductive paint, we’ve seen cardboard laptops, toilet roll robots, fruit drum kits, chocolate box robots, and hamster-wheel-triggered cameras. Have you integrated low-tech elements into your projects (and potentially accelerated the robot apocalypse in the process)? Tell us about it in the comments!


The post Low-tech Raspberry Pi robot appeared first on Raspberry Pi.

RaspiReader: build your own fingerprint reader

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/raspireader-fingerprint-scanner/

Three researchers from Michigan State University have developed a low-cost, open-source fingerprint reader which can detect fake prints. They call it RaspiReader, and they’ve built it using a Raspberry Pi 3 and two Camera Modules. Joshua and his colleagues have just uploaded all the info you need to build your own version — let’s go!

GIF of fingerprint match points being aligned on fingerprint, not real output of RaspiReader software

Sadly not the real output of the RaspiReader

Falsified fingerprints

We’ve probably all seen a movie in which a burglar crosses a room full of laser tripwires and then enters the safe full of loot by tricking the fingerprint-secured lock with a fake print. Turns out, the second part is not that unrealistic: you can fake fingerprints using a range of materials, such as glue or latex.

Examples of live and fake fingerprints collected by the RaspiReader team

The RaspiReader team collected live and fake fingerprints to test the device

If the spoof print layer capping the spoofer’s finger is thin enough, it can even fool readers that detect blood flow, pulse, or temperature. This is becoming a significant security risk, not least for anyone who unlocks their smartphone using a fingerprint.

The RaspiReader

This is where Anil K. Jain comes in: Professor Jain leads a biometrics research group. Under his guidance, Joshua J. Engelsma and Kai Cao set out to develop a fingerprint reader with improved spoof-print detection. Ultimately, they aim to help the development of more secure commercial technologies. With their project, the team has also created an amazing resource for anyone who wants to build their own fingerprint reader.

So that replicating their device would be easy, they wanted to make it using inexpensive, readily available components, which is why they turned to Raspberry Pi technology.

RaspiReader fingerprint scanner by PRIP lab

The Raspireader and its output

Inside the RaspiReader’s 3D-printed housing, LEDs shine light through an acrylic prism, on top of which the user rests their finger. The prism refracts the light so that the two Camera Modules can take images from different angles. The Pi receives these images via a Multi Camera Adapter Module feeding into the CSI port. Collecting two images means the researchers’ spoof detection algorithm has more information to work with.

Comparison of live and spoof fingerprints

Real on the left, fake on the right

RaspiReader software

The Camera Adaptor uses the RPi.GPIO Python package. The RaspiReader performs image processing, and its spoof detection takes image colour and 3D friction ridge patterns into account. The detection algorithm extracts colour local binary patterns … please don’t ask me to explain! You can have a look at the researchers’ manuscript if you want to get stuck into the fine details of their project.

Build your own fingerprint reader

I’ve had my eyes glued to my inbox waiting for Josh to send me links to instructions and files for this build, and here they are (thanks, Josh)! Check out the video tutorial, which walks you through how to assemble the RaspiReader:

RaspiReader: Cost-Effective Open-Source Fingerprint Reader

Building a cost-effective, open-source, and spoof-resilient fingerprint reader for $160* in under an hour. Code: https://github.com/engelsjo/RaspiReader Links to parts: 1. PRISM – https://www.amazon.com/gp/product/B00WL3OBK4/ref=oh_aui_detailpage_o05_s00?ie=UTF8&psc=1 (Better fit) https://www.thorlabs.com/thorproduct.cfm?partnumber=PS611 2. RaspiCams – https://www.amazon.com/gp/product/B012V1HEP4/ref=oh_aui_detailpage_o00_s00?ie=UTF8&psc=1 3. Camera Multiplexer https://www.amazon.com/gp/product/B012UQWOOQ/ref=oh_aui_detailpage_o04_s01?ie=UTF8&psc=1 4. Raspberry Pi Kit: https://www.amazon.com/CanaKit-Raspberry-Clear-Power-Supply/dp/B01C6EQNNK/ref=sr_1_6?ie=UTF8&qid=1507058509&sr=8-6&keywords=raspberry+pi+3b Whitepaper: https://arxiv.org/abs/1708.07887 * Prices can vary based on Amazon’s pricing. P.s.

You can find a parts list with links to suppliers in the video description — the whole build costs around $160. All the STL files for the housing and the Python scripts you need to run on the Pi are available on Josh’s GitHub.

Enhance your home security

The RaspiReader is a great resource for researchers, and it would also be a terrific project to build at home! Is there a more impressive way to protect a treasured possession, or secure access to your computer, than with a DIY fingerprint scanner?

Check out this James-Bond-themed blog post for Raspberry Pi resources to help you build a high-security lair. If you want even more inspiration, watch this video about a laser-secured cookie jar which Estefannie made for us. And be sure to share your successful fingerprint scanner builds with us via social media!

The post RaspiReader: build your own fingerprint reader appeared first on Raspberry Pi.

PlayerUnknown’s Battlegrounds on a Game Boy?!

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/playerunknowns-battlegrounds-game-boy/

My evenings spent watching the Polygon Awful Squad play PlayerUnknown’s Battlegrounds for hours on end have made me mildly obsessed with this record-breaking Steam game.

PlayerUnknown's Battlegrounds Raspberry Pi

So when Michael Darby’s latest PUBG-inspired Game Boy build appeared in my notifications last week, I squealed with excitement and quickly sent the link to my team…while drinking a cocktail by a pool in Turkey ☀️🍹


https://314reactor.com/ https://www.hackster.io/314reactor https://twitter.com/the_mikey_d

PlayerUnknown’s Battlegrounds

For those unfamiliar with the game: PlayerUnknown’s Battlegrounds, or PUBG for short, is a Battle-Royale-style multiplayer online video game in which individuals or teams fight to the death on an island map. As players collect weapons, ammo, and transport, their ‘safe zone’ shrinks, forcing a final face-off until only one character remains.

The game has been an astounding success on Steam, the digital distribution platform which brings PUBG to the masses. It records daily player counts of over a million!

PlayerUnknown's Battlegrounds Raspberry Pi

Yeah, I’d say one or two people seem to enjoy it!

PUBG on a Game Boy?!

As it’s a fairly complex game, let’s get this out of the way right now: no, Michael is not running the entire game on a Nintendo Game Boy. That would be magic silly impossible. Instead, he’s streaming the game from his home PC to a Raspberry Pi Zero W fitted within the hacked handheld console.

Michael removed the excess plastic inside an old Game Boy Color shell to make space for a Zero W, LiPo battery, and TFT screen. He then soldered the necessary buttons to GPIO pins, and wrote a Python script to control them.

PlayerUnknown's Battlegrounds Raspberry Pi

The maker battleground

The full script can be found here, along with a more detailed tutorial for the build.

In order to stream PUBG to the Zero W, Michael uses the open-source NVIDIA steaming service Moonlight. He set his PC’s screen resolution to 800×600 and its frame rate to 30, so that streaming the game to the TFT screen works perfectly, albeit with no sound.

PlayerUnknown's Battlegrounds Raspberry Pi

The end result is a rather impressive build that has confused YouTube commenters since he uploaded footage for it last week. The video has more than 60000 views to date, so it appears we’re not the only ones impressed with Michael’s make.


If you’re a regular reader of our blog, you may recognise Michael’s name from his recent Nerf blaster mod. And fans of Raspberry Pi may also have seen his Pi-powered Windows 98 wristwatch earlier in the year. He blogs at 314reactor, where you can read more about his digital making projects.

Windows 98 Wrist watch Raspberry Pi PlayerUnknown's Battlegrounds

Player Two has entered the game

Now it’s your turn. Have you used a Raspberry Pi to create a gaming system? I’m not just talking arcades and RetroPie here. We want to see everything, from Pi-powered board games to tech on the football field.

Share your builds in the comments below and while you’re at it, what game would you like to stream to a handheld device?

The post PlayerUnknown’s Battlegrounds on a Game Boy?! appeared first on Raspberry Pi.

Creating a Cost-Efficient Amazon ECS Cluster for Scheduled Tasks

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/

Madhuri Peri
Sr. DevOps Consultant

When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.

One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances.  Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.

Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.

In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.


The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.

Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.

The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.

The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.


To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:

  • Amazon ECR repository for storing the Docker image to be used in the task definition
  • S3 bucket that holds the transferred logs
  • Task definition, with image name and S3 bucket as environment variables provided via input parameter
  • CloudWatch Events rule
  • Amazon ECS cluster
  • Amazon ECS container instances using Spot Fleet
  • IAM roles required for the container instance profiles

Before you begin

Ensure that Git, Docker, and the AWS CLI are installed on your computer.

In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster.

Implementation Steps

  1. Clone the code from GitHub that performs RDS API calls to retrieve the log files.
    git clone https://github.com/awslabs/aws-ecs-scheduled-tasks.git
  2. Build and tag the image.
    cd aws-ecs-scheduled-tasks/container-code/src && ls

    Dockerfile		rdslogsshipper.py	requirements.txt

    docker build -t rdslogsshipper .

    Sending build context to Docker daemon 9.728 kB
    Step 1 : FROM python:3
     ---> 41397f4f2887
    Step 2 : WORKDIR /usr/src/app
     ---> Using cache
     ---> 59299c020e7e
    Step 3 : COPY requirements.txt ./
     ---> 8c017e931c3b
    Removing intermediate container df09e1bed9f2
    Step 4 : COPY rdslogsshipper.py /usr/src/app
     ---> 099a49ca4325
    Removing intermediate container 1b1da24a6699
    Step 5 : RUN pip install --no-cache-dir -r requirements.txt
     ---> Running in 3ed98b30901d
    Collecting boto3 (from -r requirements.txt (line 1))
      Downloading boto3-1.4.6-py2.py3-none-any.whl (128kB)
    Collecting botocore (from -r requirements.txt (line 2))
      Downloading botocore-1.6.7-py2.py3-none-any.whl (3.6MB)
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->-r requirements.txt (line 1))
      Downloading s3transfer-0.1.10-py2.py3-none-any.whl (54kB)
    Collecting jmespath<1.0.0,>=0.7.1 (from boto3->-r requirements.txt (line 1))
      Downloading jmespath-0.9.3-py2.py3-none-any.whl
    Collecting python-dateutil<3.0.0,>=2.1 (from botocore->-r requirements.txt (line 2))
      Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
    Collecting docutils>=0.10 (from botocore->-r requirements.txt (line 2))
      Downloading docutils-0.14-py3-none-any.whl (543kB)
    Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore->-r requirements.txt (line 2))
      Downloading six-1.10.0-py2.py3-none-any.whl
    Installing collected packages: six, python-dateutil, docutils, jmespath, botocore, s3transfer, boto3
    Successfully installed boto3-1.4.6 botocore-1.6.7 docutils-0.14 jmespath-0.9.3 python-dateutil-2.6.1 s3transfer-0.1.10 six-1.10.0
     ---> f892d3cb7383
    Removing intermediate container 3ed98b30901d
    Step 6 : COPY . .
     ---> ea7550c04fea
    Removing intermediate container b558b3ebd406
    Successfully built ea7550c04fea
  3. Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
  4. Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
  5. Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes:
    aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule  --schedule-expression "rate(15 minutes)"

        "RuleArn": "arn:aws:events:us-west-2:12345678901:rule/demo-ecs-task-rule"
  6. CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
    1. Create the IAM role to be assumed by CloudWatch.
      aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json

          "Role": {
              "AssumeRolePolicyDocument": {
                  "Version": "2012-10-17", 
                  "Statement": [
                          "Action": "sts:AssumeRole", 
                          "Effect": "Allow", 
                          "Principal": {
                              "Service": "events.amazonaws.com"
              "RoleId": "AROAIRYYLDCVZCUACT7FS", 
              "CreateDate": "2017-07-14T22:44:52.627Z", 
              "RoleName": "Test-Role", 
              "Path": "/", 
              "Arn": "arn:aws:iam::12345678901:role/Test-Role"

      The following is an example of the event-role.json file used earlier:

          "Version": "2012-10-17",
          "Statement": [
                  "Effect": "Allow",
                  "Principal": {
                    "Service": "events.amazonaws.com"
                  "Action": "sts:AssumeRole"
    2. Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources.
      aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json

          "Policy": {
              "PolicyName": "test-policy", 
              "CreateDate": "2017-07-14T22:51:20.293Z", 
              "AttachmentCount": 0, 
              "IsAttachable": true, 
              "PolicyId": "ANPAI7XDIQOLTBUMDWGJW", 
              "DefaultVersionId": "v1", 
              "Path": "/", 
              "Arn": "arn:aws:iam::123455678901:policy/test-policy", 
              "UpdateDate": "2017-07-14T22:51:20.293Z"

      The following is an example of the event-policy.json file used earlier:

          "Version": "2012-10-17",
          "Statement": [
                "Effect": "Allow",
                "Action": [
                "Resource": [
                "Condition": {
                    "ArnLike": {
                        "ecs:cluster": "arn:aws:ecs:*::cluster/"
    3. Attach the IAM policy to the role.
      aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
  7. Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings.
    aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role"

        "FailedEntries": [], 
        "FailedEntryCount": 0

That’s it. The logs now run based on the defined schedule.

To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.


In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.

In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.

If you have questions or suggestions, please comment below.

Browser hacking for 280 character tweets

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/09/browser-hacking-for-280-character-tweets.html

Twitter has raised the limit to 280 characters for a select number of people. However, they left open a hole, allowing anybody to make large tweets with a little bit of hacking. The hacking skills needed are basic hacking skills, which I thought I’d write up in a blog post.

Specifically, the skills you will exercise are:

  • basic command-line shell
  • basic HTTP requests
  • basic browser DOM editing

The short instructions

The basic instructions were found in tweets like the following:
These instructions are clear to the average hacker, but of course, a bit difficult for those learning hacking, hence this post.

The command-line

The basics of most hacking start with knowledge of the command-line. This is the “Terminal” app under macOS or cmd.exe under Windows. Almost always when you see hacking dramatized in the movies, they are using the command-line.
In the beginning, the command-line is all computers had. To do anything on a computer, you had to type a “command” telling it what to do. What we see as the modern graphical screen is a layer on top of the command-line, one that translates clicks of the mouse into the raw commands.
On most systems, the command-line is known as “bash”. This is what you’ll find on Linux and macOS. Windows historically has had a different command-line that uses slightly different syntax, though in the last couple years, they’ve also supported “bash”. You’ll have to install it first, such as by following these instructions.
You’ll see me use command that may not be yet installed on your “bash” command-line, like nc and curl. You’ll need to run a command to install them, such as:
sudo apt-get install nc curl
The thing to remember about the command-line is that the mouse doesn’t work. You can’t click to move the cursor as you normally do in applications. That’s because the command-line predates the mouse by decades. Instead, you have to use arrow keys.
I’m not going to spend much effort discussing the command-line, as a complete explanation is beyond the scope of this document. Instead, I’m assuming the reader either already knows it, or will learn-from-example as we go along.

Web requests

The basics of how the web works are really simple. A request to a web server is just a small packet of text, such as the following, which does a search on Google for the search-term “penguin” (presumably, you are interested in knowing more about penguins):
GET /search?q=penguin HTTP/1.0
Host: www.google.com
User-Agent: human
The command we are sending to the server is GET, meaning get a page. We are accessing the URL /search, which on Google’s website, is how you do a search. We are then sending the parameter q with the value penguin. We also declare that we are using version 1.0 of the HTTP (hyper-text transfer protocol).
Following the first line there are a number of additional headers. In one header, we declare the Host name that we are accessing. Web servers can contain many different websites, with different names, so this header is usually imporant.
We also add the User-Agent header. The “user-agent” means the “browser” that you use, like Edge, Chrome, Firefox, or Safari. It allows servers to send content optimized for different browsers. Since we are sending web requests without a browser here, we are joking around saying human.
Here’s what happens when we use the nc program to send this to a google web server:
The first part is us typing, until we hit the [enter] key to create a blank line. After that point is the response from the Google server. We get back a result code (OK), followed by more headers from the server, and finally the contents of the webpage, which goes on from many screens. (We’ll talk about what web pages look like below).
Note that a lot of HTTP headers are optional and really have little influence on what’s going on. They are just junk added to web requests. For example, we see Google report a P3P header is some relic of 2002 that nobody uses anymore, as far as I can tell. Indeed, if you follow the URL in the P3P header, Google pretty much says exactly that.
I point this out because the request I show above is a simplified one. In practice, most requests contain a lot more headers, especially Cookie headers. We’ll see that later when making requests.

Using cURL instead

Sending the raw HTTP request to the server, and getting raw HTTP/HTML back, is annoying. The better way of doing this is with the tool known as cURL, or plainly, just curl. You may be familiar with the older command-line tools wget. cURL is similar, but more flexible.
To use curl for the experiment above, we’d do something like the following. We are saving the web page to “penguin.html” instead of just spewing it on the screen.
Underneath, cURL builds an HTTP header just like the one we showed above, and sends it to the server, getting the response back.


Now let’s talk about web pages. When you look at the web page we got back from Google while searching for “penguin”, you’ll see that it’s intimidatingly complex. I mean, it intimidates me. But it all starts from some basic principles, so we’ll look at some simpler examples.
The following is text of a simple web page:
<p>This is a simple web page</p>
This is HTML, “hyper-text markup language”. As it’s name implies, we “markup” text, such as declaring the first text as a level-1 header (H1), and the following text as a paragraph (P).
In a web browser, this gets rendered as something that looks like the following. Notice how a header is formatted differently from a paragraph. Also notice that web browsers can use local files as well as make remote requests to web servers:
You can right-mouse click on the page and do a “View Source”. This will show the raw source behind the web page:
Web pages don’t just contain marked-up text. They contain two other important features, style information that dictates how things appear, and script that does all the live things that web pages do, from which we build web apps.
So let’s add a little bit of style and scripting to our web page. First, let’s view the source we’ll be adding:
In our header (H1) field, we’ve added the attribute to the markup giving this an id of mytitle. In the style section above, we give that element a color of blue, and tell it to align to the center.
Then, in our script section, we’ve told it that when somebody clicks on the element “mytitle”, it should send an “alert” message of “hello”.
This is what our web page now looks like, with the center blue title:
When we click on the title, we get a popup alert:
Thus, we see an example of the three components of a webpage: markup, style, and scripting.

Chrome developer tools

Now we go off the deep end. Right-mouse click on “Test” (not normal click, but right-button click, to pull up a menu). Select “Inspect”.
You should now get a window that looks something like the following. Chrome splits the screen in half, showing the web page on the left, and it’s debug tools on the right.
This looks similar to what “View Source” shows, but it isn’t. Instead, it’s showing how Chrome interpreted the source HTML. For example, our style/script tags should’ve been marked up with a head (header) tag. We forgot it, but Chrome adds it in anyway.
What Google is showing us is called the DOM, or document object model. It shows us all the objects that make up a web page, and how they fit together.
For example, it shows us how the style information for #mytitle is created. It first starts with the default style information for an h1 tag, and then how we’ve changed it with our style specifications.
We can edit the DOM manually. Just double click on things you want to change. For example, in this screen shot, I’ve changed the style spec from blue to red, and I’ve changed the header and paragraph test. The original file on disk hasn’t changed, but I’ve changed the DOM in memory.
This is a classic hacking technique. If you don’t like things like paywalls, for example, just right-click on the element blocking your view of the text, “Inspect” it, then delete it. (This works for some paywalls).
This edits the markup and style info, but changing the scripting stuff is a bit more complicated. To do that, click on the [Console] tab. This is the scripting console, and allows you to run code directly as part of the webpage. We are going to run code that resets what happens when we click on the title. In this case, we are simply going to change the message to “goodbye”.
Now when we click on the title, we indeed get the message:
Again, a common way to get around paywalls is to run some code like that that change which functions will be called.

Putting it all together

Now let’s put this all together in order to hack Twitter to allow us (the non-chosen) to tweet 280 characters. Review Dildog’s instructions above.
The first step is to get to Chrome Developer Tools. Dildog suggests F12. I suggest right-clicking on the Tweet button (or Reply button, as I use in my example) and doing “Inspect”, as I describe above.
You’ll now see your screen split in half, with the DOM toward the right, similar to how I describe above. However, Twitter’s app is really complex. Well, not really complex, it’s all basic stuff when you come right down to it. It’s just so much stuff — it’s a large web app with lots of parts. So we have to dive in without understanding everything that’s going on.
The Tweet/Reply button we are inspecting is going to look like this in the DOM:
The Tweet/Reply button is currently greyed out because it has the “disabled” attribute. You need to double click on it and remove that attribute. Also, in the class attribute, there is also a “disabled” part. Double-click, then click on that and removed just that disabled as well, without impacting the stuff around it. This should change the button from disabled to enabled. It won’t be greyed out, and it’ll respond when you click on it.
Now click on it. You’ll get an error message, as shown below:
What we’ve done here is bypass what’s known as client-side validation. The script in the web page prevented sending Tweets longer than 140 characters. Our editing of the DOM changed that, allowing us to send a bad request to the server. Bypassing client-side validation this way is the source of a lot of hacking.
But Twitter still does server-side validation as well. They know any client-side validation can be bypassed, and are in on the joke. They tell us hackers “You’ll have to be more clever”. So let’s be more clever.
In order to make longer 280 characters tweets work for select customers, they had to change something on the server-side. The thing they added was adding a “weighted_character_count=true” to the HTTP request. We just need to repeat the request we generated above, adding this parameter.
In theory, we can do this by fiddling with the scripting. The way Dildog describes does it a different way. He copies the request out of the browser, edits it, then send it via the command-line using curl.
We’ve used the [Elements] and [Console] tabs in Chrome’s DevTools. Now we are going to use the [Network] tab. This lists all the requests the web page has made to the server. The twitter app is constantly making requests to refresh the content of the web page. The request we made trying to do a long tweet is called “create”, and is red, because it failed.
Google Chrome gives us a number of ways to duplicate the request. The most useful is that it copies it as a full cURL command we can just paste onto the command-line. We don’t even need to know cURL, it takes care of everything for us. On Windows, since you have two command-lines, it gives you a choice to use the older Windows cmd.exe, or the newer bash.exe. I use the bash version, since I don’t know where to get the Windows command-line version of cURL.exe.
There’s a lot of going on here. The first thing to notice is the long xxxxxx strings. That’s actually not in the original screenshot. I edited the picture. That’s because these are session-cookies. If inserted them into your browser, you’d hijack my Twitter session, and be able to tweet as me (such as making Carlos Danger style tweets). Therefore, I have to remove them from the example.
At the top of the screen is the URL that we are accessing, which is https://twitter.com/i/tweet/create. Much of the rest of the screen uses the cURL -H option to add a header. These are all the HTTP headers that I describe above. Finally, at the bottom, is the –data section, which contains the data bits related to the tweet, especially the tweet itself.
We need to edit either the URL above to read https://twitter.com/i/tweet/create?weighted_character_count=true, or we need to add &weighted_character_count=true to the –data section at the bottom (either works). Remember: mouse doesn’t work on command-line, so you have to use the cursor-keys to navigate backwards in the line. Also, since the line is larger than the screen, it’s on several visual lines, even though it’s all a single line as far as the command-line is concerned.
Now just hit [return] on your keyboard, and the tweet will be sent to the server, which at the moment, works. Presto!
Twitter will either enable or disable the feature for everyone in a few weeks, at which point, this post won’t work. But the reason I’m writing this is to demonstrate the basic hacking skills. We manipulate the web pages we receive from servers, and we manipulate what’s sent back from our browser back to the server.

Easier: hack the scripting

Instead of messing with the DOM and editing the HTTP request, the better solution would be to change the scripting that does both DOM client-side validation and HTTP request generation. The only reason Dildog above didn’t do that is that it’s a lot more work trying to find where all this happens.
Others have, though. @Zemnmez did just that, though his technique works for the alternate TweetDeck client (https://tweetdeck.twitter.com) instead of the default client. Go copy his code from here, then paste it into the DevTools scripting [Console]. It’ll go in an replace some scripting functions, such like my simpler example above.
The console is showing a stream of error messages, because TweetDeck has bugs, ignore those.
Now you can effortlessly do long tweets as normal, without all the messing around I’ve spent so much text in this blog post describing.
Now, as I’ve mentioned this before, you are only editing what’s going on in the current web page. If you refresh this page, or close it, everything will be lost. You’ll have to re-open the DevTools scripting console and repaste the code. The easier way of doing this is to use the [Sources] tab instead of [Console] and use the “Snippets” feature to save this bit of code in your browser, to make it easier next time.
The even easier way is to use Chrome extensions like TamperMonkey and GreaseMonkey that’ll take care of this for you. They’ll save the script, and automatically run it when they see you open the TweetDeck webpage again.
An even easier way is to use one of the several Chrome extensions written in the past day specifically designed to bypass the 140 character limit. Since the purpose of this blog post is to show you how to tamper with your browser yourself, rather than help you with Twitter, I won’t list them.


Tampering with the web-page the server gives you, and the data you send back, is a basic hacker skill. In truth, there is a lot to this. You have to get comfortable with the command-line, using tools like cURL. You have to learn how HTTP requests work. You have to understand how web pages are built from markup, style, and scripting. You have to be comfortable using Chrome’s DevTools for messing around with web page elements, network requests, scripting console, and scripting sources.
So it’s rather a lot, actually.
My hope with this page is to show you a practical application of all this, without getting too bogged down in fully explaining how every bit works.

Vinyl Shelf Finder

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/vinyl-shelf-finder/

It is a truth universally acknowledged that a person in possession of a large record collection must be in want of a good shelving system. Valentin Galea has solved this problem by developing the Vinyl Shelf Finder. In this build, a web-based app directs a pan-and-tilt laser to point out your record of choice among your collection.

Vinyl Shelf Finder demo by Valentin Galea


Collector’s issues

People love to collect stuff. Stamps; soap bars; Troll dolls; belly button fluff (no, really); if you can think of a tangible item, someone out there in the world is collecting it. Of course, every collector needs to solve two issues — which system to use for cataloguing and sorting their collection, and how to best retrieve items from it. This is where Valentin’s Vinyl Shelf Finder comes in. He says:

My vinyl collection is pretty modest — about 500 records in one vertical shelf and a couple of boxes. This is enough to get cumbersome when I’m searching for specific stuff, so I came up with the idea of a automated laser pointer finder.

The Vinyl Shelf Finder

Valentin keeps an online record of his vinyl collection using Discogs. He entered each LP’s shelf position into the record, and wrote a Node.js app to access the Discogs database. The mobile app has a GUI from which he chooses records based on their name and cover image. To build the hardware, he mounted a Pimoroni Pan-Tilt HAT on a Raspberry Pi, and affixed a laser pointer to the HAT. When he selects a record in the app, the pan-and-tilt laser moves to point out the LP’s location.

Valentin Galea on Twitter

my latest hobby prj: #vinyl finder – with lazers and #raspberrypi #iot and #nodejs – https://t.co/IGGzQDgUFI https://t.co/7YBE3svGyE

Not only does the app help Valentin find records – he has also set it up to collect listening statistics using the Last.fm API. He plans to add more sophisticated statistics, and is looking into how to automate the entry of the shelf positions into his database.

If you’re interested in the Vinyl Shelf Finder, head over to Valentin’s GitHub to learn more, and to find out about updates he is making to this work in progress.

GUI of Valentin Galea's Vinyl Shelf Finder app


Vinyl + Pi

We’ve previously blogged about Mike Smith’s kaleidoscopic Recordshelf build — maybe he and Valentin could team up to create the ultimate, beautiful, practical vinyl-shelving system!

If you listen to lots of LP records and would like to learn about digitising them, check out this Pi-powered project from Mozilla HQ. If, on the other hand, you have a vinyl player you never use, why not make amazing art with it by hacking it into a CNC Wood Burner?

Are you a collector of things common or unusual? Could Raspberry Pi technology help make your collection better? Share your ideas with us in the comments!

The post Vinyl Shelf Finder appeared first on Raspberry Pi.

Skill up on how to perform CI/CD with AWS Developer tools

Post Syndicated from Chirag Dhull original https://aws.amazon.com/blogs/devops/skill-up-on-how-to-perform-cicd-with-aws-devops-tools/

This is a guest post from Paul Duvall, CTO of Stelligent, a division of HOSTING.

I co-founded Stelligent, a technology services company that provides DevOps Automation on AWS as a result of my own frustration in implementing all the “behind the scenes” infrastructure (including builds, tests, deployments, etc.) on software projects on which I was developing software. At Stelligent, we have worked with numerous customers looking to get software delivered to users quicker and with greater confidence. This sounds simple but it often consists of properly configuring and integrating myriad tools including, but not limited to, version control, build, static analysis, testing, security, deployment, and software release orchestration. What some might not realize is that there’s a new breed of build, deploy, test, and release tools that help reduce much of the undifferentiated heavy lifting of deploying and releasing software to users.

I’ve been using AWS since 2009 and I, along with many at Stelligent – have worked with the AWS Service Teams as part of the AWS Developer Tools betas that are now generally available (including AWS CodePipeline, AWS CodeCommit, AWS CodeBuild, and AWS CodeDeploy). I’ve combined the experience we’ve had with customers along with this specialized knowledge of the AWS Developer and Management Tools to provide a unique course that shows multiple ways to use these services to deliver software to users quicker and with confidence.

In DevOps Essentials on AWS, you’ll learn how to accelerate software delivery and speed up feedback loops by learning how to use AWS Developer Tools to automate infrastructure and deployment pipelines for applications running on AWS. The course demonstrates solutions for various DevOps use cases for Amazon EC2, AWS OpsWorks, AWS Elastic Beanstalk, AWS Lambda (Serverless), Amazon ECS (Containers), while defining infrastructure as code and learning more about AWS Developer Tools including AWS CodeStar, AWS CodeCommit, AWS CodeBuild, AWS CodePipeline, and AWS CodeDeploy.

In this course, you see me use the AWS Developer and Management Tools to create comprehensive continuous delivery solutions for a sample application using many types of AWS service platforms. You can run the exact same sample and/or fork the GitHub repository (https://github.com/stelligent/devops-essentials) and extend or modify the solutions. I’m excited to share how you can use AWS Developer Tools to create these solutions for your customers as well. There’s also an accompanying website for the course (http://www.devopsessentialsaws.com/) that I use in the video to walk through the course examples which link to resources located in GitHub or Amazon S3. In this course, you will learn how to:

  • Use AWS Developer and Management Tools to create a full-lifecycle software delivery solution
  • Use AWS CloudFormation to automate the provisioning of all AWS resources
  • Use AWS CodePipeline to orchestrate the deployments of all applications
  • Use AWS CodeCommit while deploying an application onto EC2 instances using AWS CodeBuild and AWS CodeDeploy
  • Deploy applications using AWS OpsWorks and AWS Elastic Beanstalk
  • Deploy an application using Amazon EC2 Container Service (ECS) along with AWS CloudFormation
  • Deploy serverless applications that use AWS Lambda and API Gateway
  • Integrate all AWS Developer Tools into an end-to-end solution with AWS CodeStar

To learn more, see DevOps Essentials on AWS video course on Udemy. For a limited time, you can enroll in this course for $40 and save 80%, a $160 saving. Simply use the code AWSDEV17.

Stelligent, an AWS Partner Network Advanced Consulting Partner holds the AWS DevOps Competency and over 100 AWS technical certifications. To stay updated on DevOps best practices, visit www.stelligent.com.