Tag Archives: Mac

Getting Ready for AWS re:Invent 2017

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/getting-ready-for-aws-reinvent-2017/

With just 40 days remaining before AWS re:Invent begins, my colleagues and I want to share some tips that will help you to make the most of your time in Las Vegas. As always, our focus is on training and education, mixed in with some after-hours fun and recreation for balance.

Locations, Locations, Locations
The re:Invent Campus will span the length of the Las Vegas strip, with events taking place at the MGM Grand, Aria, Mirage, Venetian, Palazzo, the Sands Expo Hall, the Linq Lot, and the Encore. Each venue will host tracks devoted to specific topics:

MGM Grand – Business Apps, Enterprise, Security, Compliance, Identity, Windows.

Aria – Analytics & Big Data, Alexa, Container, IoT, AI & Machine Learning, and Serverless.

Mirage – Bootcamps, Certifications & Certification Exams.

Venetian / Palazzo / Sands Expo Hall – Architecture, AWS Marketplace & Service Catalog, Compute, Content Delivery, Database, DevOps, Mobile, Networking, and Storage.

Linq Lot – Alexa Hackathons, Gameday, Jam Sessions, re:Play Party, Speaker Meet & Greets.

EncoreBookable meeting space.

If your interests span more than one topic, plan to take advantage of the re:Invent shuttles that will be making the rounds between the venues.

Lots of Content
The re:Invent Session Catalog is now live and you should start to choose the sessions of interest to you now.

With more than 1100 sessions on the agenda, planning is essential! Some of the most popular “deep dive” sessions will be run more than once and others will be streamed to overflow rooms at other venues. We’ve analyzed a lot of data, run some simulations, and are doing our best to provide you with multiple opportunities to build an action-packed schedule.

We’re just about ready to let you reserve seats for your sessions (follow me and/or @awscloud on Twitter for a heads-up). Based on feedback from earlier years, we have fine-tuned our seat reservation model. This year, 75% of the seats for each session will be reserved and the other 25% are for walk-up attendees. We’ll start to admit walk-in attendees 10 minutes before the start of the session.

Las Vegas never sleeps and neither should you! This year we have a host of late-night sessions, workshops, chalk talks, and hands-on labs to keep you busy after dark.

To learn more about our plans for sessions and content, watch the Get Ready for re:Invent 2017 Content Overview video.

Have Fun
After you’ve had enough training and learning for the day, plan to attend the Pub Crawl, the re:Play party, the Tatonka Challenge (two locations this year), our Hands-On LEGO Activities, and the Harley Ride. Stay fit with our 4K Run, Spinning Challenge, Fitness Bootcamps, and Broomball (a longstanding Amazon tradition).

See You in Vegas
As always, I am looking forward to meeting as many AWS users and blog readers as possible. Never hesitate to stop me and to say hello!

Jeff;

 

 

Using AWS Step Functions State Machines to Handle Workflow-Driven AWS CodePipeline Actions

Post Syndicated from Marcilio Mendonca original https://aws.amazon.com/blogs/devops/using-aws-step-functions-state-machines-to-handle-workflow-driven-aws-codepipeline-actions/

AWS CodePipeline is a continuous integration and continuous delivery service for fast and reliable application and infrastructure updates. It offers powerful integration with other AWS services, such as AWS CodeBuildAWS CodeDeployAWS CodeCommit, AWS CloudFormation and with third-party tools such as Jenkins and GitHub. These services make it possible for AWS customers to successfully automate various tasks, including infrastructure provisioning, blue/green deployments, serverless deployments, AMI baking, database provisioning, and release management.

Developers have been able to use CodePipeline to build sophisticated automation pipelines that often require a single CodePipeline action to perform multiple tasks, fork into different execution paths, and deal with asynchronous behavior. For example, to deploy a Lambda function, a CodePipeline action might first inspect the changes pushed to the code repository. If only the Lambda code has changed, the action can simply update the Lambda code package, create a new version, and point the Lambda alias to the new version. If the changes also affect infrastructure resources managed by AWS CloudFormation, the pipeline action might have to create a stack or update an existing one through the use of a change set. In addition, if an update is required, the pipeline action might enforce a safety policy to infrastructure resources that prevents the deletion and replacement of resources. You can do this by creating a change set and having the pipeline action inspect its changes before updating the stack. Change sets that do not conform to the policy are deleted.

This use case is a good illustration of workflow-driven pipeline actions. These are actions that run multiple tasks, deal with async behavior and loops, need to maintain and propagate state, and fork into different execution paths. Implementing workflow-driven actions directly in CodePipeline can lead to complex pipelines that are hard for developers to understand and maintain. Ideally, a pipeline action should perform a single task and delegate the complexity of dealing with workflow-driven behavior associated with that task to a state machine engine. This would make it possible for developers to build simpler, more intuitive pipelines and allow them to use state machine execution logs to visualize and troubleshoot their pipeline actions.

In this blog post, we discuss how AWS Step Functions state machines can be used to handle workflow-driven actions. We show how a CodePipeline action can trigger a Step Functions state machine and how the pipeline and the state machine are kept decoupled through a Lambda function. The advantages of using state machines include:

  • Simplified logic (complex tasks are broken into multiple smaller tasks).
  • Ease of handling asynchronous behavior (through state machine wait states).
  • Built-in support for choices and processing different execution paths (through state machine choices).
  • Built-in visualization and logging of the state machine execution.

The source code for the sample pipeline, pipeline actions, and state machine used in this post is available at https://github.com/awslabs/aws-codepipeline-stepfunctions.

Overview

This figure shows the components in the CodePipeline-Step Functions integration that will be described in this post. The pipeline contains two stages: a Source stage represented by a CodeCommit Git repository and a Prod stage with a single Deploy action that represents the workflow-driven action.

This action invokes a Lambda function (1) called the State Machine Trigger Lambda, which, in turn, triggers a Step Function state machine to process the request (2). The Lambda function sends a continuation token back to the pipeline (3) to continue its execution later and terminates. Seconds later, the pipeline invokes the Lambda function again (4), passing the continuation token received. The Lambda function checks the execution state of the state machine (5,6) and communicates the status to the pipeline. The process is repeated until the state machine execution is complete. Then the Lambda function notifies the pipeline that the corresponding pipeline action is complete (7). If the state machine has failed, the Lambda function will then fail the pipeline action and stop its execution (7). While running, the state machine triggers various Lambda functions to perform different tasks. The state machine and the pipeline are fully decoupled. Their interaction is handled by the Lambda function.

The Deploy State Machine

The sample state machine used in this post is a simplified version of the use case, with emphasis on infrastructure deployment. The state machine will follow distinct execution paths and thus have different outcomes, depending on:

  • The current state of the AWS CloudFormation stack.
  • The nature of the code changes made to the AWS CloudFormation template and pushed into the pipeline.

If the stack does not exist, it will be created. If the stack exists, a change set will be created and its resources inspected by the state machine. The inspection consists of parsing the change set results and detecting whether any resources will be deleted or replaced. If no resources are being deleted or replaced, the change set is allowed to be executed and the state machine completes successfully. Otherwise, the change set is deleted and the state machine completes execution with a failure as the terminal state.

Let’s dive into each of these execution paths.

Path 1: Create a Stack and Succeed Deployment

The Deploy state machine is shown here. It is triggered by the Lambda function using the following input parameters stored in an S3 bucket.

Create New Stack Execution Path

{
    "environmentName": "prod",
    "stackName": "sample-lambda-app",
    "templatePath": "infra/Lambda-template.yaml",
    "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
    "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ"
}

Note that some values used here are for the use case example only. Account-specific parameters like revisionS3Bucket and revisionS3Key will be different when you deploy this use case in your account.

These input parameters are used by various states in the state machine and passed to the corresponding Lambda functions to perform different tasks. For example, stackName is used to create a stack, check the status of stack creation, and create a change set. The environmentName represents the environment (for example, dev, test, prod) to which the code is being deployed. It is used to prefix the name of stacks and change sets.

With the exception of built-in states such as wait and choice, each state in the state machine invokes a specific Lambda function.  The results received from the Lambda invocations are appended to the state machine’s original input. When the state machine finishes its execution, several parameters will have been added to its original input.

The first stage in the state machine is “Check Stack Existence”. It checks whether a stack with the input name specified in the stackName input parameter already exists. The output of the state adds a Boolean value called doesStackExist to the original state machine input as follows:

{
  "doesStackExist": true,
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
}

The following stage, “Does Stack Exist?”, is represented by Step Functions built-in choice state. It checks the value of doesStackExist to determine whether a new stack needs to be created (doesStackExist=true) or a change set needs to be created and inspected (doesStackExist=false).

If the stack does not exist, the states illustrated in green in the preceding figure are executed. This execution path creates the stack, waits until the stack is created, checks the status of the stack’s creation, and marks the deployment successful after the stack has been created. Except for “Stack Created?” and “Wait Stack Creation,” each of these stages invokes a Lambda function. “Stack Created?” and “Wait Stack Creation” are implemented by using the built-in choice state (to decide which path to follow) and the wait state (to wait a few seconds before proceeding), respectively. Each stage adds the results of their Lambda function executions to the initial input of the state machine, allowing future stages to process them.

Path 2: Safely Update a Stack and Mark Deployment as Successful

Safely Update a Stack and Mark Deployment as Successful Execution Path

If the stack indicated by the stackName parameter already exists, a different path is executed. (See the green states in the figure.) This path will create a change set and use wait and choice states to wait until the change set is created. Afterwards, a stage in the execution path will inspect  the resources affected before the change set is executed.

The inspection procedure represented by the “Inspect Change Set Changes” stage consists of parsing the resources affected by the change set and checking whether any of the existing resources are being deleted or replaced. The following is an excerpt of the algorithm, where changeSetChanges.Changes is the object representing the change set changes:

...
var RESOURCES_BEING_DELETED_OR_REPLACED = "RESOURCES-BEING-DELETED-OR-REPLACED";
var CAN_SAFELY_UPDATE_EXISTING_STACK = "CAN-SAFELY-UPDATE-EXISTING-STACK";
for (var i = 0; i < changeSetChanges.Changes.length; i++) {
    var change = changeSetChanges.Changes[i];
    if (change.Type == "Resource") {
        if (change.ResourceChange.Action == "Delete") {
            return RESOURCES_BEING_DELETED_OR_REPLACED;
        }
        if (change.ResourceChange.Action == "Modify") {
            if (change.ResourceChange.Replacement == "True") {
                return RESOURCES_BEING_DELETED_OR_REPLACED;
            }
        }
    }
}
return CAN_SAFELY_UPDATE_EXISTING_STACK;

The algorithm returns different values to indicate whether the change set can be safely executed (CAN_SAFELY_UPDATE_EXISTING_STACK or RESOURCES_BEING_DELETED_OR_REPLACED). This value is used later by the state machine to decide whether to execute the change set and update the stack or interrupt the deployment.

The output of the “Inspect Change Set” stage is shown here.

{
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
  "doesStackExist": true,
  "changeSetName": "prod-sample-lambda-app-change-set-545",
  "changeSetCreationStatus": "complete",
  "changeSetAction": "CAN-SAFELY-UPDATE-EXISTING-STACK"
}

At this point, these parameters have been added to the state machine’s original input:

  • changeSetName, which is added by the “Create Change Set” state.
  • changeSetCreationStatus, which is added by the “Get Change Set Creation Status” state.
  • changeSetAction, which is added by the “Inspect Change Set Changes” state.

The “Safe to Update Infra?” step is a choice state (its JSON spec follows) that simply checks the value of the changeSetAction parameter. If the value is equal to “CAN-SAFELY-UPDATE-EXISTING-STACK“, meaning that no resources will be deleted or replaced, the step will execute the change set by proceeding to the “Execute Change Set” state. The deployment is successful (the state machine completes its execution successfully).

"Safe to Update Infra?": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.taskParams.changeSetAction",
          "StringEquals": "CAN-SAFELY-UPDATE-EXISTING-STACK",
          "Next": "Execute Change Set"
        }
      ],
      "Default": "Deployment Failed"
 }

Path 3: Reject Stack Update and Fail Deployment

Reject Stack Update and Fail Deployment Execution Path

If the changeSetAction parameter is different from “CAN-SAFELY-UPDATE-EXISTING-STACK“, the state machine will interrupt the deployment by deleting the change set and proceeding to the “Deployment Fail” step, which is a built-in Fail state. (Its JSON spec follows.) This state causes the state machine to stop in a failed state and serves to indicate to the Lambda function that the pipeline deployment should be interrupted in a fail state as well.

 "Deployment Failed": {
      "Type": "Fail",
      "Cause": "Deployment Failed",
      "Error": "Deployment Failed"
    }

In all three scenarios, there’s a state machine’s visual representation available in the AWS Step Functions console that makes it very easy for developers to identify what tasks have been executed or why a deployment has failed. Developers can also inspect the inputs and outputs of each state and look at the state machine Lambda function’s logs for details. Meanwhile, the corresponding CodePipeline action remains very simple and intuitive for developers who only need to know whether the deployment was successful or failed.

The State Machine Trigger Lambda Function

The Trigger Lambda function is invoked directly by the Deploy action in CodePipeline. The CodePipeline action must pass a JSON structure to the trigger function through the UserParameters attribute, as follows:

{
  "s3Bucket": "codepipeline-StepFunctions-sample",
  "stateMachineFile": "state_machine_input.json"
}

The s3Bucket parameter specifies the S3 bucket location for the state machine input parameters file. The stateMachineFile parameter specifies the file holding the input parameters. By being able to specify different input parameters to the state machine, we make the Trigger Lambda function and the state machine reusable across environments. For example, the same state machine could be called from a test and prod pipeline action by specifying a different S3 bucket or state machine input file for each environment.

The Trigger Lambda function performs two main tasks: triggering the state machine and checking the execution state of the state machine. Its core logic is shown here:

exports.index = function (event, context, callback) {
    try {
        console.log("Event: " + JSON.stringify(event));
        console.log("Context: " + JSON.stringify(context));
        console.log("Environment Variables: " + JSON.stringify(process.env));
        if (Util.isContinuingPipelineTask(event)) {
            monitorStateMachineExecution(event, context, callback);
        }
        else {
            triggerStateMachine(event, context, callback);
        }
    }
    catch (err) {
        failure(Util.jobId(event), callback, context.invokeid, err.message);
    }
}

Util.isContinuingPipelineTask(event) is a utility function that checks if the Trigger Lambda function is being called for the first time (that is, no continuation token is passed by CodePipeline) or as a continuation of a previous call. In its first execution, the Lambda function will trigger the state machine and send a continuation token to CodePipeline that contains the state machine execution ARN. The state machine ARN is exposed to the Lambda function through a Lambda environment variable called stateMachineArn. Here is the code that triggers the state machine:

function triggerStateMachine(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var s3Bucket = Util.actionUserParameter(event, "s3Bucket");
    var stateMachineFile = Util.actionUserParameter(event, "stateMachineFile");
    getStateMachineInputData(s3Bucket, stateMachineFile)
        .then(function (data) {
            var initialParameters = data.Body.toString();
            var stateMachineInputJSON = createStateMachineInitialInput(initialParameters, event);
            console.log("State machine input JSON: " + JSON.stringify(stateMachineInputJSON));
            return stateMachineInputJSON;
        })
        .then(function (stateMachineInputJSON) {
            return triggerStateMachineExecution(stateMachineArn, stateMachineInputJSON);
        })
        .then(function (triggerStateMachineOutput) {
            var continuationToken = { "stateMachineExecutionArn": triggerStateMachineOutput.executionArn };
            var message = "State machine has been triggered: " + JSON.stringify(triggerStateMachineOutput) + ", continuationToken: " + JSON.stringify(continuationToken);
            return continueExecution(Util.jobId(event), continuationToken, callback, message);
        })
        .catch(function (err) {
            console.log("Error triggering state machine: " + stateMachineArn + ", Error: " + err.message);
            failure(Util.jobId(event), callback, context.invokeid, err.message);
        })
}

The Trigger Lambda function fetches the state machine input parameters from an S3 file, triggers the execution of the state machine using the input parameters and the stateMachineArn environment variable, and signals to CodePipeline that the execution should continue later by passing a continuation token that contains the state machine execution ARN. In case any of these operations fail and an exception is thrown, the Trigger Lambda function will fail the pipeline immediately by signaling a pipeline failure through the putJobFailureResult CodePipeline API.

If the Lambda function is continuing a previous execution, it will extract the state machine execution ARN from the continuation token and check the status of the state machine, as shown here.

function monitorStateMachineExecution(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var continuationToken = JSON.parse(Util.continuationToken(event));
    var stateMachineExecutionArn = continuationToken.stateMachineExecutionArn;
    getStateMachineExecutionStatus(stateMachineExecutionArn)
        .then(function (response) {
            if (response.status === "RUNNING") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " is still " + response.status;
                return continueExecution(Util.jobId(event), continuationToken, callback, message);
            }
            if (response.status === "SUCCEEDED") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
                return success(Util.jobId(event), callback, message);
            }
            // FAILED, TIMED_OUT, ABORTED
            var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
            return failure(Util.jobId(event), callback, context.invokeid, message);
        })
        .catch(function (err) {
            var message = "Error monitoring execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + ", Error: " + err.message;
            failure(Util.jobId(event), callback, context.invokeid, message);
        });
}

If the state machine is in the RUNNING state, the Lambda function will send the continuation token back to the CodePipeline action. This will cause CodePipeline to call the Lambda function again a few seconds later. If the state machine has SUCCEEDED, then the Lambda function will notify the CodePipeline action that the action has succeeded. In any other case (FAILURE, TIMED-OUT, or ABORT), the Lambda function will fail the pipeline action.

This behavior is especially useful for developers who are building and debugging a new state machine because a bug in the state machine can potentially leave the pipeline action hanging for long periods of time until it times out. The Trigger Lambda function prevents this.

Also, by having the Trigger Lambda function as a means to decouple the pipeline and state machine, we make the state machine more reusable. It can be triggered from anywhere, not just from a CodePipeline action.

The Pipeline in CodePipeline

Our sample pipeline contains two simple stages: the Source stage represented by a CodeCommit Git repository and the Prod stage, which contains the Deploy action that invokes the Trigger Lambda function. When the state machine decides that the change set created must be rejected (because it replaces or deletes some the existing production resources), it fails the pipeline without performing any updates to the existing infrastructure. (See the failed Deploy action in red.) Otherwise, the pipeline action succeeds, indicating that the existing provisioned infrastructure was either created (first run) or updated without impacting any resources. (See the green Deploy stage in the pipeline on the left.)

The Pipeline in CodePipeline

The JSON spec for the pipeline’s Prod stage is shown here. We use the UserParameters attribute to pass the S3 bucket and state machine input file to the Lambda function. These parameters are action-specific, which means that we can reuse the state machine in another pipeline action.

{
  "name": "Prod",
  "actions": [
      {
          "inputArtifacts": [
              {
                  "name": "CodeCommitOutput"
              }
          ],
          "name": "Deploy",
          "actionTypeId": {
              "category": "Invoke",
              "owner": "AWS",
              "version": "1",
              "provider": "Lambda"
          },
          "outputArtifacts": [],
          "configuration": {
              "FunctionName": "StateMachineTriggerLambda",
              "UserParameters": "{\"s3Bucket\": \"codepipeline-StepFunctions-sample\", \"stateMachineFile\": \"state_machine_input.json\"}"
          },
          "runOrder": 1
      }
  ]
}

Conclusion

In this blog post, we discussed how state machines in AWS Step Functions can be used to handle workflow-driven actions. We showed how a Lambda function can be used to fully decouple the pipeline and the state machine and manage their interaction. The use of a state machine greatly simplified the associated CodePipeline action, allowing us to build a much simpler and cleaner pipeline while drilling down into the state machine’s execution for troubleshooting or debugging.

Here are two exercises you can complete by using the source code.

Exercise #1: Do not fail the state machine and pipeline action after inspecting a change set that deletes or replaces resources. Instead, create a stack with a different name (think of blue/green deployments). You can do this by creating a state machine transition between the “Safe to Update Infra?” and “Create Stack” stages and passing a new stack name as input to the “Create Stack” stage.

Exercise #2: Add wait logic to the state machine to wait until the change set completes its execution before allowing the state machine to proceed to the “Deployment Succeeded” stage. Use the stack creation case as an example. You’ll have to create a Lambda function (similar to the Lambda function that checks the creation status of a stack) to get the creation status of the change set.

Have fun and share your thoughts!

About the Author

Marcilio Mendonca is a Sr. Consultant in the Canadian Professional Services Team at Amazon Web Services. He has helped AWS customers design, build, and deploy best-in-class, cloud-native AWS applications using VMs, containers, and serverless architectures. Before he joined AWS, Marcilio was a Software Development Engineer at Amazon. Marcilio also holds a Ph.D. in Computer Science. In his spare time, he enjoys playing drums, riding his motorcycle in the Toronto GTA area, and spending quality time with his family.

DragonFly BSD 5.0

Post Syndicated from ris original https://lwn.net/Articles/736559/rss

DragonFly BSD 5.0 has been released. “Preliminary HAMMER2 support has been released into the wild as-of the 5.0 release. This support is considered EXPERIMENTAL and should generally not yet be used for production machines and important data. The boot loader will support both UFS and HAMMER2 /boot. The installer will still use a UFS /boot even for a HAMMER2 installation because the /boot partition is typically very small and HAMMER2, like HAMMER1, does not instantly free space when files are deleted or replaced.

DragonFly 5.0 has single-image HAMMER2 support, with live dedup (for cp’s), compression, fast recovery, snapshot, and boot support. HAMMER2 does not yet support multi-volume or clustering, though commands for it exist. Please use non-clustered single images for now.”

New KRACK Attack Against Wi-Fi Encryption

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/10/new_krack_attac.html

Mathy Vanhoef has just published a devastating attack against WPA2, the 14-year-old encryption protocol used by pretty much all wi-fi systems. Its an interesting attack, where the attacker forces the protocol to reuse a key. The authors call this attack KRACK, for Key Reinstallation Attacks

This is yet another of a series of marketed attacks; with a cool name, a website, and a logo. The Q&A on the website answers a lot of questions about the attack and its implications. And lots of good information in this ArsTechnica article.

There is an academic paper, too:

“Key Reinstallation Attacks: Forcing Nonce Reuse in WPA2,” by Mathy Vanhoef and Frank Piessens.

Abstract: We introduce the key reinstallation attack. This attack abuses design or implementation flaws in cryptographic protocols to reinstall an already-in-use key. This resets the key’s associated parameters such as transmit nonces and receive replay counters. Several types of cryptographic Wi-Fi handshakes are affected by the attack. All protected Wi-Fi networks use the 4-way handshake to generate a fresh session key. So far, this 14-year-old handshake has remained free from attacks, and is even proven secure. However, we show that the 4-way handshake is vulnerable to a key reinstallation attack. Here, the adversary tricks a victim into reinstalling an already-in-use key. This is achieved by manipulating and replaying handshake messages. When reinstalling the key, associated parameters such as the incremental transmit packet number (nonce) and receive packet number (replay counter) are reset to their initial value. Our key reinstallation attack also breaks the PeerKey, group key, and Fast BSS Transition (FT) handshake. The impact depends on the handshake being attacked, and the data-confidentiality protocol in use. Simplified, against AES-CCMP an adversary can replay and decrypt (but not forge) packets. This makes it possible to hijack TCP streams and inject malicious data into them. Against WPA-TKIP and GCMP the impact is catastrophic: packets can be replayed, decrypted, and forged. Because GCMP uses the same authentication key in both communication directions, it is especially affected.

Finally, we confirmed our findings in practice, and found that every Wi-Fi device is vulnerable to some variant of our attacks. Notably, our attack is exceptionally devastating against Android 6.0: it forces the client into using a predictable all-zero encryption key.

I’m just reading about this now, and will post more information
as I learn it.

EDITED TO ADD: More news.

EDITED TO ADD: This meets my definition of brilliant. The attack is blindingly obvious once it’s pointed out, but for over a decade no one noticed it.

EDITED TO ADD: Matthew Green has a blog post on what went wrong. The vulnerability is in the interaction between two protocols. At a meta level, he blames the opaque IEEE standards process:

One of the problems with IEEE is that the standards are highly complex and get made via a closed-door process of private meetings. More importantly, even after the fact, they’re hard for ordinary security researchers to access. Go ahead and google for the IETF TLS or IPSec specifications — you’ll find detailed protocol documentation at the top of your Google results. Now go try to Google for the 802.11i standards. I wish you luck.

The IEEE has been making a few small steps to ease this problem, but they’re hyper-timid incrementalist bullshit. There’s an IEEE program called GET that allows researchers to access certain standards (including 802.11) for free, but only after they’ve been public for six months — coincidentally, about the same time it takes for vendors to bake them irrevocably into their hardware and software.

This whole process is dumb and — in this specific case — probably just cost industry tens of millions of dollars. It should stop.

Nicholas Weaver explains why most people shouldn’t worry about this:

So unless your Wi-Fi password looks something like a cat’s hairball (e.g. “:SNEIufeli7rc” — which is not guessable with a few million tries by a computer), a local attacker had the capability to determine the password, decrypt all the traffic, and join the network before KRACK.

KRACK is, however, relevant for enterprise Wi-Fi networks: networks where you needed to accept a cryptographic certificate to join initially and have to provide both a username and password. KRACK represents a new vulnerability for these networks. Depending on some esoteric details, the attacker can decrypt encrypted traffic and, in some cases, inject traffic onto the network.

But in none of these cases can the attacker join the network completely. And the most significant of these attacks affects Linux devices and Android phones, they don’t affect Macs, iPhones, or Windows systems. Even when feasible, these attacks require physical proximity: An attacker on the other side of the planet can’t exploit KRACK, only an attacker in the parking lot can.

Manufacturing Astro Pi case replicas

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/astro-pi-case-guest-post/

Tim Rowledge produces and sells wonderful replicas of the cases which our Astro Pis live in aboard the International Space Station. Here is the story of how he came to do this. Over to you, Tim!

When the Astro Pi case was first revealed a couple of years ago, the collective outpouring of ‘Squee!’ it elicited may have been heard on board the ISS itself. People wanted to buy it or build it at home, and someone wanted to know whether it would blend. (There’s always one.)

The complete Astro Pi

The Sense HAT and its Pi tucked snugly in the original Astro Pi flight case — gorgeous, isn’t it?

Replicating the Astro Pi case

Some months later the STL files for printing your own Astro Pi case were released, and people jumped at the chance to use them. Soon reports appeared saying you had to make quite a few attempts before getting a good print — normal for any complex 3D-printing project. A fellow member of my local makerspace successfully made a couple of cases, but it took a lot of time, filament, and post-print finishing work. And of course, a plastic Astro Pi case simply doesn’t look or feel like the original made of machined aluminium — or ‘aluminum’, as they tend to say over here in North America.

Batch of tops of Astro Pi case replicas by Tim Rowledge

A batch of tops designed by Tim

I wanted to build an Astro Pi case which would more closely match the original. Fortunately, someone else at my makerspace happens to have some serious CNC machining equipment at his small manufacturing company. Therefore, I focused on creating a case design that could be produced with his three-axis device. This meant simplifying some parts to avoid expensive, slow, complex multi-fixture work. It took us a while, but we ended up with a design we can efficiently make using his machine.

Lasered Astro Pi case replica by Tim Rowledge

Tim’s first lasered case

And the resulting case looks really, really like the original — in fact, upon receiving one of the final prototypes, Eben commented:

“I have to say, at first glance they look spectacular: unless you hold them side by side with the originals, it’s hard to pinpoint what’s changed. I’m looking forward to seeing one built up and then seeing them in the wild.”

Inside the Astro Pi case

Making just the bare case is nice, but there are other parts required to recreate a complete Astro Pi unit. Thus I got my local electronics company to design a small HAT to provide much the same support the mezzanine board offers: an RTC and nice, clean connections to the six buttons. We also added well-labelled, grouped pads for all the other GPIO lines, along with space for an ADC. If you’re making your own Astro Pi replica, you might like the Switchboard.

The electronics supply industry just loves to offer *some* of what you need, so that one supplier never has everything: we had to obtain the required stand-offs, screws, spacers, and JST wires from assorted other sources. Jeff at my nearby Industrial Paint & Plastics took on the laser engraving of our cases, leaving out copyrighted logos etcetera.

Lasering the top of an Astro Pi case replica by Tim Rowledge

Lasering the top of a case

Get your own Astro Pi case

Should you like to buy one of our Astro Pi case kits, pop over to www.astropicase.com, and we’ll get it on its way to you pronto. If you’re an institutional or corporate customer, the fully built option might make more sense for you — ordering the Pi and other components, and having a staff member assemble it all, may well be more work than is sensible.

Astro Pi case replica Tim Rowledge

Tim’s first full Astro Pi case replica, complete with shiny APEM buttons

To put the kit together yourself, all you need to do is add a Pi, Sense HAT, Camera Module, and RTC battery, and choose your buttons. An illustrated manual explains the process step by step. Our version of the Astro Pi case uses the same APEM buttons as the units in orbit, and whilst they are expensive, just clicking them is a source of great joy. It comes in a nice travel case too.

Tim Rowledge holding up a PCB

This is Tim. Thanks, Tim!

Take part in Astro Pi

If having an Astro Pi replica is not enough for you, this is your chance: the 2017-18 Astro Pi challenge is open! Do you know a teenager who might be keen to design a experiment to run on the Astro Pis in space? Are you one yourself? You have until 29 October to send us your Mission Space Lab entry and become part of the next generation of space scientists? Head over to the Astro Pi website to find out more.

The post Manufacturing Astro Pi case replicas appeared first on Raspberry Pi.

AI in the Cloud Market: AWS & Microsoft Lend a Big Hand

Post Syndicated from Chris De Santis original https://www.anchor.com.au/blog/2017/10/aws-microsoft-launch-ai-platform/

Artificial intelligence (or AI) doesn’t necessarily play a big role in the current cloud hosting market, but Amazon Web Services (AWS) and Microsoft are looking to change that.

AI is starting to grow at an alarming rate and may be a significant role-player in the near future. According to Bernie Trudel, chairman of the Asia Cloud Computing Association (ACCA), AI “will become the killer application that will drive cloud computing forward”. He continues to mention that, although AI only accounts for 1% of the today’s global cloud computing market, its overall IT market share is growing at 52%, and its expected to rapidly grow to 10% of cloud revenue by 2025.

Trudel made notable that, although the big players in the cloud game are currently offering AI capabilities, the cloud-based AI market is still in its early stages. These big players include AWS, Microsoft, Google, and IBM. He also continues to state that AWS is certainly the leader in the cloud market, but they’re playing catch-up in terms of an AI perspective.

AWS 💘 Microsoft?

Here’s the funny bit–that a day or two after Trudel said all of this at Cloud Expo Asia, AWS announce (on their blog) their combined effort with Microsoft to create a new open-source deep-learning interface that “allows developers to more easily and quickly build machine learning models”. In other words, Gluon is an AI application for developers to create their own AI models, to the benefit of their own cloud applications and technical endeavours.

If you’d like to learn more about Gluon and the details of the project, head over to the AWS blog here.

AWS + Microsoft

 

The post AI in the Cloud Market: AWS & Microsoft Lend a Big Hand appeared first on AWS Managed Services by Anchor.

Bottomley: Using Elliptic Curve Cryptography with TPM2

Post Syndicated from corbet original https://lwn.net/Articles/736425/rss

James Bottomley describes
the use of the trusted platform module
with elliptic-curve
cryptography, with a substantial digression into how the elliptic-curve
algorithm itself works.
The initial attraction is the same as for RSA keys: making it
impossible to extract your private key from the system. However, the
mathematical calculations for EC keys are much simpler than for RSA keys
and don’t involve finding strong primes, so it’s much simpler for the TPM
(being a fairly weak calculation machine) to derive private and public EC
keys.

‘Pirate’ EBook Site Refuses Point Blank to Cooperate With BREIN

Post Syndicated from Andy original https://torrentfreak.com/pirate-ebook-site-refuses-point-blank-to-cooperate-with-brein-171015/

Dutch anti-piracy group BREIN is probably best known for its legal action against The Pirate Bay but the outfit also tackles many other forms of piracy.

A prime example is the case it pursued against a seller of fully-loaded Kodi boxes in the Netherlands. The subsequent landmark ruling from the European Court of Justice will reverberate around Europe for years to come.

Behind the scenes, however, BREIN persistently tries to take much smaller operations offline, and not without success. Earlier this year it revealed it had taken down 231 illegal sites and services includes 84 linking sites, 63 streaming portals, and 34 torrent sites. Some of these shut down completely and others were forced to leave their hosting providers.

Much of this work flies under the radar but some current action, against an eBook site, is now being thrust into the public eye.

For more than five years, EBoek.info (eBook) has serviced Internet users looking to obtain comic books in Dutch. The site informs TorrentFreak it provides a legitimate service, targeted at people who have purchased a hard copy but also want their comics in digital format.

“EBoek.info is a site about comic books in the Dutch language. Besides some general information about the books, people who have legally obtained a hard copy of the books can find a link to an NZB file which enables them to download a digital version of the books they already have,” site representative ‘Zala’ says.

For those out of the loop, NZB files are a bit like Usenet’s version of .torrent files. They contain no copyrighted content themselves but do provide software clients with information on where to find specific content, so it can be downloaded to a user’s machine.

“BREIN claims that this is illegal as it is impossible for us to verify if our visitor is telling the truth [about having purchased a copy],” Zala reveals.

Speaking with TorrentFreak, BREIN chief Tim Kuik says there’s no question that offering downloads like this is illegal.

“It is plain and simple: the site makes links to unauthorized digital copies available to the general public and therefore is infringing copyright. It is distribution of the content without authorization of the rights holder,” Kuik says.

“The unauthorized copies are not private copies. The private copy exception does not apply to this kind of distribution. The private copy has not been made by the owner of the book himself for his own use. Someone else made the digital copy and is making it available to anyone who wants to download it provided he makes the unverified claim that he has a legal copy. This harms the normal exploitation of the
content.”

Zala says that BREIN has been trying to take his site offline for many years but more recently, the platform has utilized the services of Cloudflare, partly as a form of shield. As readers may be aware, a site behind Cloudflare has its originating IP addresses hidden from the public, not to mention BREIN, who values that kind of information. According to the operator, however, BREIN managed to obtain the information from the CDN provider.

“BREIN has tried for years to take our site offline. Recently, however, Cloudflare was so friendly to give them our IP address,” Zala notes.

A text copy of an email reportedly sent by BREIN to EBoek’s web host and seen by TF appears to confirm that Cloudflare handed over the information as suggested. Among other things, the email has BREIN informing the host that “The IP we got back from Cloudflare is XXX.XXX.XX.33.”

This means that BREIN was able to place direct pressure on EBoek.info’s web host, so only time will tell if that bears any fruit for the anti-piracy group. In the meantime, however, EBoek has decided to go public over its battle with BREIN.

“We have received a request from Stichting BREIN via our hosting provider to take EBoek.info offline,” the site informed its users yesterday.

Interestingly, it also appears that BREIN doesn’t appreciate that the operators of EBoek have failed to make their identities publicly known on their platform.

“The site operates anonymously which also is unlawful. Consumer protection requires that the owner/operator of a site identifies himself,” Kuik says.

According to EBoek, the anti-piracy outfit told the site’s web host that as a “commercial online service”, EBoek is required under EU law to display its “correct and complete business information” including names, addresses, and other information. But perhaps unsurprisingly, the site doesn’t want to play ball.

“In my opinion, you are confusing us with Facebook. They are a foreign commercial company with a European branch in Ireland, and therefore are subject to Irish legislation,” Zala says in an open letter to BREIN.

“Eboek.info, on the other hand, is a foreign hobby club with no commercial purpose, whose administrators have no connection with any country in the European Union. As administrators, we follow the laws of our country of residence which do not oblige us to disclose our identity through our website.

“The fact that Eboek is visible in the Netherlands does not just mean that we are going to adapt to Dutch rules, just as we don’t adapt the site to the rules of Saudi Arabia or China or wherever we are available.”

In a further snub to the anti-piracy group, EBoek says that all visitors to the site have to communicate with its operators via its guestbook, which is publicly visible.

“We see no reason to make an exception for Stichting BREIN,” the site notes.

What makes the situation more complex is that EBoek isn’t refusing dialog completely. The site says it doesn’t want to talk to BREIN but will speak to BREIN’s customers – the publishers of the comic books in question – noting that to date no complaints from publishers have ever been received.

While the parties argue about lines of communication, BREIN insists that following this year’s European Court of Justice decision in the GS Media case, a link to a known infringing work represents copyright infringement. In this case, an NZB file – which links to a location on Usenet – would generally fit the bill.

But despite focusing on the Dutch market, the operators of EBoek say the ruling doesn’t apply to them as they’re outside of the ECJ’s jurisdiction and aren’t commercially motivated. Refusing point blank to take their site offline, EBoek’s operators say that BREIN can do its worst, nothing will have much effect.

“[W]hat’s the worst thing that can happen? That our web host hands [BREIN] our address and IP data. In that case, it will turn out that…we are actually far away,” Zala says.

“[In the case the site goes offline], we’ll just put a backup on another server and, in this case, won’t make use of the ‘services’ of Cloudflare, the provider that apparently put BREIN on the right track.”

The question of jurisdiction is indeed an interesting one, particularly given BREIN’s focus in the Netherlands. But Kuik is clear – it is the area where the content is made available that matters.

“The law of the country where the content is made available applies. In this case the EU and amongst others the Netherlands,” Kuik concludes.

To be continued…..

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Backblaze Release 5.1 – RMM Compatibility for Mass Deployments

Post Syndicated from Yev original https://www.backblaze.com/blog/rmm-for-mass-deployments/

diagram of Backblaze remote monitoring and management

Introducing Backblaze Computer Backup Release 5.1

This is a relatively minor release in terms of the core Backblaze Computer Backup service functionality, but is a big deal for Backblaze for Business as we’ve updated our Mac and PC clients to be RMM (Remote Monitoring and Management) compatible.

What Is New?

  • Updated Mac and PC clients to better handle large file uploads
  • Updated PC downloader to improve stability
  • Added RMM support for PC and Mac clients

What Is RMM?

RMM stands for “Remote Monitoring and Management.” It’s a way to administer computers that might be distributed geographically, without having access to the actual machine. If you are a systems administrator working with anywhere from a few distributed computers to a few thousand, you’re familiar with RMM and how it makes life easier.

The new clients allow administrators to deploy Backblaze Computer Backup through most “silent” installation/mass deployment tools. Two popular RMM tools are Munki and Jamf. We’ve written up knowledge base articles for both of these.

munki logo jamf logo
Learn more about Munki Learn more about Jamf

Do I Need To Use RMM Tools?

No — unless you are a systems administrator or someone who is deploying Backblaze to a lot of people all at once, you do not have to worry about RMM support.

Release Version Number:

Mac:  5.1.0
PC:  5.1.0

Availability:

October 12, 2017

Upgrade Methods:

  • “Check for Updates” on the Backblaze Client (right click on the Backblaze icon and then select “Check for Updates”)
  • Download from: https://secure.backblaze.com/update.htm
  • Auto-update will begin in a couple of weeks
Mac backup update PC backup update
Updating Backblaze on Mac Updating Backblaze on Windows

Questions:

If you have any questions, please contact Backblaze Support at www.backblaze.com/help.

The post Backblaze Release 5.1 – RMM Compatibility for Mass Deployments appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Technology to Out Sex Workers

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/10/technology_to_o.html

Two related stories:

PornHub is using machine learning algorithms to identify actors in different videos, so as to better index them. People are worried that it can really identify them, by linking their stage names to their real names.

Facebook somehow managed to link a sex worker’s clients under her fake name to her real profile.

Sometimes people have legitimate reasons for having two identities. That is becoming harder and harder.

Popcorn Time Creator Readies BitTorrent & Blockchain-Powered Video Platform

Post Syndicated from Andy original https://torrentfreak.com/popcorn-time-creator-readies-bittorrent-blockchain-powered-youtube-competitor-171012/

Without a doubt, YouTube is one of the most important websites available on the Internet today.

Its massive archive of videos brings pleasure to millions on a daily basis but its centralized nature means that owner Google always exercises control.

Over the years, people have looked to decentralize the YouTube concept and the latest project hoping to shake up the market has a particularly interesting player onboard.

Until 2015, only insiders knew that Argentinian designer Federico Abad was actually ‘Sebastian’, the shadowy figure behind notorious content sharing platform Popcorn Time.

Now he’s part of the team behind Flixxo, a BitTorrent and blockchain-powered startup hoping to wrestle a share of the video market from YouTube. Here’s how the team, which features blockchain startup RSK Labs, hope things will play out.

The Flixxo network will have no centralized storage of data, eliminating the need for expensive hosting along with associated costs. Instead, transfers will take place between peers using BitTorrent, meaning video content will be stored on the machines of Flixxo users. In practice, the content will be downloaded and uploaded in much the same way as users do on The Pirate Bay or indeed Abad’s baby, Popcorn Time.

However, there’s a twist to the system that envisions content creators, content consumers, and network participants (seeders) making revenue from their efforts.

At the heart of the Flixxo system are digital tokens (think virtual currency), called Flixx. These Flixx ‘coins’, which will go on sale in 12 days, can be used to buy access to content. Creators can also opt to pay consumers when those people help to distribute their content to others.

“Free from structural costs, producers can share the earnings from their content with the network that supports them,” the team explains.

“This way you get paid for helping us improve Flixxo, and you earn credits (in the form of digital tokens called Flixx) for watching higher quality content. Having no intermediaries means that the price you pay for watching the content that you actually want to watch is lower and fairer.”

The Flixxo team

In addition to earning tokens from helping to distribute content, people in the Flixxo ecosystem can also earn currency by watching sponsored content, i.e advertisements. While in a traditional system adverts are often considered a nuisance, Flixx tokens have real value, with a promise that users will be able to trade their Flixx not only for videos, but also for tangible and semi-tangible goods.

“Use your Flixx to reward the producers you follow, encouraging them to create more awesome content. Or keep your Flixx in your wallet and use them to buy a movie ticket, a pair of shoes from an online retailer, a chest of coins in your favourite game or even convert them to old-fashioned cash or up-and-coming digital assets, like Bitcoin,” the team explains.

The Flixxo team have big plans. After foundation in early 2016, the second quarter of 2017 saw the completion of a functional alpha release. In a little under two weeks, the project will begin its token generation event, with new offices in Los Angeles planned for the first half of 2018 alongside a premiere of the Flixxo platform.

“A total of 1,000,000,000 (one billion) Flixx tokens will be issued. A maximum of 300,000,000 (three hundred million) tokens will be sold. Some of these tokens (not more than 33% or 100,000,000 Flixx) may be sold with anticipation of the token allocation event to strategic investors,” Flixxo states.

Like all content platforms, Flixxo will live or die by the quality of the content it provides and whether, at least in the first instance, it can persuade people to part with their hard-earned cash. Only time will tell whether its content will be worth a premium over readily accessible YouTube content but with much-reduced costs, it may tempt creators seeking a bigger piece of the pie.

“Flixxo will also educate its community, teaching its users that in this new internet era value can be held and transferred online without intermediaries, a value that can be earned back by participating in a community, by contributing, being rewarded for every single social interaction,” the team explains.

Of course, the elephant in the room is what will happen when people begin sharing copyrighted content via Flixxo. Certainly, the fact that Popcorn Time’s founder is a key player and rival streaming platform Stremio is listed as a partner means that things could get a bit spicy later on.

Nevertheless, the team suggests that piracy and spam content distribution will be limited by mechanisms already built into the system.

“[A]uthors have to time-block tokens in a smart contract (set as a warranty) in order to upload content. This contract will also handle and block their earnings for a certain period of time, so that in the case of a dispute the unfair-uploader may lose those tokens,” they explain.

That being said, Flixxo also says that “there is no way” for third parties to censor content “which means that anyone has the chance of making any piece of media available on the network.” However, Flixxo says it will develop tools for filtering what it describes as “inappropriate content.”

At this point, things start to become a little unclear. On the one hand Flixxo says it could become a “revolutionary tool for uncensorable and untraceable media” yet on the other it says that it’s necessary to ensure that adult content, for example, isn’t seen by kids.

“We know there is a thin line between filtering or curating content and censorship, and it is a fact that we have an open network for everyone to upload any content. However, Flixxo as a platform will apply certain filtering based on clear rules – there should be a behavior-code for uploaders in order to offer the right content to the right user,” Flixxo explains.

To this end, Flixxo says it will deploy a centralized curation function, carried out by 101 delegates elected by the community, which will become progressively decentralized over time.

“This curation will have a cost, paid in Flixx, and will be collected from the warranty blocked by the content uploaders,” they add.

There can be little doubt that if Flixxo begins ‘curating’ unsuitable content, copyright holders will call on it to do the same for their content too. And, if the platform really takes off, 101 curators probably won’t scratch the surface. There’s also the not inconsiderable issue of what might happen to curators’ judgment when they’re incentivized to block curate content.

Finally, for those sick of “not available in your region” messages, there’s good and bad news. Flixxo insists there will be no geo-blocking of content on its part but individual creators will still have that feature available to them, should they choose.

The Flixx whitepaper can be downloaded here (pdf)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Introducing Gluon: a new library for machine learning from AWS and Microsoft

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/introducing-gluon-a-new-library-for-machine-learning-from-aws-and-microsoft/

Post by Dr. Matt Wood

Today, AWS and Microsoft announced Gluon, a new open source deep learning interface which allows developers to more easily and quickly build machine learning models, without compromising performance.

Gluon Logo

Gluon provides a clear, concise API for defining machine learning models using a collection of pre-built, optimized neural network components. Developers who are new to machine learning will find this interface more familiar to traditional code, since machine learning models can be defined and manipulated just like any other data structure. More seasoned data scientists and researchers will value the ability to build prototypes quickly and utilize dynamic neural network graphs for entirely new model architectures, all without sacrificing training speed.

Gluon is available in Apache MXNet today, a forthcoming Microsoft Cognitive Toolkit release, and in more frameworks over time.

Neural Networks vs Developers
Machine learning with neural networks (including ‘deep learning’) has three main components: data for training; a neural network model, and an algorithm which trains the neural network. You can think of the neural network in a similar way to a directed graph; it has a series of inputs (which represent the data), which connect to a series of outputs (the prediction), through a series of connected layers and weights. During training, the algorithm adjusts the weights in the network based on the error in the network output. This is the process by which the network learns; it is a memory and compute intensive process which can take days.

Deep learning frameworks such as Caffe2, Cognitive Toolkit, TensorFlow, and Apache MXNet are, in part, an answer to the question ‘how can we speed this process up? Just like query optimizers in databases, the more a training engine knows about the network and the algorithm, the more optimizations it can make to the training process (for example, it can infer what needs to be re-computed on the graph based on what else has changed, and skip the unaffected weights to speed things up). These frameworks also provide parallelization to distribute the computation process, and reduce the overall training time.

However, in order to achieve these optimizations, most frameworks require the developer to do some extra work: specifically, by providing a formal definition of the network graph, up-front, and then ‘freezing’ the graph, and just adjusting the weights.

The network definition, which can be large and complex with millions of connections, usually has to be constructed by hand. Not only are deep learning networks unwieldy, but they can be difficult to debug and it’s hard to re-use the code between projects.

The result of this complexity can be difficult for beginners and is a time-consuming task for more experienced researchers. At AWS, we’ve been experimenting with some ideas in MXNet around new, flexible, more approachable ways to define and train neural networks. Microsoft is also a contributor to the open source MXNet project, and were interested in some of these same ideas. Based on this, we got talking, and found we had a similar vision: to use these techniques to reduce the complexity of machine learning, making it accessible to more developers.

Enter Gluon: dynamic graphs, rapid iteration, scalable training
Gluon introduces four key innovations.

  1. Friendly API: Gluon networks can be defined using a simple, clear, concise code – this is easier for developers to learn, and much easier to understand than some of the more arcane and formal ways of defining networks and their associated weighted scoring functions.
  2. Dynamic networks: the network definition in Gluon is dynamic: it can bend and flex just like any other data structure. This is in contrast to the more common, formal, symbolic definition of a network which the deep learning framework has to effectively carve into stone in order to be able to effectively optimizing computation during training. Dynamic networks are easier to manage, and with Gluon, developers can easily ‘hybridize’ between these fast symbolic representations and the more friendly, dynamic ‘imperative’ definitions of the network and algorithms.
  3. The algorithm can define the network: the model and the training algorithm are brought much closer together. Instead of separate definitions, the algorithm can adjust the network dynamically during definition and training. Not only does this mean that developers can use standard programming loops, and conditionals to create these networks, but researchers can now define even more sophisticated algorithms and models which were not possible before. They are all easier to create, change, and debug.
  4. High performance operators for training: which makes it possible to have a friendly, concise API and dynamic graphs, without sacrificing training speed. This is a huge step forward in machine learning. Some frameworks bring a friendly API or dynamic graphs to deep learning, but these previous methods all incur a cost in terms of training speed. As with other areas of software, abstraction can slow down computation since it needs to be negotiated and interpreted at run time. Gluon can efficiently blend together a concise API with the formal definition under the hood, without the developer having to know about the specific details or to accommodate the compiler optimizations manually.

The team here at AWS, and our collaborators at Microsoft, couldn’t be more excited to bring these improvements to developers through Gluon. We’re already seeing quite a bit of excitement from developers and researchers alike.

Getting started with Gluon
Gluon is available today in Apache MXNet, with support coming for the Microsoft Cognitive Toolkit in a future release. We’re also publishing the front-end interface and the low-level API specifications so it can be included in other frameworks in the fullness of time.

You can get started with Gluon today. Fire up the AWS Deep Learning AMI with a single click and jump into one of 50 fully worked, notebook examples. If you’re a contributor to a machine learning framework, check out the interface specs on GitHub.

-Dr. Matt Wood

Cloudflare CEO Has to Explain Lack of Pirate Site Terminations

Post Syndicated from Ernesto original https://torrentfreak.com/cloudflare-ceo-has-to-explain-lack-of-pirate-site-terminations-171010/

In August, Cloudflare CEO Matthew Prince decided to terminate the account of controversial neo-Nazi site Daily Stormer.

“I woke up this morning in a bad mood and decided to kick them off the Internet,” he wrote.

The decision was meant as an intellectual exercise to start a conversation regarding censorship and free speech on the internet. In this respect it was a success but the discussion went much further than Prince had intended.

Cloudflare had a long-standing policy not to remove any accounts without a court order, so when this was exceeded, eyebrows were raised. In particular, copyright holders wondered why the company could terminate this account but not those of the most notorious pirate sites.

Adult entertainment publisher ALS Scan raised this question in its piracy liability case against Cloudflare, asking for a 7-hour long deposition of the company’s CEO, to find out more. Cloudflare opposed this request, saying it was overbroad and unneeded, while asking the court to weigh in.

After reviewing the matter, Magistrate Judge Alexander MacKinnon decided to allow the deposition, but in a limited form.

“An initial matter, the Court finds that ALS Scan has not made a showing that would justify a 7 hour deposition of Mr. Prince covering a wide range of topics,” the order (pdf) reads.

“On the other hand, a review of the record shows that ALS Scan has identified a narrow relevant issue for which it appears Mr. Prince has unique knowledge and for which less intrusive discovery has been exhausted.”

ALS Scan will be able to interrogate Cloudflare’s CEO but only for two hours. The deposition must be specifically tailored toward his motivation (not) to use his authority to terminate the accounts of ‘pirating’ customers.

“The specific topic is the use (or non-use) of Mr. Prince’s authority to terminate customers, as specifically applied to customers for whom Cloudflare has received notices of copyright infringement,” the order specifies.

Whether this deposition will help ALS Scan argue its case has yet to be seen. Based on earlier submissions, the CEO will likely argue that the Daily Stormer case was an exception to make a point and that it’s company policy to require a court order to respond to infringement claims.

Meanwhile, more questions are being raised. Just a few days ago Cloudflare suspended the account of a customer for using a cryptocurrency miner. Apparently, Cloudflare classifies these miners as malware, triggering a punishment without a court order.

ALS Scan and other copyright holders would like to see a similar policy against notorious pirate sites, but thus far Cloudflare is having none of it.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Application Load Balancers Now Support Multiple TLS Certificates With Smart Selection Using SNI

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/new-application-load-balancer-sni/

Today we’re launching support for multiple TLS/SSL certificates on Application Load Balancers (ALB) using Server Name Indication (SNI). You can now host multiple TLS secured applications, each with its own TLS certificate, behind a single load balancer. In order to use SNI, all you need to do is bind multiple certificates to the same secure listener on your load balancer. ALB will automatically choose the optimal TLS certificate for each client. These new features are provided at no additional charge.

If you’re looking for a TL;DR on how to use this new feature just click here. If you’re like me and you’re a little rusty on the specifics of Transport Layer Security (TLS) then keep reading.

TLS? SSL? SNI?

People tend to use the terms SSL and TLS interchangeably even though the two are technically different. SSL technically refers to a predecessor of the TLS protocol. To keep things simple I’ll be using the term TLS for the rest of this post.

TLS is a protocol for securely transmitting data like passwords, cookies, and credit card numbers. It enables privacy, authentication, and integrity of the data being transmitted. TLS uses certificate based authentication where certificates are like ID cards for your websites. You trust the person that signed and issued the certificate, the certificate authority (CA), so you trust that the data in the certificate is correct. When a browser connects to your TLS-enabled ALB, ALB presents a certificate that contains your site’s public key, which has been cryptographically signed by a CA. This way the client can be sure it’s getting the ‘real you’ and that it’s safe to use your site’s public key to establish a secure connection.

With SNI support we’re making it easy to use more than one certificate with the same ALB. The most common reason you might want to use multiple certificates is to handle different domains with the same load balancer. It’s always been possible to use wildcard and subject-alternate-name (SAN) certificates with ALB, but these come with limitations. Wildcard certificates only work for related subdomains that match a simple pattern and while SAN certificates can support many different domains, the same certificate authority has to authenticate each one. That means you have reauthenticate and reprovision your certificate everytime you add a new domain.

One of our most frequent requests on forums, reddit, and in my e-mail inbox has been to use the Server Name Indication (SNI) extension of TLS to choose a certificate for a client. Since TLS operates at the transport layer, below HTTP, it doesn’t see the hostname requested by a client. SNI works by having the client tell the server “This is the domain I expect to get a certificate for” when it first connects. The server can then choose the correct certificate to respond to the client. All modern web browsers and a large majority of other clients support SNI. In fact, today we see SNI supported by over 99.5% of clients connecting to CloudFront.

Smart Certificate Selection on ALB

ALB’s smart certificate selection goes beyond SNI. In addition to containing a list of valid domain names, certificates also describe the type of key exchange and cryptography that the server supports, as well as the signature algorithm (SHA2, SHA1, MD5) used to sign the certificate. To establish a TLS connection, a client starts a TLS handshake by sending a “ClientHello” message that outlines the capabilities of the client: the protocol versions, extensions, cipher suites, and compression methods. Based on what an individual client supports, ALB’s smart selection algorithm chooses a certificate for the connection and sends it to the client. ALB supports both the classic RSA algorithm and the newer, hipper, and faster Elliptic-curve based ECDSA algorithm. ECDSA support among clients isn’t as prevalent as SNI, but it is supported by all modern web browsers. Since it’s faster and requires less CPU, it can be particularly useful for ultra-low latency applications and for conserving the amount of battery used by mobile applications. Since ALB can see what each client supports from the TLS handshake, you can upload both RSA and ECDSA certificates for the same domains and ALB will automatically choose the best one for each client.

Using SNI with ALB

I’ll use a few example websites like VimIsBetterThanEmacs.com and VimIsTheBest.com. I’ve purchased and hosted these domains on Amazon Route 53, and provisioned two separate certificates for them in AWS Certificate Manager (ACM). If I want to securely serve both of these sites through a single ALB, I can quickly add both certificates in the console.

First, I’ll select my load balancer in the console, go to the listeners tab, and select “view/edit certificates”.

Next, I’ll use the “+” button in the top left corner to select some certificates then I’ll click the “Add” button.

There are no more steps. If you’re not really a GUI kind of person you’ll be pleased to know that it’s also simple to add new certificates via the AWS Command Line Interface (CLI) (or SDKs).

aws elbv2 add-listener-certificates --listener-arn <listener-arn> --certificates CertificateArn=<cert-arn>

Things to know

  • ALB Access Logs now include the client’s requested hostname and the certificate ARN used. If the “hostname” field is empty (represented by a “-“) the client did not use the SNI extension in their request.
  • You can use any of your certificates in ACM or IAM.
  • You can bind multiple certificates for the same domain(s) to a secure listener. Your ALB will choose the optimal certificate based on multiple factors including the capabilities of the client.
  • If the client does not support SNI your ALB will use the default certificate (the one you specified when you created the listener).
  • There are three new ELB API calls: AddListenerCertificates, RemoveListenerCertificates, and DescribeListenerCertificates.
  • You can bind up to 25 certificates per load balancer (not counting the default certificate).
  • These new features are supported by AWS CloudFormation at launch.

You can see an example of these new features in action with a set of websites created by my colleague Jon Zobrist: https://www.exampleloadbalancer.com/.

Overall, I will personally use this feature and I’m sure a ton of AWS users will benefit from it as well. I want to thank the Elastic Load Balancing team for all their hard work in getting this into the hands of our users.

Randall

Private Torrent Sites Allow Users to Mine Cryptocurrency for Upload Credit

Post Syndicated from Andy original https://torrentfreak.com/private-torrent-sites-allow-users-to-mine-cryptocurrency-for-upload-credit-171008/

Ever since The Pirate Bay crew added a cryptocurrency miner to their site last month, the debate over user mining has sizzled away in the background.

The basic premise is that a piece of software embedded in a website runs on a user’s machine, utilizing its CPU cycles in order to generate revenue for the site in question. But not everyone likes it.

The main problem has centered around consent. While some sites are giving users the option of whether to be involved or not, others simply run the miner without asking. This week, one site operator suggested to TF that since no one asks whether they can run “shitty” ads on a person’s machine, why should they ask permission to mine?

It’s a controversial point, but it would be hard to find users agreeing on either front. They almost universally insist on consent, wherever possible. That’s why when someone comes up with something innovative to solve a problem, it catches the eye.

Earlier this week a user on Reddit posted a screenshot of a fairly well known private tracker. The site had implemented a mining solution not dissimilar to that appearing on other similar platforms. This one, however, gives the user something back.

Mining for coins – with a twist

First of all, it’s important to note the implementation. The decision to mine is completely under the control of the user, with buttons to start or stop mining. There are even additional controls for how many CPU threads to commit alongside a percentage utilization selector. While still early days, that all sounds pretty fair.

Where this gets even more interesting is how this currency mining affects so-called “upload credit”, an important commodity on a private tracker without which users can be prevented from downloading any content at all.

Very quickly: when BitTorrent users download content, they simultaneously upload to other users too. The idea is that they download X megabytes and upload the same number (at least) to other users, to ensure that everyone in a torrent swarm (a number of users sharing together) gets a piece of the action, aka the content in question.

The amount of content downloaded and uploaded on a private tracker is monitored and documented by the site. If a user has 1TB downloaded and 2TB uploaded, for example, he has 1TB in credit. In basic terms, this means he can download at least 1TB of additional content before he goes into deficit, a position undesirable on a private tracker.

Now, getting more “upload credit” can be as simple as uploading more, but some users find that difficult, either due to the way a tracker’s economy works or simply due to not having resources. If this is the case, some sites allow people to donate real money to receive “upload credit”. On the tracker highlighted in the mining example above, however, it’s possible to virtually ‘trade-in’ some of the mining effort instead.

Tracker politics aside (some people believe this is simply a cash grab opportunity), from a technical standpoint the prospect is quite intriguing.

In a way, the current private tracker system allows users to “mine” upload credits by donating bandwidth to other users of the site. Now they have the opportunity to mine an actual cryptocurrency on the tracker and have some of it converted back into the tracker’s native ‘currency’ – upload credit – which can only be ‘spent’ on the site. Meanwhile, the site’s operator can make a few bucks towards site maintenance.

Another example showing how innovative these mining implementations can be was posted by a member of a second private tracker. Although it’s unclear whether mining is forced or optional, there appears to be complete transparency for the benefit of the user.

The mining ‘Top 10’ on a private tracker

In addition to displaying the total number of users mining and the hashes solved per second, the site publishes a ‘Top 10’ list of users mining the most currently, and overall. Again, some people might not like the concept of users mining at all, but psychologically this is a particularly clever implementation.

Utilizing the desire of many private tracker users to be recognizable among their peers due to their contribution to the platform, the charts give a user a measurable status in the community, at least among those who care about such things. Previously these charts would list top uploaders of content but the addition of a ‘Top miner’ category certainly adds some additional spice to the mix.

Mining is a controversial topic which isn’t likely to go away anytime soon. But, for all its faults, it’s still a way for sites to generate revenue, away from the pitfalls of increasingly hostile and easy-to-trace alternative payment systems. The Pirate Bay may have set the cat among the pigeons last month, but it also gave the old gray matter a boost too.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Now Available – Microsoft SQL Server 2017 for Amazon EC2

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-available-microsoft-sql-server-2017-for-amazon-ec2/

Microsoft SQL Server 2017 (launched just a few days ago) includes lots of powerful new features including support for graph databases, automatic database tuning, and the ability to create clusterless Always On Availability Groups. It can also be run on Linux and in Docker containers.

Run on EC2
I’m happy to announce that you can now launch EC2 instances that run Windows Server 2016 and four editions (Web, Express, Standard, and Enterprise) of SQL Server 2017. The AMIs (Amazon Machine Images) are available today in all AWS Regions and run on a wide variety of EC2 instance types, including the new x1e.32xlarge with 128 vCPUs and almost 4 TB of memory.

You can launch these instances from the AWS Management Console or through AWS Marketplace. Here’s what they look like in the console:

And in AWS Marketplace:

Licensing Options Galore
You have lots of licensing options for SQL Server:

Pay As You Go – This option works well if you would prefer to avoid buying licenses, are already running an older version of SQL Server, and want to upgrade. You don’t have to deal with true-ups, software compliance audits, or Software Assurance and you don’t need to make a long-term purchase. If you are running the Standard Edition of SQL Server, you also benefit from our recent price reduction, with savings of up to 52%.

License Mobility – This option lets your use your active Software Assurance agreement to bring your existing licenses to EC2, and allows you to run SQL Server on Windows or Linux instances.

Bring Your Own Licenses – This option lets you take advantage of your existing license investment while minimizing upgrade costs. You can run SQL Server on EC2 Dedicated Instances or EC2 Dedicated Hosts, with the potential to reduce operating costs by licensing SQL Server on a per-core basis. This option allows you to run SQL Server 2017 on EC2 Linux instances (SUSE, RHEL, and Ubuntu are supported) and also supports Docker-based environments running on EC2 Windows and Linux instances. To learn more about these options, read the Installation Guidance for SQL Server on Linux and Run SQL Server 2017 Container Image with Docker.

Learn More
To learn more about SQL Server 2017 and to explore your licensing options in depth, take a look at the SQL Server on AWS page.

If you need advice and guidance as you plan your migration effort, check out the AWS Partners who have qualified for the Microsoft Workloads competency and focus on database solutions.

Amazon RDS support for SQL Server 2017 is planned for November. This will give you a fully managed option.

Plan to join the AWS team at the PASS Summit (November 1-3 in Seattle) and at AWS re:Invent (November 27th to December 1st in Las Vegas).

Jeff;

PS – Special thanks to my colleague Tom Staab (Partner Solutions Architect) for his help with this post!

timeShift(GrafanaBuzz, 1w) Issue 16

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/10/06/timeshiftgrafanabuzz-1w-issue-16/

Welcome to another issue of TimeShift. In addition to the roundup of articles and plugin updates, we had a big announcement this week – Early Bird tickets to GrafanaCon EU are now available! We’re also accepting CFPs through the end of October, so if you have a topic in mind, don’t wait until the last minute, please send it our way. Speakers who are selected will receive a comped ticket to the conference.


Early Bird Tickets Now Available

We’ve released a limited number of Early Bird tickets before General Admission tickets are available. Take advantage of this discount before they’re sold out!

Get Your Early Bird Ticket Now

Interested in speaking at GrafanaCon? We’re looking for technical and non-tecnical talks of all sizes. Submit a CFP Now.


From the Blogosphere

Get insights into your Azure Cosmos DB: partition heatmaps, OMS, and More: Microsoft recently announced the ability to access a subset of Azure Cosmos DB metrics via Azure Monitor API. Grafana Labs built an Azure Monitor Plugin for Grafana 4.5 to visualize the data.

How to monitor Docker for Mac/Windows: Brian was tired of guessing about the performance of his development machines and test environment. Here, he shows how to monitor Docker with Prometheus to get a better understanding of a dev environment in his quest to monitor all the things.

Prometheus and Grafana to Monitor 10,000 servers: This article covers enokido’s process of choosing a monitoring platform. He identifies three possible solutions, outlines the pros and cons of each, and discusses why he chose Prometheus.

GitLab Monitoring: It’s fascinating to see Grafana dashboards with production data from companies around the world. For instance, we’ve previously highlighted the huge number of dashboards Wikimedia publicly shares. This week, we found that GitLab also has public dashboards to explore.

Monitoring a Docker Swarm Cluster with cAdvisor, InfluxDB and Grafana | The Laboratory: It’s important to know the state of your applications in a scalable environment such as Docker Swarm. This video covers an overview of Docker, VM’s vs. containers, orchestration and how to monitor Docker Swarm.

Introducing Telemetry: Actionable Time Series Data from Counters: Learn how to use counters from mulitple disparate sources, devices, operating systems, and applications to generate actionable time series data.

ofp_sniffer Branch 1.2 (docker/influxdb/grafana) Upcoming Features: This video demo shows off some of the upcoming features for OFP_Sniffer, an OpenFlow sniffer to help network troubleshooting in production networks.


Grafana Plugins

Plugin authors add new features and bugfixes all the time, so it’s important to always keep your plugins up to date. To update plugins from on-prem Grafana, use the Grafana-cli tool, if you are using Hosted Grafana, you can update with 1 click! If you have questions or need help, hit up our community site, where the Grafana team and members of the community are happy to help.

UPDATED PLUGIN

PNP for Nagios Data Source – The latest release for the PNP data source has some fixes and adds a mathematical factor option.

Update

UPDATED PLUGIN

Google Calendar Data Source – This week, there was a small bug fix for the Google Calendar annotations data source.

Update

UPDATED PLUGIN

BT Plugins – Our friends at BT have been busy. All of the BT plugins in our catalog received and update this week. The plugins are the Status Dot Panel, the Peak Report Panel, the Trend Box Panel and the Alarm Box Panel.

Changes include:

  • Custom dashboard links now work in Internet Explorer.
  • The Peak Report panel no longer supports click-to-sort.
  • The Status Dot panel tooltips now look like Grafana tooltips.


This week’s MVC (Most Valuable Contributor)

Each week we highlight some of the important contributions from our amazing open source community. This week, we’d like to recognize a contributor who did a lot of work to improve Prometheus support.

pdoan017
Thanks to Alin Sinpaleanfor his Prometheus PR – that aligns the step and interval parameters. Alin got a lot of feedback from the Prometheus community and spent a lot of time and energy explaining, debating and iterating before the PR was ready.
Thank you!


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

Wow – Excited to be a part of exploring data to find out how Mexico City is evolving.

We Need Your Help!

Do you have a graph that you love because the data is beautiful or because the graph provides interesting information? Please get in touch. Tweet or send us an email with a screenshot, and we’ll tell you about this fun experiment.

Tell Me More


What do you think?

That’s a wrap! How are we doing? Submit a comment on this article below, or post something at our community forum. Help us make these weekly roundups better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.