Tag Archives: Core

Getting Ready for AWS re:Invent 2017

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/getting-ready-for-aws-reinvent-2017/

With just 40 days remaining before AWS re:Invent begins, my colleagues and I want to share some tips that will help you to make the most of your time in Las Vegas. As always, our focus is on training and education, mixed in with some after-hours fun and recreation for balance.

Locations, Locations, Locations
The re:Invent Campus will span the length of the Las Vegas strip, with events taking place at the MGM Grand, Aria, Mirage, Venetian, Palazzo, the Sands Expo Hall, the Linq Lot, and the Encore. Each venue will host tracks devoted to specific topics:

MGM Grand – Business Apps, Enterprise, Security, Compliance, Identity, Windows.

Aria – Analytics & Big Data, Alexa, Container, IoT, AI & Machine Learning, and Serverless.

Mirage – Bootcamps, Certifications & Certification Exams.

Venetian / Palazzo / Sands Expo Hall – Architecture, AWS Marketplace & Service Catalog, Compute, Content Delivery, Database, DevOps, Mobile, Networking, and Storage.

Linq Lot – Alexa Hackathons, Gameday, Jam Sessions, re:Play Party, Speaker Meet & Greets.

EncoreBookable meeting space.

If your interests span more than one topic, plan to take advantage of the re:Invent shuttles that will be making the rounds between the venues.

Lots of Content
The re:Invent Session Catalog is now live and you should start to choose the sessions of interest to you now.

With more than 1100 sessions on the agenda, planning is essential! Some of the most popular “deep dive” sessions will be run more than once and others will be streamed to overflow rooms at other venues. We’ve analyzed a lot of data, run some simulations, and are doing our best to provide you with multiple opportunities to build an action-packed schedule.

We’re just about ready to let you reserve seats for your sessions (follow me and/or @awscloud on Twitter for a heads-up). Based on feedback from earlier years, we have fine-tuned our seat reservation model. This year, 75% of the seats for each session will be reserved and the other 25% are for walk-up attendees. We’ll start to admit walk-in attendees 10 minutes before the start of the session.

Las Vegas never sleeps and neither should you! This year we have a host of late-night sessions, workshops, chalk talks, and hands-on labs to keep you busy after dark.

To learn more about our plans for sessions and content, watch the Get Ready for re:Invent 2017 Content Overview video.

Have Fun
After you’ve had enough training and learning for the day, plan to attend the Pub Crawl, the re:Play party, the Tatonka Challenge (two locations this year), our Hands-On LEGO Activities, and the Harley Ride. Stay fit with our 4K Run, Spinning Challenge, Fitness Bootcamps, and Broomball (a longstanding Amazon tradition).

See You in Vegas
As always, I am looking forward to meeting as many AWS users and blog readers as possible. Never hesitate to stop me and to say hello!

Jeff;

 

 

Using AWS Step Functions State Machines to Handle Workflow-Driven AWS CodePipeline Actions

Post Syndicated from Marcilio Mendonca original https://aws.amazon.com/blogs/devops/using-aws-step-functions-state-machines-to-handle-workflow-driven-aws-codepipeline-actions/

AWS CodePipeline is a continuous integration and continuous delivery service for fast and reliable application and infrastructure updates. It offers powerful integration with other AWS services, such as AWS CodeBuildAWS CodeDeployAWS CodeCommit, AWS CloudFormation and with third-party tools such as Jenkins and GitHub. These services make it possible for AWS customers to successfully automate various tasks, including infrastructure provisioning, blue/green deployments, serverless deployments, AMI baking, database provisioning, and release management.

Developers have been able to use CodePipeline to build sophisticated automation pipelines that often require a single CodePipeline action to perform multiple tasks, fork into different execution paths, and deal with asynchronous behavior. For example, to deploy a Lambda function, a CodePipeline action might first inspect the changes pushed to the code repository. If only the Lambda code has changed, the action can simply update the Lambda code package, create a new version, and point the Lambda alias to the new version. If the changes also affect infrastructure resources managed by AWS CloudFormation, the pipeline action might have to create a stack or update an existing one through the use of a change set. In addition, if an update is required, the pipeline action might enforce a safety policy to infrastructure resources that prevents the deletion and replacement of resources. You can do this by creating a change set and having the pipeline action inspect its changes before updating the stack. Change sets that do not conform to the policy are deleted.

This use case is a good illustration of workflow-driven pipeline actions. These are actions that run multiple tasks, deal with async behavior and loops, need to maintain and propagate state, and fork into different execution paths. Implementing workflow-driven actions directly in CodePipeline can lead to complex pipelines that are hard for developers to understand and maintain. Ideally, a pipeline action should perform a single task and delegate the complexity of dealing with workflow-driven behavior associated with that task to a state machine engine. This would make it possible for developers to build simpler, more intuitive pipelines and allow them to use state machine execution logs to visualize and troubleshoot their pipeline actions.

In this blog post, we discuss how AWS Step Functions state machines can be used to handle workflow-driven actions. We show how a CodePipeline action can trigger a Step Functions state machine and how the pipeline and the state machine are kept decoupled through a Lambda function. The advantages of using state machines include:

  • Simplified logic (complex tasks are broken into multiple smaller tasks).
  • Ease of handling asynchronous behavior (through state machine wait states).
  • Built-in support for choices and processing different execution paths (through state machine choices).
  • Built-in visualization and logging of the state machine execution.

The source code for the sample pipeline, pipeline actions, and state machine used in this post is available at https://github.com/awslabs/aws-codepipeline-stepfunctions.

Overview

This figure shows the components in the CodePipeline-Step Functions integration that will be described in this post. The pipeline contains two stages: a Source stage represented by a CodeCommit Git repository and a Prod stage with a single Deploy action that represents the workflow-driven action.

This action invokes a Lambda function (1) called the State Machine Trigger Lambda, which, in turn, triggers a Step Function state machine to process the request (2). The Lambda function sends a continuation token back to the pipeline (3) to continue its execution later and terminates. Seconds later, the pipeline invokes the Lambda function again (4), passing the continuation token received. The Lambda function checks the execution state of the state machine (5,6) and communicates the status to the pipeline. The process is repeated until the state machine execution is complete. Then the Lambda function notifies the pipeline that the corresponding pipeline action is complete (7). If the state machine has failed, the Lambda function will then fail the pipeline action and stop its execution (7). While running, the state machine triggers various Lambda functions to perform different tasks. The state machine and the pipeline are fully decoupled. Their interaction is handled by the Lambda function.

The Deploy State Machine

The sample state machine used in this post is a simplified version of the use case, with emphasis on infrastructure deployment. The state machine will follow distinct execution paths and thus have different outcomes, depending on:

  • The current state of the AWS CloudFormation stack.
  • The nature of the code changes made to the AWS CloudFormation template and pushed into the pipeline.

If the stack does not exist, it will be created. If the stack exists, a change set will be created and its resources inspected by the state machine. The inspection consists of parsing the change set results and detecting whether any resources will be deleted or replaced. If no resources are being deleted or replaced, the change set is allowed to be executed and the state machine completes successfully. Otherwise, the change set is deleted and the state machine completes execution with a failure as the terminal state.

Let’s dive into each of these execution paths.

Path 1: Create a Stack and Succeed Deployment

The Deploy state machine is shown here. It is triggered by the Lambda function using the following input parameters stored in an S3 bucket.

Create New Stack Execution Path

{
    "environmentName": "prod",
    "stackName": "sample-lambda-app",
    "templatePath": "infra/Lambda-template.yaml",
    "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
    "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ"
}

Note that some values used here are for the use case example only. Account-specific parameters like revisionS3Bucket and revisionS3Key will be different when you deploy this use case in your account.

These input parameters are used by various states in the state machine and passed to the corresponding Lambda functions to perform different tasks. For example, stackName is used to create a stack, check the status of stack creation, and create a change set. The environmentName represents the environment (for example, dev, test, prod) to which the code is being deployed. It is used to prefix the name of stacks and change sets.

With the exception of built-in states such as wait and choice, each state in the state machine invokes a specific Lambda function.  The results received from the Lambda invocations are appended to the state machine’s original input. When the state machine finishes its execution, several parameters will have been added to its original input.

The first stage in the state machine is “Check Stack Existence”. It checks whether a stack with the input name specified in the stackName input parameter already exists. The output of the state adds a Boolean value called doesStackExist to the original state machine input as follows:

{
  "doesStackExist": true,
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
}

The following stage, “Does Stack Exist?”, is represented by Step Functions built-in choice state. It checks the value of doesStackExist to determine whether a new stack needs to be created (doesStackExist=true) or a change set needs to be created and inspected (doesStackExist=false).

If the stack does not exist, the states illustrated in green in the preceding figure are executed. This execution path creates the stack, waits until the stack is created, checks the status of the stack’s creation, and marks the deployment successful after the stack has been created. Except for “Stack Created?” and “Wait Stack Creation,” each of these stages invokes a Lambda function. “Stack Created?” and “Wait Stack Creation” are implemented by using the built-in choice state (to decide which path to follow) and the wait state (to wait a few seconds before proceeding), respectively. Each stage adds the results of their Lambda function executions to the initial input of the state machine, allowing future stages to process them.

Path 2: Safely Update a Stack and Mark Deployment as Successful

Safely Update a Stack and Mark Deployment as Successful Execution Path

If the stack indicated by the stackName parameter already exists, a different path is executed. (See the green states in the figure.) This path will create a change set and use wait and choice states to wait until the change set is created. Afterwards, a stage in the execution path will inspect  the resources affected before the change set is executed.

The inspection procedure represented by the “Inspect Change Set Changes” stage consists of parsing the resources affected by the change set and checking whether any of the existing resources are being deleted or replaced. The following is an excerpt of the algorithm, where changeSetChanges.Changes is the object representing the change set changes:

...
var RESOURCES_BEING_DELETED_OR_REPLACED = "RESOURCES-BEING-DELETED-OR-REPLACED";
var CAN_SAFELY_UPDATE_EXISTING_STACK = "CAN-SAFELY-UPDATE-EXISTING-STACK";
for (var i = 0; i < changeSetChanges.Changes.length; i++) {
    var change = changeSetChanges.Changes[i];
    if (change.Type == "Resource") {
        if (change.ResourceChange.Action == "Delete") {
            return RESOURCES_BEING_DELETED_OR_REPLACED;
        }
        if (change.ResourceChange.Action == "Modify") {
            if (change.ResourceChange.Replacement == "True") {
                return RESOURCES_BEING_DELETED_OR_REPLACED;
            }
        }
    }
}
return CAN_SAFELY_UPDATE_EXISTING_STACK;

The algorithm returns different values to indicate whether the change set can be safely executed (CAN_SAFELY_UPDATE_EXISTING_STACK or RESOURCES_BEING_DELETED_OR_REPLACED). This value is used later by the state machine to decide whether to execute the change set and update the stack or interrupt the deployment.

The output of the “Inspect Change Set” stage is shown here.

{
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
  "doesStackExist": true,
  "changeSetName": "prod-sample-lambda-app-change-set-545",
  "changeSetCreationStatus": "complete",
  "changeSetAction": "CAN-SAFELY-UPDATE-EXISTING-STACK"
}

At this point, these parameters have been added to the state machine’s original input:

  • changeSetName, which is added by the “Create Change Set” state.
  • changeSetCreationStatus, which is added by the “Get Change Set Creation Status” state.
  • changeSetAction, which is added by the “Inspect Change Set Changes” state.

The “Safe to Update Infra?” step is a choice state (its JSON spec follows) that simply checks the value of the changeSetAction parameter. If the value is equal to “CAN-SAFELY-UPDATE-EXISTING-STACK“, meaning that no resources will be deleted or replaced, the step will execute the change set by proceeding to the “Execute Change Set” state. The deployment is successful (the state machine completes its execution successfully).

"Safe to Update Infra?": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.taskParams.changeSetAction",
          "StringEquals": "CAN-SAFELY-UPDATE-EXISTING-STACK",
          "Next": "Execute Change Set"
        }
      ],
      "Default": "Deployment Failed"
 }

Path 3: Reject Stack Update and Fail Deployment

Reject Stack Update and Fail Deployment Execution Path

If the changeSetAction parameter is different from “CAN-SAFELY-UPDATE-EXISTING-STACK“, the state machine will interrupt the deployment by deleting the change set and proceeding to the “Deployment Fail” step, which is a built-in Fail state. (Its JSON spec follows.) This state causes the state machine to stop in a failed state and serves to indicate to the Lambda function that the pipeline deployment should be interrupted in a fail state as well.

 "Deployment Failed": {
      "Type": "Fail",
      "Cause": "Deployment Failed",
      "Error": "Deployment Failed"
    }

In all three scenarios, there’s a state machine’s visual representation available in the AWS Step Functions console that makes it very easy for developers to identify what tasks have been executed or why a deployment has failed. Developers can also inspect the inputs and outputs of each state and look at the state machine Lambda function’s logs for details. Meanwhile, the corresponding CodePipeline action remains very simple and intuitive for developers who only need to know whether the deployment was successful or failed.

The State Machine Trigger Lambda Function

The Trigger Lambda function is invoked directly by the Deploy action in CodePipeline. The CodePipeline action must pass a JSON structure to the trigger function through the UserParameters attribute, as follows:

{
  "s3Bucket": "codepipeline-StepFunctions-sample",
  "stateMachineFile": "state_machine_input.json"
}

The s3Bucket parameter specifies the S3 bucket location for the state machine input parameters file. The stateMachineFile parameter specifies the file holding the input parameters. By being able to specify different input parameters to the state machine, we make the Trigger Lambda function and the state machine reusable across environments. For example, the same state machine could be called from a test and prod pipeline action by specifying a different S3 bucket or state machine input file for each environment.

The Trigger Lambda function performs two main tasks: triggering the state machine and checking the execution state of the state machine. Its core logic is shown here:

exports.index = function (event, context, callback) {
    try {
        console.log("Event: " + JSON.stringify(event));
        console.log("Context: " + JSON.stringify(context));
        console.log("Environment Variables: " + JSON.stringify(process.env));
        if (Util.isContinuingPipelineTask(event)) {
            monitorStateMachineExecution(event, context, callback);
        }
        else {
            triggerStateMachine(event, context, callback);
        }
    }
    catch (err) {
        failure(Util.jobId(event), callback, context.invokeid, err.message);
    }
}

Util.isContinuingPipelineTask(event) is a utility function that checks if the Trigger Lambda function is being called for the first time (that is, no continuation token is passed by CodePipeline) or as a continuation of a previous call. In its first execution, the Lambda function will trigger the state machine and send a continuation token to CodePipeline that contains the state machine execution ARN. The state machine ARN is exposed to the Lambda function through a Lambda environment variable called stateMachineArn. Here is the code that triggers the state machine:

function triggerStateMachine(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var s3Bucket = Util.actionUserParameter(event, "s3Bucket");
    var stateMachineFile = Util.actionUserParameter(event, "stateMachineFile");
    getStateMachineInputData(s3Bucket, stateMachineFile)
        .then(function (data) {
            var initialParameters = data.Body.toString();
            var stateMachineInputJSON = createStateMachineInitialInput(initialParameters, event);
            console.log("State machine input JSON: " + JSON.stringify(stateMachineInputJSON));
            return stateMachineInputJSON;
        })
        .then(function (stateMachineInputJSON) {
            return triggerStateMachineExecution(stateMachineArn, stateMachineInputJSON);
        })
        .then(function (triggerStateMachineOutput) {
            var continuationToken = { "stateMachineExecutionArn": triggerStateMachineOutput.executionArn };
            var message = "State machine has been triggered: " + JSON.stringify(triggerStateMachineOutput) + ", continuationToken: " + JSON.stringify(continuationToken);
            return continueExecution(Util.jobId(event), continuationToken, callback, message);
        })
        .catch(function (err) {
            console.log("Error triggering state machine: " + stateMachineArn + ", Error: " + err.message);
            failure(Util.jobId(event), callback, context.invokeid, err.message);
        })
}

The Trigger Lambda function fetches the state machine input parameters from an S3 file, triggers the execution of the state machine using the input parameters and the stateMachineArn environment variable, and signals to CodePipeline that the execution should continue later by passing a continuation token that contains the state machine execution ARN. In case any of these operations fail and an exception is thrown, the Trigger Lambda function will fail the pipeline immediately by signaling a pipeline failure through the putJobFailureResult CodePipeline API.

If the Lambda function is continuing a previous execution, it will extract the state machine execution ARN from the continuation token and check the status of the state machine, as shown here.

function monitorStateMachineExecution(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var continuationToken = JSON.parse(Util.continuationToken(event));
    var stateMachineExecutionArn = continuationToken.stateMachineExecutionArn;
    getStateMachineExecutionStatus(stateMachineExecutionArn)
        .then(function (response) {
            if (response.status === "RUNNING") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " is still " + response.status;
                return continueExecution(Util.jobId(event), continuationToken, callback, message);
            }
            if (response.status === "SUCCEEDED") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
                return success(Util.jobId(event), callback, message);
            }
            // FAILED, TIMED_OUT, ABORTED
            var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
            return failure(Util.jobId(event), callback, context.invokeid, message);
        })
        .catch(function (err) {
            var message = "Error monitoring execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + ", Error: " + err.message;
            failure(Util.jobId(event), callback, context.invokeid, message);
        });
}

If the state machine is in the RUNNING state, the Lambda function will send the continuation token back to the CodePipeline action. This will cause CodePipeline to call the Lambda function again a few seconds later. If the state machine has SUCCEEDED, then the Lambda function will notify the CodePipeline action that the action has succeeded. In any other case (FAILURE, TIMED-OUT, or ABORT), the Lambda function will fail the pipeline action.

This behavior is especially useful for developers who are building and debugging a new state machine because a bug in the state machine can potentially leave the pipeline action hanging for long periods of time until it times out. The Trigger Lambda function prevents this.

Also, by having the Trigger Lambda function as a means to decouple the pipeline and state machine, we make the state machine more reusable. It can be triggered from anywhere, not just from a CodePipeline action.

The Pipeline in CodePipeline

Our sample pipeline contains two simple stages: the Source stage represented by a CodeCommit Git repository and the Prod stage, which contains the Deploy action that invokes the Trigger Lambda function. When the state machine decides that the change set created must be rejected (because it replaces or deletes some the existing production resources), it fails the pipeline without performing any updates to the existing infrastructure. (See the failed Deploy action in red.) Otherwise, the pipeline action succeeds, indicating that the existing provisioned infrastructure was either created (first run) or updated without impacting any resources. (See the green Deploy stage in the pipeline on the left.)

The Pipeline in CodePipeline

The JSON spec for the pipeline’s Prod stage is shown here. We use the UserParameters attribute to pass the S3 bucket and state machine input file to the Lambda function. These parameters are action-specific, which means that we can reuse the state machine in another pipeline action.

{
  "name": "Prod",
  "actions": [
      {
          "inputArtifacts": [
              {
                  "name": "CodeCommitOutput"
              }
          ],
          "name": "Deploy",
          "actionTypeId": {
              "category": "Invoke",
              "owner": "AWS",
              "version": "1",
              "provider": "Lambda"
          },
          "outputArtifacts": [],
          "configuration": {
              "FunctionName": "StateMachineTriggerLambda",
              "UserParameters": "{\"s3Bucket\": \"codepipeline-StepFunctions-sample\", \"stateMachineFile\": \"state_machine_input.json\"}"
          },
          "runOrder": 1
      }
  ]
}

Conclusion

In this blog post, we discussed how state machines in AWS Step Functions can be used to handle workflow-driven actions. We showed how a Lambda function can be used to fully decouple the pipeline and the state machine and manage their interaction. The use of a state machine greatly simplified the associated CodePipeline action, allowing us to build a much simpler and cleaner pipeline while drilling down into the state machine’s execution for troubleshooting or debugging.

Here are two exercises you can complete by using the source code.

Exercise #1: Do not fail the state machine and pipeline action after inspecting a change set that deletes or replaces resources. Instead, create a stack with a different name (think of blue/green deployments). You can do this by creating a state machine transition between the “Safe to Update Infra?” and “Create Stack” stages and passing a new stack name as input to the “Create Stack” stage.

Exercise #2: Add wait logic to the state machine to wait until the change set completes its execution before allowing the state machine to proceed to the “Deployment Succeeded” stage. Use the stack creation case as an example. You’ll have to create a Lambda function (similar to the Lambda function that checks the creation status of a stack) to get the creation status of the change set.

Have fun and share your thoughts!

About the Author

Marcilio Mendonca is a Sr. Consultant in the Canadian Professional Services Team at Amazon Web Services. He has helped AWS customers design, build, and deploy best-in-class, cloud-native AWS applications using VMs, containers, and serverless architectures. Before he joined AWS, Marcilio was a Software Development Engineer at Amazon. Marcilio also holds a Ph.D. in Computer Science. In his spare time, he enjoys playing drums, riding his motorcycle in the Toronto GTA area, and spending quality time with his family.

Security updates for Wednesday

Post Syndicated from ris original https://lwn.net/Articles/736766/rss

Security updates have been issued by Arch Linux (kernel, linux-hardened, and linux-zen), CentOS (wpa_supplicant), Debian (xorg-server), Fedora (selinux-policy), Gentoo (libarchive, nagios-core, ruby, and xen), openSUSE (wpa_supplicant), Oracle (wpa_supplicant), Red Hat (Red Hat Single Sign-On, rh-nodejs6-nodejs, rh-sso7-keycloak, and wpa_supplicant), Scientific Linux (wpa_supplicant), SUSE (git, wpa_supplicant, and xen), and Ubuntu (xorg-server, xorg-server-hwe-16.04, xorg-server-lts-xenial).

Amazon Redshift Dense Compute (DC2) Nodes Deliver Twice the Performance as DC1 at the Same Price

Post Syndicated from Quaseer Mujawar original https://aws.amazon.com/blogs/big-data/amazon-redshift-dense-compute-dc2-nodes-deliver-twice-the-performance-as-dc1-at-the-same-price/

Amazon Redshift makes analyzing exabyte-scale data fast, simple, and cost-effective. It delivers advanced data warehousing capabilities, including parallel execution, compressed columnar storage, and end-to-end encryption as a fully managed service, for less than $1,000/TB/year. With Amazon Redshift Spectrum, you can run SQL queries directly against exabytes of unstructured data in Amazon S3 for $5/TB scanned.

Today, we are making our Dense Compute (DC) family faster and more cost-effective with new second-generation Dense Compute (DC2) nodes at the same price as our previous generation DC1. DC2 is designed for demanding data warehousing workloads that require low latency and high throughput. DC2 features powerful Intel E5-2686 v4 (Broadwell) CPUs, fast DDR4 memory, and NVMe-based solid state disks.

We’ve tuned Amazon Redshift to take advantage of the better CPU, network, and disk on DC2 nodes, providing up to twice the performance of DC1 at the same price. Our DC2.8xlarge instances now provide twice the memory per slice of data and an optimized storage layout with 30 percent better storage utilization.

Customer successes

Several flagship customers, ranging from fast growing startups to large Fortune 100 companies, previewed the new DC2 node type. In their tests, DC2 provided up to twice the performance as DC1. Our preview customers saw faster ETL (extract, transform, and load) jobs, higher query throughput, better concurrency, faster reports, and shorter data-to-insights—all at the same cost as DC1. DC2.8xlarge customers also noted that their databases used up to 30 percent less disk space due to our optimized storage format, reducing their costs.

4Cite Marketing, one of America’s fastest growing private companies, uses Amazon Redshift to analyze customer data and determine personalized product recommendations for retailers. “Amazon Redshift’s new DC2 node is giving us a 100 percent performance increase, allowing us to provide faster insights for our retailers, more cost-effectively, to drive incremental revenue,” said Jim Finnerty, 4Cite’s senior vice president of product.

BrandVerity, a Seattle-based brand protection and compliance‎ company, provides solutions to monitor, detect, and mitigate online brand, trademark, and compliance abuse. “We saw a 70 percent performance boost with the DC2 nodes for running Redshift Spectrum queries. As a result, we can analyze far more data for our customers and deliver results much faster,” said Hyung-Joon Kim, principal software engineer at BrandVerity.

“Amazon Redshift is at the core of our operations and our marketing automation tools,” said Jarno Kartela, head of analytics and chief data scientist at DNA Plc, one of the leading Finnish telecommunications groups and Finland’s largest cable operator and pay TV provider. “We saw a 52 percent performance gain in moving to Amazon Redshift’s DC2 nodes. We can now run queries in half the time, allowing us to provide more analytics power and reduce time-to-insight for our analytics and marketing automation users.”

You can read about their experiences on our Customer Success page.

Get started

You can try the new node type using our getting started guide. Just choose dc2.large or dc2.8xlarge in the Amazon Redshift console:

If you have a DC1.large Amazon Redshift cluster, you can restore to a new DC2.large cluster using an existing snapshot. To migrate from DS2.xlarge, DS2.8xlarge, or DC1.8xlarge Amazon Redshift clusters, you can use the resize operation to move data to your new DC2 cluster. For more information, see Clusters and Nodes in Amazon Redshift.

To get the latest Amazon Redshift feature announcements, check out our What’s New page, and subscribe to the RSS feed.

Abandon Proactive Copyright Filters, Huge Coalition Tells EU Heavyweights

Post Syndicated from Andy original https://torrentfreak.com/abandon-proactive-copyright-filters-huge-coalition-tells-eu-heavyweights-171017/

Last September, EU Commission President Jean-Claude Juncker announced plans to modernize copyright law in Europe.

The proposals (pdf) are part of the Digital Single Market reforms, which have been under development for the past several years.

One of the proposals is causing significant concern. Article 13 would require some online service providers to become ‘Internet police’, proactively detecting and filtering allegedly infringing copyright works, uploaded to their platforms by users.

Currently, users are generally able to share whatever they like but should a copyright holder take exception to their upload, mechanisms are available for that content to be taken down. It’s envisioned that proactive filtering, whereby user uploads are routinely scanned and compared to a database of existing protected content, will prevent content becoming available in the first place.

These proposals are of great concern to digital rights groups, who believe that such filters will not only undermine users’ rights but will also place unfair burdens on Internet platforms, many of which will struggle to fund such a program. Yesterday, in the latest wave of opposition to Article 13, a huge coalition of international rights groups came together to underline their concerns.

Headed up by Civil Liberties Union for Europe (Liberties) and European Digital Rights (EDRi), the coalition is formed of dozens of influential groups, including Electronic Frontier Foundation (EFF), Human Rights Watch, Reporters without Borders, and Open Rights Group (ORG), to name just a few.

In an open letter to European Commission President Jean-Claude Juncker, President of the European Parliament Antonio Tajani, President of the European Council Donald Tusk and a string of others, the groups warn that the proposals undermine the trust established between EU member states.

“Fundamental rights, justice and the rule of law are intrinsically linked and constitute
core values on which the EU is founded,” the letter begins.

“Any attempt to disregard these values undermines the mutual trust between member states required for the EU to function. Any such attempt would also undermine the commitments made by the European Union and national governments to their citizens.”

Those citizens, the letter warns, would have their basic rights undermined, should the new proposals be written into EU law.

“Article 13 of the proposal on Copyright in the Digital Single Market include obligations on internet companies that would be impossible to respect without the imposition of excessive restrictions on citizens’ fundamental rights,” it notes.

A major concern is that by placing new obligations on Internet service providers that allow users to upload content – think YouTube, Facebook, Twitter and Instagram – they will be forced to err on the side of caution. Should there be any concern whatsoever that content might be infringing, fair use considerations and exceptions will be abandoned in favor of staying on the right side of the law.

“Article 13 appears to provoke such legal uncertainty that online services will have no other option than to monitor, filter and block EU citizens’ communications if they are to have any chance of staying in business,” the letter warns.

But while the potential problems for service providers and users are numerous, the groups warn that Article 13 could also be illegal since it contradicts case law of the Court of Justice.

According to the E-Commerce Directive, platforms are already required to remove infringing content, once they have been advised it exists. The new proposal, should it go ahead, would force the monitoring of uploads, something which goes against the ‘no general obligation to monitor‘ rules present in the Directive.

“The requirement to install a system for filtering electronic communications has twice been rejected by the Court of Justice, in the cases Scarlet Extended (C70/10) and Netlog/Sabam (C 360/10),” the rights groups warn.

“Therefore, a legislative provision that requires internet companies to install a filtering system would almost certainly be rejected by the Court of Justice because it would contravene the requirement that a fair balance be struck between the right to intellectual property on the one hand, and the freedom to conduct business and the right to freedom of expression, such as to receive or impart information, on the other.”

Specifically, the groups note that the proactive filtering of content would violate freedom of expression set out in Article 11 of the Charter of Fundamental Rights. That being the case, the groups expect national courts to disapply it and the rule to be annulled by the Court of Justice.

The latest protests against Article 13 come in the wake of large-scale objections earlier in the year, voicing similar concerns. However, despite the groups’ fears, they have powerful adversaries, each determined to stop the flood of copyrighted content currently being uploaded to the Internet.

Front and center in support of Article 13 is the music industry and its current hot-topic, the so-called Value Gap(1,2,3). The industry feels that platforms like YouTube are able to avoid paying expensive licensing fees (for music in particular) by exploiting the safe harbor protections of the DMCA and similar legislation.

They believe that proactively filtering uploads would significantly help to diminish this problem, which may very well be the case. But at what cost to the general public and the platforms they rely upon? Citizens and scholars feel that freedoms will be affected and it’s likely the outcry will continue.

The ball is now with the EU, whose members will soon have to make what could be the most important decision in recent copyright history. The rights groups, who are urging for Article 13 to be deleted, are clear where they stand.

The full letter is available here (pdf)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Backblaze Release 5.1 – RMM Compatibility for Mass Deployments

Post Syndicated from Yev original https://www.backblaze.com/blog/rmm-for-mass-deployments/

diagram of Backblaze remote monitoring and management

Introducing Backblaze Computer Backup Release 5.1

This is a relatively minor release in terms of the core Backblaze Computer Backup service functionality, but is a big deal for Backblaze for Business as we’ve updated our Mac and PC clients to be RMM (Remote Monitoring and Management) compatible.

What Is New?

  • Updated Mac and PC clients to better handle large file uploads
  • Updated PC downloader to improve stability
  • Added RMM support for PC and Mac clients

What Is RMM?

RMM stands for “Remote Monitoring and Management.” It’s a way to administer computers that might be distributed geographically, without having access to the actual machine. If you are a systems administrator working with anywhere from a few distributed computers to a few thousand, you’re familiar with RMM and how it makes life easier.

The new clients allow administrators to deploy Backblaze Computer Backup through most “silent” installation/mass deployment tools. Two popular RMM tools are Munki and Jamf. We’ve written up knowledge base articles for both of these.

munki logo jamf logo
Learn more about Munki Learn more about Jamf

Do I Need To Use RMM Tools?

No — unless you are a systems administrator or someone who is deploying Backblaze to a lot of people all at once, you do not have to worry about RMM support.

Release Version Number:

Mac:  5.1.0
PC:  5.1.0

Availability:

October 12, 2017

Upgrade Methods:

  • “Check for Updates” on the Backblaze Client (right click on the Backblaze icon and then select “Check for Updates”)
  • Download from: https://secure.backblaze.com/update.htm
  • Auto-update will begin in a couple of weeks
Mac backup update PC backup update
Updating Backblaze on Mac Updating Backblaze on Windows

Questions:

If you have any questions, please contact Backblaze Support at www.backblaze.com/help.

The post Backblaze Release 5.1 – RMM Compatibility for Mass Deployments appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

More on Kaspersky and the Stolen NSA Attack Tools

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/10/more_on_kaspers.html

Both the New York Times and the Washington Post are reporting that Israel has penetrated Kaspersky’s network and detected the Russian operation.

From the New York Times:

Israeli intelligence officers informed the NSA that, in the course of their Kaspersky hack, they uncovered evidence that Russian government hackers were using Kaspersky’s access to aggressively scan for American government classified programs and pulling any findings back to Russian intelligence systems. [Israeli intelligence] provided their NSA counterparts with solid evidence of the Kremlin campaign in the form of screenshots and other documentation, according to the people briefed on the events.

Kaspersky first noticed the Israeli intelligence operation in 2015.

The Washington Post writes about the NSA tools being on the home computer in the first place:

The employee, whose name has not been made public and is under investigation by federal prosecutors, did not intend to pass the material to a foreign adversary. “There wasn’t any malice,” said one person familiar with the case, who, like others interviewed, spoke on the condition of anonymity to discuss an ongoing case. “It’s just that he was trying to complete the mission, and he needed the tools to do it.

I don’t buy this. People with clearances are told over and over not to take classified material home with them. It’s not just mentioned occasionally; it’s a core part of the job.

More news articles.

Pirate Bay is Mining Cryptocurrency Again, No Opt Out

Post Syndicated from Ernesto original https://torrentfreak.com/pirate-bay-is-mining-cryptocurrency-again-no-opt-out-171011/

Last month The Pirate Bay caused some uproar by adding a Javascript-based cryptocurrency miner to its website.

The miner utilizes CPU power from visitors to generate Monero coins for the site, providing an extra source of revenue.

The Pirate Bay only tested the option briefly, but that was enough to inspire many others to follow suit. Now, a few weeks later, Pirate Bay has also turned on the miners again.

The miner is not directly embedded in the site’s core code but runs through an ad script. Many ad blockers and anti-malware tools are stopping these request, but people who don’t use any will see a clear spike in CPU usage when they access the site.

The Pirate Bay team previously said that they were testing the miner to see if it can replace ads. While there is some real revenue potential, for now, it’s running in addition to the regular banners. It’s unclear whether the current mining period is another test or if it will run permanently from now on.

The miner does appear to be throttled to a certain degree, so most users might not even notice that it’s running.

Pirate Bay load requests

Running a cryptocurrency miner such as the Coin-Hive script TPB is currently using is not without risk. Aside from user complaints, there is an issue that may make it harder for the site to operate in the future.

Last week we reported that CDN provider Cloudflare had suspended the account of torrent proxy site ProxyBunker, flagging its coin miner as malware. This means that The Pirate Bay now risks losing the Cloudflare service, which they rely on for DDoS protection, among other things.

Cloudflare’s suspension of ProxyBunker occurred even though the site provided users with an option to disable the miner. This functionality was implemented by Coinhive after the script was misused by some sites, which ran it without alerting their users.

The Pirate Bay currently has no opt-out option, nor has it informed users about the latest mining efforts. This could lead to another problem since Coinhive said it would crack down on customers who failed to keep users in the loop.

“We will verify this opt-in on our servers and will implement it in a way that it can not be circumvented. We will pledge to keep the opt-in intact at all times, without exceptions,” the Coinhive team previously noted.

The Pirate Bay team has not commented on the issue thus far. In theory, it’s possible that a rogue advertiser is responsible for the latest mining efforts. If that’s the case it will be disabled soon enough.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Now Available – Microsoft SQL Server 2017 for Amazon EC2

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-available-microsoft-sql-server-2017-for-amazon-ec2/

Microsoft SQL Server 2017 (launched just a few days ago) includes lots of powerful new features including support for graph databases, automatic database tuning, and the ability to create clusterless Always On Availability Groups. It can also be run on Linux and in Docker containers.

Run on EC2
I’m happy to announce that you can now launch EC2 instances that run Windows Server 2016 and four editions (Web, Express, Standard, and Enterprise) of SQL Server 2017. The AMIs (Amazon Machine Images) are available today in all AWS Regions and run on a wide variety of EC2 instance types, including the new x1e.32xlarge with 128 vCPUs and almost 4 TB of memory.

You can launch these instances from the AWS Management Console or through AWS Marketplace. Here’s what they look like in the console:

And in AWS Marketplace:

Licensing Options Galore
You have lots of licensing options for SQL Server:

Pay As You Go – This option works well if you would prefer to avoid buying licenses, are already running an older version of SQL Server, and want to upgrade. You don’t have to deal with true-ups, software compliance audits, or Software Assurance and you don’t need to make a long-term purchase. If you are running the Standard Edition of SQL Server, you also benefit from our recent price reduction, with savings of up to 52%.

License Mobility – This option lets your use your active Software Assurance agreement to bring your existing licenses to EC2, and allows you to run SQL Server on Windows or Linux instances.

Bring Your Own Licenses – This option lets you take advantage of your existing license investment while minimizing upgrade costs. You can run SQL Server on EC2 Dedicated Instances or EC2 Dedicated Hosts, with the potential to reduce operating costs by licensing SQL Server on a per-core basis. This option allows you to run SQL Server 2017 on EC2 Linux instances (SUSE, RHEL, and Ubuntu are supported) and also supports Docker-based environments running on EC2 Windows and Linux instances. To learn more about these options, read the Installation Guidance for SQL Server on Linux and Run SQL Server 2017 Container Image with Docker.

Learn More
To learn more about SQL Server 2017 and to explore your licensing options in depth, take a look at the SQL Server on AWS page.

If you need advice and guidance as you plan your migration effort, check out the AWS Partners who have qualified for the Microsoft Workloads competency and focus on database solutions.

Amazon RDS support for SQL Server 2017 is planned for November. This will give you a fully managed option.

Plan to join the AWS team at the PASS Summit (November 1-3 in Seattle) and at AWS re:Invent (November 27th to December 1st in Las Vegas).

Jeff;

PS – Special thanks to my colleague Tom Staab (Partner Solutions Architect) for his help with this post!

Dynamic Users with systemd

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/dynamic-users-with-systemd.html

TL;DR: you may now configure systemd to dynamically allocate a UNIX
user ID for service processes when it starts them and release it when
it stops them. It’s pretty secure, mixes well with transient services,
socket activated services and service templating.

Today we released systemd
235
. Among
other improvements this greatly extends the dynamic user logic of
systemd. Dynamic users are a powerful but little known concept,
supported in its basic form since systemd 232. With this blog story I
hope to make it a bit better known.

The UNIX user concept is the most basic and well-understood security
concept in POSIX operating systems. It is UNIX/POSIX’ primary security
concept, the one everybody can agree on, and most security concepts
that came after it (such as process capabilities, SELinux and other
MACs, user name-spaces, …) in some form or another build on it, extend
it or at least interface with it. If you build a Linux kernel with all
security features turned off, the user concept is pretty much the one
you’ll still retain.

Originally, the user concept was introduced to make multi-user systems
a reality, i.e. systems enabling multiple human users to share the
same system at the same time, cleanly separating their resources and
protecting them from each other. The majority of today’s UNIX systems
don’t really use the user concept like that anymore though. Most of
today’s systems probably have only one actual human user (or even
less!), but their user databases (/etc/passwd) list a good number
more entries than that. Today, the majority of UNIX users in most
environments are system users, i.e. users that are not the technical
representation of a human sitting in front of a PC anymore, but the
security identity a system service — an executable program — runs
as. Event though traditional, simultaneous multi-user systems slowly
became less relevant, their ground-breaking basic concept became the
cornerstone of UNIX security. The OS is nowadays partitioned into
isolated services — and each service runs as its own system user, and
thus within its own, minimal security context.

The people behind the Android OS realized the relevance of the UNIX
user concept as the primary security concept on UNIX, and took its use
even further: on Android not only system services take benefit of the
UNIX user concept, but each UI app gets its own, individual user
identity too — thus neatly separating app resources from each other,
and protecting app processes from each other, too.

Back in the more traditional Linux world things are a bit less
advanced in this area. Even though users are the quintessential UNIX
security concept, allocation and management of system users is still a
pretty limited, raw and static affair. In most cases, RPM or DEB
package installation scripts allocate a fixed number of (usually one)
system users when you install the package of a service that wants to
take benefit of the user concept, and from that point on the system
user remains allocated on the system and is never deallocated again,
even if the package is later removed again. Most Linux distributions
limit the number of system users to 1000 (which isn’t particularly a
lot). Allocating a system user is hence expensive: the number of
available users is limited, and there’s no defined way to dispose of
them after use. If you make use of system users too liberally, you are
very likely to run out of them sooner rather than later.

You may wonder why system users are generally not deallocated when the
package that registered them is uninstalled from a system (at least on
most distributions). The reason for that is one relevant property of
the user concept (you might even want to call this a design flaw):
user IDs are sticky to files (and other objects such as IPC
objects). If a service running as a specific system user creates a
file at some location, and is then terminated and its package and user
removed, then the created file still belongs to the numeric ID (“UID”)
the system user originally got assigned. When the next system user is
allocated and — due to ID recycling — happens to get assigned the same
numeric ID, then it will also gain access to the file, and that’s
generally considered a problem, given that the file belonged to a
potentially very different service once upon a time, and likely should
not be readable or changeable by anything coming after
it. Distributions hence tend to avoid UID recycling which means system
users remain registered forever on a system after they have been
allocated once.

The above is a description of the status quo ante. Let’s now focus on
what systemd’s dynamic user concept brings to the table, to improve
the situation.

Introducing Dynamic Users

With systemd dynamic users we hope to make make it easier and cheaper
to allocate system users on-the-fly, thus substantially increasing the
possible uses of this core UNIX security concept.

If you write a systemd service unit file, you may enable the dynamic
user logic for it by setting the
DynamicUser=
option in its [Service] section to yes. If you do a system user is
dynamically allocated the instant the service binary is invoked, and
released again when the service terminates. The user is automatically
allocated from the UID range 61184–65519, by looking for a so far
unused UID.

Now you may wonder, how does this concept deal with the sticky user
issue discussed above? In order to counter the problem, two strategies
easily come to mind:

  1. Prohibit the service from creating any files/directories or IPC objects

  2. Automatically removing the files/directories or IPC objects the
    service created when it shuts down.

In systemd we implemented both strategies, but for different parts of
the execution environment. Specifically:

  1. Setting DynamicUser=yes implies
    ProtectSystem=strict
    and
    ProtectHome=read-only. These
    sand-boxing options turn off write access to pretty much the whole OS
    directory tree, with a few relevant exceptions, such as the API file
    systems /proc, /sys and so on, as well as /tmp and
    /var/tmp. (BTW: setting these two options on your regular services
    that do not use DynamicUser= is a good idea too, as it drastically
    reduces the exposure of the system to exploited services.)

  2. Setting DynamicUser=yes implies
    PrivateTmp=yes. This
    option sets up /tmp and /var/tmp for the service in a way that it
    gets its own, disconnected version of these directories, that are not
    shared by other services, and whose life-cycle is bound to the
    service’s own life-cycle. Thus if the service goes down, the user is
    removed and all its temporary files and directories with it. (BTW: as
    above, consider setting this option for your regular services that do
    not use DynamicUser= too, it’s a great way to lock things down
    security-wise.)

  3. Setting DynamicUser=yes implies
    RemoveIPC=yes. This
    option ensures that when the service goes down all SysV and POSIX IPC
    objects (shared memory, message queues, semaphores) owned by the
    service’s user are removed. Thus, the life-cycle of the IPC objects is
    bound to the life-cycle of the dynamic user and service, too. (BTW:
    yes, here too, consider using this in your regular services, too!)

With these four settings in effect, services with dynamic users are
nicely sand-boxed. They cannot create files or directories, except in
/tmp and /var/tmp, where they will be removed automatically when
the service shuts down, as will any IPC objects created. Sticky
ownership of files/directories and IPC objects is hence dealt with
effectively.

The
RuntimeDirectory=
option may be used to open up a bit the sandbox to external
programs. If you set it to a directory name of your choice, it will be
created below /run when the service is started, and removed in its
entirety when it is terminated. The ownership of the directory is
assigned to the service’s dynamic user. This way, a dynamic user
service can expose API interfaces (AF_UNIX sockets, …) to other
services at a well-defined place and again bind the life-cycle of it to
the service’s own run-time. Example: set RuntimeDirectory=foobar in
your service, and watch how a directory /run/foobar appears at the
moment you start the service, and disappears the moment you stop
it again. (BTW: Much like the other settings discussed above,
RuntimeDirectory= may be used outside of the DynamicUser= context
too, and is a nice way to run any service with a properly owned,
life-cycle-managed run-time directory.)

Persistent Data

Of course, a service running in such an environment (although already
very useful for many cases!), has a major limitation: it cannot leave
persistent data around it can reuse on a later run. As pretty much the
whole OS directory tree is read-only to it, there’s simply no place it
could put the data that survives from one service invocation to the
next.

With systemd 235 this limitation is removed: there are now three new
settings:
StateDirectory=,
LogsDirectory= and CacheDirectory=. In many ways they operate like
RuntimeDirectory=, but create sub-directories below /var/lib,
/var/log and /var/cache, respectively. There’s one major
difference beyond that however: directories created that way are
persistent, they will survive the run-time cycle of a service, and
thus may be used to store data that is supposed to stay around between
invocations of the service.

Of course, the obvious question to ask now is: how do these three
settings deal with the sticky file ownership problem?

For that we lifted a concept from container managers. Container
managers have a very similar problem: each container and the host
typically end up using a very similar set of numeric UIDs, and unless
user name-spacing is deployed this means that host users might be able
to access the data of specific containers that also have a user by the
same numeric UID assigned, even though it actually refers to a very
different identity in a different context. (Actually, it’s even worse
than just getting access, due to the existence of setuid file bits,
access might translate to privilege elevation.) The way container
managers protect the container images from the host (and from each
other to some level) is by placing the container trees below a
boundary directory, with very restrictive access modes and ownership
(0700 and root:root or so). A host user hence cannot take advantage
of the files/directories of a container user of the same UID inside of
a local container tree, simply because the boundary directory makes it
impossible to even reference files in it. After all on UNIX, in order
to get access to a specific path you need access to every single
component of it.

How is that applied to dynamic user services? Let’s say
StateDirectory=foobar is set for a service that has DynamicUser=
turned off. The instant the service is started, /var/lib/foobar is
created as state directory, owned by the service’s user and remains in
existence when the service is stopped. If the same service now is run
with DynamicUser= turned on, the implementation is slightly
altered. Instead of a directory /var/lib/foobar a symbolic link by
the same path is created (owned by root), pointing to
/var/lib/private/foobar (the latter being owned by the service’s
dynamic user). The /var/lib/private directory is created as boundary
directory: it’s owned by root:root, and has a restrictive access
mode of 0700. Both the symlink and the service’s state directory will
survive the service’s life-cycle, but the state directory will remain,
and continues to be owned by the now disposed dynamic UID — however it
is protected from other host users (and other services which might get
the same dynamic UID assigned due to UID recycling) by the boundary
directory.

The obvious question to ask now is: but if the boundary directory
prohibits access to the directory from unprivileged processes, how can
the service itself which runs under its own dynamic UID access it
anyway? This is achieved by invoking the service process in a slightly
modified mount name-space: it will see most of the file hierarchy the
same way as everything else on the system (modulo /tmp and
/var/tmp as mentioned above), except for /var/lib/private, which
is over-mounted with a read-only tmpfs file system instance, with a
slightly more liberal access mode permitting the service read
access. Inside of this tmpfs file system instance another mount is
placed: a bind mount to the host’s real /var/lib/private/foobar
directory, onto the same name. Putting this together these means that
superficially everything looks the same and is available at the same
place on the host and from inside the service, but two important
changes have been made: the /var/lib/private boundary directory lost
its restrictive character inside the service, and has been emptied of
the state directories of any other service, thus making the protection
complete. Note that the symlink /var/lib/foobar hides the fact that
the boundary directory is used (making it little more than an
implementation detail), as the directory is available this way under
the same name as it would be if DynamicUser= was not used. Long
story short: for the daemon and from the view from the host the
indirection through /var/lib/private is mostly transparent.

This logic of course raises another question: what happens to the
state directory if a dynamic user service is started with a state
directory configured, gets UID X assigned on this first invocation,
then terminates and is restarted and now gets UID Y assigned on the
second invocation, with X ≠ Y? On the second invocation the directory
— and all the files and directories below it — will still be owned by
the original UID X so how could the second instance running as Y
access it? Our way out is simple: systemd will recursively change the
ownership of the directory and everything contained within it to UID Y
before invoking the service’s executable.

Of course, such recursive ownership changing (chown()ing) of whole
directory trees can become expensive (though according to my
experiences, IRL and for most services it’s much cheaper than you
might think), hence in order to optimize behavior in this regard, the
allocation of dynamic UIDs has been tweaked in two ways to avoid the
necessity to do this expensive operation in most cases: firstly, when
a dynamic UID is allocated for a service an allocation loop is
employed that starts out with a UID hashed from the service’s
name. This means a service by the same name is likely to always use
the same numeric UID. That means that a stable service name translates
into a stable dynamic UID, and that means recursive file ownership
adjustments can be skipped (of course, after validation). Secondly, if
the configured state directory already exists, and is owned by a
suitable currently unused dynamic UID, it’s preferably used above
everything else, thus maximizing the chance we can avoid the
chown()ing. (That all said, ultimately we have to face it, the
currently available UID space of 4K+ is very small still, and
conflicts are pretty likely sooner or later, thus a chown()ing has to
be expected every now and then when this feature is used extensively).

Note that CacheDirectory= and LogsDirectory= work very similar to
StateDirectory=. The only difference is that they manage directories
below the /var/cache and /var/logs directories, and their boundary
directory hence is /var/cache/private and /var/log/private,
respectively.

Examples

So, after all this introduction, let’s have a look how this all can be
put together. Here’s a trivial example:

# cat > /etc/systemd/system/dynamic-user-test.service <<EOF
[Service]
ExecStart=/usr/bin/sleep 4711
DynamicUser=yes
EOF
# systemctl daemon-reload
# systemctl start dynamic-user-test
# systemctl status dynamic-user-test
● dynamic-user-test.service
   Loaded: loaded (/etc/systemd/system/dynamic-user-test.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2017-10-06 13:12:25 CEST; 3s ago
 Main PID: 2967 (sleep)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/dynamic-user-test.service
           └─2967 /usr/bin/sleep 4711

Okt 06 13:12:25 sigma systemd[1]: Started dynamic-user-test.service.
# ps -e -o pid,comm,user | grep 2967
 2967 sleep           dynamic-user-test
# id dynamic-user-test
uid=64642(dynamic-user-test) gid=64642(dynamic-user-test) groups=64642(dynamic-user-test)
# systemctl stop dynamic-user-test
# id dynamic-user-test
id: ‘dynamic-user-test’: no such user

In this example, we create a unit file with DynamicUser= turned on,
start it, check if it’s running correctly, have a look at the service
process’ user (which is named like the service; systemd does this
automatically if the service name is suitable as user name, and you
didn’t configure any user name to use explicitly), stop the service
and verify that the user ceased to exist too.

That’s already pretty cool. Let’s step it up a notch, by doing the
same in an interactive transient service (for those who don’t know
systemd well: a transient service is a service that is defined and
started dynamically at run-time, for example via the systemd-run
command from the shell. Think: run a service without having to write a
unit file first):

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u15750.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ id
uid=63122(run-u15750) gid=63122(run-u15750) groups=63122(run-u15750) context=system_u:system_r:initrc_t:s0
sh-4.4$ ls -al /var/lib/private/
total 0
drwxr-xr-x. 3 root       root        60  6. Okt 13:21 .
drwxr-xr-x. 1 root       root       852  6. Okt 13:21 ..
drwxr-xr-x. 1 run-u15750 run-u15750   8  6. Okt 13:22 wuff
sh-4.4$ ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
sh-4.4$ ls -ld /var/lib/wuff/
drwxr-xr-x. 1 run-u15750 run-u15750 0  6. Okt 13:21 /var/lib/wuff/
sh-4.4$ echo hello > /var/lib/wuff/test
sh-4.4$ exit
exit
# id run-u15750
id: ‘run-u15750’: no such user
# ls -al /var/lib/private
total 0
drwx------. 1 root  root   66  6. Okt 13:21 .
drwxr-xr-x. 1 root  root  852  6. Okt 13:21 ..
drwxr-xr-x. 1 63122 63122   8  6. Okt 13:22 wuff
# ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
# ls -ld /var/lib/wuff/
drwxr-xr-x. 1 63122 63122 8  6. Okt 13:22 /var/lib/wuff/
# cat /var/lib/wuff/test
hello

The above invokes an interactive shell as transient service
run-u15750.service (systemd-run picked that name automatically,
since we didn’t specify anything explicitly) with a dynamic user whose
name is derived automatically from the service name. Because
StateDirectory=wuff is used, a persistent state directory for the
service is made available as /var/lib/wuff. In the interactive shell
running inside the service, the ls commands show the
/var/lib/private boundary directory and its contents, as well as the
symlink that is placed for the service. Finally, before exiting the
shell, a file is created in the state directory. Back in the original
command shell we check if the user is still allocated: it is not, of
course, since the service ceased to exist when we exited the shell and
with it the dynamic user associated with it. From the host we check
the state directory of the service, with similar commands as we did
from inside of it. We see that things are set up pretty much the same
way in both cases, except for two things: first of all the user/group
of the files is now shown as raw numeric UIDs instead of the
user/group names derived from the unit name. That’s because the user
ceased to exist at this point, and “ls” shows the raw UID for files
owned by users that don’t exist. Secondly, the access mode of the
boundary directory is different: when we look at it from outside of
the service it is not readable by anyone but root, when we looked from
inside we saw it it being world readable.

Now, let’s see how things look if we start another transient service,
reusing the state directory from the first invocation:

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u16087.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ cat /var/lib/wuff/test
hello
sh-4.4$ ls -al /var/lib/wuff/
total 4
drwxr-xr-x. 1 run-u16087 run-u16087  8  6. Okt 13:22 .
drwxr-xr-x. 3 root       root       60  6. Okt 15:42 ..
-rw-r--r--. 1 run-u16087 run-u16087  6  6. Okt 13:22 test
sh-4.4$ id
uid=63122(run-u16087) gid=63122(run-u16087) groups=63122(run-u16087) context=system_u:system_r:initrc_t:s0
sh-4.4$ exit
exit

Here, systemd-run picked a different auto-generated unit name, but
the used dynamic UID is still the same, as it was read from the
pre-existing state directory, and was otherwise unused. As we can see
the test file we generated earlier is accessible and still contains
the data we left in there. Do note that the user name is different
this time (as it is derived from the unit name, which is different),
but the UID it is assigned to is the same one as on the first
invocation. We can thus see that the mentioned optimization of the UID
allocation logic (i.e. that we start the allocation loop from the UID
owner of any existing state directory) took effect, so that no
recursive chown()ing was required.

And that’s the end of our example, which hopefully illustrated a bit
how this concept and implementation works.

Use-cases

Now that we had a look at how to enable this logic for a unit and how
it is implemented, let’s discuss where this actually could be useful
in real life.

  • One major benefit of dynamic user IDs is that running a
    privilege-separated service leaves no artifacts in the system. A
    system user is allocated and made use of, but it is discarded
    automatically in a safe and secure way after use, in a fashion that is
    safe for later recycling. Thus, quickly invoking a short-lived service
    for processing some job can be protected properly through a user ID
    without having to pre-allocate it and without this draining the
    available UID pool any longer than necessary.

  • In many cases, starting a service no longer requires
    package-specific preparation. Or in other words, quite often
    useradd/mkdir/chown/chmod invocations in “post-inst” package
    scripts, as well as
    sysusers.d
    and
    tmpfiles.d
    drop-ins become unnecessary, as the DynamicUser= and
    StateDirectory=/CacheDirectory=/LogsDirectory= logic can do the
    necessary work automatically, on-demand and with a well-defined
    life-cycle.

  • By combining dynamic user IDs with the transient unit concept, new
    creative ways of sand-boxing are made available. For example, let’s say
    you don’t trust the correct implementation of the sort command. You
    can now lock it into a simple, robust, dynamic UID sandbox with a
    simple systemd-run and still integrate it into a shell pipeline like
    any other command. Here’s an example, showcasing a shell pipeline
    whose middle element runs as a dynamically on-the-fly allocated UID,
    that is released when the pipelines ends.

    # cat some-file.txt | systemd-run ---pipe --property=DynamicUser=1 sort -u | grep -i foobar > some-other-file.txt
    
  • By combining dynamic user IDs with the systemd templating logic it
    is now possible to do much more fine-grained and fully automatic UID
    management. For example, let’s say you have a template unit file
    /etc/systemd/system/fooba[email protected]:

    [Service]
    ExecStart=/usr/bin/myfoobarserviced
    DynamicUser=1
    StateDirectory=foobar/%i
    

    Now, let’s say you want to start one instance of this service for
    each of your customers. All you need to do now for that is:

    # systemctl enable [email protected] --now
    

    And you are done. (Invoke this as many times as you like, each time
    replacing customerxyz by some customer identifier, you get the
    idea.)

  • By combining dynamic user IDs with socket activation you may easily
    implement a system where each incoming connection is served by a
    process instance running as a different, fresh, newly allocated UID
    within its own sandbox. Here’s an example waldo.socket:

    [Socket]
    ListenStream=2048
    Accept=yes
    

    With a matching [email protected]:

    [Service]
    ExecStart=-/usr/bin/myservicebinary
    DynamicUser=yes
    

    With the two unit files above, systemd will listen on TCP/IP port
    2048, and for each incoming connection invoke a fresh instance of
    [email protected], each time utilizing a different, new,
    dynamically allocated UID, neatly isolated from any other
    instance.

  • Dynamic user IDs combine very well with state-less systems,
    i.e. systems that come up with an unpopulated /etc and /var. A
    service using dynamic user IDs and the StateDirectory=,
    CacheDirectory=, LogsDirectory= and RuntimeDirectory= concepts
    will implicitly allocate the users and directories it needs for
    running, right at the moment where it needs it.

Dynamic users are a very generic concept, hence a multitude of other
uses are thinkable; the list above is just supposed to trigger your
imagination.

What does this mean for you as a packager?

I am pretty sure that a large number of services shipped with today’s
distributions could benefit from using DynamicUser= and
StateDirectory= (and related settings). It often allows removal of
post-inst packaging scripts altogether, as well as any sysusers.d
and tmpfiles.d drop-ins by unifying the needed declarations in the
unit file itself. Hence, as a packager please consider switching your
unit files over. That said, there are a number of conditions where
DynamicUser= and StateDirectory= (and friends) cannot or should
not be used. To name a few:

  1. Service that need to write to files outside of /run/<package>,
    /var/lib/<package>, /var/cache/<package>, /var/log/<package>,
    /var/tmp, /tmp, /dev/shm are generally incompatible with this
    scheme. This rules out daemons that upgrade the system as one example,
    as that involves writing to /usr.

  2. Services that maintain a herd of processes with different user
    IDs. Some SMTP services are like this. If your service has such a
    super-server design, UID management needs to be done by the
    super-server itself, which rules out systemd doing its dynamic UID
    magic for it.

  3. Services which run as root (obviously…) or are otherwise
    privileged.

  4. Services that need to live in the same mount name-space as the host
    system (for example, because they want to establish mount points
    visible system-wide). As mentioned DynamicUser= implies
    ProtectSystem=, PrivateTmp= and related options, which all require
    the service to run in its own mount name-space.

  5. Your focus is older distributions, i.e. distributions that do not
    have systemd 232 (for DynamicUser=) or systemd 235 (for
    StateDirectory= and friends) yet.

  6. If your distribution’s packaging guides don’t allow it. Consult
    your packaging guides, and possibly start a discussion on your
    distribution’s mailing list about this.

Notes

A couple of additional, random notes about the implementation and use
of these features:

  1. Do note that allocating or deallocating a dynamic user leaves
    /etc/passwd untouched. A dynamic user is added into the user
    database through the glibc NSS module
    nss-systemd,
    and this information never hits the disk.

  2. On traditional UNIX systems it was the job of the daemon process
    itself to drop privileges, while the DynamicUser= concept is
    designed around the service manager (i.e. systemd) being responsible
    for that. That said, since v235 there’s a way to marry DynamicUser=
    and such services which want to drop privileges on their own. For
    that, turn on DynamicUser= and set
    User=
    to the user name the service wants to setuid() to. This has the
    effect that systemd will allocate the dynamic user under the specified
    name when the service is started. Then, prefix the command line you
    specify in
    ExecStart=
    with a single ! character. If you do, the user is allocated for the
    service, but the daemon binary is is invoked as root instead of the
    allocated user, under the assumption that the daemon changes its UID
    on its own the right way. Not that after registration the user will
    show up instantly in the user database, and is hence resolvable like
    any other by the daemon process. Example:
    ExecStart=!/usr/bin/mydaemond

  3. You may wonder why systemd uses the UID range 61184–65519 for its
    dynamic user allocations (side note: in hexadecimal this reads as
    0xEF00–0xFFEF). That’s because distributions (specifically Fedora)
    tend to allocate regular users from below the 60000 range, and we
    don’t want to step into that. We also want to stay away from 65535 and
    a bit around it, as some of these UIDs have special meanings (65535 is
    often used as special value for “invalid” or “no” UID, as it is
    identical to the 16bit value -1; 65534 is generally mapped to the
    “nobody” user, and is where some kernel subsystems map unmappable
    UIDs). Finally, we want to stay within the 16bit range. In a user
    name-spacing world each container tends to have much less than the full
    32bit UID range available that Linux kernels theoretically
    provide. Everybody apparently can agree that a container should at
    least cover the 16bit range though — already to include a nobody
    user. (And quite frankly, I am pretty sure assigning 64K UIDs per
    container is nicely systematic, as the the higher 16bit of the 32bit
    UID values this way become a container ID, while the lower 16bit
    become the logical UID within each container, if you still follow what
    I am babbling here…). And before you ask: no this range cannot be
    changed right now, it’s compiled in. We might change that eventually
    however.

  4. You might wonder what happens if you already used UIDs from the
    61184–65519 range on your system for other purposes. systemd should
    handle that mostly fine, as long as that usage is properly registered
    in the user database: when allocating a dynamic user we pick a UID,
    see if it is currently used somehow, and if yes pick a different one,
    until we find a free one. Whether a UID is used right now or not is
    checked through NSS calls. Moreover the IPC object lists are checked to
    see if there are any objects owned by the UID we are about to
    pick. This means systemd will avoid using UIDs you have assigned
    otherwise. Note however that this of course makes the pool of
    available UIDs smaller, and in the worst cases this means that
    allocating a dynamic user might fail because there simply are no
    unused UIDs in the range.

  5. If not specified otherwise the name for a dynamically allocated
    user is derived from the service name. Not everything that’s valid in
    a service name is valid in a user-name however, and in some cases a
    randomized name is used instead to deal with this. Often it makes
    sense to pick the user names to register explicitly. For that use
    User= and choose whatever you like.

  6. If you pick a user name with User= and combine it with
    DynamicUser= and the user already exists statically it will be used
    for the service and the dynamic user logic is automatically
    disabled. This permits automatic up- and downgrades between static and
    dynamic UIDs. For example, it provides a nice way to move a system
    from static to dynamic UIDs in a compatible way: as long as you select
    the same User= value before and after switching DynamicUser= on,
    the service will continue to use the statically allocated user if it
    exists, and only operates in the dynamic mode if it does not. This is
    useful for other cases as well, for example to adapt a service that
    normally would use a dynamic user to concepts that require statically
    assigned UIDs, for example to marry classic UID-based file system
    quota with such services.

  7. systemd always allocates a pair of dynamic UID and GID at the same
    time, with the same numeric ID.

  8. If the Linux kernel had a “shiftfs” or similar functionality,
    i.e. a way to mount an existing directory to a second place, but map
    the exposed UIDs/GIDs in some way configurable at mount time, this
    would be excellent for the implementation of StateDirectory= in
    conjunction with DynamicUser=. It would make the recursive
    chown()ing step unnecessary, as the host version of the state
    directory could simply be mounted into a the service’s mount
    name-space, with a shift applied that maps the directory’s owner to the
    services’ UID/GID. But I don’t have high hopes in this regard, as all
    work being done in this area appears to be bound to user name-spacing
    — which is a concept not used here (and I guess one could say user
    name-spacing is probably more a source of problems than a solution to
    one, but you are welcome to disagree on that).

And that’s all for now. Enjoy your dynamic users!

Yes, Backblaze Just Ordered 100 Petabytes of Hard Drives

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/400-petabytes-cloud-storage/

10 Petabyt vault, 100 Petabytes ordered, 400 Petabytes stored

Backblaze just ordered a 100 petabytes’ worth of hard drives, and yes, we’ll use nearly all of them in Q4. In fact, we’ll begin the process of sourcing the Q1 hard drive order in the next few weeks.

What are we doing with all those hard drives? Let’s take a look.

Our First 10 Petabyte Backblaze Vault

Ken clicked the submit button and 10 Petabytes of Backblaze Cloud Storage came online ready to accept customer data. Ken (aka the Pod Whisperer), is one of our Datacenter Operations Managers at Backblaze and with that one click, he activated Backblaze Vault 1093, which was built with 1,200 Seagate 10 TB drives (model: ST10000NM0086). After formatting and configuration of the disks, there is 10.12 Petabytes of free space remaining for customer data. Back in 2011, when Ken started at Backblaze, he was amazed that we had amassed as much as 10 Petabytes of data storage.

The Seagate 10 TB drives we deployed in vault 1093 are helium-filled drives. We had previously deployed 45 HGST 8 TB helium-filled drives where we learned one of the benefits of using helium drives — they consume less power than traditional air-filled drives. Here’s a quick comparison of the power consumption of several high-density drive models we deploy:

MFR Model Fill Size Idle (1) Operating (2)
Seagate ST8000DM002 Air 8 TB 7.2 watts 9.0 watts
Seagate ST8000NM0055 Air 8 TB 7.6 watts 8.6 watts
HGST HUH728080ALE600 Helium 8 TB 5.1 watts 7.4 watts
Seagate ST10000NM0086 Helium 10 TB 4.8 watts 8.6 watts
(1) Idle: Average Idle in watts as reported by the manufacturer.
(2) Operating: The maximum operational consumption in watts as reported by the manufacturer — typically for read operations.

I’d like 100 Petabytes of Hard Drives To Go, Please

“100 Petabytes should get us through Q4.” — Tim Nufire, Chief Cloud Officer, Backblaze

The 1,200 Seagate 10 TB drives are just the beginning. The next Backblaze Vault will be configured with 12 TB drives which will give us 12.2 petabytes of storage in one vault. We are currently building and adding two to three Backblaze Vaults a month to our cloud storage system, so we are going to need more drives. When we did all of our “drive math,” we decided to place an order for 100 petabytes of hard drives comprised of 10 and 12 TB models. Gleb, our CEO and occasional blogger, exhaled mightily as he signed the biggest purchase order in company history. Wait until he sees the one for Q1.

Enough drives for a 10 petabyte vault

400 Petabytes of Cloud Storage

When we added Backblaze Vault 1093, we crossed over 400 Petabytes of total available storage. For those of you keeping score at home, we reached 350 Petabytes about 3 months ago as you can see in the chart below.

Petabytes of data stored by Backblaze

Backblaze Vault Primer

All of the storage capacity we’ve added in the last two years has been on our Backblaze Vault architecture, with vault 1093 being the 60th one we have placed into service. Each Backblaze Vault is comprised of 20 Backblaze Storage Pods logically grouped together into one storage system. Today, each Storage Pod contains sixty 3 ½” hard drives, giving each vault 1,200 drives. Early vaults were built on Storage Pods with 45 hard drives, for a total of 900 drives in a vault.

A Backblaze Vault accepts data directly from an authenticated user. Each data blob (object, file, group of files) is divided into 20 shards (17 data shards and 3 parity shards) using our erasure coding library. Each of the 20 shards is stored on a different Storage Pod in the vault. At any given time, several vaults stand ready to receive data storage requests.

Drive Stats for the New Drives

In our Q3 2017 Drive Stats report, due out in late October, we’ll start reporting on the 10 TB drives we are adding. It looks like the 12 TB drives will come online in Q4. We’ll also get a better look at the 8 TB consumer and enterprise drives we’ve been following. Stay tuned.

Other Big Data Clouds

We have always been transparent here at Backblaze, including about how much data we store, how we store it, even how much it costs to do so. Very few others do the same. But, if you have information on how much data a company or organization stores in the cloud, let us know in the comments. Please include the source and make sure the data is not considered proprietary. If we get enough tidbits we’ll publish a “big cloud” list.

The post Yes, Backblaze Just Ordered 100 Petabytes of Hard Drives appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

A security review of three NTP implementations

Post Syndicated from corbet original https://lwn.net/Articles/735211/rss

The Core Infrastructure Initiative commissioned security audits of three
network time protocol (NTP) implementations (ntpd, NTPSec, and Chrony) and
has released
the results
. “From a security standpoint (and here at the CII we
are security people), Chrony was the clear winner between these three NTP
implementations. Chrony does not have all of the bells and whistles that
ntpd does, and it doesn’t implement every single option listed in the NTP
specification, but for the vast majority of users this will not matter. If
all you need is an NTP client or server (with or without reference clock),
which is all that most people need, then its security benefits most likely
outweigh any missing features.

AWS Hot Startups – September 2017

Post Syndicated from Tina Barr original https://aws.amazon.com/blogs/aws/aws-hot-startups-september-2017/

As consumers continue to demand faster, simpler, and more on-the-go services, FinTech companies are responding with ever more innovative solutions to fit everyone’s needs and to improve customer experience. This month, we are excited to feature the following startups—all of whom are disrupting traditional financial services in unique ways:

  • Acorns – allowing customers to invest spare change automatically.
  • Bondlinc – improving the bond trading experience for clients, financial institutions, and private banks.
  • Lenda – reimagining homeownership with a secure and streamlined online service.

Acorns (Irvine, CA)

Driven by the belief that anyone can grow wealth, Acorns is relentlessly pursuing ways to help make that happen. Currently the fastest-growing micro-investing app in the U.S., Acorns takes mere minutes to get started and is currently helping over 2.2 million people grow their wealth. And unlike other FinTech apps, Acorns is focused on helping America’s middle class – namely the 182 million citizens who make less than $100,000 per year – and looking after their financial best interests.

Acorns is able to help their customers effortlessly invest their money, little by little, by offering ETF portfolios put together by Dr. Harry Markowitz, a Nobel Laureate in economic sciences. They also offer a range of services, including “Round-Ups,” whereby customers can automatically invest spare change from every day purchases, and “Recurring Investments,” through which customers can set up automatic transfers of just $5 per week into their portfolio. Additionally, Found Money, Acorns’ earning platform, can help anyone spend smarter as the company connects customers to brands like Lyft, Airbnb, and Skillshare, who then automatically invest in customers’ Acorns account.

The Acorns platform runs entirely on AWS, allowing them to deliver a secure and scalable cloud-based experience. By utilizing AWS, Acorns is able to offer an exceptional customer experience and fulfill its core mission. Acorns uses Terraform to manage services such as Amazon EC2 Container Service, Amazon CloudFront, and Amazon S3. They also use Amazon RDS and Amazon Redshift for data storage, and Amazon Glacier to manage document retention.

Acorns is hiring! Be sure to check out their careers page if you are interested.

Bondlinc (Singapore)

Eng Keong, Founder and CEO of Bondlinc, has long wanted to standardize, improve, and automate the traditional workflows that revolve around bond trading. As a former trader at BNP Paribas and Jefferies & Company, E.K. – as Keong is known – had personally seen how manual processes led to information bottlenecks in over-the-counter practices. This drove him, along with future Bondlinc CTO Vincent Caldeira, to start a new service that maximizes efficiency, information distribution, and accessibility for both clients and bankers in the bond market.

Currently, bond trading requires banks to spend a significant amount of resources retrieving data from expensive and restricted institutional sources, performing suitability checks, and attaching required documentation before presenting all relevant information to clients – usually by email. Bankers are often overwhelmed by these time-consuming tasks, which means clients don’t always get proper access to time-sensitive bond information and pricing. Bondlinc bridges this gap between banks and clients by providing a variety of solutions, including easy access to basic bond information and analytics, updates of new issues and relevant news, consolidated management of your portfolio, and a chat function between banker and client. By making the bond market much more accessible to clients, Bondlinc is taking private banking to the next level, while improving efficiency of the banks as well.

As a startup running on AWS since inception, Bondlinc has built and operated its SaaS product by leveraging Amazon EC2, Amazon S3, Elastic Load Balancing, and Amazon RDS across multiple Availability Zones to provide its customers (namely, financial institutions) a highly available and seamlessly scalable product distribution platform. Bondlinc also makes extensive use of Amazon CloudWatch, AWS CloudTrail, and Amazon SNS to meet the stringent operational monitoring, auditing, compliance, and governance requirements of its customers. Bondlinc is currently experimenting with Amazon Lex to build a conversational interface into its mobile application via a chat-bot that provides trading assistance services.

To see how Bondlinc works, request a demo at Bondlinc.com.

Lenda (San Francisco, CA)

Lenda is a digital mortgage company founded by seasoned FinTech entrepreneur Jason van den Brand. Jason wanted to create a smarter, simpler, and more streamlined system for people to either get a mortgage or refinance their homes. With Lenda, customers can find out if they are pre-approved for loans, and receive accurate, real-time mortgage rate quotes from industry-experienced home loan advisors. Lenda’s advisors support customers through the loan process by providing financial advice and guidance for a seamless experience.

Lenda’s innovative platform allows borrowers to complete their home loans online from start to finish. Through a savvy combination of being a direct lender with proprietary technology, Lenda has simplified the mortgage application process to save customers time and money. With an interactive dashboard, customers know exactly where they are in the mortgage process and can manage all of their documents in one place. The company recently received its Series A funding of $5.25 million, and van den Brand shared that most of the capital investment will be used to improve Lenda’s technology and fulfill the company’s mission, which is to reimagine homeownership, starting with home loans.

AWS allows Lenda to scale its business while providing a secure, easy-to-use system for a faster home loan approval process. Currently, Lenda uses Amazon S3, Amazon EC2, Amazon CloudFront, Amazon Redshift, and Amazon WorkSpaces.

Visit Lenda.com to find out more.

Thanks for reading and see you in October for another round of hot startups!

-Tina

‘Daily Stormer’ Termination Haunts Cloudflare in Online Piracy Case

Post Syndicated from Ernesto original https://torrentfreak.com/daily-stormer-termination-haunts-cloudflare-in-online-piracy-case-170929/

Last month Cloudflare CEO Matthew Prince decided to terminate the account of controversial neo-Nazi site Daily Stormer.

“I woke up this morning in a bad mood and decided to kick them off the Internet,” he announced.

While the decision is understandable from an emotional point of view, it’s quite a statement to make as the CEO of one of the largest Internet infrastructure companies. Not least because it goes directly against what many saw as Cloudflare’s core values.

For years on end, Cloudflare has been asked to remove terrorist propaganda, pirate sites, and other controversial content. Each time, Cloudflare replied that it doesn’t take action without a court order. No exceptions.

In addition, Cloudflare repeatedly stressed that it was impossible for them to remove a website from the Internet, at least not permanently. It would only require a simple DNS reconfiguration to get it back up and running.

While the Daily Stormer case has nothing to do with piracy or copyright infringement, it’s now being brought up as important evidence in an ongoing piracy liability case. Adult entertainment publisher ALS Scan views Prince as a “key witness” in the case and wants to depose Cloudflare’s CEO to find out more about his decision.

“Mr. Prince’s statement to the public that Cloudflare kicked neo-Nazis off the internet stand in sharp contrast to Cloudflare’s testimony in this case, where it claims it is powerless to remove content from the Internet,” ALS Scan writes.

The above is part of a recent submission where both parties argue over whether Prince can be deposed or not. Cloudflare wants to prevent this from happening and claims it’s unnecessary, but the adult publisher disagrees.

“By his own admissions, Mr. Prince’s decision to terminate certain users’ accounts was ‘arbitrary,’ the result of him waking up ‘in a bad mood,’ and a decision he made unilaterally as ‘CEO of a major Internet infrastructure corporation’.

“Mr. Prince has made it clear that he is the one who determines the circumstances under which Cloudflare will terminate a user’s account,” ALS Scan adds.

For its part, Cloudflare says that the CEO’s deposition is not needed. This is backed up by a declaration where Prince emphasizes that he has no unique knowledge on the company’s DMCA and repeat infringer policies, issues that directly relate to the case at hand.

“I have no unique personal knowledge regarding Cloudflare’s DMCA policy and procedure, including its repeat infringer policies, or Cloudflare’s published Terms of Service,” Prince informs the court

Prince’s declaration

The adult publisher, however, harps on the fact that the CEO arbitrarily decided to remove one site from the service, while requiring court orders in other instances. They quote from a Wall Street Journal (WSJ) article he wrote and highlight the ‘kick off the internet’ claim, which contradicts earlier statements.

Cloudflare’s lawyers contend that the WSJ article in question was meant to kick off a conversation and shouldn’t be taken literally.

“The WSJ Article was intended as an intellectual exercise to start a conversation regarding censorship and free speech on the internet. The WSJ Article had nothing to do with copyright infringement issues or Cloudflare’s DMCA policy and procedure.

“When Mr. Prince stated in the WSJ Article that ‘[he] helped kick a group of neo-Nazis off the internet last week,’ his comments were intended to illustrate a point – not to be taken literally,” Cloudflare’s legal team adds.

The deposition of Trey Guinn, a technical employee at Cloudflare, confirms that the company doesn’t have the power to cut a site off the Internet. It further suggests that the entire removal of Daily Stormer was in essence a provocation to start a conversation around freedom of speech.

From Guinn’s deposition

Still, since the lawsuit in question revolves around terminating customers, ALS Scan wants to depose Price to find out exactly when clients are terminated, and why he decided to go beyond Couldflare’s usual policy.

“No other employee can testify to Mr. Prince’s decision-making process when it comes to terminating a user’s access. No other employee can offer an explanation as to why The Daily Stormer’s account was terminated while repeat infringers’ accounts are allowed to remain.

“In a case where Mr. Prince’s personal judgment appears to govern even over Cloudflare’s own policies and procedures, Cloudflare cannot meet its heavy burden of demonstrating why he should not be deposed,” ALS Scan’s lawyers add.

To be continued.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Creating a Cost-Efficient Amazon ECS Cluster for Scheduled Tasks

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/

Madhuri Peri
Sr. DevOps Consultant

When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.

One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances.  Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.

Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.

In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.

Architecture

The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.

Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.

The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.

The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.

Walkthrough

To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:

  • Amazon ECR repository for storing the Docker image to be used in the task definition
  • S3 bucket that holds the transferred logs
  • Task definition, with image name and S3 bucket as environment variables provided via input parameter
  • CloudWatch Events rule
  • Amazon ECS cluster
  • Amazon ECS container instances using Spot Fleet
  • IAM roles required for the container instance profiles

Before you begin

Ensure that Git, Docker, and the AWS CLI are installed on your computer.

In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster.

Implementation Steps

  1. Clone the code from GitHub that performs RDS API calls to retrieve the log files.
    git clone https://github.com/awslabs/aws-ecs-scheduled-tasks.git
  2. Build and tag the image.
    cd aws-ecs-scheduled-tasks/container-code/src && ls

    Dockerfile		rdslogsshipper.py	requirements.txt

    docker build -t rdslogsshipper .

    Sending build context to Docker daemon 9.728 kB
    Step 1 : FROM python:3
     ---> 41397f4f2887
    Step 2 : WORKDIR /usr/src/app
     ---> Using cache
     ---> 59299c020e7e
    Step 3 : COPY requirements.txt ./
     ---> 8c017e931c3b
    Removing intermediate container df09e1bed9f2
    Step 4 : COPY rdslogsshipper.py /usr/src/app
     ---> 099a49ca4325
    Removing intermediate container 1b1da24a6699
    Step 5 : RUN pip install --no-cache-dir -r requirements.txt
     ---> Running in 3ed98b30901d
    Collecting boto3 (from -r requirements.txt (line 1))
      Downloading boto3-1.4.6-py2.py3-none-any.whl (128kB)
    Collecting botocore (from -r requirements.txt (line 2))
      Downloading botocore-1.6.7-py2.py3-none-any.whl (3.6MB)
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->-r requirements.txt (line 1))
      Downloading s3transfer-0.1.10-py2.py3-none-any.whl (54kB)
    Collecting jmespath<1.0.0,>=0.7.1 (from boto3->-r requirements.txt (line 1))
      Downloading jmespath-0.9.3-py2.py3-none-any.whl
    Collecting python-dateutil<3.0.0,>=2.1 (from botocore->-r requirements.txt (line 2))
      Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
    Collecting docutils>=0.10 (from botocore->-r requirements.txt (line 2))
      Downloading docutils-0.14-py3-none-any.whl (543kB)
    Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore->-r requirements.txt (line 2))
      Downloading six-1.10.0-py2.py3-none-any.whl
    Installing collected packages: six, python-dateutil, docutils, jmespath, botocore, s3transfer, boto3
    Successfully installed boto3-1.4.6 botocore-1.6.7 docutils-0.14 jmespath-0.9.3 python-dateutil-2.6.1 s3transfer-0.1.10 six-1.10.0
     ---> f892d3cb7383
    Removing intermediate container 3ed98b30901d
    Step 6 : COPY . .
     ---> ea7550c04fea
    Removing intermediate container b558b3ebd406
    Successfully built ea7550c04fea
  3. Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
  4. Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
  5. Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes:
    aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule  --schedule-expression "rate(15 minutes)"

    {
        "RuleArn": "arn:aws:events:us-west-2:12345678901:rule/demo-ecs-task-rule"
    }
  6. CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
    1. Create the IAM role to be assumed by CloudWatch.
      aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json

      {
          "Role": {
              "AssumeRolePolicyDocument": {
                  "Version": "2012-10-17", 
                  "Statement": [
                      {
                          "Action": "sts:AssumeRole", 
                          "Effect": "Allow", 
                          "Principal": {
                              "Service": "events.amazonaws.com"
                          }
                      }
                  ]
              }, 
              "RoleId": "AROAIRYYLDCVZCUACT7FS", 
              "CreateDate": "2017-07-14T22:44:52.627Z", 
              "RoleName": "Test-Role", 
              "Path": "/", 
              "Arn": "arn:aws:iam::12345678901:role/Test-Role"
          }
      }

      The following is an example of the event-role.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                    "Service": "events.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
    2. Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources.
      aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json

      {
          "Policy": {
              "PolicyName": "test-policy", 
              "CreateDate": "2017-07-14T22:51:20.293Z", 
              "AttachmentCount": 0, 
              "IsAttachable": true, 
              "PolicyId": "ANPAI7XDIQOLTBUMDWGJW", 
              "DefaultVersionId": "v1", 
              "Path": "/", 
              "Arn": "arn:aws:iam::123455678901:policy/test-policy", 
              "UpdateDate": "2017-07-14T22:51:20.293Z"
          }
      }

      The following is an example of the event-policy.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ecs:RunTask"
                ],
                "Resource": [
                    "arn:aws:ecs:*::task-definition/"
                ],
                "Condition": {
                    "ArnLike": {
                        "ecs:cluster": "arn:aws:ecs:*::cluster/"
                    }
                }
            }
          ]
      }
    3. Attach the IAM policy to the role.
      aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
  7. Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings.
    aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role"

    {
        "FailedEntries": [], 
        "FailedEntryCount": 0
    }

That’s it. The logs now run based on the defined schedule.

To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.

Conclusion

In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.

In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.

If you have questions or suggestions, please comment below.

Natural Language Processing at Clemson University – 1.1 Million vCPUs & EC2 Spot Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/natural-language-processing-at-clemson-university-1-1-million-vcpus-ec2-spot-instances/

My colleague Sanjay Padhi shared the guest post below in order to recognize an important milestone in the use of EC2 Spot Instances.

Jeff;


A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS region. The researchers conducted nearly half a million topic modeling experiments to study how human language is processed by computers. Topic modeling helps in discovering the underlying themes that are present across a collection of documents. Topic models are important because they are used to forecast business trends and help in making policy or funding decisions. These topic models can be run with many different parameters and the goal of the experiments is to explore how these parameters affect the model outputs.

The Experiment
Professor Amy Apon, Co-Director of the Complex Systems, Analytics and Visualization Institute at Clemson University with Professor Alexander Herzog and graduate students Brandon Posey and Christopher Gropp in collaboration with members of the AWS team as well as AWS Partner Omnibond performed the experiments.  They used software infrastructure based on CloudyCluster that provisions high performance computing clusters on dynamically allocated AWS resources using Amazon EC2 Spot Fleet. Spot Fleet is a collection of biddable spot instances in EC2 responsible for maintaining a target capacity specified during the request. The SLURM scheduler was used as an overlay virtual workload manager for the data analytics workflows. The team developed additional provisioning and workflow automation software as shown below for the design and orchestration of the experiments. This setup allowed them to evaluate various topic models on different data sets with massively parallel parameter sweeps on dynamically allocated AWS resources. This framework can easily be used beyond the current study for other scientific applications that use parallel computing.

Ramping to 1.1 Million vCPUs
The figure below shows elastic, automatic expansion of resources as a function of time, in the US East (Northern Virginia) Region. At just after 21:40 (GMT-1) on Aug. 26, 2017, the number of vCPUs utilized was 1,119,196. Clemson researchers also took advantage of the new per-second billing for the EC2 instances that they launched. The vCPU count usage is comparable to the core count on the largest supercomputers in the world.

Here’s the breakdown of the EC2 instance types that they used:

Campus resources at Clemson funded by the National Science Foundation were used to determine an effective configuration for the AWS experiments as compared to campus resources, and the AWS cloud resources complement the campus resources for large-scale experiments.

Meet the Team
Here’s the team that ran the experiment (Professor Alexander Herzog, graduate students Christopher Gropp and Brandon Posey, and Professor Amy Apon):

Professor Apon said about the experiment:

I am absolutely thrilled with the outcome of this experiment. The graduate students on the project are amazing. They used resources from AWS and Omnibond and developed a new software infrastructure to perform research at a scale and time-to-completion not possible with only campus resources. Per-second billing was a key enabler of these experiments.

Boyd Wilson (CEO, Omnibond, member of the AWS Partner Network) told me:

Participating in this project was exciting, seeing how the Clemson team developed a provisioning and workflow automation tool that tied into CloudyCluster to build a huge Spot Fleet supercomputer in a single region in AWS was outstanding.

About the Experiment
The experiments test parameter combinations on a range of topics and other parameters used in the topic model. The topic model outputs are stored in Amazon S3 and are currently being analyzed. The models have been applied to 17 years of computer science journal abstracts (533,560 documents and 32,551,540 words) and full text papers from the NIPS (Neural Information Processing Systems) Conference (2,484 documents and 3,280,697 words). This study allows the research team to systematically measure and analyze the impact of parameters and model selection on model convergence, topic composition and quality.

Looking Forward
This study constitutes an interaction between computer science, artificial intelligence, and high performance computing. Papers describing the full study are being submitted for peer-reviewed publication. I hope that you enjoyed this brief insight into the ways in which AWS is helping to break the boundaries in the frontiers of natural language processing!

Sanjay Padhi, Ph.D, AWS Research and Technical Computing

 

Backing Up WordPress

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/backing-up-wordpress/

WordPress cloud backup
WordPress logo

WordPress is the most popular CMS (Content Management System) for websites, with almost 30% of all websites in the world using WordPress. That’s a lot of sites — over 350 million!

In this post we’ll talk about the different approaches to keeping the data on your WordPress website safe.


Stop the Presses! (Or the Internet!)

As we were getting ready to publish this post, we received news from UpdraftPlus, one of the biggest WordPress plugin developers, that they are supporting Backblaze B2 as a storage solution for their backup plugin. They shipped the update (1.13.9) this week. This is great news for Backblaze customers! UpdraftPlus is also offering a 20% discount to Backblaze customers wishing to purchase or upgrade to UpdraftPlus Premium. The complete information is below.

UpdraftPlus joins backup plugin developer XCloner — Backup and Restore in supporting Backblaze B2. A third developer, BlogVault, also announced their intent to support Backblaze B2. Contact your favorite WordPress backup plugin developer and urge them to support Backblaze B2, as well.

Now, back to our post…


Your WordPress website data is on a web server that’s most likely located in a large data center. You might wonder why it is necessary to have a backup of your website if it’s in a data center. Website data can be lost in a number of ways, including mistakes by the website owner (been there), hacking, or even domain ownership dispute (I’ve seen it happen more than once). A website backup also can provide a history of changes you’ve made to the website, which can be useful. As an overall strategy, it’s best to have a backup of any data that you can’t afford to lose for personal or business reasons.

Your web hosting company might provide backup services as part of your hosting plan. If you are using their service, you should know where and how often your data is being backed up. You don’t want to find out too late that your backup plan was not adequate.

Sites on WordPress.com are automatically backed up by VaultPress (Automattic), which also is available for self-hosted WordPress installations. If you don’t want the work or decisions involved in managing the hosting for your WordPress site, WordPress.com will handle it for you. You do, however, give up some customization abilities, such as the option to add plugins of your own choice.

Very large and active websites might consider WordPress VIP by Automattic, or another premium WordPress hosting service such as Pagely.com.

This post is about backing up self-hosted WordPress sites, so we’ll focus on those options.

WordPress Backup

Backup strategies for WordPress can be divided into broad categories depending on 1) what you back up, 2) when you back up, and 3) where the data is backed up.

With server data, such as with a WordPress installation, you should plan to have three copies of the data (the 3-2-1 backup strategy). The first is the active data on the WordPress web server, the second is a backup stored on the web server or downloaded to your local computer, and the third should be in another location, such as the cloud.

We’ll talk about the different approaches to backing up WordPress, but we recommend using a WordPress plugin to handle your backups. A backup plugin can automate the task, optimize your backup storage space, and alert you of problems with your backups or WordPress itself. We’ll cover plugins in more detail, below.

What to Back Up?

The main components of your WordPress installation are:

You should decide which of these elements you wish to back up. The database is the top priority, as it contains all your website posts and pages (exclusive of media). Your current theme is important, as it likely contains customizations you’ve made. Following those in priority are any other files you’ve customized or made changes to.

You can choose to back up the WordPress core installation and plugins, if you wish, but these files can be downloaded again if necessary from the source, so you might not wish to include them. You likely have all the media files you use on your website on your local computer (which should be backed up), so it is your choice whether to back these up from the server as well.

If you wish to be able to recreate your entire website easily in case of data loss or disaster, you might choose to back up everything, though on a large website this could be a lot of data.

Generally, you should 1) prioritize any file that you’ve customized that you can’t afford to lose, and 2) decide whether you need a copy of everything in order to get your site back up quickly. These choices will determine your backup method and the amount of storage you need.

A good backup plugin for WordPress enables you to specify which files you wish to back up, and even to create separate backups and schedules for different backup contents. That’s another good reason to use a plugin for backing up WordPress.

When to Back Up?

You can back up manually at any time by using the Export tool in WordPress. This is handy if you wish to do a quick backup of your site or parts of it. Since it is manual, however, it is not a part of a dependable backup plan that should be done regularly. If you wish to use this tool, go to Tools, Export, and select what you wish to back up. The output will be an XML file that uses the WordPress Extended RSS format, also known as WXR. You can create a WXR file that contains all of the information on your site or just portions of the site, such as posts or pages by selecting: All content, Posts, Pages, or Media.
Note: You can use WordPress’s Export tool for sites hosted on WordPress.com, as well.

Export instruction for WordPress

Many of the backup plugins we’ll be discussing later also let you do a manual backup on demand in addition to regularly scheduled or continuous backups.

Note:  Another use of the WordPress Export tool and the WXR file is to transfer or clone your website to another server. Once you have exported the WXR file from the website you wish to transfer from, you can import the WXR file from the Tools, Import menu on the new WordPress destination site. Be aware that there are file size limits depending on the settings on your web server. See the WordPress Codex entry for more information. To make this job easier, you may wish to use one of a number of WordPress plugins designed specifically for this task.

You also can manually back up the WordPress MySQL database using a number of tools or a plugin. The WordPress Codex has good information on this. All WordPress plugins will handle this for you and do it automatically. They also typically include tools for optimizing the database tables, which is just good housekeeping.

A dependable backup strategy doesn’t rely on manual backups, which means you should consider using one of the many backup plugins available either free or for purchase. We’ll talk more about them below.

Which Format To Back Up In?

In addition to the WordPress WXR format, plugins and server tools will use various file formats and compression algorithms to store and compress your backup. You may get to choose between zip, tar, tar.gz, tar.gz2, and others. See The Most Common Archive File Formats for more information on these formats.

Select a format that you know you can access and unarchive should you need access to your backup. All of these formats are standard and supported across operating systems, though you might need to download a utility to access the file.

Where To Back Up?

Once you have your data in a suitable format for backup, where do you back it up to?

We want to have multiple copies of our active website data, so we’ll choose more than one destination for our backup data. The backup plugins we’ll discuss below enable you to specify one or more possible destinations for your backup. The possible destinations for your backup include:

A backup folder on your web server
A backup folder on your web server is an OK solution if you also have a copy elsewhere. Depending on your hosting plan, the size of your site, and what you include in the backup, you may or may not have sufficient disk space on the web server. Some backup plugins allow you to configure the plugin to keep only a certain number of recent backups and delete older ones, saving you disk space on the server.
Email to you
Because email servers have size limitations, the email option is not the best one to use unless you use it to specifically back up just the database or your main theme files.
FTP, SFTP, SCP, WebDAV
FTP, SFTP, SCP, and WebDAV are all widely-supported protocols for transferring files over the internet and can be used if you have access credentials to another server or supported storage device that is suitable for storing a backup.
Sync service (Dropbox, SugarSync, Google Drive, OneDrive)
A sync service is another possible server storage location though it can be a pricier choice depending on the plan you have and how much you wish to store.
Cloud storage (Backblaze B2, Amazon S3, Google Cloud, Microsoft Azure, Rackspace)
A cloud storage service can be an inexpensive and flexible option with pay-as-you go pricing for storing backups and other data.

A good website backup strategy would be to have multiple backups of your website data: one in a backup folder on your web hosting server, one downloaded to your local computer, and one in the cloud, such as with Backblaze B2.

If I had to choose just one of these, I would choose backing up to the cloud because it is geographically separated from both your local computer and your web host, it uses fault-tolerant and redundant data storage technologies to protect your data, and it is available from anywhere if you need to restore your site.

Backup Plugins for WordPress

Probably the easiest and most common way to implement a solid backup strategy for WordPress is to use one of the many backup plugins available for WordPress. Fortunately, there are a number of good ones and are available free or in “freemium” plans in which you can use the free version and pay for more features and capabilities only if you need them. The premium options can give you more flexibility in configuring backups or have additional options for where you can store the backups.

How to Choose a WordPress Backup Plugin

screenshot of WordPress plugins search

When considering which plugin to use, you should take into account a number of factors in making your choice.

Is the plugin actively maintained and up-to-date? You can determine this from the listing in the WordPress Plugin Repository. You also can look at reviews and support comments to get an idea of user satisfaction and how well issues are resolved.

Does the plugin work with your web hosting provider? Generally, well-supported plugins do, but you might want to check to make sure there are no issues with your hosting provider.

Does it support the cloud service or protocol you wish to use? This can be determined from looking at the listing in the WordPress Plugin Repository or on the developer’s website. Developers often will add support for cloud services or other backup destinations based on user demand, so let the developer know if there is a feature or backup destination you’d like them to add to their plugin.

Other features and options to consider in choosing a backup plugin are:

  • Whether encryption of your backup data is available
  • What are the options for automatically deleting backups from the storage destination?
  • Can you globally exclude files, folders, and specific types of files from the backup?
  • Do the options for scheduling automatic backups meet your needs for frequency?
  • Can you exclude/include specific database tables (a good way to save space in your backup)?

WordPress Backup Plugins Review

Let’s review a few of the top choices for WordPress backup plugins.

UpdraftPlus

UpdraftPlus is one of the most popular backup plugins for WordPress with over one million active installations. It is available in both free and Premium versions.

UpdraftPlus just released support for Backblaze B2 Cloud Storage in their 1.13.9 update on September 25. According to the developer, support for Backblaze B2 was the most frequent request for a new storage option for their plugin. B2 support is available in their Premium plugin and as a stand-alone update to their standard product.

Note: The developers of UpdraftPlus are offering a special 20% discount to Backblaze customers on the purchase of UpdraftPlus Premium by using the coupon code backblaze20. The discount is valid until the end of Friday, October 6th, 2017.

screenshot of Backblaze B2 cloud backup for WordPress in UpdraftPlus

XCloner — Backup and Restore

XCloner — Backup and Restore is a useful open-source plugin with many options for backing up WordPress.

XCloner supports B2 Cloud Storage in their free plugin.

screenshot of XCloner WordPress Backblaze B2 backup settings

BlogVault

BlogVault describes themselves as a “complete WordPress backup solution.” They offer a free trial of their paid WordPress backup subscription service that features real-time backups of changes to your WordPress site, as well as many other features.

BlogVault has announced their intent to support Backblaze B2 Cloud Storage in a future update.

screenshot of BlogValut WordPress Backup settings

BackWPup

BackWPup is a popular and free option for backing up WordPress. It supports a number of options for storing your backup, including the cloud, FTP, email, or on your local computer.

screenshot of BackWPup WordPress backup settings

WPBackItUp

WPBackItUp has been around since 2012 and is highly rated. It has both free and paid versions.

screenshot of WPBackItUp WordPress backup settings

VaultPress

VaultPress is part of Automattic’s well-known WordPress product, JetPack. You will need a JetPack subscription plan to use VaultPress. There are different pricing plans with different sets of features.

screenshot of VaultPress backup settings

Backup by Supsystic

Backup by Supsystic supports a number of options for backup destinations, encryption, and scheduling.

screenshot of Backup by Supsystic backup settings

BackupWordPress

BackUpWordPress is an open-source project on Github that has a popular and active following and many positive reviews.

screenshot of BackupWordPress WordPress backup settings

BackupBuddy

BackupBuddy, from iThemes, is the old-timer of backup plugins, having been around since 2010. iThemes knows a lot about WordPress, as they develop plugins, themes, utilities, and provide training in WordPress.

BackupBuddy’s backup includes all WordPress files, all files in the WordPress Media library, WordPress themes, and plugins. BackupBuddy generates a downloadable zip file of the entire WordPress website. Remote storage destinations also are supported.

screenshot of BackupBuddy settings

WordPress and the Cloud

Do you use WordPress and back up to the cloud? We’d like to hear about it. We’d also like to hear whether you are interested in using B2 Cloud Storage for storing media files served by WordPress. If you are, we’ll write about it in a future post.

In the meantime, keep your eye out for new plugins supporting Backblaze B2, or better yet, urge them to support B2 if they’re not already.

The Best Backup Strategy is the One You Use

There are other approaches and tools for backing up WordPress that you might use. If you have an approach that works for you, we’d love to hear about it in the comments.

The post Backing Up WordPress appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Of Course Atlus Hit RPCS3’s Patreon Page Over Persona 5

Post Syndicated from Andy original https://torrentfreak.com/of-course-atlus-hit-rpcs3s-patreon-page-over-persona-5-170927/

For the uninitiated, RPCS3 is an open-source Sony PlayStation 3 emulator for PC. This growing and brilliant piece of code was publicly released in 2012 and since then has been under constant development thanks to a decent-sized team of programmers and other contributors.

While all emulation has its challenges, emulating a relatively recent piece of hardware such as Playstation 3 is a massive undertaking. As a result, RPCS3 needs funding. This it achieves through its Patreon page, which currently receives support from 675 patrons to the tune of $3,000 per month.

There’s little doubt that there are plenty of people out there who want the project to succeed. Yesterday, however, things took a turn for the worse when RPCS3 attracted the negative attention of Atlus, the developer behind the utterly beautiful RPG, Persona 5.

According to the RPCS3 team, Atlus filed a DMCA takedown notice with Patreon requesting the removal of the entire RPCS3 page after the team promoted the fact that Persona 5 would be compatible with the under-development emulator.

“The PS3 emulator itself is not infringing on our copyrights and trademarks; however, no version of the P5 game should be playable on this platform; and [the RPCS3] developers are infringing on our IP by making such games playable,” Atlus told Patreon.

Fortunately for everyone involved, Patreon did not storm in and remove the entire page, not least since the page itself didn’t infringe on Atlus’ IP rights. However, Atlus was not happy with the response and attempted to negotiate with the fund-raising platform, noting that in order for Persona 5 to work, the user would have to circumvent the game’s DRM protections.

The RPCS3 team, on the other hand, believe they’re on solid ground, noting that where their main developers live, it is legal to make personal copies of legally purchased games. They concede it may not be legal for everyone, but in any event, that would be irrelevant to the DMCA notice filed against their Patreon page. Indeed, trying to take down an entire fundraiser with a DMCA notice was a significant overreach under the circumstances

According to a statement from the team, ultimately a decision was taken to proceed with caution. In order to avoid a full takedown of their Patreon page, all mentions of Persona 5 were removed from both the fund-raiser and main RPSC3 site yesterday.

The RPSC3 team noted that they had no idea why Atlus targeted their project but an announcement from the developer later shone a little light on the issue.

“We believe that our fans best experience our titles (like Persona 5) on the actual platforms for which they are developed. We don’t want their first experiences to be framerate drops, or crashes, or other issues that can crop up in emulation that we have not personally overseen,” Atlus explained.

While some gamers expressed negative opinions over Atlus’ undoubtedly overbroad actions yesterday, it’s difficult to argue with the developer’s main point. Emulators can be beautiful things but there is no doubt that in many instances they don’t recreate the gaming experience perfectly. Indeed, in some cases when things don’t go to plan, the results can be pretty horrible.

That being said, for whatever reason Atlus has chosen not to release a PC version of this popular title so, as many hardcore emulator fans will tell you (this one included), that’s a bit of a red rag to a bull. The company suggests that it might remedy that situation in the future though, so maybe that’s some consolation.

In the meantime, there’s a significant backlash against Atlus and what it attempted to do to the RPCS3 project and its fund-raising efforts. Some people are threatening never to buy an Atlus game ever again, for example, and that’s their prerogative.

But really – is anyone truly surprised that Atlus reacted in the way it did?

While Persona 5 isn’t available on PC yet, this isn’t an out-of-print game from 1982 that’s about to disappear into the black hole of time because there’s no hardware to play it on. This is a game created for relatively current hardware (bang up to date if you include the PS4 version) that was released April 2017 in the United States, just a handful of months ago.

As such, none of the usual ‘moral’ motivations for emulating games on other platforms exist for Persona 5 and for that reason alone, the decision to heavily mention it in RPCS3 fund-raising efforts was bound to backfire. It doesn’t matter whether emulation or dumping of ROMs is legal in some regions, any company can be expected to wade in when someone threatens their business model.

The stark reality is that when they do, entire projects can be put at risk. In this case, Patreon stepped in to save the day but it could’ve been a lot worse. Martyring the whole project for one game would’ve been a disaster for the team and the public. All that being said, Atlus is unlikely to come out of this on top.

“Whatever people may wish, there’s no way to stop any playable game from being executed on the emulator,” the RPCS3 team note.

“Blacklisting the game? RPCS3 is open-source, any attempt would easily be reversed. Attempting to take down the project? At the time of this post, this and many other games were already playable to their full extent, and again, RPCS3 is and will always be an open-source project.”

The bottom line here is that Atlus’ actions may have left a bit of a bad taste in the mouths of some gamers, but even the most hardcore emulator fan shouldn’t be surprised the company went for the throat on a game so fresh. That being said, there are lessons to be learned.

Atlus could’ve spoken quietly to RPCS3 first, but chose not to. RPCS3, on the other hand, will probably be a little bit more strategic with future game compatibility announcements, given what’s just happened. In the long term, that will help them, since it will ensure longevity for the project.

RPCS3 is needed, there’s no doubt about that, but its true value will only be felt when the PS3 has been consigned to history. At that point people will understand why it was worth all the effort – and the occasional hiccup.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.