Tag Archives: team

Derek Woodroffe’s steampunk tentacle hat

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/steampunk-tentacle-hat/

Halloween: that glorious time of year when you’re officially allowed to make your friends jump out of their skin with your pranks. For those among us who enjoy dressing up, Halloween is also the occasion to go all out with costumes. And so, dear reader, we present to you: a steampunk tentacle hat, created by Derek Woodroffe.

Finished Tenticle hat

Finished Tenticle hat

Extreme Electronics

Derek is an engineer who loves all things electronics. He’s part of Extreme Kits, and he runs the website Extreme Electronics. Raspberry Pi Zero-controlled Tesla coils are Derek’s speciality — he’s even been on one of the Royal Institution’s Christmas Lectures with them! Skip ahead to 15:06 in this video to see Derek in action:

Let There Be Light! // 2016 CHRISTMAS LECTURES with Saiful Islam – Lecture 1

The first Lecture from Professor Saiful Islam’s 2016 series of CHRISTMAS LECTURES, ‘Supercharged: Fuelling the future’. Watch all three Lectures here: http://richannel.org/christmas-lectures 2016 marked the 80th anniversary since the BBC first broadcast the Christmas Lectures on TV. To celebrate, chemist Professor Saiful Islam explores a subject that the lectures’ founder – Michael Faraday – addressed in the very first Christmas Lectures – energy.

Wearables

Wearables are electronically augmented items you can wear. They might take the form of spy eyeglasses, clothes with integrated sensors, or, in this case, headgear adorned with mechanised tentacles.

Why did Derek make this? We’re not entirely sure, but we suspect he’s a fan of the Cthulu mythos. In any case, we were a little astounded by his project. This is how we reacted when Derek tweeted us about it:

Raspberry Pi on Twitter

@ExtElec @extkits This is beyond incredible and completely unexpected.

In fact, we had to recover from a fit of laughter before we actually managed to type this answer.

Making a steampunk tentacle hat

Derek made the ‘skeleton’ of each tentacle out of a net curtain spring, acrylic rings, and four lengths of fishing line. Two servomotors connect to two ends of fishing line each, and pull them to move the tentacle.

net curtain spring and acrylic rings forming a mechanic tentacle skeleton - steampunk tentacle hat by Derek Woodroffe
Two servos connecting to lengths of fishing line - steampunk tentacle hat by Derek Woodroffe

Then he covered the tentacles with nylon stockings and liquid latex, glued suckers cut out of MDF onto them, and mounted them on an acrylic base. The eight motors connect to a Raspberry Pi via an I2C 8-port PWM controller board.

artificial tentacles - steampunk tentacle hat by Derek Woodroffe
8 servomotors connected to a controller board and a raspberry pi- steampunk tentacle hat by Derek Woodroffe

The Pi makes the servos pull the tentacles so that they move in sine waves in both the x and y directions, seemingly of their own accord. Derek cut open the top of a hat to insert the mounted tentacles, and he used more liquid latex to give the whole thing a slimy-looking finish.

steampunk tentacle hat by Derek Woodroffe

Iä! Iä! Cthulhu fhtagn!

You can read more about Derek’s steampunk tentacle hat here. He will be at the Beeston Raspberry Jam in November to show off his build, so if you’re in the Nottingham area, why not drop by?

Wearables for Halloween

This build is already pretty creepy, but just imagine it with a sensor- or camera-powered upgrade that makes the tentacles reach for people nearby. You’d have nightmare fodder for weeks.

With the help of the Raspberry Pi, any Halloween costume can be taken to the next level. How could Pi technology help you to win that coveted ‘Scariest costume’ prize this year? Tell us your ideas in the comments, and be sure to share pictures of you in your get-up with us on Twitter, Facebook, or Instagram.

The post Derek Woodroffe’s steampunk tentacle hat appeared first on Raspberry Pi.

Federate Database User Authentication Easily with IAM and Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/federate-database-user-authentication-easily-with-iam-and-amazon-redshift/

Managing database users though federation allows you to manage authentication and authorization procedures centrally. Amazon Redshift now supports database authentication with IAM, enabling user authentication though enterprise federation. No need to manage separate database users and passwords to further ease the database administration. You can now manage users outside of AWS and authenticate them for access to an Amazon Redshift data warehouse. Do this by integrating IAM authentication and a third-party SAML-2.0 identity provider (IdP), such as AD FS, PingFederate, or Okta. In addition, database users can also be automatically created at their first login based on corporate permissions.

In this post, I demonstrate how you can extend the federation to enable single sign-on (SSO) to the Amazon Redshift data warehouse.

SAML and Amazon Redshift

AWS supports Security Assertion Markup Language (SAML) 2.0, which is an open standard for identity federation used by many IdPs. SAML enables federated SSO, which enables your users to sign in to the AWS Management Console. Users can also make programmatic calls to AWS API actions by using assertions from a SAML-compliant IdP. For example, if you use Microsoft Active Directory for corporate directories, you may be familiar with how Active Directory and AD FS work together to enable federation. For more information, see the Enabling Federation to AWS Using Windows Active Directory, AD FS, and SAML 2.0 AWS Security Blog post.

Amazon Redshift now provides the GetClusterCredentials API operation that allows you to generate temporary database user credentials for authentication. You can set up an IAM permissions policy that generates these credentials for connecting to Amazon Redshift. Extending the IAM authentication, you can configure the federation of AWS access though a SAML 2.0–compliant IdP. An IAM role can be configured to permit the federated users call the GetClusterCredentials action and generate temporary credentials to log in to Amazon Redshift databases. You can also set up policies to restrict access to Amazon Redshift clusters, databases, database user names, and user group.

Amazon Redshift federation workflow

In this post, I demonstrate how you can use a JDBC– or ODBC-based SQL client to log in to the Amazon Redshift cluster using this feature. The SQL clients used with Amazon Redshift JDBC or ODBC drivers automatically manage the process of calling the GetClusterCredentials action, retrieving the database user credentials, and establishing a connection to your Amazon Redshift database. You can also use your database application to programmatically call the GetClusterCredentials action, retrieve database user credentials, and connect to the database. I demonstrate these features using an example company to show how different database users accounts can be managed easily using federation.

The following diagram shows how the SSO process works:

  1. JDBC/ODBC
  2. Authenticate using Corp Username/Password
  3. IdP sends SAML assertion
  4. Call STS to assume role with SAML
  5. STS Returns Temp Credentials
  6. Use Temp Credentials to get Temp cluster credentials
  7. Connect to Amazon Redshift using temp credentials

Walkthrough

Example Corp. is using Active Directory (idp host:demo.examplecorp.com) to manage federated access for users in its organization. It has an AWS account: 123456789012 and currently manages an Amazon Redshift cluster with the cluster ID “examplecorp-dw”, database “analytics” in us-west-2 region for its Sales and Data Science teams. It wants the following access:

  • Sales users can access the examplecorp-dw cluster using the sales_grp database group
  • Sales users access examplecorp-dw through a JDBC-based SQL client
  • Sales users access examplecorp-dw through an ODBC connection, for their reporting tools
  • Data Science users access the examplecorp-dw cluster using the data_science_grp database group.
  • Partners access the examplecorp-dw cluster and query using the partner_grp database group.
  • Partners are not federated through Active Directory and are provided with separate IAM user credentials (with IAM user name examplecorpsalespartner).
  • Partners can connect to the examplecorp-dw cluster programmatically, using language such as Python.
  • All users are automatically created in Amazon Redshift when they log in for the first time.
  • (Optional) Internal users do not specify database user or group information in their connection string. It is automatically assigned.
  • Data warehouse users can use SSO for the Amazon Redshift data warehouse using the preceding permissions.

Step 1:  Set up IdPs and federation

The Enabling Federation to AWS Using Windows Active Directory post demonstrated how to prepare Active Directory and enable federation to AWS. Using those instructions, you can establish trust between your AWS account and the IdP and enable user access to AWS using SSO.  For more information, see Identity Providers and Federation.

For this walkthrough, assume that this company has already configured SSO to their AWS account: 123456789012 for their Active Directory domain demo.examplecorp.com. The Sales and Data Science teams are not required to specify database user and group information in the connection string. The connection string can be configured by adding SAML Attribute elements to your IdP. Configuring these optional attributes enables internal users to conveniently avoid providing the DbUser and DbGroup parameters when they log in to Amazon Redshift.

The user-name attribute can be set up as follows, with a user ID (for example, nancy) or an email address (for example. [email protected]):

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbUser">  
  <AttributeValue>user-name</AttributeValue>
</Attribute>

The AutoCreate attribute can be defined as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/AutoCreate">
    <AttributeValue>true</AttributeValue>
</Attribute>

The sales_grp database group can be included as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbGroups">
    <AttributeValue>sales_grp</AttributeValue>
</Attribute>

For more information about attribute element configuration, see Configure SAML Assertions for Your IdP.

Step 2: Create IAM roles for access to the Amazon Redshift cluster

The next step is to create IAM policies with permissions to call GetClusterCredentials and provide authorization for Amazon Redshift resources. To grant a SQL client the ability to retrieve the cluster endpoint, region, and port automatically, include the redshift:DescribeClusters action with the Amazon Redshift cluster resource in the IAM role.  For example, users can connect to the Amazon Redshift cluster using a JDBC URL without the need to hardcode the Amazon Redshift endpoint:

Previous:  jdbc:redshift://endpoint:port/database

Current:  jdbc:redshift:iam://clustername:region/dbname

Use IAM to create the following policies. You can also use an existing user or role and assign these policies. For example, if you already created an IAM role for IdP access, you can attach the necessary policies to that role. Here is the policy created for sales users for this example:

Sales_DW_IAM_Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:userid": "AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp"
            ]
        }
    ]
}

The policy uses the following parameter values:

  • Region: us-west-2
  • AWS Account: 123456789012
  • Cluster name: examplecorp-dw
  • Database group: sales_grp
  • IAM role: AIDIODR4TAW7CSEXAMPLE
Policy Statement Description
{
"Effect":"Allow",
"Action":[
"redshift:DescribeClusters"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
]
}

Allow users to retrieve the cluster endpoint, region, and port automatically for the Amazon Redshift cluster examplecorp-dw. This specification uses the resource format arn:aws:redshift:region:account-id:cluster:clustername. For example, the SQL client JDBC can be specified in the format jdbc:redshift:iam://clustername:region/dbname.

For more information, see Amazon Resource Names.

{
"Effect":"Allow",
"Action":[
"redshift:GetClusterCredentials"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
"arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
],
"Condition":{
"StringEquals":{
"aws:userid":"AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
}
}
}

Generates a temporary token to authenticate into the examplecorp-dw cluster. “arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}” restricts the corporate user name to the database user name for that user. This resource is specified using the format: arn:aws:redshift:region:account-id:dbuser:clustername/dbusername.

The Condition block enforces that the AWS user ID should match “AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com”, so that individual users can authenticate only as themselves. The AIDIODR4TAW7CSEXAMPLE role has the Sales_DW_IAM_Policy policy attached.

{
"Effect":"Allow",
"Action":[
"redshift:CreateClusterUser"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
]
}
Automatically creates database users in examplecorp-dw, when they log in for the first time. Subsequent logins reuse the existing database user.
{
"Effect":"Allow",
"Action":[
"redshift:JoinGroup"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp"
]
}
Allows sales users to join the sales_grp database group through the resource “arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp” that is specified in the format arn:aws:redshift:region:account-id:dbgroup:clustername/dbgroupname.

Similar policies can be created for Data Science users with access to join the data_science_grp group in examplecorp-dw. You can now attach the Sales_DW_IAM_Policy policy to the role that is mapped to IdP application for SSO.
 For more information about how to define the claim rules, see Configuring SAML Assertions for the Authentication Response.

Because partners are not authorized using Active Directory, they are provided with IAM credentials and added to the partner_grp database group. The Partner_DW_IAM_Policy is attached to the IAM users for partners. The following policy allows partners to log in using the IAM user name as the database user name.

Partner_DW_IAM_Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "redshift:DbUser": "${aws:username}"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/partner_grp"
            ]
        }
    ]
}

redshift:DbUser“: “${aws:username}” forces an IAM user to use the IAM user name as the database user name.

With the previous steps configured, you can now establish the connection to Amazon Redshift through JDBC– or ODBC-supported clients.

Step 3: Set up database user access

Before you start connecting to Amazon Redshift using the SQL client, set up the database groups for appropriate data access. Log in to your Amazon Redshift database as superuser to create a database group, using CREATE GROUP.

Log in to examplecorp-dw/analytics as superuser and create the following groups and users:

CREATE GROUP sales_grp;
CREATE GROUP datascience_grp;
CREATE GROUP partner_grp;

Use the GRANT command to define access permissions to database objects (tables/views) for the preceding groups.

Step 4: Connect to Amazon Redshift using the JDBC SQL client

Assume that sales user “nancy” is using the SQL Workbench client and JDBC driver to log in to the Amazon Redshift data warehouse. The following steps help set up the client and establish the connection:

  1. Download the latest Amazon Redshift JDBC driver from the Configure a JDBC Connection page
  2. Build the JDBC URL with the IAM option in the following format:
    jdbc:redshift:iam://examplecorp-dw:us-west-2/sales_db

Because the redshift:DescribeClusters action is assigned to the preceding IAM roles, it automatically resolves the cluster endpoints and the port. Otherwise, you can specify the endpoint and port information in the JDBC URL, as described in Configure a JDBC Connection.

Identify the following JDBC options for providing the IAM credentials (see the “Prepare your environment” section) and configure in the SQL Workbench Connection Profile:

plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider 
idp_host=demo.examplecorp.com (The name of the corporate identity provider host)
idp_port=443  (The port of the corporate identity provider host)
user=examplecorp\nancy(corporate user name)
password=***(corporate user password)

The SQL workbench configuration looks similar to the following screenshot:

Now, “nancy” can connect to examplecorp-dw by authenticating using the corporate Active Directory. Because the SAML attributes elements are already configured for nancy, she logs in as database user nancy and is assigned the sales_grp. Similarly, other Sales and Data Science users can connect to the examplecorp-dw cluster. A custom Amazon Redshift ODBC driver can also be used to connect using a SQL client. For more information, see Configure an ODBC Connection.

Step 5: Connecting to Amazon Redshift using JDBC SQL Client and IAM Credentials

This optional step is necessary only when you want to enable users that are not authenticated with Active Directory. Partners are provided with IAM credentials that they can use to connect to the examplecorp-dw Amazon Redshift clusters. These IAM users are attached to Partner_DW_IAM_Policy that assigns them to be assigned to the public database group in Amazon Redshift. The following JDBC URLs enable them to connect to the Amazon Redshift cluster:

jdbc:redshift:iam//examplecorp-dw/analytics?AccessKeyID=XXX&SecretAccessKey=YYY&DbUser=examplecorpsalespartner&DbGroup= partner_grp&AutoCreate=true

The AutoCreate option automatically creates a new database user the first time the partner logs in. There are several other options available to conveniently specify the IAM user credentials. For more information, see Options for providing IAM credentials.

Step 6: Connecting to Amazon Redshift using an ODBC client for Microsoft Windows

Assume that another sales user “uma” is using an ODBC-based client to log in to the Amazon Redshift data warehouse using Example Corp Active Directory. The following steps help set up the ODBC client and establish the Amazon Redshift connection in a Microsoft Windows operating system connected to your corporate network:

  1. Download and install the latest Amazon Redshift ODBC driver.
  2. Create a system DSN entry.
    1. In the Start menu, locate the driver folder or folders:
      • Amazon Redshift ODBC Driver (32-bit)
      • Amazon Redshift ODBC Driver (64-bit)
      • If you installed both drivers, you have a folder for each driver.
    2. Choose ODBC Administrator, and then type your administrator credentials.
    3. To configure the driver for all users on the computer, choose System DSN. To configure the driver for your user account only, choose User DSN.
    4. Choose Add.
  3. Select the Amazon Redshift ODBC driver, and choose Finish. Configure the following attributes:
    Data Source Name =any friendly name to identify the ODBC connection 
    Database=analytics
    user=uma(corporate user name)
    Auth Type-Identity Provider: AD FS
    password=leave blank (Windows automatically authenticates)
    Cluster ID: examplecorp-dw
    idp_host=demo.examplecorp.com (The name of the corporate IdP host)

This configuration looks like the following:

  1. Choose OK to save the ODBC connection.
  2. Verify that uma is set up with the SAML attributes, as described in the “Set up IdPs and federation” section.

The user uma can now use this ODBC connection to establish the connection to the Amazon Redshift cluster using any ODBC-based tools or reporting tools such as Tableau. Internally, uma authenticates using the Sales_DW_IAM_Policy  IAM role and is assigned the sales_grp database group.

Step 7: Connecting to Amazon Redshift using Python and IAM credentials

To enable partners, connect to the examplecorp-dw cluster programmatically, using Python on a computer such as Amazon EC2 instance. Reuse the IAM users that are attached to the Partner_DW_IAM_Policy policy defined in Step 2.

The following steps show this set up on an EC2 instance:

  1. Launch a new EC2 instance with the Partner_DW_IAM_Policy role, as described in Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances. Alternatively, you can attach an existing IAM role to an EC2 instance.
  2. This example uses Python PostgreSQL Driver (PyGreSQL) to connect to your Amazon Redshift clusters. To install PyGreSQL on Amazon Linux, use the following command as the ec2-user:
    sudo easy_install pip
    sudo yum install postgresql postgresql-devel gcc python-devel
    sudo pip install PyGreSQL

  1. The following code snippet demonstrates programmatic access to Amazon Redshift for partner users:
    #!/usr/bin/env python
    """
    Usage:
    python redshift-unload-copy.py <config file> <region>
    
    * Copyright 2014, Amazon.com, Inc. or its affiliates. All Rights Reserved.
    *
    * Licensed under the Amazon Software License (the "License").
    * You may not use this file except in compliance with the License.
    * A copy of the License is located at
    *
    * http://aws.amazon.com/asl/
    *
    * or in the "license" file accompanying this file. This file is distributed
    * on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
    * express or implied. See the License for the specific language governing
    * permissions and limitations under the License.
    """
    
    import sys
    import pg
    import boto3
    
    REGION = 'us-west-2'
    CLUSTER_IDENTIFIER = 'examplecorp-dw'
    DB_NAME = 'sales_db'
    DB_USER = 'examplecorpsalespartner'
    
    options = """keepalives=1 keepalives_idle=200 keepalives_interval=200
                 keepalives_count=6"""
    
    set_timeout_stmt = "set statement_timeout = 1200000"
    
    def conn_to_rs(host, port, db, usr, pwd, opt=options, timeout=set_timeout_stmt):
        rs_conn_string = """host=%s port=%s dbname=%s user=%s password=%s
                             %s""" % (host, port, db, usr, pwd, opt)
        print "Connecting to %s:%s:%s as %s" % (host, port, db, usr)
        rs_conn = pg.connect(dbname=rs_conn_string)
        rs_conn.query(timeout)
        return rs_conn
    
    def main():
        # describe the cluster and fetch the IAM temporary credentials
        global redshift_client
        redshift_client = boto3.client('redshift', region_name=REGION)
        response_cluster_details = redshift_client.describe_clusters(ClusterIdentifier=CLUSTER_IDENTIFIER)
        response_credentials = redshift_client.get_cluster_credentials(DbUser=DB_USER,DbName=DB_NAME,ClusterIdentifier=CLUSTER_IDENTIFIER,DurationSeconds=3600)
        rs_host = response_cluster_details['Clusters'][0]['Endpoint']['Address']
        rs_port = response_cluster_details['Clusters'][0]['Endpoint']['Port']
        rs_db = DB_NAME
        rs_iam_user = response_credentials['DbUser']
        rs_iam_pwd = response_credentials['DbPassword']
        # connect to the Amazon Redshift cluster
        conn = conn_to_rs(rs_host, rs_port, rs_db, rs_iam_user,rs_iam_pwd)
        # execute a query
        result = conn.query("SELECT sysdate as dt")
        # fetch results from the query
        for dt_val in result.getresult() :
            print dt_val
        # close the Amazon Redshift connection
        conn.close()
    
    if __name__ == "__main__":
        main()

You can save this Python program in a file (redshiftscript.py) and execute it at the command line as ec2-user:

python redshiftscript.py

Now partners can connect to the Amazon Redshift cluster using the Python script, and authentication is federated through the IAM user.

Summary

In this post, I demonstrated how to use federated access using Active Directory and IAM roles to enable single sign-on to an Amazon Redshift cluster. I also showed how partners outside an organization can be managed easily using IAM credentials.  Using the GetClusterCredentials API action, now supported by Amazon Redshift, lets you manage a large number of database users and have them use corporate credentials to log in. You don’t have to maintain separate database user accounts.

Although this post demonstrated the integration of IAM with AD FS and Active Directory, you can replicate this solution across with your choice of SAML 2.0 third-party identity providers (IdP), such as PingFederate or Okta. For the different supported federation options, see Configure SAML Assertions for Your IdP.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to establish federated access to your AWS resources by using Active Directory user attributes.


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

Using AWS Step Functions State Machines to Handle Workflow-Driven AWS CodePipeline Actions

Post Syndicated from Marcilio Mendonca original https://aws.amazon.com/blogs/devops/using-aws-step-functions-state-machines-to-handle-workflow-driven-aws-codepipeline-actions/

AWS CodePipeline is a continuous integration and continuous delivery service for fast and reliable application and infrastructure updates. It offers powerful integration with other AWS services, such as AWS CodeBuildAWS CodeDeployAWS CodeCommit, AWS CloudFormation and with third-party tools such as Jenkins and GitHub. These services make it possible for AWS customers to successfully automate various tasks, including infrastructure provisioning, blue/green deployments, serverless deployments, AMI baking, database provisioning, and release management.

Developers have been able to use CodePipeline to build sophisticated automation pipelines that often require a single CodePipeline action to perform multiple tasks, fork into different execution paths, and deal with asynchronous behavior. For example, to deploy a Lambda function, a CodePipeline action might first inspect the changes pushed to the code repository. If only the Lambda code has changed, the action can simply update the Lambda code package, create a new version, and point the Lambda alias to the new version. If the changes also affect infrastructure resources managed by AWS CloudFormation, the pipeline action might have to create a stack or update an existing one through the use of a change set. In addition, if an update is required, the pipeline action might enforce a safety policy to infrastructure resources that prevents the deletion and replacement of resources. You can do this by creating a change set and having the pipeline action inspect its changes before updating the stack. Change sets that do not conform to the policy are deleted.

This use case is a good illustration of workflow-driven pipeline actions. These are actions that run multiple tasks, deal with async behavior and loops, need to maintain and propagate state, and fork into different execution paths. Implementing workflow-driven actions directly in CodePipeline can lead to complex pipelines that are hard for developers to understand and maintain. Ideally, a pipeline action should perform a single task and delegate the complexity of dealing with workflow-driven behavior associated with that task to a state machine engine. This would make it possible for developers to build simpler, more intuitive pipelines and allow them to use state machine execution logs to visualize and troubleshoot their pipeline actions.

In this blog post, we discuss how AWS Step Functions state machines can be used to handle workflow-driven actions. We show how a CodePipeline action can trigger a Step Functions state machine and how the pipeline and the state machine are kept decoupled through a Lambda function. The advantages of using state machines include:

  • Simplified logic (complex tasks are broken into multiple smaller tasks).
  • Ease of handling asynchronous behavior (through state machine wait states).
  • Built-in support for choices and processing different execution paths (through state machine choices).
  • Built-in visualization and logging of the state machine execution.

The source code for the sample pipeline, pipeline actions, and state machine used in this post is available at https://github.com/awslabs/aws-codepipeline-stepfunctions.

Overview

This figure shows the components in the CodePipeline-Step Functions integration that will be described in this post. The pipeline contains two stages: a Source stage represented by a CodeCommit Git repository and a Prod stage with a single Deploy action that represents the workflow-driven action.

This action invokes a Lambda function (1) called the State Machine Trigger Lambda, which, in turn, triggers a Step Function state machine to process the request (2). The Lambda function sends a continuation token back to the pipeline (3) to continue its execution later and terminates. Seconds later, the pipeline invokes the Lambda function again (4), passing the continuation token received. The Lambda function checks the execution state of the state machine (5,6) and communicates the status to the pipeline. The process is repeated until the state machine execution is complete. Then the Lambda function notifies the pipeline that the corresponding pipeline action is complete (7). If the state machine has failed, the Lambda function will then fail the pipeline action and stop its execution (7). While running, the state machine triggers various Lambda functions to perform different tasks. The state machine and the pipeline are fully decoupled. Their interaction is handled by the Lambda function.

The Deploy State Machine

The sample state machine used in this post is a simplified version of the use case, with emphasis on infrastructure deployment. The state machine will follow distinct execution paths and thus have different outcomes, depending on:

  • The current state of the AWS CloudFormation stack.
  • The nature of the code changes made to the AWS CloudFormation template and pushed into the pipeline.

If the stack does not exist, it will be created. If the stack exists, a change set will be created and its resources inspected by the state machine. The inspection consists of parsing the change set results and detecting whether any resources will be deleted or replaced. If no resources are being deleted or replaced, the change set is allowed to be executed and the state machine completes successfully. Otherwise, the change set is deleted and the state machine completes execution with a failure as the terminal state.

Let’s dive into each of these execution paths.

Path 1: Create a Stack and Succeed Deployment

The Deploy state machine is shown here. It is triggered by the Lambda function using the following input parameters stored in an S3 bucket.

Create New Stack Execution Path

{
    "environmentName": "prod",
    "stackName": "sample-lambda-app",
    "templatePath": "infra/Lambda-template.yaml",
    "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
    "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ"
}

Note that some values used here are for the use case example only. Account-specific parameters like revisionS3Bucket and revisionS3Key will be different when you deploy this use case in your account.

These input parameters are used by various states in the state machine and passed to the corresponding Lambda functions to perform different tasks. For example, stackName is used to create a stack, check the status of stack creation, and create a change set. The environmentName represents the environment (for example, dev, test, prod) to which the code is being deployed. It is used to prefix the name of stacks and change sets.

With the exception of built-in states such as wait and choice, each state in the state machine invokes a specific Lambda function.  The results received from the Lambda invocations are appended to the state machine’s original input. When the state machine finishes its execution, several parameters will have been added to its original input.

The first stage in the state machine is “Check Stack Existence”. It checks whether a stack with the input name specified in the stackName input parameter already exists. The output of the state adds a Boolean value called doesStackExist to the original state machine input as follows:

{
  "doesStackExist": true,
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
}

The following stage, “Does Stack Exist?”, is represented by Step Functions built-in choice state. It checks the value of doesStackExist to determine whether a new stack needs to be created (doesStackExist=true) or a change set needs to be created and inspected (doesStackExist=false).

If the stack does not exist, the states illustrated in green in the preceding figure are executed. This execution path creates the stack, waits until the stack is created, checks the status of the stack’s creation, and marks the deployment successful after the stack has been created. Except for “Stack Created?” and “Wait Stack Creation,” each of these stages invokes a Lambda function. “Stack Created?” and “Wait Stack Creation” are implemented by using the built-in choice state (to decide which path to follow) and the wait state (to wait a few seconds before proceeding), respectively. Each stage adds the results of their Lambda function executions to the initial input of the state machine, allowing future stages to process them.

Path 2: Safely Update a Stack and Mark Deployment as Successful

Safely Update a Stack and Mark Deployment as Successful Execution Path

If the stack indicated by the stackName parameter already exists, a different path is executed. (See the green states in the figure.) This path will create a change set and use wait and choice states to wait until the change set is created. Afterwards, a stage in the execution path will inspect  the resources affected before the change set is executed.

The inspection procedure represented by the “Inspect Change Set Changes” stage consists of parsing the resources affected by the change set and checking whether any of the existing resources are being deleted or replaced. The following is an excerpt of the algorithm, where changeSetChanges.Changes is the object representing the change set changes:

...
var RESOURCES_BEING_DELETED_OR_REPLACED = "RESOURCES-BEING-DELETED-OR-REPLACED";
var CAN_SAFELY_UPDATE_EXISTING_STACK = "CAN-SAFELY-UPDATE-EXISTING-STACK";
for (var i = 0; i < changeSetChanges.Changes.length; i++) {
    var change = changeSetChanges.Changes[i];
    if (change.Type == "Resource") {
        if (change.ResourceChange.Action == "Delete") {
            return RESOURCES_BEING_DELETED_OR_REPLACED;
        }
        if (change.ResourceChange.Action == "Modify") {
            if (change.ResourceChange.Replacement == "True") {
                return RESOURCES_BEING_DELETED_OR_REPLACED;
            }
        }
    }
}
return CAN_SAFELY_UPDATE_EXISTING_STACK;

The algorithm returns different values to indicate whether the change set can be safely executed (CAN_SAFELY_UPDATE_EXISTING_STACK or RESOURCES_BEING_DELETED_OR_REPLACED). This value is used later by the state machine to decide whether to execute the change set and update the stack or interrupt the deployment.

The output of the “Inspect Change Set” stage is shown here.

{
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
  "doesStackExist": true,
  "changeSetName": "prod-sample-lambda-app-change-set-545",
  "changeSetCreationStatus": "complete",
  "changeSetAction": "CAN-SAFELY-UPDATE-EXISTING-STACK"
}

At this point, these parameters have been added to the state machine’s original input:

  • changeSetName, which is added by the “Create Change Set” state.
  • changeSetCreationStatus, which is added by the “Get Change Set Creation Status” state.
  • changeSetAction, which is added by the “Inspect Change Set Changes” state.

The “Safe to Update Infra?” step is a choice state (its JSON spec follows) that simply checks the value of the changeSetAction parameter. If the value is equal to “CAN-SAFELY-UPDATE-EXISTING-STACK“, meaning that no resources will be deleted or replaced, the step will execute the change set by proceeding to the “Execute Change Set” state. The deployment is successful (the state machine completes its execution successfully).

"Safe to Update Infra?": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.taskParams.changeSetAction",
          "StringEquals": "CAN-SAFELY-UPDATE-EXISTING-STACK",
          "Next": "Execute Change Set"
        }
      ],
      "Default": "Deployment Failed"
 }

Path 3: Reject Stack Update and Fail Deployment

Reject Stack Update and Fail Deployment Execution Path

If the changeSetAction parameter is different from “CAN-SAFELY-UPDATE-EXISTING-STACK“, the state machine will interrupt the deployment by deleting the change set and proceeding to the “Deployment Fail” step, which is a built-in Fail state. (Its JSON spec follows.) This state causes the state machine to stop in a failed state and serves to indicate to the Lambda function that the pipeline deployment should be interrupted in a fail state as well.

 "Deployment Failed": {
      "Type": "Fail",
      "Cause": "Deployment Failed",
      "Error": "Deployment Failed"
    }

In all three scenarios, there’s a state machine’s visual representation available in the AWS Step Functions console that makes it very easy for developers to identify what tasks have been executed or why a deployment has failed. Developers can also inspect the inputs and outputs of each state and look at the state machine Lambda function’s logs for details. Meanwhile, the corresponding CodePipeline action remains very simple and intuitive for developers who only need to know whether the deployment was successful or failed.

The State Machine Trigger Lambda Function

The Trigger Lambda function is invoked directly by the Deploy action in CodePipeline. The CodePipeline action must pass a JSON structure to the trigger function through the UserParameters attribute, as follows:

{
  "s3Bucket": "codepipeline-StepFunctions-sample",
  "stateMachineFile": "state_machine_input.json"
}

The s3Bucket parameter specifies the S3 bucket location for the state machine input parameters file. The stateMachineFile parameter specifies the file holding the input parameters. By being able to specify different input parameters to the state machine, we make the Trigger Lambda function and the state machine reusable across environments. For example, the same state machine could be called from a test and prod pipeline action by specifying a different S3 bucket or state machine input file for each environment.

The Trigger Lambda function performs two main tasks: triggering the state machine and checking the execution state of the state machine. Its core logic is shown here:

exports.index = function (event, context, callback) {
    try {
        console.log("Event: " + JSON.stringify(event));
        console.log("Context: " + JSON.stringify(context));
        console.log("Environment Variables: " + JSON.stringify(process.env));
        if (Util.isContinuingPipelineTask(event)) {
            monitorStateMachineExecution(event, context, callback);
        }
        else {
            triggerStateMachine(event, context, callback);
        }
    }
    catch (err) {
        failure(Util.jobId(event), callback, context.invokeid, err.message);
    }
}

Util.isContinuingPipelineTask(event) is a utility function that checks if the Trigger Lambda function is being called for the first time (that is, no continuation token is passed by CodePipeline) or as a continuation of a previous call. In its first execution, the Lambda function will trigger the state machine and send a continuation token to CodePipeline that contains the state machine execution ARN. The state machine ARN is exposed to the Lambda function through a Lambda environment variable called stateMachineArn. Here is the code that triggers the state machine:

function triggerStateMachine(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var s3Bucket = Util.actionUserParameter(event, "s3Bucket");
    var stateMachineFile = Util.actionUserParameter(event, "stateMachineFile");
    getStateMachineInputData(s3Bucket, stateMachineFile)
        .then(function (data) {
            var initialParameters = data.Body.toString();
            var stateMachineInputJSON = createStateMachineInitialInput(initialParameters, event);
            console.log("State machine input JSON: " + JSON.stringify(stateMachineInputJSON));
            return stateMachineInputJSON;
        })
        .then(function (stateMachineInputJSON) {
            return triggerStateMachineExecution(stateMachineArn, stateMachineInputJSON);
        })
        .then(function (triggerStateMachineOutput) {
            var continuationToken = { "stateMachineExecutionArn": triggerStateMachineOutput.executionArn };
            var message = "State machine has been triggered: " + JSON.stringify(triggerStateMachineOutput) + ", continuationToken: " + JSON.stringify(continuationToken);
            return continueExecution(Util.jobId(event), continuationToken, callback, message);
        })
        .catch(function (err) {
            console.log("Error triggering state machine: " + stateMachineArn + ", Error: " + err.message);
            failure(Util.jobId(event), callback, context.invokeid, err.message);
        })
}

The Trigger Lambda function fetches the state machine input parameters from an S3 file, triggers the execution of the state machine using the input parameters and the stateMachineArn environment variable, and signals to CodePipeline that the execution should continue later by passing a continuation token that contains the state machine execution ARN. In case any of these operations fail and an exception is thrown, the Trigger Lambda function will fail the pipeline immediately by signaling a pipeline failure through the putJobFailureResult CodePipeline API.

If the Lambda function is continuing a previous execution, it will extract the state machine execution ARN from the continuation token and check the status of the state machine, as shown here.

function monitorStateMachineExecution(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var continuationToken = JSON.parse(Util.continuationToken(event));
    var stateMachineExecutionArn = continuationToken.stateMachineExecutionArn;
    getStateMachineExecutionStatus(stateMachineExecutionArn)
        .then(function (response) {
            if (response.status === "RUNNING") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " is still " + response.status;
                return continueExecution(Util.jobId(event), continuationToken, callback, message);
            }
            if (response.status === "SUCCEEDED") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
                return success(Util.jobId(event), callback, message);
            }
            // FAILED, TIMED_OUT, ABORTED
            var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
            return failure(Util.jobId(event), callback, context.invokeid, message);
        })
        .catch(function (err) {
            var message = "Error monitoring execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + ", Error: " + err.message;
            failure(Util.jobId(event), callback, context.invokeid, message);
        });
}

If the state machine is in the RUNNING state, the Lambda function will send the continuation token back to the CodePipeline action. This will cause CodePipeline to call the Lambda function again a few seconds later. If the state machine has SUCCEEDED, then the Lambda function will notify the CodePipeline action that the action has succeeded. In any other case (FAILURE, TIMED-OUT, or ABORT), the Lambda function will fail the pipeline action.

This behavior is especially useful for developers who are building and debugging a new state machine because a bug in the state machine can potentially leave the pipeline action hanging for long periods of time until it times out. The Trigger Lambda function prevents this.

Also, by having the Trigger Lambda function as a means to decouple the pipeline and state machine, we make the state machine more reusable. It can be triggered from anywhere, not just from a CodePipeline action.

The Pipeline in CodePipeline

Our sample pipeline contains two simple stages: the Source stage represented by a CodeCommit Git repository and the Prod stage, which contains the Deploy action that invokes the Trigger Lambda function. When the state machine decides that the change set created must be rejected (because it replaces or deletes some the existing production resources), it fails the pipeline without performing any updates to the existing infrastructure. (See the failed Deploy action in red.) Otherwise, the pipeline action succeeds, indicating that the existing provisioned infrastructure was either created (first run) or updated without impacting any resources. (See the green Deploy stage in the pipeline on the left.)

The Pipeline in CodePipeline

The JSON spec for the pipeline’s Prod stage is shown here. We use the UserParameters attribute to pass the S3 bucket and state machine input file to the Lambda function. These parameters are action-specific, which means that we can reuse the state machine in another pipeline action.

{
  "name": "Prod",
  "actions": [
      {
          "inputArtifacts": [
              {
                  "name": "CodeCommitOutput"
              }
          ],
          "name": "Deploy",
          "actionTypeId": {
              "category": "Invoke",
              "owner": "AWS",
              "version": "1",
              "provider": "Lambda"
          },
          "outputArtifacts": [],
          "configuration": {
              "FunctionName": "StateMachineTriggerLambda",
              "UserParameters": "{\"s3Bucket\": \"codepipeline-StepFunctions-sample\", \"stateMachineFile\": \"state_machine_input.json\"}"
          },
          "runOrder": 1
      }
  ]
}

Conclusion

In this blog post, we discussed how state machines in AWS Step Functions can be used to handle workflow-driven actions. We showed how a Lambda function can be used to fully decouple the pipeline and the state machine and manage their interaction. The use of a state machine greatly simplified the associated CodePipeline action, allowing us to build a much simpler and cleaner pipeline while drilling down into the state machine’s execution for troubleshooting or debugging.

Here are two exercises you can complete by using the source code.

Exercise #1: Do not fail the state machine and pipeline action after inspecting a change set that deletes or replaces resources. Instead, create a stack with a different name (think of blue/green deployments). You can do this by creating a state machine transition between the “Safe to Update Infra?” and “Create Stack” stages and passing a new stack name as input to the “Create Stack” stage.

Exercise #2: Add wait logic to the state machine to wait until the change set completes its execution before allowing the state machine to proceed to the “Deployment Succeeded” stage. Use the stack creation case as an example. You’ll have to create a Lambda function (similar to the Lambda function that checks the creation status of a stack) to get the creation status of the change set.

Have fun and share your thoughts!

About the Author

Marcilio Mendonca is a Sr. Consultant in the Canadian Professional Services Team at Amazon Web Services. He has helped AWS customers design, build, and deploy best-in-class, cloud-native AWS applications using VMs, containers, and serverless architectures. Before he joined AWS, Marcilio was a Software Development Engineer at Amazon. Marcilio also holds a Ph.D. in Computer Science. In his spare time, he enjoys playing drums, riding his motorcycle in the Toronto GTA area, and spending quality time with his family.

New ‘Coalition Against Piracy’ Will Crack Down on Pirate Streaming Boxes

Post Syndicated from Ernesto original https://torrentfreak.com/new-coalition-against-piracy-will-crack-down-on-pirate-streaming-boxes-171017/

Traditionally there have only been a handful of well-known industry groups fighting online piracy, but this appears to be changing.

Increasingly, major entertainment industry companies are teaming up in various regions to bundle their enforcement efforts against copyright infringement.

Earlier this year the Alliance for Creativity and Entertainment (ACE) was formed by major players including Disney, HBO, and NBCUniversal, and several of the same media giants are also involved in the newly founded Coalition Against Piracy (CAP).

CAP will coordinate anti-piracy efforts in Asia and is backed by CASBAA, Disney, Fox, HBO Asia, NBCUniversal, Premier League, Turner Asia-Pacific, A&E Networks, Astro, BBC Worldwide, National Basketball Association, TV5MONDE, Viacom International, and others.

The coalition has hired Neil Gane as its general manager. Gane is no stranger to anti-piracy work, as he previously served as the MPAA’s regional director in Australasia and was chief of the Australian Federation Against Copyright Theft.

The goal of CAP will be to assist in local enforcement actions against piracy, including the disruption and dismantling of local businesses that facilitate it. Pirate streaming boxes and apps will be among the main targets.

These boxes, which often use the legal Kodi player paired with infringing add-ons, are referred to as illicit streaming devices (ISDs) by industry insiders. They have grown in popularity all around the world and Asia is no exception.

“The prevalence of ISDs across Asia is staggering. The criminals who operate the ISD networks and the pirate websites are profiting from the hard work of talented creators, seriously damaging the legitimate content ecosystem as well as exposing consumers to dangerous malware”, Gane said, quoted by Indian Television.

Gane knows the region well and started his career working for the Hong Kong Police. He sees the pirate streaming box ecosystem as a criminal network which presents a major threat to the entertainment industries.

“This is a highly organized transnational crime with criminal syndicates profiting enormously at the expense of consumers as well as content creators,” Gane noted.

The Asian creative industry is a major growth market as more and more legal content is made available. However, the growth of these legal services is threatened by pirate boxes and apps. The Coalition Against Piracy hopes to curb this.

The launch of CAP, which will be formalized at the upcoming CASBAA anti-piracy convention in November, confirms the trend of localized anti-piracy coalitions which are backed by major industry players. We can expect to hear more from these during the years to come.

Just a few days ago the founding members of the aforementioned ACE anti-piracy initiative filed their first joint lawsuit in the US which, unsurprisingly, targets a seller of streaming boxes.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

How to Compete with Giants

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/how-to-compete-with-giants/

How to Compete with Giants

This post by Backblaze’s CEO and co-founder Gleb Budman is the sixth in a series about entrepreneurship. You can choose posts in the series from the list below:

  1. How Backblaze got Started: The Problem, The Solution, and the Stuff In-Between
  2. Building a Competitive Moat: Turning Challenges Into Advantages
  3. From Idea to Launch: Getting Your First Customers
  4. How to Get Your First 1,000 Customers
  5. Surviving Your First Year
  6. How to Compete with Giants

Use the Join button above to receive notification of new posts in this series.

Perhaps your business is competing in a brand new space free from established competitors. Most of us, though, start companies that compete with existing offerings from large, established companies. You need to come up with a better mousetrap — not the first mousetrap.

That’s the challenge Backblaze faced. In this post, I’d like to share some of the lessons I learned from that experience.

Backblaze vs. Giants

Competing with established companies that are orders of magnitude larger can be daunting. How can you succeed?

I’ll set the stage by offering a few sets of giants we compete with:

  • When we started Backblaze, we offered online backup in a market where companies had been offering “online backup” for at least a decade, and even the newer entrants had raised tens of millions of dollars.
  • When we built our storage servers, the alternatives were EMC, NetApp, and Dell — each of which had a market cap of over $10 billion.
  • When we introduced our cloud storage offering, B2, our direct competitors were Amazon, Google, and Microsoft. You might have heard of them.

What did we learn by competing with these giants on a bootstrapped budget? Let’s take a look.

Determine What Success Means

For a long time Apple considered Apple TV to be a hobby, not a real product worth focusing on, because it did not generate a billion in revenue. For a $10 billion per year revenue company, a new business that generates $50 million won’t move the needle and often isn’t worth putting focus on. However, for a startup, getting to $50 million in revenue can be the start of a wildly successful business.

Lesson Learned: Don’t let the giants set your success metrics.

The Advantages Startups Have

The giants have a lot of advantages: more money, people, scale, resources, access, etc. Following their playbook and attacking head-on means you’re simply outgunned. Common paths to failure are trying to build more features, enter more markets, outspend on marketing, and other similar approaches where scale and resources are the primary determinants of success.

But being a startup affords many advantages most giants would salivate over. As a nimble startup you can leverage those to succeed. Let’s breakdown nine competitive advantages we’ve used that you can too.

1. Drive Focus

It’s hard to build a $10 billion revenue business doing just one thing, and most giants have a broad portfolio of businesses, numerous products for each, and targeting a variety of customer segments in multiple markets. That adds complexity and distributes management attention.

Startups get the benefit of having everyone in the company be extremely focused, often on a singular mission, product, customer segment, and market. While our competitors sell everything from advertising to Zantac, and are investing in groceries and shipping, Backblaze has focused exclusively on cloud storage. This means all of our best people (i.e. everyone) is focused on our cloud storage business. Where is all of your focus going?

Lesson Learned: Align everyone in your company to a singular focus to dramatically out-perform larger teams.

2. Use Lack-of-Scale as an Advantage

You may have heard Paul Graham say “Do things that don’t scale.” There are a host of things you can do specifically because you don’t have the same scale as the giants. Use that as an advantage.

When we look for data center space, we have more options than our largest competitors because there are simply more spaces available with room for 100 cabinets than for 1,000 cabinets. With some searching, we can find data center space that is better/cheaper.

When a flood in Thailand destroyed factories, causing the world’s supply of hard drives to plummet and prices to triple, we started drive farming. The giants certainly couldn’t. It was a bit crazy, but it let us keep prices unchanged for our customers.

Our Chief Cloud Officer, Tim, used to work at Adobe. Because of their size, any new product needed to always launch in a multitude of languages and in global markets. Once launched, they had scale. But getting any new product launched was incredibly challenging.

Lesson Learned: Use lack-of-scale to exploit opportunities that are closed to giants.

3. Build a Better Product

This one is probably obvious. If you’re going to provide the same product, at the same price, to the same customers — why do it? Remember that better does not always mean more features. Here’s one way we built a better product that didn’t require being a bigger company.

All online backup services required customers to choose what to include in their backup. We found that this was complicated for users since they often didn’t know what needed to be backed up. We flipped the model to back up everything and allow users to exclude if they wanted to, but it was not required. This reduced the number of features/options, while making it easier and better for the user.

This didn’t require the resources of a huge company; it just required understanding customers a bit deeper and thinking about the solution differently. Building a better product is the most classic startup competitive advantage.

Lesson Learned: Dig deep with your customers to understand and deliver a better mousetrap.

4. Provide Better Service

How can you provide better service? Use your advantages. Escalations from your customer care folks to engineering can go through fewer hoops. Fixing an issue and shipping can be quicker. Access to real answers on Twitter or Facebook can be more effective.

A strategic decision we made was to have all customer support people as full-time employees in our headquarters. This ensures they are in close contact to the whole company for feedback to quickly go both ways.

Having a smaller team and fewer layers enables faster internal communication, which increases customer happiness. And the option to do things that don’t scale — such as help a customer in a unique situation — can go a long way in building customer loyalty.

Lesson Learned: Service your customers better by establishing clear internal communications.

5. Remove The Unnecessary

After determining that the industry standard EMC/NetApp/Dell storage servers would be too expensive to build our own cloud storage upon, we decided to build our own infrastructure. Many said we were crazy to compete with these multi-billion dollar companies and that it would be impossible to build a lower cost storage server. However, not only did it prove to not be impossible — it wasn’t even that hard.

One key trick? Remove the unnecessary. While EMC and others built servers to sell to other companies for a wide variety of use cases, Backblaze needed servers that only Backblaze would run, and for a single use case. As a result we could tailor the servers for our needs by removing redundancy from each server (since we would run redundant servers), and using lower-performance components (since we would get high-performance by running parallel servers).

What do your customers and use cases not need? This can trim costs and complexity while often improving the product for your use case.

Lesson Learned: Don’t think “what can we add” to what the giants offer — think “what can we remove.”

6. Be Easy

How many times have you visited a large company website, particularly one that’s not consumer-focused, only to leave saying, “Huh? I don’t understand what you do.” Keeping your website clear, and your product and pricing simple, will dramatically increase conversion and customer satisfaction. If you’re able to make it 2x easier and thus increasing your conversion by 2x, you’ve just allowed yourself to spend ½ as much acquiring a customer.

Providing unlimited data backup wasn’t specifically about providing more storage — it was about making it easier. Since users didn’t know how much data they needed to back up, charging per gigabyte meant they wouldn’t know the cost. Providing unlimited data backup meant they could just relax.

Customers love easy — and being smaller makes easy easier to deliver. Use that as an advantage in your website, marketing materials, pricing, product, and in every other customer interaction.

Lesson Learned: Ease-of-use isn’t a slogan: it’s a competitive advantage. Treat it as seriously as any other feature of your product

7. Don’t Be Afraid of Risk

Obviously unnecessary risks are unnecessary, and some risks aren’t worth taking. However, large companies that have given guidance to Wall Street with a $0.01 range on their earning-per-share are inherently going to be very risk-averse. Use risk-tolerance to open up opportunities, and adjust your tolerance level as you scale. In your first year, there are likely an infinite number of ways your business may vaporize; don’t be too worried about taking a risk that might have a 20% downside when the upside is hockey stick growth.

Using consumer-grade hard drives in our servers may have caused pain and suffering for us years down-the-line, but they were priced at approximately 50% of enterprise drives. Giants wouldn’t have considered the option. Turns out, the consumer drives performed great for us.

Lesson Learned: Use calculated risks as an advantage.

8. Be Open

The larger a company grows, the more it wants to hide information. Some of this is driven by regulatory requirements as a public company. But most of this is cultural. Sharing something might cause a problem, so let’s not. All external communication is treated as a critical press release, with rounds and rounds of editing by multiple teams and approvals. However, customers are often desperate for information. Moreover, sharing information builds trust, understanding, and advocates.

I started blogging at Backblaze before we launched. When we blogged about our Storage Pod and open-sourced the design, many thought we were crazy to share this information. But it was transformative for us, establishing Backblaze as a tech thought leader in storage and giving people a sense of how we were able to provide our service at such a low cost.

Over the years we’ve developed a culture of being open internally and externally, on our blog and with the press, and in communities such as Hacker News and Reddit. Often we’ve been asked, “why would you share that!?” — but it’s the continual openness that builds trust. And that culture of openness is incredibly challenging for the giants.

Lesson Learned: Overshare to build trust and brand where giants won’t.

9. Be Human

As companies scale, typically a smaller percent of founders and executives interact with customers. The people who build the company become more hidden, the language feels “corporate,” and customers start to feel they’re interacting with the cliche “faceless, nameless corporation.” Use your humanity to your advantage. From day one the Backblaze About page listed all the founders, and my email address. While contacting us shouldn’t be the first path for a customer support question, I wanted it to be clear that we stand behind the service we offer; if we’re doing something wrong — I want to know it.

To scale it’s important to have processes and procedures, but sometimes a situation falls outside of a well-established process. While we want our employees to follow processes, they’re still encouraged to be human and “try to do the right thing.” How to you strike this balance? Simon Sinek gives a good talk about it: make your employees feel safe. If employees feel safe they’ll be human.

If your customer is a consumer, they’ll appreciate being treated as a human. Even if your customer is a corporation, the purchasing decision-makers are still people.

Lesson Learned: Being human is the ultimate antithesis to the faceless corporation.

Build Culture to Sustain Your Advantages at Scale

Presumably the goal is not to always be competing with giants, but to one day become a giant. Does this mean you’ll lose all of these advantages? Some, yes — but not all. Some of these advantages are cultural, and if you build these into the culture from the beginning, and fight to keep them as you scale, you can keep them as you become a giant.

Tesla still comes across as human, with Elon Musk frequently interacting with people on Twitter. Apple continues to provide great service through their Genius Bar. And, worst case, if you lose these at scale, you’ll still have the other advantages of being a giant such as money, people, scale, resources, and access.

Of course, some new startup will be gunning for you with grand ambitions, so just be sure not to get complacent. 😉

The post How to Compete with Giants appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

More Raspberry Pi labs in West Africa

Post Syndicated from Rachel Churcher original https://www.raspberrypi.org/blog/pi-based-ict-west-africa/

Back in May 2013, we heard from Dominique Laloux about an exciting project to bring Raspberry Pi labs to schools in rural West Africa. Until 2012, 75 percent of teachers there had never used a computer. The project has been very successful, and Dominique has been in touch again to bring us the latest news.

A view of the inside of the new Pi lab building

Preparing the new Pi labs building in Kuma Tokpli, Togo

Growing the project

Thanks to the continuing efforts of a dedicated team of teachers, parents and other supporters, the Centre Informatique de Kuma, now known as INITIC (from the French ‘INItiation aux TIC’), runs two Raspberry Pi labs in schools in Togo, and plans to open a third in December. The second lab was opened last year in Kpalimé, a town in the Plateaux Region in the west of the country.

Student using a Raspberry Pi computer

Using the new Raspberry Pi labs in Kpalimé, Togo

More than 400 students used the new lab intensively during the last school year. Dominique tells us more:

“The report made in early July by the seven teachers who accompanied the students was nothing short of amazing: the young people covered a very impressive number of concepts and skills, from the GUI and the file system, to a solid introduction to word processing and spreadsheets, and many other skills. The lab worked exactly as expected. Its 21 Raspberry Pis worked flawlessly, with the exception of a couple of SD cards that needed re-cloning, and a couple of old screens that needed to be replaced. All the Raspberry Pis worked without a glitch. They are so reliable!”

The teachers and students have enjoyed access to a range of software and resources, all running on Raspberry Pi 2s and 3s.

“Our current aim is to introduce the students to ICT using the Raspberry Pis, rather than introducing them to programming and electronics (a step that will certainly be considered later). We use Ubuntu Mate along with a large selection of applications, from LibreOffice, Firefox, GIMP, Audacity, and Calibre, to special maths, science, and geography applications. There are also special applications such as GnuCash and GanttProject, as well as logic games including PyChess. Since December, students also have access to a local server hosting Kiwix, Wiktionary (a local copy of Wikipedia in four languages), several hundred videos, and several thousand books. They really love it!”

Pi lab upgrade

This summer, INITIC upgraded the equipment in their Pi lab in Kuma Adamé, which has been running since 2014. 21 older model Raspberry Pis were replaced with Pi 2s and 3s, to bring this lab into line with the others, and encourage co-operation between the different locations.

“All 21 first-generation Raspberry Pis worked flawlessly for three years, despite the less-than-ideal conditions in which they were used — tropical conditions, dust, frequent power outages, etc. I brought them all back to Brussels, and they all still work fine. The rationale behind the upgrade was to bring more computing power to the lab, and also to have the same equipment in our two Raspberry Pi labs (and in other planned installations).”

Students and teachers using the upgraded Pi labs in Kuma Adamé

Students and teachers using the upgraded Pi lab in Kuma Adamé

An upgrade of the organisation’s first lab, installed in 2012 in Kuma Tokpli, will be completed in December. This lab currently uses ‘retired’ laptops, which will be replaced with Raspberry Pis and peripherals. INITIC, in partnership with the local community, is also constructing a new building to house the upgraded technology, and the organisation’s third Raspberry Pi lab.

Reliable tech

Dominique has been very impressed with the performance of the Raspberry Pis since 2014.

“Our experience of three years, in two very different contexts, clearly demonstrates that the Raspberry Pi is a very convincing alternative to more ‘conventional’ computers for introducing young students to ICT where resources are scarce. I wish I could convince more communities in the world to invest in such ‘low cost, low consumption, low maintenance’ infrastructure. It really works!”

He goes on to explain that:

“Our goal now is to build at least one new Raspberry Pi lab in another Togolese school each year. That will, of course, depend on how successful we are at gathering the funds necessary for each installation, but we are confident we can convince enough friends to give us the financial support needed for our action.”

A desk with Raspberry Pis and peripherals

Reliable Raspberry Pis in the labs at Kpalimé

Get involved

We are delighted to see the Raspberry Pi being used to bring information technology to new teachers, students, and communities in Togo – it’s wonderful to see this project becoming established and building on its achievements. The mission of the Raspberry Pi Foundation is to put the power of digital making into the hands of people all over the world. Therefore, projects like this, in which people use our tech to fulfil this mission in places with few resources, are wonderful to us.

More information about INITIC and its projects can be found on its website. If you are interested in helping the organisation to meet its goals, visit the How to help page. And if you are involved with a project like this, bringing ICT, computer science, and coding to new places, please tell us about it in the comments below.

The post More Raspberry Pi labs in West Africa appeared first on Raspberry Pi.

Pirate Bay’s Iconic .SE Domain has Expired (Updated)

Post Syndicated from Ernesto original https://torrentfreak.com/pirate-bays-iconic-se-domain-has-expired-and-is-for-sale-171016/

When The Pirate Bay first came online during the summer of 2003, its main point of access was thepiratebay.org.

Since then the site has burnt through more than a dozen domains, trying to evade seizures or other legal threats.

For many years thepiratebay.se operated as the site’s main domain name. Earlier this year the site moved back to the good old .org again, and from the looks of it, TPB is ready to say farewell to the Swedish domain.

Thepiratebay.se expired last week and, if nothing happens, it will be de-activated tomorrow. This means that the site might lose control over a piece of its history.

The torrent site moved from the ORG to the SE domain in 2012, fearing that US authorities would seize the former. Around that time the Department of Homeland Security took hundreds of sites offline and the Pirate Bay team feared that they would be next.

Thepiratebay.se has expired

Ironically, however, the next big threat came from Sweden, the Scandinavian country where the site once started.

In 2013, a local anti-piracy group filed a motion targeting two of The Pirate Bay’s domains, ThePirateBay.se and PirateBay.se. This case that has been dragging on for years now.

During this time TPB moved back and forth between domains but the .se domain turned out to be a safer haven than most alternatives, despite the legal issues. Many other domains were simply seized or suspended without prior notice.

When the Swedish Court of Appeal eventually ruled that The Pirate Bay’s domain had to be confiscated and forfeited to the state, the site’s operators moved back to the .org domain, where it all started.

Although a Supreme Court appeal is still pending, according to a report from IDG earlier this year the court has placed a lock on the domain. This prevents the owner from changing or transferring it, which may explain why it has expired.

The lock is relevant, as the domain not only expired but has also been put of for sale again in the SEDO marketplace, with a minimum bid of $90. This sale would be impossible, if the domain is locked.

Thepiratebay.se for sale

Perhaps the most ironic of all is the fact that TPB moved to .se because it feared that the US controlled .org domain was easy prey.

Fast forward half a decade and over a dozen domains have come and gone while thepiratebay.org still stands strong, despite entertainment industry pressure.

Update: We updated the article to mention that the domain name is locked by the Swedish Supreme Court. This means that it can’t be updated and would explain why it has expired.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Netflix Expands Content Protection Team to Reduce Piracy

Post Syndicated from Ernesto original https://torrentfreak.com/netflix-expands-content-protection-team-to-reduce-piracy-171015/

There is little doubt that, in the United States and many other countries, Netflix has become the standard for watching movies on the Internet.

Despite the widespread availability, however, Netflix originals are widely pirated. Episodes from House of Cards, Narcos, and Orange is the New Black are downloaded and streamed millions of times through unauthorized platforms.

The streaming giant is obviously not happy with this situation and has ramped up its anti-piracy efforts in recent years. Since last year the company has sent out over a million takedown requests to Google alone and this volume continues to expand.

This growth coincides with an expansion of the company’s internal anti-piracy division. A new job posting shows that Netflix is expanding this team with a Copyright and Content Protection Coordinator. The ultimate goal is to reduce piracy to a fringe activity.

“The growing Global Copyright & Content Protection Group is looking to expand its team with the addition of a coordinator,” the job listing reads.

“He or she will be tasked with supporting the Netflix Global Copyright & Content Protection Group in its internal tactical take down efforts with the goal of reducing online piracy to a socially unacceptable fringe activity.”

Among other things, the new coordinator will evaluate new technological solutions to tackle piracy online.

More old-fashioned takedown efforts are also part of the job. This includes monitoring well-known content platforms, search engines and social network sites for pirated content.

“Day to day scanning of Facebook, YouTube, Twitter, Periscope, Google Search, Bing Search, VK, DailyMotion and all other platforms (including live platforms) used for piracy,” is listed as one of the main responsibilities.

Netflix’ Copyright and Content Protection Coordinator Job

The coordinator is further tasked with managing Facebook’s Rights Manager and YouTube’s Content-ID system, to prevent circumvention of these piracy filters. Experience with fingerprinting technologies and other anti-piracy tools will be helpful in this regard.

Netflix doesn’t do all the copyright enforcement on its own though. The company works together with other media giants in the recently launched “Alliance for Creativity and Entertainment” that is spearheaded by the MPAA.

In addition, the company also uses the takedown services of external anti-piracy outfits to target more traditional infringement sources, such as cyberlockers and piracy streaming sites. The coordinator has to keep an eye on these as well.

“Liaise with our vendors on manual takedown requests on linking sites and hosting sites and gathering data on pirate streaming sites, cyberlockers and usenet platforms.”

The above shows that Netflix is doing its best to prevent piracy from getting out of hand. It’s definitely taking the issue more seriously than a few years ago when the company didn’t have much original content.

The switch from being merely a distribution platform to becoming a major content producer and copyright holder has changed the stakes. Netflix hasn’t won the war on piracy, it’s just getting started.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

The Evil Within 2 Used Denuvo, Then Dumped it Before Launch

Post Syndicated from Andy original https://torrentfreak.com/the-evil-within-2-used-denuvo-then-dumped-it-before-launch-171013/

At the end of September we reported on a nightmare scenario for videogame anti-tamper technology Denuvo.

With cracking groups chipping away at the system for the past few months, progressing in leaps and bounds, the race to the bottom was almost complete. After aiming to hold off pirates for the first few lucrative weeks and months after launch, the Denuvo-protected Total War: Warhammer 2 fell to pirates in a matter of hours.

In the less than two weeks that have passed since, things haven’t improved much. By most measurements, in fact, the situation appears to have gotten worse.

On Wednesday, action role-playing game Middle Earth: Shadow of War was cracked a day after launch. While this didn’t beat the record set by Warhammer 2, the scene was given an unexpected gift.

Instead of the crack appearing courtesy of scene groups STEAMPUNKS or CPY, which has largely been the tradition thus far this year, old favorite CODEX stepped up to the mark with their own efforts. This means there are now close to half a dozen entities with the ability to defeat Denuvo, which isn’t a good look for the anti-piracy outfit.

A CODEX crack for Denuvo, from nowhere

Needless to say, this development was met with absolute glee by pirates, who forgave the additional day taken to crack the game in order to welcome CODEX into the anti-Denuvo club. But while this is bad news for the anti-tamper technology, there could be a worse enemy crossing the horizon – no confidence.

This Tuesday, DSO Gaming reported that it had received a review copy of Bethesda’s then-upcoming survival horror game, The Evil Within 2. The site, which is often a reliable source for Denuvo-related news, confirmed that the code was indeed protected by Denuvo.

“Another upcoming title that will be using Denuvo is The Evil Within 2,” the site reported. “Bethesda has provided us with a review code for The Evil Within 2. As such, we can confirm that Denuvo is present in it.”

As you read this, October 13, 2017, The Evil Within 2 is enjoying its official worldwide launch. Early yesterday afternoon, however, the title leaked early onto the Internet, courtesy of cracking group CODEX.

At first view, it looked like CODEX had cracked Denuvo before the game’s official launch but the reality was somewhat different after the dust had settled. For reasons best known to developer Bethesda, Denuvo was completely absent from the title. As shown by the title’s NFO (information) file, the only protection present was that provided by Steam.

Denuvo? What Denuvo?

This raises a number of scenarios, none of them good for Denuvo.

One possibility is that all along Bethesda never intended to use Denuvo on the final release. Exactly why we’ll likely never know, but the theory doesn’t really gel with them including it in the review code reviewed by DSO Gaming earlier this week.

The other proposition is that Bethesda witnessed the fiasco around Denuvo’s ‘protection’ in recent days and decided not to invest in something that wasn’t going to provide value for money.

Of course, these theories are going to be pretty difficult to confirm. Denuvo are a pretty confident bunch when things are going their way but they go suspiciously quiet when the tide is turning. Equally, developers tend to keep quiet about their anti-piracy strategies too.

The bottom line though is that if the protection really works and turns in valuable cash, why wouldn’t Bethesda use it as they have done on previous titles including Doom and Prey?

With that question apparently answering itself at the moment, all eyes now turn to Denuvo. Although it has a history of being one of the most successful anti-piracy systems overall, it has taken a massive battering in recent times. Will it recover? Only time will tell but at the moment things couldn’t get much worse.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Popcorn Time Creator Readies BitTorrent & Blockchain-Powered Video Platform

Post Syndicated from Andy original https://torrentfreak.com/popcorn-time-creator-readies-bittorrent-blockchain-powered-youtube-competitor-171012/

Without a doubt, YouTube is one of the most important websites available on the Internet today.

Its massive archive of videos brings pleasure to millions on a daily basis but its centralized nature means that owner Google always exercises control.

Over the years, people have looked to decentralize the YouTube concept and the latest project hoping to shake up the market has a particularly interesting player onboard.

Until 2015, only insiders knew that Argentinian designer Federico Abad was actually ‘Sebastian’, the shadowy figure behind notorious content sharing platform Popcorn Time.

Now he’s part of the team behind Flixxo, a BitTorrent and blockchain-powered startup hoping to wrestle a share of the video market from YouTube. Here’s how the team, which features blockchain startup RSK Labs, hope things will play out.

The Flixxo network will have no centralized storage of data, eliminating the need for expensive hosting along with associated costs. Instead, transfers will take place between peers using BitTorrent, meaning video content will be stored on the machines of Flixxo users. In practice, the content will be downloaded and uploaded in much the same way as users do on The Pirate Bay or indeed Abad’s baby, Popcorn Time.

However, there’s a twist to the system that envisions content creators, content consumers, and network participants (seeders) making revenue from their efforts.

At the heart of the Flixxo system are digital tokens (think virtual currency), called Flixx. These Flixx ‘coins’, which will go on sale in 12 days, can be used to buy access to content. Creators can also opt to pay consumers when those people help to distribute their content to others.

“Free from structural costs, producers can share the earnings from their content with the network that supports them,” the team explains.

“This way you get paid for helping us improve Flixxo, and you earn credits (in the form of digital tokens called Flixx) for watching higher quality content. Having no intermediaries means that the price you pay for watching the content that you actually want to watch is lower and fairer.”

The Flixxo team

In addition to earning tokens from helping to distribute content, people in the Flixxo ecosystem can also earn currency by watching sponsored content, i.e advertisements. While in a traditional system adverts are often considered a nuisance, Flixx tokens have real value, with a promise that users will be able to trade their Flixx not only for videos, but also for tangible and semi-tangible goods.

“Use your Flixx to reward the producers you follow, encouraging them to create more awesome content. Or keep your Flixx in your wallet and use them to buy a movie ticket, a pair of shoes from an online retailer, a chest of coins in your favourite game or even convert them to old-fashioned cash or up-and-coming digital assets, like Bitcoin,” the team explains.

The Flixxo team have big plans. After foundation in early 2016, the second quarter of 2017 saw the completion of a functional alpha release. In a little under two weeks, the project will begin its token generation event, with new offices in Los Angeles planned for the first half of 2018 alongside a premiere of the Flixxo platform.

“A total of 1,000,000,000 (one billion) Flixx tokens will be issued. A maximum of 300,000,000 (three hundred million) tokens will be sold. Some of these tokens (not more than 33% or 100,000,000 Flixx) may be sold with anticipation of the token allocation event to strategic investors,” Flixxo states.

Like all content platforms, Flixxo will live or die by the quality of the content it provides and whether, at least in the first instance, it can persuade people to part with their hard-earned cash. Only time will tell whether its content will be worth a premium over readily accessible YouTube content but with much-reduced costs, it may tempt creators seeking a bigger piece of the pie.

“Flixxo will also educate its community, teaching its users that in this new internet era value can be held and transferred online without intermediaries, a value that can be earned back by participating in a community, by contributing, being rewarded for every single social interaction,” the team explains.

Of course, the elephant in the room is what will happen when people begin sharing copyrighted content via Flixxo. Certainly, the fact that Popcorn Time’s founder is a key player and rival streaming platform Stremio is listed as a partner means that things could get a bit spicy later on.

Nevertheless, the team suggests that piracy and spam content distribution will be limited by mechanisms already built into the system.

“[A]uthors have to time-block tokens in a smart contract (set as a warranty) in order to upload content. This contract will also handle and block their earnings for a certain period of time, so that in the case of a dispute the unfair-uploader may lose those tokens,” they explain.

That being said, Flixxo also says that “there is no way” for third parties to censor content “which means that anyone has the chance of making any piece of media available on the network.” However, Flixxo says it will develop tools for filtering what it describes as “inappropriate content.”

At this point, things start to become a little unclear. On the one hand Flixxo says it could become a “revolutionary tool for uncensorable and untraceable media” yet on the other it says that it’s necessary to ensure that adult content, for example, isn’t seen by kids.

“We know there is a thin line between filtering or curating content and censorship, and it is a fact that we have an open network for everyone to upload any content. However, Flixxo as a platform will apply certain filtering based on clear rules – there should be a behavior-code for uploaders in order to offer the right content to the right user,” Flixxo explains.

To this end, Flixxo says it will deploy a centralized curation function, carried out by 101 delegates elected by the community, which will become progressively decentralized over time.

“This curation will have a cost, paid in Flixx, and will be collected from the warranty blocked by the content uploaders,” they add.

There can be little doubt that if Flixxo begins ‘curating’ unsuitable content, copyright holders will call on it to do the same for their content too. And, if the platform really takes off, 101 curators probably won’t scratch the surface. There’s also the not inconsiderable issue of what might happen to curators’ judgment when they’re incentivized to block curate content.

Finally, for those sick of “not available in your region” messages, there’s good and bad news. Flixxo insists there will be no geo-blocking of content on its part but individual creators will still have that feature available to them, should they choose.

The Flixx whitepaper can be downloaded here (pdf)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Introducing Gluon: a new library for machine learning from AWS and Microsoft

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/introducing-gluon-a-new-library-for-machine-learning-from-aws-and-microsoft/

Post by Dr. Matt Wood

Today, AWS and Microsoft announced Gluon, a new open source deep learning interface which allows developers to more easily and quickly build machine learning models, without compromising performance.

Gluon Logo

Gluon provides a clear, concise API for defining machine learning models using a collection of pre-built, optimized neural network components. Developers who are new to machine learning will find this interface more familiar to traditional code, since machine learning models can be defined and manipulated just like any other data structure. More seasoned data scientists and researchers will value the ability to build prototypes quickly and utilize dynamic neural network graphs for entirely new model architectures, all without sacrificing training speed.

Gluon is available in Apache MXNet today, a forthcoming Microsoft Cognitive Toolkit release, and in more frameworks over time.

Neural Networks vs Developers
Machine learning with neural networks (including ‘deep learning’) has three main components: data for training; a neural network model, and an algorithm which trains the neural network. You can think of the neural network in a similar way to a directed graph; it has a series of inputs (which represent the data), which connect to a series of outputs (the prediction), through a series of connected layers and weights. During training, the algorithm adjusts the weights in the network based on the error in the network output. This is the process by which the network learns; it is a memory and compute intensive process which can take days.

Deep learning frameworks such as Caffe2, Cognitive Toolkit, TensorFlow, and Apache MXNet are, in part, an answer to the question ‘how can we speed this process up? Just like query optimizers in databases, the more a training engine knows about the network and the algorithm, the more optimizations it can make to the training process (for example, it can infer what needs to be re-computed on the graph based on what else has changed, and skip the unaffected weights to speed things up). These frameworks also provide parallelization to distribute the computation process, and reduce the overall training time.

However, in order to achieve these optimizations, most frameworks require the developer to do some extra work: specifically, by providing a formal definition of the network graph, up-front, and then ‘freezing’ the graph, and just adjusting the weights.

The network definition, which can be large and complex with millions of connections, usually has to be constructed by hand. Not only are deep learning networks unwieldy, but they can be difficult to debug and it’s hard to re-use the code between projects.

The result of this complexity can be difficult for beginners and is a time-consuming task for more experienced researchers. At AWS, we’ve been experimenting with some ideas in MXNet around new, flexible, more approachable ways to define and train neural networks. Microsoft is also a contributor to the open source MXNet project, and were interested in some of these same ideas. Based on this, we got talking, and found we had a similar vision: to use these techniques to reduce the complexity of machine learning, making it accessible to more developers.

Enter Gluon: dynamic graphs, rapid iteration, scalable training
Gluon introduces four key innovations.

  1. Friendly API: Gluon networks can be defined using a simple, clear, concise code – this is easier for developers to learn, and much easier to understand than some of the more arcane and formal ways of defining networks and their associated weighted scoring functions.
  2. Dynamic networks: the network definition in Gluon is dynamic: it can bend and flex just like any other data structure. This is in contrast to the more common, formal, symbolic definition of a network which the deep learning framework has to effectively carve into stone in order to be able to effectively optimizing computation during training. Dynamic networks are easier to manage, and with Gluon, developers can easily ‘hybridize’ between these fast symbolic representations and the more friendly, dynamic ‘imperative’ definitions of the network and algorithms.
  3. The algorithm can define the network: the model and the training algorithm are brought much closer together. Instead of separate definitions, the algorithm can adjust the network dynamically during definition and training. Not only does this mean that developers can use standard programming loops, and conditionals to create these networks, but researchers can now define even more sophisticated algorithms and models which were not possible before. They are all easier to create, change, and debug.
  4. High performance operators for training: which makes it possible to have a friendly, concise API and dynamic graphs, without sacrificing training speed. This is a huge step forward in machine learning. Some frameworks bring a friendly API or dynamic graphs to deep learning, but these previous methods all incur a cost in terms of training speed. As with other areas of software, abstraction can slow down computation since it needs to be negotiated and interpreted at run time. Gluon can efficiently blend together a concise API with the formal definition under the hood, without the developer having to know about the specific details or to accommodate the compiler optimizations manually.

The team here at AWS, and our collaborators at Microsoft, couldn’t be more excited to bring these improvements to developers through Gluon. We’re already seeing quite a bit of excitement from developers and researchers alike.

Getting started with Gluon
Gluon is available today in Apache MXNet, with support coming for the Microsoft Cognitive Toolkit in a future release. We’re also publishing the front-end interface and the low-level API specifications so it can be included in other frameworks in the fullness of time.

You can get started with Gluon today. Fire up the AWS Deep Learning AMI with a single click and jump into one of 50 fully worked, notebook examples. If you’re a contributor to a machine learning framework, check out the interface specs on GitHub.

-Dr. Matt Wood

DevOps Cafe Episode 76 – Randy Shoup

Post Syndicated from DevOpsCafeAdmin original http://devopscafe.org/show/2017/10/11/devops-cafe-episode-76-randy-shoup.html

Technical talent is obviously in his jeans (pun intended) 

John and Damon chat with Randy Shoup (Stitch Fix) about what he’s learned building high-scale systems and teams through multiple generations of technology and practices… and how he is doing it again today.

  

Direct download

Follow John Willis on Twitter: @botchagalupe
Follow Damon Edwards on Twitter: @damonedwards 
Follow Randy Shoup on Twitter: @randyshoup

Notes:

 

Please tweet or leave comments or questions below and we’ll read them on the show!

Clean up Your Container Images with Amazon ECR Lifecycle Policies

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/clean-up-your-container-images-with-amazon-ecr-lifecycle-policies/

This post comes from the desk of Brent Langston.

Starting today, customers can keep their container image repositories tidy by automatically removing old or unused images using lifecycle policies, now available as part of Amazon E2 Container Repository (Amazon ECR).

Amazon ECR is a fully managed Docker container registry that makes it easy to store manage and deploy Docker container images without worrying about the typical challenges of scaling a service to handle pulling hundreds of images at one time. This scale means that development teams using Amazon ECR actively often find that their repositories fill up with many container image versions. This makes it difficult to find the code changes that matter and incurs unnecessary storage costs. Previously, cleaning up your repository meant spending time to manually delete old images, or writing and executing scripts.

Now, lifecycle policies allow you to define a set of rules to remove old container images automatically. You can also preview rules to see exactly which container images are affected when the rule runs. This allows repositories to be better organized, makes it easier to find the code revisions that matter, and lowers storage costs.

Look at how lifecycle policies work.

Ground Rules

One of the biggest benefits of deploying code in containers is the ability to quickly and easily roll back to a previous version. You can deploy with less risk because, if something goes wrong, it is easy to revert back to the previous container version and know that your application will run like it did before the failed deployment. Most people probably never roll back past a few versions. If your situation is similar, then one simple lifecycle rule might be to just keep the last 30 images.

Last 30 Images

In your ECR registry, choose Dry-Run Lifecycle Rules, Add.

  • For Image Status, select Untagged.
  • Under Match criteria, for Count Type, enter Image Count More Than.
  • For Count Number, enter 30.
  • For Rule action, choose expire.

Choose Save. To see which images would be cleaned up, Save and dry-run rules.

Of course, there are teams who, for compliance reasons, might prefer to keep certain images for a period of time, rather than keeping by count. For that situation, you can choose to clean up images older than 90 days.

Last 90 Days

Select the rule that you just created and choose Edit. Change the parameters to keep only 90 days of untagged images:

  • Under Match criteria, for Count Type, enter Since Image Pushed
  • For Count Number, enter 90.
  • For Count Unit, enter days.

Tags

Certainly 90 days is an arbitrary timeframe, and your team might have policies in place that would require a longer timeframe for certain kinds of images. If that’s the case, but you still want to continue with the spring cleaning, you can consider getting rid of images that are tag prefixed.

Here is the list of rules I came up with to groom untagged, development, staging, and production images:

  • Remove untagged images over 90 days old
  • Remove development tagged images over 90 days old
  • Remove staging tagged images over 180 days old
  • Remove production tagged images over 1 year old

As you can see, the new Amazon ECR lifecycle policies are powerful, and help you easily keep the images you need, while cleaning out images you may never use again. This feature is available starting today, in all regions where Amazon ECR is available, at no extra charge. For more information, see Amazon ECR Lifecycle Policies in the AWS technical documentation.

— Brent
@brentContained

Introducing Email Templates and Bulk Sending

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/ses/introducing-email-templates-and-bulk-sending/

The Amazon SES team is excited to announce our latest update, which includes two related features that help you send personalized emails to large groups of customers. This post discusses these features, and provides examples that you can follow to start using these features right away.

Email templates

You can use email templates to create the structure of an email that you plan to send to multiple recipients, or that you will use again in the future. Each template contains a subject line, a text part, and an HTML part. Both the subject and the email body can contain variables that are automatically replaced with values specific to each recipient. For example, you can include a {{name}} variable in the body of your email. When you send the email, you specify the value of {{name}} for each recipient. Amazon SES then automatically replaces the {{name}} variable with the recipient’s first name.

Creating a template

To create a template, you use the CreateTemplate API operation. To use this operation, pass a JSON object with four properties: a template name (TemplateName), a subject line (SubjectPart), a plain text version of the email body (TextPart), and an HTML version of the email body (HtmlPart). You can include variables in the subject line or message body by enclosing the variable names in two sets of curly braces. The following example shows the structure of this JSON object.

{
  "TemplateName": "MyTemplate",
  "SubjectPart": "Greetings, {{name}}!",
  "TextPart": "Dear {{name}},\r\nYour favorite animal is {{favoriteanimal}}.",
  "HtmlPart": "<h1>Hello {{name}}</h1><p>Your favorite animal is {{favoriteanimal}}.</p>"
}

Use this example to create your own template, and save the resulting file as mytemplate.json. You can then use the AWS Command Line Interface (AWS CLI) to create your template by running the following command: aws ses create-template --cli-input-json mytemplate.json

Sending an email created with a template

Now that you have created a template, you’re ready to send email that uses the template. You can use the SendTemplatedEmail API operation to send email to a single destination using a template. Like the CreateTemplate operation, this operation accepts a JSON object with four properties. For this operation, the properties are the sender’s email address (Source), the name of an existing template (Template), an object called Destination that contains the recipient addresses (and, optionally, any CC or BCC addresses) that will receive the email, and a property that refers to the values that will be replaced in the email (TemplateData). The following example shows the structure of the JSON object used by the SendTemplatedEmail operation.

{
  "Source": "[email protected]",
  "Template": "MyTemplate",
  "Destination": {
    "ToAddresses": [ "[email protected]" ]
  },
  "TemplateData": "{ \"name\":\"Alejandro\", \"favoriteanimal\": \"zebra\" }"
}

Customize this example to fit your needs, and then save the resulting file as myemail.json. One important note: in the TemplateData property, you must use a blackslash (\) character to escape the quotes within this object, as shown in the preceding example.

When you’re ready to send the email, run the following command: aws ses send-templated-email --cli-input-json myemail.json

Bulk email sending

In most cases, you should use email templates to send personalized emails to several customers at the same time. The SendBulkTemplatedEmail API operation helps you do that. This operation also accepts a JSON object. At a minimum, you must supply a sender email address (Source), a reference to an existing template (Template), a list of recipients in an array called Destinations (within which you specify the recipient’s email address, and the variable values for that recipient), and a list of fallback values for the variables in the template (DefaultTemplateData). The following example shows the structure of this JSON object.

{
  "Source":"[email protected]",
  "ConfigurationSetName":"ConfigSet",
  "Template":"MyTemplate",
  "Destinations":[
    {
      "Destination":{
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Anaya\", \"favoriteanimal\":\"yak\" }"
    },
    {
      "Destination":{ 
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Liu\", \"favoriteanimal\":\"water buffalo\" }"
    },
    {
      "Destination":{
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Shirley\", \"favoriteanimal\":\"vulture\" }"
    },
    {
      "Destination":{
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{}"
    }
  ],
  "DefaultTemplateData":"{ \"name\":\"friend\", \"favoriteanimal\":\"unknown\" }"
}

This example sends unique emails to Anaya ([email protected]), Liu ([email protected]), Shirley ([email protected]), and a fourth recipient ([email protected]), whose name and favorite animal we didn’t specify. Anaya, Liu, and Shirley will see their names in place of the {{name}} tag in the template (which, in this example, is present in both the subject line and message body), as well as their favorite animals in place of the {{favoriteanimal}} tag in the message body. The DefaultTemplateData property determines what happens if you do not specify the ReplacementTemplateData property for a recipient. In this case, the fourth recipient will see the word “friend” in place of the {{name}} tag, and “unknown” in place of the {{favoriteanimal}} tag.

Use the example to create your own list of recipients, and save the resulting file as mybulkemail.json. When you’re ready to send the email, run the following command: aws ses send-bulk-templated-email --cli-input-json mybulkemail.json

Other considerations

There are a few limits and other considerations when using these features:

  • You can create up to 10,000 email templates per Amazon SES account.
  • Each template can be up to 10 MB in size.
  • You can include an unlimited number of replacement variables in each template.
  • You can send email to up to 50 destinations in each call to the SendBulkTemplatedEmail operation. A destination includes a list of recipients, as well as CC and BCC recipients. Note that the number of destinations you can contact in a single call to the API may be limited by your account’s maximum sending rate. For more information, see Managing Your Amazon SES Sending Limits in the Amazon SES Developer Guide.

We look forward to seeing the amazing things you create with these new features. If you have any questions, please leave a comment on this post, or let us know in the Amazon SES forum.

AWS Developer Tools Expands Integration to Include GitHub

Post Syndicated from Balaji Iyer original https://aws.amazon.com/blogs/devops/aws-developer-tools-expands-integration-to-include-github/

AWS Developer Tools is a set of services that include AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy. Together, these services help you securely store and maintain version control of your application’s source code and automatically build, test, and deploy your application to AWS or your on-premises environment. These services are designed to enable developers and IT professionals to rapidly and safely deliver software.

As part of our continued commitment to extend the AWS Developer Tools ecosystem to third-party tools and services, we’re pleased to announce AWS CodeStar and AWS CodeBuild now integrate with GitHub. This will make it easier for GitHub users to set up a continuous integration and continuous delivery toolchain as part of their release process using AWS Developer Tools.

In this post, I will walk through the following:

Prerequisites:

You’ll need an AWS account, a GitHub account, an Amazon EC2 key pair, and administrator-level permissions for AWS Identity and Access Management (IAM), AWS CodeStar, AWS CodeBuild, AWS CodePipeline, Amazon EC2, Amazon S3.

 

Integrating GitHub with AWS CodeStar

AWS CodeStar enables you to quickly develop, build, and deploy applications on AWS. Its unified user interface helps you easily manage your software development activities in one place. With AWS CodeStar, you can set up your entire continuous delivery toolchain in minutes, so you can start releasing code faster.

When AWS CodeStar launched in April of this year, it used AWS CodeCommit as the hosted source repository. You can now choose between AWS CodeCommit or GitHub as the source control service for your CodeStar projects. In addition, your CodeStar project dashboard lets you centrally track GitHub activities, including commits, issues, and pull requests. This makes it easy to manage project activity across the components of your CI/CD toolchain. Adding the GitHub dashboard view will simplify development of your AWS applications.

In this section, I will show you how to use GitHub as the source provider for your CodeStar projects. I’ll also show you how to work with recent commits, issues, and pull requests in the CodeStar dashboard.

Sign in to the AWS Management Console and from the Services menu, choose CodeStar. In the CodeStar console, choose Create a new project. You should see the Choose a project template page.

CodeStar Project

Choose an option by programming language, application category, or AWS service. I am going to choose the Ruby on Rails web application that will be running on Amazon EC2.

On the Project details page, you’ll now see the GitHub option. Type a name for your project, and then choose Connect to GitHub.

Project details

You’ll see a message requesting authorization to connect to your GitHub repository. When prompted, choose Authorize, and then type your GitHub account password.

Authorize

This connects your GitHub identity to AWS CodeStar through OAuth. You can always review your settings by navigating to your GitHub application settings.

Installed GitHub Apps

You’ll see AWS CodeStar is now connected to GitHub:

Create project

You can choose a public or private repository. GitHub offers free accounts for users and organizations working on public and open source projects and paid accounts that offer unlimited private repositories and optional user management and security features.

In this example, I am going to choose the public repository option. Edit the repository description, if you like, and then choose Next.

Review your CodeStar project details, and then choose Create Project. On Choose an Amazon EC2 Key Pair, choose Create Project.

Key Pair

On the Review project details page, you’ll see Edit Amazon EC2 configuration. Choose this link to configure instance type, VPC, and subnet options. AWS CodeStar requires a service role to create and manage AWS resources and IAM permissions. This role will be created for you when you select the AWS CodeStar would like permission to administer AWS resources on your behalf check box.

Choose Create Project. It might take a few minutes to create your project and resources.

Review project details

When you create a CodeStar project, you’re added to the project team as an owner. If this is the first time you’ve used AWS CodeStar, you’ll be asked to provide the following information, which will be shown to others:

  • Your display name.
  • Your email address.

This information is used in your AWS CodeStar user profile. User profiles are not project-specific, but they are limited to a single AWS region. If you are a team member in projects in more than one region, you’ll have to create a user profile in each region.

User settings

User settings

Choose Next. AWS CodeStar will create a GitHub repository with your configuration settings (for example, https://github.com/biyer/ruby-on-rails-service).

When you integrate your integrated development environment (IDE) with AWS CodeStar, you can continue to write and develop code in your preferred environment. The changes you make will be included in the AWS CodeStar project each time you commit and push your code.

IDE

After setting up your IDE, choose Next to go to the CodeStar dashboard. Take a few minutes to familiarize yourself with the dashboard. You can easily track progress across your entire software development process, from your backlog of work items to recent code deployments.

Dashboard

After the application deployment is complete, choose the endpoint that will display the application.

Pipeline

This is what you’ll see when you open the application endpoint:

The Commit history section of the dashboard lists the commits made to the Git repository. If you choose the commit ID or the Open in GitHub option, you can use a hotlink to your GitHub repository.

Commit history

Your AWS CodeStar project dashboard is where you and your team view the status of your project resources, including the latest commits to your project, the state of your continuous delivery pipeline, and the performance of your instances. This information is displayed on tiles that are dedicated to a particular resource. To see more information about any of these resources, choose the details link on the tile. The console for that AWS service will open on the details page for that resource.

Issues

You can also filter issues based on their status and the assigned user.

Filter

AWS CodeBuild Now Supports Building GitHub Pull Requests

CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue. You can use prepackaged build environments to get started quickly or you can create custom build environments that use your own build tools.

We recently announced support for GitHub pull requests in AWS CodeBuild. This functionality makes it easier to collaborate across your team while editing and building your application code with CodeBuild. You can use the AWS CodeBuild or AWS CodePipeline consoles to run AWS CodeBuild. You can also automate the running of AWS CodeBuild by using the AWS Command Line Interface (AWS CLI), the AWS SDKs, or the AWS CodeBuild Plugin for Jenkins.

AWS CodeBuild

In this section, I will show you how to trigger a build in AWS CodeBuild with a pull request from GitHub through webhooks.

Open the AWS CodeBuild console at https://console.aws.amazon.com/codebuild/. Choose Create project. If you already have a CodeBuild project, you can choose Edit project, and then follow along. CodeBuild can connect to AWS CodeCommit, S3, BitBucket, and GitHub to pull source code for builds. For Source provider, choose GitHub, and then choose Connect to GitHub.

Configure

After you’ve successfully linked GitHub and your CodeBuild project, you can choose a repository in your GitHub account. CodeBuild also supports connections to any public repository. You can review your settings by navigating to your GitHub application settings.

GitHub Apps

On Source: What to Build, for Webhook, select the Rebuild every time a code change is pushed to this repository check box.

Note: You can select this option only if, under Repository, you chose Use a repository in my account.

Source

In Environment: How to build, for Environment image, select Use an image managed by AWS CodeBuild. For Operating system, choose Ubuntu. For Runtime, choose Base. For Version, choose the latest available version. For Build specification, you can provide a collection of build commands and related settings, in YAML format (buildspec.yml) or you can override the build spec by inserting build commands directly in the console. AWS CodeBuild uses these commands to run a build. In this example, the output is the string “hello.”

Environment

On Artifacts: Where to put the artifacts from this build project, for Type, choose No artifacts. (This is also the type to choose if you are just running tests or pushing a Docker image to Amazon ECR.) You also need an AWS CodeBuild service role so that AWS CodeBuild can interact with dependent AWS services on your behalf. Unless you already have a role, choose Create a role, and for Role name, type a name for your role.

Artifacts

In this example, leave the advanced settings at their defaults.

If you expand Show advanced settings, you’ll see options for customizing your build, including:

  • A build timeout.
  • A KMS key to encrypt all the artifacts that the builds for this project will use.
  • Options for building a Docker image.
  • Elevated permissions during your build action (for example, accessing Docker inside your build container to build a Dockerfile).
  • Resource options for the build compute type.
  • Environment variables (built-in or custom). For more information, see Create a Build Project in the AWS CodeBuild User Guide.

Advanced settings

You can use the AWS CodeBuild console to create a parameter in Amazon EC2 Systems Manager. Choose Create a parameter, and then follow the instructions in the dialog box. (In that dialog box, for KMS key, you can optionally specify the ARN of an AWS KMS key in your account. Amazon EC2 Systems Manager uses this key to encrypt the parameter’s value during storage and decrypt during retrieval.)

Create parameter

Choose Continue. On the Review page, either choose Save and build or choose Save to run the build later.

Choose Start build. When the build is complete, the Build logs section should display detailed information about the build.

Logs

To demonstrate a pull request, I will fork the repository as a different GitHub user, make commits to the forked repo, check in the changes to a newly created branch, and then open a pull request.

Pull request

As soon as the pull request is submitted, you’ll see CodeBuild start executing the build.

Build

GitHub sends an HTTP POST payload to the webhook’s configured URL (highlighted here), which CodeBuild uses to download the latest source code and execute the build phases.

Build project

If you expand the Show all checks option for the GitHub pull request, you’ll see that CodeBuild has completed the build, all checks have passed, and a deep link is provided in Details, which opens the build history in the CodeBuild console.

Pull request

Summary:

In this post, I showed you how to use GitHub as the source provider for your CodeStar projects and how to work with recent commits, issues, and pull requests in the CodeStar dashboard. I also showed you how you can use GitHub pull requests to automatically trigger a build in AWS CodeBuild — specifically, how this functionality makes it easier to collaborate across your team while editing and building your application code with CodeBuild.


About the author:

Balaji Iyer is an Enterprise Consultant for the Professional Services Team at Amazon Web Services. In this role, he has helped several customers successfully navigate their journey to AWS. His specialties include architecting and implementing highly scalable distributed systems, serverless architectures, large scale migrations, operational security, and leading strategic AWS initiatives. Before he joined Amazon, Balaji spent more than a decade building operating systems, big data analytics solutions, mobile services, and web applications. In his spare time, he enjoys experiencing the great outdoors and spending time with his family.

 

Pirate Bay is Mining Cryptocurrency Again, No Opt Out

Post Syndicated from Ernesto original https://torrentfreak.com/pirate-bay-is-mining-cryptocurrency-again-no-opt-out-171011/

Last month The Pirate Bay caused some uproar by adding a Javascript-based cryptocurrency miner to its website.

The miner utilizes CPU power from visitors to generate Monero coins for the site, providing an extra source of revenue.

The Pirate Bay only tested the option briefly, but that was enough to inspire many others to follow suit. Now, a few weeks later, Pirate Bay has also turned on the miners again.

The miner is not directly embedded in the site’s core code but runs through an ad script. Many ad blockers and anti-malware tools are stopping these request, but people who don’t use any will see a clear spike in CPU usage when they access the site.

The Pirate Bay team previously said that they were testing the miner to see if it can replace ads. While there is some real revenue potential, for now, it’s running in addition to the regular banners. It’s unclear whether the current mining period is another test or if it will run permanently from now on.

The miner does appear to be throttled to a certain degree, so most users might not even notice that it’s running.

Pirate Bay load requests

Running a cryptocurrency miner such as the Coin-Hive script TPB is currently using is not without risk. Aside from user complaints, there is an issue that may make it harder for the site to operate in the future.

Last week we reported that CDN provider Cloudflare had suspended the account of torrent proxy site ProxyBunker, flagging its coin miner as malware. This means that The Pirate Bay now risks losing the Cloudflare service, which they rely on for DDoS protection, among other things.

Cloudflare’s suspension of ProxyBunker occurred even though the site provided users with an option to disable the miner. This functionality was implemented by Coinhive after the script was misused by some sites, which ran it without alerting their users.

The Pirate Bay currently has no opt-out option, nor has it informed users about the latest mining efforts. This could lead to another problem since Coinhive said it would crack down on customers who failed to keep users in the loop.

“We will verify this opt-in on our servers and will implement it in a way that it can not be circumvented. We will pledge to keep the opt-in intact at all times, without exceptions,” the Coinhive team previously noted.

The Pirate Bay team has not commented on the issue thus far. In theory, it’s possible that a rogue advertiser is responsible for the latest mining efforts. If that’s the case it will be disabled soon enough.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

[$] Continuous-integration testing for Intel graphics

Post Syndicated from jake original https://lwn.net/Articles/735468/rss

Two separate talks, at two different venues, give us a look into the
kinds of testing that the Intel graphics team is
doing. Daniel Vetter had a
short presentation as part of the Testing and Fuzzing microconference at
the Linux Plumbers Conference (LPC). His colleague, Martin Peres, gave a
somewhat longer talk, complete with demos, at the X.Org Developers Conference
(XDC). The picture they paint is a pleasing one: there is lots of testing
going on there. But there are problems as well; that amount of testing
runs afoul of bugs elsewhere in the kernel, which makes the job
harder.

Application Load Balancers Now Support Multiple TLS Certificates With Smart Selection Using SNI

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/new-application-load-balancer-sni/

Today we’re launching support for multiple TLS/SSL certificates on Application Load Balancers (ALB) using Server Name Indication (SNI). You can now host multiple TLS secured applications, each with its own TLS certificate, behind a single load balancer. In order to use SNI, all you need to do is bind multiple certificates to the same secure listener on your load balancer. ALB will automatically choose the optimal TLS certificate for each client. These new features are provided at no additional charge.

If you’re looking for a TL;DR on how to use this new feature just click here. If you’re like me and you’re a little rusty on the specifics of Transport Layer Security (TLS) then keep reading.

TLS? SSL? SNI?

People tend to use the terms SSL and TLS interchangeably even though the two are technically different. SSL technically refers to a predecessor of the TLS protocol. To keep things simple I’ll be using the term TLS for the rest of this post.

TLS is a protocol for securely transmitting data like passwords, cookies, and credit card numbers. It enables privacy, authentication, and integrity of the data being transmitted. TLS uses certificate based authentication where certificates are like ID cards for your websites. You trust the person that signed and issued the certificate, the certificate authority (CA), so you trust that the data in the certificate is correct. When a browser connects to your TLS-enabled ALB, ALB presents a certificate that contains your site’s public key, which has been cryptographically signed by a CA. This way the client can be sure it’s getting the ‘real you’ and that it’s safe to use your site’s public key to establish a secure connection.

With SNI support we’re making it easy to use more than one certificate with the same ALB. The most common reason you might want to use multiple certificates is to handle different domains with the same load balancer. It’s always been possible to use wildcard and subject-alternate-name (SAN) certificates with ALB, but these come with limitations. Wildcard certificates only work for related subdomains that match a simple pattern and while SAN certificates can support many different domains, the same certificate authority has to authenticate each one. That means you have reauthenticate and reprovision your certificate everytime you add a new domain.

One of our most frequent requests on forums, reddit, and in my e-mail inbox has been to use the Server Name Indication (SNI) extension of TLS to choose a certificate for a client. Since TLS operates at the transport layer, below HTTP, it doesn’t see the hostname requested by a client. SNI works by having the client tell the server “This is the domain I expect to get a certificate for” when it first connects. The server can then choose the correct certificate to respond to the client. All modern web browsers and a large majority of other clients support SNI. In fact, today we see SNI supported by over 99.5% of clients connecting to CloudFront.

Smart Certificate Selection on ALB

ALB’s smart certificate selection goes beyond SNI. In addition to containing a list of valid domain names, certificates also describe the type of key exchange and cryptography that the server supports, as well as the signature algorithm (SHA2, SHA1, MD5) used to sign the certificate. To establish a TLS connection, a client starts a TLS handshake by sending a “ClientHello” message that outlines the capabilities of the client: the protocol versions, extensions, cipher suites, and compression methods. Based on what an individual client supports, ALB’s smart selection algorithm chooses a certificate for the connection and sends it to the client. ALB supports both the classic RSA algorithm and the newer, hipper, and faster Elliptic-curve based ECDSA algorithm. ECDSA support among clients isn’t as prevalent as SNI, but it is supported by all modern web browsers. Since it’s faster and requires less CPU, it can be particularly useful for ultra-low latency applications and for conserving the amount of battery used by mobile applications. Since ALB can see what each client supports from the TLS handshake, you can upload both RSA and ECDSA certificates for the same domains and ALB will automatically choose the best one for each client.

Using SNI with ALB

I’ll use a few example websites like VimIsBetterThanEmacs.com and VimIsTheBest.com. I’ve purchased and hosted these domains on Amazon Route 53, and provisioned two separate certificates for them in AWS Certificate Manager (ACM). If I want to securely serve both of these sites through a single ALB, I can quickly add both certificates in the console.

First, I’ll select my load balancer in the console, go to the listeners tab, and select “view/edit certificates”.

Next, I’ll use the “+” button in the top left corner to select some certificates then I’ll click the “Add” button.

There are no more steps. If you’re not really a GUI kind of person you’ll be pleased to know that it’s also simple to add new certificates via the AWS Command Line Interface (CLI) (or SDKs).

aws elbv2 add-listener-certificates --listener-arn <listener-arn> --certificates CertificateArn=<cert-arn>

Things to know

  • ALB Access Logs now include the client’s requested hostname and the certificate ARN used. If the “hostname” field is empty (represented by a “-“) the client did not use the SNI extension in their request.
  • You can use any of your certificates in ACM or IAM.
  • You can bind multiple certificates for the same domain(s) to a secure listener. Your ALB will choose the optimal certificate based on multiple factors including the capabilities of the client.
  • If the client does not support SNI your ALB will use the default certificate (the one you specified when you created the listener).
  • There are three new ELB API calls: AddListenerCertificates, RemoveListenerCertificates, and DescribeListenerCertificates.
  • You can bind up to 25 certificates per load balancer (not counting the default certificate).
  • These new features are supported by AWS CloudFormation at launch.

You can see an example of these new features in action with a set of websites created by my colleague Jon Zobrist: https://www.exampleloadbalancer.com/.

Overall, I will personally use this feature and I’m sure a ton of AWS users will benefit from it as well. I want to thank the Elastic Load Balancing team for all their hard work in getting this into the hands of our users.

Randall

Bringing Clean and Safe Drinking Water to Developing Countries

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/keeping-charity-water-data-safe/

image of a cup filling with water

If you’d like to read more about charity: water‘s use of Backblaze for Business, visit backblaze.com/charitywater/

charity: water  + Backblaze for Business

Considering that charity: water sends workers with laptop computers to rural communities in 24 countries around the world, it’s not surprising that computer backup is needed on every computer they have. It’s so essential that Matt Ward, System Administrator for charity: water, says it’s a standard part of employee on-boarding.

charity: water, based in New York City, is a non-profit organization that is working to bring clean water to the nearly one in ten people around the world who live without it — a situation that affects not only health, but education and income.

“We have people constantly traveling all over the world, so a cloud-based service makes sense whether the user is in New York or Malawi. Most of our projects and beneficiaries are in Sub Saharan Africa and Southern/Southeast Asia,” explains Matt. “Water scarcity and poor water quality are a problem here, and in so many countries around the world.”

charity: water in Rwanda

To achieve their mission, charity: water works through implementing organizations on the ground within the targeted communities. The people in these communities must spend hours every day walking to collect water for their families. It’s a losing proposition, as the time they spend walking takes away from education, earning money, and generally limits the opportunities for improving their lives.

charity: water began using Backblaze for Business before Matt came on a year ago. They started with a few licenses, but quickly decided to deploy Backblaze to every computer in the organization.

“We’ve lost computers plenty of times,” he says, “but, because of Backblaze, there’s never been a case where we lost the computer’s data.”

charity: water has about 80 staff computer users, and adds ten to twenty interns each season. Each staff member or intern has at least one computer. “Our IT department is two people, me and my director,” explains Matt, “and we have to support everyone, so being super simple to deploy is valuable to us.”

“When a new person joins us, we just send them an invitation to join the Group on Backblaze, and they’re all set. Their data is automatically backed up whenever they’re connected to the internet, and I can see their current status on the management console. [Backblaze] really nailed the user interface. You can show anyone the interface, even on their first day, and they get it because it’s simple and easy to understand.”

young girl drinkng clean water

One of the frequent uses for Backblaze for Business is when Matt off-boards users, such as all the interns at the end of the season. He starts a restore through the Backblaze admin console even before he has the actual computer. “I know I have a reliable archive in the restore from Backblaze, and it’s easier than doing it directly from the laptop.”

Matt is an enthusiastic user of the features designed for business users, especially Backblaze’s Groups feature, which has enabled charity: water to centralize billing and computer management for their worldwide team. Businesses can create groups to cluster job functions, employee locations, or any other criteria.

charity: water delivery clean water to children

“It saves me time to be able to see the status of any user’s backups, such as the last time the data was backed up” explains Matt. Before Backblaze, charity: water was writing documentation for workers, hoping they would follow backup protocols. Now, Matt knows what’s going on in real time — a valuable feature when the laptops are dispersed around the world.

“Backblaze for Business is an essential element in any organization’s IT continuity plan,” says Matt. “You need to be sure that there is a backup solution for your data should anything go wrong.”

To learn more about how charity: water uses Backblaze for Business, visit backblaze.com/charitywater/.

Matt Ward of charity: water

Matt Ward, System Administrator for charity: water

The post Bringing Clean and Safe Drinking Water to Developing Countries appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.