Tag Archives: iq

Implementing Default Directory Indexes in Amazon S3-backed Amazon CloudFront Origins Using [email protected]

Post Syndicated from Ronnie Eichler original https://aws.amazon.com/blogs/compute/implementing-default-directory-indexes-in-amazon-s3-backed-amazon-cloudfront-origins-using-lambdaedge/

With the recent launch of [email protected], it’s now possible for you to provide even more robust functionality to your static websites. Amazon CloudFront is a content distribution network service. In this post, I show how you can use [email protected] along with the CloudFront origin access identity (OAI) for Amazon S3 and still provide simple URLs (such as www.example.com/about/ instead of www.example.com/about/index.html).

Background

Amazon S3 is a great platform for hosting a static website. You don’t need to worry about managing servers or underlying infrastructure—you just publish your static to content to an S3 bucket. S3 provides a DNS name such as <bucket-name>.s3-website-<AWS-region>.amazonaws.com. Use this name for your website by creating a CNAME record in your domain’s DNS environment (or Amazon Route 53) as follows:

www.example.com -> <bucket-name>.s3-website-<AWS-region>.amazonaws.com

You can also put CloudFront in front of S3 to further scale the performance of your site and cache the content closer to your users. CloudFront can enable HTTPS-hosted sites, by either using a custom Secure Sockets Layer (SSL) certificate or a managed certificate from AWS Certificate Manager. In addition, CloudFront also offers integration with AWS WAF, a web application firewall. As you can see, it’s possible to achieve some robust functionality by using S3, CloudFront, and other managed services and not have to worry about maintaining underlying infrastructure.

One of the key concerns that you might have when implementing any type of WAF or CDN is that you want to force your users to go through the CDN. If you implement CloudFront in front of S3, you can achieve this by using an OAI. However, in order to do this, you cannot use the HTTP endpoint that is exposed by S3’s static website hosting feature. Instead, CloudFront must use the S3 REST endpoint to fetch content from your origin so that the request can be authenticated using the OAI. This presents some challenges in that the REST endpoint does not support redirection to a default index page.

CloudFront does allow you to specify a default root object (index.html), but it only works on the root of the website (such as http://www.example.com > http://www.example.com/index.html). It does not work on any subdirectory (such as http://www.example.com/about/). If you were to attempt to request this URL through CloudFront, CloudFront would do a S3 GetObject API call against a key that does not exist.

Of course, it is a bad user experience to expect users to always type index.html at the end of every URL (or even know that it should be there). Until now, there has not been an easy way to provide these simpler URLs (equivalent to the DirectoryIndex Directive in an Apache Web Server configuration) to users through CloudFront. Not if you still want to be able to restrict access to the S3 origin using an OAI. However, with the release of [email protected], you can use a JavaScript function running on the CloudFront edge nodes to look for these patterns and request the appropriate object key from the S3 origin.

Solution

In this example, you use the compute power at the CloudFront edge to inspect the request as it’s coming in from the client. Then re-write the request so that CloudFront requests a default index object (index.html in this case) for any request URI that ends in ‘/’.

When a request is made against a web server, the client specifies the object to obtain in the request. You can use this URI and apply a regular expression to it so that these URIs get resolved to a default index object before CloudFront requests the object from the origin. Use the following code:

'use strict';
exports.handler = (event, context, callback) => {
    
    // Extract the request from the CloudFront event that is sent to [email protected] 
    var request = event.Records[0].cf.request;

    // Extract the URI from the request
    var olduri = request.uri;

    // Match any '/' that occurs at the end of a URI. Replace it with a default index
    var newuri = olduri.replace(/\/$/, '\/index.html');
    
    // Log the URI as received by CloudFront and the new URI to be used to fetch from origin
    console.log("Old URI: " + olduri);
    console.log("New URI: " + newuri);
    
    // Replace the received URI with the URI that includes the index page
    request.uri = newuri;
    
    // Return to CloudFront
    return callback(null, request);

};

To get started, create an S3 bucket to be the origin for CloudFront:

Create bucket

On the other screens, you can just accept the defaults for the purposes of this walkthrough. If this were a production implementation, I would recommend enabling bucket logging and specifying an existing S3 bucket as the destination for access logs. These logs can be useful if you need to troubleshoot issues with your S3 access.

Now, put some content into your S3 bucket. For this walkthrough, create two simple webpages to demonstrate the functionality:  A page that resides at the website root, and another that is in a subdirectory.

<s3bucketname>/index.html

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Root home page</title>
    </head>
    <body>
        <p>Hello, this page resides in the root directory.</p>
    </body>
</html>

<s3bucketname>/subdirectory/index.html

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Subdirectory home page</title>
    </head>
    <body>
        <p>Hello, this page resides in the /subdirectory/ directory.</p>
    </body>
</html>

When uploading the files into S3, you can accept the defaults. You add a bucket policy as part of the CloudFront distribution creation that allows CloudFront to access the S3 origin. You should now have an S3 bucket that looks like the following:

Root of bucket

Subdirectory in bucket

Next, create a CloudFront distribution that your users will use to access the content. Open the CloudFront console, and choose Create Distribution. For Select a delivery method for your content, under Web, choose Get Started.

On the next screen, you set up the distribution. Below are the options to configure:

  • Origin Domain Name:  Select the S3 bucket that you created earlier.
  • Restrict Bucket Access: Choose Yes.
  • Origin Access Identity: Create a new identity.
  • Grant Read Permissions on Bucket: Choose Yes, Update Bucket Policy.
  • Object Caching: Choose Customize (I am changing the behavior to avoid having CloudFront cache objects, as this could affect your ability to troubleshoot while implementing the Lambda code).
    • Minimum TTL: 0
    • Maximum TTL: 0
    • Default TTL: 0

You can accept all of the other defaults. Again, this is a proof-of-concept exercise. After you are comfortable that the CloudFront distribution is working properly with the origin and Lambda code, you can re-visit the preceding values and make changes before implementing it in production.

CloudFront distributions can take several minutes to deploy (because the changes have to propagate out to all of the edge locations). After that’s done, test the functionality of the S3-backed static website. Looking at the distribution, you can see that CloudFront assigns a domain name:

CloudFront Distribution Settings

Try to access the website using a combination of various URLs:

http://<domainname>/:  Works

› curl -v http://d3gt20ea1hllb.cloudfront.net/
*   Trying 54.192.192.214...
* TCP_NODELAY set
* Connected to d3gt20ea1hllb.cloudfront.net (54.192.192.214) port 80 (#0)
> GET / HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 200 OK
< ETag: "cb7e2634fe66c1fd395cf868087dd3b9"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: -D2FSRwzfcwyKZKFZr6DqYFkIf4t7HdGw2MkUF5sE6YFDxRJgi0R1g==
< Content-Length: 209
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:16 GMT
< Via: 1.1 6419ba8f3bd94b651d416054d9416f1e.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<
<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Root home page</title>
    </head>
    <body>
        <p>Hello, this page resides in the root directory.</p>
    </body>
</html>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact

This is because CloudFront is configured to request a default root object (index.html) from the origin.

http://<domainname>/subdirectory/:  Doesn’t work

› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/
*   Trying 54.192.192.214...
* TCP_NODELAY set
* Connected to d3gt20ea1hllb.cloudfront.net (54.192.192.214) port 80 (#0)
> GET /subdirectory/ HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 200 OK
< ETag: "d41d8cd98f00b204e9800998ecf8427e"
< x-amz-server-side-encryption: AES256
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: Iqf0Gy8hJLiW-9tOAdSFPkL7vCWBrgm3-1ly5tBeY_izU82ftipodA==
< Content-Length: 0
< Content-Type: application/x-directory
< Last-Modified: Wed, 19 Jul 2017 19:21:24 GMT
< Via: 1.1 6419ba8f3bd94b651d416054d9416f1e.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact

If you use a tool such like cURL to test this, you notice that CloudFront and S3 are returning a blank response. The reason for this is that the subdirectory does exist, but it does not resolve to an S3 object. Keep in mind that S3 is an object store, so there are no real directories. User interfaces such as the S3 console present a hierarchical view of a bucket with folders based on the presence of forward slashes, but behind the scenes the bucket is just a collection of keys that represent stored objects.

http://<domainname>/subdirectory/index.html:  Works

› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/index.html
*   Trying 54.192.192.130...
* TCP_NODELAY set
* Connected to d3gt20ea1hllb.cloudfront.net (54.192.192.130) port 80 (#0)
> GET /subdirectory/index.html HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 20 Jul 2017 20:35:15 GMT
< ETag: "ddf87c487acf7cef9d50418f0f8f8dae"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: RefreshHit from cloudfront
< X-Amz-Cf-Id: bkh6opXdpw8pUomqG3Qr3UcjnZL8axxOH82Lh0OOcx48uJKc_Dc3Cg==
< Content-Length: 227
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:45 GMT
< Via: 1.1 3f2788d309d30f41de96da6f931d4ede.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<
<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Subdirectory home page</title>
    </head>
    <body>
        <p>Hello, this page resides in the /subdirectory/ directory.</p>
    </body>
</html>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact

This request works as expected because you are referencing the object directly. Now, you implement the [email protected] function to return the default index.html page for any subdirectory. Looking at the example JavaScript code, here’s where the magic happens:

var newuri = olduri.replace(/\/$/, '\/index.html');

You are going to use a JavaScript regular expression to match any ‘/’ that occurs at the end of the URI and replace it with ‘/index.html’. This is the equivalent to what S3 does on its own with static website hosting. However, as I mentioned earlier, you can’t rely on this if you want to use a policy on the bucket to restrict it so that users must access the bucket through CloudFront. That way, all requests to the S3 bucket must be authenticated using the S3 REST API. Because of this, you implement a [email protected] function that takes any client request ending in ‘/’ and append a default ‘index.html’ to the request before requesting the object from the origin.

In the Lambda console, choose Create function. On the next screen, skip the blueprint selection and choose Author from scratch, as you’ll use the sample code provided.

Next, configure the trigger. Choosing the empty box shows a list of available triggers. Choose CloudFront and select your CloudFront distribution ID (created earlier). For this example, leave Cache Behavior as * and CloudFront Event as Origin Request. Select the Enable trigger and replicate box and choose Next.

Lambda Trigger

Next, give the function a name and a description. Then, copy and paste the following code:

'use strict';
exports.handler = (event, context, callback) => {
    
    // Extract the request from the CloudFront event that is sent to [email protected] 
    var request = event.Records[0].cf.request;

    // Extract the URI from the request
    var olduri = request.uri;

    // Match any '/' that occurs at the end of a URI. Replace it with a default index
    var newuri = olduri.replace(/\/$/, '\/index.html');
    
    // Log the URI as received by CloudFront and the new URI to be used to fetch from origin
    console.log("Old URI: " + olduri);
    console.log("New URI: " + newuri);
    
    // Replace the received URI with the URI that includes the index page
    request.uri = newuri;
    
    // Return to CloudFront
    return callback(null, request);

};

Next, define a role that grants permissions to the Lambda function. For this example, choose Create new role from template, Basic Edge Lambda permissions. This creates a new IAM role for the Lambda function and grants the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*"
            ]
        }
    ]
}

In a nutshell, these are the permissions that the function needs to create the necessary CloudWatch log group and log stream, and to put the log events so that the function is able to write logs when it executes.

After the function has been created, you can go back to the browser (or cURL) and re-run the test for the subdirectory request that failed previously:

› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/
*   Trying 54.192.192.202...
* TCP_NODELAY set
* Connected to d3gt20ea1hllb.cloudfront.net (54.192.192.202) port 80 (#0)
> GET /subdirectory/ HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 20 Jul 2017 21:18:44 GMT
< ETag: "ddf87c487acf7cef9d50418f0f8f8dae"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: rwFN7yHE70bT9xckBpceTsAPcmaadqWB9omPBv2P6WkIfQqdjTk_4w==
< Content-Length: 227
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:45 GMT
< Via: 1.1 3572de112011f1b625bb77410b0c5cca.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<
<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Subdirectory home page</title>
    </head>
    <body>
        <p>Hello, this page resides in the /subdirectory/ directory.</p>
    </body>
</html>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact

You have now configured a way for CloudFront to return a default index page for subdirectories in S3!

Summary

In this post, you used [email protected] to be able to use CloudFront with an S3 origin access identity and serve a default root object on subdirectory URLs. To find out some more about this use-case, see [email protected] integration with CloudFront in our documentation.

If you have questions or suggestions, feel free to comment below. For troubleshooting or implementation help, check out the Lambda forum.

How to Compete with Giants

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/how-to-compete-with-giants/

How to Compete with Giants

This post by Backblaze’s CEO and co-founder Gleb Budman is the sixth in a series about entrepreneurship. You can choose posts in the series from the list below:

  1. How Backblaze got Started: The Problem, The Solution, and the Stuff In-Between
  2. Building a Competitive Moat: Turning Challenges Into Advantages
  3. From Idea to Launch: Getting Your First Customers
  4. How to Get Your First 1,000 Customers
  5. Surviving Your First Year
  6. How to Compete with Giants

Use the Join button above to receive notification of new posts in this series.

Perhaps your business is competing in a brand new space free from established competitors. Most of us, though, start companies that compete with existing offerings from large, established companies. You need to come up with a better mousetrap — not the first mousetrap.

That’s the challenge Backblaze faced. In this post, I’d like to share some of the lessons I learned from that experience.

Backblaze vs. Giants

Competing with established companies that are orders of magnitude larger can be daunting. How can you succeed?

I’ll set the stage by offering a few sets of giants we compete with:

  • When we started Backblaze, we offered online backup in a market where companies had been offering “online backup” for at least a decade, and even the newer entrants had raised tens of millions of dollars.
  • When we built our storage servers, the alternatives were EMC, NetApp, and Dell — each of which had a market cap of over $10 billion.
  • When we introduced our cloud storage offering, B2, our direct competitors were Amazon, Google, and Microsoft. You might have heard of them.

What did we learn by competing with these giants on a bootstrapped budget? Let’s take a look.

Determine What Success Means

For a long time Apple considered Apple TV to be a hobby, not a real product worth focusing on, because it did not generate a billion in revenue. For a $10 billion per year revenue company, a new business that generates $50 million won’t move the needle and often isn’t worth putting focus on. However, for a startup, getting to $50 million in revenue can be the start of a wildly successful business.

Lesson Learned: Don’t let the giants set your success metrics.

The Advantages Startups Have

The giants have a lot of advantages: more money, people, scale, resources, access, etc. Following their playbook and attacking head-on means you’re simply outgunned. Common paths to failure are trying to build more features, enter more markets, outspend on marketing, and other similar approaches where scale and resources are the primary determinants of success.

But being a startup affords many advantages most giants would salivate over. As a nimble startup you can leverage those to succeed. Let’s breakdown nine competitive advantages we’ve used that you can too.

1. Drive Focus

It’s hard to build a $10 billion revenue business doing just one thing, and most giants have a broad portfolio of businesses, numerous products for each, and targeting a variety of customer segments in multiple markets. That adds complexity and distributes management attention.

Startups get the benefit of having everyone in the company be extremely focused, often on a singular mission, product, customer segment, and market. While our competitors sell everything from advertising to Zantac, and are investing in groceries and shipping, Backblaze has focused exclusively on cloud storage. This means all of our best people (i.e. everyone) is focused on our cloud storage business. Where is all of your focus going?

Lesson Learned: Align everyone in your company to a singular focus to dramatically out-perform larger teams.

2. Use Lack-of-Scale as an Advantage

You may have heard Paul Graham say “Do things that don’t scale.” There are a host of things you can do specifically because you don’t have the same scale as the giants. Use that as an advantage.

When we look for data center space, we have more options than our largest competitors because there are simply more spaces available with room for 100 cabinets than for 1,000 cabinets. With some searching, we can find data center space that is better/cheaper.

When a flood in Thailand destroyed factories, causing the world’s supply of hard drives to plummet and prices to triple, we started drive farming. The giants certainly couldn’t. It was a bit crazy, but it let us keep prices unchanged for our customers.

Our Chief Cloud Officer, Tim, used to work at Adobe. Because of their size, any new product needed to always launch in a multitude of languages and in global markets. Once launched, they had scale. But getting any new product launched was incredibly challenging.

Lesson Learned: Use lack-of-scale to exploit opportunities that are closed to giants.

3. Build a Better Product

This one is probably obvious. If you’re going to provide the same product, at the same price, to the same customers — why do it? Remember that better does not always mean more features. Here’s one way we built a better product that didn’t require being a bigger company.

All online backup services required customers to choose what to include in their backup. We found that this was complicated for users since they often didn’t know what needed to be backed up. We flipped the model to back up everything and allow users to exclude if they wanted to, but it was not required. This reduced the number of features/options, while making it easier and better for the user.

This didn’t require the resources of a huge company; it just required understanding customers a bit deeper and thinking about the solution differently. Building a better product is the most classic startup competitive advantage.

Lesson Learned: Dig deep with your customers to understand and deliver a better mousetrap.

4. Provide Better Service

How can you provide better service? Use your advantages. Escalations from your customer care folks to engineering can go through fewer hoops. Fixing an issue and shipping can be quicker. Access to real answers on Twitter or Facebook can be more effective.

A strategic decision we made was to have all customer support people as full-time employees in our headquarters. This ensures they are in close contact to the whole company for feedback to quickly go both ways.

Having a smaller team and fewer layers enables faster internal communication, which increases customer happiness. And the option to do things that don’t scale — such as help a customer in a unique situation — can go a long way in building customer loyalty.

Lesson Learned: Service your customers better by establishing clear internal communications.

5. Remove The Unnecessary

After determining that the industry standard EMC/NetApp/Dell storage servers would be too expensive to build our own cloud storage upon, we decided to build our own infrastructure. Many said we were crazy to compete with these multi-billion dollar companies and that it would be impossible to build a lower cost storage server. However, not only did it prove to not be impossible — it wasn’t even that hard.

One key trick? Remove the unnecessary. While EMC and others built servers to sell to other companies for a wide variety of use cases, Backblaze needed servers that only Backblaze would run, and for a single use case. As a result we could tailor the servers for our needs by removing redundancy from each server (since we would run redundant servers), and using lower-performance components (since we would get high-performance by running parallel servers).

What do your customers and use cases not need? This can trim costs and complexity while often improving the product for your use case.

Lesson Learned: Don’t think “what can we add” to what the giants offer — think “what can we remove.”

6. Be Easy

How many times have you visited a large company website, particularly one that’s not consumer-focused, only to leave saying, “Huh? I don’t understand what you do.” Keeping your website clear, and your product and pricing simple, will dramatically increase conversion and customer satisfaction. If you’re able to make it 2x easier and thus increasing your conversion by 2x, you’ve just allowed yourself to spend ½ as much acquiring a customer.

Providing unlimited data backup wasn’t specifically about providing more storage — it was about making it easier. Since users didn’t know how much data they needed to back up, charging per gigabyte meant they wouldn’t know the cost. Providing unlimited data backup meant they could just relax.

Customers love easy — and being smaller makes easy easier to deliver. Use that as an advantage in your website, marketing materials, pricing, product, and in every other customer interaction.

Lesson Learned: Ease-of-use isn’t a slogan: it’s a competitive advantage. Treat it as seriously as any other feature of your product

7. Don’t Be Afraid of Risk

Obviously unnecessary risks are unnecessary, and some risks aren’t worth taking. However, large companies that have given guidance to Wall Street with a $0.01 range on their earning-per-share are inherently going to be very risk-averse. Use risk-tolerance to open up opportunities, and adjust your tolerance level as you scale. In your first year, there are likely an infinite number of ways your business may vaporize; don’t be too worried about taking a risk that might have a 20% downside when the upside is hockey stick growth.

Using consumer-grade hard drives in our servers may have caused pain and suffering for us years down-the-line, but they were priced at approximately 50% of enterprise drives. Giants wouldn’t have considered the option. Turns out, the consumer drives performed great for us.

Lesson Learned: Use calculated risks as an advantage.

8. Be Open

The larger a company grows, the more it wants to hide information. Some of this is driven by regulatory requirements as a public company. But most of this is cultural. Sharing something might cause a problem, so let’s not. All external communication is treated as a critical press release, with rounds and rounds of editing by multiple teams and approvals. However, customers are often desperate for information. Moreover, sharing information builds trust, understanding, and advocates.

I started blogging at Backblaze before we launched. When we blogged about our Storage Pod and open-sourced the design, many thought we were crazy to share this information. But it was transformative for us, establishing Backblaze as a tech thought leader in storage and giving people a sense of how we were able to provide our service at such a low cost.

Over the years we’ve developed a culture of being open internally and externally, on our blog and with the press, and in communities such as Hacker News and Reddit. Often we’ve been asked, “why would you share that!?” — but it’s the continual openness that builds trust. And that culture of openness is incredibly challenging for the giants.

Lesson Learned: Overshare to build trust and brand where giants won’t.

9. Be Human

As companies scale, typically a smaller percent of founders and executives interact with customers. The people who build the company become more hidden, the language feels “corporate,” and customers start to feel they’re interacting with the cliche “faceless, nameless corporation.” Use your humanity to your advantage. From day one the Backblaze About page listed all the founders, and my email address. While contacting us shouldn’t be the first path for a customer support question, I wanted it to be clear that we stand behind the service we offer; if we’re doing something wrong — I want to know it.

To scale it’s important to have processes and procedures, but sometimes a situation falls outside of a well-established process. While we want our employees to follow processes, they’re still encouraged to be human and “try to do the right thing.” How to you strike this balance? Simon Sinek gives a good talk about it: make your employees feel safe. If employees feel safe they’ll be human.

If your customer is a consumer, they’ll appreciate being treated as a human. Even if your customer is a corporation, the purchasing decision-makers are still people.

Lesson Learned: Being human is the ultimate antithesis to the faceless corporation.

Build Culture to Sustain Your Advantages at Scale

Presumably the goal is not to always be competing with giants, but to one day become a giant. Does this mean you’ll lose all of these advantages? Some, yes — but not all. Some of these advantages are cultural, and if you build these into the culture from the beginning, and fight to keep them as you scale, you can keep them as you become a giant.

Tesla still comes across as human, with Elon Musk frequently interacting with people on Twitter. Apple continues to provide great service through their Genius Bar. And, worst case, if you lose these at scale, you’ll still have the other advantages of being a giant such as money, people, scale, resources, and access.

Of course, some new startup will be gunning for you with grand ambitions, so just be sure not to get complacent. 😉

The post How to Compete with Giants appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

More Raspberry Pi labs in West Africa

Post Syndicated from Rachel Churcher original https://www.raspberrypi.org/blog/pi-based-ict-west-africa/

Back in May 2013, we heard from Dominique Laloux about an exciting project to bring Raspberry Pi labs to schools in rural West Africa. Until 2012, 75 percent of teachers there had never used a computer. The project has been very successful, and Dominique has been in touch again to bring us the latest news.

A view of the inside of the new Pi lab building

Preparing the new Pi labs building in Kuma Tokpli, Togo

Growing the project

Thanks to the continuing efforts of a dedicated team of teachers, parents and other supporters, the Centre Informatique de Kuma, now known as INITIC (from the French ‘INItiation aux TIC’), runs two Raspberry Pi labs in schools in Togo, and plans to open a third in December. The second lab was opened last year in Kpalimé, a town in the Plateaux Region in the west of the country.

Student using a Raspberry Pi computer

Using the new Raspberry Pi labs in Kpalimé, Togo

More than 400 students used the new lab intensively during the last school year. Dominique tells us more:

“The report made in early July by the seven teachers who accompanied the students was nothing short of amazing: the young people covered a very impressive number of concepts and skills, from the GUI and the file system, to a solid introduction to word processing and spreadsheets, and many other skills. The lab worked exactly as expected. Its 21 Raspberry Pis worked flawlessly, with the exception of a couple of SD cards that needed re-cloning, and a couple of old screens that needed to be replaced. All the Raspberry Pis worked without a glitch. They are so reliable!”

The teachers and students have enjoyed access to a range of software and resources, all running on Raspberry Pi 2s and 3s.

“Our current aim is to introduce the students to ICT using the Raspberry Pis, rather than introducing them to programming and electronics (a step that will certainly be considered later). We use Ubuntu Mate along with a large selection of applications, from LibreOffice, Firefox, GIMP, Audacity, and Calibre, to special maths, science, and geography applications. There are also special applications such as GnuCash and GanttProject, as well as logic games including PyChess. Since December, students also have access to a local server hosting Kiwix, Wiktionary (a local copy of Wikipedia in four languages), several hundred videos, and several thousand books. They really love it!”

Pi lab upgrade

This summer, INITIC upgraded the equipment in their Pi lab in Kuma Adamé, which has been running since 2014. 21 older model Raspberry Pis were replaced with Pi 2s and 3s, to bring this lab into line with the others, and encourage co-operation between the different locations.

“All 21 first-generation Raspberry Pis worked flawlessly for three years, despite the less-than-ideal conditions in which they were used — tropical conditions, dust, frequent power outages, etc. I brought them all back to Brussels, and they all still work fine. The rationale behind the upgrade was to bring more computing power to the lab, and also to have the same equipment in our two Raspberry Pi labs (and in other planned installations).”

Students and teachers using the upgraded Pi labs in Kuma Adamé

Students and teachers using the upgraded Pi lab in Kuma Adamé

An upgrade of the organisation’s first lab, installed in 2012 in Kuma Tokpli, will be completed in December. This lab currently uses ‘retired’ laptops, which will be replaced with Raspberry Pis and peripherals. INITIC, in partnership with the local community, is also constructing a new building to house the upgraded technology, and the organisation’s third Raspberry Pi lab.

Reliable tech

Dominique has been very impressed with the performance of the Raspberry Pis since 2014.

“Our experience of three years, in two very different contexts, clearly demonstrates that the Raspberry Pi is a very convincing alternative to more ‘conventional’ computers for introducing young students to ICT where resources are scarce. I wish I could convince more communities in the world to invest in such ‘low cost, low consumption, low maintenance’ infrastructure. It really works!”

He goes on to explain that:

“Our goal now is to build at least one new Raspberry Pi lab in another Togolese school each year. That will, of course, depend on how successful we are at gathering the funds necessary for each installation, but we are confident we can convince enough friends to give us the financial support needed for our action.”

A desk with Raspberry Pis and peripherals

Reliable Raspberry Pis in the labs at Kpalimé

Get involved

We are delighted to see the Raspberry Pi being used to bring information technology to new teachers, students, and communities in Togo – it’s wonderful to see this project becoming established and building on its achievements. The mission of the Raspberry Pi Foundation is to put the power of digital making into the hands of people all over the world. Therefore, projects like this, in which people use our tech to fulfil this mission in places with few resources, are wonderful to us.

More information about INITIC and its projects can be found on its website. If you are interested in helping the organisation to meet its goals, visit the How to help page. And if you are involved with a project like this, bringing ICT, computer science, and coding to new places, please tell us about it in the comments below.

The post More Raspberry Pi labs in West Africa appeared first on Raspberry Pi.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Introducing Gluon: a new library for machine learning from AWS and Microsoft

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/introducing-gluon-a-new-library-for-machine-learning-from-aws-and-microsoft/

Post by Dr. Matt Wood

Today, AWS and Microsoft announced Gluon, a new open source deep learning interface which allows developers to more easily and quickly build machine learning models, without compromising performance.

Gluon Logo

Gluon provides a clear, concise API for defining machine learning models using a collection of pre-built, optimized neural network components. Developers who are new to machine learning will find this interface more familiar to traditional code, since machine learning models can be defined and manipulated just like any other data structure. More seasoned data scientists and researchers will value the ability to build prototypes quickly and utilize dynamic neural network graphs for entirely new model architectures, all without sacrificing training speed.

Gluon is available in Apache MXNet today, a forthcoming Microsoft Cognitive Toolkit release, and in more frameworks over time.

Neural Networks vs Developers
Machine learning with neural networks (including ‘deep learning’) has three main components: data for training; a neural network model, and an algorithm which trains the neural network. You can think of the neural network in a similar way to a directed graph; it has a series of inputs (which represent the data), which connect to a series of outputs (the prediction), through a series of connected layers and weights. During training, the algorithm adjusts the weights in the network based on the error in the network output. This is the process by which the network learns; it is a memory and compute intensive process which can take days.

Deep learning frameworks such as Caffe2, Cognitive Toolkit, TensorFlow, and Apache MXNet are, in part, an answer to the question ‘how can we speed this process up? Just like query optimizers in databases, the more a training engine knows about the network and the algorithm, the more optimizations it can make to the training process (for example, it can infer what needs to be re-computed on the graph based on what else has changed, and skip the unaffected weights to speed things up). These frameworks also provide parallelization to distribute the computation process, and reduce the overall training time.

However, in order to achieve these optimizations, most frameworks require the developer to do some extra work: specifically, by providing a formal definition of the network graph, up-front, and then ‘freezing’ the graph, and just adjusting the weights.

The network definition, which can be large and complex with millions of connections, usually has to be constructed by hand. Not only are deep learning networks unwieldy, but they can be difficult to debug and it’s hard to re-use the code between projects.

The result of this complexity can be difficult for beginners and is a time-consuming task for more experienced researchers. At AWS, we’ve been experimenting with some ideas in MXNet around new, flexible, more approachable ways to define and train neural networks. Microsoft is also a contributor to the open source MXNet project, and were interested in some of these same ideas. Based on this, we got talking, and found we had a similar vision: to use these techniques to reduce the complexity of machine learning, making it accessible to more developers.

Enter Gluon: dynamic graphs, rapid iteration, scalable training
Gluon introduces four key innovations.

  1. Friendly API: Gluon networks can be defined using a simple, clear, concise code – this is easier for developers to learn, and much easier to understand than some of the more arcane and formal ways of defining networks and their associated weighted scoring functions.
  2. Dynamic networks: the network definition in Gluon is dynamic: it can bend and flex just like any other data structure. This is in contrast to the more common, formal, symbolic definition of a network which the deep learning framework has to effectively carve into stone in order to be able to effectively optimizing computation during training. Dynamic networks are easier to manage, and with Gluon, developers can easily ‘hybridize’ between these fast symbolic representations and the more friendly, dynamic ‘imperative’ definitions of the network and algorithms.
  3. The algorithm can define the network: the model and the training algorithm are brought much closer together. Instead of separate definitions, the algorithm can adjust the network dynamically during definition and training. Not only does this mean that developers can use standard programming loops, and conditionals to create these networks, but researchers can now define even more sophisticated algorithms and models which were not possible before. They are all easier to create, change, and debug.
  4. High performance operators for training: which makes it possible to have a friendly, concise API and dynamic graphs, without sacrificing training speed. This is a huge step forward in machine learning. Some frameworks bring a friendly API or dynamic graphs to deep learning, but these previous methods all incur a cost in terms of training speed. As with other areas of software, abstraction can slow down computation since it needs to be negotiated and interpreted at run time. Gluon can efficiently blend together a concise API with the formal definition under the hood, without the developer having to know about the specific details or to accommodate the compiler optimizations manually.

The team here at AWS, and our collaborators at Microsoft, couldn’t be more excited to bring these improvements to developers through Gluon. We’re already seeing quite a bit of excitement from developers and researchers alike.

Getting started with Gluon
Gluon is available today in Apache MXNet, with support coming for the Microsoft Cognitive Toolkit in a future release. We’re also publishing the front-end interface and the low-level API specifications so it can be included in other frameworks in the fullness of time.

You can get started with Gluon today. Fire up the AWS Deep Learning AMI with a single click and jump into one of 50 fully worked, notebook examples. If you’re a contributor to a machine learning framework, check out the interface specs on GitHub.

-Dr. Matt Wood

Introducing Email Templates and Bulk Sending

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/ses/introducing-email-templates-and-bulk-sending/

The Amazon SES team is excited to announce our latest update, which includes two related features that help you send personalized emails to large groups of customers. This post discusses these features, and provides examples that you can follow to start using these features right away.

Email templates

You can use email templates to create the structure of an email that you plan to send to multiple recipients, or that you will use again in the future. Each template contains a subject line, a text part, and an HTML part. Both the subject and the email body can contain variables that are automatically replaced with values specific to each recipient. For example, you can include a {{name}} variable in the body of your email. When you send the email, you specify the value of {{name}} for each recipient. Amazon SES then automatically replaces the {{name}} variable with the recipient’s first name.

Creating a template

To create a template, you use the CreateTemplate API operation. To use this operation, pass a JSON object with four properties: a template name (TemplateName), a subject line (SubjectPart), a plain text version of the email body (TextPart), and an HTML version of the email body (HtmlPart). You can include variables in the subject line or message body by enclosing the variable names in two sets of curly braces. The following example shows the structure of this JSON object.

{
  "TemplateName": "MyTemplate",
  "SubjectPart": "Greetings, {{name}}!",
  "TextPart": "Dear {{name}},\r\nYour favorite animal is {{favoriteanimal}}.",
  "HtmlPart": "<h1>Hello {{name}}</h1><p>Your favorite animal is {{favoriteanimal}}.</p>"
}

Use this example to create your own template, and save the resulting file as mytemplate.json. You can then use the AWS Command Line Interface (AWS CLI) to create your template by running the following command: aws ses create-template --cli-input-json mytemplate.json

Sending an email created with a template

Now that you have created a template, you’re ready to send email that uses the template. You can use the SendTemplatedEmail API operation to send email to a single destination using a template. Like the CreateTemplate operation, this operation accepts a JSON object with four properties. For this operation, the properties are the sender’s email address (Source), the name of an existing template (Template), an object called Destination that contains the recipient addresses (and, optionally, any CC or BCC addresses) that will receive the email, and a property that refers to the values that will be replaced in the email (TemplateData). The following example shows the structure of the JSON object used by the SendTemplatedEmail operation.

{
  "Source": "[email protected]",
  "Template": "MyTemplate",
  "Destination": {
    "ToAddresses": [ "[email protected]" ]
  },
  "TemplateData": "{ \"name\":\"Alejandro\", \"favoriteanimal\": \"zebra\" }"
}

Customize this example to fit your needs, and then save the resulting file as myemail.json. One important note: in the TemplateData property, you must use a blackslash (\) character to escape the quotes within this object, as shown in the preceding example.

When you’re ready to send the email, run the following command: aws ses send-templated-email --cli-input-json myemail.json

Bulk email sending

In most cases, you should use email templates to send personalized emails to several customers at the same time. The SendBulkTemplatedEmail API operation helps you do that. This operation also accepts a JSON object. At a minimum, you must supply a sender email address (Source), a reference to an existing template (Template), a list of recipients in an array called Destinations (within which you specify the recipient’s email address, and the variable values for that recipient), and a list of fallback values for the variables in the template (DefaultTemplateData). The following example shows the structure of this JSON object.

{
  "Source":"[email protected]",
  "ConfigurationSetName":"ConfigSet",
  "Template":"MyTemplate",
  "Destinations":[
    {
      "Destination":{
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Anaya\", \"favoriteanimal\":\"yak\" }"
    },
    {
      "Destination":{ 
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Liu\", \"favoriteanimal\":\"water buffalo\" }"
    },
    {
      "Destination":{
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Shirley\", \"favoriteanimal\":\"vulture\" }"
    },
    {
      "Destination":{
        "ToAddresses":[
          "richard.roe[email protected]"
        ]
      },
      "ReplacementTemplateData":"{}"
    }
  ],
  "DefaultTemplateData":"{ \"name\":\"friend\", \"favoriteanimal\":\"unknown\" }"
}

This example sends unique emails to Anaya ([email protected]), Liu ([email protected]), Shirley ([email protected]), and a fourth recipient ([email protected]), whose name and favorite animal we didn’t specify. Anaya, Liu, and Shirley will see their names in place of the {{name}} tag in the template (which, in this example, is present in both the subject line and message body), as well as their favorite animals in place of the {{favoriteanimal}} tag in the message body. The DefaultTemplateData property determines what happens if you do not specify the ReplacementTemplateData property for a recipient. In this case, the fourth recipient will see the word “friend” in place of the {{name}} tag, and “unknown” in place of the {{favoriteanimal}} tag.

Use the example to create your own list of recipients, and save the resulting file as mybulkemail.json. When you’re ready to send the email, run the following command: aws ses send-bulk-templated-email --cli-input-json mybulkemail.json

Other considerations

There are a few limits and other considerations when using these features:

  • You can create up to 10,000 email templates per Amazon SES account.
  • Each template can be up to 10 MB in size.
  • You can include an unlimited number of replacement variables in each template.
  • You can send email to up to 50 destinations in each call to the SendBulkTemplatedEmail operation. A destination includes a list of recipients, as well as CC and BCC recipients. Note that the number of destinations you can contact in a single call to the API may be limited by your account’s maximum sending rate. For more information, see Managing Your Amazon SES Sending Limits in the Amazon SES Developer Guide.

We look forward to seeing the amazing things you create with these new features. If you have any questions, please leave a comment on this post, or let us know in the Amazon SES forum.

"Responsible encryption" fallacies

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/10/responsible-encryption-fallacies.html

Deputy Attorney General Rod Rosenstein gave a speech recently calling for “Responsible Encryption” (aka. “Crypto Backdoors”). It’s full of dangerous ideas that need to be debunked.

The importance of law enforcement

The first third of the speech talks about the importance of law enforcement, as if it’s the only thing standing between us and chaos. It cites the 2016 Mirai attacks as an example of the chaos that will only get worse without stricter law enforcement.

But the Mira case demonstrated the opposite, how law enforcement is not needed. They made no arrests in the case. A year later, they still haven’t a clue who did it.

Conversely, we technologists have fixed the major infrastructure issues. Specifically, those affected by the DNS outage have moved to multiple DNS providers, including a high-capacity DNS provider like Google and Amazon who can handle such large attacks easily.

In other words, we the people fixed the major Mirai problem, and law-enforcement didn’t.

Moreover, instead being a solution to cyber threats, law enforcement has become a threat itself. The DNC didn’t have the FBI investigate the attacks from Russia likely because they didn’t want the FBI reading all their files, finding wrongdoing by the DNC. It’s not that they did anything actually wrong, but it’s more like that famous quote from Richelieu “Give me six words written by the most honest of men and I’ll find something to hang him by”. Give all your internal emails over to the FBI and I’m certain they’ll find something to hang you by, if they want.
Or consider the case of Andrew Auernheimer. He found AT&T’s website made public user accounts of the first iPad, so he copied some down and posted them to a news site. AT&T had denied the problem, so making the problem public was the only way to force them to fix it. Such access to the website was legal, because AT&T had made the data public. However, prosecutors disagreed. In order to protect the powerful, they twisted and perverted the law to put Auernheimer in jail.

It’s not that law enforcement is bad, it’s that it’s not the unalloyed good Rosenstein imagines. When law enforcement becomes the thing Rosenstein describes, it means we live in a police state.

Where law enforcement can’t go

Rosenstein repeats the frequent claim in the encryption debate:

Our society has never had a system where evidence of criminal wrongdoing was totally impervious to detection

Of course our society has places “impervious to detection”, protected by both legal and natural barriers.

An example of a legal barrier is how spouses can’t be forced to testify against each other. This barrier is impervious.

A better example, though, is how so much of government, intelligence, the military, and law enforcement itself is impervious. If prosecutors could gather evidence everywhere, then why isn’t Rosenstein prosecuting those guilty of CIA torture?

Oh, you say, government is a special exception. If that were the case, then why did Rosenstein dedicate a precious third of his speech discussing the “rule of law” and how it applies to everyone, “protecting people from abuse by the government”. It obviously doesn’t, there’s one rule of government and a different rule for the people, and the rule for government means there’s lots of places law enforcement can’t go to gather evidence.

Likewise, the crypto backdoor Rosenstein is demanding for citizens doesn’t apply to the President, Congress, the NSA, the Army, or Rosenstein himself.

Then there are the natural barriers. The police can’t read your mind. They can only get the evidence that is there, like partial fingerprints, which are far less reliable than full fingerprints. They can’t go backwards in time.

I mention this because encryption is a natural barrier. It’s their job to overcome this barrier if they can, to crack crypto and so forth. It’s not our job to do it for them.

It’s like the camera that increasingly comes with TVs for video conferencing, or the microphone on Alexa-style devices that are always recording. This suddenly creates evidence that the police want our help in gathering, such as having the camera turned on all the time, recording to disk, in case the police later gets a warrant, to peer backward in time what happened in our living rooms. The “nothing is impervious” argument applies here as well. And it’s equally bogus here. By not helping police by not recording our activities, we aren’t somehow breaking some long standing tradit

And this is the scary part. It’s not that we are breaking some ancient tradition that there’s no place the police can’t go (with a warrant). Instead, crypto backdoors breaking the tradition that never before have I been forced to help them eavesdrop on me, even before I’m a suspect, even before any crime has been committed. Sure, laws like CALEA force the phone companies to help the police against wrongdoers — but here Rosenstein is insisting I help the police against myself.

Balance between privacy and public safety

Rosenstein repeats the frequent claim that encryption upsets the balance between privacy/safety:

Warrant-proof encryption defeats the constitutional balance by elevating privacy above public safety.

This is laughable, because technology has swung the balance alarmingly in favor of law enforcement. Far from “Going Dark” as his side claims, the problem we are confronted with is “Going Light”, where the police state monitors our every action.

You are surrounded by recording devices. If you walk down the street in town, outdoor surveillance cameras feed police facial recognition systems. If you drive, automated license plate readers can track your route. If you make a phone call or use a credit card, the police get a record of the transaction. If you stay in a hotel, they demand your ID, for law enforcement purposes.

And that’s their stuff, which is nothing compared to your stuff. You are never far from a recording device you own, such as your mobile phone, TV, Alexa/Siri/OkGoogle device, laptop. Modern cars from the last few years increasingly have always-on cell connections and data recorders that record your every action (and location).

Even if you hike out into the country, when you get back, the FBI can subpoena your GPS device to track down your hidden weapon’s cache, or grab the photos from your camera.

And this is all offline. So much of what we do is now online. Of the photographs you own, fewer than 1% are printed out, the rest are on your computer or backed up to the cloud.

Your phone is also a GPS recorder of your exact position all the time, which if the government wins the Carpenter case, they police can grab without a warrant. Tagging all citizens with a recording device of their position is not “balance” but the premise for a novel more dystopic than 1984.

If suspected of a crime, which would you rather the police searched? Your person, houses, papers, and physical effects? Or your mobile phone, computer, email, and online/cloud accounts?

The balance of privacy and safety has swung so far in favor of law enforcement that rather than debating whether they should have crypto backdoors, we should be debating how to add more privacy protections.

“But it’s not conclusive”

Rosenstein defends the “going light” (“Golden Age of Surveillance”) by pointing out it’s not always enough for conviction. Nothing gives a conviction better than a person’s own words admitting to the crime that were captured by surveillance. This other data, while copious, often fails to convince a jury beyond a reasonable doubt.
This is nonsense. Police got along well enough before the digital age, before such widespread messaging. They solved terrorist and child abduction cases just fine in the 1980s. Sure, somebody’s GPS location isn’t by itself enough — until you go there and find all the buried bodies, which leads to a conviction. “Going dark” imagines that somehow, the evidence they’ve been gathering for centuries is going away. It isn’t. It’s still here, and matches up with even more digital evidence.
Conversely, a person’s own words are not as conclusive as you think. There’s always missing context. We quickly get back to the Richelieu “six words” problem, where captured communications are twisted to convict people, with defense lawyers trying to untwist them.

Rosenstein’s claim may be true, that a lot of criminals will go free because the other electronic data isn’t convincing enough. But I’d need to see that claim backed up with hard studies, not thrown out for emotional impact.

Terrorists and child molesters

You can always tell the lack of seriousness of law enforcement when they bring up terrorists and child molesters.
To be fair, sometimes we do need to talk about terrorists. There are things unique to terrorism where me may need to give government explicit powers to address those unique concerns. For example, the NSA buys mobile phone 0day exploits in order to hack terrorist leaders in tribal areas. This is a good thing.
But when terrorists use encryption the same way everyone else does, then it’s not a unique reason to sacrifice our freedoms to give the police extra powers. Either it’s a good idea for all crimes or no crimes — there’s nothing particular about terrorism that makes it an exceptional crime. Dead people are dead. Any rational view of the problem relegates terrorism to be a minor problem. More citizens have died since September 8, 2001 from their own furniture than from terrorism. According to studies, the hot water from the tap is more of a threat to you than terrorists.
Yes, government should do what they can to protect us from terrorists, but no, it’s not so bad of a threat that requires the imposition of a military/police state. When people use terrorism to justify their actions, it’s because they trying to form a military/police state.
A similar argument works with child porn. Here’s the thing: the pervs aren’t exchanging child porn using the services Rosenstein wants to backdoor, like Apple’s Facetime or Facebook’s WhatsApp. Instead, they are exchanging child porn using custom services they build themselves.
Again, I’m (mostly) on the side of the FBI. I support their idea of buying 0day exploits in order to hack the web browsers of visitors to the secret “PlayPen” site. This is something that’s narrow to this problem and doesn’t endanger the innocent. On the other hand, their calls for crypto backdoors endangers the innocent while doing effectively nothing to address child porn.
Terrorists and child molesters are a clichéd, non-serious excuse to appeal to our emotions to give up our rights. We should not give in to such emotions.

Definition of “backdoor”

Rosenstein claims that we shouldn’t call backdoors “backdoors”:

No one calls any of those functions [like key recovery] a “back door.”  In fact, those capabilities are marketed and sought out by many users.

He’s partly right in that we rarely refer to PGP’s key escrow feature as a “backdoor”.

But that’s because the term “backdoor” refers less to how it’s done and more to who is doing it. If I set up a recovery password with Apple, I’m the one doing it to myself, so we don’t call it a backdoor. If it’s the police, spies, hackers, or criminals, then we call it a “backdoor” — even it’s identical technology.

Wikipedia uses the key escrow feature of the 1990s Clipper Chip as a prime example of what everyone means by “backdoor“. By “no one”, Rosenstein is including Wikipedia, which is obviously incorrect.

Though in truth, it’s not going to be the same technology. The needs of law enforcement are different than my personal key escrow/backup needs. In particular, there are unsolvable problems, such as a backdoor that works for the “legitimate” law enforcement in the United States but not for the “illegitimate” police states like Russia and China.

I feel for Rosenstein, because the term “backdoor” does have a pejorative connotation, which can be considered unfair. But that’s like saying the word “murder” is a pejorative term for killing people, or “torture” is a pejorative term for torture. The bad connotation exists because we don’t like government surveillance. I mean, honestly calling this feature “government surveillance feature” is likewise pejorative, and likewise exactly what it is that we are talking about.

Providers

Rosenstein focuses his arguments on “providers”, like Snapchat or Apple. But this isn’t the question.

The question is whether a “provider” like Telegram, a Russian company beyond US law, provides this feature. Or, by extension, whether individuals should be free to install whatever software they want, regardless of provider.

Telegram is a Russian company that provides end-to-end encryption. Anybody can download their software in order to communicate so that American law enforcement can’t eavesdrop. They aren’t going to put in a backdoor for the U.S. If we succeed in putting backdoors in Apple and WhatsApp, all this means is that criminals are going to install Telegram.

If the, for some reason, the US is able to convince all such providers (including Telegram) to install a backdoor, then it still doesn’t solve the problem, as uses can just build their own end-to-end encryption app that has no provider. It’s like email: some use the major providers like GMail, others setup their own email server.

Ultimately, this means that any law mandating “crypto backdoors” is going to target users not providers. Rosenstein tries to make a comparison with what plain-old telephone companies have to do under old laws like CALEA, but that’s not what’s happening here. Instead, for such rules to have any effect, they have to punish users for what they install, not providers.

This continues the argument I made above. Government backdoors is not something that forces Internet services to eavesdrop on us — it forces us to help the government spy on ourselves.
Rosenstein tries to address this by pointing out that it’s still a win if major providers like Apple and Facetime are forced to add backdoors, because they are the most popular, and some terrorists/criminals won’t move to alternate platforms. This is false. People with good intentions, who are unfairly targeted by a police state, the ones where police abuse is rampant, are the ones who use the backdoored products. Those with bad intentions, who know they are guilty, will move to the safe products. Indeed, Telegram is already popular among terrorists because they believe American services are already all backdoored. 
Rosenstein is essentially demanding the innocent get backdoored while the guilty don’t. This seems backwards. This is backwards.

Apple is morally weak

The reason I’m writing this post is because Rosenstein makes a few claims that cannot be ignored. One of them is how he describes Apple’s response to government insistence on weakening encryption doing the opposite, strengthening encryption. He reasons this happens because:

Of course they [Apple] do. They are in the business of selling products and making money. 

We [the DoJ] use a different measure of success. We are in the business of preventing crime and saving lives. 

He swells in importance. His condescending tone ennobles himself while debasing others. But this isn’t how things work. He’s not some white knight above the peasantry, protecting us. He’s a beat cop, a civil servant, who serves us.

A better phrasing would have been:

They are in the business of giving customers what they want.

We are in the business of giving voters what they want.

Both sides are doing the same, giving people what they want. Yes, voters want safety, but they also want privacy. Rosenstein imagines that he’s free to ignore our demands for privacy as long has he’s fulfilling his duty to protect us. He has explicitly rejected what people want, “we use a different measure of success”. He imagines it’s his job to tell us where the balance between privacy and safety lies. That’s not his job, that’s our job. We, the people (and our representatives), make that decision, and it’s his job is to do what he’s told. His measure of success is how well he fulfills our wishes, not how well he satisfies his imagined criteria.

That’s why those of us on this side of the debate doubt the good intentions of those like Rosenstein. He criticizes Apple for wanting to protect our rights/freedoms, and declare they measure success differently.

They are willing to be vile

Rosenstein makes this argument:

Companies are willing to make accommodations when required by the government. Recent media reports suggest that a major American technology company developed a tool to suppress online posts in certain geographic areas in order to embrace a foreign government’s censorship policies. 

Let me translate this for you:

Companies are willing to acquiesce to vile requests made by police-states. Therefore, they should acquiesce to our vile police-state requests.

It’s Rosenstein who is admitting here is that his requests are those of a police-state.

Constitutional Rights

Rosenstein says:

There is no constitutional right to sell warrant-proof encryption.

Maybe. It’s something the courts will have to decide. There are many 1st, 2nd, 3rd, 4th, and 5th Amendment issues here.
The reason we have the Bill of Rights is because of the abuses of the British Government. For example, they quartered troops in our homes, as a way of punishing us, and as a way of forcing us to help in our own oppression. The troops weren’t there to defend us against the French, but to defend us against ourselves, to shoot us if we got out of line.

And that’s what crypto backdoors do. We are forced to be agents of our own oppression. The principles enumerated by Rosenstein apply to a wide range of even additional surveillance. With little change to his speech, it can equally argue why the constant TV video surveillance from 1984 should be made law.

Let’s go back and look at Apple. It is not some base company exploiting consumers for profit. Apple doesn’t have guns, they cannot make people buy their product. If Apple doesn’t provide customers what they want, then customers vote with their feet, and go buy an Android phone. Apple isn’t providing encryption/security in order to make a profit — it’s giving customers what they want in order to stay in business.
Conversely, if we citizens don’t like what the government does, tough luck, they’ve got the guns to enforce their edicts. We can’t easily vote with our feet and walk to another country. A “democracy” is far less democratic than capitalism. Apple is a minority, selling phones to 45% of the population, and that’s fine, the minority get the phones they want. In a Democracy, where citizens vote on the issue, those 45% are screwed, as the 55% impose their will unwanted onto the remainder.

That’s why we have the Bill of Rights, to protect the 49% against abuse by the 51%. Regardless whether the Supreme Court agrees the current Constitution, it is the sort right that might exist regardless of what the Constitution says. 

Obliged to speak the truth

Here is the another part of his speech that I feel cannot be ignored. We have to discuss this:

Those of us who swear to protect the rule of law have a different motivation.  We are obliged to speak the truth.

The truth is that “going dark” threatens to disable law enforcement and enable criminals and terrorists to operate with impunity.

This is not true. Sure, he’s obliged to say the absolute truth, in court. He’s also obliged to be truthful in general about facts in his personal life, such as not lying on his tax return (the sort of thing that can get lawyers disbarred).

But he’s not obliged to tell his spouse his honest opinion whether that new outfit makes them look fat. Likewise, Rosenstein knows his opinion on public policy doesn’t fall into this category. He can say with impunity that either global warming doesn’t exist, or that it’ll cause a biblical deluge within 5 years. Both are factually untrue, but it’s not going to get him fired.

And this particular claim is also exaggerated bunk. While everyone agrees encryption makes law enforcement’s job harder than with backdoors, nobody honestly believes it can “disable” law enforcement. While everyone agrees that encryption helps terrorists, nobody believes it can enable them to act with “impunity”.

I feel bad here. It’s a terrible thing to question your opponent’s character this way. But Rosenstein made this unavoidable when he clearly, with no ambiguity, put his integrity as Deputy Attorney General on the line behind the statement that “going dark threatens to disable law enforcement and enable criminals and terrorists to operate with impunity”. I feel it’s a bald face lie, but you don’t need to take my word for it. Read his own words yourself and judge his integrity.

Conclusion

Rosenstein’s speech includes repeated references to ideas like “oath”, “honor”, and “duty”. It reminds me of Col. Jessup’s speech in the movie “A Few Good Men”.

If you’ll recall, it was rousing speech, “you want me on that wall” and “you use words like honor as a punchline”. Of course, since he was violating his oath and sending two privates to death row in order to avoid being held accountable, it was Jessup himself who was crapping on the concepts of “honor”, “oath”, and “duty”.

And so is Rosenstein. He imagines himself on that wall, doing albeit terrible things, justified by his duty to protect citizens. He imagines that it’s he who is honorable, while the rest of us not, even has he utters bald faced lies to further his own power and authority.

We activists oppose crypto backdoors not because we lack honor, or because we are criminals, or because we support terrorists and child molesters. It’s because we value privacy and government officials who get corrupted by power. It’s not that we fear Trump becoming a dictator, it’s that we fear bureaucrats at Rosenstein’s level becoming drunk on authority — which Rosenstein demonstrably has. His speech is a long train of corrupt ideas pursuing the same object of despotism — a despotism we oppose.

In other words, we oppose crypto backdoors because it’s not a tool of law enforcement, but a tool of despotism.

Cloudflare CEO Has to Explain Lack of Pirate Site Terminations

Post Syndicated from Ernesto original https://torrentfreak.com/cloudflare-ceo-has-to-explain-lack-of-pirate-site-terminations-171010/

In August, Cloudflare CEO Matthew Prince decided to terminate the account of controversial neo-Nazi site Daily Stormer.

“I woke up this morning in a bad mood and decided to kick them off the Internet,” he wrote.

The decision was meant as an intellectual exercise to start a conversation regarding censorship and free speech on the internet. In this respect it was a success but the discussion went much further than Prince had intended.

Cloudflare had a long-standing policy not to remove any accounts without a court order, so when this was exceeded, eyebrows were raised. In particular, copyright holders wondered why the company could terminate this account but not those of the most notorious pirate sites.

Adult entertainment publisher ALS Scan raised this question in its piracy liability case against Cloudflare, asking for a 7-hour long deposition of the company’s CEO, to find out more. Cloudflare opposed this request, saying it was overbroad and unneeded, while asking the court to weigh in.

After reviewing the matter, Magistrate Judge Alexander MacKinnon decided to allow the deposition, but in a limited form.

“An initial matter, the Court finds that ALS Scan has not made a showing that would justify a 7 hour deposition of Mr. Prince covering a wide range of topics,” the order (pdf) reads.

“On the other hand, a review of the record shows that ALS Scan has identified a narrow relevant issue for which it appears Mr. Prince has unique knowledge and for which less intrusive discovery has been exhausted.”

ALS Scan will be able to interrogate Cloudflare’s CEO but only for two hours. The deposition must be specifically tailored toward his motivation (not) to use his authority to terminate the accounts of ‘pirating’ customers.

“The specific topic is the use (or non-use) of Mr. Prince’s authority to terminate customers, as specifically applied to customers for whom Cloudflare has received notices of copyright infringement,” the order specifies.

Whether this deposition will help ALS Scan argue its case has yet to be seen. Based on earlier submissions, the CEO will likely argue that the Daily Stormer case was an exception to make a point and that it’s company policy to require a court order to respond to infringement claims.

Meanwhile, more questions are being raised. Just a few days ago Cloudflare suspended the account of a customer for using a cryptocurrency miner. Apparently, Cloudflare classifies these miners as malware, triggering a punishment without a court order.

ALS Scan and other copyright holders would like to see a similar policy against notorious pirate sites, but thus far Cloudflare is having none of it.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

VHostScan – Virtual Host Scanner With Alias & Catch-All Detection

Post Syndicated from Darknet original https://www.darknet.org.uk/2017/10/vhostscan-virtual-host-scanner-with-alias-catch-all-detection/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

VHostScan – Virtual Host Scanner With Alias & Catch-All Detection

VHostScan is a Python-based virtual host scanner that can be used with pivot tools, detect catch-all scenarios, aliases and dynamic default pages.

Features of VHostScan Virtual Host Scanner

  • Quickly highlight unique content in catch-all scenarios
  • Locate the outliers in catch-all scenarios where results have dynamic content on the page (such as the time)
  • Identify aliases by tweaking the unique depth of matches
  • Wordlist supports standard words and a variable to input a base hostname (for e.g.

Read the rest of VHostScan – Virtual Host Scanner With Alias & Catch-All Detection now! Only available at Darknet.

JavaScript got better while I wasn’t looking

Post Syndicated from Eevee original https://eev.ee/blog/2017/10/07/javascript-got-better-while-i-wasnt-looking/

IndustrialRobot has generously donated in order to inquire:

In the last few years there seems to have been a lot of activity with adding emojis to Unicode. Has there been an equal effort to add ‘real’ languages/glyph systems/etc?

And as always, if you don’t have anything to say on that topic, feel free to choose your own. :p

Yes.

I mean, each release of Unicode lists major new additions right at the top — Unicode 10, Unicode 9, Unicode 8, etc. They also keep fastidious notes, so you can also dig into how and why these new scripts came from, by reading e.g. the proposal for the addition of Zanabazar Square. I don’t think I have much to add here; I’m not a real linguist, I only play one on TV.

So with that out of the way, here’s something completely different!

A brief history of JavaScript

JavaScript was created in seven days, about eight thousand years ago. It was pretty rough, and it stayed rough for most of its life. But that was fine, because no one used it for anything besides having a trail of sparkles follow your mouse on their Xanga profile.

Then people discovered you could actually do a handful of useful things with JavaScript, and it saw a sharp uptick in usage. Alas, it stayed pretty rough. So we came up with polyfills and jQuerys and all kinds of miscellaneous things that tried to smooth over the rough parts, to varying degrees of success.

And… that’s it. That’s pretty much how things stayed for a while.


I have complicated feelings about JavaScript. I don’t hate it… but I certainly don’t enjoy it, either. It has some pretty neat ideas, like prototypical inheritance and “everything is a value”, but it buries them under a pile of annoying quirks and a woefully inadequate standard library. The DOM APIs don’t make things much better — they seem to be designed as though the target language were Java, rarely taking advantage of any interesting JavaScript features. And the places where the APIs overlap with the language are a hilarious mess: I have to check documentation every single time I use any API that returns a set of things, because there are at least three totally different conventions for handling that and I can’t keep them straight.

The funny thing is that I’ve been fairly happy to work with Lua, even though it shares most of the same obvious quirks as JavaScript. Both languages are weakly typed; both treat nonexistent variables and keys as simply false values, rather than errors; both have a single data structure that doubles as both a list and a map; both use 64-bit floating-point as their only numeric type (though Lua added integers very recently); both lack a standard object model; both have very tiny standard libraries. Hell, Lua doesn’t even have exceptions, not really — you have to fake them in much the same style as Perl.

And yet none of this bothers me nearly as much in Lua. The differences between the languages are very subtle, but combined they make a huge impact.

  • Lua has separate operators for addition and concatenation, so + is never ambiguous. It also has printf-style string formatting in the standard library.

  • Lua’s method calls are syntactic sugar: foo:bar() just means foo.bar(foo). Lua doesn’t even have a special this or self value; the invocant just becomes the first argument. In contrast, JavaScript invokes some hand-waved magic to set its contextual this variable, which has led to no end of confusion.

  • Lua has an iteration protocol, as well as built-in iterators for dealing with list-style or map-style data. JavaScript has a special dedicated Array type and clumsy built-in iteration syntax.

  • Lua has operator overloading and (surprisingly flexible) module importing.

  • Lua allows the keys of a map to be any value (though non-scalars are always compared by identity). JavaScript implicitly converts keys to strings — and since there’s no operator overloading, there’s no way to natively fix this.

These are fairly minor differences, in the grand scheme of language design. And almost every feature in Lua is implemented in a ridiculously simple way; in fact the entire language is described in complete detail in a single web page. So writing JavaScript is always frustrating for me: the language is so close to being much more ergonomic, and yet, it isn’t.

Or, so I thought. As it turns out, while I’ve been off doing other stuff for a few years, browser vendors have been implementing all this pie-in-the-sky stuff from “ES5” and “ES6”, whatever those are. People even upgrade their browsers now. Lo and behold, the last time I went to write JavaScript, I found out that a number of papercuts had actually been solved, and the solutions were sufficiently widely available that I could actually use them in web code.

The weird thing is that I do hear a lot about JavaScript, but the feature I’ve seen raved the most about by far is probably… built-in types for working with arrays of bytes? That’s cool and all, but not exactly the most pressing concern for me.

Anyway, if you also haven’t been keeping tabs on the world of JavaScript, here are some things we missed.

let

MDN docs — supported in Firefox 44, Chrome 41, IE 11, Safari 10

I’m pretty sure I first saw let over a decade ago. Firefox has supported it for ages, but you actually had to opt in by specifying JavaScript version 1.7. Remember JavaScript versions? You know, from back in the days when people actually suggested you write stuff like this:

1
<SCRIPT LANGUAGE="JavaScript1.2" TYPE="text/javascript">

Yikes.

Anyway, so, let declares a variable — but scoped to the immediately containing block, unlike var, which scopes to the innermost function. The trouble with var was that it was very easy to make misleading:

1
2
3
4
5
6
// foo exists here
while (true) {
    var foo = ...;
    ...
}
// foo exists here too

If you reused the same temporary variable name in a different block, or if you expected to be shadowing an outer foo, or if you were trying to do something with creating closures in a loop, this would cause you some trouble.

But no more, because let actually scopes the way it looks like it should, the way variable declarations do in C and friends. As an added bonus, if you refer to a variable declared with let outside of where it’s valid, you’ll get a ReferenceError instead of a silent undefined value. Hooray!

There’s one other interesting quirk to let that I can’t find explicitly documented. Consider:

1
2
3
4
5
6
7
let closures = [];
for (let i = 0; i < 4; i++) {
    closures.push(function() { console.log(i); });
}
for (let j = 0; j < closures.length; j++) {
    closures[j]();
}

If this code had used var i, then it would print 4 four times, because the function-scoped var i means each closure is sharing the same i, whose final value is 4. With let, the output is 0 1 2 3, as you might expect, because each run through the loop gets its own i.

But wait, hang on.

The semantics of a C-style for are that the first expression is only evaluated once, at the very beginning. So there’s only one let i. In fact, it makes no sense for each run through the loop to have a distinct i, because the whole idea of the loop is to modify i each time with i++.

I assume this is simply a special case, since it’s what everyone expects. We expect it so much that I can’t find anyone pointing out that the usual explanation for why it works makes no sense. It has the interesting side effect that for no longer de-sugars perfectly to a while, since this will print all 4s:

1
2
3
4
5
6
7
8
9
closures = [];
let i = 0;
while (i < 4) {
    closures.push(function() { console.log(i); });
    i++;
}
for (let j = 0; j < closures.length; j++) {
    closures[j]();
}

This isn’t a problem — I’m glad let works this way! — it just stands out to me as interesting. Lua doesn’t need a special case here, since it uses an iterator protocol that produces values rather than mutating a visible state variable, so there’s no problem with having the loop variable be truly distinct on each run through the loop.

Classes

MDN docs — supported in Firefox 45, Chrome 42, Safari 9, Edge 13

Prototypical inheritance is pretty cool. The way JavaScript presents it is a little bit opaque, unfortunately, which seems to confuse a lot of people. JavaScript gives you enough functionality to make it work, and even makes it sound like a first-class feature with a property outright called prototype… but to actually use it, you have to do a bunch of weird stuff that doesn’t much look like constructing an object or type.

The funny thing is, people with almost any background get along with Python just fine, and Python uses prototypical inheritance! Nobody ever seems to notice this, because Python tucks it neatly behind a class block that works enough like a Java-style class. (Python also handles inheritance without using the prototype, so it’s a little different… but I digress. Maybe in another post.)

The point is, there’s nothing fundamentally wrong with how JavaScript handles objects; the ergonomics are just terrible.

Lo! They finally added a class keyword. Or, rather, they finally made the class keyword do something; it’s been reserved this entire time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class Vector {
    constructor(x, y) {
        this.x = x;
        this.y = y;
    }

    get magnitude() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    }

    dot(other) {
        return this.x * other.x + this.y * other.y;
    }
}

This is all just sugar for existing features: creating a Vector function to act as the constructor, assigning a function to Vector.prototype.dot, and whatever it is you do to make a property. (Oh, there are properties. I’ll get to that in a bit.)

The class block can be used as an expression, with or without a name. It also supports prototypical inheritance with an extends clause and has a super pseudo-value for superclass calls.

It’s a little weird that the inside of the class block has its own special syntax, with function omitted and whatnot, but honestly you’d have a hard time making a class block without special syntax.

One severe omission here is that you can’t declare values inside the block, i.e. you can’t just drop a bar = 3; in there if you want all your objects to share a default attribute. The workaround is to just do this.bar = 3; inside the constructor, but I find that unsatisfying, since it defeats half the point of using prototypes.

Properties

MDN docs — supported in Firefox 4, Chrome 5, IE 9, Safari 5.1

JavaScript historically didn’t have a way to intercept attribute access, which is a travesty. And by “intercept attribute access”, I mean that you couldn’t design a value foo such that evaluating foo.bar runs some code you wrote.

Exciting news: now it does. Or, rather, you can intercept specific attributes, like in the class example above. The above magnitude definition is equivalent to:

1
2
3
4
5
6
7
Object.defineProperty(Vector.prototype, 'magnitude', {
    configurable: true,
    enumerable: true,
    get: function() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    },
});

Beautiful.

And what even are these configurable and enumerable things? It seems that every single key on every single object now has its own set of three Boolean twiddles:

  • configurable means the property itself can be reconfigured with another call to Object.defineProperty.
  • enumerable means the property appears in for..in or Object.keys().
  • writable means the property value can be changed, which only applies to properties with real values rather than accessor functions.

The incredibly wild thing is that for properties defined by Object.defineProperty, configurable and enumerable default to false, meaning that by default accessor properties are immutable and invisible. Super weird.

Nice to have, though. And luckily, it turns out the same syntax as in class also works in object literals.

1
2
3
4
5
6
Vector.prototype = {
    get magnitude() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    },
    ...
};

Alas, I’m not aware of a way to intercept arbitrary attribute access.

Another feature along the same lines is Object.seal(), which marks all of an object’s properties as non-configurable and prevents any new properties from being added to the object. The object is still mutable, but its “shape” can’t be changed. And of course you can just make the object completely immutable if you want, via setting all its properties non-writable, or just using Object.freeze().

I have mixed feelings about the ability to irrevocably change something about a dynamic runtime. It would certainly solve some gripes of former Haskell-minded colleagues, and I don’t have any compelling argument against it, but it feels like it violates some unwritten contract about dynamic languages — surely any structural change made by user code should also be able to be undone by user code?

Slurpy arguments

MDN docs — supported in Firefox 15, Chrome 47, Edge 12, Safari 10

Officially this feature is called “rest parameters”, but that’s a terrible name, no one cares about “arguments” vs “parameters”, and “slurpy” is a good word. Bless you, Perl.

1
2
3
function foo(a, b, ...args) {
    // ...
}

Now you can call foo with as many arguments as you want, and every argument after the second will be collected in args as a regular array.

You can also do the reverse with the spread operator:

1
2
3
4
5
let args = [];
args.push(1);
args.push(2);
args.push(3);
foo(...args);

It even works in array literals, even multiple times:

1
2
let args2 = [...args, ...args];
console.log(args2);  // [1, 2, 3, 1, 2, 3]

Apparently there’s also a proposal for allowing the same thing with objects inside object literals.

Default arguments

MDN docs — supported in Firefox 15, Chrome 49, Edge 14, Safari 10

Yes, arguments can have defaults now. It’s more like Sass than Python — default expressions are evaluated once per call, and later default expressions can refer to earlier arguments. I don’t know how I feel about that but whatever.

1
2
3
function foo(n = 1, m = n + 1, list = []) {
    ...
}

Also, unlike Python, you can have an argument with a default and follow it with an argument without a default, since the default default (!) is and always has been defined as undefined. Er, let me just write it out.

1
2
3
function bar(a = 5, b) {
    ...
}

Arrow functions

MDN docs — supported in Firefox 22, Chrome 45, Edge 12, Safari 10

Perhaps the most humble improvement is the arrow function. It’s a slightly shorter way to write an anonymous function.

1
2
3
(a, b, c) => { ... }
a => { ... }
() => { ... }

An arrow function does not set this or some other magical values, so you can safely use an arrow function as a quick closure inside a method without having to rebind this. Hooray!

Otherwise, arrow functions act pretty much like regular functions; you can even use all the features of regular function signatures.

Arrow functions are particularly nice in combination with all the combinator-style array functions that were added a while ago, like Array.forEach.

1
2
3
[7, 8, 9].forEach(value => {
    console.log(value);
});

Symbol

MDN docs — supported in Firefox 36, Chrome 38, Edge 12, Safari 9

This isn’t quite what I’d call an exciting feature, but it’s necessary for explaining the next one. It’s actually… extremely weird.

symbol is a new kind of primitive (like number and string), not an object (like, er, Number and String). A symbol is created with Symbol('foo'). No, not new Symbol('foo'); that throws a TypeError, for, uh, some reason.

The only point of a symbol is as a unique key. You see, symbols have one very special property: they can be used as object keys, and will not be stringified. Remember, only strings can be keys in JavaScript — even the indices of an array are, semantically speaking, still strings. Symbols are a new exception to this rule.

Also, like other objects, two symbols don’t compare equal to each other: Symbol('foo') != Symbol('foo').

The result is that symbols solve one of the problems that plauges most object systems, something I’ve talked about before: interfaces. Since an interface might be implemented by any arbitrary type, and any arbitrary type might want to implement any number of arbitrary interfaces, all the method names on an interface are effectively part of a single global namespace.

I think I need to take a moment to justify that. If you have IFoo and IBar, both with a method called method, and you want to implement both on the same type… you have a problem. Because most object systems consider “interface” to mean “I have a method called method, with no way to say which interface’s method you mean. This is a hard problem to avoid, because IFoo and IBar might not even come from the same library. Occasionally languages offer a clumsy way to “rename” one method or the other, but the most common approach seems to be for interface designers to avoid names that sound “too common”. You end up with redundant mouthfuls like IFoo.foo_method.

This incredibly sucks, and the only languages I’m aware of that avoid the problem are the ML family and Rust. In Rust, you define all the methods for a particular trait (interface) in a separate block, away from the type’s “own” methods. It’s pretty slick. You can still do obj.method(), and as long as there’s only one method among all the available traits, you’ll get that one. If not, there’s syntax for explicitly saying which trait you mean, which I can’t remember because I’ve never had to use it.

Symbols are JavaScript’s answer to this problem. If you want to define some interface, you can name its methods with symbols, which are guaranteed to be unique. You just have to make sure you keep the symbol around somewhere accessible so other people can actually use it. (Or… not?)

The interesting thing is that JavaScript now has several of its own symbols built in, allowing user objects to implement features that were previously reserved for built-in types. For example, you can use the Symbol.hasInstance symbol — which is simply where the language is storing an existing symbol and is not the same as Symbol('hasInstance')! — to override instanceof:

1
2
3
4
5
6
7
8
// oh my god don't do this though
class EvenNumber {
    static [Symbol.hasInstance](obj) {
        return obj % 2 == 0;
    }
}
console.log(2 instanceof EvenNumber);  // true
console.log(3 instanceof EvenNumber);  // false

Oh, and those brackets around Symbol.hasInstance are a sort of reverse-quoting — they indicate an expression to use where the language would normally expect a literal identifier. I think they work as object keys, too, and maybe some other places.

The equivalent in Python is to implement a method called __instancecheck__, a name which is not special in any way except that Python has reserved all method names of the form __foo__. That’s great for Python, but doesn’t really help user code. JavaScript has actually outclassed (ho ho) Python here.

Of course, obj[BobNamespace.some_method]() is not the prettiest way to call an interface method, so it’s not perfect. I imagine this would be best implemented in user code by exposing a polymorphic function, similar to how Python’s len(obj) pretty much just calls obj.__len__().

I only bring this up because it’s the plumbing behind one of the most incredible things in JavaScript that I didn’t even know about until I started writing this post. I’m so excited oh my gosh. Are you ready? It’s:

Iteration protocol

MDN docs — supported in Firefox 27, Chrome 39, Safari 10; still experimental in Edge

Yes! Amazing! JavaScript has first-class support for iteration! I can’t even believe this.

It works pretty much how you’d expect, or at least, how I’d expect. You give your object a method called Symbol.iterator, and that returns an iterator.

What’s an iterator? It’s an object with a next() method that returns the next value and whether the iterator is exhausted.

Wait, wait, wait a second. Hang on. The method is called next? Really? You didn’t go for Symbol.next? Python 2 did exactly the same thing, then realized its mistake and changed it to __next__ in Python 3. Why did you do this?

Well, anyway. My go-to test of an iterator protocol is how hard it is to write an equivalent to Python’s enumerate(), which takes a list and iterates over its values and their indices. In Python it looks like this:

1
2
3
4
5
for i, value in enumerate(['one', 'two', 'three']):
    print(i, value)
# 0 one
# 1 two
# 2 three

It’s super nice to have, and I’m always amazed when languages with “strong” “support” for iteration don’t have it. Like, C# doesn’t. So if you want to iterate over a list but also need indices, you need to fall back to a C-style for loop. And if you want to iterate over a lazy or arbitrary iterable but also need indices, you need to track it yourself with a counter. Ridiculous.

Here’s my attempt at building it in JavaScript.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
function enumerate(iterable) {
    // Return a new iter*able* object with a Symbol.iterator method that
    // returns an iterator.
    return {
        [Symbol.iterator]: function() {
            let iterator = iterable[Symbol.iterator]();
            let i = 0;

            return {
                next: function() {
                    let nextval = iterator.next();
                    if (! nextval.done) {
                        nextval.value = [i, nextval.value];
                        i++;
                    }
                    return nextval;
                },
            };
        },
    };
}
for (let [i, value] of enumerate(['one', 'two', 'three'])) {
    console.log(i, value);
}
// 0 one
// 1 two
// 2 three

Incidentally, for..of (which iterates over a sequence, unlike for..in which iterates over keys — obviously) is finally supported in Edge 12. Hallelujah.

Oh, and let [i, value] is destructuring assignment, which is also a thing now and works with objects as well. You can even use the splat operator with it! Like Python! (And you can use it in function signatures! Like Python! Wait, no, Python decided that was terrible and removed it in 3…)

1
let [x, y, ...others] = ['apple', 'orange', 'cherry', 'banana'];

It’s a Halloween miracle. 🎃

Generators

MDN docs — supported in Firefox 26, Chrome 39, Edge 13, Safari 10

That’s right, JavaScript has goddamn generators now. It’s basically just copying Python and adding a lot of superfluous punctuation everywhere. Not that I’m complaining.

Also, generators are themselves iterable, so I’m going to cut to the chase and rewrite my enumerate() with a generator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
function enumerate(iterable) {
    return {
        [Symbol.iterator]: function*() {
            let i = 0;
            for (let value of iterable) {
                yield [i, value];
                i++;
            }
        },
    };
}
for (let [i, value] of enumerate(['one', 'two', 'three'])) {
    console.log(i, value);
}
// 0 one
// 1 two
// 2 three

Amazing. function* is a pretty strange choice of syntax, but whatever? I guess it also lets them make yield only act as a keyword inside a generator, for ultimate backwards compatibility.

JavaScript generators support everything Python generators do: yield* yields every item from a subsequence, like Python’s yield from; generators can return final values; you can pass values back into the generator if you iterate it by hand. No, really, I wasn’t kidding, it’s basically just copying Python. It’s great. You could now built asyncio in JavaScript!

In fact, they did that! JavaScript now has async and await. An async function returns a Promise, which is also a built-in type now. Amazing.

Sets and maps

MDN docs for MapMDN docs for Set — supported in Firefox 13, Chrome 38, IE 11, Safari 7.1

I did not save the best for last. This is much less exciting than generators. But still exciting.

The only data structure in JavaScript is the object, a map where the strings are keys. (Or now, also symbols, I guess.) That means you can’t readily use custom values as keys, nor simulate a set of arbitrary objects. And you have to worry about people mucking with Object.prototype, yikes.

But now, there’s Map and Set! Wow.

Unfortunately, because JavaScript, Map couldn’t use the indexing operators without losing the ability to have methods, so you have to use a boring old method-based API. But Map has convenient methods that plain objects don’t, like entries() to iterate over pairs of keys and values. In fact, you can use a map with for..of to get key/value pairs. So that’s nice.

Perhaps more interesting, there’s also now a WeakMap and WeakSet, where the keys are weak references. I don’t think JavaScript had any way to do weak references before this, so that’s pretty slick. There’s no obvious way to hold a weak value, but I guess you could substitute a WeakSet with only one item.

Template literals

MDN docs — supported in Firefox 34, Chrome 41, Edge 12, Safari 9

Template literals are JavaScript’s answer to string interpolation, which has historically been a huge pain in the ass because it doesn’t even have string formatting in the standard library.

They’re just strings delimited by backticks instead of quotes. They can span multiple lines and contain expressions.

1
2
console.log(`one plus
two is ${1 + 2}`);

Someone decided it would be a good idea to allow nesting more sets of backticks inside a ${} expression, so, good luck to syntax highlighters.

However, someone also had the most incredible idea ever, which was to add syntax allowing user code to do the interpolation — so you can do custom escaping, when absolutely necessary, which is virtually never, because “escaping” means you’re building a structured format by slopping strings together willy-nilly instead of using some API that works with the structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// OF COURSE, YOU SHOULDN'T BE DOING THIS ANYWAY; YOU SHOULD BUILD HTML WITH
// THE DOM API AND USE .textContent FOR LITERAL TEXT.  BUT AS AN EXAMPLE:
function html(literals, ...values) {
    let ret = [];
    literals.forEach((literal, i) => {
        if (i > 0) {
            // Is there seriously still not a built-in function for doing this?
            // Well, probably because you SHOULDN'T BE DOING IT
            ret.push(values[i - 1]
                .replace(/&/g, '&amp;')
                .replace(/</g, '&lt;')
                .replace(/>/g, '&gt;')
                .replace(/"/g, '&quot;')
                .replace(/'/g, '&apos;'));
        }
        ret.push(literal);
    });
    return ret.join('');
}
let username = 'Bob<script>';
let result = html`<b>Hello, ${username}!</b>`;
console.log(result);
// <b>Hello, Bob&lt;script&gt;!</b>

It’s a shame this feature is in JavaScript, the language where you are least likely to need it.

Trailing commas

Remember how you couldn’t do this for ages, because ass-old IE considered it a syntax error and would reject the entire script?

1
2
3
4
5
{
    a: 'one',
    b: 'two',
    c: 'three',  // <- THIS GUY RIGHT HERE
}

Well now it’s part of the goddamn spec and if there’s anything in this post you can rely on, it’s this. In fact you can use AS MANY GODDAMN TRAILING COMMAS AS YOU WANT. But only in arrays.

1
[1, 2, 3,,,,,,,,,,,,,,,,,,,,,,,,,]

Apparently that has the bizarre side effect of reserving extra space at the end of the array, without putting values there.

And more, probably

Like strict mode, which makes a few silent “errors” be actual errors, forces you to declare variables (no implicit globals!), and forbids the completely bozotic with block.

Or String.trim(), which trims whitespace off of strings.

Or… Math.sign()? That’s new? Seriously? Well, okay.

Or the Proxy type, which lets you customize indexing and assignment and calling. Oh. I guess that is possible, though this is a pretty weird way to do it; why not just use symbol-named methods?

You can write Unicode escapes for astral plane characters in strings (or identifiers!), as \u{XXXXXXXX}.

There’s a const now? I extremely don’t care, just name it in all caps and don’t reassign it, come on.

There’s also a mountain of other minor things, which you can peruse at your leisure via MDN or the ECMAScript compatibility tables (note the links at the top, too).

That’s all I’ve got. I still wouldn’t say I’m a big fan of JavaScript, but it’s definitely making an effort to clean up some goofy inconsistencies and solve common problems. I think I could even write some without yelling on Twitter about it now.

On the other hand, if you’re still stuck supporting IE 10 for some reason… well, er, my condolences.

[$] What’s the best way to prevent kernel pointer leaks?

Post Syndicated from corbet original https://lwn.net/Articles/735589/rss

An attacker who seeks to compromise a running kernel by overwriting
kernel data structures or forcing a jump to specific kernel code must, in
either case, have some idea of where the target objects are in memory.
Techniques like kernel address-space layout randomization have been created
in the
hope of denying that knowledge, but that effort is wasted if the kernel
leaks information about where it has been placed in memory. Developers
have been plugging pointer leaks for years but, as a recent discussion
shows, there is still some disagreement over the best way to prevent
attackers from learning about the kernel’s address-space layout.

MPAA Reports Pirate Sites, Hosts and Ad-Networks to US Government

Post Syndicated from Ernesto original https://torrentfreak.com/mpaa-reports-pirate-sites-hosts-and-ad-networks-to-us-government-171004/

Responding to a request from the Office of the US Trade Representative (USTR), the MPAA has submitted an updated list of “notorious markets” that it says promote the illegal distribution of movies and TV-shows.

These annual submissions help to guide the U.S. Government’s position towards foreign countries when it comes to copyright enforcement.

What stands out in the MPAA’s latest overview is that it no longer includes offline markets, only sites and services that are available on the Internet. This suggests that online copyright infringement is seen as a priority.

The MPAA’s report includes more than two dozen alleged pirate sites in various categories. While this is not an exhaustive list, the movie industry specifically highlights some of the worst offenders in various categories.

“Content thieves take advantage of a wide constellation of easy-to-use online technologies, such as direct download and streaming, to create infringing sites and applications, often with the look and feel of legitimate content distributors, luring unsuspecting consumers into piracy,” the MPAA writes.

According to the MPAA, torrent sites remain popular, serving millions of torrents to tens of millions of users at any given time.

The Pirate Bay has traditionally been one of the main targets. Based on data from Alexa and SimilarWeb, the MPAA says that TPB has about 62 million unique visitors per month. The other torrent sites mentioned are 1337x.to, Rarbg.to, Rutracker.org, and Torrentz2.eu.

MPAA calls out torrent sites

The second highlighted category covers various linking and streaming sites. This includes the likes of Fmovies.is, Gostream.is, Primewire.ag, Kinogo.club, MeWatchSeries.to, Movie4k.tv and Repelis.tv.

Direct download sites and video hosting services also get a mention. Nowvideo.sx, Openload.co, Rapidgator.net, Uploaded.net and the Russian social network VK.com. Many of these services refuse to properly process takedown notices, the MPAA claims.

The last category is new and centers around piracy apps. These sites offer mobile applications that allow users to stream pirated content, such as IpPlayBox.tv, MoreTV, 3DBoBoVR, TVBrowser, and KuaiKa, which are particularly popular in Asia.

Aside from listing specific sites, the MPAA also draws the US Government’s attention to the streaming box problem. The report specifically mentions that Kodi-powered boxes are regularly abused for infringing purposes.

“An emerging global threat is streaming piracy which is enabled by piracy devices preloaded with software to illicitly stream movies and television programming and a burgeoning ecosystem of infringing add-ons,” the MPAA notes.

“The most popular software is an open source media player software, Kodi. Although Kodi is not itself unlawful, and does not host or link to unlicensed content, it can be easily configured to direct consumers toward unlicensed films and television shows.”

Pirate streaming boxes

There are more than 750 websites offering infringing devices, the Hollywood group notes, adding that the rapid growth of this problem is startling. Interestingly, the report mentions TVAddons.ag as a “piracy add-on repository,” noting that it’s currently offline. Whether the new TVAddons is also seen a problematic is unclear.

The MPAA also continues its trend of calling out third-party intermediaries, including hosting providers. These companies refuse to take pirate sites offline following complaints, even when the MPAA views them as blatantly violating the law.

“Hosting companies provide the essential infrastructure required to operate a website,” the MPAA writes. “Given the central role of hosting providers in the online ecosystem, it is very concerning that many refuse to take action upon being notified…”

The Hollywood group specifically mentions Private Layer and Netbrella as notorious markets. CDN provider CloudFlare is also named. As a US-based company, the latter can’t be included in the list. However, the MPAA explains that it is often used as an anonymization tool by sites and services that are mentioned in the report.

Another group of intermediaries that play a role in fueling piracy (mentioned for the first time) are advertising networks. The MPAA specifically calls out the Canadian company WWWPromoter, which works with sites such as Primewire.ag, Projectfreetv.at and 123movies.to

“The companies connecting advertisers to infringing websites and inadvertently contribute to the prevalence and prosperity of infringing sites by providing funding to the operators of these sites through advertising revenue,” the MPAA writes.

The MPAA’s full report is available here (pdf). The USTR will use this input above to make up its own list of notorious markets. This will help to identify current threats and call on foreign governments to take appropriate action.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

[$] Improvements in the block layer

Post Syndicated from jake original https://lwn.net/Articles/735275/rss

Jens Axboe is the
maintainer of the block layer of the kernel. In this capacity,
he spoke at Kernel Recipes
2017
on what’s new in the storage world for
Linux, with a particular focus on the new block-multiqueue subsystem:
the degree to which it’s been adopted, a number of optimizations that
have recently been made, and a bit of speculation about
how it will further improve in the future.

Subscribers can click below for a report from the Kernel Recipes talk by
guest author Tom Yates.

PS4 Piracy Now Exists – If Gamers Want to Jump Through Hoops

Post Syndicated from Andy original https://torrentfreak.com/ps4-piracy-now-exists-if-gamers-want-to-jump-through-hoops-170930/

During the reign of the first few generations of consoles, gamers became accustomed to their machines being compromised by hacking groups and enthusiasts, to enable the execution of third-party software.

Often carried out under the banner of running “homebrew” code, so-called jailbroken consoles also brought with them the prospect of running pirate copies of officially produced games. Once the floodgates were opened, not much could hold things back.

With the advent of mass online gaming, however, things became more complex. Regular firmware updates mean that security holes could be fixed remotely whenever a user went online, rendering the jailbreaking process a cat-and-mouse game with continually moving targets.

This, coupled with massively improved overall security, has meant that the current generation of consoles has remained largely piracy free, at least on a do-it-at-home basis. Now, however, that position is set to change after the first decrypted PS4 game dumps began to hit the web this week.

Thanks to release group KOTF (Knights of the Fallen), Grand Theft Auto V, Far Cry 4, and Assassins Creed IV are all available for download from the usual places. As expected they are pretty meaty downloads, with GTAV weighing in via 90 x 500MB files, Far Cry4 via 54 of the same size, and ACIV sporting 84 x 250MB.

Partial NFO file for PS4 GTA V

While undoubtedly large, it’s not the filesize that will prove most prohibitive when it comes to getting these beasts to run on a PlayStation 4. Indeed, a potential pirate will need to jump through a number of hoops to enjoy any of these titles or others that may appear in the near future.

KOTF explains as much in the NFO (information) files it includes with its releases. The list of requirements is long.

First up, a gamer needs to possess a PS4 with an extremely old firmware version – v1.76 – which was released way back in August 2014. The fact this firmware is required doesn’t come as a surprise since it was successfully jailbroken back in December 2015.

The age of the firmware raises several issues, not least where people can obtain a PS4 that’s so old it still has this firmware intact. Also, newer games require later firmware, so most games released during the past two to three years won’t be compatible with v1.76. That limits the pool of games considerably.

Finally, forget going online with such an old software version. Sony will be all over it like a cheap suit, plotting to do something unpleasant to that cheeky antique code, given half a chance. And, for anyone wondering, downgrading a higher firmware version to v1.76 isn’t possible – yet.

But for gamers who want a little bit of recent PS4 nostalgia on the cheap, ‘all’ they have to do is gather the necessary tools together and follow the instructions below.

Easy – when you know how

While this is a landmark moment for PS4 piracy (which to date has mainly centered around much hocus pocus), the limitations listed above mean that it isn’t going to hit the mainstream just yet.

That being said, all things are possible when given the right people, determination, and enough time. Whether that will be anytime soon is anyone’s guess but there are rumors that firmware v4.55 has already been exploited, so you never know.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

AWS Hot Startups – September 2017

Post Syndicated from Tina Barr original https://aws.amazon.com/blogs/aws/aws-hot-startups-september-2017/

As consumers continue to demand faster, simpler, and more on-the-go services, FinTech companies are responding with ever more innovative solutions to fit everyone’s needs and to improve customer experience. This month, we are excited to feature the following startups—all of whom are disrupting traditional financial services in unique ways:

  • Acorns – allowing customers to invest spare change automatically.
  • Bondlinc – improving the bond trading experience for clients, financial institutions, and private banks.
  • Lenda – reimagining homeownership with a secure and streamlined online service.

Acorns (Irvine, CA)

Driven by the belief that anyone can grow wealth, Acorns is relentlessly pursuing ways to help make that happen. Currently the fastest-growing micro-investing app in the U.S., Acorns takes mere minutes to get started and is currently helping over 2.2 million people grow their wealth. And unlike other FinTech apps, Acorns is focused on helping America’s middle class – namely the 182 million citizens who make less than $100,000 per year – and looking after their financial best interests.

Acorns is able to help their customers effortlessly invest their money, little by little, by offering ETF portfolios put together by Dr. Harry Markowitz, a Nobel Laureate in economic sciences. They also offer a range of services, including “Round-Ups,” whereby customers can automatically invest spare change from every day purchases, and “Recurring Investments,” through which customers can set up automatic transfers of just $5 per week into their portfolio. Additionally, Found Money, Acorns’ earning platform, can help anyone spend smarter as the company connects customers to brands like Lyft, Airbnb, and Skillshare, who then automatically invest in customers’ Acorns account.

The Acorns platform runs entirely on AWS, allowing them to deliver a secure and scalable cloud-based experience. By utilizing AWS, Acorns is able to offer an exceptional customer experience and fulfill its core mission. Acorns uses Terraform to manage services such as Amazon EC2 Container Service, Amazon CloudFront, and Amazon S3. They also use Amazon RDS and Amazon Redshift for data storage, and Amazon Glacier to manage document retention.

Acorns is hiring! Be sure to check out their careers page if you are interested.

Bondlinc (Singapore)

Eng Keong, Founder and CEO of Bondlinc, has long wanted to standardize, improve, and automate the traditional workflows that revolve around bond trading. As a former trader at BNP Paribas and Jefferies & Company, E.K. – as Keong is known – had personally seen how manual processes led to information bottlenecks in over-the-counter practices. This drove him, along with future Bondlinc CTO Vincent Caldeira, to start a new service that maximizes efficiency, information distribution, and accessibility for both clients and bankers in the bond market.

Currently, bond trading requires banks to spend a significant amount of resources retrieving data from expensive and restricted institutional sources, performing suitability checks, and attaching required documentation before presenting all relevant information to clients – usually by email. Bankers are often overwhelmed by these time-consuming tasks, which means clients don’t always get proper access to time-sensitive bond information and pricing. Bondlinc bridges this gap between banks and clients by providing a variety of solutions, including easy access to basic bond information and analytics, updates of new issues and relevant news, consolidated management of your portfolio, and a chat function between banker and client. By making the bond market much more accessible to clients, Bondlinc is taking private banking to the next level, while improving efficiency of the banks as well.

As a startup running on AWS since inception, Bondlinc has built and operated its SaaS product by leveraging Amazon EC2, Amazon S3, Elastic Load Balancing, and Amazon RDS across multiple Availability Zones to provide its customers (namely, financial institutions) a highly available and seamlessly scalable product distribution platform. Bondlinc also makes extensive use of Amazon CloudWatch, AWS CloudTrail, and Amazon SNS to meet the stringent operational monitoring, auditing, compliance, and governance requirements of its customers. Bondlinc is currently experimenting with Amazon Lex to build a conversational interface into its mobile application via a chat-bot that provides trading assistance services.

To see how Bondlinc works, request a demo at Bondlinc.com.

Lenda (San Francisco, CA)

Lenda is a digital mortgage company founded by seasoned FinTech entrepreneur Jason van den Brand. Jason wanted to create a smarter, simpler, and more streamlined system for people to either get a mortgage or refinance their homes. With Lenda, customers can find out if they are pre-approved for loans, and receive accurate, real-time mortgage rate quotes from industry-experienced home loan advisors. Lenda’s advisors support customers through the loan process by providing financial advice and guidance for a seamless experience.

Lenda’s innovative platform allows borrowers to complete their home loans online from start to finish. Through a savvy combination of being a direct lender with proprietary technology, Lenda has simplified the mortgage application process to save customers time and money. With an interactive dashboard, customers know exactly where they are in the mortgage process and can manage all of their documents in one place. The company recently received its Series A funding of $5.25 million, and van den Brand shared that most of the capital investment will be used to improve Lenda’s technology and fulfill the company’s mission, which is to reimagine homeownership, starting with home loans.

AWS allows Lenda to scale its business while providing a secure, easy-to-use system for a faster home loan approval process. Currently, Lenda uses Amazon S3, Amazon EC2, Amazon CloudFront, Amazon Redshift, and Amazon WorkSpaces.

Visit Lenda.com to find out more.

Thanks for reading and see you in October for another round of hot startups!

-Tina

‘Daily Stormer’ Termination Haunts Cloudflare in Online Piracy Case

Post Syndicated from Ernesto original https://torrentfreak.com/daily-stormer-termination-haunts-cloudflare-in-online-piracy-case-170929/

Last month Cloudflare CEO Matthew Prince decided to terminate the account of controversial neo-Nazi site Daily Stormer.

“I woke up this morning in a bad mood and decided to kick them off the Internet,” he announced.

While the decision is understandable from an emotional point of view, it’s quite a statement to make as the CEO of one of the largest Internet infrastructure companies. Not least because it goes directly against what many saw as Cloudflare’s core values.

For years on end, Cloudflare has been asked to remove terrorist propaganda, pirate sites, and other controversial content. Each time, Cloudflare replied that it doesn’t take action without a court order. No exceptions.

In addition, Cloudflare repeatedly stressed that it was impossible for them to remove a website from the Internet, at least not permanently. It would only require a simple DNS reconfiguration to get it back up and running.

While the Daily Stormer case has nothing to do with piracy or copyright infringement, it’s now being brought up as important evidence in an ongoing piracy liability case. Adult entertainment publisher ALS Scan views Prince as a “key witness” in the case and wants to depose Cloudflare’s CEO to find out more about his decision.

“Mr. Prince’s statement to the public that Cloudflare kicked neo-Nazis off the internet stand in sharp contrast to Cloudflare’s testimony in this case, where it claims it is powerless to remove content from the Internet,” ALS Scan writes.

The above is part of a recent submission where both parties argue over whether Prince can be deposed or not. Cloudflare wants to prevent this from happening and claims it’s unnecessary, but the adult publisher disagrees.

“By his own admissions, Mr. Prince’s decision to terminate certain users’ accounts was ‘arbitrary,’ the result of him waking up ‘in a bad mood,’ and a decision he made unilaterally as ‘CEO of a major Internet infrastructure corporation’.

“Mr. Prince has made it clear that he is the one who determines the circumstances under which Cloudflare will terminate a user’s account,” ALS Scan adds.

For its part, Cloudflare says that the CEO’s deposition is not needed. This is backed up by a declaration where Prince emphasizes that he has no unique knowledge on the company’s DMCA and repeat infringer policies, issues that directly relate to the case at hand.

“I have no unique personal knowledge regarding Cloudflare’s DMCA policy and procedure, including its repeat infringer policies, or Cloudflare’s published Terms of Service,” Prince informs the court

Prince’s declaration

The adult publisher, however, harps on the fact that the CEO arbitrarily decided to remove one site from the service, while requiring court orders in other instances. They quote from a Wall Street Journal (WSJ) article he wrote and highlight the ‘kick off the internet’ claim, which contradicts earlier statements.

Cloudflare’s lawyers contend that the WSJ article in question was meant to kick off a conversation and shouldn’t be taken literally.

“The WSJ Article was intended as an intellectual exercise to start a conversation regarding censorship and free speech on the internet. The WSJ Article had nothing to do with copyright infringement issues or Cloudflare’s DMCA policy and procedure.

“When Mr. Prince stated in the WSJ Article that ‘[he] helped kick a group of neo-Nazis off the internet last week,’ his comments were intended to illustrate a point – not to be taken literally,” Cloudflare’s legal team adds.

The deposition of Trey Guinn, a technical employee at Cloudflare, confirms that the company doesn’t have the power to cut a site off the Internet. It further suggests that the entire removal of Daily Stormer was in essence a provocation to start a conversation around freedom of speech.

From Guinn’s deposition

Still, since the lawsuit in question revolves around terminating customers, ALS Scan wants to depose Price to find out exactly when clients are terminated, and why he decided to go beyond Couldflare’s usual policy.

“No other employee can testify to Mr. Prince’s decision-making process when it comes to terminating a user’s access. No other employee can offer an explanation as to why The Daily Stormer’s account was terminated while repeat infringers’ accounts are allowed to remain.

“In a case where Mr. Prince’s personal judgment appears to govern even over Cloudflare’s own policies and procedures, Cloudflare cannot meet its heavy burden of demonstrating why he should not be deposed,” ALS Scan’s lawyers add.

To be continued.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Creating a Cost-Efficient Amazon ECS Cluster for Scheduled Tasks

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/

Madhuri Peri
Sr. DevOps Consultant

When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.

One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances.  Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.

Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.

In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.

Architecture

The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.

Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.

The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.

The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.

Walkthrough

To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:

  • Amazon ECR repository for storing the Docker image to be used in the task definition
  • S3 bucket that holds the transferred logs
  • Task definition, with image name and S3 bucket as environment variables provided via input parameter
  • CloudWatch Events rule
  • Amazon ECS cluster
  • Amazon ECS container instances using Spot Fleet
  • IAM roles required for the container instance profiles

Before you begin

Ensure that Git, Docker, and the AWS CLI are installed on your computer.

In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster.

Implementation Steps

  1. Clone the code from GitHub that performs RDS API calls to retrieve the log files.
    git clone https://github.com/awslabs/aws-ecs-scheduled-tasks.git
  2. Build and tag the image.
    cd aws-ecs-scheduled-tasks/container-code/src && ls

    Dockerfile		rdslogsshipper.py	requirements.txt

    docker build -t rdslogsshipper .

    Sending build context to Docker daemon 9.728 kB
    Step 1 : FROM python:3
     ---> 41397f4f2887
    Step 2 : WORKDIR /usr/src/app
     ---> Using cache
     ---> 59299c020e7e
    Step 3 : COPY requirements.txt ./
     ---> 8c017e931c3b
    Removing intermediate container df09e1bed9f2
    Step 4 : COPY rdslogsshipper.py /usr/src/app
     ---> 099a49ca4325
    Removing intermediate container 1b1da24a6699
    Step 5 : RUN pip install --no-cache-dir -r requirements.txt
     ---> Running in 3ed98b30901d
    Collecting boto3 (from -r requirements.txt (line 1))
      Downloading boto3-1.4.6-py2.py3-none-any.whl (128kB)
    Collecting botocore (from -r requirements.txt (line 2))
      Downloading botocore-1.6.7-py2.py3-none-any.whl (3.6MB)
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->-r requirements.txt (line 1))
      Downloading s3transfer-0.1.10-py2.py3-none-any.whl (54kB)
    Collecting jmespath<1.0.0,>=0.7.1 (from boto3->-r requirements.txt (line 1))
      Downloading jmespath-0.9.3-py2.py3-none-any.whl
    Collecting python-dateutil<3.0.0,>=2.1 (from botocore->-r requirements.txt (line 2))
      Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
    Collecting docutils>=0.10 (from botocore->-r requirements.txt (line 2))
      Downloading docutils-0.14-py3-none-any.whl (543kB)
    Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore->-r requirements.txt (line 2))
      Downloading six-1.10.0-py2.py3-none-any.whl
    Installing collected packages: six, python-dateutil, docutils, jmespath, botocore, s3transfer, boto3
    Successfully installed boto3-1.4.6 botocore-1.6.7 docutils-0.14 jmespath-0.9.3 python-dateutil-2.6.1 s3transfer-0.1.10 six-1.10.0
     ---> f892d3cb7383
    Removing intermediate container 3ed98b30901d
    Step 6 : COPY . .
     ---> ea7550c04fea
    Removing intermediate container b558b3ebd406
    Successfully built ea7550c04fea
  3. Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
  4. Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
  5. Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes:
    aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule  --schedule-expression "rate(15 minutes)"

    {
        "RuleArn": "arn:aws:events:us-west-2:12345678901:rule/demo-ecs-task-rule"
    }
  6. CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
    1. Create the IAM role to be assumed by CloudWatch.
      aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json

      {
          "Role": {
              "AssumeRolePolicyDocument": {
                  "Version": "2012-10-17", 
                  "Statement": [
                      {
                          "Action": "sts:AssumeRole", 
                          "Effect": "Allow", 
                          "Principal": {
                              "Service": "events.amazonaws.com"
                          }
                      }
                  ]
              }, 
              "RoleId": "AROAIRYYLDCVZCUACT7FS", 
              "CreateDate": "2017-07-14T22:44:52.627Z", 
              "RoleName": "Test-Role", 
              "Path": "/", 
              "Arn": "arn:aws:iam::12345678901:role/Test-Role"
          }
      }

      The following is an example of the event-role.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                    "Service": "events.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
    2. Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources.
      aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json

      {
          "Policy": {
              "PolicyName": "test-policy", 
              "CreateDate": "2017-07-14T22:51:20.293Z", 
              "AttachmentCount": 0, 
              "IsAttachable": true, 
              "PolicyId": "ANPAI7XDIQOLTBUMDWGJW", 
              "DefaultVersionId": "v1", 
              "Path": "/", 
              "Arn": "arn:aws:iam::123455678901:policy/test-policy", 
              "UpdateDate": "2017-07-14T22:51:20.293Z"
          }
      }

      The following is an example of the event-policy.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ecs:RunTask"
                ],
                "Resource": [
                    "arn:aws:ecs:*::task-definition/"
                ],
                "Condition": {
                    "ArnLike": {
                        "ecs:cluster": "arn:aws:ecs:*::cluster/"
                    }
                }
            }
          ]
      }
    3. Attach the IAM policy to the role.
      aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
  7. Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings.
    aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role"

    {
        "FailedEntries": [], 
        "FailedEntryCount": 0
    }

That’s it. The logs now run based on the defined schedule.

To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.

Conclusion

In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.

In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.

If you have questions or suggestions, please comment below.