Tag Archives: 3.0

Managing AWS Lambda Function Concurrency

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/managing-aws-lambda-function-concurrency/

One of the key benefits of serverless applications is the ease in which they can scale to meet traffic demands or requests, with little to no need for capacity planning. In AWS Lambda, which is the core of the serverless platform at AWS, the unit of scale is a concurrent execution. This refers to the number of executions of your function code that are happening at any given time.

Thinking about concurrent executions as a unit of scale is a fairly unique concept. In this post, I dive deeper into this and talk about how you can make use of per function concurrency limits in Lambda.

Understanding concurrency in Lambda

Instead of diving right into the guts of how Lambda works, here’s an appetizing analogy: a magical pizza.
Yes, a magical pizza!

This magical pizza has some unique properties:

  • It has a fixed maximum number of slices, such as 8.
  • Slices automatically re-appear after they are consumed.
  • When you take a slice from the pizza, it does not re-appear until it has been completely consumed.
  • One person can take multiple slices at a time.
  • You can easily ask to have the number of slices increased, but they remain fixed at any point in time otherwise.

Now that the magical pizza’s properties are defined, here’s a hypothetical situation of some friends sharing this pizza.

Shawn, Kate, Daniela, Chuck, Ian and Avleen get together every Friday to share a pizza and catch up on their week. As there is just six of them, they can easily all enjoy a slice of pizza at a time. As they finish each slice, it re-appears in the pizza pan and they can take another slice again. Given the magical properties of their pizza, they can continue to eat all they want, but with two very important constraints:

  • If any of them take too many slices at once, the others may not get as much as they want.
  • If they take too many slices, they might also eat too much and get sick.

One particular week, some of the friends are hungrier than the rest, taking two slices at a time instead of just one. If more than two of them try to take two pieces at a time, this can cause contention for pizza slices. Some of them would wait hungry for the slices to re-appear. They could ask for a pizza with more slices, but then run the same risk again later if more hungry friends join than planned for.

What can they do?

If the friends agreed to accept a limit for the maximum number of slices they each eat concurrently, both of these issues are avoided. Some could have a maximum of 2 of the 8 slices, or other concurrency limits that were more or less. Just so long as they kept it at or under eight total slices to be eaten at one time. This would keep any from going hungry or eating too much. The six friends can happily enjoy their magical pizza without worry!

Concurrency in Lambda

Concurrency in Lambda actually works similarly to the magical pizza model. Each AWS Account has an overall AccountLimit value that is fixed at any point in time, but can be easily increased as needed, just like the count of slices in the pizza. As of May 2017, the default limit is 1000 “slices” of concurrency per AWS Region.

Also like the magical pizza, each concurrency “slice” can only be consumed individually one at a time. After consumption, it becomes available to be consumed again. Services invoking Lambda functions can consume multiple slices of concurrency at the same time, just like the group of friends can take multiple slices of the pizza.

Let’s take our example of the six friends and bring it back to AWS services that commonly invoke Lambda:

  • Amazon S3
  • Amazon Kinesis
  • Amazon DynamoDB
  • Amazon Cognito

In a single account with the default concurrency limit of 1000 concurrent executions, any of these four services could invoke enough functions to consume the entire limit or some part of it. Just like with the pizza example, there is the possibility for two issues to pop up:

  • One or more of these services could invoke enough functions to consume a majority of the available concurrency capacity. This could cause others to be starved for it, causing failed invocations.
  • A service could consume too much concurrent capacity and cause a downstream service or database to be overwhelmed, which could cause failed executions.

For Lambda functions that are launched in a VPC, you have the potential to consume the available IP addresses in a subnet or the maximum number of elastic network interfaces to which your account has access. For more information, see Configuring a Lambda Function to Access Resources in an Amazon VPC. For information about elastic network interface limits, see Network Interfaces section in the Amazon VPC Limits topic.

One way to solve both of these problems is applying a concurrency limit to the Lambda functions in an account.

Configuring per function concurrency limits

You can now set a concurrency limit on individual Lambda functions in an account. The concurrency limit that you set reserves a portion of your account level concurrency for a given function. All of your functions’ concurrent executions count against this account-level limit by default.

If you set a concurrency limit for a specific function, then that function’s concurrency limit allocation is deducted from the shared pool and assigned to that specific function. AWS also reserves 100 units of concurrency for all functions that don’t have a specified concurrency limit set. This helps to make sure that future functions have capacity to be consumed.

Going back to the example of the consuming services, you could set throttles for the functions as follows:

Amazon S3 function = 350
Amazon Kinesis function = 200
Amazon DynamoDB function = 200
Amazon Cognito function = 150
Total = 900

With the 100 reserved for all non-concurrency reserved functions, this totals the account limit of 1000.

Here’s how this works. To start, create a basic Lambda function that is invoked via Amazon API Gateway. This Lambda function returns a single “Hello World” statement with an added sleep time between 2 and 5 seconds. The sleep time simulates an API providing some sort of capability that can take a varied amount of time. The goal here is to show how an API that is underloaded can reach its concurrency limit, and what happens when it does.
To create the example function

  1. Open the Lambda console.
  2. Choose Create Function.
  3. For Author from scratch, enter the following values:
    1. For Name, enter a value (such as concurrencyBlog01).
    2. For Runtime, choose Python 3.6.
    3. For Role, choose Create new role from template and enter a name aligned with this function, such as concurrencyBlogRole.
  4. Choose Create function.
  5. The function is created with some basic example code. Replace that code with the following:

import time
from random import randint
seconds = randint(2, 5)

def lambda_handler(event, context):
time.sleep(seconds)
return {"statusCode": 200,
"body": ("Hello world, slept " + str(seconds) + " seconds"),
"headers":
{
"Access-Control-Allow-Headers": "Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token",
"Access-Control-Allow-Methods": "GET,OPTIONS",
}}

  1. Under Basic settings, set Timeout to 10 seconds. While this function should only ever take up to 5-6 seconds (with the 5-second max sleep), this gives you a little bit of room if it takes longer.

  1. Choose Save at the top right.

At this point, your function is configured for this example. Test it and confirm this in the console:

  1. Choose Test.
  2. Enter a name (it doesn’t matter for this example).
  3. Choose Create.
  4. In the console, choose Test again.
  5. You should see output similar to the following:

Now configure API Gateway so that you have an HTTPS endpoint to test against.

  1. In the Lambda console, choose Configuration.
  2. Under Triggers, choose API Gateway.
  3. Open the API Gateway icon now shown as attached to your Lambda function:

  1. Under Configure triggers, leave the default values for API Name and Deployment stage. For Security, choose Open.
  2. Choose Add, Save.

API Gateway is now configured to invoke Lambda at the Invoke URL shown under its configuration. You can take this URL and test it in any browser or command line, using tools such as “curl”:


$ curl https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01
Hello world, slept 2 seconds

Throwing load at the function

Now start throwing some load against your API Gateway + Lambda function combo. Right now, your function is only limited by the total amount of concurrency available in an account. For this example account, you might have 850 unreserved concurrency out of a full account limit of 1000 due to having configured a few concurrency limits already (also the 100 concurrency saved for all functions without configured limits). You can find all of this information on the main Dashboard page of the Lambda console:

For generating load in this example, use an open source tool called “hey” (https://github.com/rakyll/hey), which works similarly to ApacheBench (ab). You test from an Amazon EC2 instance running the default Amazon Linux AMI from the EC2 console. For more help with configuring an EC2 instance, follow the steps in the Launch Instance Wizard.

After the EC2 instance is running, SSH into the host and run the following:


sudo yum install go
go get -u github.com/rakyll/hey

“hey” is easy to use. For these tests, specify a total number of tests (5,000) and a concurrency of 50 against the API Gateway URL as follows(replace the URL here with your own):


$ ./go/bin/hey -n 5000 -c 50 https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01

The output from “hey” tells you interesting bits of information:


$ ./go/bin/hey -n 5000 -c 50 https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01

Summary:
Total: 381.9978 secs
Slowest: 9.4765 secs
Fastest: 0.0438 secs
Average: 3.2153 secs
Requests/sec: 13.0891
Total data: 140024 bytes
Size/request: 28 bytes

Response time histogram:
0.044 [1] |
0.987 [2] |
1.930 [0] |
2.874 [1803] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
3.817 [1518] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
4.760 [719] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
5.703 [917] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
6.647 [13] |
7.590 [14] |
8.533 [9] |
9.477 [4] |

Latency distribution:
10% in 2.0224 secs
25% in 2.0267 secs
50% in 3.0251 secs
75% in 4.0269 secs
90% in 5.0279 secs
95% in 5.0414 secs
99% in 5.1871 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0003 secs, 0.0000 secs, 0.0332 secs
DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0046 secs
req write: 0.0000 secs, 0.0000 secs, 0.0005 secs
resp wait: 3.2149 secs, 0.0438 secs, 9.4472 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0004 secs

Status code distribution:
[200] 4997 responses
[502] 3 responses

You can see a helpful histogram and latency distribution. Remember that this Lambda function has a random sleep period in it and so isn’t entirely representational of a real-life workload. Those three 502s warrant digging deeper, but could be due to Lambda cold-start timing and the “second” variable being the maximum of 5, causing the Lambda functions to time out. AWS X-Ray and the Amazon CloudWatch logs generated by both API Gateway and Lambda could help you troubleshoot this.

Configuring a concurrency reservation

Now that you’ve established that you can generate this load against the function, I show you how to limit it and protect a backend resource from being overloaded by all of these requests.

  1. In the console, choose Configure.
  2. Under Concurrency, for Reserve concurrency, enter 25.

  1. Click on Save in the top right corner.

You could also set this with the AWS CLI using the Lambda put-function-concurrency command or see your current concurrency configuration via Lambda get-function. Here’s an example command:


$ aws lambda get-function --function-name concurrencyBlog01 --output json --query Concurrency
{
"ReservedConcurrentExecutions": 25
}

Either way, you’ve set the Concurrency Reservation to 25 for this function. This acts as both a limit and a reservation in terms of making sure that you can execute 25 concurrent functions at all times. Going above this results in the throttling of the Lambda function. Depending on the invoking service, throttling can result in a number of different outcomes, as shown in the documentation on Throttling Behavior. This change has also reduced your unreserved account concurrency for other functions by 25.

Rerun the same load generation as before and see what happens. Previously, you tested at 50 concurrency, which worked just fine. By limiting the Lambda functions to 25 concurrency, you should see rate limiting kick in. Run the same test again:


$ ./go/bin/hey -n 5000 -c 50 https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01

While this test runs, refresh the Monitoring tab on your function detail page. You see the following warning message:

This is great! It means that your throttle is working as configured and you are now protecting your downstream resources from too much load from your Lambda function.

Here is the output from a new “hey” command:


$ ./go/bin/hey -n 5000 -c 50 https://ofixul557l.execute-api.us-east-1.amazonaws.com/prod/concurrencyBlog01
Summary:
Total: 379.9922 secs
Slowest: 7.1486 secs
Fastest: 0.0102 secs
Average: 1.1897 secs
Requests/sec: 13.1582
Total data: 164608 bytes
Size/request: 32 bytes

Response time histogram:
0.010 [1] |
0.724 [3075] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.438 [0] |
2.152 [811] |∎∎∎∎∎∎∎∎∎∎∎
2.866 [11] |
3.579 [566] |∎∎∎∎∎∎∎
4.293 [214] |∎∎∎
5.007 [1] |
5.721 [315] |∎∎∎∎
6.435 [4] |
7.149 [2] |

Latency distribution:
10% in 0.0130 secs
25% in 0.0147 secs
50% in 0.0205 secs
75% in 2.0344 secs
90% in 4.0229 secs
95% in 5.0248 secs
99% in 5.0629 secs

Details (average, fastest, slowest):
DNS+dialup: 0.0004 secs, 0.0000 secs, 0.0537 secs
DNS-lookup: 0.0002 secs, 0.0000 secs, 0.0184 secs
req write: 0.0000 secs, 0.0000 secs, 0.0016 secs
resp wait: 1.1892 secs, 0.0101 secs, 7.1038 secs
resp read: 0.0000 secs, 0.0000 secs, 0.0005 secs

Status code distribution:
[502] 3076 responses
[200] 1924 responses

This looks fairly different from the last load test run. A large percentage of these requests failed fast due to the concurrency throttle failing them (those with the 0.724 seconds line). The timing shown here in the histogram represents the entire time it took to get a response between the EC2 instance and API Gateway calling Lambda and being rejected. It’s also important to note that this example was configured with an edge-optimized endpoint in API Gateway. You see under Status code distribution that 3076 of the 5000 requests failed with a 502, showing that the backend service from API Gateway and Lambda failed the request.

Other uses

Managing function concurrency can be useful in a few other ways beyond just limiting the impact on downstream services and providing a reservation of concurrency capacity. Here are two other uses:

  • Emergency kill switch
  • Cost controls

Emergency kill switch

On occasion, due to issues with applications I’ve managed in the past, I’ve had a need to disable a certain function or capability of an application. By setting the concurrency reservation and limit of a Lambda function to zero, you can do just that.

With the reservation set to zero every invocation of a Lambda function results in being throttled. You could then work on the related parts of the infrastructure or application that aren’t working, and then reconfigure the concurrency limit to allow invocations again.

Cost controls

While I mentioned how you might want to use concurrency limits to control the downstream impact to services or databases that your Lambda function might call, another resource that you might be cautious about is money. Setting the concurrency throttle is another way to help control costs during development and testing of your application.

You might want to prevent against a function performing a recursive action too quickly or a development workload generating too high of a concurrency. You might also want to protect development resources connected to this function from generating too much cost, such as APIs that your Lambda function calls.

Conclusion

Concurrent executions as a unit of scale are a fairly unique characteristic about Lambda functions. Placing limits on how many concurrency “slices” that your function can consume can prevent a single function from consuming all of the available concurrency in an account. Limits can also prevent a function from overwhelming a backend resource that isn’t as scalable.

Unlike monolithic applications or even microservices where there are mixed capabilities in a single service, Lambda functions encourage a sort of “nano-service” of small business logic directly related to the integration model connected to the function. I hope you’ve enjoyed this post and configure your concurrency limits today!

Looking Forward to 2018

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org//2017/12/07/looking-forward-to-2018.html

Let’s Encrypt had a great year in 2017. We more than doubled the number of active (unexpired) certificates we service to 46 million, we just about tripled the number of unique domains we service to 61 million, and we did it all while maintaining a stellar security and compliance track record. Most importantly though, the Web went from 46% encrypted page loads to 67% according to statistics from Mozilla – a gain of 21% in a single year – incredible. We’re proud to have contributed to that, and we’d like to thank all of the other people and organizations who also worked hard to create a more secure and privacy-respecting Web.

While we’re proud of what we accomplished in 2017, we are spending most of the final quarter of the year looking forward rather than back. As we wrap up our own planning process for 2018, I’d like to share some of our plans with you, including both the things we’re excited about and the challenges we’ll face. We’ll cover service growth, new features, infrastructure, and finances.

Service Growth

We are planning to double the number of active certificates and unique domains we service in 2018, to 90 million and 120 million, respectively. This anticipated growth is due to continuing high expectations for HTTPS growth in general in 2018.

Let’s Encrypt helps to drive HTTPS adoption by offering a free, easy to use, and globally available option for obtaining the certificates required to enable HTTPS. HTTPS adoption on the Web took off at an unprecedented rate from the day Let’s Encrypt launched to the public.

One of the reasons Let’s Encrypt is so easy to use is that our community has done great work making client software that works well for a wide variety of platforms. We’d like to thank everyone involved in the development of over 60 client software options for Let’s Encrypt. We’re particularly excited that support for the ACME protocol and Let’s Encrypt is being added to the Apache httpd server.

Other organizations and communities are also doing great work to promote HTTPS adoption, and thus stimulate demand for our services. For example, browsers are starting to make their users more aware of the risks associated with unencrypted HTTP (e.g. Firefox, Chrome). Many hosting providers and CDNs are making it easier than ever for all of their customers to use HTTPS. Government agencies are waking up to the need for stronger security to protect constituents. The media community is working to Secure the News.

New Features

We’ve got some exciting features planned for 2018.

First, we’re planning to introduce an ACME v2 protocol API endpoint and support for wildcard certificates along with it. Wildcard certificates will be free and available globally just like our other certificates. We are planning to have a public test API endpoint up by January 4, and we’ve set a date for the full launch: Tuesday, February 27.

Later in 2018 we plan to introduce ECDSA root and intermediate certificates. ECDSA is generally considered to be the future of digital signature algorithms on the Web due to the fact that it is more efficient than RSA. Let’s Encrypt will currently sign ECDSA keys from subscribers, but we sign with the RSA key from one of our intermediate certificates. Once we have an ECDSA root and intermediates, our subscribers will be able to deploy certificate chains which are entirely ECDSA.

Infrastructure

Our CA infrastructure is capable of issuing millions of certificates per day with multiple redundancy for stability and a wide variety of security safeguards, both physical and logical. Our infrastructure also generates and signs nearly 20 million OCSP responses daily, and serves those responses nearly 2 billion times per day. We expect issuance and OCSP numbers to double in 2018.

Our physical CA infrastructure currently occupies approximately 70 units of rack space, split between two datacenters, consisting primarily of compute servers, storage, HSMs, switches, and firewalls.

When we issue more certificates it puts the most stress on storage for our databases. We regularly invest in more and faster storage for our database servers, and that will continue in 2018.

We’ll need to add a few additional compute servers in 2018, and we’ll also start aging out hardware in 2018 for the first time since we launched. We’ll age out about ten 2u compute servers and replace them with new 1u servers, which will save space and be more energy efficient while providing better reliability and performance.

We’ll also add another infrastructure operations staff member, bringing that team to a total of six people. This is necessary in order to make sure we can keep up with demand while maintaining a high standard for security and compliance. Infrastructure operations staff are systems administrators responsible for building and maintaining all physical and logical CA infrastructure. The team also manages a 24/7/365 on-call schedule and they are primary participants in both security and compliance audits.

Finances

We pride ourselves on being an efficient organization. In 2018 Let’s Encrypt will secure a large portion of the Web with a budget of only $3.0M. For an overall increase in our budget of only 13%, we will be able to issue and service twice as many certificates as we did in 2017. We believe this represents an incredible value and that contributing to Let’s Encrypt is one of the most effective ways to help create a more secure and privacy-respecting Web.

Our 2018 fundraising efforts are off to a strong start with Platinum sponsorships from Mozilla, Akamai, OVH, Cisco, Google Chrome and the Electronic Frontier Foundation. The Ford Foundation has renewed their grant to Let’s Encrypt as well. We are seeking additional sponsorship and grant assistance to meet our full needs for 2018.

We had originally budgeted $2.91M for 2017 but we’ll likely come in under budget for the year at around $2.65M. The difference between our 2017 expenses of $2.65M and the 2018 budget of $3.0M consists primarily of the additional infrastructure operations costs previously mentioned.

Support Let’s Encrypt

We depend on contributions from our community of users and supporters in order to provide our services. If your company or organization would like to sponsor Let’s Encrypt please email us at [email protected]. We ask that you make an individual contribution if it is within your means.

We’re grateful for the industry and community support that we receive, and we look forward to continuing to create a more secure and privacy-respecting Web!

Stretch for PCs and Macs, and a Raspbian update

Post Syndicated from Simon Long original https://www.raspberrypi.org/blog/stretch-pcs-macs-raspbian-update/

Today, we are launching the first Debian Stretch release of the Raspberry Pi Desktop for PCs and Macs, and we’re also releasing the latest version of Raspbian Stretch for your Pi.

Raspberry Pi Desktop Stretch splash screen

For PCs and Macs

When we released our custom desktop environment on Debian for PCs and Macs last year, we were slightly taken aback by how popular it turned out to be. We really only created it as a result of one of those “Wouldn’t it be cool if…” conversations we sometimes have in the office, so we were delighted by the Pi community’s reaction.

Seeing how keen people were on the x86 version, we decided that we were going to try to keep releasing it alongside Raspbian, with the ultimate aim being to make simultaneous releases of both. This proved to be tricky, particularly with the move from the Jessie version of Debian to the Stretch version this year. However, we have now finished the job of porting all the custom code in Raspbian Stretch to Debian, and so the first Debian Stretch release of the Raspberry Pi Desktop for your PC or Mac is available from today.

The new Stretch releases

As with the Jessie release, you can either run this as a live image from a DVD, USB stick, or SD card or install it as the native operating system on the hard drive of an old laptop or desktop computer. Please note that installing this software will erase anything else on the hard drive — do not install this over a machine running Windows or macOS that you still need to use for its original purpose! It is, however, safe to boot a live image on such a machine, since your hard drive will not be touched by this.

We’re also pleased to announce that we are releasing the latest version of Raspbian Stretch for your Pi today. The Pi and PC versions are largely identical: as before, there are a few applications (such as Mathematica) which are exclusive to the Pi, but the user interface, desktop, and most applications will be exactly the same.

For Raspbian, this new release is mostly bug fixes and tweaks over the previous Stretch release, but there are one or two changes you might notice.

File manager

The file manager included as part of the LXDE desktop (on which our desktop is based) is a program called PCManFM, and it’s very feature-rich; there’s not much you can’t do in it. However, having used it for a few years, we felt that it was perhaps more complex than it needed to be — the sheer number of menu options and choices made some common operations more awkward than they needed to be. So to try to make file management easier, we have implemented a cut-down mode for the file manager.

Raspberry Pi Desktop Stretch - file manager

Most of the changes are to do with the menus. We’ve removed a lot of options that most people are unlikely to change, and moved some other options into the Preferences screen rather than the menus. The two most common settings people tend to change — how icons are displayed and sorted — are now options on the toolbar and in a top-level menu rather than hidden away in submenus.

The sidebar now only shows a single hierarchical view of the file system, and we’ve tidied the toolbar and updated the icons to make them match our house style. We’ve removed the option for a tabbed interface, and we’ve stomped a few bugs as well.

One final change was to make it possible to rename a file just by clicking on its icon to highlight it, and then clicking on its name. This is the way renaming works on both Windows and macOS, and it’s always seemed slightly awkward that Unix desktop environments tend not to support it.

As with most of the other changes we’ve made to the desktop over the last few years, the intention is to make it simpler to use, and to ease the transition from non-Unix environments. But if you really don’t like what we’ve done and long for the old file manager, just untick the box for Display simplified user interface and menus in the Layout page of Preferences, and everything will be back the way it was!

Raspberry Pi Desktop Stretch - preferences GUI

Battery indicator for laptops

One important feature missing from the previous release was an indication of the amount of battery life. Eben runs our desktop on his Mac, and he was becoming slightly irritated by having to keep rebooting into macOS just to check whether his battery was about to die — so fixing this was a priority!

We’ve added a battery status icon to the taskbar; this shows current percentage charge, along with whether the battery is charging, discharging, or connected to the mains. When you hover over the icon with the mouse pointer, a tooltip with more details appears, including the time remaining if the battery can provide this information.

Raspberry Pi Desktop Stretch - battery indicator

While this battery monitor is mainly intended for the PC version, it also supports the first-generation pi-top — to see it, you’ll only need to make sure that I2C is enabled in Configuration. A future release will support the new second-generation pi-top.

New PC applications

We have included a couple of new applications in the PC version. One is called PiServer — this allows you to set up an operating system, such as Raspbian, on the PC which can then be shared by a number of Pi clients networked to it. It is intended to make it easy for classrooms to have multiple Pis all running exactly the same software, and for the teacher to have control over how the software is installed and used. PiServer is quite a clever piece of software, and it’ll be covered in more detail in another blog post in December.

We’ve also added an application which allows you to easily use the GPIO pins of a Pi Zero connected via USB to a PC in applications using Scratch or Python. This makes it possible to run the same physical computing projects on the PC as you do on a Pi! Again, we’ll tell you more in a separate blog post this month.

Both of these applications are included as standard on the PC image, but not on the Raspbian image. You can run them on a Pi if you want — both can be installed from apt.

How to get the new versions

New images for both Raspbian and Debian versions are available from the Downloads page.

It is possible to update existing installations of both Raspbian and Debian versions. For Raspbian, this is easy: just open a terminal window and enter

sudo apt-get update
sudo apt-get dist-upgrade

Updating Raspbian on your Raspberry Pi

How to update to the latest version of Raspbian on your Raspberry Pi. Download Raspbian here: More information on the latest version of Raspbian: Buy a Raspberry Pi:

It is slightly more complex for the PC version, as the previous release was based around Debian Jessie. You will need to edit the files /etc/apt/sources.list and /etc/apt/sources.list.d/raspi.list, using sudo to do so. In both files, change every occurrence of the word “jessie” to “stretch”. When that’s done, do the following:

sudo apt-get update 
sudo dpkg --force-depends -r libwebkitgtk-3.0-common
sudo apt-get -f install
sudo apt-get dist-upgrade
sudo apt-get install python3-thonny
sudo apt-get install sonic-pi=2.10.0~repack-rpt1+2
sudo apt-get install piserver
sudo apt-get install usbbootgui

At several points during the upgrade process, you will be asked if you want to keep the current version of a configuration file or to install the package maintainer’s version. In every case, keep the existing version, which is the default option. The update may take an hour or so, depending on your network connection.

As with all software updates, there is the possibility that something may go wrong during the process, which could lead to your operating system becoming corrupted. Therefore, we always recommend making a backup first.

Enjoy the new versions, and do let us know any feedback you have in the comments or on the forums!

The post Stretch for PCs and Macs, and a Raspbian update appeared first on Raspberry Pi.

H1 Instances – Fast, Dense Storage for Big Data Applications

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-h1-instances-fast-dense-storage-for-big-data-applications/

The scale of AWS and the diversity of our customer base gives us the opportunity to create EC2 instance types that are purpose-built for many different types of workloads. For example, a number of popular big data use cases depend on high-speed, sequential access to multiple terabytes of data. Our customers want to build and run very large MapReduce clusters, host distributed file systems, use Apache Kafka to process voluminous log files, and so forth.

New H1 Instances
The new H1 instances are designed specifically for this use case. In comparison to the existing D2 (dense storage) instances, the H1 instances provide more vCPUs and more memory per terabyte of local magnetic storage, along with increased network bandwidth, giving you the power to address more complex challenges with a nicely balanced mix of resources.

The instances are based on Intel Xeon E5-2686 v4 processors running at a base clock frequency of 2.3 GHz and come in four instance sizes (all VPC-only and HVM-only):

Instance Name vCPUs
RAM
Local Storage Network Bandwidth
h1.2xlarge 8 32 GiB 2 TB Up to 10 Gbps
h1.4xlarge 16 64 GiB 4 TB Up to 10 Gbps
h1.8xlarge 32 128 GiB 8 TB 10 Gbps
h1.16xlarge 64 256 GiB 16 TB 25 Gbps

The two largest sizes support Intel Turbo and CPU power management, with all-core Turbo at 2.7 GHz and single-core Turbo at 3.0 GHz.

Local storage is optimized to deliver high throughput for sequential I/O; you can expect to transfer up to 1.15 gigabytes per second if you use a 2 megabyte block size. The storage is encrypted at rest using 256-bit XTS-AES and one-time keys.

Moving large amounts of data on and off of these instances is facilitated by the use of Enhanced Networking, giving you up to 25 Gbps of network bandwith within Placement Groups.

Launch One Today
H1 instances are available today in the US East (Northern Virginia), US West (Oregon), US East (Ohio), and EU (Ireland) Regions. You can launch them in On-Demand or Spot Form. Dedicated Hosts, Dedicated Instances, and Reserved Instances (both 1-year and 3-year) are also available.

Jeff;

Don Jr.: I’ll bite

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/11/don-jr-ill-bite.html

So Don Jr. tweets the following, which is an excellent troll. So I thought I’d bite. The reason is I just got through debunk Democrat claims about NetNeutrality, so it seems like a good time to balance things out and debunk Trump nonsense.

The issue here is not which side is right. The issue here is whether you stand for truth, or whether you’ll seize any factoid that appears to support your side, regardless of the truthfulness of it. The ACLU obviously chose falsehoods, as I documented. In the following tweet, Don Jr. does the same.

It’s a preview of the hyperpartisan debates are you are likely to have across the dinner table tomorrow, which each side trying to outdo the other in the false-hoods they’ll claim.

What we see in this number is a steady trend of these statistics since the Great Recession, with no evidence in the graphs showing how Trump has influenced these numbers, one way or the other.

Stock markets at all time highs

This is true, but it’s obviously not due to Trump. The stock markers have been steadily rising since the Great Recession. Trump has done nothing substantive to change the market trajectory. Also, he hasn’t inspired the market to change it’s direction.
To be fair to Don Jr., we’ve all been crediting (or blaming) presidents for changes in the stock market despite the fact they have almost no influence over it. Presidents don’t run the economy, it’s an inappropriate conceit. The most influence they’ve had is in harming it.

Lowest jobless claims since 73

Again, let’s graph this:

As we can see, jobless claims have been on a smooth downward trajectory since the Great Recession. It’s difficult to see here how President Trump has influenced these numbers.

6 Trillion added to the economy

What he’s referring to is that assets have risen in value, like the stock market, homes, gold, and even Bitcoin.
But this is a well known fallacy known as Mercantilism, believing the “economy” is measured by the value of its assets. This was debunked by Adam Smith in his book “The Wealth of Nations“, where he showed instead the the “economy” is measured by how much it produces (GDP – Gross Domestic Product) and not assets.
GDP has grown at 3.0%, which is pretty good compared to the long term trend, and is better than Europe or Japan (though not as good as China). But Trump doesn’t deserve any credit for this — today’s rise in GDP is the result of stuff that happened years ago.
Assets have risen by $6 trillion, but that’s not a good thing. After all, when you sell your home for more money, the buyer has to pay more. So one person is better off and one is worse off, so the net effect is zero.
Actually, such asset price increase is a worrisome indicator — we are entering into bubble territory. It’s the result of a loose monetary policy, low interest rates and “quantitative easing” that was designed under the Obama administration to stimulate the economy. That’s why all assets are rising in value. Normally, a rise in one asset means a fall in another, like selling gold to pay for houses. But because of loose monetary policy, all assets are increasing in price. The amazing rise in Bitcoin over the last year is as much a result of this bubble growing in all assets as it is to an exuberant belief in Bitcoin.
When this bubble collapses, which may happen during Trump’s term, it’ll really be the Obama administration who is to blame. I mean, if Trump is willing to take credit for the asset price bubble now, I’m willing to give it to him, as long as he accepts the blame when it crashes.

1.5 million fewer people on food stamps

As you’d expect, I’m going to debunk this with a graph: the numbers have been falling since the great recession. Indeed, in the previous period under Obama, 1.9 fewer people got off food stamps, so Trump’s performance is slight ahead rather than behind Obama. Of course, neither president is really responsible.

Consumer confidence through the roof

Again we are going to graph this number:

Again we find nothing in the graph that suggests President Trump is responsible for any change — it’s been improving steadily since the Great Recession.

One thing to note is that, technically, it’s not “through the roof” — it still quite a bit below the roof set during the dot-com era.

Lowest Unemployment rate in 17 years

Again, let’s simply graph it over time and look for Trump’s contribution. as we can see, there doesn’t appear to be anything special Trump has done — unemployment has steadily been improving since the Great Recession.
But here’s the thing, the “unemployment rate” only measures those looking for work, not those who have given up. The number that concerns people more is the “labor force participation rate”. The Great Recession kicked a lot of workers out of the economy.
Mostly this is because Baby Boomer are now retiring an leaving the workforce, and some have chosen to retire early rather than look for another job. But there are still some other problems in our economy that cause this. President Trump has nothing particular in order to solve these problems.

Conclusion

As we see, Don Jr’s tweet is a troll. When we look at the graphs of these indicators going back to the Great Recession, we don’t see how President Trump has influenced anything. The improvements this year are in line with the improvements last year, which are in turn inline with the improvements in the previous year.
To be fair, all parties credit their President with improvements during their term. President Obama’s supporters did the same thing. But at least right now, with these numbers, we can see that there’s no merit to anything in Don Jr’s tweet.
The hyperpartisan rancor in this country is because neither side cares about the facts. We should care. We should care that these numbers suck, even if we are Republicans. Conversely, we should care that those NetNeutrality claims by Democrats suck, even if we are Democrats.

170 ‘Pirate’ IPTV Vendors Throw in the Towel Facing Legal Pressure

Post Syndicated from Ernesto original https://torrentfreak.com/170-pirate-iptv-vendors-throw-the-in-the-towel-facing-legal-pressure-171121/

Pirate streaming boxes are all the rage this year. Not just among the dozens of millions of users, they are on top of the anti-piracy agenda as well.

Dubbed Piracy 3.0 by the MPAA, copyright holders are trying their best to curb this worrisome trend. In the Netherlands local anti-piracy group BREIN is leading the charge.

Backed by the major film studios, the organization booked a significant victory earlier this year against Filmspeler. In this case, the European Court of Justice ruled that selling or using devices pre-configured to obtain copyright-infringing content is illegal.

Paired with the earlier GS Media ruling, which held that companies with a for-profit motive can’t knowingly link to copyright-infringing material, this provides a powerful enforcement tool.

With these decisions in hand, BREIN previously pressured hundreds of streaming box vendors to halt sales of hardware with pirate addons, but it didn’t stop there. This week the group also highlighted its successes against vendors of unauthorized IPTV services.

“BREIN has already stopped 170 illegal providers of illegal media players and/or IPTV subscriptions. Even providers that only offer illegal IPTV subscriptions are being dealt with,” BREIN reports.

In addition to shutting down the trade in IPTV services, the anti-piracy group also removed 375 advertisements for such services from various marketplaces.

“This is illegal commerce. If you wait until you are warned, you are too late,” BREIN director Tim Kuik says.

“You can be held personally liable. You can also be charged and criminally prosecuted. Willingly committing commercial copyright infringement can lead to a 82,000 euro fine and 4 years imprisonment,” he adds.

While most pirate IPTV vendors threw in the towel voluntarily, some received an extra incentive. Twenty signed a settlement with BREIN for varying amounts, up to tens of thousands of euros. They all face further penalties if they continue to sell pirate subscriptions.

In some cases, the courts were involved. This includes the recent lawsuit against MovieStreamer, that was ordered to stop its IPTV hyperlinking activities immediately. Failure to do so will result in a 5,000 euro per day fine. In addition, the vendor was also ordered to pay legal costs of 17,527 euros.

While BREIN has booked plenty of successes already, as exampled here, the pirate streaming box problem is far from solved. The anti-piracy group currently has one case pending in court, but more are likely to follow in the near future.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

MPAA Lobbies US Congress on Streaming Piracy Boxes

Post Syndicated from Ernesto original https://torrentfreak.com/mpaa-lobbies-us-congress-on-streaming-piracy-boxes-171112/

As part of its quest to reduce piracy, the MPAA continues to spend money on its lobbying activities, hoping to sway lawmakers in its direction.

While the lobbying talks take place behind closed doors, quarterly disclosure reports provide some insight into the items under discussion.

The MPAA’s most recent lobbying disclosure form features several new topics that weren’t on the agenda last year.

Among other issues, the Hollywood group lobbied the U.S. Senate and the U.S. House of Representatives on set-top boxes, preloaded streaming piracy devices, and streaming piracy in general.

The details of these discussions remain behind closed doors. The only thing we know for sure is what Hollywood is lobbying on, but it doesn’t take much imagination to take an educated guess on the ‘why’ part.

Just over a year ago streaming piracy boxes were hardly mentioned in anti-piracy circles, but today they are on the top of the enforcement list. The MPAA is reporting these concerns to lawmakers, to see whether they can be of assistance in curbing this growing threat.

Some of the lobbying topics

It’s clear that pirate streaming players are a prime concern for Hollywood. MPA boss Stan McCoy recently characterized the use of these devices as “Piracy 3.0” and a coalition of industry players sued a US-based seller of streaming boxes earlier this month.

The lobbying efforts themselves are nothing new of course. Every year the MPAA spends around $4 million to influence the decisions of lawmakers, both directly and through external lobbying firms such as Covington & Burling, Capitol Tax Partners, and Sentinel Worldwide.

While piracy streaming boxes are new on the agenda this year, they are not the only topics under discussion. Other items include trade deals such as the TPP, TTIP, and NAFTA, voluntary domain name initiatives, EU digital single market proposals, and cybersecurity.

TorrentFreak reached out to the MPAA for more information on the streaming box lobbying efforts, but we have yet to hear back.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Sky: People Can’t Pirate Live Soccer in the UK Anymore

Post Syndicated from Andy original https://torrentfreak.com/sky-people-cant-pirate-live-soccer-in-the-uk-anymore-171108/

The commotion over the set-top box streaming phenomenon is showing no signs of dying down and if day one at the Cable and Satellite Broadcasting Association of Asia (CASBAA) Conference 2017 was anything to go by, things are only heating up.

Held at Studio City in Macau, the conference has a strong anti-piracy element and was opened by Joe Welch, CASBAA Board Chairman and SVP Public Affairs Asia, 21st Century Fox. He began Tuesday by noting the important recent launch of a brand new anti-piracy initiative.

“CASBAA recently launched the Coalition Against Piracy, funded by 18 of the region’s content players and distribution partners,” he said.

TF reported on the formation of the coalition mid-October. It includes heavyweights such as Disney, Fox, HBO, NBCUniversal and BBC Worldwide, and will have a strong focus on the illicit set-top box market.

Illegal streaming devices (or ISDs, as the industry calls them), were directly addressed in a segment yesterday afternoon titled Face To Face. Led by Dr. Ros Lynch, Director of Copyright & IP Enforcement at the UK Intellectual Property Office, the session detailed the “onslaught of online piracy” and the rise of ISDs that is apparently “shaking the market”.

Given the apparent gravity of those statements, the following will probably come as a surprise. According to Lynch, the UK IPO sought the opinion of UK-based rightsholders about the pirate box phenomenon a while back after being informed of their popularity in the East. The response was that pirate boxes weren’t an issue. It didn’t take long, however, for things to blow up.

“The UKIPO provides intelligence and evidence to industry and the Police Intellectual Property Crime Unit (PIPCU) in London who then take enforcement actions,” Lynch explained.

“We first heard about the issues with ISDs from [broadcaster] TVB in Hong Kong and we then consulted the UK rights holders who responded that it wasn’t a problem. Two years later the issue just exploded.”

The evidence of that in the UK isn’t difficult to find. In addition to millions of devices with both free Kodi addon and subscription-based systems deployed, the app market has bloomed too, offering free or near to free content to all.

This caught the eye of the Premier League who this year obtained two pioneering injunctions (1,2) to tackle live streams of football games. Streams are blocked by local ISPs in real-time, making illicit online viewing a more painful experience than it ever has been. No doubt progress has been made on this front, with thousands of streams blocked, but according to broadcaster Sky, the results are unprecedented.

“Site-blocking has moved the goalposts significantly,” said Matthew Hibbert, head of litigation at Sky UK.

“In the UK you cannot watch pirated live Premier League content anymore,” he said.

While progress has been good, the statement is overly enthusiastic. TF sources have been monitoring the availability of pirate streams on around dozen illicit sites and services every Saturday (when it is actually illegal to broadcast matches in the UK) and service has been steady on around half of them and intermittent at worst on the rest.

There are hundreds of other platforms available so while many are definitely affected by Premier League blocking, it’s safe to assume that live football piracy hasn’t been wiped out. Nevertheless, it would be wrong to suggest that no progress has been made, in this and other related areas.

Kevin Plumb, Director of Legal Services at The Premier League, said that pubs showing football from illegal streams had also massively dwindled in numbers.

“In the past 18 months the illegal broadcasting of live Premier League matches in pubs in the UK has been decimated,” he said.

This result is almost certainly down to prosecutions taken in tandem with the Federation Against Copyright Theft (FACT), that have seen several landlords landed with large fines. Indeed, both sides of the market have been tackled, with both licensed premises and IPTV device sellers being targeted.

“The most successful thing we’ve done to combat piracy has been to undertake criminal prosecutions against ISD piracy,” said FACT chief Kieron Sharp yesterday. “Everyone is pleading guilty to these offenses.”

Most if not all of FACT-led prosecutions target device and subscription sellers under fraud legislation but that could change in the future, Lynch of the Intellectual Property Office said.

“While the UK works to update its legislation, we can’t wait for the new legislation to take enforcement actions and we rely heavily on ‘conspiracy to defraud’ charges, and have successfully prosecuted a number of ISD retailers,” she said.

Finally, information provided yesterday by network company CISCO shine light on what it costs to run a subscription-based pirate IPTV operation.

Director of Intelligence & Security Operations Avigail Gutman said a pirate IPTV server offering 1,000 channels to around 1,000 subscribers can cost as little as 2,000 euros per month to run but can generate 12,000 euros in revenue during the same period.

“In April of 2017, ten major paid TV and content providers had relinquished 3.09 million euros per month to 285 ISD-based streaming pirate syndicates,” she said.

There’s little doubt that IPTV piracy, both paid and free, is here to stay. The big question is how it will be tackled short and long-term and whether any changes in legislation will have any unintended knock-on effects.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Now Available – Compute-Intensive C5 Instances for Amazon EC2

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-available-compute-intensive-c5-instances-for-amazon-ec2/

I’m thrilled to announce that the new compute-intensive C5 instances are available today in six sizes for launch in three AWS regions!

These instances designed for compute-heavy applications like batch processing, distributed analytics, high-performance computing (HPC), ad serving, highly scalable multiplayer gaming, and video encoding. The new instances offer a 25% price/performance improvement over the C4 instances, with over 50% for some workloads. They also have additional memory per vCPU, and (for code that can make use of the new AVX-512 instructions), twice the performance for vector and floating point workloads.

Over the years we have been working non-stop to provide our customers with the best possible networking, storage, and compute performance, with a long-term focus on offloading many types of work to dedicated hardware designed and built by AWS. The C5 instance type incorporates the latest generation of our hardware offloads, and also takes another big step forward with the addition of a new hypervisor that runs hand-in-glove with our hardware. The new hypervisor allows us to give you access to all of the processing power provided by the host hardware, while also making performance even more consistent and further raising the bar on security. We’ll be sharing many technical details about it at AWS re:Invent.

The New Instances
The C5 instances are available in six sizes:

Instance Name vCPUs
RAM
EBS Bandwidth Network Bandwidth
c5.large 2 4 GiB Up to 2.25 Gbps Up to 10 Gbps
c5.xlarge 4 8 GiB Up to 2.25 Gbps Up to 10 Gbps
c5.2xlarge 8 16 GiB Up to 2.25 Gbps Up to 10 Gbps
c5.4xlarge 16 32 GiB 2.25 Gbps Up to 10 Gbps
c5.9xlarge 36 72 GiB 4.5 Gbps 10 Gbps
c5.18xlarge 72 144 GiB 9 Gbps 25 Gbps

Each vCPU is a hardware hyperthread on a 3.0 GHz Intel Xeon Platinum 8000-series processor. This custom processor, optimized for EC2, allows you have full control over the C-states on the two largest sizes, allowing you to run a single core at up to 3.5 GHz using Intel Turbo Boost Technology.

As you can see from the table, the four smallest instance sizes offer substantially more EBS and network bandwidth than the previous generation of compute-intensive instances.

Because all networking and storage functionality is implemented in hardware, C5 instances require HVM AMIs that include drivers for the Elastic Network Adapter (ENA) and NVMe. The latest Amazon Linux, Microsoft Windows, Ubuntu, RHEL, CentOS, SLES, Debian, and FreeBSD AMIs all support C5 instances. If you are doing machine learning inferencing, or other compute-intensive work, be sure to check out the most recent version of the Intel Math Kernel Library. It has been optimized for the Intel® Xeon® Platinum processor and has the potential to greatly accelerate your work.

In order to remain compatible with instances that use the Xen hypervisor, the device names for EBS volumes will continue to use the existing /dev/sd and /dev/xvd prefixes. The device name that you provide when you attach a volume to an instance is not used because the NVMe driver assigns its own device name (read Amazon EBS and NVMe to learn more):

The nvme command displays additional information about each volume (install it using sudo yum -y install nvme-cli if necessary):

The SN field in the output can be mapped to an EBS volume ID by inserting a “-” after the “vol” prefix (sadly, the NVMe SN field is not long enough to store the entire ID). Here’s a simple script that uses this information to create an EBS snapshot of each attached volume:

$ sudo nvme list | \
  awk '/dev/ {print(gensub("vol", "vol-", 1, $2))}' | \
  xargs -n 1 aws ec2 create-snapshot --volume-id

With a little more work (and a lot of testing), you could create a script that expands EBS volumes that are getting full.

Getting to C5
As I mentioned earlier, our effort to offload work to hardware accelerators has been underway for quite some time. Here’s a recap:

CC1 – Launched in 2010, the CC1 was designed to support scale-out HPC applications. It was the first EC2 instance to support 10 Gbps networking and one of the first to support HVM virtualization. The network fabric that we designed for the CC1 (based on our own switch hardware) has become the standard for all AWS data centers.

C3 – Launched in 2013, the C3 introduced Enhanced Networking and uses dedicated hardware accelerators to support the software defined network inside of each Virtual Private Cloud (VPC). Hardware virtualization removes the I/O stack from the hypervisor in favor of direct access by the guest OS, resulting in higher performance and reduced variability.

C4 – Launched in 2015, the C4 instances are EBS Optimized by default via a dedicated network connection, and also offload EBS processing (including CPU-intensive crypto operations for encrypted EBS volumes) to a hardware accelerator.

C5 – Launched today, the hypervisor that powers the C5 instances allow practically all of the resources of the host CPU to be devoted to customer instances. The ENA networking and the NVMe interface to EBS are both powered by hardware accelerators. The instances do not require (or support) the Xen paravirtual networking or block device drivers, both of which have been removed in order to increase efficiency.

Going forward, we’ll use this hypervisor to power other instance types and plan to share additional technical details in a set of AWS re:Invent sessions.

Launch a C5 Today
You can launch C5 instances today in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions in On-Demand and Spot form (Reserved Instances are also available), with additional Regions in the works.

One quick note before I go: The current NVMe driver is not optimized for high-performance sequential workloads and we don’t recommend the use of C5 instances in conjunction with sc1 or st1 volumes. We are aware of this issue and have been working to optimize the driver for this important use case.

Jeff;

MPAA: Almost 70% of 38 Million Kodi Users Are Pirates

Post Syndicated from Andy original https://torrentfreak.com/mpaa-almost-70-of-38-million-kodi-users-are-pirates-171104/

As torrents and other forms of file-sharing resolutely simmer away in the background, it is the streaming phenomenon that’s taking the Internet by storm.

This Tuesday, in a report by Canadian broadband management company Sandvine, it was revealed that IPTV traffic has grown to massive proportions.

Sandvine found that 6.5% of households in North American are now communicating with known TV piracy services. This translates to seven million subscribers and many more potential viewers. There’s little doubt that IPTV and all its variants, Kodi streaming included, are definitely here to stay.

The topic was raised again Wednesday during a panel discussion hosted by the Copyright Alliance in conjunction with the Creative Rights Caucus. Titled “Copyright Pirates’ New Strategies”, the discussion’s promotional graphic indicates some of the industry heavyweights in attendance.

The Copyright Alliance tweeted points from the discussion throughout the day and soon the conversation turned to the streaming phenomenon that has transformed piracy in recent times.

Previously dubbed Piracy 3.0 by the MPAA, Senior Vice President, Government and Regulatory Affairs Neil Fried was present to describe streaming devices and apps as the latest development in TV and movie piracy.

Like many before him, Fried explained that the Kodi platform in its basic form is legal. However, he noted that many of the add-ons for the media player provide access to pirated content, a point proven in a big screen demo.

Kodi demo by the MPAA via Copyright Alliance

According to the Copyright Alliance, Fried then delivered some interesting stats. The MPAA believes that there are around 38 million users of Kodi in the world, which sounds like a reasonable figure given that the system has been around for 15 years in various guises, including during its XBMC branding.

However, he also claimed that of those 38 million, a substantial 26 million users have piracy addons installed. That suggests around 68.5% or seven out of ten of all Kodi users are pirates of movies, TV shows, and other media. Taking the MPAA statement to its conclusion, only 12 million Kodi users are operating the software legitimately.

TorrentFreak contacted XBMC Foundation President Nathan Betzen for his stance on the figures but he couldn’t shine much light on usage.

“Unfortunately I do not have an up to date number on users, and because we don’t watch what our users are doing, we have no way of knowing how many do what with regards to streaming. [The MPAA’s] numbers could be completely correct or totally made up. We have no real way to know,” Betzen said.

That being said, the team does have the capability to monitor overall Kodi usage, even if they don’t publish the stats. This was revealed back in June 2011 when Kodi was still called XBMC.

“The addon system gives us the opportunity to measure the popularity of addons, measure user base, estimate the frequency that people update their systems, and even, ultimately, help users find the more popular addons,” the team wrote.

“Most interestingly, for the purposes of this post, is that we can get a pretty good picture of how many active XBMC installs there are without having to track what each individual user does.”

Using this system, the team concluded there were roughly 435,000 active XBMC instances around the globe in April 2011, but that figure was to swell dramatically. Just three months later, 789,000 XBMC installations had been active in the previous six weeks.

What’s staggering is that in 2017, the MPAA claims that there are now 38 million users of Kodi, of which 26 million are pirates. In the absence of any figures from the Kodi team, TF asked Kodi addon repository TVAddons what they thought of the MPAA’s stats.

“We’ve always banned the use of analytics within Kodi addons, so it’s really impossible to make such an estimate. It seems like the MPAA is throwing around numbers without much statistical evidence while mislabelling Kodi users as ‘pirate’ in the same way that they have mislabelled legitimate services like CloudFlare,” a spokesperson said.

“As far as general addon use goes, before our repository server (which contained hundreds of legitimate addons) was unlawfully seized, it had about 39 million active users per month, but even we don’t know how many users downloaded which addons. We never allowed for addon statistics for users because they are invasive to privacy and breed unhealthy competition.”

So, it seems that while there is some dispute over the number of potential pirates, there does at least appear to be some consensus on the number of users overall. The big question, however, is how groups like the MPAA will deal with this kind of unauthorized infringement in future.

At the moment the big push is to paint pirate platforms as dangerous places to be. Indeed, during the discussion this week, Copyright Alliance CEO Keith Kupferschmid claimed that users of pirate services are “28 times more likely” to be infected with malware.

Whether that strategy will pay off remains unclear but it’s obvious that at least for now, Piracy 3.0 is a massive deal, one that few people saw coming half a decade ago but is destined to keep growing.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

‘Pirate’ IPTV Provider Loses Case, Despite Not Offering Content Itself

Post Syndicated from Andy original https://torrentfreak.com/pirate-iptv-provider-loses-case-despite-not-offering-content-itself-171031/

In 2017, there can be little doubt that streaming is the big piracy engine of the moment. Dubbed Piracy 3.0 by the MPAA, the movement is causing tremendous headaches for rightsholders on a global scale.

One of the interesting things about this phenomenon is the distributed nature of the content on offer. Sourced from thousands of online locations, from traditional file-hosters to Google Drive, the big challenge is to aggregate it all into one place, to make it easy to find. This is often achieved via third-party addons for the legal Kodi software.

One company offering such a service was MovieStreamer.nl in the Netherlands. Via its website MovieStreamer the company offered its Easy Use Interface 2.0, a piece of software that made Kodi easy to use and other streams easy to find for 79 euros. It also sold ‘VIP’ access to thousands of otherwise premium channels for around 20 euros per month.

MovieStreamer Easy Interface 2.0

“Thanks to the unique Easy Use Interface, we have the unique 3-step process,” the company’s marketing read.

“Click tile of choice, activate subtitles, and play! Fully automated and instantly the most optimal settings. Our youngest user is 4 years old and the ‘oldest’ 86 years. Ideal for young and old, beginner and expert.”

Of course, being based in the Netherlands it wasn’t long before MovieStreamer caught the attention of BREIN. The anti-piracy outfit says it tried to get the company to stop offering the illegal product but after getting no joy, took the case to court.

From BREIN’s perspective, the case was cut and dried. MovieStreamer had no right to provide access to the infringing content so it was in breach of copyright law (unauthorized communication to the public) and should stop its activities immediately. MovieStreamer, however, saw things somewhat differently.

At the core of its defense was the claim that did it not provide content itself and was merely a kind of middleman. MovieStreamer said it provided only a referral service in the form of a hyperlink formatted as a shortened URL, which in turn brought together supply and demand.

In effect, MovieStreamer claimed that it was several steps away from any infringement and that only the users themselves could activate the shortener hyperlink and subsequent process (including a corresponding M3U playlist file, which linked to other hyperlinks) to access any pirated content. Due to this disconnect, MovieStreamer said that there was no infringement, for-profit or otherwise.

A judge at the District Court in Utrecht disagreed, ruling that by providing a unique hyperlink to customers which in turn lead to protected works was indeed a “communication to the public” based on the earlier Filmspeler case.

The Court also noted that MovieStreamer knew or indeed ought to have known the illegal nature of the content being linked to, not least since BREIN had already informed them of that fact. Since the company was aware, the for-profit element of the GS Media decision handed down by the European Court of Justice came into play.

In an order handed down October 27, the Court ordered MovieStreamer to stop its IPTV hyperlinking activities immediately, whether via its Kodi Easy Use Interface or other means. Failure to do so will result in a 5,000 euro per day fine, payable to BREIN, up to a maximum of 500,000 euros. MovieStreamer was also ordered to pay legal costs of 17,527 euros.

“Moviestreamer sold a link to illegal content. Then you are required to check if that content is legally on the internet,” BREIN Director Tim Kuik said in a statement.

“You can not claim that you have nothing to do with the content if you sell a link to that content.”

Speaking with Tweakers, MovieStreamer owner Bernhard Ohler said that the packages in question were removed from his website on Saturday night. He also warned that other similar companies could experience the same issues with BREIN.

“With this judgment in hand, BREIN has, of course, a powerful weapon to force them offline,” he said.

Ohler said that the margins on hardware were so small that the IPTV subscriptions were the heart of his company. Contacted by TorrentFreak on what this means for his business, he had just two words.

“The end,” he said.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Roku Shows FBI Warning to Pirate Channel Users

Post Syndicated from Ernesto original https://torrentfreak.com/roku-shows-fbi-warning-to-pirate-channel-users-171009/

In recent years it has become much easier to stream movies and TV-shows over the Internet.

Legal services such as Netflix and HBO are flourishing, but at the same time millions of people are streaming from unauthorized sources, often paired with perfectly legal streaming platforms and devices.

Hollywood insiders have dubbed this trend “Piracy 3.0” and are actively working with stakeholders to address the threat. One of the companies rightsholders are working with is Roku, known for its easy-to-use media players.

Earlier this year a Mexican court ordered retailers to take the Roku media player off the shelves. This legal battle is still ongoing, but it was a clear signal to the company, which now has its own anti-piracy team.

Several third-party “private” channels have been removed from the player in recent weeks as they violate Roku’s terms and conditions. These include the hugely popular streaming channel XTV, which offered access to infringing content.

After its removal, XTV briefly returned as XTV 2, but that didn’t last for long. The infringing channel was soon removed again, this time showing the FBI’s anti-piracy seal followed by a rather ominous message.

“FBI Anti-Piracy Warning: Unauthorized copying is punishable under federal law,” it reads. “Roku has removed this unauthorized service due to repeated claims of copyright infringement.”

FBI Warning (via Cordcuttersnews)

The unusual warning was picked up by Cordcuttersnews and states that Roku itself removed the channel.

To some it may seem that the FBI is cracking down on Roku channels, but this is not the case. The anti-piracy seal and associated warning are often used in cases where the organization is not actively involved, to add extra weight. The FBI supports this, as long as certain standards are met.

A Roku spokesperson confirmed to TorrentFreak that they’re using it on their own accord here.

“We want to send a clear message to Roku customers and to publishers that any publication of pirated content on our platform is a violation of law and our platform rules,” the company says.

“We have recently expanded the messaging that we display to customers that install non-certified channels to alert them to the associated risks, and we display the FBI’s publicly available warning when we remove channels for copyright violations.”

The strong language shows that Roku is taking its efforts to crack down on infringing channels very seriously. A few weeks ago the company started to warn users that pirate channels may be removed without prior notice.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Now Available – Amazon Linux AMI 2017.09

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-available-amazon-linux-ami-2017-09/

I’m happy to announce that the latest version of the Amazon Linux AMI (2017.09) is now available in all AWS Regions for all current-generation EC2 instances. The AMI contains a supported and maintained Linux image that is designed to provide a stable, secure, high performance environment for applications running on EC2.

Easy Upgrade
You can upgrade your existing instances by running two commands and then rebooting:

$ sudo yum clean all
$ sudo yum update

Lots of Goodies
The AMI contains many new features, many of which were added in response to requests from our customers. Here’s a summary:

Kernel 4.9.51 – Based on the 4.9 stable kernel series, this kernel includes the ENA 1.3.0 driver along with support for TCP Bottleneck Bandwidth and RTT (BBR). Read my post, Elastic Network Adapter – High-Performance Network Interface for Amazon EC2 to learn more about ENA. Read the Release Notes to learn how to enable BBR.

Amazon SSM Agent – The Amazon SSM Agent is now installed by default. This means that you can now use EC2 Run Command to configure and run scripts on your instances with no further setup. To learn more, read Executing Commands Using Systems Manager Run Command or Manage Instances at Scale Without SSH Access Using EC2 Run Command.

Python 3.6 – The newest version of Python is now included and can be managed via virtualenv and alternatives. You can install Python 3.6 like this:

$ sudo yum install python36 python36-virtualenv python36-pip

Ruby 2.4 – The latest version of Ruby in the 2.4 series is now available. Install it like this:

$ sudo yum install ruby24

OpenSSL – The AMI now uses OpenSSL 1.0.2k.

HTTP/2 – The HTTP/2 protocol is now supported by the AMI’s httpd24, nginx, and curl packages.

Relational DatabasesPostgres 9.6 and MySQL 5.7 are now available, and can be installed like this:

$ sudo yum install postgresql96
$ sudo yum install mysql57

OpenMPI – The OpenMPI package has been upgraded from 1.6.4 to 2.1.1. OpenMPI compatibility packages are available and can be used to build and run older OpenMPI applications.

And More – Other updated packages include Squid 3.5, Nginx 1.12, Tomcat 8.5, and GCC 6.4.

Launch it Today
You can use this AMI to launch EC2 instances in all AWS Regions today. It is available for EBS-backed and Instance Store-backed instances and supports HVM and PV modes.

Jeff;

Evergreen 3.0.0 released

Post Syndicated from ris original https://lwn.net/Articles/735379/rss

The Evergreen community has announced the
release
of Evergreen 3.0.0, software for libraries. This release
includes community support of the web staff client for production use,
serials and offline circulation modules for the web staff client,
improvements to the display of headings in the public catalog browse list,
and more.

‘China Should Crack Down on Pirate Streaming Box Distributors’

Post Syndicated from Ernesto original https://torrentfreak.com/china-should-crack-down-on-pirate-streaming-box-distributors-171001/

The International Intellectual Property Alliance (IIPA) has informed the U.S. Government that China must step up its game to better protect the interests of copyright holders.

The US Trade Representative is reviewing whether China has done enough to comply with its WTO obligations, but IIPA members including RIAA and MPAA believe there is still work to be done.

One of the areas to which the Chinese Government should pay more attention is enforcement. Although a lot of progress has been made in recent years, especially in combating music piracy, new threats have emerged.

One of the areas highlighted by IIPA is the streaming box ecosystem, aptly dubbed as “piracy 3.0” by the Motion Picture Association. This appeals to a new breed of pirates who rely on set-top boxes which are filled with pirate add-ons.

Industry groups often refer to these boxes as Illicit Streaming Devices (ISDs) and they see China as a major hub through which these are shipped around the world.

“ISDs are media boxes, set-top boxes or other devices that allow users, through the use of piracy apps, to stream, download, or otherwise access unauthorized content from the Internet,” IIPA writes.

“These devices have emerged as a significant means through which pirated motion picture and television content is accessed on televisions in homes in China as well as elsewhere in Asia and increasingly around the world. China is a hub for the manufacture of these devices.”

Although the hardware and media players are perfectly legal, things get problematic when they’re loaded with pirate add-ons and promoted as tools to facilitate copyright infringement.

IIPA states that the Chinese Government should do more to stop these devices from being sold. Cracking down on the main distribution points would be a good start, they say.

“However it is done, the Chinese government must increase enforcement efforts, including cracking down on piracy apps and on device retailers and/or distributors who preload the devices with apps that facilitate infringement.

“Moreover, because China is the main source of this problem spreading across Asia, the Chinese government should take immediate actions against key distribution points for devices that are being used illegally,” IIPA adds.

In addition to pirate boxes, the industry groups also want China to beef up its enforcement against online journal piracy, pirate apps, unauthorized camcording, and unlicensed streaming platforms.

IIPA intends to explain the above and several other shortcomings in detail during a hearing in Washington, DC, next Wednesday. The group has submitted an overview of its testimony to the Trade Representative, which is available here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Creating a Cost-Efficient Amazon ECS Cluster for Scheduled Tasks

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/

Madhuri Peri
Sr. DevOps Consultant

When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.

One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances.  Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.

Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.

In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.

Architecture

The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.

Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.

The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.

The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.

Walkthrough

To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:

  • Amazon ECR repository for storing the Docker image to be used in the task definition
  • S3 bucket that holds the transferred logs
  • Task definition, with image name and S3 bucket as environment variables provided via input parameter
  • CloudWatch Events rule
  • Amazon ECS cluster
  • Amazon ECS container instances using Spot Fleet
  • IAM roles required for the container instance profiles

Before you begin

Ensure that Git, Docker, and the AWS CLI are installed on your computer.

In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster.

Implementation Steps

  1. Clone the code from GitHub that performs RDS API calls to retrieve the log files.
    git clone https://github.com/awslabs/aws-ecs-scheduled-tasks.git
  2. Build and tag the image.
    cd aws-ecs-scheduled-tasks/container-code/src && ls

    Dockerfile		rdslogsshipper.py	requirements.txt

    docker build -t rdslogsshipper .

    Sending build context to Docker daemon 9.728 kB
    Step 1 : FROM python:3
     ---> 41397f4f2887
    Step 2 : WORKDIR /usr/src/app
     ---> Using cache
     ---> 59299c020e7e
    Step 3 : COPY requirements.txt ./
     ---> 8c017e931c3b
    Removing intermediate container df09e1bed9f2
    Step 4 : COPY rdslogsshipper.py /usr/src/app
     ---> 099a49ca4325
    Removing intermediate container 1b1da24a6699
    Step 5 : RUN pip install --no-cache-dir -r requirements.txt
     ---> Running in 3ed98b30901d
    Collecting boto3 (from -r requirements.txt (line 1))
      Downloading boto3-1.4.6-py2.py3-none-any.whl (128kB)
    Collecting botocore (from -r requirements.txt (line 2))
      Downloading botocore-1.6.7-py2.py3-none-any.whl (3.6MB)
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->-r requirements.txt (line 1))
      Downloading s3transfer-0.1.10-py2.py3-none-any.whl (54kB)
    Collecting jmespath<1.0.0,>=0.7.1 (from boto3->-r requirements.txt (line 1))
      Downloading jmespath-0.9.3-py2.py3-none-any.whl
    Collecting python-dateutil<3.0.0,>=2.1 (from botocore->-r requirements.txt (line 2))
      Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
    Collecting docutils>=0.10 (from botocore->-r requirements.txt (line 2))
      Downloading docutils-0.14-py3-none-any.whl (543kB)
    Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore->-r requirements.txt (line 2))
      Downloading six-1.10.0-py2.py3-none-any.whl
    Installing collected packages: six, python-dateutil, docutils, jmespath, botocore, s3transfer, boto3
    Successfully installed boto3-1.4.6 botocore-1.6.7 docutils-0.14 jmespath-0.9.3 python-dateutil-2.6.1 s3transfer-0.1.10 six-1.10.0
     ---> f892d3cb7383
    Removing intermediate container 3ed98b30901d
    Step 6 : COPY . .
     ---> ea7550c04fea
    Removing intermediate container b558b3ebd406
    Successfully built ea7550c04fea
  3. Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
  4. Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
  5. Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes:
    aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule  --schedule-expression "rate(15 minutes)"

    {
        "RuleArn": "arn:aws:events:us-west-2:12345678901:rule/demo-ecs-task-rule"
    }
  6. CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
    1. Create the IAM role to be assumed by CloudWatch.
      aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json

      {
          "Role": {
              "AssumeRolePolicyDocument": {
                  "Version": "2012-10-17", 
                  "Statement": [
                      {
                          "Action": "sts:AssumeRole", 
                          "Effect": "Allow", 
                          "Principal": {
                              "Service": "events.amazonaws.com"
                          }
                      }
                  ]
              }, 
              "RoleId": "AROAIRYYLDCVZCUACT7FS", 
              "CreateDate": "2017-07-14T22:44:52.627Z", 
              "RoleName": "Test-Role", 
              "Path": "/", 
              "Arn": "arn:aws:iam::12345678901:role/Test-Role"
          }
      }

      The following is an example of the event-role.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                    "Service": "events.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
    2. Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources.
      aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json

      {
          "Policy": {
              "PolicyName": "test-policy", 
              "CreateDate": "2017-07-14T22:51:20.293Z", 
              "AttachmentCount": 0, 
              "IsAttachable": true, 
              "PolicyId": "ANPAI7XDIQOLTBUMDWGJW", 
              "DefaultVersionId": "v1", 
              "Path": "/", 
              "Arn": "arn:aws:iam::123455678901:policy/test-policy", 
              "UpdateDate": "2017-07-14T22:51:20.293Z"
          }
      }

      The following is an example of the event-policy.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ecs:RunTask"
                ],
                "Resource": [
                    "arn:aws:ecs:*::task-definition/"
                ],
                "Condition": {
                    "ArnLike": {
                        "ecs:cluster": "arn:aws:ecs:*::cluster/"
                    }
                }
            }
          ]
      }
    3. Attach the IAM policy to the role.
      aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
  7. Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings.
    aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role"

    {
        "FailedEntries": [], 
        "FailedEntryCount": 0
    }

That’s it. The logs now run based on the defined schedule.

To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.

Conclusion

In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.

In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.

If you have questions or suggestions, please comment below.