Tag Archives: Spark

timeShift(GrafanaBuzz, 1w) Issue 25

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/12/08/timeshiftgrafanabuzz-1w-issue-25/

Welcome to TimeShift

This week, a few of us from Grafana Labs, along with 4,000 of our closest friends, headed down to chilly Austin, TX for KubeCon + CloudNativeCon North America 2017. We got to see a number of great talks and were thrilled to see Grafana make appearances in some of the presentations. We were also a sponsor of the conference and handed out a ton of swag (we overnighted some of our custom Grafana scarves, which came in handy for Thursday’s snow).

We also announced Grafana Labs has joined the Cloud Native Computing Foundation as a Silver member! We’re excited to share our expertise in time series data visualization and open source software with the CNCF community.


Latest Release

Grafana 4.6.2 is available and includes some bug fixes:

  • Prometheus: Fixes bug with new Prometheus alerts in Grafana. Make sure to download this version if you’re using Prometheus for alerting. More details in the issue. #9777
  • Color picker: Bug after using textbox input field to change/paste color string #9769
  • Cloudwatch: build using golang 1.9.2 #9667, thanks @mtanda
  • Heatmap: Fixed tooltip for “time series buckets” mode #9332
  • InfluxDB: Fixed query editor issue when using > or < operators in WHERE clause #9871

Download Grafana 4.6.2 Now


From the Blogosphere

Grafana Labs Joins the CNCF: Grafana Labs has officially joined the Cloud Native Computing Foundation (CNCF). We look forward to working with the CNCF community to democratize metrics and help unify traditionally disparate information.

Automating Web Performance Regression Alerts: Peter and his team needed a faster and easier way to find web performance regressions at the Wikimedia Foundation. Grafana 4’s alerting features were exactly what they needed. This post covers their journey on setting up alerts for both RUM and synthetic testing and shares the alerts they’ve set up on their dashboards.

How To Install Grafana on Ubuntu 17.10: As you probably guessed from the title, this article walks you through installing and configuring Grafana in the latest version of Ubuntu (or earlier releases). It also covers installing plugins using the Grafana CLI tool.

Prometheus: Starting the Server with Alertmanager, cAdvisor and Grafana: Learn how to monitor Docker from scratch using cAdvisor, Prometheus and Grafana in this detailed, step-by-step walkthrough.

Monitoring Java EE Servers with Prometheus and Payara: In this screencast, Adam uses firehose; a Java EE 7+ metrics gateway for Prometheus, to convert the JSON output into Prometheus statistics and visualizes the data in Grafana.

Monitoring Spark Streaming with InfluxDB and Grafana: This article focuses on how to monitor Apache Spark Streaming applications with InfluxDB and Grafana at scale.


GrafanaCon EU, March 1-2, 2018

We are currently reaching out to everyone who submitted a talk to GrafanaCon and will soon publish the final schedule at grafanacon.org.

Join us March 1-2, 2018 in Amsterdam for 2 days of talks centered around Grafana and the surrounding monitoring ecosystem including Graphite, Prometheus, InfluxData, Elasticsearch, Kubernetes, and more.

Get Your Ticket Now


Grafana Plugins

Lots of plugin updates and a new OpenNMS Helm App plugin to announce! To install or update any plugin in an on-prem Grafana instance, use the Grafana-cli tool, or install and update with 1 click on Hosted Grafana.

NEW PLUGIN

OpenNMS Helm App – The new OpenNMS Helm App plugin replaces the old OpenNMS data source. Helm allows users to create flexible dashboards using both fault management (FM) and performance management (PM) data from OpenNMS® Horizon™ and/or OpenNMS® Meridian™. The old data source is now deprecated.


Install Now

UPDATED PLUGIN

PNP Data Source – This data source plugin (that uses PNP4Nagios to access RRD files) received a small, but important update that fixes template query parsing.


Update

UPDATED PLUGIN

Vonage Status Panel – The latest version of the Status Panel comes with a number of small fixes and changes. Below are a few of the enhancements:

  • Threshold settings – removed Show Always option, and replaced it with 2 options:
    • Display Alias – Select when to show the metric alias.
    • Display Value – Select when to show the metric value.
  • Text format configuration (bold / italic) for warning / critical / disabled states.
  • Option to change the corner radius of the panel. Now you can change the panel’s shape to have rounded corners.

Update

UPDATED PLUGIN

Google Calendar Plugin – This plugin received a small update, so be sure to install version 1.0.4.


Update

UPDATED PLUGIN

Carpet Plot Panel – The Carpet Plot Panel received a fix for IE 11, and also added the ability to choose custom colors.


Update


Upcoming Events:

In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We also like to make sure we mention other Grafana-related events happening all over the world. If you’re putting on just such an event, let us know and we’ll list it here.

Docker Meetup @ Tuenti | Madrid, Spain – Dec 12, 2017: Javier Provecho: Intro to Metrics with Swarm, Prometheus and Grafana

Learn how to gain visibility in real time for your micro services. We’ll cover how to deploy a Prometheus server with persistence and Grafana, how to enable metrics endpoints for various service types (docker daemon, traefik proxy and postgres) and how to scrape, visualize and set up alarms based on those metrics.

RSVP

Grafana Lyon Meetup n ° 2 | Lyon, France – Dec 14, 2017: This meetup will cover some of the latest innovations in Grafana and discussion about automation. Also, free beer and chips, so – of course you’re going!

RSVP

FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. Carl Bergquist is managing the Cloud and Monitoring Devroom, and we’ve heard there were some great talks submitted. There is no need to register; all are welcome.


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

We were thrilled to see our dashboards bigger than life at KubeCon + CloudNativeCon this week. Thanks for snapping a photo and sharing!


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


How are we doing?

Hard to believe this is the 25th issue of Timeshift! I have a blast writing these roundups, but Let me know what you think. Submit a comment on this article below, or post something at our community forum. Find an article I haven’t included? Send it my way. Help us make timeShift better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

Sky’s Pirate Site-Blocking Move is Something For North Korea, ISPs Say

Post Syndicated from Andy original https://torrentfreak.com/skys-pirate-site-blocking-move-is-something-for-north-korea-isps-say-171129/

Entertainment companies have been taking legal action to have pirate sites blocked for more than a decade so it was only a matter of time before New Zealand had a taste of the action.

It’s now been revealed that Sky Network Television, the country’s biggest pay-TV service, filed a complaint with the High Court in September, demanding that four local Internet service providers block subscriber access to several ‘pirate’ sites.

At this point, the sites haven’t been named, but it seems almost inevitable that the likes of The Pirate Bay will be present. The ISPs are known, however. Spark, Vodafone, Vocus and Two Degrees control around 90% of the Kiwi market so any injunction handed down will affect almost the entire country.

In its application, Sky states that pirate sites make available unauthorized copies of its entertainment works, something which not only infringes its copyrights but also undermines its business model. But while this is standard fare in such complaints, the Internet industry backlash today is something out of the ordinary.

ISPs in other jurisdictions have fought back against blocking efforts but few have deployed the kind of language being heard in New Zealand this morning.

Vocus Group – which runs the Orcon, Slingshot and Flip brands – is labeling Sky’s efforts as “gross censorship and a breach of net neutrality”, adding that they’re in direct opposition to the idea of a free and open Internet.

“SKY’s call that sites be blacklisted on their say so is dinosaur behavior, something you would expect in North Korea, not in New Zealand. It isn’t our job to police the Internet and it sure as hell isn’t SKY’s either, all sites should be equal and open,” says Vocus Consumer General Manager Taryn Hamilton.

But in response, Sky said Vocus “has got it wrong”, highlighting that site-blocking is now common practice in places such as Australia and the UK.

“Pirate sites like Pirate Bay make no contribution to the development of content, but rather just steal it. Over 40 countries around the world have put in place laws to block such sites, and we’re just looking to do the same,” the company said.

The broadcaster says it will only go to court to have dedicated pirate sites blocked, ones that “pay nothing to the creators” while stealing content for their own gain.

“We’re doing this because illegal streaming and content piracy is a major threat to the entertainment, creative and sporting industries in New Zealand and abroad. With piracy, not only is the sport and entertainment content that we love at risk, but so are the livelihoods of the thousands of people employed by these industries,” the company said.

“Illegally sharing or viewing content impacts a vast number of people and jobs including athletes, actors, artists, production crew, customer service representatives, event planners, caterers and many, many more.”

ISP Spark, which is also being targeted by Sky, was less visibly outraged than some of its competitors. However, the company still feels that controlling what people can see on the Internet is a slippery slope.

“We have some sympathy for this given we invest tens of millions of dollars into content ourselves through Lightbox. However, we don’t think it should be the role of ISPs to become the ‘police of the internet’ on behalf of other parties,” a Spark spokesperson said.

Perhaps unsurprisingly, Sky’s blocking efforts haven’t been well received by InternetNZ, the non-profit organization which protects and promotes Internet use in New Zealand.

Describing the company’s application for an injunction as an “extreme step”, InternetNZ Chief Executive Jordan Carter said that site-blocking works against the “very nature” of the Internet and is a measure that’s unlikely to achieve its goals.

“Site blocking is very easily evaded by people with the right skills or tools. Those who are deliberate pirates will be able to get around site blocking without difficulty,” Carter said.

“If blocking is ordered, it risks driving content piracy further underground, with the help of easily-deployed and common Internet tools. This could well end up making the issues that Sky are facing even harder to police in the future.”

What most of the ISPs and InternetNZ are also agreed on is the need to fight piracy with competitive, attractive legal offerings. Vocus says that local interest in The Pirate Bay has halved since Netflix launched in New Zealand, with traffic to the torrent site sitting at just 23% of its peak 2013 levels.

“The success of Netflix, iTunes and Spotify proves that people are willing to pay to access good-quality content. It’s pretty clear that SKY doesn’t understand the internet, and is trying a Hail Mary to turnaround its sunset business,” Vocus Consumer General Manager Taryn Hamilton said.

The big question now is whether the High Court has the ability to order these kinds of blocks. InternetNZ has its doubts, noting that it should only happen following a parliamentary mandate.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production

Post Syndicated from Rafi Ton original https://aws.amazon.com/blogs/big-data/using-amazon-redshift-spectrum-amazon-athena-and-aws-glue-with-node-js-in-production/

This is a guest post by Rafi Ton, founder and CEO of NUVIAD. NUVIAD is, in their own words, “a mobile marketing platform providing professional marketers, agencies and local businesses state of the art tools to promote their products and services through hyper targeting, big data analytics and advanced machine learning tools.”

At NUVIAD, we’ve been using Amazon Redshift as our main data warehouse solution for more than 3 years.

We store massive amounts of ad transaction data that our users and partners analyze to determine ad campaign strategies. When running real-time bidding (RTB) campaigns in large scale, data freshness is critical so that our users can respond rapidly to changes in campaign performance. We chose Amazon Redshift because of its simplicity, scalability, performance, and ability to load new data in near real time.

Over the past three years, our customer base grew significantly and so did our data. We saw our Amazon Redshift cluster grow from three nodes to 65 nodes. To balance cost and analytics performance, we looked for a way to store large amounts of less-frequently analyzed data at a lower cost. Yet, we still wanted to have the data immediately available for user queries and to meet their expectations for fast performance. We turned to Amazon Redshift Spectrum.

In this post, I explain the reasons why we extended Amazon Redshift with Redshift Spectrum as our modern data warehouse. I cover how our data growth and the need to balance cost and performance led us to adopt Redshift Spectrum. I also share key performance metrics in our environment, and discuss the additional AWS services that provide a scalable and fast environment, with data available for immediate querying by our growing user base.

Amazon Redshift as our foundation

The ability to provide fresh, up-to-the-minute data to our customers and partners was always a main goal with our platform. We saw other solutions provide data that was a few hours old, but this was not good enough for us. We insisted on providing the freshest data possible. For us, that meant loading Amazon Redshift in frequent micro batches and allowing our customers to query Amazon Redshift directly to get results in near real time.

The benefits were immediately evident. Our customers could see how their campaigns performed faster than with other solutions, and react sooner to the ever-changing media supply pricing and availability. They were very happy.

However, this approach required Amazon Redshift to store a lot of data for long periods, and our data grew substantially. In our peak, we maintained a cluster running 65 DC1.large nodes. The impact on our Amazon Redshift cluster was evident, and we saw our CPU utilization grow to 90%.

Why we extended Amazon Redshift to Redshift Spectrum

Redshift Spectrum gives us the ability to run SQL queries using the powerful Amazon Redshift query engine against data stored in Amazon S3, without needing to load the data. With Redshift Spectrum, we store data where we want, at the cost that we want. We have the data available for analytics when our users need it with the performance they expect.

Seamless scalability, high performance, and unlimited concurrency

Scaling Redshift Spectrum is a simple process. First, it allows us to leverage Amazon S3 as the storage engine and get practically unlimited data capacity.

Second, if we need more compute power, we can leverage Redshift Spectrum’s distributed compute engine over thousands of nodes to provide superior performance – perfect for complex queries running against massive amounts of data.

Third, all Redshift Spectrum clusters access the same data catalog so that we don’t have to worry about data migration at all, making scaling effortless and seamless.

Lastly, since Redshift Spectrum distributes queries across potentially thousands of nodes, they are not affected by other queries, providing much more stable performance and unlimited concurrency.

Keeping it SQL

Redshift Spectrum uses the same query engine as Amazon Redshift. This means that we did not need to change our BI tools or query syntax, whether we used complex queries across a single table or joins across multiple tables.

An interesting capability introduced recently is the ability to create a view that spans both Amazon Redshift and Redshift Spectrum external tables. With this feature, you can query frequently accessed data in your Amazon Redshift cluster and less-frequently accessed data in Amazon S3, using a single view.

Leveraging Parquet for higher performance

Parquet is a columnar data format that provides superior performance and allows Redshift Spectrum (or Amazon Athena) to scan significantly less data. With less I/O, queries run faster and we pay less per query. You can read all about Parquet at https://parquet.apache.org/ or https://en.wikipedia.org/wiki/Apache_Parquet.

Lower cost

From a cost perspective, we pay standard rates for our data in Amazon S3, and only small amounts per query to analyze data with Redshift Spectrum. Using the Parquet format, we can significantly reduce the amount of data scanned. Our costs are now lower, and our users get fast results even for large complex queries.

What we learned about Amazon Redshift vs. Redshift Spectrum performance

When we first started looking at Redshift Spectrum, we wanted to put it to the test. We wanted to know how it would compare to Amazon Redshift, so we looked at two key questions:

  1. What is the performance difference between Amazon Redshift and Redshift Spectrum on simple and complex queries?
  2. Does the data format impact performance?

During the migration phase, we had our dataset stored in Amazon Redshift and S3 as CSV/GZIP and as Parquet file formats. We tested three configurations:

  • Amazon Redshift cluster with 28 DC1.large nodes
  • Redshift Spectrum using CSV/GZIP
  • Redshift Spectrum using Parquet

We performed benchmarks for simple and complex queries on one month’s worth of data. We tested how much time it took to perform the query, and how consistent the results were when running the same query multiple times. The data we used for the tests was already partitioned by date and hour. Properly partitioning the data improves performance significantly and reduces query times.

Simple query

First, we tested a simple query aggregating billing data across a month:

SELECT 
  user_id, 
  count(*) AS impressions, 
  SUM(billing)::decimal /1000000 AS billing 
FROM <table_name> 
WHERE 
  date >= '2017-08-01' AND 
  date <= '2017-08-31'  
GROUP BY 
  user_id;

We ran the same query seven times and measured the response times (red marking the longest time and green the shortest time):

Execution Time (seconds)
  Amazon Redshift Redshift Spectrum
CSV
Redshift Spectrum Parquet
Run #1 39.65 45.11 11.92
Run #2 15.26 43.13 12.05
Run #3 15.27 46.47 13.38
Run #4 21.22 51.02 12.74
Run #5 17.27 43.35 11.76
Run #6 16.67 44.23 13.67
Run #7 25.37 40.39 12.75
Average 21.53  44.82 12.61

For simple queries, Amazon Redshift performed better than Redshift Spectrum, as we thought, because the data is local to Amazon Redshift.

What was surprising was that using Parquet data format in Redshift Spectrum significantly beat ‘traditional’ Amazon Redshift performance. For our queries, using Parquet data format with Redshift Spectrum delivered an average 40% performance gain over traditional Amazon Redshift. Furthermore, Redshift Spectrum showed high consistency in execution time with a smaller difference between the slowest run and the fastest run.

Comparing the amount of data scanned when using CSV/GZIP and Parquet, the difference was also significant:

Data Scanned (GB)
CSV (Gzip) 135.49
Parquet 2.83

Because we pay only for the data scanned by Redshift Spectrum, the cost saving of using Parquet is evident and substantial.

Complex query

Next, we compared the same three configurations with a complex query.

Execution Time (seconds)
  Amazon Redshift Redshift Spectrum CSV Redshift Spectrum Parquet
Run #1 329.80 84.20 42.40
Run #2 167.60 65.30 35.10
Run #3 165.20 62.20 23.90
Run #4 273.90 74.90 55.90
Run #5 167.70 69.00 58.40
Average 220.84 71.12 43.14

This time, Redshift Spectrum using Parquet cut the average query time by 80% compared to traditional Amazon Redshift!

Bottom line: For complex queries, Redshift Spectrum provided a 67% performance gain over Amazon Redshift. Using the Parquet data format, Redshift Spectrum delivered an 80% performance improvement over Amazon Redshift. For us, this was substantial.

Optimizing the data structure for different workloads

Because the cost of S3 is relatively inexpensive and we pay only for the data scanned by each query, we believe that it makes sense to keep our data in different formats for different workloads and different analytics engines. It is important to note that we can have any number of tables pointing to the same data on S3. It all depends on how we partition the data and update the table partitions.

Data permutations

For example, we have a process that runs every minute and generates statistics for the last minute of data collected. With Amazon Redshift, this would be done by running the query on the table with something as follows:

SELECT 
  user, 
  COUNT(*) 
FROM 
  events_table 
WHERE 
  ts BETWEEN ‘2017-08-01 14:00:00’ AND ‘2017-08-01 14:00:59’ 
GROUP BY 
  user;

(Assuming ‘ts’ is your column storing the time stamp for each event.)

With Redshift Spectrum, we pay for the data scanned in each query. If the data is partitioned by the minute instead of the hour, a query looking at one minute would be 1/60th the cost. If we use a temporary table that points only to the data of the last minute, we save that unnecessary cost.

Creating Parquet data efficiently

On the average, we have 800 instances that process our traffic. Each instance sends events that are eventually loaded into Amazon Redshift. When we started three years ago, we would offload data from each server to S3 and then perform a periodic copy command from S3 to Amazon Redshift.

Recently, Amazon Kinesis Firehose added the capability to offload data directly to Amazon Redshift. While this is now a viable option, we kept the same collection process that worked flawlessly and efficiently for three years.

This changed, however, when we incorporated Redshift Spectrum. With Redshift Spectrum, we needed to find a way to:

  • Collect the event data from the instances.
  • Save the data in Parquet format.
  • Partition the data effectively.

To accomplish this, we save the data as CSV and then transform it to Parquet. The most effective method to generate the Parquet files is to:

  1. Send the data in one-minute intervals from the instances to Kinesis Firehose with an S3 temporary bucket as the destination.
  2. Aggregate hourly data and convert it to Parquet using AWS Lambda and AWS Glue.
  3. Add the Parquet data to S3 by updating the table partitions.

With this new process, we had to give more attention to validating the data before we sent it to Kinesis Firehose, because a single corrupted record in a partition fails queries on that partition.

Data validation

To store our click data in a table, we considered the following SQL create table command:

create external TABLE spectrum.blog_clicks (
    user_id varchar(50),
    campaign_id varchar(50),
    os varchar(50),
    ua varchar(255),
    ts bigint,
    billing float
)
partitioned by (date date, hour smallint)  
stored as parquet
location 's3://nuviad-temp/blog/clicks/';

The above statement defines a new external table (all Redshift Spectrum tables are external tables) with a few attributes. We stored ‘ts’ as a Unix time stamp and not as Timestamp, and billing data is stored as float and not decimal (more on that later). We also said that the data is partitioned by date and hour, and then stored as Parquet on S3.

First, we need to get the table definitions. This can be achieved by running the following query:

SELECT 
  * 
FROM 
  svv_external_columns 
WHERE 
  tablename = 'blog_clicks';

This query lists all the columns in the table with their respective definitions:

schemaname tablename columnname external_type columnnum part_key
spectrum blog_clicks user_id varchar(50) 1 0
spectrum blog_clicks campaign_id varchar(50) 2 0
spectrum blog_clicks os varchar(50) 3 0
spectrum blog_clicks ua varchar(255) 4 0
spectrum blog_clicks ts bigint 5 0
spectrum blog_clicks billing double 6 0
spectrum blog_clicks date date 7 1
spectrum blog_clicks hour smallint 8 2

Now we can use this data to create a validation schema for our data:

const rtb_request_schema = {
    "name": "clicks",
    "items": {
        "user_id": {
            "type": "string",
            "max_length": 100
        },
        "campaign_id": {
            "type": "string",
            "max_length": 50
        },
        "os": {
            "type": "string",
            "max_length": 50            
        },
        "ua": {
            "type": "string",
            "max_length": 255            
        },
        "ts": {
            "type": "integer",
            "min_value": 0,
            "max_value": 9999999999999
        },
        "billing": {
            "type": "float",
            "min_value": 0,
            "max_value": 9999999999999
        }
    }
};

Next, we create a function that uses this schema to validate data:

function valueIsValid(value, item_schema) {
    if (schema.type == 'string') {
        return (typeof value == 'string' && value.length <= schema.max_length);
    }
    else if (schema.type == 'integer') {
        return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
    }
    else if (schema.type == 'float' || schema.type == 'double') {
        return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
    }
    else if (schema.type == 'boolean') {
        return typeof value == 'boolean';
    }
    else if (schema.type == 'timestamp') {
        return (new Date(value)).getTime() > 0;
    }
    else {
        return true;
    }
}

Near real-time data loading with Kinesis Firehose

On Kinesis Firehose, we created a new delivery stream to handle the events as follows:

Delivery stream name: events
Source: Direct PUT
S3 bucket: nuviad-events
S3 prefix: rtb/
IAM role: firehose_delivery_role_1
Data transformation: Disabled
Source record backup: Disabled
S3 buffer size (MB): 100
S3 buffer interval (sec): 60
S3 Compression: GZIP
S3 Encryption: No Encryption
Status: ACTIVE
Error logging: Enabled

This delivery stream aggregates event data every minute, or up to 100 MB, and writes the data to an S3 bucket as a CSV/GZIP compressed file. Next, after we have the data validated, we can safely send it to our Kinesis Firehose API:

if (validated) {
    let itemString = item.join('|')+'\n'; //Sending csv delimited by pipe and adding new line

    let params = {
        DeliveryStreamName: 'events',
        Record: {
            Data: itemString
        }
    };

    firehose.putRecord(params, function(err, data) {
        if (err) {
            console.error(err, err.stack);        
        }
        else {
            // Continue to your next step 
        }
    });
}

Now, we have a single CSV file representing one minute of event data stored in S3. The files are named automatically by Kinesis Firehose by adding a UTC time prefix in the format YYYY/MM/DD/HH before writing objects to S3. Because we use the date and hour as partitions, we need to change the file naming and location to fit our Redshift Spectrum schema.

Automating data distribution using AWS Lambda

We created a simple Lambda function triggered by an S3 put event that copies the file to a different location (or locations), while renaming it to fit our data structure and processing flow. As mentioned before, the files generated by Kinesis Firehose are structured in a pre-defined hierarchy, such as:

S3://your-bucket/your-prefix/2017/08/01/20/events-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz

All we need to do is parse the object name and restructure it as we see fit. In our case, we did the following (the event is an object received in the Lambda function with all the data about the object written to S3):

/*
	object key structure in the event object:
your-prefix/2017/08/01/20/event-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz
	*/

let key_parts = event.Records[0].s3.object.key.split('/'); 

let event_type = key_parts[0];
let date = key_parts[1] + '-' + key_parts[2] + '-' + key_parts[3];
let hour = key_parts[4];
if (hour.indexOf('0') == 0) {
 		hour = parseInt(hour, 10) + '';
}
    
let parts1 = key_parts[5].split('-');
let minute = parts1[7];
if (minute.indexOf('0') == 0) {
        minute = parseInt(minute, 10) + '';
}

Now, we can redistribute the file to the two destinations we need—one for the minute processing task and the other for hourly aggregation:

    copyObjectToHourlyFolder(event, date, hour, minute)
        .then(copyObjectToMinuteFolder.bind(null, event, date, hour, minute))
        .then(addPartitionToSpectrum.bind(null, event, date, hour, minute))
        .then(deleteOldMinuteObjects.bind(null, event))
        .then(deleteStreamObject.bind(null, event))        
        .then(result => {
            callback(null, { message: 'done' });            
        })
        .catch(err => {
            console.error(err);
            callback(null, { message: err });            
        }); 

Kinesis Firehose stores the data in a temporary folder. We copy the object to another folder that holds the data for the last processed minute. This folder is connected to a small Redshift Spectrum table where the data is being processed without needing to scan a much larger dataset. We also copy the data to a folder that holds the data for the entire hour, to be later aggregated and converted to Parquet.

Because we partition the data by date and hour, we created a new partition on the Redshift Spectrum table if the processed minute is the first minute in the hour (that is, minute 0). We ran the following:

ALTER TABLE 
  spectrum.events 
ADD partition
  (date='2017-08-01', hour=0) 
  LOCATION 's3://nuviad-temp/events/2017-08-01/0/';

After the data is processed and added to the table, we delete the processed data from the temporary Kinesis Firehose storage and from the minute storage folder.

Migrating CSV to Parquet using AWS Glue and Amazon EMR

The simplest way we found to run an hourly job converting our CSV data to Parquet is using Lambda and AWS Glue (and thanks to the awesome AWS Big Data team for their help with this).

Creating AWS Glue jobs

What this simple AWS Glue script does:

  • Gets parameters for the job, date, and hour to be processed
  • Creates a Spark EMR context allowing us to run Spark code
  • Reads CSV data into a DataFrame
  • Writes the data as Parquet to the destination S3 bucket
  • Adds or modifies the Redshift Spectrum / Amazon Athena table partition for the table
import sys
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','day_partition_key', 'hour_partition_key', 'day_partition_value', 'hour_partition_value' ])

#day_partition_key = "partition_0"
#hour_partition_key = "partition_1"
#day_partition_value = "2017-08-01"
#hour_partition_value = "0"

day_partition_key = args['day_partition_key']
hour_partition_key = args['hour_partition_key']
day_partition_value = args['day_partition_value']
hour_partition_value = args['hour_partition_value']

print("Running for " + day_partition_value + "/" + hour_partition_value)

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

df = spark.read.option("delimiter","|").csv("s3://nuviad-temp/events/"+day_partition_value+"/"+hour_partition_value)
df.registerTempTable("data")

df1 = spark.sql("select _c0 as user_id, _c1 as campaign_id, _c2 as os, _c3 as ua, cast(_c4 as bigint) as ts, cast(_c5 as double) as billing from data")

df1.repartition(1).write.mode("overwrite").parquet("s3://nuviad-temp/parquet/"+day_partition_value+"/hour="+hour_partition_value)

client = boto3.client('athena', region_name='us-east-1')

response = client.start_query_execution(
    QueryString='alter table parquet_events add if not exists partition(' + day_partition_key + '=\'' + day_partition_value + '\',' + hour_partition_key + '=' + hour_partition_value + ')  location \'s3://nuviad-temp/parquet/' + day_partition_value + '/hour=' + hour_partition_value + '\'' ,
    QueryExecutionContext={
        'Database': 'spectrumdb'
    },
    ResultConfiguration={
        'OutputLocation': 's3://nuviad-temp/convertresults'
    }
)

response = client.start_query_execution(
    QueryString='alter table parquet_events partition(' + day_partition_key + '=\'' + day_partition_value + '\',' + hour_partition_key + '=' + hour_partition_value + ') set location \'s3://nuviad-temp/parquet/' + day_partition_value + '/hour=' + hour_partition_value + '\'' ,
    QueryExecutionContext={
        'Database': 'spectrumdb'
    },
    ResultConfiguration={
        'OutputLocation': 's3://nuviad-temp/convertresults'
    }
)

job.commit()

Note: Because Redshift Spectrum and Athena both use the AWS Glue Data Catalog, we could use the Athena client to add the partition to the table.

Here are a few words about float, decimal, and double. Using decimal proved to be more challenging than we expected, as it seems that Redshift Spectrum and Spark use them differently. Whenever we used decimal in Redshift Spectrum and in Spark, we kept getting errors, such as:

S3 Query Exception (Fetch). Task failed due to an internal error. File 'https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parquet has an incompatible Parquet schema for column 's3://nuviad-events/events.lat'. Column type: DECIMAL(18, 8), Parquet schema:\noptional float lat [i:4 d:1 r:0]\n (https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parq

We had to experiment with a few floating-point formats until we found that the only combination that worked was to define the column as double in the Spark code and float in Spectrum. This is the reason you see billing defined as float in Spectrum and double in the Spark code.

Creating a Lambda function to trigger conversion

Next, we created a simple Lambda function to trigger the AWS Glue script hourly using a simple Python code:

import boto3
import json
from datetime import datetime, timedelta
 
client = boto3.client('glue')
 
def lambda_handler(event, context):
    last_hour_date_time = datetime.now() - timedelta(hours = 1)
    day_partition_value = last_hour_date_time.strftime("%Y-%m-%d") 
    hour_partition_value = last_hour_date_time.strftime("%-H") 
    response = client.start_job_run(
    JobName='convertEventsParquetHourly',
    Arguments={
         '--day_partition_key': 'date',
         '--hour_partition_key': 'hour',
         '--day_partition_value': day_partition_value,
         '--hour_partition_value': hour_partition_value
         }
    )

Using Amazon CloudWatch Events, we trigger this function hourly. This function triggers an AWS Glue job named ‘convertEventsParquetHourly’ and runs it for the previous hour, passing job names and values of the partitions to process to AWS Glue.

Redshift Spectrum and Node.js

Our development stack is based on Node.js, which is well-suited for high-speed, light servers that need to process a huge number of transactions. However, a few limitations of the Node.js environment required us to create workarounds and use other tools to complete the process.

Node.js and Parquet

The lack of Parquet modules for Node.js required us to implement an AWS Glue/Amazon EMR process to effectively migrate data from CSV to Parquet. We would rather save directly to Parquet, but we couldn’t find an effective way to do it.

One interesting project in the works is the development of a Parquet NPM by Marc Vertes called node-parquet (https://www.npmjs.com/package/node-parquet). It is not in a production state yet, but we think it would be well worth following the progress of this package.

Timestamp data type

According to the Parquet documentation, Timestamp data are stored in Parquet as 64-bit integers. However, JavaScript does not support 64-bit integers, because the native number type is a 64-bit double, giving only 53 bits of integer range.

The result is that you cannot store Timestamp correctly in Parquet using Node.js. The solution is to store Timestamp as string and cast the type to Timestamp in the query. Using this method, we did not witness any performance degradation whatsoever.

Lessons learned

You can benefit from our trial-and-error experience.

Lesson #1: Data validation is critical

As mentioned earlier, a single corrupt entry in a partition can fail queries running against this partition, especially when using Parquet, which is harder to edit than a simple CSV file. Make sure that you validate your data before scanning it with Redshift Spectrum.

Lesson #2: Structure and partition data effectively

One of the biggest benefits of using Redshift Spectrum (or Athena for that matter) is that you don’t need to keep nodes up and running all the time. You pay only for the queries you perform and only for the data scanned per query.

Keeping different permutations of your data for different queries makes a lot of sense in this case. For example, you can partition your data by date and hour to run time-based queries, and also have another set partitioned by user_id and date to run user-based queries. This results in faster and more efficient performance of your data warehouse.

Storing data in the right format

Use Parquet whenever you can. The benefits of Parquet are substantial. Faster performance, less data to scan, and much more efficient columnar format. However, it is not supported out-of-the-box by Kinesis Firehose, so you need to implement your own ETL. AWS Glue is a great option.

Creating small tables for frequent tasks

When we started using Redshift Spectrum, we saw our Amazon Redshift costs jump by hundreds of dollars per day. Then we realized that we were unnecessarily scanning a full day’s worth of data every minute. Take advantage of the ability to define multiple tables on the same S3 bucket or folder, and create temporary and small tables for frequent queries.

Lesson #3: Combine Athena and Redshift Spectrum for optimal performance

Moving to Redshift Spectrum also allowed us to take advantage of Athena as both use the AWS Glue Data Catalog. Run fast and simple queries using Athena while taking advantage of the advanced Amazon Redshift query engine for complex queries using Redshift Spectrum.

Redshift Spectrum excels when running complex queries. It can push many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer, so that queries use much less of your cluster’s processing capacity.

Lesson #4: Sort your Parquet data within the partition

We achieved another performance improvement by sorting data within the partition using sortWithinPartitions(sort_field). For example:

df.repartition(1).sortWithinPartitions("campaign_id")…

Conclusion

We were extremely pleased with using Amazon Redshift as our core data warehouse for over three years. But as our client base and volume of data grew substantially, we extended Amazon Redshift to take advantage of scalability, performance, and cost with Redshift Spectrum.

Redshift Spectrum lets us scale to virtually unlimited storage, scale compute transparently, and deliver super-fast results for our users. With Redshift Spectrum, we store data where we want at the cost we want, and have the data available for analytics when our users need it with the performance they expect.


About the Author

With 7 years of experience in the AdTech industry and 15 years in leading technology companies, Rafi Ton is the founder and CEO of NUVIAD. He enjoys exploring new technologies and putting them to use in cutting edge products and services, in the real world generating real money. Being an experienced entrepreneur, Rafi believes in practical-programming and fast adaptation of new technologies to achieve a significant market advantage.

 

 

Looming Net Neutrality Repeal Sparks BitTorrent Throttling Fears

Post Syndicated from Ernesto original https://torrentfreak.com/looming-net-neutrality-repeal-sparks-bittorrent-throttling-fears-171123/

Ten years ago we uncovered that Comcast was systematically slowing down BitTorrent traffic to ease the load on its network.

The Comcast case ignited a broad discussion about net neutrality and provided the setup for the FCC’s Open Internet Order, which came into effect three years later.

This Open Internet Order then became the foundation of the net neutrality regulation that was adopted in 2015 and still applies today. The big change compared to the earlier attempt was that ISPs can be regulated as carriers under Title II.

These rules provide a clear standard that prevents ISPs from blocking, throttling, and paid prioritization of “lawful” traffic. However, this may soon be over as the FCC is determined to repeal it.

FCC head Ajit Pai recently told Reuters that the current rules are too restrictive and hinder competition and innovation, which is ultimately not in the best interests of consumers

“The FCC will no longer be in the business of micromanaging business models and preemptively prohibiting services and applications and products that could be pro-competitive,” Pai said. “We should simply set rules of the road that let companies of all kinds in every sector compete and let consumers decide who wins and loses.”

This week the FCC released its final repeal draft (pdf), which was met with fierce resistance from the public and various large tech companies. They fear that, if the current net neutrality rules disappear, throttling and ‘fast lanes’ for some services will become commonplace.

This could also mean that BitTorrent traffic could become a target once again, with it being blocked or throttled across many networks, as The Verge just pointed out.

Blocking BitTorrent traffic would indeed become much easier if current net neutrality safeguards were removed. However, the FCC believes that the current “no-throttling rules are unnecessary to prevent the harms that they were intended to thwart,” such as blocking entire file transfer protocols.

Instead, the FCC notes that antitrust law, FTC enforcement of ISP commitments, and consumer expectations will prevent any unwelcome blocking. This is also the reason why ISPs adopted no-blocking policies even when they were not required to, they point out.

Indeed, when the DC Circuit Court of Appeals decimated the Open Internet Order in 2014, Comcast was quick to assure subscribers that it had no plans to start throttling torrents again. Yes, that offers no guarantees for the future.

The FCC goes on to mention that the current net neutrality rules don’t prevent selective blocking. They can already be bypassed by ISPs if they offer “curated services,” which allows them to filter content on viewpoint grounds. And Edge providers also block content because it violates their “viewpoints,” citing the Cloudflare termination of The Daily Stormer.

Net neutrality supporters see these explanations as weak excuses and have less trust in the self-regulating capacity of the ISP industry that the FCC, calling for last minute protests to stop the repeal.

For now it appears, however, that the FCC is unlikely to change its course, as Ars Technica reports.

While net neutrality concerns are legitimate, for BitTorrent users not that much will change.

As we’ve highlighted in the past, blocking pirate sites is already an option under the current rules. The massive copyright loophole made sure of that. Targeting all torrent traffic is even an option, in theory.

If net neutrality is indeed repealed next month, blocking or throttling BitTorrent traffic across the entire network will become easier, no doubt. For now, however, there are no signs that any ISPs plan to do so.

If it does, we will know soon enough. The FCC will require ISPs to be transparent under the new plan. They have to disclose network management practices, blocking efforts, commercial prioritization, and the like.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Amazon QuickSight Update – Geospatial Visualization, Private VPC Access, and More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-quicksight-update-geospatial-visualization-private-vpc-access-and-more/

We don’t often recognize or celebrate anniversaries at AWS. With nearly 100 services on our list, we’d be eating cake and drinking champagne several times a week. While that might sound like fun, we’d rather spend our working hours listening to customers and innovating. With that said, Amazon QuickSight has now been generally available for a little over a year and I would like to give you a quick update!

QuickSight in Action
Today, tens of thousands of customers (from startups to enterprises, in industries as varied as transportation, legal, mining, and healthcare) are using QuickSight to analyze and report on their business data.

Here are a couple of examples:

Gemini provides legal evidence procurement for California attorneys who represent injured workers. They have gone from creating custom reports and running one-off queries to creating and sharing dynamic QuickSight dashboards with drill-downs and filtering. QuickSight is used to track sales pipeline, measure order throughput, and to locate bottlenecks in the order processing pipeline.

Jivochat provides a real-time messaging platform to connect visitors to website owners. QuickSight lets them create and share interactive dashboards while also providing access to the underlying datasets. This has allowed them to move beyond the sharing of static spreadsheets, ensuring that everyone is looking at the same and is empowered to make timely decisions based on current data.

Transfix is a tech-powered freight marketplace that matches loads and increases visibility into logistics for Fortune 500 shippers in retail, food and beverage, manufacturing, and other industries. QuickSight has made analytics accessible to both BI engineers and non-technical business users. They scrutinize key business and operational metrics including shipping routes, carrier efficient, and process automation.

Looking Back / Looking Ahead
The feedback on QuickSight has been incredibly helpful. Customers tell us that their employees are using QuickSight to connect to their data, perform analytics, and make high-velocity, data-driven decisions, all without setting up or running their own BI infrastructure. We love all of the feedback that we get, and use it to drive our roadmap, leading to the introduction of over 40 new features in just a year. Here’s a summary:

Looking forward, we are watching an interesting trend develop within our customer base. As these customers take a close look at how they analyze and report on data, they are realizing that a serverless approach offers some tangible benefits. They use Amazon Simple Storage Service (S3) as a data lake and query it using a combination of QuickSight and Amazon Athena, giving them agility and flexibility without static infrastructure. They also make great use of QuickSight’s dashboards feature, monitoring business results and operational metrics, then sharing their insights with hundreds of users. You can read Building a Serverless Analytics Solution for Cleaner Cities and review Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight if you are interested in this approach.

New Features and Enhancements
We’re still doing our best to listen and to learn, and to make sure that QuickSight continues to meet your needs. I’m happy to announce that we are making seven big additions today:

Geospatial Visualization – You can now create geospatial visuals on geographical data sets.

Private VPC Access – You can now sign up to access a preview of a new feature that allows you to securely connect to data within VPCs or on-premises, without the need for public endpoints.

Flat Table Support – In addition to pivot tables, you can now use flat tables for tabular reporting. To learn more, read about Using Tabular Reports.

Calculated SPICE Fields – You can now perform run-time calculations on SPICE data as part of your analysis. Read Adding a Calculated Field to an Analysis for more information.

Wide Table Support – You can now use tables with up to 1000 columns.

Other Buckets – You can summarize the long tail of high-cardinality data into buckets, as described in Working with Visual Types in Amazon QuickSight.

HIPAA Compliance – You can now run HIPAA-compliant workloads on QuickSight.

Geospatial Visualization
Everyone seems to want this feature! You can now take data that contains a geographic identifier (country, city, state, or zip code) and create beautiful visualizations with just a few clicks. QuickSight will geocode the identifier that you supply, and can also accept lat/long map coordinates. You can use this feature to visualize sales by state, map stores to shipping destinations, and so forth. Here’s a sample visualization:

To learn more about this feature, read Using Geospatial Charts (Maps), and Adding Geospatial Data.

Private VPC Access Preview
If you have data in AWS (perhaps in Amazon Redshift, Amazon Relational Database Service (RDS), or on EC2) or on-premises in Teradata or SQL Server on servers without public connectivity, this feature is for you. Private VPC Access for QuickSight uses an Elastic Network Interface (ENI) for secure, private communication with data sources in a VPC. It also allows you to use AWS Direct Connect to create a secure, private link with your on-premises resources. Here’s what it looks like:

If you are ready to join the preview, you can sign up today.

Jeff;

 

New White House Announcement on the Vulnerability Equities Process

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/11/new_white_house_1.html

The White House has released a new version of the Vulnerabilities Equities Process (VEP). This is the inter-agency process by which the US government decides whether to inform the software vendor of a vulnerability it finds, or keep it secret and use it to eavesdrop on or attack other systems. You can read the new policy or the fact sheet, but the best place to start is Cybersecurity Coordinator Rob Joyce’s blog post.

In considering a way forward, there are some key tenets on which we can build a better process.

Improved transparency is critical. The American people should have confidence in the integrity of the process that underpins decision making about discovered vulnerabilities. Since I took my post as Cybersecurity Coordinator, improving the VEP and ensuring its transparency have been key priorities, and we have spent the last few months reviewing our existing policy in order to improve the process and make key details about the VEP available to the public. Through these efforts, we have validated much of the existing process and ensured a rigorous standard that considers many potential equities.

The interests of all stakeholders must be fairly represented. At a high level we consider four major groups of equities: defensive equities; intelligence / law enforcement / operational equities; commercial equities; and international partnership equities. Additionally, ordinary people want to know the systems they use are resilient, safe, and sound. These core considerations, which have been incorporated into the VEP Charter, help to standardize the process by which decision makers weigh the benefit to national security and the national interest when deciding whether to disclose or restrict knowledge of a vulnerability.

Accountability of the process and those who operate it is important to establish confidence in those served by it. Our public release of the unclassified portions Charter will shed light on aspects of the VEP that were previously shielded from public review, including who participates in the VEP’s governing body, known as the Equities Review Board. We make it clear that departments and agencies with protective missions participate in VEP discussions, as well as other departments and agencies that have broader equities, like the Department of State and the Department of Commerce. We also clarify what categories of vulnerabilities are submitted to the process and ensure that any decision not to disclose a vulnerability will be reevaluated regularly. There are still important reasons to keep many of the specific vulnerabilities evaluated in the process classified, but we will release an annual report that provides metrics about the process to further inform the public about the VEP and its outcomes.

Our system of government depends on informed and vigorous dialogue to discover and make available the best ideas that our diverse society can generate. This publication of the VEP Charter will likely spark discussion and debate. This discourse is important. I also predict that articles will make breathless claims of “massive stockpiles” of exploits while describing the issue. That simply isn’t true. The annual reports and transparency of this effort will reinforce that fact.

Mozilla is pleased with the new charter. I am less so; it looks to me like the same old policy with some new transparency measures — which I’m not sure I trust. The devil is in the details, and we don’t know the details — and it has giant loopholes that pretty much anything can fall through:

The United States Government’s decision to disclose or restrict vulnerability information could be subject to restrictions by partner agreements and sensitive operations. Vulnerabilities that fall within these categories will be cataloged by the originating Department/Agency internally and reported directly to the Chair of the ERB. The details of these categories are outlined in Annex C, which is classified. Quantities of excepted vulnerabilities from each department and agency will be provided in ERB meetings to all members.

This is me from last June:

There’s a lot we don’t know about the VEP. The Washington Post says that the NSA used EternalBlue “for more than five years,” which implies that it was discovered after the 2010 process was put in place. It’s not clear if all vulnerabilities are given such consideration, or if bugs are periodically reviewed to determine if they should be disclosed. That said, any VEP that allows something as dangerous as EternalBlue — or the Cisco vulnerabilities that the Shadow Brokers leaked last August — to remain unpatched for years isn’t serving national security very well. As a former NSA employee said, the quality of intelligence that could be gathered was “unreal.” But so was the potential damage. The NSA must avoid hoarding vulnerabilities.

I stand by that, and am not sure the new policy changes anything.

More commentary.

Here’s more about the Windows vulnerabilities hoarded by the NSA and released by the Shadow Brokers.

EDITED TO ADD (11/18): More news.

EDITED TO ADD (11/22): Adam Shostack points out that the process does not cover design flaws or trade-offs, and that those need to be covered:

…we need the VEP to expand to cover those issues. I’m not going to claim that will be easy, that the current approach will translate, or that they should have waited to handle those before publishing. One obvious place it gets harder is the sources and methods tradeoff. But we need the internet to be a resilient and trustworthy infrastructure.

Sci-Hub Won’t Be Blocked by US ISPs Anytime Soon

Post Syndicated from Ernesto original https://torrentfreak.com/sci-hub-wont-be-blocked-by-us-isps-anytime-soon-171111/

Sci-Hub, often referred to as the “Pirate Bay of Science,” hasn’t had a particularly good run in US courts so far.

Following a $15 million defeat against Elsevier in June, the American Chemical Society won a default judgment of $4.8 million in copyright damages late last week.

In addition, the publisher was granted an unprecedented injunction, requiring various third-party services to stop providing access to the site.

The order specifically mentions domain registrars and hosting companies, but also search engines and ISPs, although only those who are in “active concert or participation” with the site. This order sparked fears that Google, Comcast, and others would be ordered to take action, but that’s not the case.

After the news broke ACS issued a press release clarifying that it would not go after search engines and ISPs when they are not in “active participation” with Sci-Hub. The problem is that this can be interpreted quite broadly, leaving plenty of room for uncertainty.

Luckily, ACS Director Glenn Ruskin was willing to provide more clarity. He stressed that search engines and ISPs won’t be targeted for simply linking users to Sci-Hub. Companies that host the content are a target though.

“The court’s affirmative ruling does not apply to search engines writ large, but only to those entities who have been in active concert or participation with Sci-Hub, such as websites that host ACS content stolen by Sci-Hub,” Ruskin said.

When we asked whether this means that ISPs such as Comcast are not likely to be targeted, the answer was affirmative.

“That is correct, unless the internet service provider has been in active concert or participation with SciHub. Simply linking to SciHub does not rise to be in active concert or participation,” Ruskin clarified.

The above suggests that ACS will go after domain name registrars, hosting companies, and perhaps Cloudflare, but not any further. Still, even if that’s the case there is cause for worry among several digital rights activists.

The Electronic Frontier Foundation believes that these type of orders set a dangerous precedent. The concept of “active concert or participation” should only cover close associates and co-conspirators, not everyone who provides a service to the defendant. Domain registrars and registries have often been compelled to take action in similar cases, but EFF says this goes too far.

“The courts need to limit who can be bound by orders like this one, to prevent them from being abused,” EFF Senior Staff Attorney Mitch Stoltz informs TorrentFreak.

“In particular, domain name registrars and registries shouldn’t be ordered to help take down a website because of a dispute over the site’s contents. That invites others to use the domain name system as a tool for censorship.”

News of the Sci-Hub injunction has sparked controversy and confusion in recent days, not least because Sci-hub.cc became unavailable soon after. Instead of showing the usual search box, visitors now see a “403 Forbidden” error message. On top of that, the bulletproof Tor version of the site also went offline.

The error message indicates that there’s a hosting issue. While it’s easy to conclude that the court’s injunction has something to do with this, that might not necessarily be the case. Sci-Hub’s hosting company isn’t tied to the US and has a history of protecting sites from takedown efforts.

We reached out to Sci-Hub founder Alexandra Elbakyan for comment but we’re yet to receive a response. The site hasn’t posted any relevant updates on its social media pages either.

That said, the site is far from done. In addition to the Tor domain, Sci-Hub has several other backups in place such as Sci-Hub.io and Sci-Hub.ac, which are up and running as usual.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Visualize AWS Cloudtrail Logs using AWS Glue and Amazon Quicksight

Post Syndicated from Luis Caro Perez original https://aws.amazon.com/blogs/big-data/streamline-aws-cloudtrail-log-visualization-using-aws-glue-and-amazon-quicksight/

Being able to easily visualize AWS CloudTrail logs gives you a better understanding of how your AWS infrastructure is being used. It can also help you audit and review AWS API calls and detect security anomalies inside your AWS account. To do this, you must be able to perform analytics based on your CloudTrail logs.

In this post, I walk through using AWS Glue and AWS Lambda to convert AWS CloudTrail logs from JSON to a query-optimized format dataset in Amazon S3. I then use Amazon Athena and Amazon QuickSight to query and visualize the data.

Solution overview

To process CloudTrail logs, you must implement the following architecture:

CloudTrail delivers log files in an Amazon S3 bucket folder. To correctly crawl these logs, you modify the file contents and folder structure using an Amazon S3-triggered Lambda function that stores the transformed files in an S3 bucket single folder. When the files are in a single folder, AWS Glue scans the data, converts it into Apache Parquet format, and catalogs it to allow for querying and visualization using Amazon Athena and Amazon QuickSight.

Walkthrough

Let’s look at the steps that are required to build the solution.

Set up CloudTrail logs

First, you need to set up a trail that delivers log files to an S3 bucket. To create a trail in CloudTrail, follow the instructions in Creating a Trail.

When you finish, the trail settings page should look like the following screenshot:

In this example, I set up log files to be delivered to the cloudtraillfcaro bucket.

Consolidate CloudTrail reports into a single folder using Lambda

AWS CloudTrail delivers log files using the following folder structure inside the configured Amazon S3 bucket:

AWSLogs/ACCOUNTID/CloudTrail/REGION/YEAR/MONTH/HOUR/filename.json.gz

Additionally, log files have the following structure:

{
    "Records": [{
        "eventVersion": "1.01",
        "userIdentity": {
            "type": "IAMUser",
            "principalId": "AIDAJDPLRKLG7UEXAMPLE",
            "arn": "arn:aws:iam::123456789012:user/Alice",
            "accountId": "123456789012",
            "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
            "userName": "Alice",
            "sessionContext": {
                "attributes": {
                    "mfaAuthenticated": "false",
                    "creationDate": "2014-03-18T14:29:23Z"
                }
            }
        },
        "eventTime": "2014-03-18T14:30:07Z",
        "eventSource": "cloudtrail.amazonaws.com",
        "eventName": "StartLogging",
        "awsRegion": "us-west-2",
        "sourceIPAddress": "72.21.198.64",
        "userAgent": "signin.amazonaws.com",
        "requestParameters": {
            "name": "Default"
        },
        "responseElements": null,
        "requestID": "cdc73f9d-aea9-11e3-9d5a-835b769c0d9c",
        "eventID": "3074414d-c626-42aa-984b-68ff152d6ab7"
    },
    ... additional entries ...
    ]

If AWS Glue crawlers are used to catalog these files as they are written, the following obstacles arise:

  1. AWS Glue identifies different tables per different folders because they don’t follow a traditional partition format.
  2. Based on the structure of the file content, AWS Glue identifies the tables as having a single column of type array.
  3. CloudTrail logs have JSON attributes that use uppercase letters. According to the Best Practices When Using Athena with AWS Glue, it is recommended that you convert these to lowercase.

To have AWS Glue catalog all log files in a single table with all the columns describing each event, implement the following Lambda function:

from __future__ import print_function
import json
import urllib
import boto3
import gzip

s3 = boto3.resource('s3')
client = boto3.client('s3')

def convertColumntoLowwerCaps(obj):
    for key in obj.keys():
        new_key = key.lower()
        if new_key != key:
            obj[new_key] = obj[key]
            del obj[key]
    return obj


def lambda_handler(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
    print(bucket)
    print(key)
    try:
        newKey = 'flatfiles/' + key.replace("/", "")
        client.download_file(bucket, key, '/tmp/file.json.gz')
        with gzip.open('/tmp/out.json.gz', 'w') as output, gzip.open('/tmp/file.json.gz', 'rb') as file:
            i = 0
            for line in file: 
                for record in json.loads(line,object_hook=convertColumntoLowwerCaps)['records']:
            		if i != 0:
            		    output.write("\n")
            		output.write(json.dumps(record))
            		i += 1
        client.upload_file('/tmp/out.json.gz', bucket,newKey)
        return "success"
    except Exception as e:
        print(e)
        print('Error processing object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

The function goes over each element of the records array, changes uppercase letters to lowercase in column names, and inserts each element of the array as a single line of a new file. The new file is saved inside a flatfiles folder created by the function without any subfolders in the S3 bucket.

The function should have a role containing a policy with at least the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::cloudtraillfcaro/*",
                "arn:aws:s3:::cloudtraillfcaro"
            ],
            "Effect": "Allow"
        }
    ]
}

In this example, CloudTrail delivers logs to the cloudtraillfcaro bucket. Make sure that you replace this name with your bucket name in the policy. For more information about how to work with inline policies, see Working with Inline Policies.

After the Lambda function is created, you can set up the following trigger using the Triggers tab on the AWS Lambda console.

Choose Add trigger, and choose S3 as a source of the trigger.

After choosing the source, configure the following settings:

In the trigger, any file that is written to the path for the log files—which in this case is AWSLogs/119582755581/CloudTrail/—is processed. Make sure that the Enable trigger check box is selected and that the bucket and prefix parameters match your use case.

After you set up the function and receive log files, the bucket (in this case cloudtraillfcaro) should contain the processed files inside the flatfiles folder.

Catalog source data

Once the files are processed by the Lambda function, set up a crawler named cloudtrail to catalog them.

The crawler must point to the flatfiles folder.

All the crawlers and AWS Glue jobs created for this solution must have a role with the AWSGlueServiceRole managed policy and an inline policy with permissions to modify the S3 buckets used on the Lambda function. For more information, see Working with Managed Policies.

The role should look like the following:

In this example, the inline policy named s3perms contains the permissions to modify the S3 buckets.

After you choose the role, you can schedule the crawler to run on demand.

A new database is created, and the crawler is set to use it. In this case, the cloudtrail database is used for all the tables.

After the crawler runs, a single table should be created in the catalog with the following structure:

The table should contain the following columns:

Create and run the AWS Glue job

To convert all the CloudTrail logs to a columnar store in Parquet, set up an AWS Glue job by following these steps.

Upload the following script into a bucket in Amazon S3:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "cloudtrail", table_name = "flatfiles", transformation_ctx = "datasource0")
resolvechoice1 = ResolveChoice.apply(frame = datasource0, choice = "make_struct", transformation_ctx = "resolvechoice1")
relationalized1 = resolvechoice1.relationalize("trail", args["TempDir"]).select("trail")
datasink = glueContext.write_dynamic_frame.from_options(frame = relationalized1, connection_type = "s3", connection_options = {"path": "s3://cloudtraillfcaro/parquettrails"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

In the example, you load the script as a file named cloudtrailtoparquet.py. Make sure that you modify the script and update the “{"path": "s3://cloudtraillfcaro/parquettrails"}” with the destination in which you want to store your results.

After uploading the script, add a new AWS Glue job. Choose a name and role for the job, and choose the option of running the job from An existing script that you provide.

To avoid processing the same data twice, enable the Job bookmark setting in the Advanced properties section of the job properties.

Choose Next twice, and then choose Finish.

If logs are already in the flatfiles folder, you can run the job on demand to generate the first set of results.

Once the job starts running, wait for it to complete.

When the job is finished, its Run status should be Succeeded. After that, you can verify that the Parquet files are written to the Amazon S3 location.

Catalog results

To be able to process results from Athena, you can use an AWS Glue crawler to catalog the results of the AWS Glue job.

In this example, the crawler is set to use the same database as the source named cloudtrail.

You can run the crawler using the console. When the crawler finishes running and has processed the Parquet results, a new table should be created in the AWS Glue Data Catalog. In this example, it’s named parquettrails.

The table should have the classification set to parquet.

It should have the same columns as the flatfiles table, with the exception of the struct type columns, which should be relationalized into several columns:

In this example, notice how the requestparameters column, which was a struct in the original table (flatfiles), was transformed to several columns—one for each key value inside it. This is done using a transformation native to AWS Glue called relationalize.

Query results with Athena

After crawling the results, you can query them using Athena. For example, to query what events took place in the time frame between 2017-10-23t12:00:00 and 2017-10-23t13:00, use the following select statement:

select *
from cloudtrail.parquettrails
where eventtime > '2017-10-23T12:00:00Z' AND eventtime < '2017-10-23T13:00:00Z'
order by eventtime asc;

Be sure to replace cloudtrail.parquettrails with the names of your database and table that references the Parquet results. Replace the datetimes with an hour when your account had activity and was processed by the AWS Glue job.

Visualize results using Amazon QuickSight

Once you can query the data using Athena, you can visualize it using Amazon QuickSight. Before connecting Amazon QuickSight to Athena, be sure to grant QuickSight access to Athena and the associated S3 buckets in your account. For more information, see Managing Amazon QuickSight Permissions to AWS Resources. You can then create a new data set in Amazon QuickSight based on the Athena table that you created.

After setting up permissions, you can create a new analysis in Amazon QuickSight by choosing New analysis.

Then add a new data set.

Choose Athena as the source.

Give the data source a name (in this case, I named it cloudtrail).

Choose the name of the database and the table referencing the Parquet results.

Then choose Visualize.

After that, you should see the following screen:

Now you can create some visualizations. First, search for the sourceipaddress column, and drag it to the AutoGraph section.

You can see a list of the IP addresses that you have used to interact with AWS. To review whether these IP addresses have been used from IAM users, internal AWS services, or roles, use the type value that is inside the useridentity field of the original log files. Thanks to the relationalize transformation, this value is available as the useridentity.type column. After the column is added into the Group/Color box, the visualization should look like the following:

You can now see and distinguish the most used IPs and whether they are used from roles, AWS services, or IAM users.

After following all these steps, you can use Amazon QuickSight to add different columns from CloudTrail and perform different types of visualizations. You can build operational dashboards that continuously monitor AWS infrastructure usage and access. You can share those dashboards with others in your organization who might need to see this data.

Summary

In this post, you saw how you can use a simple Lambda function and an AWS Glue script to convert text files into Parquet to improve Athena query performance and data compression. The post also demonstrated how to use AWS Lambda to preprocess files in Amazon S3 and transform them into a format that is recognizable by AWS Glue crawlers.

This example, used AWS CloudTrail logs, but you can apply the proposed solution to any set of files that after preprocessing, can be cataloged by AWS Glue.


Additional Reading

Learn how to Harmonize, Query, and Visualize Data from Various Providers using AWS Glue, Amazon Athena, and Amazon QuickSight.


About the Authors

Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.

 

 

 

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

JavaScript got better while I wasn’t looking

Post Syndicated from Eevee original https://eev.ee/blog/2017/10/07/javascript-got-better-while-i-wasnt-looking/

IndustrialRobot has generously donated in order to inquire:

In the last few years there seems to have been a lot of activity with adding emojis to Unicode. Has there been an equal effort to add ‘real’ languages/glyph systems/etc?

And as always, if you don’t have anything to say on that topic, feel free to choose your own. :p

Yes.

I mean, each release of Unicode lists major new additions right at the top — Unicode 10, Unicode 9, Unicode 8, etc. They also keep fastidious notes, so you can also dig into how and why these new scripts came from, by reading e.g. the proposal for the addition of Zanabazar Square. I don’t think I have much to add here; I’m not a real linguist, I only play one on TV.

So with that out of the way, here’s something completely different!

A brief history of JavaScript

JavaScript was created in seven days, about eight thousand years ago. It was pretty rough, and it stayed rough for most of its life. But that was fine, because no one used it for anything besides having a trail of sparkles follow your mouse on their Xanga profile.

Then people discovered you could actually do a handful of useful things with JavaScript, and it saw a sharp uptick in usage. Alas, it stayed pretty rough. So we came up with polyfills and jQuerys and all kinds of miscellaneous things that tried to smooth over the rough parts, to varying degrees of success.

And… that’s it. That’s pretty much how things stayed for a while.


I have complicated feelings about JavaScript. I don’t hate it… but I certainly don’t enjoy it, either. It has some pretty neat ideas, like prototypical inheritance and “everything is a value”, but it buries them under a pile of annoying quirks and a woefully inadequate standard library. The DOM APIs don’t make things much better — they seem to be designed as though the target language were Java, rarely taking advantage of any interesting JavaScript features. And the places where the APIs overlap with the language are a hilarious mess: I have to check documentation every single time I use any API that returns a set of things, because there are at least three totally different conventions for handling that and I can’t keep them straight.

The funny thing is that I’ve been fairly happy to work with Lua, even though it shares most of the same obvious quirks as JavaScript. Both languages are weakly typed; both treat nonexistent variables and keys as simply false values, rather than errors; both have a single data structure that doubles as both a list and a map; both use 64-bit floating-point as their only numeric type (though Lua added integers very recently); both lack a standard object model; both have very tiny standard libraries. Hell, Lua doesn’t even have exceptions, not really — you have to fake them in much the same style as Perl.

And yet none of this bothers me nearly as much in Lua. The differences between the languages are very subtle, but combined they make a huge impact.

  • Lua has separate operators for addition and concatenation, so + is never ambiguous. It also has printf-style string formatting in the standard library.

  • Lua’s method calls are syntactic sugar: foo:bar() just means foo.bar(foo). Lua doesn’t even have a special this or self value; the invocant just becomes the first argument. In contrast, JavaScript invokes some hand-waved magic to set its contextual this variable, which has led to no end of confusion.

  • Lua has an iteration protocol, as well as built-in iterators for dealing with list-style or map-style data. JavaScript has a special dedicated Array type and clumsy built-in iteration syntax.

  • Lua has operator overloading and (surprisingly flexible) module importing.

  • Lua allows the keys of a map to be any value (though non-scalars are always compared by identity). JavaScript implicitly converts keys to strings — and since there’s no operator overloading, there’s no way to natively fix this.

These are fairly minor differences, in the grand scheme of language design. And almost every feature in Lua is implemented in a ridiculously simple way; in fact the entire language is described in complete detail in a single web page. So writing JavaScript is always frustrating for me: the language is so close to being much more ergonomic, and yet, it isn’t.

Or, so I thought. As it turns out, while I’ve been off doing other stuff for a few years, browser vendors have been implementing all this pie-in-the-sky stuff from “ES5” and “ES6”, whatever those are. People even upgrade their browsers now. Lo and behold, the last time I went to write JavaScript, I found out that a number of papercuts had actually been solved, and the solutions were sufficiently widely available that I could actually use them in web code.

The weird thing is that I do hear a lot about JavaScript, but the feature I’ve seen raved the most about by far is probably… built-in types for working with arrays of bytes? That’s cool and all, but not exactly the most pressing concern for me.

Anyway, if you also haven’t been keeping tabs on the world of JavaScript, here are some things we missed.

let

MDN docs — supported in Firefox 44, Chrome 41, IE 11, Safari 10

I’m pretty sure I first saw let over a decade ago. Firefox has supported it for ages, but you actually had to opt in by specifying JavaScript version 1.7. Remember JavaScript versions? You know, from back in the days when people actually suggested you write stuff like this:

1
<SCRIPT LANGUAGE="JavaScript1.2" TYPE="text/javascript">

Yikes.

Anyway, so, let declares a variable — but scoped to the immediately containing block, unlike var, which scopes to the innermost function. The trouble with var was that it was very easy to make misleading:

1
2
3
4
5
6
// foo exists here
while (true) {
    var foo = ...;
    ...
}
// foo exists here too

If you reused the same temporary variable name in a different block, or if you expected to be shadowing an outer foo, or if you were trying to do something with creating closures in a loop, this would cause you some trouble.

But no more, because let actually scopes the way it looks like it should, the way variable declarations do in C and friends. As an added bonus, if you refer to a variable declared with let outside of where it’s valid, you’ll get a ReferenceError instead of a silent undefined value. Hooray!

There’s one other interesting quirk to let that I can’t find explicitly documented. Consider:

1
2
3
4
5
6
7
let closures = [];
for (let i = 0; i < 4; i++) {
    closures.push(function() { console.log(i); });
}
for (let j = 0; j < closures.length; j++) {
    closures[j]();
}

If this code had used var i, then it would print 4 four times, because the function-scoped var i means each closure is sharing the same i, whose final value is 4. With let, the output is 0 1 2 3, as you might expect, because each run through the loop gets its own i.

But wait, hang on.

The semantics of a C-style for are that the first expression is only evaluated once, at the very beginning. So there’s only one let i. In fact, it makes no sense for each run through the loop to have a distinct i, because the whole idea of the loop is to modify i each time with i++.

I assume this is simply a special case, since it’s what everyone expects. We expect it so much that I can’t find anyone pointing out that the usual explanation for why it works makes no sense. It has the interesting side effect that for no longer de-sugars perfectly to a while, since this will print all 4s:

1
2
3
4
5
6
7
8
9
closures = [];
let i = 0;
while (i < 4) {
    closures.push(function() { console.log(i); });
    i++;
}
for (let j = 0; j < closures.length; j++) {
    closures[j]();
}

This isn’t a problem — I’m glad let works this way! — it just stands out to me as interesting. Lua doesn’t need a special case here, since it uses an iterator protocol that produces values rather than mutating a visible state variable, so there’s no problem with having the loop variable be truly distinct on each run through the loop.

Classes

MDN docs — supported in Firefox 45, Chrome 42, Safari 9, Edge 13

Prototypical inheritance is pretty cool. The way JavaScript presents it is a little bit opaque, unfortunately, which seems to confuse a lot of people. JavaScript gives you enough functionality to make it work, and even makes it sound like a first-class feature with a property outright called prototype… but to actually use it, you have to do a bunch of weird stuff that doesn’t much look like constructing an object or type.

The funny thing is, people with almost any background get along with Python just fine, and Python uses prototypical inheritance! Nobody ever seems to notice this, because Python tucks it neatly behind a class block that works enough like a Java-style class. (Python also handles inheritance without using the prototype, so it’s a little different… but I digress. Maybe in another post.)

The point is, there’s nothing fundamentally wrong with how JavaScript handles objects; the ergonomics are just terrible.

Lo! They finally added a class keyword. Or, rather, they finally made the class keyword do something; it’s been reserved this entire time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class Vector {
    constructor(x, y) {
        this.x = x;
        this.y = y;
    }

    get magnitude() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    }

    dot(other) {
        return this.x * other.x + this.y * other.y;
    }
}

This is all just sugar for existing features: creating a Vector function to act as the constructor, assigning a function to Vector.prototype.dot, and whatever it is you do to make a property. (Oh, there are properties. I’ll get to that in a bit.)

The class block can be used as an expression, with or without a name. It also supports prototypical inheritance with an extends clause and has a super pseudo-value for superclass calls.

It’s a little weird that the inside of the class block has its own special syntax, with function omitted and whatnot, but honestly you’d have a hard time making a class block without special syntax.

One severe omission here is that you can’t declare values inside the block, i.e. you can’t just drop a bar = 3; in there if you want all your objects to share a default attribute. The workaround is to just do this.bar = 3; inside the constructor, but I find that unsatisfying, since it defeats half the point of using prototypes.

Properties

MDN docs — supported in Firefox 4, Chrome 5, IE 9, Safari 5.1

JavaScript historically didn’t have a way to intercept attribute access, which is a travesty. And by “intercept attribute access”, I mean that you couldn’t design a value foo such that evaluating foo.bar runs some code you wrote.

Exciting news: now it does. Or, rather, you can intercept specific attributes, like in the class example above. The above magnitude definition is equivalent to:

1
2
3
4
5
6
7
Object.defineProperty(Vector.prototype, 'magnitude', {
    configurable: true,
    enumerable: true,
    get: function() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    },
});

Beautiful.

And what even are these configurable and enumerable things? It seems that every single key on every single object now has its own set of three Boolean twiddles:

  • configurable means the property itself can be reconfigured with another call to Object.defineProperty.
  • enumerable means the property appears in for..in or Object.keys().
  • writable means the property value can be changed, which only applies to properties with real values rather than accessor functions.

The incredibly wild thing is that for properties defined by Object.defineProperty, configurable and enumerable default to false, meaning that by default accessor properties are immutable and invisible. Super weird.

Nice to have, though. And luckily, it turns out the same syntax as in class also works in object literals.

1
2
3
4
5
6
Vector.prototype = {
    get magnitude() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    },
    ...
};

Alas, I’m not aware of a way to intercept arbitrary attribute access.

Another feature along the same lines is Object.seal(), which marks all of an object’s properties as non-configurable and prevents any new properties from being added to the object. The object is still mutable, but its “shape” can’t be changed. And of course you can just make the object completely immutable if you want, via setting all its properties non-writable, or just using Object.freeze().

I have mixed feelings about the ability to irrevocably change something about a dynamic runtime. It would certainly solve some gripes of former Haskell-minded colleagues, and I don’t have any compelling argument against it, but it feels like it violates some unwritten contract about dynamic languages — surely any structural change made by user code should also be able to be undone by user code?

Slurpy arguments

MDN docs — supported in Firefox 15, Chrome 47, Edge 12, Safari 10

Officially this feature is called “rest parameters”, but that’s a terrible name, no one cares about “arguments” vs “parameters”, and “slurpy” is a good word. Bless you, Perl.

1
2
3
function foo(a, b, ...args) {
    // ...
}

Now you can call foo with as many arguments as you want, and every argument after the second will be collected in args as a regular array.

You can also do the reverse with the spread operator:

1
2
3
4
5
let args = [];
args.push(1);
args.push(2);
args.push(3);
foo(...args);

It even works in array literals, even multiple times:

1
2
let args2 = [...args, ...args];
console.log(args2);  // [1, 2, 3, 1, 2, 3]

Apparently there’s also a proposal for allowing the same thing with objects inside object literals.

Default arguments

MDN docs — supported in Firefox 15, Chrome 49, Edge 14, Safari 10

Yes, arguments can have defaults now. It’s more like Sass than Python — default expressions are evaluated once per call, and later default expressions can refer to earlier arguments. I don’t know how I feel about that but whatever.

1
2
3
function foo(n = 1, m = n + 1, list = []) {
    ...
}

Also, unlike Python, you can have an argument with a default and follow it with an argument without a default, since the default default (!) is and always has been defined as undefined. Er, let me just write it out.

1
2
3
function bar(a = 5, b) {
    ...
}

Arrow functions

MDN docs — supported in Firefox 22, Chrome 45, Edge 12, Safari 10

Perhaps the most humble improvement is the arrow function. It’s a slightly shorter way to write an anonymous function.

1
2
3
(a, b, c) => { ... }
a => { ... }
() => { ... }

An arrow function does not set this or some other magical values, so you can safely use an arrow function as a quick closure inside a method without having to rebind this. Hooray!

Otherwise, arrow functions act pretty much like regular functions; you can even use all the features of regular function signatures.

Arrow functions are particularly nice in combination with all the combinator-style array functions that were added a while ago, like Array.forEach.

1
2
3
[7, 8, 9].forEach(value => {
    console.log(value);
});

Symbol

MDN docs — supported in Firefox 36, Chrome 38, Edge 12, Safari 9

This isn’t quite what I’d call an exciting feature, but it’s necessary for explaining the next one. It’s actually… extremely weird.

symbol is a new kind of primitive (like number and string), not an object (like, er, Number and String). A symbol is created with Symbol('foo'). No, not new Symbol('foo'); that throws a TypeError, for, uh, some reason.

The only point of a symbol is as a unique key. You see, symbols have one very special property: they can be used as object keys, and will not be stringified. Remember, only strings can be keys in JavaScript — even the indices of an array are, semantically speaking, still strings. Symbols are a new exception to this rule.

Also, like other objects, two symbols don’t compare equal to each other: Symbol('foo') != Symbol('foo').

The result is that symbols solve one of the problems that plauges most object systems, something I’ve talked about before: interfaces. Since an interface might be implemented by any arbitrary type, and any arbitrary type might want to implement any number of arbitrary interfaces, all the method names on an interface are effectively part of a single global namespace.

I think I need to take a moment to justify that. If you have IFoo and IBar, both with a method called method, and you want to implement both on the same type… you have a problem. Because most object systems consider “interface” to mean “I have a method called method, with no way to say which interface’s method you mean. This is a hard problem to avoid, because IFoo and IBar might not even come from the same library. Occasionally languages offer a clumsy way to “rename” one method or the other, but the most common approach seems to be for interface designers to avoid names that sound “too common”. You end up with redundant mouthfuls like IFoo.foo_method.

This incredibly sucks, and the only languages I’m aware of that avoid the problem are the ML family and Rust. In Rust, you define all the methods for a particular trait (interface) in a separate block, away from the type’s “own” methods. It’s pretty slick. You can still do obj.method(), and as long as there’s only one method among all the available traits, you’ll get that one. If not, there’s syntax for explicitly saying which trait you mean, which I can’t remember because I’ve never had to use it.

Symbols are JavaScript’s answer to this problem. If you want to define some interface, you can name its methods with symbols, which are guaranteed to be unique. You just have to make sure you keep the symbol around somewhere accessible so other people can actually use it. (Or… not?)

The interesting thing is that JavaScript now has several of its own symbols built in, allowing user objects to implement features that were previously reserved for built-in types. For example, you can use the Symbol.hasInstance symbol — which is simply where the language is storing an existing symbol and is not the same as Symbol('hasInstance')! — to override instanceof:

1
2
3
4
5
6
7
8
// oh my god don't do this though
class EvenNumber {
    static [Symbol.hasInstance](obj) {
        return obj % 2 == 0;
    }
}
console.log(2 instanceof EvenNumber);  // true
console.log(3 instanceof EvenNumber);  // false

Oh, and those brackets around Symbol.hasInstance are a sort of reverse-quoting — they indicate an expression to use where the language would normally expect a literal identifier. I think they work as object keys, too, and maybe some other places.

The equivalent in Python is to implement a method called __instancecheck__, a name which is not special in any way except that Python has reserved all method names of the form __foo__. That’s great for Python, but doesn’t really help user code. JavaScript has actually outclassed (ho ho) Python here.

Of course, obj[BobNamespace.some_method]() is not the prettiest way to call an interface method, so it’s not perfect. I imagine this would be best implemented in user code by exposing a polymorphic function, similar to how Python’s len(obj) pretty much just calls obj.__len__().

I only bring this up because it’s the plumbing behind one of the most incredible things in JavaScript that I didn’t even know about until I started writing this post. I’m so excited oh my gosh. Are you ready? It’s:

Iteration protocol

MDN docs — supported in Firefox 27, Chrome 39, Safari 10; still experimental in Edge

Yes! Amazing! JavaScript has first-class support for iteration! I can’t even believe this.

It works pretty much how you’d expect, or at least, how I’d expect. You give your object a method called Symbol.iterator, and that returns an iterator.

What’s an iterator? It’s an object with a next() method that returns the next value and whether the iterator is exhausted.

Wait, wait, wait a second. Hang on. The method is called next? Really? You didn’t go for Symbol.next? Python 2 did exactly the same thing, then realized its mistake and changed it to __next__ in Python 3. Why did you do this?

Well, anyway. My go-to test of an iterator protocol is how hard it is to write an equivalent to Python’s enumerate(), which takes a list and iterates over its values and their indices. In Python it looks like this:

1
2
3
4
5
for i, value in enumerate(['one', 'two', 'three']):
    print(i, value)
# 0 one
# 1 two
# 2 three

It’s super nice to have, and I’m always amazed when languages with “strong” “support” for iteration don’t have it. Like, C# doesn’t. So if you want to iterate over a list but also need indices, you need to fall back to a C-style for loop. And if you want to iterate over a lazy or arbitrary iterable but also need indices, you need to track it yourself with a counter. Ridiculous.

Here’s my attempt at building it in JavaScript.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
function enumerate(iterable) {
    // Return a new iter*able* object with a Symbol.iterator method that
    // returns an iterator.
    return {
        [Symbol.iterator]: function() {
            let iterator = iterable[Symbol.iterator]();
            let i = 0;

            return {
                next: function() {
                    let nextval = iterator.next();
                    if (! nextval.done) {
                        nextval.value = [i, nextval.value];
                        i++;
                    }
                    return nextval;
                },
            };
        },
    };
}
for (let [i, value] of enumerate(['one', 'two', 'three'])) {
    console.log(i, value);
}
// 0 one
// 1 two
// 2 three

Incidentally, for..of (which iterates over a sequence, unlike for..in which iterates over keys — obviously) is finally supported in Edge 12. Hallelujah.

Oh, and let [i, value] is destructuring assignment, which is also a thing now and works with objects as well. You can even use the splat operator with it! Like Python! (And you can use it in function signatures! Like Python! Wait, no, Python decided that was terrible and removed it in 3…)

1
let [x, y, ...others] = ['apple', 'orange', 'cherry', 'banana'];

It’s a Halloween miracle. 🎃

Generators

MDN docs — supported in Firefox 26, Chrome 39, Edge 13, Safari 10

That’s right, JavaScript has goddamn generators now. It’s basically just copying Python and adding a lot of superfluous punctuation everywhere. Not that I’m complaining.

Also, generators are themselves iterable, so I’m going to cut to the chase and rewrite my enumerate() with a generator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
function enumerate(iterable) {
    return {
        [Symbol.iterator]: function*() {
            let i = 0;
            for (let value of iterable) {
                yield [i, value];
                i++;
            }
        },
    };
}
for (let [i, value] of enumerate(['one', 'two', 'three'])) {
    console.log(i, value);
}
// 0 one
// 1 two
// 2 three

Amazing. function* is a pretty strange choice of syntax, but whatever? I guess it also lets them make yield only act as a keyword inside a generator, for ultimate backwards compatibility.

JavaScript generators support everything Python generators do: yield* yields every item from a subsequence, like Python’s yield from; generators can return final values; you can pass values back into the generator if you iterate it by hand. No, really, I wasn’t kidding, it’s basically just copying Python. It’s great. You could now built asyncio in JavaScript!

In fact, they did that! JavaScript now has async and await. An async function returns a Promise, which is also a built-in type now. Amazing.

Sets and maps

MDN docs for MapMDN docs for Set — supported in Firefox 13, Chrome 38, IE 11, Safari 7.1

I did not save the best for last. This is much less exciting than generators. But still exciting.

The only data structure in JavaScript is the object, a map where the strings are keys. (Or now, also symbols, I guess.) That means you can’t readily use custom values as keys, nor simulate a set of arbitrary objects. And you have to worry about people mucking with Object.prototype, yikes.

But now, there’s Map and Set! Wow.

Unfortunately, because JavaScript, Map couldn’t use the indexing operators without losing the ability to have methods, so you have to use a boring old method-based API. But Map has convenient methods that plain objects don’t, like entries() to iterate over pairs of keys and values. In fact, you can use a map with for..of to get key/value pairs. So that’s nice.

Perhaps more interesting, there’s also now a WeakMap and WeakSet, where the keys are weak references. I don’t think JavaScript had any way to do weak references before this, so that’s pretty slick. There’s no obvious way to hold a weak value, but I guess you could substitute a WeakSet with only one item.

Template literals

MDN docs — supported in Firefox 34, Chrome 41, Edge 12, Safari 9

Template literals are JavaScript’s answer to string interpolation, which has historically been a huge pain in the ass because it doesn’t even have string formatting in the standard library.

They’re just strings delimited by backticks instead of quotes. They can span multiple lines and contain expressions.

1
2
console.log(`one plus
two is ${1 + 2}`);

Someone decided it would be a good idea to allow nesting more sets of backticks inside a ${} expression, so, good luck to syntax highlighters.

However, someone also had the most incredible idea ever, which was to add syntax allowing user code to do the interpolation — so you can do custom escaping, when absolutely necessary, which is virtually never, because “escaping” means you’re building a structured format by slopping strings together willy-nilly instead of using some API that works with the structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// OF COURSE, YOU SHOULDN'T BE DOING THIS ANYWAY; YOU SHOULD BUILD HTML WITH
// THE DOM API AND USE .textContent FOR LITERAL TEXT.  BUT AS AN EXAMPLE:
function html(literals, ...values) {
    let ret = [];
    literals.forEach((literal, i) => {
        if (i > 0) {
            // Is there seriously still not a built-in function for doing this?
            // Well, probably because you SHOULDN'T BE DOING IT
            ret.push(values[i - 1]
                .replace(/&/g, '&amp;')
                .replace(/</g, '&lt;')
                .replace(/>/g, '&gt;')
                .replace(/"/g, '&quot;')
                .replace(/'/g, '&apos;'));
        }
        ret.push(literal);
    });
    return ret.join('');
}
let username = 'Bob<script>';
let result = html`<b>Hello, ${username}!</b>`;
console.log(result);
// <b>Hello, Bob&lt;script&gt;!</b>

It’s a shame this feature is in JavaScript, the language where you are least likely to need it.

Trailing commas

Remember how you couldn’t do this for ages, because ass-old IE considered it a syntax error and would reject the entire script?

1
2
3
4
5
{
    a: 'one',
    b: 'two',
    c: 'three',  // <- THIS GUY RIGHT HERE
}

Well now it’s part of the goddamn spec and if there’s anything in this post you can rely on, it’s this. In fact you can use AS MANY GODDAMN TRAILING COMMAS AS YOU WANT. But only in arrays.

1
[1, 2, 3,,,,,,,,,,,,,,,,,,,,,,,,,]

Apparently that has the bizarre side effect of reserving extra space at the end of the array, without putting values there.

And more, probably

Like strict mode, which makes a few silent “errors” be actual errors, forces you to declare variables (no implicit globals!), and forbids the completely bozotic with block.

Or String.trim(), which trims whitespace off of strings.

Or… Math.sign()? That’s new? Seriously? Well, okay.

Or the Proxy type, which lets you customize indexing and assignment and calling. Oh. I guess that is possible, though this is a pretty weird way to do it; why not just use symbol-named methods?

You can write Unicode escapes for astral plane characters in strings (or identifiers!), as \u{XXXXXXXX}.

There’s a const now? I extremely don’t care, just name it in all caps and don’t reassign it, come on.

There’s also a mountain of other minor things, which you can peruse at your leisure via MDN or the ECMAScript compatibility tables (note the links at the top, too).

That’s all I’ve got. I still wouldn’t say I’m a big fan of JavaScript, but it’s definitely making an effort to clean up some goofy inconsistencies and solve common problems. I think I could even write some without yelling on Twitter about it now.

On the other hand, if you’re still stuck supporting IE 10 for some reason… well, er, my condolences.

Football Coach Retweets, Gets Sued for Copyright Infringement

Post Syndicated from Andy original https://torrentfreak.com/football-coach-retweets-gets-sued-for-copyright-infringement-170928/

When copyright infringement lawsuits hit the US courts, there’s often a serious case at hand. Whether that’s the sharing of a leaked movie online or indeed the mass infringement that allegedly took place on Megaupload, there’s usually something quite meaty to discuss.

A lawsuit filed this week in a Pennsylvania federal court certainly provides the later, but without managing to be much more than a fairly trivial matter in the first instance.

The case was filed by sports psychologist and author Dr. Keith Bell. It begins by describing Bell as an “internationally recognized performance consultant” who has worked with 500 teams, including the Olympic and national teams for the United States, Canada, Australia, New Zealand, Hong Kong, Fiji, and the Cayman Islands.

Bell is further described as a successful speaker, athlete and coach; “A four-time
collegiate All-American swimmer, a holder of numerous world and national masters swim records, and has coached several collegiate, high school, and private swim teams to competitive success.”

At the heart of the lawsuit is a book that Bell published in 1982, entitled Winning Isn’t Normal.

“The book has enjoyed substantial acclaim, distribution and publicity. Dr. Bell is the sole author of this work, and continues to own all rights in the work,” the lawsuit (pdf) reads.

Bell claims that on or about November 6, 2015, King’s College head football coach Jeffery Knarr retweeted a tweet that was initially posted from @NSUBaseball32, a Twitter account operated by Northeastern State University’s RiverHawks baseball team. The retweet, as shown in the lawsuit, can be seen below.

The retweet that sparked the lawsuit

“The post was made without authorization from Dr. Bell and without attribution
to Dr. Bell,” the lawsuit reads.

“Neither Defendant King’s College nor Defendant Jeffery Knarr contacted Dr.
Bell to request permission to use Dr. Bell’s copyrighted work. As of November 14, 2015, the post had received 206 ‘Retweets’ and 189 ‘Likes.’ Due to the globally accessible nature of Twitter, the post was accessible by Internet users across the world.”

Bell says he sent a cease and desist letter to NSU in September 2016 and shortly thereafter NSU removed the post, which removed the retweets. However, this meant that Knarr’s retweet had been online for “at least” 10 months and 21 days.

To put the icing on the cake, Bell also holds the trademark to the phrase “Winning Isn’t Normal”, so he’s suing Knarr and his King’s College employer for trademark infringement too.

“The Defendants included Plaintiff’s trademark twice in the Twitter post. The first instance was as the title of the post, with the mark shown in letters which
were emphasized by being capitalized, bold, and underlined,” the lawsuit notes.

“The second instance was at the end of the post, with the mark shown in letters which were emphasized by being capitalized, bold, underlined, and followed by three
exclamation points.”

Describing what appears to be a casual retweet as “willful, intentional and purposeful” infringement carried out “in disregard of and with indifference to Plaintiff’s rights,” Bell demands damages and attorneys fees from Knarr and his employer.

“As a direct and proximate result of said infringement by Defendants, Plaintiff is
entitled to damages in an amount to be proven at trial,” the lawsuit concludes.

Since the page from the book retweeted by Knarr is a small portion of the overall work, there may be a fair use defense. Nevertheless, defending this kind of suit is never cheap, so it’s probably fair to say there will already be a considerable amount of regret among the defendants at ever having set eyes on Bell’s 35-year-old book.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

The possibilities of the Sense HAT

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/sense-hat-projects/

Did you realise the Sense HAT has been available for over two years now? Used by astronauts on the International Space Station, the exact same hardware is available to you on Earth. With a new Astro Pi challenge just launched, it’s time for a retrospective/roundup/inspiration post about this marvellous bit of kit.

Sense HAT attached to Pi and power cord

The Sense HAT on a Pi in full glory

The Sense HAT explained

We developed our scientific add-on board to be part of the Astro Pi computers we sent to the International Space Station with ESA astronaut Tim Peake. For a play-by-play of Astro Pi’s history, head to the blog archive.

Astro Pi logo with starry background

Just to remind you, this is all the cool stuff our engineers have managed to fit onto the HAT:

  • A gyroscope (sensing pitch, roll, and yaw)
  • An accelerometer
  • A magnetometer
  • Sensors for temperature, humidity, and barometric pressure
  • A joystick
  • An 8×8 LED matrix

You can find a roundup of the technical specs here on the blog.

How to Sense HAT

It’s easy to begin exploring this device: take a look at our free Getting started with the Sense HAT resource, or use one of our Code Club Sense HAT projects. You can also try out the emulator, available offline on Raspbian and online on Trinket.

Sense HAT emulator on Trinket

The Sense HAT emulator on trinket.io

Fun and games with the Sense HAT

Use the LED matrix and joystick to recreate games such as Pong or Flappy Bird. Of course, you could also add sensor input to your game: code an egg drop game or a Magic 8 Ball that reacts to how the device moves.

Sense HAT Random Sparkles

Create random sparkles on the Sense HAT

Once December rolls around, you could brighten up your home with a voice-controlled Christmas tree or an advent calendar on your Sense HAT.

If you like the great outdoors, you could also use your Sense HAT to recreate this Hiking Companion by Marcus Johnson. Take it with you on your next hike!

Art with the Sense HAT

The LED matrix is perfect for getting creative. To draw something basic without having to squint at a Python list, use this app by our very own Richard Hayler. Feeling more ambitious? The MagPi will teach you how to create magnificent pixel art. Ben Nuttall has created this neat little Python script for displaying a photo taken by the Raspberry Pi Camera Module on the Sense HAT.

Brett Haines Mathematica on the Sense HAT

It’s also possible to incorporate Sense HAT data into your digital art! The Python Turtle module and the Processing language are both useful tools for creating beautiful animations based on real-world information.

A Sense HAT project that also uses this principle is Giorgio Sancristoforo’s Tableau, a ‘generative music album’. This device creates music according to the sensor data:

Tableau Generative Album

“There is no doubt that, as music is removed by the phonographrecord from the realm of live production and from the imperative of artistic activity and becomes petrified, it absorbs into itself, in this process of petrification, the very life that would otherwise vanish.”

Science with the Sense HAT

This free Essentials book from The MagPi team covers all the Sense HAT science basics. You can, for example, learn how to measure gravity.

Cropped cover of Experiment with the Sense HAT book

Our online resource shows you how to record the information your HAT picks up. Next you can analyse and graph your data using Mathematica, which is included for free on Raspbian. This resource walks you through how this software works.

If you’re seeking inspiration for experiments you can do on our Astro Pis Izzy and Ed on the ISS, check out the winning entries of previous rounds of the Astro Pi challenge.

Thomas Pesquet with Ed and Izzy

Thomas Pesquet with Ed and Izzy

But you can also stick to terrestrial scientific investigations. For example, why not build a weather station and share its data on your own web server or via Weather Underground?

Your code in space!

If you’re a student or an educator in one of the 22 ESA member states, you can get a team together to enter our 2017-18 Astro Pi challenge. There are two missions to choose from, including Mission Zero: follow a few guidelines, and your code is guaranteed to run in space!

The post The possibilities of the Sense HAT appeared first on Raspberry Pi.

EU Prepares Guidelines to Force Google & Facebook to Police Piracy

Post Syndicated from Andy original https://torrentfreak.com/eu-prepares-guidelines-to-force-google-facebook-to-police-piracy-170915/

In the current climate, creators and distributors are forced to play a giant game of whac-a-mole to limit the unlicensed spread of their content on the Internet.

The way the law stands today in the United States, EU, and most other developed countries, copyright holders must wait for content to appear online before sending targeted takedown notices to hosts, service providers, and online platforms.

After sending several billion of these notices, patience is wearing thin, so a new plan is beginning to emerge. Rather than taking down content after it appears, major entertainment industry groups would prefer companies to take proactive action. The upload filters currently under discussion in Europe are a prime example but are already causing controversy.

Continuing the momentum in this direction, Reuters reports that the European Union will publish draft guidelines at the end of this month, urging platforms such as Google and Facebook to take a more proactive approach to illegal content of all kinds.

“Online platforms need to significantly step up their actions to address this problem,” the draft EU guidelines say.

“They need to be proactive in weeding out illegal content, put effective notice-and-action procedures in place, and establish well-functioning interfaces with third parties (such as trusted flaggers) and give a particular priority to notifications from national law enforcement authorities.”

On the copyright front, Google already operates interfaces designed to take down infringing content. And, as the recent agreement in the UK with copyright holders shows, is also prepared to make infringing content harder to find. Nevertheless, it will remain to be seen if Google is prepared to give even ‘trusted’ third-parties a veto on what content can appear online, without having oversight itself.

The guidelines are reportedly non-binding but further legislation in this area isn’t being ruled out for Spring 2018, if companies fail to make significant progress.

Interestingly, however, a Commission source told Reuters that any new legislation would not “change the liability exemption for online platforms.” Maintaining these so-called ‘safe harbors’ is a priority for online giants such as Google and Facebook – anything less would almost certainly be a deal-breaker.

The guidelines, due to be published at the end of September, will also encourage online platforms to publish transparency reports. These should detail the volume of notices received and actions subsequently taken. Again, Google is way ahead of the game here, having published this kind of data for the past several years.

“The guidelines also contain safeguards against excessive removal of content, such as giving its owners a right to contest such a decision,” Reuters adds.

More will be known about the proposals in a couple of weeks but it’s quite likely they’ll spark another round of debate on whether legislation is the best route to tackle illegal content or whether voluntary agreements – which have a tendency to be rather less open – should be the way to go.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Code Club reaches 1 in 5 UK secondary schools

Post Syndicated from Maria Quevedo original https://www.raspberrypi.org/blog/code-club-9-to-13/

Today, we’re excited to announce the expansion of Code Club to secondary school ages up to 13. When we made our plans known last May, we were beginning work with a pilot group of 50 UK secondary schools to discover how we could best support them, and how we could make Code Club work as well for children aged 12 and 13 as it does for its original age range of 9 to 11 years. Now, new projects are available for secondary-aged children, and we will continue to create more resources to build on the support we offer this age group.

An animated gif with happy Code Club robots and text showing that Code Club is extending to 9- to 13-year-olds

One in five UK secondary schools

In extending Code Club’s age range to 9-13, we’re responding to huge demand. One in five UK state-sector secondary schools has already registered with the programme, and most of these – almost 600 of them – are already running Code Clubs.

By giving secondaries access to the Code Club support network and providing new, more advanced programming projects, we will help schools better to meet the needs of their students, and offer many thousands more children the opportunity to develop essential skills in programming and computing. Libraries and other non-school venues will also be able to welcome children of a wider range of ages to their clubs.

New Code Club resources

Our first five projects for older children offer a variety of ways for more advanced coders to build on their skills and explore further programming concepts.

From ‘Flappy Parrot’ and Where’s Wally-inspired ‘Lineup’, to ‘Binary Hero’ and quiz-tastic ‘Guess the Flag’, there’s something to spark everyone’s imagination. You can read more about these new resources in today’s Code Club UK blog post.

Help Code Club in your local school

Around 300 secondary schools across the UK have registered with Code Club but have not yet started their club, because they’re still looking for volunteers to support them. Can you help these keen teachers and students get up and running? If you can volunteer an hour each week, either on your own or by taking turns with friends or colleagues, you could make all the difference to a Code Club near you.

A Code Club in every community

We want every 9- to 13-year-old to have the opportunity to join a Code Club, and we will continue working hard to deliver our goal of putting a Code Club in every community. Make sure your local school, youth club, or library knows how to get involved.

The post Code Club reaches 1 in 5 UK secondary schools appeared first on Raspberry Pi.