Tag Archives: comics

Weekly roundup: Potpourri 2

Post Syndicated from Eevee original https://eev.ee/dev/2018/01/23/weekly-roundup-potpourri-2/

  • blog: I wrote a birthday post, as is tradition. I finally finished writing Game Night 2, a full month after we actually played those games.

  • art: I put together an art improvement chart for last year, after skipping doing it in July, tut tut. Kind of a weird rollercoaster!

    I worked a teeny bit on two one-off comics I guess but they aren’t reeeally getting anywhere fast. Comics are hard.

    I made a banner for Strawberry Jam 2 which I think came out fantastically!

  • games: I launched Strawberry Jam 2, a month-long February game jam about making horny games. I will probably be making a horny game for it.

  • idchoppers: I took another crack at dilation. Some meager progress, maybe. I think I’m now porting bad academic C++ to Rust to get the algorithm I want, and I can’t help but wonder if I could just make up something of my own faster than this.

  • fox flux: I did a bunch of brainstorming and consolidated a bunch of notes from like four different places, which feels like work but also feels like it doesn’t actually move the project forward.

  • anise!!: Ah, yes, this fell a bit by the wayside. Some map work, some attempts at a 3D effect for a particular thing without much luck (though I found a workaround in the last couple days).

  • computers: I relieved myself of some 200 browser tabs, which feels fantastic, though I’ve since opened like 80 more. Alas. I also tried to put together a firejail profile for running mystery games from the internet, and I got like 90% of the way there, but it turns out there’s basically no way to stop an X application from reading all keyboard input.

    (Yes, I know about that, and I tried it. Yes, that too.)

I’ve got a small pile of little projects that are vaguely urgent, so as much as I’d love to bash my head against idchoppers for a solid week, I’m gonna try to focus on getting a couple half-done things full-done. And maybe try to find time for art regularly so I don’t fall out of practice? Huff puff.

Eevee gained 2791 experience points

Post Syndicated from Eevee original https://eev.ee/blog/2018/01/15/eevee-gained-2791-experience-points/

Eevee grew to level 31!

A year strongly defined by mixed success! Also, a lot of video games.

I ran three game jams, resulting in a total of 157 games existing that may not have otherwise, which is totally mindblowing?!

For GAMES MADE QUICK???, glip and I made NEON PHASE, a short little exploratory platformer. Honestly, I should give myself more credit for this and the rest of the LÖVE games I’ve based on the same codebase — I wove a physics engine (and everything else!) from scratch and it has held up remarkably well for a variety of different uses.

I successfully finished an HD version of Isaac’s Descent using my LÖVE engine, though it doesn’t have anything new over the original and I’ve only released it as a tech demo on Patreon.

For Strawberry Jam (NSFW!) we made fox flux (slightly NSFW!), which felt like a huge milestone: the first game where I made all the art! I mean, not counting Isaac’s Descent, which was for a very limited platform. It’s a pretty arbitrary milestone, yes, but it feels significant. I’ve been working on expanding the game into a longer and slightly less buggy experience, but the art is taking the longest by far. I must’ve spent weeks on player sprites alone.

We then set about working on Bolthaven, a sequel of sorts to NEON PHASE, and got decently far, and then abandond it. Oops.

We then started a cute little PICO-8 game, and forgot about it. Oops.

I was recruited to help with Chaos Composer, a more ambitious game glip started with someone else in Unity. I had to get used to Unity, and we squabbled a bit, but the game is finally about at the point where it’s “playable” and “maps” can be designed? It’s slightly on hold at the moment while we all finish up some other stuff, though.

We made a birthday game for two of our friends whose birthdays were very close together! Only they got to see it.

For Ludum Dare 38, we made Lunar Depot 38, a little “wave shooter” or whatever you call those? The AI is pretty rough, seeing as this was the first time I’d really made enemies and I had 72 hours to figure out how to do it, but I still think it’s pretty fun to play and I love the circular world.

I made Roguelike Simulator as an experiment with making something small and quick with a simple tool, and I had a lot of fun! I definitely want to do more stuff like this in the future.

And now we’re working on a game about Star Anise, my cat’s self-insert, which is looking to have more polish and depth than anything we’ve done so far! We’ve definitely come a long way in a year.

Somewhere along the line, I put out a call for a “potluck” project, where everyone would give me sprites of a given size without knowing what anyone else had contributed, and I would then make a game using only those sprites. Unfortunately, that stalled a few times: I tried using the Phaser JS library, but we didn’t get along; I tried LÖVE, but didn’t know where to go with the game; and then I decided to use this as an experiment with procedural generation, and didn’t get around to it. I still feel bad that everyone did work for me and I didn’t follow through, but I don’t know whether this will ever become a game.

veekun, alas, consumed months of my life. I finally got Sun and Moon loaded, but it took weeks of work since I was basically reinventing all the tooling we’d ever had from scratch, without even having most of that tooling available as a reference. It was worth it in the end, at least: Ultra Sun and Ultra Moon only took a few days to get loaded. But veekun itself is still missing some obvious Sun/Moon features, and the whole site needs an overhaul, and I just don’t know if I want to dedicate that much time to it when I have so much other stuff going on that’s much more interesting to me right now.

I finally turned my blog into more of a website, giving it a neat front page that lists a bunch of stuff I’ve done. I made a release category at last, though I’m still not quite in the habit of using it.

I wrote some blog posts, of course! I think the most interesting were JavaScript got better while I wasn’t looking and Object models. I was also asked to write a couple pieces for money for a column that then promptly shut down.

On a whim, I made a set of Eevee mugshots for Doom, which I think is a decent indication of my (pixel) art progress over the year?

I started idchoppers, a Doom parsing and manipulation library written in Rust, though it didn’t get very far and I’ve spent most of the time fighting with Rust because it won’t let me implement all my extremely bad ideas. It can do a couple things, at least, like flip maps very quickly and render maps to SVG.

I did toy around with music a little, but not a lot.

I wrote two short twines for Flora. They’re okay. I’m working on another; I think it’ll be better.

I didn’t do a lot of art overall, at least compared to the two previous years; most of my art effort over the year has gone into fox flux, which requires me to learn a whole lot of things. I did dip my toes into 3D modelling, most notably producing my current Twitter banner as well as this cool Star Anise animation. I wouldn’t mind doing more of that; maybe I’ll even try to make a low-poly pixel-textured 3D game sometime.

I restarted my book with a much better concept, though so far I’ve only written about half a chapter. Argh. I see that the vast majority of the work was done within the span of a single week, which is bad since that means I only worked on it for a week, but good since that means I can actually do a pretty good amount of work in only a week. I also did a lot of squabbling with tooling, which is hopefully mostly out of the way now.

My computer broke? That was an exciting week.


A lot of stuff, but the year as a whole still feels hit or miss. All the time I spent on veekun feels like a black void in the middle of the year, which seems like a good sign that I maybe don’t want to pour even more weeks into it in the near future.

Mostly, I want to do: more games, more art, more writing, more music.

I want to try out some tiny game making tools and make some tiny games with them — partly to get exposure to different things, partly to get more little ideas out into the world regularly, and partly to get more practice at letting myself have ideas. I have a couple tools in mind and I guess I’ll aim at a microgame every two months or so? I’d also like to finish the expanded fox flux by the end of the year, of course, though at the moment I can’t even gauge how long it might take.

I seriously lapsed on drawing last year, largely because fox flux pixel art took me so much time. So I want to draw more, and I want to get much faster at pixel art. It would probably help if I had a more concrete goal for drawing, so I might try to draw some short comics and write a little visual novel or something, which would also force me to aim for consistency.

I want to work on my book more, of course, but I also want to try my hand at a bit more fiction. I’ve had a blast writing dialogue for our games! I just shy away from longer-form writing for some reason — which seems ridiculous when a large part of my audience found me through my blog. I do think I’ve had some sort of breakthrough in the last month or two; I suddenly feel a good bit more confident about writing in general and figuring out what I want to say? One recent post I know I wrote in a single afternoon, which virtually never happens because I keep rewriting and rearranging stuff. Again, a visual novel would be a good excuse to practice writing fiction without getting too bogged down in details.

And, ah, music. I shy heavily away from music, since I have no idea what I’m doing, and also I seem to spend a lot of time fighting with tools. (Surprise.) I tried out SunVox for the first time just a few days ago and have been enjoying it quite a bit for making sound effects, so I might try it for music as well. And once again, visual novel background music is a pretty low-pressure thing to compose for. Hell, visual novels are small games, too, so that checks all the boxes. I guess I’ll go make a visual novel.

Here’s to twenty gayteen!

MagPi 65: Newbies Guide, and something brand new!

Post Syndicated from Rob Zwetsloot original https://www.raspberrypi.org/blog/magpi-65/

Hey folks, Rob from The MagPi here! We know many people might be getting their very first Raspberry Pi this Christmas, and excitedly wondering “what do I do with it?” While we can’t tell you exactly what to do with your Pi, we can show you how to immerse yourself in the world of Raspberry Pi and be inspired by our incredible community, and that’s the topic of The MagPi 65, out today tomorrow (we’re a day early because we’re simply TOO excited about the special announcement below!).

The one, the only…issue 65!

Raspberry Pi for Newbies

Raspberry Pi for Newbies covers some of the very basics you should know about the world of Raspberry Pi. After a quick set-up tutorial, we introduce you to the Raspberry Pi’s free online resources, including Scratch and Python projects from Code Club, before guiding you through the wider Raspberry Pi and maker community.

Raspberry Pi MagPi 65 Newbie Guide

Pages and pages of useful advice and starter projects

The online community is an amazing place to learn about all the incredible things you can do with the Raspberry Pi. We’ve included some information on good places to look for tutorials, advice and ideas.

And that’s not all

Want to do more after learning about the world of Pi? The rest of the issue has our usual selection of expert guides to help you build some amazing projects: you can make a Christmas memory game, build a tower of bells to ring in the New Year, and even take your first steps towards making a game using C++.

Raspberry Pi MagPi 65

Midimutant, the synthesizer “that boinks endless strange sounds”

All this along with inspiring projects, definitive reviews, and tales from around the community.

Raspberry Pi Annual

Issue 65 isn’t the only new release to look out for. We’re excited to bring you the first ever Raspberry Pi Annual, and it’s free for MagPi subscribers – in fact, subscribers should be receiving it the same day as their issue 65 delivery!

If you’re not yet a subscriber of The MagPi, don’t panic: you can still bag yourself a copy of the Raspberry Pi Annual by signing up to a 12-month subscription of The MagPi before 24 January. You’ll also receive the usual subscriber gift of a free Raspberry Pi Zero W (with case and cable).  Click here to subscribe to The MagPi – The Official Raspberry Pi magazine.

Ooooooo…aaaaaahhhhh…

The Raspberry Pi Annual is aimed at young folk wanting to learn to code, with a variety of awesome step-by-step Scratch tutorials, games, puzzles, and comics, including a robotic Babbage.

Get your copy

You can get The MagPi 65 and the Raspberry Pi Annual 2018 from our online store, and the magazine can be found in the wild at WHSmith, Tesco, Sainsbury’s, and Asda. You’ll be able to get it in the US at Barnes & Noble and Micro Center in a few days’ time. The MagPi 65 is also available digitally on our Android and iOS apps. Finally, you can also download a free PDF of The MagPi 65 and The Raspberry Pi Annual 2018.

We hope you have a merry Christmas! We’re off until the New Year. Bye!

The post MagPi 65: Newbies Guide, and something brand new! appeared first on Raspberry Pi.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Introspection

Post Syndicated from Eevee original https://eev.ee/blog/2017/05/28/introspection/

This month, IndustrialRobot has generously donated in order to ask:

How do you go about learning about yourself? Has your view of yourself changed recently? How did you handle it?

Whoof. That’s incredibly abstract and open-ended — there’s a lot I could say, but most of it is hard to turn into words.


The first example to come to mind — and the most conspicuous, at least from where I’m sitting — has been the transition from technical to creative since quitting my tech job. I think I touched on this a year ago, but it’s become all the more pronounced since then.

I quit in part because I wanted more time to work on my own projects. Two years ago, those projects included such things as: giving the Python ecosystem a better imaging library, designing an alternative to regular expressions, building a Very Correct IRC bot framework, and a few more things along similar lines. The goals were all to solve problems — not hugely important ones, but mildly inconvenient ones that I thought I could bring something novel to. Problem-solving for its own sake.

Now that I had all the time in the world to work on these things, I… didn’t. It turned out they were almost as much of a slog as my job had been!

The problem, I think, was that there was no point.

This was really weird to realize and come to terms with. I do like solving problems for its own sake; it’s interesting and educational. And most of the programming folks I know and surround myself with have that same drive and use it to create interesting tools like Twisted. So besides taking for granted that this was the kind of stuff I wanted to do, it seemed like the kind of stuff I should want to do.

But even if I create a really interesting tool, what do I have? I don’t have a thing; I have a tool that can be used to build things. If I want a thing, I have to either now build it myself — starting from nearly zero despite all the work on the tool, because it can only do so much in isolation — or convince a bunch of other people to use my tool to build things. Then they’d be depending on my tool, which means I have to maintain and support it, which is even more time and effort poured into this non-thing.

Despite frequently being drawn to think about solving abstract tooling problems, it seems I truly want to make things. This is probably why I have a lot of abandoned projects boldly described as “let’s solve X problem forever!” — I go to scratch the itch, I do just enough work that it doesn’t itch any more, and then I lose interest.

I spent a few months quietly flailing over this minor existential crisis. I’d spent years daydreaming about making tools; what did I have if not that drive? I was having to force myself to work on what I thought were my passion projects.

Meanwhile, I’d vaguely intended to do some game development, but for some reason dragged my feet forever and then took my sweet time dipping my toes in the water. I did work on a text adventure, Runed Awakening, on and off… but it was a fractal of creative decisions and I had a hard time making all of them. It might’ve been too ambitious, despite feeling small, and that might’ve discouraged me from pursuing other kinds of games earlier.

A big part of it might have been the same reason I took so long to even give art a serious try. I thought of myself as a technical person, and art is a thing for creative people, so I’m simply disqualified, right? Maybe the same thing applies to games.

Lord knows I had enough trouble when I tried. I’d orbited the Doom community for years but never released a single finished level. I did finally give it a shot again, now that I had the time. Six months into my funemployment, I wrote a three-part guide on making Doom levels. Three months after that, I finally released one of my own.

I suppose that opened the floodgates; a couple weeks later, glip and I decided to try making something for the PICO-8, and then we did that (almost exactly a year ago!). Then kept doing it.

It’s been incredibly rewarding — far moreso than any “pure” tooling problem I’ve ever approached. Moreso than even something like veekun, which is a useful thing. People have thoughts and opinions on games. Games give people feelings, which they then tell you about. Most of the commentary on a reference website is that something is missing or incorrect.

I like doing creative work. There was never a singular moment when this dawned on me; it was a slow process over the course of a year or more. I probably should’ve had an inkling when I started drawing, half a year before I quit; even my early (and very rough) daily comics made people laugh, and I liked that a lot. Even the most well-crafted software doesn’t tend to bring joy to people, but amateur art can.

I still like doing technical work, but I prefer when it’s a means to a creative end. And, just as important, I prefer when it has a clear and constrained scope. “Make a library/tool for X” is a nebulous problem that could go in a great many directions; “make a bot that tweets Perlin noise” has a pretty definitive finish line. It was interesting to write a little physics engine, but I would’ve hated doing it if it weren’t for a game I were making and didn’t have the clear scope of “do what I need for this game”.


It feels like creative work is something I’ve been wanting to do for a long time. If this were a made-for-TV movie, I would’ve discovered this impulse one day and immediately revealed myself as a natural-born artistic genius of immense unrealized talent.

That didn’t happen. Instead I’ve found that even something as mundane as having ideas is a skill, and while it’s one I enjoy, I’ve barely ever exercised it at all. I have plenty of ideas with technical work, but I run into brick walls all the time with creative stuff.

How do I theme this area? Well, I don’t know. How do I think of something? I don’t know that either. It’s a strange paradox to have an urge to create things but not quite know what those things are.

It’s such a new and completely different kind of problem. There’s no right answer, or even an answer I can check for “correctness”. I can do anything. With no landmarks to start from, it’s easy to feel completely lost and just draw blanks.

I’ve essentially recalibrated the texture of stuff I work on, and I have to find some completely new ways to approach problems. I haven’t found them yet. I don’t think they’re anything that can be told or taught. But I’m starting to get there, and part of it is just accepting that I can’t treat these like problems with clear best solutions and clear algorithms to find those solutions.

A particularly glaring irony is that I’ve had a really tough problem designing abstract spaces, even though that’s exactly the kind of architecture I praise in Doom. It’s much trickier than it looks — a good abstract design is reminiscent of something without quite being that something.

I suppose it’s similar to a struggle I’ve had with art. I’m drawn to a cartoony style, and cartooning is also a mild form of abstraction, of whittling away details to leave only what’s most important. I’m reminded in particular of the forest background in fox flux — I was completely lost on how to make something reminiscent of a tree line. I knew enough to know that drawing trees would’ve made the background far too busy, but trees are naturally busy, so how do you represent that?

The answer glip gave me was to make big chunky leaf shapes around the edges and where light levels change. Merely overlapping those shapes implies depth well enough to convey the overall shape of the tree. The result works very well and looks very simple — yet it took a lot of effort just to get to the idea.

It reminds me of mathematical research, in a way? You know the general outcome you want, and you know the tools at your disposal, and it’s up to you to make some creative leaps. I don’t think there’s a way to directly learn how to approach that kind of problem; all you can do is look at what others have done and let it fuel your imagination.


I think I’m getting a little distracted here, but this is stuff that’s been rattling around lately.

If there’s a more personal meaning to the tree story, it’s that this is a thing I can do. I can learn it, and it makes sense to me, despite being a huge nerd.

Two and a half years ago, I never would’ve thought I’d ever make an entire game from scratch and do all the art for it. It was completely unfathomable. Maybe we can do a lot of things we don’t expect we’re capable of, if only we give them a serious shot.

And ask for help, of course. I have a hell of a time doing that. I did a painting recently that factored in mountains of glip’s advice, and on some level I feel like I didn’t quite do it myself, even though every stroke was made by my hand. Hell, I don’t even look at references nearly as much as I should. It feels like cheating, somehow? I know that’s ridiculous, but my natural impulse is to put my head down and figure it out myself. Maybe I’ve been doing that for too long with programming. Trust me, it doesn’t work quite so well in a brand new field.


I’m getting distracted again!

To answer your actual questions: how do I go about learning about myself? I don’t! It happens completely by accident. I’ll consciously examine my surface-level thoughts or behaviors or whatever, sure, but the serious fundamental revelations have all caught me completely by surprise — sometimes slowly, sometimes suddenly.

Most of them also came from listening to the people who observe me from the outside: I only started drawing in the first place because of some ridiculous deal I made with glip. At the time I thought they just wanted everyone to draw because art is their thing, but now I’m starting to suspect they’d caught on after eight years of watching me lament that I couldn’t draw.

I don’t know how I handle such discoveries, either. What is handling? I imagine someone discovering something and trying to come to grips with it, but I don’t know that I have quite that experience — my grappling usually comes earlier, when I’m still trying to figure the thing out despite not knowing that there’s a thing to find out. Once I know it, it’s on the table; I can’t un-know it or reject it meaningfully. All I can do is figure out what to do with it, and I approach that the same way I approach every other problem: by flailing at it and hoping for the best.

This isn’t quite 2000 words. Sorry. I’ve run out of things to say about me. This paragraph is very conspicuous filler. Banana. Atmosphere. Vocation.

Running R on Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/

Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.

In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.

Prerequisites

Before you get started, complete the following steps.

    1. Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s Identity and Access Management (IAM) console. This can be done by attaching the associated Athena policies to your data scientist user group in IAM.

 

RAthena_1

  1. Provide a staging directory in the form of an Amazon S3 bucket. Athena will use this to query datasets and store results. We’ll call this staging bucket s3://athenauser-athena-r in the instructions that follow.

NOTE: In this blog post, I create all AWS resources in the US-East region. Use the Region Table to check the availability of Athena in other regions.

Set up R and RStudio on EC2 

  1. Follow the instructions in the blog post “Running R on AWS” to set up R on an EC2 instance (t2.medium or greater) running Amazon Linux . Read the step below before you begin.
  1. In that blog post under “Advanced Details,” when you reach step 3 use the following bash script to install the latest version of RStudio. Modify the password for RStudio as needed.
#!/bin/bash
#install R
yum install -y R
#install RStudio-Server
wget https://download2.rstudio.org/rstudio-server-rhel-1.0.136-x86_64.rpm
yum install -y --nogpgcheck rstudio-server-rhel-1.0.136-x86_64.rpm
#add user(s)
useradd rstudio
echo rstudio:rstudio | chpasswd

Install Java 8 

  1. SSH into this EC2 instance.
  2. Remove older versions of Java.
  3. Install Java 8. This is required to work with Athena.
  4. Run the following commands on the command line.
#install Java 8, select ‘y’ from options presented to proceed with installation
sudo yum install java-1.8.0-openjdk-devel
#remove version 7 of Java, select ‘y’ from options to proceed with removal
sudo yum remove java-1.7.0-openjdk
#configure java, choose 1 as your selection option for java 8 configuration
sudo /usr/sbin/alternatives --config java
#run command below to add Java support to R
sudo R CMD javareconf

#following libraries are required for the interactive application we build later
sudo yum install -y libpng-devel
sudo yum install -y libjpeg-turbo-devel

Set up .Renviron

You need to configure the R environment variable .Renviron with the required Athena credentials.

  1. Get the required credentials from your AWS Administrator in the form of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
  1. Type the following command from the Linux command prompt and bring up the vi editor.
sudo vim /home/rstudio/.Renviron

Provide your Athena credentials in the following form into the editor:
ATHENA_USER=< AWS_ACCESS_KEY_ID >
ATHENA_PASSWORD=< AWS_SECRET_ACCESS_KEY>
  1. Save this file and exit from the editor. 

Log in to RStudio

Next, you’ll log in to RStudio on your EC2 instance.

  1. Get the public IP address of your instance from the EC2 dashboard and paste it followed by :8787 (port number for RStudio) into your browser window.
  1. Confirm that your IP address has been whitelisted for inbound access to port 8787 as part of the configuration for the security group associated with your EC2 instance.
  1. Log in to RStudio with the username and password you provided previously.

Install R packages

Next, you’ll install and load the required R packages.

#--following R packages are required for connecting R with Athena
install.packages("rJava")
install.packages("RJDBC")
library(rJava)
library(RJDBC)

#--following R packages are required for the interactive application we build later
#--steps below might take several minutes to complete
install.packages(c("plyr","dplyr","png","RgoogleMaps","ggmap"))
library(plyr)
library(dplyr)
library(png)
library(RgoogleMaps)
library(ggmap)

Connect to Athena

The following steps in R download the Athena driver and set up the required connection. Use the JDBC URL associated with your region.

#verify Athena credentials by inspecting results from command below
Sys.getenv()
#set up URL to download Athena JDBC driver
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
fil
#set up driver connection to JDBC
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
#connect to Athena using the driver, S3 working directory and credentials for Athena 
#replace ‘athenauser’ below with prefix you have set up for your S3 bucket
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir="s3://athenauser-athena-r",
user=Sys.getenv("ATHENA_USER"),
password=Sys.getenv("ATHENA_PASSWORD"))
#in case of error or warning from step above ensure rJava and RJDBC packages have #been loaded 
#also ensure you have Java 8 running and configured for R as outlined earlier

Now you’re ready to start querying Athena from RStudio. 

Sample Queries to test

# get a list of all tables currently in Athena 
dbListTables(con)
# run a sample query
dfelb=dbGetQuery(con, "SELECT * FROM sampledb.elb_logs limit 10")
head(dfelb,2)

RAthena_2

Interactive Use Case

Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.

Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

#---sql  create table statement in Athena
dbSendQuery(con, 
"
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.gdeltmaster (
GLOBALEVENTID BIGINT,
SQLDATE INT,
MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
AvgTone DOUBLE,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT,
ActionGeo_Type INT,
ActionGeo_FullName STRING,
ActionGeo_CountryCode STRING,
ActionGeo_ADM1Code STRING,
ActionGeo_Lat FLOAT,
ActionGeo_Long FLOAT,
ActionGeo_FeatureID INT,
DATEADDED INT,
SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://support.elasticmapreduce/training/datasets/gdelt'
;
"
)



dbListTables(con)

You should see this newly created table named ‘gdeltmaster’ appear in your RStudio console after executing the statement above.

RAthena_3

Query this Athena table to get a count of all CAMEO events that took place in the US in 2015.

#--get count of all CAMEO events that took place in US in year 2015 
#--save results in R dataframe
dfg<-dbGetQuery(con,"SELECT eventcode,count(*) as count
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
group by eventcode
order by eventcode desc"
)
str(dfg)
head(dfg,2)

RAthena_4

#--get list of top 5 most frequently occurring events in US in 2015
dfs=head(arrange(dfg,desc(count)),5)
dfs

RAthena_5

From the R output shown above, you can see that CAMEO event “042” has the highest count. From the CAMEO manual, you can determine that this event has the description “Travel to another location for a meeting or other event.”

Next, you’ll use the knowledge gained from this analysis to get a list of all geo-coordinates associated with this specific event from the Athena table.

#--get a list of latitude and longitude associated with event “042” 
#--save results in R dataframe
dfgeo<-dbGetQuery(con,"SELECT actiongeo_lat,actiongeo_long
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
and eventcode = '042'
"
)
#--duration of above query will depend on factors like size of chosen EC2 instance
#--now rename columns in dataframe for brevity
names(dfgeo)[names(dfgeo)=="actiongeo_lat"]="lat"
names(dfgeo)[names(dfgeo)=="actiongeo_long"]="long"
names(dfgeo)
#let us inspect this R dataframe
str(dfgeo)
head(dfgeo,5)

RAthena_6
Next, generate a map for the United States. 

#--generate map for the US using the ggmap package
map=qmap('USA',zoom=3)
map

RAthena_7

Now you’ll plot the geodata obtained from your Athena table onto this map. This will help you visualize all US locations where these events had occurred in 2015. 

#--plot our geo-coordinates on the US map
map + geom_point(data = dfgeo, aes(x = dfgeo$long, y = dfgeo$lat), color="blue", size=0.5, alpha=0.5)

RAthena_8

By visually inspecting the results, you can determine that this specific event was heavily concentrated in the Northeastern part of the US.

Conclusion

You’ve learned how to build a simple interactive application with Athena and R. Athena can be used to store and query the underlying data for your big data applications using standard SQL, while R can be used to interactively query Athena and generate analytical insights using the powerful set of libraries that R provides.

If you have questions or suggestions, please leave your feedback in the comments.

 


About the Author

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 


Related

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

o_IoT_Minutes_1

 

Utopia

Post Syndicated from Eevee original https://eev.ee/blog/2017/03/08/utopia/

It’s been a while, but someone’s back on the Patreon blog topic tier! IndustrialRobot asks:

What does your personal utopia look like? Do you think we (as mankind) can achieve it? Why/why not?

Hm.

I spent the month up to my eyeballs in a jam game, but this question was in the back of my mind a lot. I could use it as a springboard to opine about anything, especially in the current climate: politics, religion, nationalism, war, economics, etc., etc. But all of that has been done to death by people who actually know what they’re talking about.

The question does say “personal”. So in a less abstract sense… what do I want the world to look like?

Mostly, I want everyone to have the freedom to make things.

I’ve been having a surprisingly hard time writing the rest of this without veering directly into the ravines of “basic income is good” and “maybe capitalism is suboptimal”. Those are true, but not really the tone I want here, and anyway they’ve been done to death by better writers than I. I’ve talked this out with Mel a few times, and it sounds much better aloud, so I’m going to try to drop my Blog Voice and just… talk.

*ahem*

Art versus business

So, art. Art is good.

I’m construing “art” very broadly here. More broadly than “media”, too. I’m including shitty robots, weird Twitter almost-bots, weird Twitter non-bots, even a great deal of open source software. Anything that even remotely resembles creative work — driven perhaps by curiosity, perhaps by practicality, but always by a soul bursting with ideas and a palpable need to get them out.

Western culture thrives on art. Most culture thrives on art. I’m not remotely qualified to defend this, but I suspect you could define culture in terms of art. It’s pretty important.

You’d think this would be reflected in how we discuss art, but often… it’s not. Tell me how often you’ve heard some of these gems.

  • I could do that.”
  • My eight-year-old kid could do that.”
  • Jokes about the worthlessness of liberal arts degrees.
  • Jokes about people trying to write novels in their spare time, the subtext being that only dreamy losers try to write novels, or something.
  • The caricature of a hippie working on a screenplay at Starbucks.

Oh, and then there was the guy who made a bot to scrape tons of art from artists who were using Patreon as a paywall — and a primary source of income. The justification was that artists shouldn’t expect to make a living off of, er, doing art, and should instead get “real jobs”.

I do wonder. How many of the people repeating these sentiments listen to music, or go to movies, or bought an iPhone because it’s prettier? Are those things not art that took real work to create? Is creating those things not a “real job”?

Perhaps a “real job” has to be one that’s not enjoyable, not a passion? And yet I can’t recall ever hearing anyone say that Taylor Swift should get a “real job”. Or that, say, pro football players should get “real jobs”. What do pro football players even do? They play a game a few times a year, and somehow this drives the flow of unimaginable amounts of money. We dress it up in the more serious-sounding “sport”, but it’s a game in the same general genre as hopscotch. There’s nothing wrong with that, but somehow it gets virtually none of the scorn that art does.

Another possible explanation is America’s partly-Christian, partly-capitalist attitude that you deserve exactly whatever you happen to have at the moment. (Whereas I deserve much more and will be getting it any day now.) Rich people are rich because they earned it, and we don’t question that further. Poor people are poor because they failed to earn it, and we don’t question that further, either. To do so would suggest that the system is somehow unfair, and hard work does not perfectly correlate with any particular measure of success.

I’m sure that factors in, but it’s not quite satisfying: I’ve also seen a good deal of spite aimed at people who are making a fairly decent chunk through Patreon or similar. Something is missing.

I thought, at first, that the key might be the American worship of work. Work is an inherent virtue. Politicians run entire campaigns based on how many jobs they’re going to create. Notably, no one seems too bothered about whether the work is useful, as long as someone decided to pay you for it.

Finally I stumbled upon the key. America doesn’t actually worship work. America worships business. Business means a company is deciding to pay you. Business means legitimacy. Business is what separates a hobby from a career.

And this presents a problem for art.

If you want to provide a service or sell a product, that’ll be hard, but America will at least try to look like it supports you. People are impressed that you’re an entrepreneur, a small business owner. Politicians will brag about policies made in your favor, whether or not they’re stabbing you in the back.

Small businesses have a particular structure they can develop into. You can divide work up. You can have someone in sales, someone in accounting. You can provide specifications and pay a factory to make your product. You can defer all of the non-creative work to someone else, whether that means experts in a particular field or unskilled labor.

But if your work is inherently creative, you can’t do that. The very thing you’re making is your idea in your style, driven by your experience. This is not work that’s readily parallelizable. Even if you sell physical merchandise and register as an LLC and have a dedicated workspace and do various other formal business-y things, the basic structure will still look the same: a single person doing the thing they enjoy. A hobbyist.

Consider the bulleted list from above. Those are all individual painters or artists or authors or screenwriters. The kinds of artists who earn respect without question are generally those managed by a business, those with branding: musical artists signed to labels, actors working for a studio. Even football players are part of a tangle of business.

(This doesn’t mean that business automatically confers respect, of course; tech in particular is full of anecdotes about nerds’ disdain for people whose jobs are design or UI or documentation or whathaveyou. But a businessy look seems to be a significant advantage.)

It seems that although art is a large part of what informs culture, we have a culture that defines “serious” endeavors in such a way that independent art cannot possibly be “serious”.

Art versus money

Which wouldn’t really matter at all, except that we also have a culture that expects you to pay for food and whatnot.

The reasoning isn’t too outlandish. Food is produced from a combination of work and resources. In exchange for getting the food, you should give back some of your own work and resources.

Obviously this is riddled with subtle flaws, but let’s roll with it for now and look at a case study. Like, uh, me!

Mel and I built and released two games together in the six weeks between mid-January and the end of February. Together, those games have made $1,000 in sales. The sales trail off fairly quickly within a few days of release, so we’ll call that the total gross for our effort.

I, dumb, having never actually sold anything before, thought this was phenomenal. Then I had the misfortune of doing some math.

Itch takes at least 10%, so we’re down to $900 net. Divided over six weeks, that’s $150 per week, before taxes — or $3.75 per hour if we’d been working full time.

Ah, but wait! There are two of us. And we hadn’t been working full time — we’d been working nearly every waking hour, which is at least twice “full time” hours. So we really made less than a dollar an hour. Even less than that, if you assume overtime pay.

From the perspective of capitalism, what is our incentive to do this? Between us, we easily have over thirty years of experience doing the things we do, and we spent weeks in crunch mode working on something, all to earn a small fraction of minimum wage. Did we not contribute back our own work and resources? Was our work worth so much less than waiting tables?

Waiting tables is a perfectly respectable way to earn a living, mind you. Ah, but wait! I’ve accidentally done something clever here. It is generally expected that you tip your waiter, because waiters are underpaid by the business, because the business assumes they’ll be tipped. Not tipping is actually, almost impressively, one of the rudest things you can do. And yet it’s not expected that you tip an artist whose work you enjoy, even though many such artists aren’t being paid at all.

Now, to be perfectly fair, both games were released for free. Even a dollar an hour is infinitely more than the zero dollars I was expecting — and I’m amazed and thankful we got as much as we did! Thank you so much. I bring it up not as a complaint, but as an armchair analysis of our systems of incentives.

People can take art for granted and whatever, yes, but there are several other factors at play here that hamper the ability for art to make money.

For one, I don’t want to sell my work. I suspect a great deal of independent artists and writers and open source developers (!) feel the same way. I create things because I want to, because I have to, because I feel so compelled to create that having a non-creative full-time job was making me miserable. I create things for the sake of expressing an idea. Attaching a price tag to something reduces the number of people who’ll experience it. In other words, selling my work would make it less valuable in my eyes, in much the same way that adding banner ads to my writing would make it less valuable.

And yet, I’m forced to sell something in some way, or else I’ll have to find someone who wants me to do bland mechanical work on their ideas in exchange for money… at the cost of producing sharply less work of my own. Thank goodness for Patreon, at least.

There’s also the reverse problem, in that people often don’t want to buy creative work. Everyone does sometimes, but only sometimes. It’s kind of a weird situation, and the internet has exacerbated it considerably.

Consider that if I write a book and print it on paper, that costs something. I have to pay for the paper and the ink and the use of someone else’s printer. If I want one more book, I have to pay a little more. I can cut those costs pretty considerable by printing a lot of books at once, but each copy still has a price, a marginal cost. If I then gave those books away, I would be actively losing money. So I can pretty well justify charging for a book.

Along comes the internet. Suddenly, copying costs nothing. Not only does it cost nothing, but it’s the fundamental operation. When you download a file or receive an email or visit a web site, you’re really getting a copy! Even the process which ultimately shows it on your screen involves a number of copies. This is so natural that we don’t even call it copying, don’t even think of it as copying.

True, bandwidth does cost something, but the rate is virtually nothing until you start looking at very big numbers indeed. I pay $60/mo for hosting this blog and a half dozen other sites — even that’s way more than I need, honestly, but downgrading would be a hassle — and I get 6TB of bandwidth. Even the longest of my posts haven’t exceeded 100KB. A post could be read by 64 million people before I’d start having a problem. If that were the population of a country, it’d be the 23rd largest in the world, between Italy and the UK.

How, then, do I justify charging for my writing? (Yes, I realize the irony in using my blog as an example in a post I’m being paid $88 to write.)

Well, I do pour effort and expertise and a fraction of my finite lifetime into it. But it doesn’t cost me anything tangible — I already had this hosting for something else! — and it’s easier all around to just put it online.

The same idea applies to a vast bulk of what’s online, and now suddenly we have a bit of a problem. Not only are we used to getting everything for free online, but we never bothered to build any sensible payment infrastructure. You still have to pay for everything by typing in a cryptic sequence of numbers from a little physical plastic card, which will then give you a small loan and charge the seller 30¢ plus 2.9% for the “convenience”.

If a website could say “pay 5¢ to read this” and you clicked a button in your browser and that was that, we might be onto something. But with our current setup, it costs far more than 5¢ to transfer 5¢, even though it’s just a number in a computer somewhere. The only people with the power and resources to fix this don’t want to fix it — they’d rather be the ones charging you the 30¢ plus 2.9%.

That leads to another factor of platforms and publishers, which are more than happy to eat a chunk of your earnings even when you do sell stuff. Google Play, the App Store, Steam, and anecdotally many other big-name comparative platforms all take 30% of your sales. A third! And that’s good! It seems common among book publishers to take 85% to 90%. For ebook sales — i.e., ones that don’t actually cost anything — they may generously lower that to a mere 75% to 85%.

Bless Patreon for only taking 5%. Itch.io is even better: it defaults to 10%, but gives you a slider, which you can set to anything from 0% to 100%.

I’ve mentioned all this before, so here’s a more novel thought: finite disposable income. Your audience only has so much money to spend on media right now. You can try to be more compelling to encourage them to spend more of it, rather than saving it, but ultimately everyone has a limit before they just plain run out of money.

Now, popularity is heavily influenced by social and network effects, so it tends to create a power law distribution: a few things are ridiculously hyperpopular, and then there’s a steep drop to a long tail of more modestly popular things.

If a new hyperpopular thing comes out, everyone is likely to want to buy it… but then that eats away a significant chunk of that finite pool of money that could’ve gone to less popular things.

This isn’t bad, and buying a popular thing doesn’t make you a bad person; it’s just what happens. I don’t think there’s any satisfying alternative that doesn’t involve radically changing the way we think about our economy.

Taylor Swift, who I’m only picking on because her infosec account follows me on Twitter, has sold tens of millions of albums and is worth something like a quarter of a billion dollars. Does she need more? If not, should she make all her albums free from now on?

Maybe she does, and maybe she shouldn’t. The alternative is for someone to somehow prevent her from making more money, which doesn’t sit well. Yet it feels almost heretical to even ask if someone “needs” more money, because we take for granted that she’s earned it — in part by being invested in by a record label and heavily advertised. The virtue is work, right? Don’t a lot of people work just as hard? (“But you have to be talented too!” Then please explain how wildly incompetent CEOs still make millions, and leave burning businesses only to be immediately hired by new ones? Anyway, are we really willing to bet there is no one equally talented but not as popular by sheer happenstance?)

It’s kind of a moot question anyway, since she’s probably under contract with billionaires and it’s not up to her.

Where the hell was I going with this.


Right, so. Money. Everyone needs some. But making it off art can be tricky, unless you’re one of the lucky handful who strike gold.

And I’m still pretty goddamn lucky to be able to even try this! I doubt I would’ve even gotten into game development by now if I were still working for an SF tech company — it just drained so much of my creative energy, and it’s enough of an uphill battle for me to get stuff done in the first place.

How many people do I know who are bursting with ideas, but have to work a tedious job to keep the lights on, and are too tired at the end of the day to get those ideas out? Make no mistake, making stuff takes work — a lot of it. And that’s if you’re already pretty good at the artform. If you want to learn to draw or paint or write or code, you have to do just as much work first, with much more frustration, and not as much to show for it.

Utopia

So there’s my utopia. I want to see a world where people have the breathing room to create the things they dream about and share them with the rest of us.

Can it happen? Maybe. I think the cultural issues are a fairly big blocker; we’d be much better off if we treated independent art with the same reverence as, say, people who play with a ball for twelve hours a year. Or if we treated liberal arts degrees as just as good as computer science degrees. (“But STEM can change the world!” Okay. How many people with computer science degrees would you estimate are changing the world, and how many are making a website 1% faster or keeping a lumbering COBOL beast running or trying to trick 1% more people into clicking on ads?)

I don’t really mean stuff like piracy, either. Piracy is a thing, but it’s… complicated. In my experience it’s not even artists who care the most about piracy; it’s massive publishers, the sort who see artists as a sponge to squeeze money out of. You know, the same people who make everything difficult to actually buy, infest it with DRM so it doesn’t work on half the stuff you own, and don’t even sell it in half the world.

I mean treating art as a free-floating commodity, detached from anyone who created it. I mean neo-Nazis adopting a comic book character as their mascot, against the creator’s wishes. I mean politicians and even media conglomerates using someone else’s music in well-funded videos and ads without even asking. I mean assuming Google Image Search, wonder that it is, is some kind of magical free art machine. I mean the snotty Reddit post I found while looking up Patreon’s fee structure, where some doofus was insisting that Patreon couldn’t possibly pay for a full-time YouTuber’s time, because not having a job meant they had lots of time to spare.

Maybe I should go one step further: everyone should create at least once or twice. Everyone should know what it’s like to have crafted something out of nothing, to be a fucking god within the microcosm of a computer screen or a sewing machine or a pottery table. Everyone should know that spark of inspiration that we don’t seem to know how to teach in math or science classes, even though it’s the entire basis of those as well. Everyone should know that there’s a good goddamn reason I listed open source software as a kind of art at the beginning of this post.

Basic income and more arts funding for public schools. If Uber can get billions of dollars for putting little car icons on top of Google Maps and not actually doing any of their own goddamn service themselves, I think we can afford to pump more cash into webcomics and indie games and, yes, even underwater basket weaving.