Tag Archives: training

Getting Ready for AWS re:Invent 2017

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/getting-ready-for-aws-reinvent-2017/

With just 40 days remaining before AWS re:Invent begins, my colleagues and I want to share some tips that will help you to make the most of your time in Las Vegas. As always, our focus is on training and education, mixed in with some after-hours fun and recreation for balance.

Locations, Locations, Locations
The re:Invent Campus will span the length of the Las Vegas strip, with events taking place at the MGM Grand, Aria, Mirage, Venetian, Palazzo, the Sands Expo Hall, the Linq Lot, and the Encore. Each venue will host tracks devoted to specific topics:

MGM Grand – Business Apps, Enterprise, Security, Compliance, Identity, Windows.

Aria – Analytics & Big Data, Alexa, Container, IoT, AI & Machine Learning, and Serverless.

Mirage – Bootcamps, Certifications & Certification Exams.

Venetian / Palazzo / Sands Expo Hall – Architecture, AWS Marketplace & Service Catalog, Compute, Content Delivery, Database, DevOps, Mobile, Networking, and Storage.

Linq Lot – Alexa Hackathons, Gameday, Jam Sessions, re:Play Party, Speaker Meet & Greets.

EncoreBookable meeting space.

If your interests span more than one topic, plan to take advantage of the re:Invent shuttles that will be making the rounds between the venues.

Lots of Content
The re:Invent Session Catalog is now live and you should start to choose the sessions of interest to you now.

With more than 1100 sessions on the agenda, planning is essential! Some of the most popular “deep dive” sessions will be run more than once and others will be streamed to overflow rooms at other venues. We’ve analyzed a lot of data, run some simulations, and are doing our best to provide you with multiple opportunities to build an action-packed schedule.

We’re just about ready to let you reserve seats for your sessions (follow me and/or @awscloud on Twitter for a heads-up). Based on feedback from earlier years, we have fine-tuned our seat reservation model. This year, 75% of the seats for each session will be reserved and the other 25% are for walk-up attendees. We’ll start to admit walk-in attendees 10 minutes before the start of the session.

Las Vegas never sleeps and neither should you! This year we have a host of late-night sessions, workshops, chalk talks, and hands-on labs to keep you busy after dark.

To learn more about our plans for sessions and content, watch the Get Ready for re:Invent 2017 Content Overview video.

Have Fun
After you’ve had enough training and learning for the day, plan to attend the Pub Crawl, the re:Play party, the Tatonka Challenge (two locations this year), our Hands-On LEGO Activities, and the Harley Ride. Stay fit with our 4K Run, Spinning Challenge, Fitness Bootcamps, and Broomball (a longstanding Amazon tradition).

See You in Vegas
As always, I am looking forward to meeting as many AWS users and blog readers as possible. Never hesitate to stop me and to say hello!

Jeff;

 

 

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Introducing Gluon: a new library for machine learning from AWS and Microsoft

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/introducing-gluon-a-new-library-for-machine-learning-from-aws-and-microsoft/

Post by Dr. Matt Wood

Today, AWS and Microsoft announced Gluon, a new open source deep learning interface which allows developers to more easily and quickly build machine learning models, without compromising performance.

Gluon Logo

Gluon provides a clear, concise API for defining machine learning models using a collection of pre-built, optimized neural network components. Developers who are new to machine learning will find this interface more familiar to traditional code, since machine learning models can be defined and manipulated just like any other data structure. More seasoned data scientists and researchers will value the ability to build prototypes quickly and utilize dynamic neural network graphs for entirely new model architectures, all without sacrificing training speed.

Gluon is available in Apache MXNet today, a forthcoming Microsoft Cognitive Toolkit release, and in more frameworks over time.

Neural Networks vs Developers
Machine learning with neural networks (including ‘deep learning’) has three main components: data for training; a neural network model, and an algorithm which trains the neural network. You can think of the neural network in a similar way to a directed graph; it has a series of inputs (which represent the data), which connect to a series of outputs (the prediction), through a series of connected layers and weights. During training, the algorithm adjusts the weights in the network based on the error in the network output. This is the process by which the network learns; it is a memory and compute intensive process which can take days.

Deep learning frameworks such as Caffe2, Cognitive Toolkit, TensorFlow, and Apache MXNet are, in part, an answer to the question ‘how can we speed this process up? Just like query optimizers in databases, the more a training engine knows about the network and the algorithm, the more optimizations it can make to the training process (for example, it can infer what needs to be re-computed on the graph based on what else has changed, and skip the unaffected weights to speed things up). These frameworks also provide parallelization to distribute the computation process, and reduce the overall training time.

However, in order to achieve these optimizations, most frameworks require the developer to do some extra work: specifically, by providing a formal definition of the network graph, up-front, and then ‘freezing’ the graph, and just adjusting the weights.

The network definition, which can be large and complex with millions of connections, usually has to be constructed by hand. Not only are deep learning networks unwieldy, but they can be difficult to debug and it’s hard to re-use the code between projects.

The result of this complexity can be difficult for beginners and is a time-consuming task for more experienced researchers. At AWS, we’ve been experimenting with some ideas in MXNet around new, flexible, more approachable ways to define and train neural networks. Microsoft is also a contributor to the open source MXNet project, and were interested in some of these same ideas. Based on this, we got talking, and found we had a similar vision: to use these techniques to reduce the complexity of machine learning, making it accessible to more developers.

Enter Gluon: dynamic graphs, rapid iteration, scalable training
Gluon introduces four key innovations.

  1. Friendly API: Gluon networks can be defined using a simple, clear, concise code – this is easier for developers to learn, and much easier to understand than some of the more arcane and formal ways of defining networks and their associated weighted scoring functions.
  2. Dynamic networks: the network definition in Gluon is dynamic: it can bend and flex just like any other data structure. This is in contrast to the more common, formal, symbolic definition of a network which the deep learning framework has to effectively carve into stone in order to be able to effectively optimizing computation during training. Dynamic networks are easier to manage, and with Gluon, developers can easily ‘hybridize’ between these fast symbolic representations and the more friendly, dynamic ‘imperative’ definitions of the network and algorithms.
  3. The algorithm can define the network: the model and the training algorithm are brought much closer together. Instead of separate definitions, the algorithm can adjust the network dynamically during definition and training. Not only does this mean that developers can use standard programming loops, and conditionals to create these networks, but researchers can now define even more sophisticated algorithms and models which were not possible before. They are all easier to create, change, and debug.
  4. High performance operators for training: which makes it possible to have a friendly, concise API and dynamic graphs, without sacrificing training speed. This is a huge step forward in machine learning. Some frameworks bring a friendly API or dynamic graphs to deep learning, but these previous methods all incur a cost in terms of training speed. As with other areas of software, abstraction can slow down computation since it needs to be negotiated and interpreted at run time. Gluon can efficiently blend together a concise API with the formal definition under the hood, without the developer having to know about the specific details or to accommodate the compiler optimizations manually.

The team here at AWS, and our collaborators at Microsoft, couldn’t be more excited to bring these improvements to developers through Gluon. We’re already seeing quite a bit of excitement from developers and researchers alike.

Getting started with Gluon
Gluon is available today in Apache MXNet, with support coming for the Microsoft Cognitive Toolkit in a future release. We’re also publishing the front-end interface and the low-level API specifications so it can be included in other frameworks in the fullness of time.

You can get started with Gluon today. Fire up the AWS Deep Learning AMI with a single click and jump into one of 50 fully worked, notebook examples. If you’re a contributor to a machine learning framework, check out the interface specs on GitHub.

-Dr. Matt Wood

Backing Up WordPress

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/backing-up-wordpress/

WordPress cloud backup
WordPress logo

WordPress is the most popular CMS (Content Management System) for websites, with almost 30% of all websites in the world using WordPress. That’s a lot of sites — over 350 million!

In this post we’ll talk about the different approaches to keeping the data on your WordPress website safe.


Stop the Presses! (Or the Internet!)

As we were getting ready to publish this post, we received news from UpdraftPlus, one of the biggest WordPress plugin developers, that they are supporting Backblaze B2 as a storage solution for their backup plugin. They shipped the update (1.13.9) this week. This is great news for Backblaze customers! UpdraftPlus is also offering a 20% discount to Backblaze customers wishing to purchase or upgrade to UpdraftPlus Premium. The complete information is below.

UpdraftPlus joins backup plugin developer XCloner — Backup and Restore in supporting Backblaze B2. A third developer, BlogVault, also announced their intent to support Backblaze B2. Contact your favorite WordPress backup plugin developer and urge them to support Backblaze B2, as well.

Now, back to our post…


Your WordPress website data is on a web server that’s most likely located in a large data center. You might wonder why it is necessary to have a backup of your website if it’s in a data center. Website data can be lost in a number of ways, including mistakes by the website owner (been there), hacking, or even domain ownership dispute (I’ve seen it happen more than once). A website backup also can provide a history of changes you’ve made to the website, which can be useful. As an overall strategy, it’s best to have a backup of any data that you can’t afford to lose for personal or business reasons.

Your web hosting company might provide backup services as part of your hosting plan. If you are using their service, you should know where and how often your data is being backed up. You don’t want to find out too late that your backup plan was not adequate.

Sites on WordPress.com are automatically backed up by VaultPress (Automattic), which also is available for self-hosted WordPress installations. If you don’t want the work or decisions involved in managing the hosting for your WordPress site, WordPress.com will handle it for you. You do, however, give up some customization abilities, such as the option to add plugins of your own choice.

Very large and active websites might consider WordPress VIP by Automattic, or another premium WordPress hosting service such as Pagely.com.

This post is about backing up self-hosted WordPress sites, so we’ll focus on those options.

WordPress Backup

Backup strategies for WordPress can be divided into broad categories depending on 1) what you back up, 2) when you back up, and 3) where the data is backed up.

With server data, such as with a WordPress installation, you should plan to have three copies of the data (the 3-2-1 backup strategy). The first is the active data on the WordPress web server, the second is a backup stored on the web server or downloaded to your local computer, and the third should be in another location, such as the cloud.

We’ll talk about the different approaches to backing up WordPress, but we recommend using a WordPress plugin to handle your backups. A backup plugin can automate the task, optimize your backup storage space, and alert you of problems with your backups or WordPress itself. We’ll cover plugins in more detail, below.

What to Back Up?

The main components of your WordPress installation are:

You should decide which of these elements you wish to back up. The database is the top priority, as it contains all your website posts and pages (exclusive of media). Your current theme is important, as it likely contains customizations you’ve made. Following those in priority are any other files you’ve customized or made changes to.

You can choose to back up the WordPress core installation and plugins, if you wish, but these files can be downloaded again if necessary from the source, so you might not wish to include them. You likely have all the media files you use on your website on your local computer (which should be backed up), so it is your choice whether to back these up from the server as well.

If you wish to be able to recreate your entire website easily in case of data loss or disaster, you might choose to back up everything, though on a large website this could be a lot of data.

Generally, you should 1) prioritize any file that you’ve customized that you can’t afford to lose, and 2) decide whether you need a copy of everything in order to get your site back up quickly. These choices will determine your backup method and the amount of storage you need.

A good backup plugin for WordPress enables you to specify which files you wish to back up, and even to create separate backups and schedules for different backup contents. That’s another good reason to use a plugin for backing up WordPress.

When to Back Up?

You can back up manually at any time by using the Export tool in WordPress. This is handy if you wish to do a quick backup of your site or parts of it. Since it is manual, however, it is not a part of a dependable backup plan that should be done regularly. If you wish to use this tool, go to Tools, Export, and select what you wish to back up. The output will be an XML file that uses the WordPress Extended RSS format, also known as WXR. You can create a WXR file that contains all of the information on your site or just portions of the site, such as posts or pages by selecting: All content, Posts, Pages, or Media.
Note: You can use WordPress’s Export tool for sites hosted on WordPress.com, as well.

Export instruction for WordPress

Many of the backup plugins we’ll be discussing later also let you do a manual backup on demand in addition to regularly scheduled or continuous backups.

Note:  Another use of the WordPress Export tool and the WXR file is to transfer or clone your website to another server. Once you have exported the WXR file from the website you wish to transfer from, you can import the WXR file from the Tools, Import menu on the new WordPress destination site. Be aware that there are file size limits depending on the settings on your web server. See the WordPress Codex entry for more information. To make this job easier, you may wish to use one of a number of WordPress plugins designed specifically for this task.

You also can manually back up the WordPress MySQL database using a number of tools or a plugin. The WordPress Codex has good information on this. All WordPress plugins will handle this for you and do it automatically. They also typically include tools for optimizing the database tables, which is just good housekeeping.

A dependable backup strategy doesn’t rely on manual backups, which means you should consider using one of the many backup plugins available either free or for purchase. We’ll talk more about them below.

Which Format To Back Up In?

In addition to the WordPress WXR format, plugins and server tools will use various file formats and compression algorithms to store and compress your backup. You may get to choose between zip, tar, tar.gz, tar.gz2, and others. See The Most Common Archive File Formats for more information on these formats.

Select a format that you know you can access and unarchive should you need access to your backup. All of these formats are standard and supported across operating systems, though you might need to download a utility to access the file.

Where To Back Up?

Once you have your data in a suitable format for backup, where do you back it up to?

We want to have multiple copies of our active website data, so we’ll choose more than one destination for our backup data. The backup plugins we’ll discuss below enable you to specify one or more possible destinations for your backup. The possible destinations for your backup include:

A backup folder on your web server
A backup folder on your web server is an OK solution if you also have a copy elsewhere. Depending on your hosting plan, the size of your site, and what you include in the backup, you may or may not have sufficient disk space on the web server. Some backup plugins allow you to configure the plugin to keep only a certain number of recent backups and delete older ones, saving you disk space on the server.
Email to you
Because email servers have size limitations, the email option is not the best one to use unless you use it to specifically back up just the database or your main theme files.
FTP, SFTP, SCP, WebDAV
FTP, SFTP, SCP, and WebDAV are all widely-supported protocols for transferring files over the internet and can be used if you have access credentials to another server or supported storage device that is suitable for storing a backup.
Sync service (Dropbox, SugarSync, Google Drive, OneDrive)
A sync service is another possible server storage location though it can be a pricier choice depending on the plan you have and how much you wish to store.
Cloud storage (Backblaze B2, Amazon S3, Google Cloud, Microsoft Azure, Rackspace)
A cloud storage service can be an inexpensive and flexible option with pay-as-you go pricing for storing backups and other data.

A good website backup strategy would be to have multiple backups of your website data: one in a backup folder on your web hosting server, one downloaded to your local computer, and one in the cloud, such as with Backblaze B2.

If I had to choose just one of these, I would choose backing up to the cloud because it is geographically separated from both your local computer and your web host, it uses fault-tolerant and redundant data storage technologies to protect your data, and it is available from anywhere if you need to restore your site.

Backup Plugins for WordPress

Probably the easiest and most common way to implement a solid backup strategy for WordPress is to use one of the many backup plugins available for WordPress. Fortunately, there are a number of good ones and are available free or in “freemium” plans in which you can use the free version and pay for more features and capabilities only if you need them. The premium options can give you more flexibility in configuring backups or have additional options for where you can store the backups.

How to Choose a WordPress Backup Plugin

screenshot of WordPress plugins search

When considering which plugin to use, you should take into account a number of factors in making your choice.

Is the plugin actively maintained and up-to-date? You can determine this from the listing in the WordPress Plugin Repository. You also can look at reviews and support comments to get an idea of user satisfaction and how well issues are resolved.

Does the plugin work with your web hosting provider? Generally, well-supported plugins do, but you might want to check to make sure there are no issues with your hosting provider.

Does it support the cloud service or protocol you wish to use? This can be determined from looking at the listing in the WordPress Plugin Repository or on the developer’s website. Developers often will add support for cloud services or other backup destinations based on user demand, so let the developer know if there is a feature or backup destination you’d like them to add to their plugin.

Other features and options to consider in choosing a backup plugin are:

  • Whether encryption of your backup data is available
  • What are the options for automatically deleting backups from the storage destination?
  • Can you globally exclude files, folders, and specific types of files from the backup?
  • Do the options for scheduling automatic backups meet your needs for frequency?
  • Can you exclude/include specific database tables (a good way to save space in your backup)?

WordPress Backup Plugins Review

Let’s review a few of the top choices for WordPress backup plugins.

UpdraftPlus

UpdraftPlus is one of the most popular backup plugins for WordPress with over one million active installations. It is available in both free and Premium versions.

UpdraftPlus just released support for Backblaze B2 Cloud Storage in their 1.13.9 update on September 25. According to the developer, support for Backblaze B2 was the most frequent request for a new storage option for their plugin. B2 support is available in their Premium plugin and as a stand-alone update to their standard product.

Note: The developers of UpdraftPlus are offering a special 20% discount to Backblaze customers on the purchase of UpdraftPlus Premium by using the coupon code backblaze20. The discount is valid until the end of Friday, October 6th, 2017.

screenshot of Backblaze B2 cloud backup for WordPress in UpdraftPlus

XCloner — Backup and Restore

XCloner — Backup and Restore is a useful open-source plugin with many options for backing up WordPress.

XCloner supports B2 Cloud Storage in their free plugin.

screenshot of XCloner WordPress Backblaze B2 backup settings

BlogVault

BlogVault describes themselves as a “complete WordPress backup solution.” They offer a free trial of their paid WordPress backup subscription service that features real-time backups of changes to your WordPress site, as well as many other features.

BlogVault has announced their intent to support Backblaze B2 Cloud Storage in a future update.

screenshot of BlogValut WordPress Backup settings

BackWPup

BackWPup is a popular and free option for backing up WordPress. It supports a number of options for storing your backup, including the cloud, FTP, email, or on your local computer.

screenshot of BackWPup WordPress backup settings

WPBackItUp

WPBackItUp has been around since 2012 and is highly rated. It has both free and paid versions.

screenshot of WPBackItUp WordPress backup settings

VaultPress

VaultPress is part of Automattic’s well-known WordPress product, JetPack. You will need a JetPack subscription plan to use VaultPress. There are different pricing plans with different sets of features.

screenshot of VaultPress backup settings

Backup by Supsystic

Backup by Supsystic supports a number of options for backup destinations, encryption, and scheduling.

screenshot of Backup by Supsystic backup settings

BackupWordPress

BackUpWordPress is an open-source project on Github that has a popular and active following and many positive reviews.

screenshot of BackupWordPress WordPress backup settings

BackupBuddy

BackupBuddy, from iThemes, is the old-timer of backup plugins, having been around since 2010. iThemes knows a lot about WordPress, as they develop plugins, themes, utilities, and provide training in WordPress.

BackupBuddy’s backup includes all WordPress files, all files in the WordPress Media library, WordPress themes, and plugins. BackupBuddy generates a downloadable zip file of the entire WordPress website. Remote storage destinations also are supported.

screenshot of BackupBuddy settings

WordPress and the Cloud

Do you use WordPress and back up to the cloud? We’d like to hear about it. We’d also like to hear whether you are interested in using B2 Cloud Storage for storing media files served by WordPress. If you are, we’ll write about it in a future post.

In the meantime, keep your eye out for new plugins supporting Backblaze B2, or better yet, urge them to support B2 if they’re not already.

The Best Backup Strategy is the One You Use

There are other approaches and tools for backing up WordPress that you might use. If you have an approach that works for you, we’d love to hear about it in the comments.

The post Backing Up WordPress appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Dialekt-o-maten vending machine

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/dialekt-o-maten-vending-machine/

At some point, many of you will have become exasperated with your AI personal assistant for not understanding you due to your accent – or worse, your fantastic regional dialect! A vending machine from Coca-Cola Sweden turns this issue inside out: the Dialekt-o-maten rewards users with a free soft drink for speaking in a Swedish regional dialect.

The world’s first vending machine where you pay with a dialect!

Thirsty fans along with journalists were invited to try Dialekt-o-maten at Stureplan in central Stockholm. Depending on how well they could pronounce the different phrases in assorted Swedish dialects – they were rewarded an ice cold Coke with that destination on the label.

The Dialekt-o-maten

The machine, which uses a Raspberry Pi, was set up in Stureplan Square in Stockholm. A person presses one of six buttons to choose the regional dialect they want to try out. They then hit ‘record’, and speak into the microphone. The recording is compared to a library of dialect samples, and, if it matches closely enough, voila! — the Dialekt-o-maten dispenses a soft drink for free.

Dialekt-o-maten on the highstreet in Stockholm

Code for the Dialekt-o-maten

The team of developers used the dejavu Python library, as well as custom-written code which responded to new recordings. Carl-Anders Svedberg, one of the developers, said:

Testing the voices and fine-tuning the right level of difficulty for the users was quite tricky. And we really should have had more voice samples. Filtering out noise from the surroundings, like cars and music, was also a small hurdle.

While they wrote the initial software on macOS, the team transferred it to a Raspberry Pi so they could install the hardware inside the Dialekt-o-maten.

Regional dialects

Even though Sweden has only ten million inhabitants, there are more than 100 Swedish dialects. In some areas of Sweden, the local language even still resembles Old Norse. The Dialekt-o-maten recorded how well people spoke the six dialects it used. Apparently, the hardest one to imitate is spoken in Vadstena, and the easiest is spoken in Smögen.

Dialekt-o-maten on Stockholm highstreet

Speech recognition with the Pi

Because of its audio input capabilities, the Raspberry Pi is very useful for building devices that use speech recognition software. One of our favourite projects in this vein is of course Allen Pan’s Real-Life Wizard Duel. We also think this pronunciation training machine by Japanese makers HomeMadeGarbage is really neat. Ideas from these projects and the Dialekt-o-maten could potentially be combined to make a fully fledged language-learning tool!

How about you? Have you used a Raspberry Pi to help you become multilingual? If so, do share your project with us in the comments or via social media.

The post Dialekt-o-maten vending machine appeared first on Raspberry Pi.

Google Signs Agreement to Tackle YouTube Piracy

Post Syndicated from Andy original https://torrentfreak.com/google-signs-unprecedented-agreement-to-tackle-youtube-piracy-170921/

Once upon a time, people complaining about piracy would point to the hundreds of piracy sites around the Internet. These days, criticism is just as likely to be leveled at Google-owned services.

YouTube, in particular, has come in for intense criticism, with the music industry complaining of exploitation of the DMCA in order to obtain unfair streaming rates from record labels. Along with streaming-ripping, this so-called Value Gap is one of the industry’s hottest topics.

With rightsholders seemingly at war with Google to varying degrees, news from France suggests that progress can be made if people sit down and negotiate.

According to local reports, Google and local anti-piracy outfit ALPA (l’Association de Lutte Contre la Piraterie Audiovisuelle) under the auspices of the CNC have signed an agreement to grant rightsholders direct access to content takedown mechanisms on YouTube.

YouTube has granted access to its Content ID systems to companies elsewhere for years but the new deal will see the system utilized by French content owners for the first time. It’s hoped that the access will result in infringing content being taken down or monetized more quickly than before.

“We do not want fraudsters to use our platforms to the detriment of creators,” said Carlo D’Asaro Biondo, Google’s President of Strategic Relationships in Europe, the Middle East and Africa.

The agreement, overseen by the Ministry of Culture, will see Google provide ALPA with financial support and rightsholders with essential training.

ALPA president Nicolas Seydoux welcomed the deal, noting that it symbolizes the “collapse of the wall of incomprehension” that previously existed between France’s rightsholders and the Internet search giant.

The deal forms part of the French government’s “Plan of Action Against Piracy”, in which it hopes to crack down on infringement in various ways, including tackling the threat of pirate sites, better promotion of services offering legitimate content, and educating children “from an early age” on the need to respect copyright.

“The fight against piracy is the great challenge of the new century in the cultural sphere,” said France’s Minister of Culture, Françoise Nyssen.

“I hope this is just the beginning of a process. It will require other agreements with rights holders and other platforms, as well as at the European level.”

According to NextInpact, the Google agreement will eventually encompass the downgrading of infringing content in search results as part of the Trusted Copyright Removal Program. A similar system is already in place in the UK.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

New Techniques in Fake Reviews

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/09/new_techniques_.html

Research paper: “Automated Crowdturfing Attacks and Defenses in Online Review Systems.”

Abstract: Malicious crowdsourcing forums are gaining traction as sources of spreading misinformation online, but are limited by the costs of hiring and managing human workers. In this paper, we identify a new class of attacks that leverage deep learning language models (Recurrent Neural Networks or RNNs) to automate the generation of fake online reviews for products and services. Not only are these attacks cheap and therefore more scalable, but they can control rate of content output to eliminate the signature burstiness that makes crowdsourced campaigns easy to detect.

Using Yelp reviews as an example platform, we show how a two phased review generation and customization attack can produce reviews that are indistinguishable by state-of-the-art statistical detectors. We conduct a survey-based user study to show these reviews not only evade human detection, but also score high on “usefulness” metrics by users. Finally, we develop novel automated defenses against these attacks, by leveraging the lossy transformation introduced by the RNN training and generation cycle. We consider countermeasures against our mechanisms, show that they produce unattractive cost-benefit tradeoffs for attackers, and that they can be further curtailed by simple constraints imposed by online service providers.

Create a text-based adventure game with FutureLearn

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/text-based-futurelearn/

Learning with Raspberry Pi has never been so easy! We’re adding a new course to FutureLearn today, and you can take part anywhere in the world.

FutureLearn: the story so far…

In February 2017, we were delighted to launch two free online CPD training courses on the FutureLearn platform, available anywhere in the world. Since the launch, more than 30,000 educators have joined these courses: Teaching Programming in Primary Schools, and Teaching Physical Computing with Raspberry Pi and Python.

Futurelearn Raspberry Pi

Thousands of educators have been building their skills – completing tasks such as writing a program in Python to make an LED blink, or building a voting app in Scratch. The two courses are scaffolded to build skills, week by week. Learners are supported by videos, screencasts, and articles, and they have the chance to apply what they have learned in as many different practical projects as possible.

We have had some excellent feedback from learners on the courses, such as Kyle Wilke who commented: “Fantastic course. Nice integration of text-based and video instruction. Was very impressed how much support was provided by fellow students, kudos to us. Can’t wait to share this with fellow educators.”

Brand new course

We are launching a new course this autumn. You can join lead educator Laura Sachs to learn object-oriented programming principles by creating your own text-based adventure game in Python. The course is aimed at educators who have programming experience, but have never programmed in the object-oriented style.

Future Learn: Object-oriented Programming in Python trailer

Our newest FutureLearn course in now live. You can join lead educator Laura Sachs to learn object-oriented programming principles by creating your own text-based adventure game in Python. The course is aimed at educators who have programming experience, but have never programmed in the object-oriented style.

The course will introduce you to the principles of object-oriented programming in Python, showing you how to create objects, functions, methods, and classes. You’ll use what you learn to create your own text-based adventure game. You will have the chance to share your code with other learners, and to see theirs. If you’re an educator, you’ll also be able to develop ideas for using object-oriented programming in your classroom.

Take part

Sign up now to join us on the course, starting today, September 4. Our courses are free to join online – so you can learn wherever you are, and whenever you want.

The post Create a text-based adventure game with FutureLearn appeared first on Raspberry Pi.

VMware Cloud on AWS – Now Available

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/vmware-cloud-on-aws-now-available/

Last year I told you about the work that we are doing with our friends at VMware to build the VMware Cloud on AWS. As I shared at the time, this is a native, fully-managed offering that runs the VMware SDDC stack directly on bare-metal AWS infrastructure that maintains the elasticity and security customers have come to expect. This allows you to benefit from the scalability and resiliency of AWS, along with the networking and system-level hardware features that are fundamental parts of our security-first architecture.

VMware Cloud on AWS allows you take advantage of what you already know and own. Your existing skills, your investment in training, your operational practices, and your investment in software licenses remain relevant and applicable when you move to the public cloud. As part of that move you can forget about building & running data centers, modernizing hardware, and scaling to meet transient or short-term demand. You can also take advantage of a long list of AWS compute, database, analytics, IoT, AI, security, mobile, deployment and application services.

Initial Availability
After incorporating feedback from many customers and partners in our Early Access beta program, today at VMworld, VMware and Amazon announced the initial availability of VMware Cloud on AWS. This service is initially available in the US West (Oregon) region through VMware and members of the VMware Partner Network. It is designed to support popular use cases such as data center extension, application development & testing, and application migration.

This offering is sold, delivered, supported, and billed by VMware. It supports custom-sized VMs, runs any OS that is supported by VMware, and makes use of single-tenant bare-metal AWS infrastructure so that you can bring your Windows Server licenses to the cloud. Each SDDC (Software-Defined Data Center) consists of 4 to 16 instances, each with 36 cores, 512 GB of memory, and 15.2 TB of NVMe storage. Clusters currently run in a single AWS Availability Zone (AZ) with support in the works for clusters that span AZs. You can spin up an entire VMware SDDC in a couple of hours, and scale host capacity up and down in minutes.

The NSX networking platform (powered by the AWS Elastic Networking Adapter running at up to 25 Gbps) supports multicast traffic, separate networks for management and compute, and IPSec VPN tunnels to on-premises firewalls, routers, and so forth.

Here’s an overview to show you how all of the parts fit together:

The VMware and third-party management tools (vCenter Server, PowerCLI, the vRealize Suite, and code that calls the vSphere API) that you use today will work just fine when you build a hybrid VMware environment that combines your existing on-premises resources and those that you launch in AWS. This hybrid environment will use a new VMware Hybrid Linked Mode to create a single, unified view of your on-premises and cloud resources. You can use familiar VMware tools to manage your applications, without having to purchase any new or custom hardware, rewrite applications, or modify your operating model.

Your applications and your code can access the full range of AWS services (the database, analytical, and AI services are a good place to start). Use for these services is billed separately and you’ll need to create an AWS account.

Learn More at VMworld
If you are attending VMworld in Las Vegas, please be sure to check out some of the 90+ AWS sessions:

Also, be sure to stop by booth #300 and say hello to my colleagues from the AWS team.

In the Works
Our teams have come a long way since last year, but things are just getting revved up!

VMware and AWS are continuing to invest to enable support for new capabilities and use cases, such as application migration, data center expansion, and application test and development. Work is under way to add additional AWS regions, support more use cases such as disaster recovery and data center consolidation, add certifications, and enable even deeper integration with AWS services.

Jeff;

 

The Pronunciation Training Machine

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/pronunciation-training-machine/

Using a Raspberry Pi, an Arduino, an Adafruit NeoPixel Ring and a servomotor, Japanese makers HomeMadeGarbage produced this Pronunciation Training Machine to help their parents distinguish ‘L’s and ‘R’s when speaking English.

L R 発音矯正ギブス お母ちゃん編 Pronunciation training machine #right #light #raspberrypi #arduino #neopixel

23 Likes, 1 Comments – Home Made Garbage (@homemadegarbage) on Instagram: “L R 発音矯正ギブス お母ちゃん編 Pronunciation training machine #right #light #raspberrypi #arduino #neopixel”

How does an Pronunciation Training Machine work?

As you can see in the video above, the machine utilises the Google Cloud Speech API to recognise their parents’ pronunciation of the words ‘right’ and ‘light’. Correctly pronounce the former, and the servo-mounted arrow points to the right. Pronounce the later and the NeoPixel Ring illuminates because, well, you just said “light”.

An image showing how the project works - English Pronunciation TrainingYou can find the full code for the project on its hackster page here.

Variations on the idea

It’s a super-cute project with great potential, and the concept could easily be amended for other training purposes. How about using motion sensors to help someone learn their left from their right?

A photo of hands with left and right written on them - English Pronunciation Training

Wait…your left or my left?
image c/o tattly

Or use random.choice to switch on LEDs over certain images, and speech recognition to reward a correct answer? Light up a picture of a cat, for example, and when the player says “cat”, they receive a ‘purr’ or a treat?

A photo of a kitten - English Pronunciation Training

Obligatory kitten picture
image c/o somewhere on the internet!

Raspberry Pi-based educational aids do not have to be elaborate builds. They can use components as simple as a servo and an LED, and still have the potential to make great improvements in people’s day-to-day lives.

Your own projects

If you’ve created an educational tool using a Raspberry Pi, we’d love to see it. The Raspberry Pi itself is an educational tool, so you’re helping it to fulfil its destiny! Make sure you share your projects with us on social media, or pop a link in the comments below. We’d also love to see people using the Pronunciation Training Machine (or similar projects), so make sure you share those too!

A massive shout out to Artie at hackster.io for this heads-up, and for all the other Raspberry Pi projects he sends my way. What a star!

The post The Pronunciation Training Machine appeared first on Raspberry Pi.

Analyzing AWS Cost and Usage Reports with Looker and Amazon Athena

Post Syndicated from Dillon Morrison original https://aws.amazon.com/blogs/big-data/analyzing-aws-cost-and-usage-reports-with-looker-and-amazon-athena/

This is a guest post by Dillon Morrison at Looker. Looker is, in their own words, “a new kind of analytics platform–letting everyone in your business make better decisions by getting reliable answers from a tool they can use.” 

As the breadth of AWS products and services continues to grow, customers are able to more easily move their technology stack and core infrastructure to AWS. One of the attractive benefits of AWS is the cost savings. Rather than paying upfront capital expenses for large on-premises systems, customers can instead pay variables expenses for on-demand services. To further reduce expenses AWS users can reserve resources for specific periods of time, and automatically scale resources as needed.

The AWS Cost Explorer is great for aggregated reporting. However, conducting analysis on the raw data using the flexibility and power of SQL allows for much richer detail and insight, and can be the better choice for the long term. Thankfully, with the introduction of Amazon Athena, monitoring and managing these costs is now easier than ever.

In the post, I walk through setting up the data pipeline for cost and usage reports, Amazon S3, and Athena, and discuss some of the most common levers for cost savings. I surface tables through Looker, which comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive.

Analysis with Athena

With Athena, there’s no need to create hundreds of Excel reports, move data around, or deploy clusters to house and process data. Athena uses Apache Hive’s DDL to create tables, and the Presto querying engine to process queries. Analysis can be performed directly on raw data in S3. Conveniently, AWS exports raw cost and usage data directly into a user-specified S3 bucket, making it simple to start querying with Athena quickly. This makes continuous monitoring of costs virtually seamless, since there is no infrastructure to manage. Instead, users can leverage the power of the Athena SQL engine to easily perform ad-hoc analysis and data discovery without needing to set up a data warehouse.

After the data pipeline is established, cost and usage data (the recommended billing data, per AWS documentation) provides a plethora of comprehensive information around usage of AWS services and the associated costs. Whether you need the report segmented by product type, user identity, or region, this report can be cut-and-sliced any number of ways to properly allocate costs for any of your business needs. You can then drill into any specific line item to see even further detail, such as the selected operating system, tenancy, purchase option (on-demand, spot, or reserved), and so on.

Walkthrough

By default, the Cost and Usage report exports CSV files, which you can compress using gzip (recommended for performance). There are some additional configuration options for tuning performance further, which are discussed below.

Prerequisites

If you want to follow along, you need the following resources:

Enable the cost and usage reports

First, enable the Cost and Usage report. For Time unit, select Hourly. For Include, select Resource IDs. All options are prompted in the report-creation window.

The Cost and Usage report dumps CSV files into the specified S3 bucket. Please note that it can take up to 24 hours for the first file to be delivered after enabling the report.

Configure the S3 bucket and files for Athena querying

In addition to the CSV file, AWS also creates a JSON manifest file for each cost and usage report. Athena requires that all of the files in the S3 bucket are in the same format, so we need to get rid of all these manifest files. If you’re looking to get started with Athena quickly, you can simply go into your S3 bucket and delete the manifest file manually, skip the automation described below, and move on to the next section.

To automate the process of removing the manifest file each time a new report is dumped into S3, which I recommend as you scale, there are a few additional steps. The folks at Concurrency labs wrote a great overview and set of scripts for this, which you can find in their GitHub repo.

These scripts take the data from an input bucket, remove anything unnecessary, and dump it into a new output bucket. We can utilize AWS Lambda to trigger this process whenever new data is dropped into S3, or on a nightly basis, or whatever makes most sense for your use-case, depending on how often you’re querying the data. Please note that enabling the “hourly” report means that data is reported at the hour-level of granularity, not that a new file is generated every hour.

Following these scripts, you’ll notice that we’re adding a date partition field, which isn’t necessary but improves query performance. In addition, converting data from CSV to a columnar format like ORC or Parquet also improves performance. We can automate this process using Lambda whenever new data is dropped in our S3 bucket. Amazon Web Services discusses columnar conversion at length, and provides walkthrough examples, in their documentation.

As a long-term solution, best practice is to use compression, partitioning, and conversion. However, for purposes of this walkthrough, we’re not going to worry about them so we can get up-and-running quicker.

Set up the Athena query engine

In your AWS console, navigate to the Athena service, and click “Get Started”. Follow the tutorial and set up a new database (we’ve called ours “AWS Optimizer” in this example). Don’t worry about configuring your initial table, per the tutorial instructions. We’ll be creating a new table for cost and usage analysis. Once you walked through the tutorial steps, you’ll be able to access the Athena interface, and can begin running Hive DDL statements to create new tables.

One thing that’s important to note, is that the Cost and Usage CSVs also contain the column headers in their first row, meaning that the column headers would be included in the dataset and any queries. For testing and quick set-up, you can remove this line manually from your first few CSV files. Long-term, you’ll want to use a script to programmatically remove this row each time a new file is dropped in S3 (every few hours typically). We’ve drafted up a sample script for ease of reference, which we run on Lambda. We utilize Lambda’s native ability to invoke the script whenever a new object is dropped in S3.

For cost and usage, we recommend using the DDL statement below. Since our data is in CSV format, we don’t need to use a SerDe, we can simply specify the “separatorChar, quoteChar, and escapeChar”, and the structure of the files (“TEXTFILE”). Note that AWS does have an OpenCSV SerDe as well, if you prefer to use that.

 

CREATE EXTERNAL TABLE IF NOT EXISTS cost_and_usage	 (
identity_LineItemId String,
identity_TimeInterval String,
bill_InvoiceId String,
bill_BillingEntity String,
bill_BillType String,
bill_PayerAccountId String,
bill_BillingPeriodStartDate String,
bill_BillingPeriodEndDate String,
lineItem_UsageAccountId String,
lineItem_LineItemType String,
lineItem_UsageStartDate String,
lineItem_UsageEndDate String,
lineItem_ProductCode String,
lineItem_UsageType String,
lineItem_Operation String,
lineItem_AvailabilityZone String,
lineItem_ResourceId String,
lineItem_UsageAmount String,
lineItem_NormalizationFactor String,
lineItem_NormalizedUsageAmount String,
lineItem_CurrencyCode String,
lineItem_UnblendedRate String,
lineItem_UnblendedCost String,
lineItem_BlendedRate String,
lineItem_BlendedCost String,
lineItem_LineItemDescription String,
lineItem_TaxType String,
product_ProductName String,
product_accountAssistance String,
product_architecturalReview String,
product_architectureSupport String,
product_availability String,
product_bestPractices String,
product_cacheEngine String,
product_caseSeverityresponseTimes String,
product_clockSpeed String,
product_currentGeneration String,
product_customerServiceAndCommunities String,
product_databaseEdition String,
product_databaseEngine String,
product_dedicatedEbsThroughput String,
product_deploymentOption String,
product_description String,
product_durability String,
product_ebsOptimized String,
product_ecu String,
product_endpointType String,
product_engineCode String,
product_enhancedNetworkingSupported String,
product_executionFrequency String,
product_executionLocation String,
product_feeCode String,
product_feeDescription String,
product_freeQueryTypes String,
product_freeTrial String,
product_frequencyMode String,
product_fromLocation String,
product_fromLocationType String,
product_group String,
product_groupDescription String,
product_includedServices String,
product_instanceFamily String,
product_instanceType String,
product_io String,
product_launchSupport String,
product_licenseModel String,
product_location String,
product_locationType String,
product_maxIopsBurstPerformance String,
product_maxIopsvolume String,
product_maxThroughputvolume String,
product_maxVolumeSize String,
product_maximumStorageVolume String,
product_memory String,
product_messageDeliveryFrequency String,
product_messageDeliveryOrder String,
product_minVolumeSize String,
product_minimumStorageVolume String,
product_networkPerformance String,
product_operatingSystem String,
product_operation String,
product_operationsSupport String,
product_physicalProcessor String,
product_preInstalledSw String,
product_proactiveGuidance String,
product_processorArchitecture String,
product_processorFeatures String,
product_productFamily String,
product_programmaticCaseManagement String,
product_provisioned String,
product_queueType String,
product_requestDescription String,
product_requestType String,
product_routingTarget String,
product_routingType String,
product_servicecode String,
product_sku String,
product_softwareType String,
product_storage String,
product_storageClass String,
product_storageMedia String,
product_technicalSupport String,
product_tenancy String,
product_thirdpartySoftwareSupport String,
product_toLocation String,
product_toLocationType String,
product_training String,
product_transferType String,
product_usageFamily String,
product_usagetype String,
product_vcpu String,
product_version String,
product_volumeType String,
product_whoCanOpenCases String,
pricing_LeaseContractLength String,
pricing_OfferingClass String,
pricing_PurchaseOption String,
pricing_publicOnDemandCost String,
pricing_publicOnDemandRate String,
pricing_term String,
pricing_unit String,
reservation_AvailabilityZone String,
reservation_NormalizedUnitsPerReservation String,
reservation_NumberOfReservations String,
reservation_ReservationARN String,
reservation_TotalReservedNormalizedUnits String,
reservation_TotalReservedUnits String,
reservation_UnitsPerReservation String,
resourceTags_userName String,
resourceTags_usercostcategory String  


)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'

STORED AS TEXTFILE
    LOCATION 's3://<<your bucket name>>';

Once you’ve successfully executed the command, you should see a new table named “cost_and_usage” with the below properties. Now we’re ready to start executing queries and running analysis!

Start with Looker and connect to Athena

Setting up Looker is a quick process, and you can try it out for free here (or download from Amazon Marketplace). It takes just a few seconds to connect Looker to your Athena database, and Looker comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive. After you’re connected, you can use the Looker UI to run whatever analysis you’d like. Looker translates this UI to optimized SQL, so any user can execute and visualize queries for true self-service analytics.

Major cost saving levers

Now that the data pipeline is configured, you can dive into the most popular use cases for cost savings. In this post, I focus on:

  • Purchasing Reserved Instances vs. On-Demand Instances
  • Data transfer costs
  • Allocating costs over users or other Attributes (denoted with resource tags)

On-Demand, Spot, and Reserved Instances

Purchasing Reserved Instances vs On-Demand Instances is arguably going to be the biggest cost lever for heavy AWS users (Reserved Instances run up to 75% cheaper!). AWS offers three options for purchasing instances:

  • On-Demand—Pay as you use.
  • Spot (variable cost)—Bid on spare Amazon EC2 computing capacity.
  • Reserved Instances—Pay for an instance for a specific, allotted period of time.

When purchasing a Reserved Instance, you can also choose to pay all-upfront, partial-upfront, or monthly. The more you pay upfront, the greater the discount.

If your company has been using AWS for some time now, you should have a good sense of your overall instance usage on a per-month or per-day basis. Rather than paying for these instances On-Demand, you should try to forecast the number of instances you’ll need, and reserve them with upfront payments.

The total amount of usage with Reserved Instances versus overall usage with all instances is called your coverage ratio. It’s important not to confuse your coverage ratio with your Reserved Instance utilization. Utilization represents the amount of reserved hours that were actually used. Don’t worry about exceeding capacity, you can still set up Auto Scaling preferences so that more instances get added whenever your coverage or utilization crosses a certain threshold (we often see a target of 80% for both coverage and utilization among savvy customers).

Calculating the reserved costs and coverage can be a bit tricky with the level of granularity provided by the cost and usage report. The following query shows your total cost over the last 6 months, broken out by Reserved Instance vs other instance usage. You can substitute the cost field for usage if you’d prefer. Please note that you should only have data for the time period after the cost and usage report has been enabled (though you can opt for up to 3 months of historical data by contacting your AWS Account Executive). If you’re just getting started, this query will only show a few days.

 

SELECT 
	DATE_FORMAT(from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate),'%Y-%m') AS "cost_and_usage.usage_start_month",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_ris",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_non_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_non_ris"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

The resulting table should look something like the image below (I’m surfacing tables through Looker, though the same table would result from querying via command line or any other interface).

With a BI tool, you can create dashboards for easy reference and monitoring. New data is dumped into S3 every few hours, so your dashboards can update several times per day.

It’s an iterative process to understand the appropriate number of Reserved Instances needed to meet your business needs. After you’ve properly integrated Reserved Instances into your purchasing patterns, the savings can be significant. If your coverage is consistently below 70%, you should seriously consider adjusting your purchase types and opting for more Reserved instances.

Data transfer costs

One of the great things about AWS data storage is that it’s incredibly cheap. Most charges often come from moving and processing that data. There are several different prices for transferring data, broken out largely by transfers between regions and availability zones. Transfers between regions are the most costly, followed by transfers between Availability Zones. Transfers within the same region and same availability zone are free unless using elastic or public IP addresses, in which case there is a cost. You can find more detailed information in the AWS Pricing Docs. With this in mind, there are several simple strategies for helping reduce costs.

First, since costs increase when transferring data between regions, it’s wise to ensure that as many services as possible reside within the same region. The more you can localize services to one specific region, the lower your costs will be.

Second, you should maximize the data you’re routing directly within AWS services and IP addresses. Transfers out to the open internet are the most costly and least performant mechanisms of data transfers, so it’s best to keep transfers within AWS services.

Lastly, data transfers between private IP addresses are cheaper than between elastic or public IP addresses, so utilizing private IP addresses as much as possible is the most cost-effective strategy.

The following query provides a table depicting the total costs for each AWS product, broken out transfer cost type. Substitute the “lineitem_productcode” field in the query to segment the costs by any other attribute. If you notice any unusually high spikes in cost, you’ll need to dig deeper to understand what’s driving that spike: location, volume, and so on. Drill down into specific costs by including “product_usagetype” and “product_transfertype” in your query to identify the types of transfer costs that are driving up your bill.

SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-In')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_inbound_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-Out')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_outbound_data_transfer_cost"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

When moving between regions or over the open web, many data transfer costs also include the origin and destination location of the data movement. Using a BI tool with mapping capabilities, you can get a nice visual of data flows. The point at the center of the map is used to represent external data flows over the open internet.

Analysis by tags

AWS provides the option to apply custom tags to individual resources, so you can allocate costs over whatever customized segment makes the most sense for your business. For a SaaS company that hosts software for customers on AWS, maybe you’d want to tag the size of each customer. The following query uses custom tags to display the reserved, data transfer, and total cost for each AWS service, broken out by tag categories, over the last 6 months. You’ll want to substitute the cost_and_usage.resourcetags_customersegment and cost_and_usage.customer_segment with the name of your customer field.

 

SELECT * FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY z___min_rank) as z___pivot_row_rank, RANK() OVER (PARTITION BY z__pivot_col_rank ORDER BY z___min_rank) as z__pivot_col_ordering FROM (
SELECT *, MIN(z___rank) OVER (PARTITION BY "cost_and_usage.product_code") as z___min_rank FROM (
SELECT *, RANK() OVER (ORDER BY CASE WHEN z__pivot_col_rank=1 THEN (CASE WHEN "cost_and_usage.total_unblended_cost" IS NOT NULL THEN 0 ELSE 1 END) ELSE 2 END, CASE WHEN z__pivot_col_rank=1 THEN "cost_and_usage.total_unblended_cost" ELSE NULL END DESC, "cost_and_usage.total_unblended_cost" DESC, z__pivot_col_rank, "cost_and_usage.product_code") AS z___rank FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY CASE WHEN "cost_and_usage.customer_segment" IS NULL THEN 1 ELSE 0 END, "cost_and_usage.customer_segment") AS z__pivot_col_rank FROM (
SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	cost_and_usage.resourcetags_customersegment  AS "cost_and_usage.customer_segment",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_data_transfers_unblended",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.unblended_percent_spend_on_ris"
FROM aws_optimizer.cost_and_usage_raw  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1,2) ww
) bb WHERE z__pivot_col_rank <= 16384
) aa
) xx
) zz
 WHERE z___pivot_row_rank <= 500 OR z__pivot_col_ordering = 1 ORDER BY z___pivot_row_rank

The resulting table in this example looks like the results below. In this example, you can tell that we’re making poor use of Reserved Instances because they represent such a small portion of our overall costs.

Again, using a BI tool to visualize these costs and trends over time makes the analysis much easier to consume and take action on.

Summary

Saving costs on your AWS spend is always an iterative, ongoing process. Hopefully with these queries alone, you can start to understand your spending patterns and identify opportunities for savings. However, this is just a peek into the many opportunities available through analysis of the Cost and Usage report. Each company is different, with unique needs and usage patterns. To achieve maximum cost savings, we encourage you to set up an analytics environment that enables your team to explore all potential cuts and slices of your usage data, whenever it’s necessary. Exploring different trends and spikes across regions, services, user types, etc. helps you gain comprehensive understanding of your major cost levers and consistently implement new cost reduction strategies.

Note that all of the queries and analysis provided in this post were generated using the Looker data platform. If you’re already a Looker customer, you can get all of this analysis, additional pre-configured dashboards, and much more using Looker Blocks for AWS.


About the Author

Dillon Morrison leads the Platform Ecosystem at Looker. He enjoys exploring new technologies and architecting the most efficient data solutions for the business needs of his company and their customers. In his spare time, you’ll find Dillon rock climbing in the Bay Area or nose deep in the docs of the latest AWS product release at his favorite cafe (“Arlequin in SF is unbeatable!”).

 

 

 

Showtime Seeks Injunction to Stop Mayweather v McGregor Piracy

Post Syndicated from Andy original https://torrentfreak.com/showtime-seeks-injunction-to-stop-mayweather-v-mcgregor-piracy-170816/

It’s the fight that few believed would become reality but on August 26, at the T-Mobile Arena in Las Vegas, Floyd Mayweather Jr. will duke it out with UFC lightweight champion Conor McGregor.

Despite being labeled a freak show by boxing purists, it is set to become the biggest combat sports event of all time. Mayweather, undefeated in his professional career, will face brash Irishman McGregor, who has gained a reputation for accepting fights with anyone – as long as there’s a lot of money involved. Big money is definitely the theme of the Mayweather bout.

Dubbed “The Money Fight”, some predict it could pull in a billion dollars, with McGregor pocketing $100m and Mayweather almost certainly more. Many of those lucky enough to gain entrance on the night will have spent thousands on their tickets but for the millions watching around the world….iiiiiiiit’s Showtimmme….with hefty PPV prices attached.

Of course, not everyone will be handing over $89.95 to $99.99 to watch the event officially on Showtime. Large numbers will turn to the many hundreds of websites set to stream the fight for free online, which has the potential to reduce revenues for all involved. With that in mind, Showtime Networks has filed a lawsuit in California which attempts to preemptively tackle this piracy threat.

The suit targets a number of John Does said to be behind a network of dozens of sites planning to stream the fight online for free. Defendant 1, using the alias “Kopa Mayweather”, is allegedly the operator of LiveStreamHDQ, a site that Showtime has grappled with previously.

“Plaintiff has had extensive experience trying to prevent live streaming websites from engaging in the unauthorized reproduction and distribution of Plaintiff’s copyrighted works in the past,” the lawsuit reads.

“In addition to bringing litigation, this experience includes sending cease and desist demands to LiveStreamHDQ in response to its unauthorized live streaming of the record-breaking fight between Floyd Mayweather, Jr. and Manny Pacquiao.”

Showtime says that LiveStreamHDQ is involved in the operations of at least 41 other sites that have been set up to specifically target people seeking to watch the fight without paying. Each site uses a .US ccTLD domain name.

Sample of the sites targeted by the lawsuit

Showtime informs the court that the registrant email and IP addresses of the domains overlap, which provides further proof that they’re all part of the same operation. The TV network also highlights various statements on the sites in question which demonstrate intent to show the fight without permission, including the highly dubious “Watch From Here Mayweather vs Mcgregor Live with 4k Display.”

In addition, the lawsuit is highly critical of efforts by the sites’ operator(s) to stuff the pages with fight-related keywords in order to draw in as much search engine traffic as they can.

“Plaintiff alleges that Defendants have engaged in such keyword stuffing as a form of search engine optimization in an effort to attract as much web traffic as possible in the form of Internet users searching for a way to access a live stream of the Fight,” it reads.

While site operators are expected to engage in such behavior, Showtime says that these SEO efforts have been particularly successful, obtaining high-ranking positions in major search engines for the would-be pirate sites.

For instance, Showtime says that a Google search for “Mayweather McGregor Live” results in four of the target websites appearing in the first 100 results, i.e the first 10 pages. Interestingly, however, to get that result searchers would need to put the search in quotes as shown above, since a plain search fails to turn anything up in hundreds of results.

At this stage, the important thing to note is that none of the sites are currently carrying links to the fight, because the fight is yet to happen. Nevertheless, Showtime is convinced that come fight night, all of the target websites will be populated with pirate links, accessible for free or after paying a fee. This needs to be stopped, it argues.

“Defendants’ anticipated unlawful distribution will impair the marketability and profitability of the Coverage, and interfere with Plaintiff’s own authorized distribution of the Coverage, because Defendants will provide consumers with an opportunity to view the Coverage in its entirety for free, rather than paying for the Coverage provided through Plaintiff’s authorized channels.

“This is especially true where, as here, the work at issue is live coverage of a one-time live sporting event whose outcome is unknown,” the network writes.

Showtime informs the court that it made efforts to contact the sites in question but had just a single response from an individual who claimed to be sports blogger who doesn’t offer streaming services. The undertone is one of disbelief.

In closing, Showtime demands a temporary restraining order, preliminary injunction, and permanent injunction, prohibiting the defendants from making the fight available in any way, and/or “forming new entities” in order to circumvent any subsequent court order. Compensation for suspected damages is also requested.

Showtime previously applied for and obtained a similar injunction to cover the (hugely disappointing) Mayweather v Pacquiao fight in 2015. In that case, websites were ordered to be taken down on the day before the fight.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

AWS Partner Webinar Series – August 2017

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/aws-partner-webinar-series-august-2017/

We love bringing our customers helpful information and we have another cool series we are excited to tell you about. The AWS Partner Webinar Series is a selection of live and recorded presentations covering a broad range of topics at varying technical levels and scale. A little different from our AWS Online TechTalks, each AWS Partner Webinar is hosted by an AWS solutions architect and an AWS Competency Partner who has successfully helped customers evaluate and implement the tools, techniques, and technologies of AWS.

Check out this month’s webinars and let us know which ones you found the most helpful! All schedule times are shown in the Pacific Time (PDT) time zone.

Security Webinars

Sophos
Seeing More Clearly: ATLO Software Secures Online Training Solutions for Correctional Facilities with SophosUTM on AWS Link.
August 17th, 2017 | 10:00 AM PDT

F5
F5 on AWS: How MailControl Improved their Application Visibility and Security
August 23, 2017 | 10:00 AM PDT

Big Data Webinars

Tableau, Matillion, 47Lining, NorthBay
Unlock Insights and Reduce Costs by Modernizing Your Data Warehouse on AWS
August 22, 2017 | 10:00 AM PDT

Storage Webinars

StorReduce
How Globe Telecom does Primary Backups via StorReduce to the AWS Cloud
August 29, 2017 | 8:00 AM PDT

Commvault
Moving Forward Faster: How Monash University Automated Data Movement for 3500 Virtual Machines to AWS with Commvault
August 29, 2017 | 1:00 PM PDT

Dell EMC
Moving Forward Faster: Protect Your Workloads on AWS With Increased Scale and Performance
August 30, 2017 | 11:00 AM PDT

Druva
How Hatco Protects Against Ransomware with Druva on AWS
September 13, 2017 | 10:00 AM PDT

Thomas and Ed become a RealLifeDoodle on the ISS

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/astro-pi-reallifedoodle/

Thanks to the very talented sooperdavid, creator of some of the wonderful animations known as RealLifeDoodles, Thomas Pesquet and Astro Pi Ed have been turned into one of the cutest videos on the internet.

space pi – Create, Discover and Share Awesome GIFs on Gfycat

Watch space pi GIF by sooperdave on Gfycat. Discover more GIFS online on Gfycat

And RealLifeDoodles aaaaare?

Thanks to the power of viral video, many will be aware of the ongoing Real Life Doodle phenomenon. Wait, you’re not aware?

Oh. Well, let me explain it to you.

Taking often comical video clips, those with a know-how and skill level that outweighs my own in spades add faces and emotions to inanimate objects, creating what the social media world refers to as a Real Life Doodle. From disappointed exercise balls to cannibalistic piles of leaves, these video clips are both cute and sometimes, though thankfully not always, a little heartbreaking.

letmegofree – Create, Discover and Share Awesome GIFs on Gfycat

Watch letmegofree GIF by sooperdave on Gfycat. Discover more reallifedoodles GIFs on Gfycat

Our own RealLifeDoodle

A few months back, when Programme Manager Dave Honess, better known to many as SpaceDave, sent me these Astro Pi videos for me to upload to YouTube, a small plan hatched in my brain. For in the midst of the video, and pointed out to me by SpaceDave – “I kind of love the way he just lets the unit drop out of shot” – was the most adorable sight as poor Ed drifted off into the great unknown of the ISS. Finding that I have this odd ability to consider many inanimate objects as ‘cute’, I wanted to see whether we could turn poor Ed into a RealLifeDoodle.

Heading to the Reddit RealLifeDoodle subreddit, I sent moderator sooperdavid a private message, asking if he’d be so kind as to bring our beloved Ed to life.

Yesterday, our dream came true!

Astro Pi

Unless you’re new to the world of the Raspberry Pi blog (in which case, welcome!), you’ll probably know about the Astro Pi Challenge. But for those who are unaware, let me break it down for you.

Raspberry Pi RealLifeDoodle

In 2015, two weeks before British ESA Astronaut Tim Peake journeyed to the International Space Station, two Raspberry Pis were sent up to await his arrival. Clad in 6063-grade aluminium flight cases and fitted with their own Sense HATs and camera modules, the Astro Pis Ed and Izzy were ready to receive the winning codes from school children in the UK. The following year, this time maintained by French ESA Astronaut Thomas Pesquet, children from every ESA member country got involved to send even more code to the ISS.

Get involved

Will there be another Astro Pi Challenge? Well, I just asked SpaceDave and he didn’t say no! So why not get yourself into training now and try out some of our space-themed free resources, including our 3D-print your own Astro Pi case tutorial? You can also follow the adventures of Ed and Izzy in our brilliant Story of Astro Pi cartoons.

Raspberry Pi RealLifeDoodle

And if you’re quick, there’s still time to take part in tomorrow’s Moonhack! Check out their website for more information and help the team at Code Club Australia beat their own world record!

The post Thomas and Ed become a RealLifeDoodle on the ISS appeared first on Raspberry Pi.

Updates to GPIO Zero, the physical computing API

Post Syndicated from Ben Nuttall original https://www.raspberrypi.org/blog/gpio-zero-update/

GPIO Zero v1.4 is out now! It comes with a set of new features, including a handy pinout command line tool. To start using this newest version of the API, update your Raspbian OS now:

sudo apt update && sudo apt upgrade

Some of the things we’ve added will make it easier for you try your hand on different programming styles. In doing so you’ll build your coding skills, and will improve as a programmer. As a consequence, you’ll learn to write more complex code, which will enable you to take on advanced electronics builds. And on top of that, you can use the skills you’ll acquire in other computing projects.

GPIO Zero pinout tool

The new pinout tool

Developing GPIO Zero

Nearly two years ago, I started the GPIO Zero project as a simple wrapper around the low-level RPi.GPIO library. I wanted to create a simpler way to control GPIO-connected devices in Python, based on three years’ experience of training teachers, running workshops, and building projects. The idea grew over time, and the more we built for our Python library, the more sophisticated and powerful it became.

One of the great things about Python is that it’s a multi-paradigm programming language. You can write code in a number of different styles, according to your needs. You don’t have to write classes, but you can if you need them. There are functional programming tools available, but beginners get by without them. Importantly, the more advanced features of the language are not a barrier to entry.

Become a more advanced programmer

As a beginner to programming, you usually start by writing procedural programs, in which the flow moves from top to bottom. Then you’ll probably add loops and create your own functions. Your next step might be to start using libraries which introduce new patterns that operate in a different manner to what you’ve written before, for example threaded callbacks (event-driven programming). You might move on to object-oriented programming, extending the functionality of classes provided by other libraries, and starting to write your own classes. Occasionally, you may make use of tools created with functional programming techniques.

Five buttons in different colours

Take control of the buttons in your life

It’s much the same with GPIO Zero: you can start using it very easily, and we’ve made it simple to progress along the learning curve towards more advanced programming techniques. For example, if you want to make a push button control an LED, the easiest way to do this is via procedural programming using a while loop:

from gpiozero import LED, Button

led = LED(17)
button = Button(2)

while True:
    if button.is_pressed:
        led.on()
    else:
        led.off()

But another way to achieve the same thing is to use events:

from gpiozero import LED, Button
from signal import pause

led = LED(17)
button = Button(2)

button.when_pressed = led.on
button.when_released = led.off

pause()

You could even use a declarative approach, and set the LED’s behaviour in a single line:

from gpiozero import LED, Button
from signal import pause

led = LED(17)
button = Button(2)

led.source = button.values

pause()

You will find that using the procedural approach is a great start, but at some point you’ll hit a limit, and will have to try a different approach. The example above can be approach in several programming styles. However, if you’d like to control a wider range of devices or a more complex system, you need to carefully consider which style works best for what you want to achieve. Being able to choose the right programming style for a task is a skill in itself.

Source/values properties

So how does the led.source = button.values thing actually work?

Every GPIO Zero device has a .value property. For example, you can read a button’s state (True or False), and read or set an LED’s state (so led.value = True is the same as led.on()). Since LEDs and buttons operate with the same value set (True and False), you could say led.value = button.value. However, this only sets the LED to match the button once. If you wanted it to always match the button’s state, you’d have to use a while loop. To make things easier, we came up with a way of telling devices they’re connected: we added a .values property to all devices, and a .source to output devices. Now, a loop is no longer necessary, because this will do the job:

led.source = button.values

This is a simple approach to connecting devices using a declarative style of programming. In one single line, we declare that the LED should get its values from the button, i.e. when the button is pressed, the LED should be on. You can even mix the procedural with the declarative style: at one stage of the program, the LED could be set to match the button, while in the next stage it could just be blinking, and finally it might return back to its original state.

These additions are useful for connecting other devices as well. For example, a PWMLED (LED with variable brightness) has a value between 0 and 1, and so does a potentiometer connected via an ADC (analogue-digital converter) such as the MCP3008. The new GPIO Zero update allows you to say led.source = pot.values, and then twist the potentiometer to control the brightness of the LED.

But what if you want to do something more complex, like connect two devices with different value sets or combine multiple inputs?

We provide a set of device source tools, which allow you to process values as they flow from one device to another. They also let you send in artificial values such as random data, and you can even write your own functions to generate values to pass to a device’s source. For example, to control a motor’s speed with a potentiometer, you could use this code:

from gpiozero import Motor, MCP3008
from signal import pause

motor = Motor(20, 21)
pot = MCP3008()

motor.source = pot.values

pause()

This works, but it will only drive the motor forwards. If you wanted the potentiometer to drive it forwards and backwards, you’d use the scaled tool to scale its values to a range of -1 to 1:

from gpiozero import Motor, MCP3008
from gpiozero.tools import scaled
from signal import pause

motor = Motor(20, 21)
pot = MCP3008()

motor.source = scaled(pot.values, -1, 1)

pause()

And to separately control a robot’s left and right motor speeds with two potentiometers, you could do this:

from gpiozero import Robot, MCP3008
from signal import pause

robot = Robot(left=(2, 3), right=(4, 5))
left = MCP3008(0)
right = MCP3008(1)

robot.source = zip(left.values, right.values)

pause()

GPIO Zero and Blue Dot

Martin O’Hanlon created a Python library called Blue Dot which allows you to use your Android device to remotely control things on their Raspberry Pi. The API is very similar to GPIO Zero, and it even incorporates the value/values properties, which means you can hook it up to GPIO devices easily:

from bluedot import BlueDot
from gpiozero import LED
from signal import pause

bd = BlueDot()
led = LED(17)

led.source = bd.values

pause()

We even included a couple of Blue Dot examples in our recipes.

Make a series of binary logic gates using source/values

Read more in this source/values tutorial from The MagPi, and on the source/values documentation page.

Remote GPIO control

GPIO Zero supports multiple low-level GPIO libraries. We use RPi.GPIO by default, but you can choose to use RPIO or pigpio instead. The pigpio library supports remote connections, so you can run GPIO Zero on one Raspberry Pi to control the GPIO pins of another, or run code on a PC (running Windows, Mac, or Linux) to remotely control the pins of a Pi on the same network. You can even control two or more Pis at once!

If you’re using Raspbian on a Raspberry Pi (or a PC running our x86 Raspbian OS), you have everything you need to remotely control GPIO. If you’re on a PC running Windows, Mac, or Linux, you just need to install gpiozero and pigpio using pip. See our guide on configuring remote GPIO.

I road-tested the new pin_factory syntax at the Raspberry Jam @ Pi Towers

There are a number of different ways to use remote pins:

  • Set the default pin factory and remote IP address with environment variables:
$ GPIOZERO_PIN_FACTORY=pigpio PIGPIO_ADDR=192.168.1.2 python3 blink.py
  • Set the default pin factory in your script:
import gpiozero
from gpiozero import LED
from gpiozero.pins.pigpio import PiGPIOFactory

gpiozero.Device.pin_factory = PiGPIOFactory(host='192.168.1.2')

led = LED(17)
  • The pin_factory keyword argument allows you to use multiple Pis in the same script:
from gpiozero import LED
from gpiozero.pins.pigpio import PiGPIOFactory

factory2 = PiGPIOFactory(host='192.168.1.2')
factory3 = PiGPIOFactory(host='192.168.1.3')

local_hat = TrafficHat()
remote_hat2 = TrafficHat(pin_factory=factory2)
remote_hat3 = TrafficHat(pin_factory=factory3)

This is a really powerful feature! For more, read this remote GPIO tutorial in The MagPi, and check out the remote GPIO recipes in our documentation.

GPIO Zero on your PC

GPIO Zero doesn’t have any dependencies, so you can install it on your PC using pip. In addition to the API’s remote GPIO control, you can use its ‘mock’ pin factory on your PC. We originally created the mock pin feature for the GPIO Zero test suite, but we found that it’s really useful to be able to test GPIO Zero code works without running it on real hardware:

$ GPIOZERO_PIN_FACTORY=mock python3
>>> from gpiozero import LED
>>> led = LED(22)
>>> led.blink()
>>> led.value
True
>>> led.value
False

You can even tell pins to change state (e.g. to simulate a button being pressed) by accessing an object’s pin property:

>>> from gpiozero import LED
>>> led = LED(22)
>>> button = Button(23)
>>> led.source = button.values
>>> led.value
False
>>> button.pin.drive_low()
>>> led.value
True

You can also use the pinout command line tool if you set your pin factory to ‘mock’. It gives you a Pi 3 diagram by default, but you can supply a revision code to see information about other Pi models. For example, to use the pinout tool for the original 256MB Model B, just type pinout -r 2.

GPIO Zero documentation and resources

On the API’s website, we provide beginner recipes and advanced recipes, and we have added remote GPIO configuration including PC/Mac/Linux and Pi Zero OTG, and a section of GPIO recipes. There are also new sections on source/values, command-line tools, FAQs, Pi information and library development.

You’ll find plenty of cool projects using GPIO Zero in our learning resources. For example, you could check out the one that introduces physical computing with Python and get stuck in! We even provide a GPIO Zero cheat sheet you can download and print.

There are great GPIO Zero tutorials and projects in The MagPi magazine every month. Moreover, they also publish Simple Electronics with GPIO Zero, a book which collects a series of tutorials useful for building your knowledge of physical computing. And the best thing is, you can download it, and all magazine issues, for free!

Check out the API documentation and read more about what’s new in GPIO Zero on my blog. We have lots planned for the next release. Watch this space.

Get building!

The world of physical computing is at your fingertips! Are you feeling inspired?

If you’ve never tried your hand on physical computing, our Build a robot buggy learning resource is the perfect place to start! It’s your step-by-step guide for building a simple robot controlled with the help of GPIO Zero.

If you have a gee-whizz idea for an electronics project, do share it with us below. And if you’re currently working on a cool build and would like to show us how it’s going, pop a link to it in the comments.

The post Updates to GPIO Zero, the physical computing API appeared first on Raspberry Pi.

Introducing the GameDay Essentials Show on AWS Twitch Channel

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/game-day-essentials-show-on-twitch/

Imagine if you will, you have obtained a new position at Unicorn.Rentals, a company that specializes in LARM, Legendary Animal Rental Market. Given the chance, what child wouldn’t happily exchange anything for the temporary use of a unicorn? What parent could refuse the opportunity to make their children happy? Let’s estimate the year to be 2017 and Unicorn.Rentals continues to dominate in the animal rental market.

You are about to enter another dimension, a dimension as vast as space and as timeless as infinity. It is the middle ground between light and shadow, between science and superstition, and lies at the beginning of man’s cloud knowledge. This is a journey into a wondrous land of imagination, a land of both shadow and substance. You are crossing over into the GameDay Essentials Zone.

Well, maybe not another dimension but almost as cool. Maybe, kinda? Either way, I am very excited to introduce the newest show on the AWS Twitch Channel named GameDay Essentials. The GameDay Essentials show is a  “new hire training program” for the aforementioned Unicorn.Rentals company scenario. You will step into the shoes of a new employee being ramped up and trained on cloud computing in order to work successfully for a company using Amazon Web Services.

 

With the GameDay Essentials show, you will get hands-on computing experience to help with the growth of the Unicorn.Rentals startup. The first episode, Recon, premiered on July 25th and provided information on logging services with CloudTrail and Cloudwatch, as well as, how to assess the configuration and identify existing inventory resources in an AWS Account. You can check out the recording of Episode 1–Recon here. The rest of season one for this six-part series airs on Tuesdays at 11:30 AM PT, the next three episodes discussing the following topics:

  • Episode 2 – Scaling: Learn how to scale your application infrastructure by diving into the how to of implementing scaling techniques and auto scaling groups. Airing on August 1 
  • Episode 3 – Changes: Winston Churchill is quoted saying “To improve is to change; to be perfect is to change often”. This GameDay episode is all about managing change as a key component to success. You will learn how to use native AWS security and deployment tools to track and manage change and discuss how to handle changes in team dynamics. Airing on August 8th
  • Episode 4 – Decoupling: Most people in the technology industry understand that you should avoid creating tightly coupled systems. Therefore, you will discover how loosely coupled systems operate and gain knowledge on how to diagnose any failures that may occur with these systems. Airing on August 15th 

Summary

Our latest show, GameDay Essentials is designed to help you “get into the game” and learn more about cloud computing and the AWS Platform. GameDay Essentials joins our other live coding shows already featured each week on the AWS Twitch Channel: Live Coding with AWS and AWS Maker Studio.

Tune in each week to the AWS Twitch channel to visit another dimension: a dimension of sound, a dimension of sight, a dimension of cloud. This is the dimension of imagination. It is an area, which we call the GameDay Essentials Zone. Get it, like the Twilight Zone, still no? Oh well, check out the GameDay Essentials show on Twitch on the AWS Channel, it is a great resource for interactive learning about cloud computing with AWS, so enjoy the ride.

Tara

TVAddons Returns, But in Ugly War With Canadian Telcos Over Kodi Addons

Post Syndicated from Andy original https://torrentfreak.com/tvaddons-returns-ugly-war-canadian-telcos-kodi-addons-170801/

After Dish Network filed a lawsuit against TVAddons in Texas, several high-profile Kodi addons took the decision to shut down. Soon after, TVAddons itself went offline.

In the weeks that followed, several TVAddons-related domains were signed over (1,2) to a Canadian law firm, a mysterious situation that didn’t dovetail well with the US-based legal action.

TorrentFreak can now reveal that the shutdown of TVAddons had nothing to do with the US action and everything to do with a separate lawsuit filed in Canada.

The complaint against TVAddons

Two months ago on June 2, a collection of Canadian telecoms giants including Bell Canada, Bell ExpressVu, Bell Media, Videotron, Groupe TVA, Rogers Communications and Rogers Media, filed a complaint in Federal Court against Montreal resident, Adam Lackman, the man behind TVAddons.

The 18-page complaint details the plaintiffs’ case against Lackman, claiming that he communicated copyrighted TV shows including Game of Thrones, Prison Break, The Big Bang Theory, America’s Got Talent, Keeping Up With The Kardashians and dozens more, to the public in breach of copyright.

The key claim is that Lackman achieved this by developing, hosting, distributing or promoting Kodi add-ons.

Adam Lackman, the man behind TVAddons (@adam.lackman on Instagram)

A total of 18 major add-ons are detailed in the complaint including 1Channel, Exodus, Phoenix, Stream All The Sources, SportsDevil, cCloudTV and Alluc, to name a few. Also under the spotlight is the ‘FreeTelly’ custom Kodi build distributed by TVAddons alongside its Kodi configuration tool, Indigo.

“[The defendant] has made the [TV shows] available to the public by telecommunication in a way that allows members of the public to have access to them from a place and at a time individually chosen by them…consequently infringing the Plaintiffs’ copyright…in contravention of sections 2.4(1.1), 3(1)(f) and 27(1) of the Copyright Act,” the complaint reads.

The complaint alleges that Lackman “induced and/or authorized users” of the FreeTelly and Indigo tools to carry out infringement by his handling and promotion of infringing add-ons, including through TVAddons.ag and Offshoregit.com, in contravention of sections 3(1)(f) and 27(1) of the Copyright Act.

“Approximately 40 million unique users located around the world are actively using Infringing Addons hosted by TVAddons every month, and approximately 900,000 Canadian households use Infringing Add-ons to access television content. The amount of users of Infringing add-ons hosted TVAddons is constantly increasing,” the complaint adds.

To limit the harm allegedly caused by TVAddons, the complaint asked for interim, interlocutory, and permanent injunctions restraining Lackman and associates from developing, promoting or distributing any of the allegedly infringing add-ons or software. On top, the plaintiffs requested punitive and exemplary damages, plus costs.

The interim injunction and Anton Piller Order

Following the filing of the complaint, on June 9 the Federal Court handed down a time-limited interim injunction against Lackman which restrained him from various activities in respect of TVAddons. The process took place ex parte, meaning in secret, without Lackman being able to mount a defense.

The Court also authorized a bailiff and computer forensics experts to take control of Internet domains including TVAddons.ag and Offshoregit.com plus social media and hosting provider accounts for a period of 14 days. These were transferred to Daniel Drapeau at DrapeauLex, an independent court-appointed supervising counsel.

The order also contained an Anton Piller order, a civil search warrant that grants plaintiffs no-notice permission to enter a defendant’s premises in order to secure and copy evidence to support their case, before it can be destroyed or tampered with.

The order covered not only data related to the TVAddons platform, such as operating and financial details, revenues, and banking information, but everything in Lackman’s possession.

The Court ordered the telecoms companies to inform Lackman that the case against him is a civil proceeding and that he could deny entry to his property if he wished. However, that option would put him in breach of the order and would place him at risk of being fined or even imprisoned. Catch 22 springs to mind.

The Court did, however, put limits on the number of people that could be present during the execution of the Anton Piller order (ostensibly to avoid intimidation) and ordered the plaintiffs to deposit CAD$50,000 with the Court, in case the order was improperly executed. That decision would later prove an important one.

The search and interrogation of TVAddons’ operator

On June 12, the order was executed and Lackman’s premises were searched for more than 16 hours. For nine hours he was interrogated and effectively denied his right to remain silent since non-cooperation with an Anton Piller order amounts to contempt of court. The Court’s stated aim of not intimidating Lackman failed.

The TVAddons operator informs TorrentFreak that he heard a disturbance in the hallway outside and spotted several men hiding on the other side of the door. Fearing for his life, Lackman called the police and when they arrived he opened the door. At this point, the police were told by those in attendance to leave, despite Lackman’s protests.

Once inside, Lackman was told he had an hour to find a lawyer, but couldn’t use any electronic device to get one. Throughout the entire day, Lackman says he was reminded by the plaintiffs’ lawyer that he could be held in contempt of court and jailed, even though he was always cooperating.

“I had to sit there and not leave their sight. I was denied access to medication,” Lackman told TorrentFreak. “I had a doctor’s appointment I was forced to miss. I wasn’t even allowed to call and cancel.”

In papers later filed with the court by Lackman’s team, the Anton Piller order was described as a “bombe atomique” since TVAddons had never been served with so much as a copyright takedown notice in advance of this action.

The Anton Piller controversy

Anton Piller orders are only valid when passing a three-step test: when there is a strong prima facie case against the respondent, the damage – potential or actual – is serious for the applicant, and when there is a real possibility that evidence could be destroyed.

For Bell Canada, Bell ExpressVu, Bell Media, Videotron, Groupe TVA, Rogers Communications and Rogers Media, serious problems emerged on at least two of these points after the execution of the order.

For example, TVAddons carried more than 1,500 add-ons yet only 1% of those add-ons were considered to be infringing, a tiny number in the overall picture. Then there was the not insignificant problem with the exchange that took place during the hearing to obtain the order, during which Lackman was not present.

Clearly, the securing of existing evidence wasn’t the number one priority.

Plaintiffs: We want to destroy TVAddons

And the problems continued.

No right to remain silent, no right to consult a lawyer

The Anton Piller search should have been carried out between 8am and 8pm but actually carried on until midnight. As previously mentioned, Adam Lackman was effectively denied his right to remain silent and was forbidden from getting advice from his lawyer.

None of this sat well with the Honourable B. Richard Bell during a subsequent Federal Court hearing to consider the execution of the Anton Piller order.

“It is important to note that the Defendant was not permitted to refuse to answer questions under fear of contempt proceedings, and his counsel was not permitted to clarify the answers to questions. I conclude unhesitatingly that the Defendant was subjected to an examination for discovery without any of the protections normally afforded to litigants in such circumstances,” the Judge said.

“Here, I would add that the ‘questions’ were not really questions at all. They took the form of orders or directions. For example, the Defendant was told to ‘provide to the bailiff’ or ‘disclose to the Plaintiffs’ solicitors’.”

Evidence preservation? More like a fishing trip

But shockingly, the interrogation of Lackman went much, much further. TorrentFreak understands that the TVAddons operator was given a list of 30 names of people that might be operating sites or services similar to TVAddons. He was then ordered to provide all of the information he had on those individuals.

Of course, people tend to guard their online identities so it’s possible that the information provided by Lackman will be of limited use, but Judge Bell was not happy that the Anton Piller order was abused by the plaintiffs in this way.

“I conclude that those questions, posed by Plaintiffs’ counsel, were solely made in furtherance of their investigation and constituted a hunt for further evidence, as opposed to the preservation of then existing evidence,” he wrote in a June 29 order.

But he was only just getting started.

Plaintiffs unlawfully tried to destroy TVAddons before trial

The Judge went on to note that from their own mouths, the Anton Piller order was purposely designed by the plaintiffs to completely shut down TVAddons, despite the fact that only a tiny proportion of the add-ons available on the site were allegedly used to infringe copyright.

“I am of the view that [the order’s] true purpose was to destroy the livelihood of the Defendant, deny him the financial resources to finance a defense to the claim made against him, and to provide an opportunity for discovery of the Defendant in circumstances where none of the procedural safeguards of our civil justice system could be engaged,” Judge Bell wrote.

As noted, plaintiffs must also have a “strong prima facie case” to obtain an Anton Piller order but Judge Bell says he’s not convinced that one exists. Instead, he praised the “forthright manner” of Lackman, who successfully compared the ability of Kodi addons to find content in the same way as Google search can.

So why the big turn around?

Judge Bell said that while the prima facie case may have appeared strong before the judge who heard the matter ex parte (without Lackman being present to defend himself), the subsequent adversarial hearing undermined it, to the point that it no longer met the threshold.

As a result of these failings, Judge Bell declared the Anton Piller order unlawful. Things didn’t improve for the plaintiffs on the injunction front either.

The Judge said that he believes that Lackman has “an arguable case” that he is not violating the Copyright Act by merely providing addons and that TVAddons is his only source of income. So, if an injunction to close the site was granted, the litigation would effectively be over, since the plaintiffs already admitted that their aim was to neutralize the platform.

If the platform was neutralized, Lackman could no longer earn money from the site, which would harm his ability to mount a defense.

“In considering the balance of convenience, I also repeat that the plaintiffs admit that the vast majority of add-ons are non-infringing. Whether the remaining approximately 1% are infringing is very much up for debate. For these reasons, I find the balance of convenience favors the defendant, and no interlocutory injunction will be issued,” the Judge declared.

With the Anton Piller order declared unlawful and no interlocutory injunction (one effective until the final determination of the case) handed down, things were about to get worse for the telecoms companies.

They had paid CAD$50,000 to the court in security in case things went wrong with the Anton Piller order, so TVAddons was entitled to compensation from that amount. That would be helpful, since at this point TVAddons had already run up CAD$75,000 in legal expenses.

On top, the Judge told independent counsel to give everything seized during the Anton Piller search back to Lackman.

The order to return items previously seized

But things were far from over. Within days, the telecoms companies took the decision to the Court of Appeal, asking for a stay of execution (a delay in carrying out a court order) to retain possession of items seized, including physical property, domains, and social media accounts.

Mid-July the appeal was granted and certain confidentiality clauses affecting independent counsel (including Daniel Drapeau, who holds the TVAddons’ domains) were ordered to be continued. However, considering the problems with the execution of the Anton Piller order, Bell Canada, TVA, Videotron and Rogers et al, were ordered to submit an additional security bond of CAD$140,000, on top of the CAD$50,000 already deposited.

So the battle continues, and continue it will

Speaking with TorrentFreak, Adam Lackman says that he has no choice but to fight the telcoms companies since not doing so would result in a loss by default judgment. Interestingly, both he and one of the judges involved in the case thus far believe he has an arguable case.

Lackman says that his activities are protected under the Canadian Copyright Act, specifically subparagraph 2.4(1)(b) which states as follows:

A person whose only act in respect of the communication of a work or other subject-matter to the public consists of providing the means of telecommunication necessary for another person to so communicate the work or other subject-matter does not communicate that work or other subject-matter to the public;

Of course, finding out whether that’s indeed the case will be a costly endeavor.

“It all comes down to whether we will have the financial resources necessary to mount our defense and go to trial. We won’t have ad revenue coming in, since losing our domain names means that we’ll lose the majority of our traffic for quite some time into the future,” Lackman told TF in a statement.

“We’re hoping that others will be as concerned as us about big companies manipulating the law in order to shut down what they see as competition. We desperately need help in financially supporting our legal defense, we cannot do it alone.

“We’ve run up a legal bill of over $100,000 to date. We’re David, and they are four Goliaths with practically unlimited resources. If we lose, it will mean that new case law is made, case law that could mean increased censorship of the internet.”

In the hope of getting support, TVAddons has launched a fundraiser campaign and in the meantime, a new version of the site is back on a new domain, TVAddons.co.

Given TVAddons’ line of defense, the nature of both the platform and Kodi addons, and the fact that there has already been a serious abuse of process during evidence preservation, this is now one of the most interesting and potentially influential copyright cases underway anywhere today.

TVAddons is being represented by Éva Richard , Hilal Ayoubi and Karim Renno in Canada, plus Erin Russell and Jason Sweet in the United States.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

AWS Hot Startups – July 2017

Post Syndicated from Tina Barr original https://aws.amazon.com/blogs/aws/aws-hot-startups-july-2017/

Welcome back to another month of Hot Startups! Every day, startups are creating innovative and exciting businesses, applications, and products around the world. Each month we feature a handful of startups doing cool things using AWS.

July is all about learning! These companies are focused on providing access to tools and resources to expand knowledge and skills in different ways.

This month’s startups:

  • CodeHS – provides fun and accessible computer science curriculum for middle and high schools.
  • Insight – offers intensive fellowships to grow technical talent in Data Science.
  • iTranslate – enables people to read, write, and speak in over 90 languages, anywhere in the world.

CodeHS (San Francisco, CA)

In 2012, Stanford students Zach Galant and Jeremy Keeshin were computer science majors and TAs for introductory classes when they noticed a trend among their peers. Many wished that they had been exposed to computer science earlier in life. In their senior year, Zach and Jeremy launched CodeHS to give middle and high schools the opportunity to provide a fun, accessible computer science education to students everywhere. CodeHS is a web-based curriculum pathway complete with teacher resources, lesson plans, and professional development opportunities. The curriculum is supplemented with time-saving teacher tools to help with lesson planning, grading and reviewing student code, and managing their classroom.

CodeHS aspires to empower all students to meaningfully impact the future, and believe that coding is becoming a new foundational skill, along with reading and writing, that allows students to further explore any interest or area of study. At the time CodeHS was founded in 2012, only 10% of high schools in America offered a computer science course. Zach and Jeremy set out to change that by providing a solution that made it easy for schools and districts to get started. With CodeHS, thousands of teachers have been trained and are teaching hundreds of thousands of students all over the world. To use CodeHS, all that’s needed is the internet and a web browser. Students can write and run their code online, and teachers can immediately see what the students are working on and how they are doing.

Amazon EC2, Amazon RDS, Amazon ElastiCache, Amazon CloudFront, and Amazon S3 make it possible for CodeHS to scale their site to meet the needs of schools all over the world. CodeHS also relies on AWS to compile and run student code in the browser, which is extremely important when teaching server-side languages like Java that powers the AP course. Since usage rises and falls based on school schedules, Amazon CloudWatch and ELBs are used to easily scale up when students are running code so they have a seamless experience.

Be sure to visit the CodeHS website, and to learn more about bringing computer science to your school, click here!

Insight (Palo Alto, CA)

Insight was founded in 2012 to create a new educational model, optimize hiring for data teams, and facilitate successful career transitions among data professionals. Over the last 5 years, Insight has kept ahead of market trends and launched a series of professional training fellowships including Data Science, Health Data Science, Data Engineering, and Artificial Intelligence. Finding individuals with the right skill set, background, and culture fit is a challenge for big companies and startups alike, and Insight is focused on developing top talent through intensive 7-week fellowships. To date, Insight has over 1,000 alumni at over 350 companies including Amazon, Google, Netflix, Twitter, and The New York Times.

The Data Engineering team at Insight is well-versed in the current ecosystem of open source tools and technologies and provides mentorship on the best practices in this space. The technical teams are continually working with external groups in a variety of data advisory and mentorship capacities, but the majority of Insight partners participate in professional sessions. Companies visit the Insight office to speak with fellows in an informal setting and provide details on the type of work they are doing and how their teams are growing. These sessions have proved invaluable as fellows experience a significantly better interview process and companies yield engaged and enthusiastic new team members.

An important aspect of Insight’s fellowships is the opportunity for hands-on work, focusing on everything from building big-data pipelines to contributing novel features to industry-standard open source efforts. Insight provides free AWS resources for all fellows to use, in addition to mentorships from the Data Engineering team. Fellows regularly utilize Amazon S3, Amazon EC2, Amazon Kinesis, Amazon EMR, AWS Lambda, Amazon Redshift, Amazon RDS, among other services. The experience with AWS gives fellows a solid skill set as they transition into the industry. Fellowships are currently being offered in Boston, New York, Seattle, and the Bay Area.

Check out the Insight blog for more information on trends in data infrastructure, artificial intelligence, and cutting-edge data products.

 

iTranslate (Austria)

When the App Store was introduced in 2008, the founders of iTranslate saw an opportunity to be part of something big. The group of four fully believed that the iPhone and apps were going to change the world, and together they brainstormed ideas for their own app. The combination of translation and mobile devices seemed a natural fit, and by 2009 iTranslate was born. iTranslate’s mission is to enable travelers, students, business professionals, employers, and medical staff to read, write, and speak in all languages, anywhere in the world. The app allows users to translate text, voice, websites and more into nearly 100 languages on various platforms. Today, iTranslate is the leading player for conversational translation and dictionary apps, with more than 60 million downloads and 6 million monthly active users.

iTranslate is breaking language barriers through disruptive technology and innovation, enabling people to translate in real time. The app has a variety of features designed to optimize productivity including offline translation, website and voice translation, and language auto detection. iTranslate also recently launched the world’s first ear translation device in collaboration with Bragi, a company focused on smart earphones. The Dash Pro allows people to communicate freely, while having a personal translator right in their ear.

iTranslate started using Amazon Polly soon after it was announced. CEO Alexander Marktl said, “As the leading translation and dictionary app, it is our mission at iTranslate to provide our users with the best possible tools to read, write, and speak in all languages across the globe. Amazon Polly provides us with the ability to efficiently produce and use high quality, natural sounding synthesized speech.” The stable and simple-to-use API, low latency, and free caching allow iTranslate to scale as they continue adding features to their app. Customers also enjoy the option to change speech rate and change between male and female voices. To assure quality, speed, and reliability of their products, iTranslate also uses Amazon EC2, Amazon S3, and Amazon Route 53.

To get started with iTranslate, visit their website here.

—–

Thanks for reading!

-Tina

Now Available: Three New AWS Specialty Training Courses

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-available-three-new-aws-specialty-training-courses/

AWS Training allows you to learn from the experts so you can advance your knowledge with practical skills and get more out of the AWS Cloud. Today I am happy to announce that three of our most popular training bootcamps (a staple at AWS re:Invent and AWS Global Summits) are becoming part of our permanent instructor-led training portfolio:

These one-day courses are intended for individuals who would like to dive deeper into a specialized topic with an expert trainer.

You can explore our complete course catalog, and you can search for a public class near you within the AWS Training and Certification Portal. You can also request a private onsite training session for your team by contacting us.

Jeff;