Tag Archives: bias

timeShift(GrafanaBuzz, 1w) Issue 18

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/10/20/timeshiftgrafanabuzz-1w-issue-18/

Welcome to another issue of timeShift. This week we released Grafana 4.6.0-beta2, which includes some fixes for alerts, annotations, the Cloudwatch data source, and a few panel updates. We’re also gearing up for Oredev, one of the biggest tech conferences in Scandinavia, November 7-10. In addition to sponsoring, our very own Carl Bergquist will be presenting “Monitoring for everyone.” Hope to see you there – swing by our booth and say hi!


Latest Release

Grafana 4.6-beta-2 is now available! Grafana 4.6.0-beta2 adds fixes for:

  • ColorPicker display
  • Alerting test
  • Cloudwatch improvements
  • CSV export
  • Text panel enhancements
  • Annotation fix for MySQL

To see more details on what’s in the newest version, please see the release notes.

Download Grafana 4.6.0-beta-2 Now


From the Blogosphere

Screeps and Grafana: Graphing your AI: If you’re unfamiliar with Screeps, it’s a MMO RTS game for programmers, where the objective is to grow your colony through programming your units’ AI. You control your colony by writing JavaScript, which operates 247 in the single persistent real-time world filled by other players. This article walks you through graphing all your game stats with Grafana.

ntopng Grafana Integration: The Beauty of Data Visualization: Our friends at ntop created a tutorial so that you can graph ntop monitoring data in Grafana. He goes through the metrics exposed, configuring the ntopng Data Source plugin, and building your first dashboard. They’ve also created a nice video tutorial of the process.

Installing Graphite and Grafana to Display the Graphs of Centreon: This article, provides a step-by-step guide to getting your Centreon data into Graphite and visualizing the data in Grafana.

Bit v. Byte Episode 3 – Metrics for the Win: Bit v. Byte is a new weekly Podcast about the web industry, tools and techniques upcoming and in use today. This episode dives into metrics, and discusses Grafana, Prometheus and NGINX Amplify.

Code-Quickie: Visualize heating with Grafana: With the winter weather coming, Reinhard wanted to monitor the stats in his boiler room. This article covers not only the visualization of the data, but the different devices and sensors you can use to can use in your own home.

RuuviTag with C.H.I.P – BLE – Node-RED: Following the temperature-monitoring theme from the last article, Tobias writes about his journey of hooking up his new RuuviTag to Grafana to measure temperature, relative humidity, air pressure and more.


Early Bird will be Ending Soon

Early bird discounts will be ending soon, but you still have a few days to lock in the lower price. We will be closing early bird on October 31, so don’t wait until the last minute to take advantage of the discounted tickets!

Also, there’s still time to submit your talk. We’ll accept submissions through the end of October. We’re looking for technical and non-technical talks of all sizes. Submit a CFP now.

Get Your Early Bird Ticket Now


Grafana Plugins

This week we have updates to two panels and a brand new panel that can add some animation to your dashboards. Installing plugins in Grafana is easy; for on-prem Grafana, use the Grafana-cli tool, or with 1 click if you are using Hosted Grafana.

NEW PLUGIN

Geoloop Panel – The Geoloop panel is a simple visualizer for joining GeoJSON to Time Series data, and animating the geo features in a loop. An example of using the panel would be showing the rate of rainfall during a 5-hour storm.

Install Now

UPDATED PLUGIN

Breadcrumb Panel – This plugin keeps track of dashboards you have visited within one session and displays them as a breadcrumb. The latest update fixes some issues with back navigation and url query params.

Update

UPDATED PLUGIN

Influx Admin Panel – The Influx Admin panel duplicates features from the now deprecated Web Admin Interface for InfluxDB and has lots of features like letting you see the currently running queries, which can also be easily killed.

Changes in the latest release:

  • Converted to typescript project based on typescript-template-datasource
  • Select Databases. This only works with PR#8096
  • Added time format options
  • Show tags from response
  • Support template variables in the query

Update


Contribution of the week:

Each week we highlight some of the important contributions from our amazing open source community. Thank you for helping make Grafana better!

The Stockholm Go Meetup had a hackathon this week and sent a PR for letting whitelisted cookies pass through the Grafana proxy. Thanks to everyone who worked on this PR!


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

This is awesome – we can’t get enough of these public dashboards!

We Need Your Help!

Do you have a graph that you love because the data is beautiful or because the graph provides interesting information? Please get in touch. Tweet or send us an email with a screenshot, and we’ll tell you about this fun experiment.

Tell Me More


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


How are we doing?

Please tell us how we’re doing. Submit a comment on this article below, or post something at our community forum. Help us make these weekly roundups better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

We Are Not Having a Productive Debate About Women in Tech

Post Syndicated from Bozho original https://techblog.bozho.net/not-productive-debate-women-tech/

Yes, it’s about the “anti-diversity memo”. But I won’t go into particular details of the memo, the firing, who’s right and wrong, who’s liberal and who’s conservative. Actually, I don’t need to repeat this post, which states almost exactly what I think about the particular issue. Just in case, and before someone decided to label me as “sexist white male” that knows nothing, I guess should clearly state that I acknowledge that biases against women are real and that I strongly support equal opportunity, and I think there must be more women in technology. I also have to state that I think the author of “the memo” was well-meaning, had some well argued, research-backed points and should not be ostracized.

But I want to “rant” about the quality of the debate. On one side we have conservatives who are throwing themselves in defense of the fired googler, insisting that liberals are banning conservative points of view, that it is normal to have so few woman in tech and that everything is actually okay, or even that women are inferior. On the other side we have triggered liberals that are ready to shout “discrimination” and “harassment” at anything that resembles an attempt to claim anything different than total and absolute equality, in many cases using a classical “strawman” argument (e.g. “he’s saying women should not work in tech, he’s obviously wrong”).

Everyone seems to be too eager to take side and issue a verdict on who’s right and who’s wrong, to blame the other side for all related and unrelated woes and while doing that, exhibit a huge amount of biases. If the debate is about that, we’d better shut it down as soon as possible, as it’s not going to lead anywhere. No matter how much conservatives want “a debate”, and no matter how much liberals want to advance equality. Oh, and by the way – this “conservatives” vs “liberals” is a false dichotomy. Most people hold a somewhat sensible stance in between. But let’s get to the actual issue:

Women are underrepresented in STEM (Science, technology, engineering, mathematics). That is a fact everyone agrees on and is blatantly obvious when you walk in any software company office.

Why is that the case? The whole debate revolved around biological and social differences, some of which are probably even true – that women value job flexibility more than being promoted or getting higher salary, that they are more neurotic (on average), that they are less confident, that they are more empathic and so on. These difference have been studied and documented, and as much as I have my reservations about psychology studies (so much so, that even meta-analysis are shown by meta-meta-analysis to be flawed) and social science in general, there seems to be a consensus there (by the way, it’s a shame that Gizmodo removed all the scientific references when they first published “the memo”). But that is not the issue. As it has been pointed out, there’s equal applicability of male and female “inherent” traits when working with technology.

Why are we talking about “techonology”, and why not “mining and construction”, as many will point out. Let’s cut that argument once and for all – mining and construction are blue collar jobs that have a high chance of being automated in the near future and are in decline. The problem that we’re trying to solve is – how to make the dominant profession of the future – information technology – one of equal opportunity. Yes, it’s a a bold claim, but software is going to be everywhere and the industry will grow. This is why it’s so important to discuss it, not because we are developers and we are somewhat affected by that.

So, there has been extended research on the matter, and the reasons are – surprise – complex and intertwined and there is no simple issue that, once resolved, will unlock the path of women to tech jobs.

What would diversity give us and why should we care? Let’s assume for a moment we don’t care about equal opportunity and we are right-leaning, conservative people. Well, imagine you have a growing business and you need to hire developers. What would you prefer – having fewer or more people of whom to choose from? Having fewer or more diverse skills (technical and social) on the job market? The answer is obvious. The more people, regardless of their gender, race, whatever, are on the job market, the better for businesses.

So I guess we’ve agreed on the two points so far – that women are underrepresented, and that it’s better for everyone if there are more people with technical skills on the job market, which includes more women.

The “final” questions is – how?

And this questions seems to not be anywhere in the discussion. Instead, we are going in circles with irrelevant arguments trying to either show that we’ve read more scientific papers than others, that we are more liberal than others or that we are more pro free speech.

Back to “how” – in Bulgaria we have a social meme: “I don’t know what is the right way, but the way you are doing it is NOT the right way”. And much of the underlying sentiment of “the memo” is similar – that google should stop doing some of the stuff it is doing about diversity, or do them differently (but doesn’t tell us how exactly). Hiring biases, internal programs, whatever, seem to bother him. But this is just talking about the surface of the problem. These programs are correcting something that remains hidden in “the memo”.

Google, on their diversity page, say that 20% of their tech employees are women. At the same time, in another diversity section, they claim “18% of CS graduates are women”. So, I guess, job done – they’ve reached the maximum possible diversity. They’ve hired as many women in tech as CS graduates there are. Anything more than that, even if it doesn’t mean they’ll hire worse developers, will leave the rest of the industry with less women. So, sure, 50/50 in Google would sound cool, but the industry average will still be bad.

And that’s the actual, underlying reason that we should have already arrived at, and we should’ve started discussing the “how”. Girls do not see STEM as a thing for them. Our biases are projected on younger girls which culminate at a “this is not for girls” mantra. No matter how diverse hiring policies we have, if we don’t address the issue at a way earlier stage, we aren’t getting anywhere.

In schools and even kindergartens we need to have an inclusive environment where “this is not for girls” is frowned upon. We should not discourage girls from liking math, or making math sound uncool and “hard for girls” (in my biased world I actually know more women mathematicians than men). This comic seems like on a different topic (gender-specific toys), but it’s actually not about toys – it’s about what is considered (stereo)typical of a girl to do. And most of these biases are unconscious, and come from all around us (school, TV, outdoor ads, people on the street, relatives, etc.), and it takes effort to confront them.

To do that, we need policy decisions. We need lobbying education departments / ministries to encourage girls more in the STEM direction (and don’t worry, they’ll be good at it). By the way, guess what – Google’s diversity program is not just about hiring more women, it actually includes education policies with stuff like “influencing perception about computer science”, “getting more girls to code” and scholarships.

Let’s discuss the education policies, the path to getting 40-50% of CS graduates to be female, and before that – more girls in schools with technical focus, and ultimately – how to get society to not perceive technology and science as “not for girls”. Let each girl decide on her own. All the other debates are short-sighted and not to the point at all. Will biological differences matter then? They probably will – but not significantly to justify a high gender imbalance.

I am no expert in education policies and I don’t know what will work and what won’t. There is research on the matter that we should look at, and maybe argue about it. Everything else is wasted keystrokes.

The post We Are Not Having a Productive Debate About Women in Tech appeared first on Bozho's tech blog.

Piracy Narrative Isn’t About Ethics Anymore, It’s About “Danger”

Post Syndicated from Andy original https://torrentfreak.com/piracy-narrative-isnt-about-ethics-anymore-its-about-danger-170812/

Over the years there have been almost endless attempts to stop people from accessing copyright-infringing content online. Campaigns have come and gone and almost two decades later the battle is still ongoing.

Early on, when panic enveloped the music industry, the campaigns centered around people getting sued. Grabbing music online for free could be costly, the industry warned, while parading the heads of a few victims on pikes for the world to see.

Periodically, however, the aim has been to appeal to the public’s better nature. The idea is that people essentially want to do the ‘right thing’, so once they understand that largely hard-working Americans are losing their livelihoods, people will stop downloading from The Pirate Bay. For some, this probably had the desired effect but millions of people are still getting their fixes for free, so the job isn’t finished yet.

In more recent years, notably since the MPAA and RIAA had their eyes blacked in the wake of SOPA, the tone has shifted. In addition to educating the public, torrent and streaming sites are increasingly being painted as enemies of the public they claim to serve.

Several studies, largely carried out on behalf of the Digital Citizens Alliance (DCA), have claimed that pirate sites are hotbeds of malware, baiting consumers in with tasty pirate booty only to offload trojans, viruses, and God-knows-what. These reports have been ostensibly published as independent public interest documents but this week an advisor to the DCA suggested a deeper interest for the industry.

Hemanshu Nigam is a former federal prosecutor, ex-Chief Security Officer for News Corp and Fox Interactive Media, and former VP Worldwide Internet Enforcement at the MPAA. In an interview with Deadline this week, he spoke about alleged links between pirate sites and malware distributors. He also indicated that warning people about the dangers of pirate sites has become Hollywood’s latest anti-piracy strategy.

“The industry narrative has changed. When I was at the MPAA, we would tell people that stealing content is wrong and young people would say, yeah, whatever, you guys make a lot of money, too bad,” he told the publication.

“It has gone from an ethical discussion to a dangerous one. Now, your parents’ bank account can be raided, your teenage daughter can be spied on in her bedroom and extorted with the footage, or your computer can be locked up along with everything in it and held for ransom.”

Nigam’s stance isn’t really a surprise since he’s currently working for the Digital Citizens Alliance as an advisor. In turn, the Alliance is at least partly financed by the MPAA. There’s no suggestion whatsoever that Nigam is involved in any propaganda effort, but recent signs suggest that the DCA’s work in malware awareness is more about directing people away from pirate sites than protecting them from the alleged dangers within.

That being said and despite the bias, it’s still worth giving experts like Nigam an opportunity to speak. Largely thanks to industry efforts with brands, pirate sites are increasingly being forced to display lower-tier ads, which can be problematic. On top, some sites’ policies mean they don’t deserve any visitors at all.

In the Deadline piece, however, Nigam alleges that hackers have previously reached out to pirate websites offering $200 to $5000 per day “depending on the size of the pirate website” to have the site infect users with malware. If true, that’s a serious situation and people who would ordinarily use ‘pirate’ sites would definitely appreciate the details.

For example, to which sites did hackers make this offer and, crucially, which sites turned down the offer and which ones accepted?

It’s important to remember that pirates are just another type of consumer and they would boycott sites in a heartbeat if they discovered they’d been paid to infect them with malware. But, as usual, the claims are extremely light in detail. Instead, there’s simply a blanket warning to stay away from all unauthorized sites, which isn’t particularly helpful.

In some cases, of course, operational security will prevent some details coming to light but without these, people who don’t get infected on a ‘pirate’ site (the vast majority) simply won’t believe the allegations. As the author of the Deadline piece pointed out, it’s a bit like Reefer Madness all over again.

The point here is that without hard independent evidence to back up these claims, with reports listing sites alongside the malware they’ve supposed to have spread and when, few people will respond to perceived scaremongering. Free content trumps a few distant worries almost every time, whether that involves malware or the threat of a lawsuit.

It’ll be up to the DCA and their MPAA paymasters to consider whether the approach is working but thus far, not even having government heavyweights on board has helped.

Earlier this year the DCA launched a video campaign, enrolling 15 attorney generals to publish their own anti-piracy PSAs on YouTube. Thus far, interest has been minimal, to say the least.

At the time of writing the 15 PSAs have 3,986 views in total, with 2,441 of those contributed by a single video contributed by Wisconsin Attorney General Brad Schimel. Despite the relative success, even that got slammed with 2 upvotes and 127 downvotes.

A few of the other videos have a couple of hundred views each but more than half have less than 70. Perhaps most worryingly for the DCA, apart from the Schimel PSA, none have any upvotes at all, only down. It’s unclear who the viewers were but it seems reasonable to conclude they weren’t entertained.

The bottom line is nobody likes malware or having their banking details stolen but yet again, people who claim to have the public interest at heart aren’t actually making a difference on the ground. It could be argued that groups advocating online safety should be publishing guides on how to stay protected on the Internet period, not merely advising people to stay away from certain sites.

But of course, that wouldn’t achieve the goals of the MPAA Digital Citizens Alliance.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Yet more reasons to disagree with experts on nPetya

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/yet-more-reasons-to-disagree-with.html

In WW II, they looked at planes returning from bombing missions that were shot full of holes. Their natural conclusion was to add more armor to the sections that were damaged, to protect them in the future. But wait, said the statisticians. The original damage is likely spread evenly across the plane. Damage on returning planes indicates where they could damage and still return. The undamaged areas are where they were hit and couldn’t return. Thus, it’s the undamaged areas you need to protect.

This is called survivorship bias.
Many experts are making the same mistake with regards to the nPetya ransomware. 
I hate to point this out, because they are all experts I admire and respect, especially @MalwareJake, but it’s still an error. An example is this tweet:
The context of this tweet is the discussion of why nPetya was well written with regards to spreading, but full of bugs with regards to collecting on the ransom. The conclusion therefore that it wasn’t intended to be ransomware, but was intended to simply be a “wiper”, to cause destruction.
But this is just survivorship bias. If nPetya had been written the other way, with excellent ransomware features and poor spreading, we would not now be talking about it. Even that initial seeding with the trojaned MeDoc update wouldn’t have spread it far enough.
In other words, all malware samples we get are good at spreading, either on their own, or because the creator did a good job seeding them. It’s because we never see the ones that didn’t spread.
With regards to nPetya, a lot of experts are making this claim. Since it spread so well, but had hopelessly crippled ransomware features, that must have been the intent all along. Yet, as we see from survivorship bias, none of us would’ve seen nPetya had it not been for the spreading feature.

NonPetya: no evidence it was a "smokescreen"

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/06/nonpetya-no-evidence-it-was-smokescreen.html

Many well-regarded experts claim that the not-Petya ransomware wasn’t “ransomware” at all, but a “wiper” whose goal was to destroy files, without any intent at letting victims recover their files. I want to point out that there is no real evidence of this.

Certainly, things look suspicious. For one thing, it certainly targeted the Ukraine. For another thing, it made several mistakes that prevent them from ever decrypting drives. Their email account was shutdown, and it corrupts the boot sector.

But these things aren’t evidence, they are problems. They are things needing explanation, not things that support our preferred conspiracy theory.

The simplest, Occam’s Razor explanation explanation is that they were simple mistakes. Such mistakes are common among ransomware. We think of virus writers as professional software developers who thoroughly test their code. Decades of evidence show the opposite, that such software is of poor quality with shockingly bad bugs.

It’s true that effectively, nPetya is a wiper. Matthieu Suiche‏ does a great job describing one flaw that prevents it working. @hasherezade does a great job explaining another flaw.  But best explanation isn’t that this is intentional. Even if these bugs didn’t exist, it’d still be a wiper if the perpetrators simply ignored the decryption requests. They need not intentionally make the decryption fail.

Thus, the simpler explanation is that it’s simply a bug. Ransomware authors test the bits they care about, and test less well the bits they don’t. It’s quite plausible to believe that just before shipping the code, they’d add a few extra features, and forget to regression test the entire suite. I mean, I do that all the time with my code.

Some have pointed to the sophistication of the code as proof that such simple errors are unlikely. This isn’t true. While it’s more sophisticated than WannaCry, it’s about average for the current state-of-the-art for ransomware in general. What people think of, such the Petya base, or using PsExec to spread throughout a Windows domain, is already at least a year old.

Indeed, the use of PsExec itself is a bit clumsy, when the code for doing the same thing is already public. It’s just a few calls to basic Windows networking APIs. A sophisticated virus would do this itself, rather than clumsily use PsExec.

Infamy doesn’t mean skill. People keep making the mistake that the more widespread something is in the news, the more skill, the more of a “conspiracy” there must be behind it. This is not true. Virus/worm writers often do newsworthy things by accident. Indeed, the history of worms, starting with the Morris Worm, has been things running out of control more than the author’s expectations.

What makes nPetya newsworthy isn’t the EternalBlue exploit or the wiper feature. Instead, the creators got lucky with MeDoc. The software is used by every major organization in the Ukraine, and at the same time, their website was horribly insecure — laughably insecure. Furthermore, it’s autoupdate feature didn’t check cryptographic signatures. No hacker can plan for this level of widespread incompetence — it’s just extreme luck.

Thus, the effect of bumbling around is something that hit the Ukraine pretty hard, but it’s not necessarily the intent of the creators. It’s like how the Slammer worm hit South Korea pretty hard, or how the Witty worm hit the DoD pretty hard. These things look “targeted”, especially to the victims, but it was by pure chance (provably so, in the case of Witty).

Certainly, MeDoc was targeted. But then, targeting a single organization is the norm for ransomware. They have to do it that way, giving each target a different Bitcoin address for payment. That it then spread to the entire Ukraine, and further, is the sort of thing that typically surprises worm writers.

Finally, there’s little reason to believe that there needs to be a “smokescreen”. Russian hackers are targeting the Ukraine all the time. Whether Russian hackers are to blame for “ransomware” vs. “wiper” makes little difference.

Conclusion

We know that Russian hackers are constantly targeting the Ukraine. Therefore, the theory that this was nPetya’s goal all along, to destroy Ukraines computers, is a good one.

Yet, there’s no actual “evidence” of this. nPetya’s issues are just as easily explained by normal software bugs. The smokescreen isn’t needed. The boot record bug isn’t needed. The single email address that was shutdown isn’t significant, since half of all ransomware uses the same technique.

The experts who disagree with me are really smart/experienced people who you should generally trust. It’s just that I can’t see their evidence.

Update: I wrote another blogpost about “survivorship bias“, refuting the claim by many experts talking about the sophistication of the spreading feature.


Update: comment asks “why is there no Internet spreading code?”. The answer is “I don’t know”, but unanswerable questions aren’t evidence of a conspiracy. “What aren’t there any stars in the background?” isn’t proof the moon landings are fake, such because you can’t answer the question. One guess is that you never want ransomware to spread that far, until you’ve figured out how to get payment from so many people.

From Idea to Launch: Getting Your First Customers

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/how-to-get-your-first-customers/

line outside of Apple

After deciding to build an unlimited backup service and developing our own storage platform, the next step was to get customers and feedback. Not all customers are created equal. Let’s talk about the types, and when and how to attract them.

How to Get Your First Customers

First Step – Don’t Launch Publicly
Launch when you’re ready for the judgments of people who don’t know you at all. Until then, don’t launch. Sign up users and customers either that you know, those you can trust to cut you some slack (while providing you feedback), or at minimum those for whom you can set expectations. For months the Backblaze website was a single page with no ability to get the product and minimal info on what it would be. This is not to counter the Lean Startup ‘iterate quickly with customer feedback’ advice. Rather, this is an acknowledgement that there are different types of feedback required based on your development stage.

Sign Up Your Friends
We knew all of our first customers; they were friends, family, and previous co-workers. Many knew what we were up to and were excited to help us. No magic marketing or tech savviness was required to reach them – we just asked that they try the service. We asked them to provide us feedback on their experience and collected it through email and conversations. While the feedback wasn’t unbiased, it was nonetheless wide-ranging, real, and often insightful. These people were willing to spend time carefully thinking about their feedback and delving deeper into the conversations.

Broaden to Beta
Unless you’re famous or your service costs $1 million per customer, you’ll probably need to expand quickly beyond your friends to build a business – and to get broader feedback. Our next step was to broaden the customer base to beta users.

Opening up the service in beta provides three benefits:

  1. Air cover for the early warts. There are going to be issues, bugs, unnecessarily complicated user flows, and poorly worded text. Beta tells people, “We don’t consider the product ‘done’ and you should expect some of these issues. Please be patient with us.”
  2. A request for feedback. Some people always provide feedback, but beta communicates that you want it.
  3. An awareness opportunity. Opening up in beta provides an early (but not only) opportunity to have an announcement and build awareness.

Pitching Beta to Press
Not all press cares about, or is even willing to cover, beta products. Much of the mainstream press wants to write about services that are fully live, have scale, and are important in the marketplace. However, there are a number of sites that like to cover the leading edge – and that means covering betas. Techcrunch, Ars Technica, and SimpleHelp covered our initial private beta launch. I’ll go into the details of how to work with the press to cover your announcements in a post next month.

Private vs. Public Beta
Both private and public beta provide all three of the benefits above. The difference between the two is that private betas are much more controlled, whereas public ones bring in more users. But this isn’t an either/or – I recommend doing both.

Private Beta
For our original beta in 2008, we decided that we were comfortable with about 1,000 users subscribing to our service. That would provide us with a healthy amount of feedback and get some early adoption, while not overwhelming us or our server capacity, and equally important not causing cash flow issues from having to buy more equipment. So we decided to limit the sign-up to only the first 1,000 people who signed up; then we would shut off sign-ups for a while.

But how do you even get 1,000 people to sign up for your service? In our case, get some major publications to write about our beta. (Note: In a future post I’ll explain exactly how to find and reach out to writers. Sign up to receive all of the entrepreneurial posts in this series.)

Public Beta
For our original service (computer backup), we did not have a public beta; but when we launched Backblaze B2, we had a private and then a public beta. The private beta allowed us to work out early kinks, while the public beta brought us a more varied set of use cases. In public beta, there is no cap on the number of users that may try the service.

While this is a first-class problem to have, if your service is flooded and stops working, it’s still a problem. Think through what you will do if that happens. In our early days, when our system could get overwhelmed by volume, we had a static web page hosted with a different registrar that wouldn’t let customers sign up but would tell them when our service would be open again. When we reached a critical volume level we would redirect to it in order to at least provide status for when we could accept more customers.

Collect Feedback
Since one of the goals of betas is to get feedback, we made sure that we had our email addresses clearly presented on the site so users could send us thoughts. We were most interested in broad qualitative feedback on users’ experience, so all emails went to an internal mailing list that would be read by everyone at Backblaze.

For our B2 public and private betas, we also added an optional short survey to the sign-up process. In order to be considered for the private beta you had to fill the survey out, though we found that 80% of users continued to fill out the survey even when it was not required. This survey had both closed-end questions (“how much data do you have”) and open-ended ones (“what do you want to use cloud storage for?”).

BTW, despite us getting a lot of feedback now via our support team, Twitter, and marketing surveys, we are always open to more – you can email me directly at gleb.budman {at} backblaze.com.

Don’t Throw Away Users
Initially our backup service was available only on Windows, but we had an email sign-up list for people who wanted it for their Mac. This provided us with a sense of market demand and a ready list of folks who could be beta users and early adopters when we had a Mac version. Have a service targeted at doctors but lawyers are expressing interest? Capture that.

Product Launch

When
The first question is “when” to launch. Presuming your service is in ‘public beta’, what is the advantage of moving out of beta and into a “version 1.0”, “gold”, or “public availability”? That depends on your service and customer base. Some services fly through public beta. Gmail, on the other hand, was (in)famous for being in beta for 5 years, despite having over 100 million users.

The term beta says to users, “give us some leeway, but feel free to use the service”. That’s fine for many consumer apps and will have near zero impact on them. However, services aimed at businesses and government will often not be adopted with a beta label as the enterprise customers want to know the company feels the service is ‘ready’. While Backblaze started out as a purely consumer service, because it was a data backup service, it was important for customers to trust that the service was ready.

No product is bug-free. But from a product readiness perspective, the nomenclature should also be a reflection of the quality of the product. You can launch a product with one feature that works well out of beta. But a product with fifty features on which half the users will bump into problems should likely stay in beta. The customer feedback, surveys, and your own internal testing should guide you in determining this quality during the beta. Be careful about “we’ve only seen that one time” or “I haven’t been able to reproduce that on my machine”; those issues are likely to scale with customers when you launch.

How
Launching out of beta can be as simple as removing the beta label from the website/product. However, this can be a great time to reach out to press, write a blog post, and send an email announcement to your customers.

Consider thanking your beta testers somehow; can they get some feature turned out for free, an extension of their trial, or premium support? If nothing else, remember to thank them for their feedback. Users that signed up during your beta are likely the ones who will propel your service. They had the need and interest to both be early adopters and deal with bugs. They are likely the key to getting 1,000 true fans.

The Beginning
The title of this post was “Getting your first customers”, because getting to launch may feel like the peak of your journey when you’re pre-launch, but it really is just the beginning. It’s a step along the journey of building your business. If your launch is wildly successful, enjoy it, work to build on the momentum, but don’t lose track of building your business. If your launch is a dud, go out for a coffee with your team, say “well that sucks”, and then get back to building your business. You can learn a tremendous amount from your early customers, and they can become your biggest fans, but the success of your business will depend on what you continue to do the months and years after your launch.

The post From Idea to Launch: Getting Your First Customers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

2017-05-09 bias-и и дебъгване

Post Syndicated from Vasil Kolev original https://vasil.ludost.net/blog/?p=3352

Нещо странично.

Тия дни в офиса около някакви занимания обсъждахме следната задача:

“Имаме банда пирати (N на брой, капитан и N-1 останали членове), които искат да си разделят съкровище от 100 пари. Пиратите имат строга линейна йерархия (знае се кой след кой е). Разделянето става по следния начин – текущият капитан предлага разпределение, гласува се и ако събере половината или повече от гласовете се приема, ако не – убиват го и следващия по веригата предлага разпределение. Въпросът е какво трябва да предложи капитанът, така че всички да се съгласят, ако приемем, че всички в екипажа са перфектни логици. Също така пиратите са кръвожадни и ако при гласуване против има шанс да спечели и същите пари, пак ще предпочете да убие капитана. Също така всички са алчни и целта е капитанът да запази най-много за себе си.”
(задачата не идва от икономиката, въпреки че и там всички са перфектни логици и за това толкова много им се дънят теориите)

Решението на задачата е интересно (за него – по-долу), но е доста по-интересно колко трудно се оказа да я реша. Първоначалната ми идея беше просто на горната половина от пиратите да се разделят намалящи суми, понеже това е стандартния начин, по който се случват нещата. Това се оказа неефективно. После ми напомниха (което сам трябваше да се сетя), че такива задачи се решават отзад-напред и по индукция, и като за начало започнахме с въпроса, какво става ако са само двама?

Първият ми отговор беше – ами другия член на екипажа ще иска винаги да убие капитана, щото така ще вземе всичко. Обаче се оказа, че и капитана има глас, т.е. ако останат само двама, капитанът взима всичко и разпределението е 100 за него и нищо за другия.

Какво следва, ако са трима? Казах – добре, тогава даваш на единия 1, на другия 2, и останалото за капитана, понеже ако останат само двама, последния няма да вземе нищо, капитанът гласува за себе си и втория и да е за и против, няма значение. Само че няма нужда да даваме нищо на средния, щото не ни пука за мнението му, така всъщност правилното разпределение идва 1, 0, 99. Тук пак си пролича bias-а, пак очаквах да има някаква пропорция.

Long story short, следващата итерация е 0, 1, 0, 99, понеже така ако не се съгласят, на следващия ход предпоследния ако не се съгласи няма да вземе нищо, и на другите двама мнението няма значение. Pattern-а мисля, че си личи 🙂

Лошото е колко много влияеше bias-а, който съм натрупал от четене за разпределения в реалния живот – какво са пиратите, как няма перфектни логици (и реално никой няма да смята по тоя начин, а ще се стремят към нещо, което им се вижда честно), как това тотално изключва политическата възможност N/2+1 от долната част да гласуват винаги против, докато не дойде всичкото до тях и после да си го разделят по равно и всякакви подобни варианти от реалния живот. Ако примерът беше с каквото и да е друго (например не включваше хора), вероятно щеше да е доста по-лесно да гледам абстрактно.

Което е още един довод в подкрепа на идеята ми, че много по-лесно се дебъгва нещо чуждо (често и което никога не си виждал), отколкото нещо, с което почти постоянно се занимаваш. Над 90% от проблемите (това не се базира на никаква статистика, а на усещане) са достатъчно прости, че да могат да се решат със стандартни методи и да не изискват много задълбочено познаване на системата (половината ми живот е минал в дебъгване на неща, които не разбирам, доста по-често успешно, отколкото не) и вероятно като/ако правя debug workshop-а (за който много хора ми натякват), ще е с проблеми, с които и аз не съм запознат, да е наистина забавно …

Get wordy with our free resources

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/get-wordy-with-our-free-resources/

Here at the Raspberry Pi Foundation, we take great pride in the wonderful free resources we produce for you to use in classes, at home and in coding clubs. We publish them under a Creative Commons licence, and they’re an excellent way to develop your digital-making skills.

With yesterday being World Poetry Day (I’m a day late to the party. Shhh), I thought I’d share some wordy-themed [wordy-themed? Are you sure? – Ed] resources for you all to have a play with.

Shakespearean Insult Generator

Raspberry Pi Free Resources Shakespearean Insult Generator

Have you ever found yourself lost for words just when the moment calls for your best comeback? With the Shakespearean Insult Generator, your mumbled retorts to life’s awkward situations will have the lyrical flow of our nation’s most beloved bard.

Thou sodden-witted lord! Thou hast no more brain than I have in mine elbows!

Not only will the generator provide you with hours of potty-mouthed fun, it’ll also teach you how to read and write data in CSV format using Python, how to manipulate lists, and how to choose a random item from a list.

Talk like a Pirate

Raspberry Pi Free Resources Talk Like a Pirate

Ye’ll never be forced t’walk the plank once ye learn how to talk like a scurvy ol’ pirate… yaaaarrrgh!

The Talk like a Pirate speech generator teaches you how to use jQuery to cause live updates on a web page, how to write regular expressions to match patterns and words, and how to create a web page to input text and output results.

Once you’ve mastered those skills, you can use them to create other speech generators. How about a speech generator that turns certain words into their slang counterparts? Or one that changes words into txt speak – laugh into LOL, and see you into CU?

Secret Agent Chat

Raspberry Pi Free Resources Secret Agent Chat

So you’ve already mastered insults via list manipulation and random choice, and you’ve converted words into hilarious variations through matching word patterns and input/output. What’s next?

The Secret Agent Chat resource shows you how random numbers can be used to encrypt messages, how iteration can be used to encrypt individual characters, and, to make sure nobody cracks your codes, the importance of keeping your keys secret. And with these new skills under your belt, you can write and encrypt messages between you and your friends, ensuring that nobody will be able to read your secrets.

Unlocking your transferable skill set

One of the great things about building projects like these is the way it expands your transferable skill set. When you complete a project using one of our resources, you gain abilities that can be transferred to other projects and situations. You might never need to use a ‘Talk like a Pirate’ speech generator, but you might need to create a way to detect and alter certain word patterns in a document. And while you might be able to coin your own colourful insults, making the Shakespearean Insult Generator gives you the ability to select words from lists at random, allowing you to write a program that picks names to create sports or quiz teams without bias.

All of our resources are available for free on our website, and we continually update them to offer you more opportunities to work on your skills, whatever your age and experience.

Have you built anything from our resources? Let us know in the comments.

The post Get wordy with our free resources appeared first on Raspberry Pi.

Canada Rejects Flawed and One-Sided “Piracy” Claims From US Govt.

Post Syndicated from Ernesto original https://torrentfreak.com/canada-rejects-flawed-and-one-sided-piracy-claims-from-us-govt-170310/

Every year the Office of the US Trade Representative (USTR) releases an updated version of its Special 301 Report, calling out other nations for failing to live up to U.S. IP enforcement standards.

In recent years Canada has been placed on this “watch list” many times, for a variety of reasons. The country fails to properly deter piracy, is one of the prime complaints circulated by the U.S. Government.

Even after Canada revamped its copyright law, including a mandatory piracy notice scheme and extending the copyright term to 70 years after publication, the allegations didn’t go away in 2016.

Now, a year later new hearings are underway to discuss the 2017 version of the report. Fearing repercussions, several countries have joined stakeholders to defend their positions. However, Canada was notably absent.

While the Canadian Government hasn’t made a lot of fuss in the media, a confidential memo, obtained by University of Ottawa professor Michael Geist, shows that they have little faith in the USTR report.

“Canada does not recognize the validity of the Special 301 and considers the process and the Report to be flawed,” the Government memo reads.

“The Report fails to employ a clear methodology and the findings tend to rely on industry allegations rather than empirical evidence and objective analysis.”

The document in question was prepared for Minister Mélanie Joly last year after the 2016 report was published. It points out, in no uncertain terms, that Canada doesn’t recognize the validity of the 301 process and includes several talking points for the media.

Excerpt from the note

This year, rightsholders have once again labeled Canada a “piracy haven” so it wouldn’t be a big surprise if it’s listed again. Based on the Canadian Government’s lack of response, it is likely that the Northern neighbor still has little faith in the report.

TorrentFreak spoke with law professor Micheal Geist, who has been very critical of the USTR’s 301-process in the past. He believes that Canada is doing the right thing and characterizes the yearly 301 report as biased.

“I think the Canadian government is exactly right in its assessment of the Special 301 report process. It is little more than a lobbying document and the content largely reflects biased submissions from lobby groups,” Geist tells TorrentFreak.

In a recent article the professor explains that, contrary to claims from entertainment industry groups, Canada now has some of the toughest anti-piracy laws in the world. But, these rightsholder groups want more.

Some of the requests, such as those put forward by the industry group IIPA, even go beyond what the United States itself is doing, or far beyond internationally agreed standards.

“[T]he submissions frequently engage in a double standard with the IIPA lobbying against fair use in other countries even though the U.S. has had fair use for decades,” Geist says.

“It also often calls on countries to implement rules that go far beyond their international obligations such as the demands that countries adopt a DMCA-style approach for the WIPO Internet treaties even though those treaties are far more flexible in their requirements.”

This critique of the USTR’s annual report is not new as its alleged biased nature has been discussed by various experts in the past. However, as a country, Canada’s rejection will have an impact, and Professor Geist hopes that other nations will follow suit.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Some notes on the RAND 0day report

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/03/some-notes-on-rand-0day-report.html

The RAND Corporation has a research report on the 0day market [ * ]. It’s pretty good. They talked to all the right people. It should be considered the seminal work on the issue. They’ve got the pricing about right ($1 million for full chain iPhone exploit, but closer to $100k for others). They’ve got the stats about right (5% chance somebody else will discover an exploit).

Yet, they’ve got some problems, namely phrasing the debate as activists want, rather than a neutral view of the debate.

The report frequently uses the word “stockpile”. This is a biased term used by activists. According to the dictionary, it means:

a large accumulated stock of goods or materials, especially one held in reserve for use at a time of shortage or other emergency.

Activists paint the picture that the government (NSA, CIA, DoD, FBI) buys 0day to hold in reserve in case they later need them. If that’s the case, then it seems reasonable that it’s better to disclose/patch the vuln then let it grow moldy in a cyberwarehouse somewhere.

But that’s not how things work. The government buys vulns it has immediate use for (primarily). Almost all vulns it buys are used within 6 months. Most vulns in its “stockpile” have been used in the previous year. These cyberweapons are not in a warehouse, but in active use on the front lines.

This is top secret, of course, so people assume it’s not happening. They hear about no cyber operations (except Stuxnet), so they assume such operations aren’t occurring. Thus, they build up the stockpiling assumption rather than the active use assumption.

If the RAND wanted to create an even more useful survey, they should figure out how many thousands of times per day our government (NSA, CIA, DoD, FBI) exploits 0days. They should characterize who they target (e.g. terrorists, child pornographers), success rate, and how many people they’ve killed based on 0days. It’s this data, not patching, that is at the root of the policy debate.

That 0days are actively used determines pricing. If the government doesn’t have immediate need for a vuln, it won’t pay much for it, if anything at all. Conversely, if the government has urgent need for a vuln, it’ll pay a lot.

Let’s say you have a remote vuln for Samsung TVs. You go to the NSA and offer it to them. They tell you they aren’t interested, because they see no near term need for it. Then a year later, spies reveal ISIS has stolen a truckload of Samsung TVs, put them in all the meeting rooms, and hooked them to Internet for video conferencing. The NSA then comes back to you and offers $500k for the vuln.

Likewise, the number of sellers affects the price. If you know they desperately need the Samsung TV 0day, but they are only offering $100k, then it likely means that there’s another seller also offering such a vuln.

That’s why iPhone vulns are worth $1 million for a full chain exploit, from browser to persistence. They use it a lot, it’s a major part of ongoing cyber operations. Each time Apple upgrades iOS, the change breaks part of the existing chain, and the government is keen on getting a new exploit to fix it. They’ll pay a lot to the first vuln seller who can give them a new exploit.

Thus, there are three prices the government is willing to pay for an 0day (the value it provides to the government):

  • the price for an 0day they will actively use right now (high)
  • the price for an 0day they’ll stockpile for possible use in the future (low)
  • the price for an 0day they’ll disclose to the vendor to patch (very low)

That these are different prices is important to the policy debate. When activists claim the government should disclose the 0day they acquire, they are ignoring the price the 0day was acquired for. Since the government actively uses the 0day, they are acquired for a high-price, with their “use” value far higher than their “patch” value. It’s an absurd argument to make that they government should then immediately discard that money, to pay “use value” prices for “patch” results.

If the policy becomes that the NSA/CIA should disclose/patch the 0day they buy, it doesn’t mean business as usual acquiring vulns. It instead means they’ll stop buying 0day.

In other words, “patching 0day” is not an outcome on either side of the debate. Either the government buys 0day to use, or it stops buying 0day. In neither case does patching happen.

The real argument is whether the government (NSA, CIA, DoD, FBI) should be acquiring, weaponizing, and using 0day in the first place. It demands that we unilaterally disarm our military, intelligence, and law enforcement, preventing them from using 0days against our adversaries while our adversaries continue to use 0days against us.

That’s the gaping hole in both the RAND paper and most news reporting of this controversy. They characterize the debate the way activists want, as if the only question is the value of patching. They avoid talking about unilateral cyberdisarmament, even though that’s the consequence of the policy they are advocating. They avoid comparing the value of 0days to our country for active use (high) compared to the value to to our country for patching (very low).

Conclusion

It’s nice that the RAND paper studied the value of patching and confirmed it’s low, that only around 5% of our cyber-arsenal is likely to be found by others. But it’d be nice if they also looked at the point of view of those actively using 0days on a daily basis, rather than phrasing the debate the way activists want.

A note about "false flag" operations

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/03/a-note-about-false-flag-operations.html

There’s nothing in the CIA #Vault7 leaks that calls into question strong attribution, like Russia being responsible for the DNC hacks. On the other hand, it does call into question weak attribution, like North Korea being responsible for the Sony hacks.

There are really two types of attribution. Strong attribution is a preponderance of evidence that would convince an unbiased, skeptical expert. Weak attribution is flimsy evidence that confirms what people are predisposed to believe.

The DNC hacks have strong evidence pointing to Russia. Not only does all the malware check out, but also other, harder to “false flag” bits, like active command-and-control servers. A serious operator could still false-flag this in theory, if only by bribing people in Russia, but nothing in the CIA dump hints at this.

The Sony hacks have weak evidence pointing to North Korea. One of the items was the use of the RawDisk driver, used both in malware attributed to North Korea and the Sony attacks. This was described as “flimsy” at the time [*]. The CIA dump [*] demonstrates that indeed it’s flimsy — as apparently CIA malware also uses the RawDisk code.

In the coming days, biased partisans are going to seize on the CIA leaks as proof of “false flag” operations, calling into question Russian hacks. No, this isn’t valid. We experts in the industry criticized “malware techniques” as flimsy attribution, long before the Sony attack, and long before the DNC hacks. All the CIA leaks do is prove we were right. On the other hand, the DNC hack attribution is based on more than just this, so nothing in the CIA leaks calls into question that attribution.

WikiLeaks Releases CIA Hacking Tools

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/03/wikileaks_relea.html

WikiLeaks just released a cache of 8,761 classified CIA documents from 2012 to 2016, including details of its offensive Internet operations.

I have not read through any of them yet. If you see something interesting, tell us in the comments.

EDITED TO ADD: There’s a lot in here. Many of the hacking tools are redacted, with the tar files and zip archives replaced with messages like:

::: THIS ARCHIVE FILE IS STILL BEING EXAMINED BY WIKILEAKS. :::

::: IT MAY BE RELEASED IN THE NEAR FUTURE. WHAT FOLLOWS IS :::
::: AN AUTOMATICALLY GENERATED LIST OF ITS CONTENTS: :::

Hopefully we’ll get them eventually. The documents say that the CIA — and other intelligence services — can bypass Signal, WhatsApp and Telegram. It seems to be by hacking the end-user devices and grabbing the traffic before and after encryption, not by breaking the encryption.

New York Times article.

EDITED TO ADD: Some details from The Guardian:

According to the documents:

  • CIA hackers targeted smartphones and computers.
  • The Center for Cyber Intelligence is based at the CIA headquarters in Virginia but it has a second covert base in the US consulate in Frankfurt which covers Europe, the Middle East and Africa.
  • A programme called Weeping Angel describes how to attack a Samsung F8000 TV set so that it appears to be off but can still be used for monitoring.

I just noticed this from the WikiLeaks page:

Recently, the CIA lost control of the majority of its hacking arsenal including malware, viruses, trojans, weaponized “zero day” exploits, malware remote control systems and associated documentation. This extraordinary collection, which amounts to more than several hundred million lines of code, gives its possessor the entire hacking capacity of the CIA. The archive appears to have been circulated among former U.S. government hackers and contractors in an unauthorized manner, one of whom has provided WikiLeaks with portions of the archive.

So it sounds like this cache of documents wasn’t taken from the CIA and given to WikiLeaks for publication, but has been passed around the community for a while — and incidentally some part of the cache was passed to WikiLeaks. So there are more documents out there, and others may release them in unredacted form.

Wired article. Slashdot thread. Two articles from the Washington Post.

EDITED TO ADD: This document talks about Comodo version 5.X and version 6.X. Version 6 was released in Feb 2013. Version 7 was released in Apr 2014. This gives us a time window of that page, and the cache in general. (WikiLeaks says that the documents cover 2013 to 2016.)

If these tools are a few years out of date, it’s similar to the NSA tools released by the “Shadow Brokers.” Most of us thought the Shadow Brokers were the Russians, specifically releasing older NSA tools that had diminished value as secrets. Could this be the Russians as well?

EDITED TO ADD: Nicholas Weaver comments.

EDITED TO ADD (3/8): These documents are interesting:

The CIA’s hand crafted hacking techniques pose a problem for the agency. Each technique it has created forms a “fingerprint” that can be used by forensic investigators to attribute multiple different attacks to the same entity.

This is analogous to finding the same distinctive knife wound on multiple separate murder victims. The unique wounding style creates suspicion that a single murderer is responsible. As soon one murder in the set is solved then the other murders also find likely attribution.

The CIA’s Remote Devices Branch‘s UMBRAGE group collects and maintains a substantial library of attack techniques ‘stolen’ from malware produced in other states including the Russian Federation.

With UMBRAGE and related projects the CIA cannot only increase its total number of attack types but also misdirect attribution by leaving behind the “fingerprints” of the groups that the attack techniques were stolen from.

UMBRAGE components cover keyloggers, password collection, webcam capture, data destruction, persistence, privilege escalation, stealth, anti-virus (PSP) avoidance and survey techniques.

This is being spun in the press as the CIA is pretending to be Russia. I’m not convinced that the documents support these allegations. Can someone else look at the documents. I don’t like my conclusion that WikiLeaks is using this document dump as a way to push their own bias.

Security and the Internet of Things

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/02/security_and_th.html

Last year, on October 21, your digital video recorder ­- or at least a DVR like yours ­- knocked Twitter off the internet. Someone used your DVR, along with millions of insecure webcams, routers, and other connected devices, to launch an attack that started a chain reaction, resulting in Twitter, Reddit, Netflix, and many sites going off the internet. You probably didn’t realize that your DVR had that kind of power. But it does.

All computers are hackable. This has as much to do with the computer market as it does with the technologies. We prefer our software full of features and inexpensive, at the expense of security and reliability. That your computer can affect the security of Twitter is a market failure. The industry is filled with market failures that, until now, have been largely ignorable. As computers continue to permeate our homes, cars, businesses, these market failures will no longer be tolerable. Our only solution will be regulation, and that regulation will be foisted on us by a government desperate to “do something” in the face of disaster.

In this article I want to outline the problems, both technical and political, and point to some regulatory solutions. Regulation might be a dirty word in today’s political climate, but security is the exception to our small-government bias. And as the threats posed by computers become greater and more catastrophic, regulation will be inevitable. So now’s the time to start thinking about it.

We also need to reverse the trend to connect everything to the internet. And if we risk harm and even death, we need to think twice about what we connect and what we deliberately leave uncomputerized.

If we get this wrong, the computer industry will look like the pharmaceutical industry, or the aircraft industry. But if we get this right, we can maintain the innovative environment of the internet that has given us so much.

**********

We no longer have things with computers embedded in them. We have computers with things attached to them.

Your modern refrigerator is a computer that keeps things cold. Your oven, similarly, is a computer that makes things hot. An ATM is a computer with money inside. Your car is no longer a mechanical device with some computers inside; it’s a computer with four wheels and an engine. Actually, it’s a distributed system of over 100 computers with four wheels and an engine. And, of course, your phones became full-power general-purpose computers in 2007, when the iPhone was introduced.

We wear computers: fitness trackers and computer-enabled medical devices ­- and, of course, we carry our smartphones everywhere. Our homes have smart thermostats, smart appliances, smart door locks, even smart light bulbs. At work, many of those same smart devices are networked together with CCTV cameras, sensors that detect customer movements, and everything else. Cities are starting to embed smart sensors in roads, streetlights, and sidewalk squares, also smart energy grids and smart transportation networks. A nuclear power plant is really just a computer that produces electricity, and ­- like everything else we’ve just listed -­ it’s on the internet.

The internet is no longer a web that we connect to. Instead, it’s a computerized, networked, and interconnected world that we live in. This is the future, and what we’re calling the Internet of Things.

Broadly speaking, the Internet of Things has three parts. There are the sensors that collect data about us and our environment: smart thermostats, street and highway sensors, and those ubiquitous smartphones with their motion sensors and GPS location receivers. Then there are the “smarts” that figure out what the data means and what to do about it. This includes all the computer processors on these devices and ­- increasingly ­- in the cloud, as well as the memory that stores all of this information. And finally, there are the actuators that affect our environment. The point of a smart thermostat isn’t to record the temperature; it’s to control the furnace and the air conditioner. Driverless cars collect data about the road and the environment to steer themselves safely to their destinations.

You can think of the sensors as the eyes and ears of the internet. You can think of the actuators as the hands and feet of the internet. And you can think of the stuff in the middle as the brain. We are building an internet that senses, thinks, and acts.

This is the classic definition of a robot. We’re building a world-size robot, and we don’t even realize it.

To be sure, it’s not a robot in the classical sense. We think of robots as discrete autonomous entities, with sensors, brain, and actuators all together in a metal shell. The world-size robot is distributed. It doesn’t have a singular body, and parts of it are controlled in different ways by different people. It doesn’t have a central brain, and it has nothing even remotely resembling a consciousness. It doesn’t have a single goal or focus. It’s not even something we deliberately designed. It’s something we have inadvertently built out of the everyday objects we live with and take for granted. It is the extension of our computers and networks into the real world.

This world-size robot is actually more than the Internet of Things. It’s a combination of several decades-old computing trends: mobile computing, cloud computing, always-on computing, huge databases of personal information, the Internet of Things ­- or, more precisely, cyber-physical systems ­- autonomy, and artificial intelligence. And while it’s still not very smart, it’ll get smarter. It’ll get more powerful and more capable through all the interconnections we’re building.

It’ll also get much more dangerous.

**********

Computer security has been around for almost as long as computers have been. And while it’s true that security wasn’t part of the design of the original internet, it’s something we have been trying to achieve since its beginning.

I have been working in computer security for over 30 years: first in cryptography, then more generally in computer and network security, and now in general security technology. I have watched computers become ubiquitous, and have seen firsthand the problems ­- and solutions ­- of securing these complex machines and systems. I’m telling you all this because what used to be a specialized area of expertise now affects everything. Computer security is now everything security. There’s one critical difference, though: The threats have become greater.

Traditionally, computer security is divided into three categories: confidentiality, integrity, and availability. For the most part, our security concerns have largely centered around confidentiality. We’re concerned about our data and who has access to it ­- the world of privacy and surveillance, of data theft and misuse.

But threats come in many forms. Availability threats: computer viruses that delete our data, or ransomware that encrypts our data and demands payment for the unlock key. Integrity threats: hackers who can manipulate data entries can do things ranging from changing grades in a class to changing the amount of money in bank accounts. Some of these threats are pretty bad. Hospitals have paid tens of thousands of dollars to criminals whose ransomware encrypted critical medical files. JPMorgan Chase spends half a billion on cybersecurity a year.

Today, the integrity and availability threats are much worse than the confidentiality threats. Once computers start affecting the world in a direct and physical manner, there are real risks to life and property. There is a fundamental difference between crashing your computer and losing your spreadsheet data, and crashing your pacemaker and losing your life. This isn’t hyperbole; recently researchers found serious security vulnerabilities in St. Jude Medical’s implantable heart devices. Give the internet hands and feet, and it will have the ability to punch and kick.

Take a concrete example: modern cars, those computers on wheels. The steering wheel no longer turns the axles, nor does the accelerator pedal change the speed. Every move you make in a car is processed by a computer, which does the actual controlling. A central computer controls the dashboard. There’s another in the radio. The engine has 20 or so computers. These are all networked, and increasingly autonomous.

Now, let’s start listing the security threats. We don’t want car navigation systems to be used for mass surveillance, or the microphone for mass eavesdropping. We might want it to be used to determine a car’s location in the event of a 911 call, and possibly to collect information about highway congestion. We don’t want people to hack their own cars to bypass emissions-control limitations. We don’t want manufacturers or dealers to be able to do that, either, as Volkswagen did for years. We can imagine wanting to give police the ability to remotely and safely disable a moving car; that would make high-speed chases a thing of the past. But we definitely don’t want hackers to be able to do that. We definitely don’t want them disabling the brakes in every car without warning, at speed. As we make the transition from driver-controlled cars to cars with various driver-assist capabilities to fully driverless cars, we don’t want any of those critical components subverted. We don’t want someone to be able to accidentally crash your car, let alone do it on purpose. And equally, we don’t want them to be able to manipulate the navigation software to change your route, or the door-lock controls to prevent you from opening the door. I could go on.

That’s a lot of different security requirements, and the effects of getting them wrong range from illegal surveillance to extortion by ransomware to mass death.

**********

Our computers and smartphones are as secure as they are because companies like Microsoft, Apple, and Google spend a lot of time testing their code before it’s released, and quickly patch vulnerabilities when they’re discovered. Those companies can support large, dedicated teams because those companies make a huge amount of money, either directly or indirectly, from their software ­ and, in part, compete on its security. Unfortunately, this isn’t true of embedded systems like digital video recorders or home routers. Those systems are sold at a much lower margin, and are often built by offshore third parties. The companies involved simply don’t have the expertise to make them secure.

At a recent hacker conference, a security researcher analyzed 30 home routers and was able to break into half of them, including some of the most popular and common brands. The denial-of-service attacks that forced popular websites like Reddit and Twitter off the internet last October were enabled by vulnerabilities in devices like webcams and digital video recorders. In August, two security researchers demonstrated a ransomware attack on a smart thermostat.

Even worse, most of these devices don’t have any way to be patched. Companies like Microsoft and Apple continuously deliver security patches to your computers. Some home routers are technically patchable, but in a complicated way that only an expert would attempt. And the only way for you to update the firmware in your hackable DVR is to throw it away and buy a new one.

The market can’t fix this because neither the buyer nor the seller cares. The owners of the webcams and DVRs used in the denial-of-service attacks don’t care. Their devices were cheap to buy, they still work, and they don’t know any of the victims of the attacks. The sellers of those devices don’t care: They’re now selling newer and better models, and the original buyers only cared about price and features. There is no market solution, because the insecurity is what economists call an externality: It’s an effect of the purchasing decision that affects other people. Think of it kind of like invisible pollution.

**********

Security is an arms race between attacker and defender. Technology perturbs that arms race by changing the balance between attacker and defender. Understanding how this arms race has unfolded on the internet is essential to understanding why the world-size robot we’re building is so insecure, and how we might secure it. To that end, I have five truisms, born from what we’ve already learned about computer and internet security. They will soon affect the security arms race everywhere.

Truism No. 1: On the internet, attack is easier than defense.

There are many reasons for this, but the most important is the complexity of these systems. More complexity means more people involved, more parts, more interactions, more mistakes in the design and development process, more of everything where hidden insecurities can be found. Computer-security experts like to speak about the attack surface of a system: all the possible points an attacker might target and that must be secured. A complex system means a large attack surface. The defender has to secure the entire attack surface. The attacker just has to find one vulnerability ­- one unsecured avenue for attack -­ and gets to choose how and when to attack. It’s simply not a fair battle.

There are other, more general, reasons why attack is easier than defense. Attackers have a natural agility that defenders often lack. They don’t have to worry about laws, and often not about morals or ethics. They don’t have a bureaucracy to contend with, and can more quickly make use of technical innovations. Attackers also have a first-mover advantage. As a society, we’re generally terrible at proactive security; we rarely take preventive security measures until an attack actually happens. So more advantages go to the attacker.

Truism No. 2: Most software is poorly written and insecure.

If complexity isn’t enough, we compound the problem by producing lousy software. Well-written software, like the kind found in airplane avionics, is both expensive and time-consuming to produce. We don’t want that. For the most part, poorly written software has been good enough. We’d all rather live with buggy software than pay the prices good software would require. We don’t mind if our games crash regularly, or our business applications act weird once in a while. Because software has been largely benign, it hasn’t mattered. This has permeated the industry at all levels. At universities, we don’t teach how to code well. Companies don’t reward quality code in the same way they reward fast and cheap. And we consumers don’t demand it.

But poorly written software is riddled with bugs, sometimes as many as one per 1,000 lines of code. Some of them are inherent in the complexity of the software, but most are programming mistakes. Not all bugs are vulnerabilities, but some are.

Truism No. 3: Connecting everything to each other via the internet will expose new vulnerabilities.

The more we network things together, the more vulnerabilities on one thing will affect other things. On October 21, vulnerabilities in a wide variety of embedded devices were all harnessed together to create what hackers call a botnet. This botnet was used to launch a distributed denial-of-service attack against a company called Dyn. Dyn provided a critical internet function for many major internet sites. So when Dyn went down, so did all those popular websites.

These chains of vulnerabilities are everywhere. In 2012, journalist Mat Honan suffered a massive personal hack because of one of them. A vulnerability in his Amazon account allowed hackers to get into his Apple account, which allowed them to get into his Gmail account. And in 2013, the Target Corporation was hacked by someone stealing credentials from its HVAC contractor.

Vulnerabilities like these are particularly hard to fix, because no one system might actually be at fault. It might be the insecure interaction of two individually secure systems.

Truism No. 4: Everybody has to stop the best attackers in the world.

One of the most powerful properties of the internet is that it allows things to scale. This is true for our ability to access data or control systems or do any of the cool things we use the internet for, but it’s also true for attacks. In general, fewer attackers can do more damage because of better technology. It’s not just that these modern attackers are more efficient, it’s that the internet allows attacks to scale to a degree impossible without computers and networks.

This is fundamentally different from what we’re used to. When securing my home against burglars, I am only worried about the burglars who live close enough to my home to consider robbing me. The internet is different. When I think about the security of my network, I have to be concerned about the best attacker possible, because he’s the one who’s going to create the attack tool that everyone else will use. The attacker that discovered the vulnerability used to attack Dyn released the code to the world, and within a week there were a dozen attack tools using it.

Truism No. 5: Laws inhibit security research.

The Digital Millennium Copyright Act is a terrible law that fails at its purpose of preventing widespread piracy of movies and music. To make matters worse, it contains a provision that has critical side effects. According to the law, it is a crime to bypass security mechanisms that protect copyrighted work, even if that bypassing would otherwise be legal. Since all software can be copyrighted, it is arguably illegal to do security research on these devices and to publish the result.

Although the exact contours of the law are arguable, many companies are using this provision of the DMCA to threaten researchers who expose vulnerabilities in their embedded systems. This instills fear in researchers, and has a chilling effect on research, which means two things: (1) Vendors of these devices are more likely to leave them insecure, because no one will notice and they won’t be penalized in the market, and (2) security engineers don’t learn how to do security better.
Unfortunately, companies generally like the DMCA. The provisions against reverse-engineering spare them the embarrassment of having their shoddy security exposed. It also allows them to build proprietary systems that lock out competition. (This is an important one. Right now, your toaster cannot force you to only buy a particular brand of bread. But because of this law and an embedded computer, your Keurig coffee maker can force you to buy a particular brand of coffee.)

**********
In general, there are two basic paradigms of security. We can either try to secure something well the first time, or we can make our security agile. The first paradigm comes from the world of dangerous things: from planes, medical devices, buildings. It’s the paradigm that gives us secure design and secure engineering, security testing and certifications, professional licensing, detailed preplanning and complex government approvals, and long times-to-market. It’s security for a world where getting it right is paramount because getting it wrong means people dying.

The second paradigm comes from the fast-moving and heretofore largely benign world of software. In this paradigm, we have rapid prototyping, on-the-fly updates, and continual improvement. In this paradigm, new vulnerabilities are discovered all the time and security disasters regularly happen. Here, we stress survivability, recoverability, mitigation, adaptability, and muddling through. This is security for a world where getting it wrong is okay, as long as you can respond fast enough.

These two worlds are colliding. They’re colliding in our cars -­ literally -­ in our medical devices, our building control systems, our traffic control systems, and our voting machines. And although these paradigms are wildly different and largely incompatible, we need to figure out how to make them work together.

So far, we haven’t done very well. We still largely rely on the first paradigm for the dangerous computers in cars, airplanes, and medical devices. As a result, there are medical systems that can’t have security patches installed because that would invalidate their government approval. In 2015, Chrysler recalled 1.4 million cars to fix a software vulnerability. In September 2016, Tesla remotely sent a security patch to all of its Model S cars overnight. Tesla sure sounds like it’s doing things right, but what vulnerabilities does this remote patch feature open up?

**********
Until now we’ve largely left computer security to the market. Because the computer and network products we buy and use are so lousy, an enormous after-market industry in computer security has emerged. Governments, companies, and people buy the security they think they need to secure themselves. We’ve muddled through well enough, but the market failures inherent in trying to secure this world-size robot will soon become too big to ignore.

Markets alone can’t solve our security problems. Markets are motivated by profit and short-term goals at the expense of society. They can’t solve collective-action problems. They won’t be able to deal with economic externalities, like the vulnerabilities in DVRs that resulted in Twitter going offline. And we need a counterbalancing force to corporate power.

This all points to policy. While the details of any computer-security system are technical, getting the technologies broadly deployed is a problem that spans law, economics, psychology, and sociology. And getting the policy right is just as important as getting the technology right because, for internet security to work, law and technology have to work together. This is probably the most important lesson of Edward Snowden’s NSA disclosures. We already knew that technology can subvert law. Snowden demonstrated that law can also subvert technology. Both fail unless each work. It’s not enough to just let technology do its thing.

Any policy changes to secure this world-size robot will mean significant government regulation. I know it’s a sullied concept in today’s world, but I don’t see any other possible solution. It’s going to be especially difficult on the internet, where its permissionless nature is one of the best things about it and the underpinning of its most world-changing innovations. But I don’t see how that can continue when the internet can affect the world in a direct and physical manner.

**********

I have a proposal: a new government regulatory agency. Before dismissing it out of hand, please hear me out.

We have a practical problem when it comes to internet regulation. There’s no government structure to tackle this at a systemic level. Instead, there’s a fundamental mismatch between the way government works and the way this technology works that makes dealing with this problem impossible at the moment.

Government operates in silos. In the U.S., the FAA regulates aircraft. The NHTSA regulates cars. The FDA regulates medical devices. The FCC regulates communications devices. The FTC protects consumers in the face of “unfair” or “deceptive” trade practices. Even worse, who regulates data can depend on how it is used. If data is used to influence a voter, it’s the Federal Election Commission’s jurisdiction. If that same data is used to influence a consumer, it’s the FTC’s. Use those same technologies in a school, and the Department of Education is now in charge. Robotics will have its own set of problems, and no one is sure how that is going to be regulated. Each agency has a different approach and different rules. They have no expertise in these new issues, and they are not quick to expand their authority for all sorts of reasons.

Compare that with the internet. The internet is a freewheeling system of integrated objects and networks. It grows horizontally, demolishing old technological barriers so that people and systems that never previously communicated now can. Already, apps on a smartphone can log health information, control your energy use, and communicate with your car. That’s a set of functions that crosses jurisdictions of at least four different government agencies, and it’s only going to get worse.

Our world-size robot needs to be viewed as a single entity with millions of components interacting with each other. Any solutions here need to be holistic. They need to work everywhere, for everything. Whether we’re talking about cars, drones, or phones, they’re all computers.

This has lots of precedent. Many new technologies have led to the formation of new government regulatory agencies. Trains did, cars did, airplanes did. Radio led to the formation of the Federal Radio Commission, which became the FCC. Nuclear power led to the formation of the Atomic Energy Commission, which eventually became the Department of Energy. The reasons were the same in every case. New technologies need new expertise because they bring with them new challenges. Governments need a single agency to house that new expertise, because its applications cut across several preexisting agencies. It’s less that the new agency needs to regulate -­ although that’s often a big part of it -­ and more that governments recognize the importance of the new technologies.

The internet has famously eschewed formal regulation, instead adopting a multi-stakeholder model of academics, businesses, governments, and other interested parties. My hope is that we can keep the best of this approach in any regulatory agency, looking more at the new U.S. Digital Service or the 18F office inside the General Services Administration. Both of those organizations are dedicated to providing digital government services, and both have collected significant expertise by bringing people in from outside of government, and both have learned how to work closely with existing agencies. Any internet regulatory agency will similarly need to engage in a high level of collaborate regulation -­ both a challenge and an opportunity.

I don’t think any of us can predict the totality of the regulations we need to ensure the safety of this world, but here’s a few. We need government to ensure companies follow good security practices: testing, patching, secure defaults -­ and we need to be able to hold companies liable when they fail to do these things. We need government to mandate strong personal data protections, and limitations on data collection and use. We need to ensure that responsible security research is legal and well-funded. We need to enforce transparency in design, some sort of code escrow in case a company goes out of business, and interoperability between devices of different manufacturers, to counterbalance the monopolistic effects of interconnected technologies. Individuals need the right to take their data with them. And internet-enabled devices should retain some minimal functionality if disconnected from the internet

I’m not the only one talking about this. I’ve seen proposals for a National Institutes of Health analog for cybersecurity. University of Washington law professor Ryan Calo has proposed a Federal Robotics Commission. I think it needs to be broader: maybe a Department of Technology Policy.

Of course there will be problems. There’s a lack of expertise in these issues inside government. There’s a lack of willingness in government to do the hard regulatory work. Industry is worried about any new bureaucracy: both that it will stifle innovation by regulating too much and that it will be captured by industry and regulate too little. A domestic regulatory agency will have to deal with the fundamentally international nature of the problem.

But government is the entity we use to solve problems like this. Governments have the scope, scale, and balance of interests to address the problems. It’s the institution we’ve built to adjudicate competing social interests and internalize market externalities. Left to their own devices, the market simply can’t. That we’re currently in the middle of an era of low government trust, where many of us can’t imagine government doing anything positive in an area like this, is to our detriment.

Here’s the thing: Governments will get involved, regardless. The risks are too great, and the stakes are too high. Government already regulates dangerous physical systems like cars and medical devices. And nothing motivates the U.S. government like fear. Remember 2001? A nominally small-government Republican president created the Office of Homeland Security 11 days after the terrorist attacks: a rushed and ill-thought-out decision that we’ve been trying to fix for over a decade. A fatal disaster will similarly spur our government into action, and it’s unlikely to be well-considered and thoughtful action. Our choice isn’t between government involvement and no government involvement. Our choice is between smarter government involvement and stupider government involvement. We have to start thinking about this now. Regulations are necessary, important, and complex; and they’re coming. We can’t afford to ignore these issues until it’s too late.

We also need to start disconnecting systems. If we cannot secure complex systems to the level required by their real-world capabilities, then we must not build a world where everything is computerized and interconnected.

There are other models. We can enable local communications only. We can set limits on collected and stored data. We can deliberately design systems that don’t interoperate with each other. We can deliberately fetter devices, reversing the current trend of turning everything into a general-purpose computer. And, most important, we can move toward less centralization and more distributed systems, which is how the internet was first envisioned.

This might be a heresy in today’s race to network everything, but large, centralized systems are not inevitable. The technical elites are pushing us in that direction, but they really don’t have any good supporting arguments other than the profits of their ever-growing multinational corporations.

But this will change. It will change not only because of security concerns, it will also change because of political concerns. We’re starting to chafe under the worldview of everything producing data about us and what we do, and that data being available to both governments and corporations. Surveillance capitalism won’t be the business model of the internet forever. We need to change the fabric of the internet so that evil governments don’t have the tools to create a horrific totalitarian state. And while good laws and regulations in Western democracies are a great second line of defense, they can’t be our only line of defense.

My guess is that we will soon reach a high-water mark of computerization and connectivity, and that afterward we will make conscious decisions about what and how we decide to interconnect. But we’re still in the honeymoon phase of connectivity. Governments and corporations are punch-drunk on our data, and the rush to connect everything is driven by an even greater desire for power and market share. One of the presentations released by Edward Snowden contained the NSA mantra: “Collect it all.” A similar mantra for the internet today might be: “Connect it all.”

The inevitable backlash will not be driven by the market. It will be deliberate policy decisions that put the safety and welfare of society above individual corporations and industries. It will be deliberate policy decisions that prioritize the security of our systems over the demands of the FBI to weaken them in order to make their law-enforcement jobs easier. It’ll be hard policy for many to swallow, but our safety will depend on it.

**********

The scenarios I’ve outlined, both the technological and economic trends that are causing them and the political changes we need to make to start to fix them, come from my years of working in internet-security technology and policy. All of this is informed by an understanding of both technology and policy. That turns out to be critical, and there aren’t enough people who understand both.

This brings me to my final plea: We need more public-interest technologists.

Over the past couple of decades, we’ve seen examples of getting internet-security policy badly wrong. I’m thinking of the FBI’s “going dark” debate about its insistence that computer devices be designed to facilitate government access, the “vulnerability equities process” about when the government should disclose and fix a vulnerability versus when it should use it to attack other systems, the debacle over paperless touch-screen voting machines, and the DMCA that I discussed above. If you watched any of these policy debates unfold, you saw policy-makers and technologists talking past each other.

Our world-size robot will exacerbate these problems. The historical divide between Washington and Silicon Valley -­ the mistrust of governments by tech companies and the mistrust of tech companies by governments ­- is dangerous.

We have to fix this. Getting IoT security right depends on the two sides working together and, even more important, having people who are experts in each working on both. We need technologists to get involved in policy, and we need policy-makers to get involved in technology. We need people who are experts in making both technology and technological policy. We need technologists on congressional staffs, inside federal agencies, working for NGOs, and as part of the press. We need to create a viable career path for public-interest technologists, much as there already is one for public-interest attorneys. We need courses, and degree programs in colleges, for people interested in careers in public-interest technology. We need fellowships in organizations that need these people. We need technology companies to offer sabbaticals for technologists wanting to go down this path. We need an entire ecosystem that supports people bridging the gap between technology and law. We need a viable career path that ensures that even though people in this field won’t make as much as they would in a high-tech start-up, they will have viable careers. The security of our computerized and networked future ­ meaning the security of ourselves, families, homes, businesses, and communities ­ depends on it.

This plea is bigger than security, actually. Pretty much all of the major policy debates of this century will have a major technological component. Whether it’s weapons of mass destruction, robots drastically affecting employment, climate change, food safety, or the increasing ubiquity of ever-shrinking drones, understanding the policy means understanding the technology. Our society desperately needs technologists working on the policy. The alternative is bad policy.

**********

The world-size robot is less designed than created. It’s coming without any forethought or architecting or planning; most of us are completely unaware of what we’re building. In fact, I am not convinced we can actually design any of this. When we try to design complex sociotechnical systems like this, we are regularly surprised by their emergent properties. The best we can do is observe and channel these properties as best we can.

Market thinking sometimes makes us lose sight of the human choices and autonomy at stake. Before we get controlled ­ or killed ­ by the world-size robot, we need to rebuild confidence in our collective governance institutions. Law and policy may not seem as cool as digital tech, but they’re also places of critical innovation. They’re where we collectively bring about the world we want to live in.

While I might sound like a Cassandra, I’m actually optimistic about our future. Our society has tackled bigger problems than this one. It takes work and it’s not easy, but we eventually find our way clear to make the hard choices necessary to solve our real problems.

The world-size robot we’re building can only be managed responsibly if we start making real choices about the interconnected world we live in. Yes, we need security systems as robust as the threat landscape. But we also need laws that effectively regulate these dangerous technologies. And, more generally, we need to make moral, ethical, and political decisions on how those systems should work. Until now, we’ve largely left the internet alone. We gave programmers a special right to code cyberspace as they saw fit. This was okay because cyberspace was separate and relatively unimportant: That is, it didn’t matter. Now that that’s changed, we can no longer give programmers and the companies they work for this power. Those moral, ethical, and political decisions need, somehow, to be made by everybody. We need to link people with the same zeal that we are currently linking machines. “Connect it all” must be countered with “connect us all.”

This essay previously appeared in New York Magazine.

Canadian Stock Exchange Blocked Megaupload 2.0 Plans

Post Syndicated from Ernesto original https://torrentfreak.com/canadian-stock-exchange-blocked-megaupload-2-0-plans-170124/

megaupload-mu2Last Friday it was exactly five years ago that the original Megaupload service was taken offline as part of a U.S. criminal investigation.

Kim Dotcom wanted to use this special date to announce new details about its successor Megaupload 2.0 and the associated Bitcache service. However, minutes before the announcement, something got in the way.

Today, Kim Dotcom, chief “evangelist” of the service, explains what happened. The original idea was to announce a prominent merger deal with a Canadian company that would bring in an additional $12 million in capital.

Megaupload 2.0 and Bitcache already secured its initial investment round last October. Through Max Keiser’s crowdfunding platform Bank to the Future, it raised well over a million dollars from 354 investors in just two weeks.

To bring in more capital, the startup had quietly struck a stock and cash merger deal with a publicly listed company on the Canadian stock exchange, at a $100 million valuation.

This news was supposed to break last Friday, but just minutes before going public the Canadian Securities Exchange got in the way, according to Dotcom.

The Canadian company sent a draft press release of its merger plans to the exchange, which swiftly came back with some objections, effectively blocking the announcement.

“Trading of the stock was halted while waiting for a response. The Exchange demonstrated a bias against the merger and requested further detailed and intrusive information,” a statement released by Dotcom says.

Dotcom doesn’t reveal what the concerns of the Exchange were, but it’s not unlikely that the links to a pending criminal Megaupload case in the United States may play a role.

Megaupload 2.0 and Bitcache put their lawyers on the case, but the company eventually decided to back away from the planned merger.

“Bitcache feels it is important as a technology startup to stay nimble and reduce corporate complexity in favor of technology development. The experience of dealing with the Exchange has only served to encourage that view,” Dotcom’s announcement reads.

While the original plan has been scuppered, Dotcom and his team will now focus on getting the service ready for a first beta release. A proof of concept is scheduled to come out during the second quarter of the year, soon followed by a closed beta.

The first open release is penned for the end of the year according to the current planning, Dotcom informs us.

From what has been revealed thus far, Megaupload 2.0 and the associated Bitcache platform will allow people to share and store files, linking every file-transfer to a bitcoin transaction.

Unlike the original Megaupload, the new version isn’t going to store all files itself. Instead, it plans to use third-party providers such as Maidsafe and Storj.

“Megaupload 2 will be a caching provider for popular files on special high-speed servers that serve the files from ram. Long term storage will mostly be provided by numerous third-party sites that we are partnering with. You can expect more details on January 20,” Dotcom previously told us.

Prospective users who are eager to see what the service has in store have to be patient for a little longer, but Dotcom is confident that it will be a game-changer on multiple fronts.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

WhatsApp Security Vulnerability

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/01/whatsapp_securi.html

Back in March, Rolf Weber wrote about a potential vulnerability in the WhatsApp protocol that would allow Facebook to defeat perfect forward secrecy by forcibly change users’ keys, allowing it — or more likely, the government — to eavesdrop on encrypted messages.

It seems that this vulnerability is real:

WhatsApp has the ability to force the generation of new encryption keys for offline users, unbeknown to the sender and recipient of the messages, and to make the sender re-encrypt messages with new keys and send them again for any messages that have not been marked as delivered.

The recipient is not made aware of this change in encryption, while the sender is only notified if they have opted-in to encryption warnings in settings, and only after the messages have been re-sent. This re-encryption and rebroadcasting effectively allows WhatsApp to intercept and read users’ messages.

The security loophole was discovered by Tobias Boelter, a cryptography and security researcher at the University of California, Berkeley. He told the Guardian: “If WhatsApp is asked by a government agency to disclose its messaging records, it can effectively grant access due to the change in keys.”

The vulnerability is not inherent to the Signal protocol. Open Whisper Systems’ messaging app, Signal, the app used and recommended by whistleblower Edward Snowden, does not suffer from the same vulnerability. If a recipient changes the security key while offline, for instance, a sent message will fail to be delivered and the sender will be notified of the change in security keys without automatically resending the message.

WhatsApp’s implementation automatically resends an undelivered message with a new key without warning the user in advance or giving them the ability to prevent it.

Note that it’s an attack against current and future messages, and not something that would allow the government to reach into the past. In that way, it is no more troubling than the government hacking your mobile phone and reading your WhatsApp conversations that way.

An unnamed “WhatsApp spokesperson” said that they implemented the encryption this way for usability:

In WhatsApp’s implementation of the Signal protocol, we have a “Show Security Notifications” setting (option under Settings > Account > Security) that notifies you when a contact’s security code has changed. We know the most common reasons this happens are because someone has switched phones or reinstalled WhatsApp. This is because in many parts of the world, people frequently change devices and Sim cards. In these situations, we want to make sure people’s messages are delivered, not lost in transit.

He’s technically correct. This is not a backdoor. This really isn’t even a flaw. It’s a design decision that put usability ahead of security in this particular instance. Moxie Marlinspike, creator of Signal and the code base underlying WhatsApp’s encryption, said as much:

Under normal circumstances, when communicating with a contact who has recently changed devices or reinstalled WhatsApp, it might be possible to send a message before the sending client discovers that the receiving client has new keys. The recipient’s device immediately responds, and asks the sender to reencrypt the message with the recipient’s new identity key pair. The sender displays the “safety number has changed” notification, reencrypts the message, and delivers it.

The WhatsApp clients have been carefully designed so that they will not re-encrypt messages that have already been delivered. Once the sending client displays a “double check mark,” it can no longer be asked to re-send that message. This prevents anyone who compromises the server from being able to selectively target previously delivered messages for re-encryption.

The fact that WhatsApp handles key changes is not a “backdoor,” it is how cryptography works. Any attempt to intercept messages in transmit by the server is detectable by the sender, just like with Signal, PGP, or any other end-to-end encrypted communication system.

The only question it might be reasonable to ask is whether these safety number change notifications should be “blocking” or “non-blocking.” In other words, when a contact’s key changes, should WhatsApp require the user to manually verify the new key before continuing, or should WhatsApp display an advisory notification and continue without blocking the user.

Given the size and scope of WhatsApp’s user base, we feel that their choice to display a non-blocking notification is appropriate. It provides transparent and cryptographically guaranteed confidence in the privacy of a user’s communication, along with a simple user experience. The choice to make these notifications “blocking” would in some ways make things worse. That would leak information to the server about who has enabled safety number change notifications and who hasn’t, effectively telling the server who it could MITM transparently and who it couldn’t; something that WhatsApp considered very carefully.

How serious this is depends on your threat model. If you are worried about the US government — or any other government that can pressure Facebook — snooping on your messages, then this is a small vulnerability. If not, then it’s nothing to worry about.

Slashdot thread. Hacker News thread. BoingBoing post. More here.

EDITED TO ADD (1/24): Zeynep Tufekci takes the Guardian to task for their reporting on this vulnerability. (Note: I signed on to her letter.)

GrafanaCon 2016 Videos Available

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/01/13/grafanacon-2016-videos-available/

Last November we held the second GrafanaCon. This time it was actually a real conference and not a glorified meetup (like GrafanaCon 2015).
It was a full two day conference, with over 200 attendees, 10 sponsors, 30 talks covering a diverse set
of topics from the Grafana ecosystem & user community. We also had an awesome and unique venue aboard the Intrepid Sea, Air & Space Museum
which is an aircraft carrier that has been converted into a museum highlighting US naval and air force history (it also has the spaceship Enterprise on board).

The event was big success, attendees and sponsors seemed happy, talks were interesting, and the after conference party was a hit.
Personally it was really amazing seeing how big the Grafana community has become and getting a chance to talk to so many users and companies
that are leveraging Grafana in unique ways. It was also great to meet some of the big Grafana contributers
like Utkarsh Bhatnagar and Mitsuhiro Tanda.

Videos of Talks

Videos from all the talks are available on the official Grafana Youtube Channel.
There is a new playlist that contains all the talks.

Photos from the event

My transportation to and from the conference

The theater was a great place to have talks on day 1.

Day 2 talks were held in two rooms with a great view of the Hudson.

Every attendee got a Grafana scarf. The most beautiful and useful piece of swag I have ever seen (but I may be a bit biased).

It was great getting the Grafana Labs team together for the event.

Thanks

A big thanks to all the conference Sponsors, presenters and attendees. Hope to see you all again at GrafanaCon 2017.

Pirate Bay Offered to Help Catch Criminals But Copyright Got in the Way

Post Syndicated from Andy original https://torrentfreak.com/pirate-bay-offered-to-help-catch-criminals-but-copyright-got-in-the-way-170109/

thepirateIf The Pirate Bay manages to navigate the stormy waters of the Internet for another couple of years, it will have spent an unprecedented decade-and-a-half thumbing its nose at the authorities. Of course, that has come at a price.

The authorities’ interest in The Pirate Bay remains at a high and, given the chance, police in some countries would happily take down the world’s most prominent copyright scofflaw. However, painting the site as having no respect for any law would be doing it a disservice. In fact, at one point it even offered to work with the police.

The revelations follow the publication of a shocking article by Aftonbladet (Swedish) which details how, over an extended period, its reporters monitored dozens of people sharing images of child abuse online. The publication even met up with some of its targets and conducted interviews in person.

One of the people to comment on the extraordinary piece is Tobias Andersson, an early spokesperson of free-sharing advocacy group Piratbyrån (Pirate Bureau) and The Pirate Bay. Interestingly, Andersson reveals how The Pirate Bay offered to help police catch these kinds of offenders many years ago.

“A ‘fun’ thing about my time at the Pirate Bureau and The Pirate Bay was when the National Police, during the middle of the trial against us, called and wanted to consult about [abuse images] and TPB,” Andersson says.

The former site spokesperson, who also had more recent responsibility at The Promo Bay project, says he went to meet the police where he spoke with an officer and a technician. They had a specific request – to implement a filter to stop certain content appearing on the site.

“They wanted us to block certain [abuse-related] keywords,” Andersson explains.

Of course, keyword filters are notoriously weak and easily circumvented. So, instead, Andersson suggested another route the authorities might take which, due to the very public nature of torrent sharing (especially more than a decade ago when people were less privacy-conscious), might make actual perpetrators more easy to catch.

“I told [the police] how they could see the IP addresses in a [BitTorrent] client belonging to those who were sharing the content,” Andersson explains.

“I showed them how to start a torrent at 0.1kb/s download to be able to see the client list but without sharing anything. Which is not really rocket science,” the TPB and Piratbyrån veteran informs TorrentFreak.

Somewhat disappointingly, however, the police were unresponsive.

“They were not at all interested,” he says.

“Our skilled moderators [on The Pirate Bay] routinely deleted everything that could be suspected to be child porn, but still people tried to post it again and again. I wanted to explain to the police that we could easily identify most of stuff being posted but they were totally uninterested.”

Meanwhile, however, Hollywood and the recording industries were working with Swedish police on a highly expensive and complex technical case to bring down The Pirate Bay on copyright grounds. Sadly, it was to be further copyright-related demands that would bring negotiations on catching more serious offenders to an end.

“Because we refused to censor [The Pirate Bay’s] search to remove, for example, a crappy Stanley Kubrick movie, our ‘cooperation’ with the police ended there. Too bad, because we could have easily provided them with lists [of offenders] like those Aftonbladet reported today,” Andersson concludes.

Today’s revelations mark the second time The Pirate Bay has been shown to work with authorities to trap serious criminals. In 2013, the site provided evidence to TorrentFreak which showed notorious copyright troll outfit Prenda Law uploaded “honey-pot” torrents to the site. The principals of that organization are now facing charges of extortion and fraud.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Your absurd story doesn’t make me a Snowden apologist

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/12/your-absurd-story-doesnt-make-me.html

Defending truth in the Snowden Affair doesn’t make one an “apologist”, for either side. There plenty of ardent supporters on either side that need to be debunked. The latest (anti-Snowden) example is the HPSCI committee report on Snowden [*], and stories like this one in the Wall Street Journal [*]. Pointing out the obvious holes doesn’t make us “apologists”.

As Edward Epstein documents in the WSJ story, one of the lies Snowden told was telling his employer (Booz-Allen) that he was being treated for epilepsy when in fact he was fleeing to Hong Kong in order to give documents to Greenwald and Poitras.

Well, of course he did. If you are going to leak a bunch of documents to the press, you can’t do that without deceiving your employer. That’s the very definition of this sort of “whistleblowing”. Snowden has been quite open to the public about the lies he told his employer, including this one.

Rather than evidence that there’s something wrong with Snowden, the way Snowden-haters (is that the opposite of “apologist”?) seize on this is evidence that they are a bit unhinged.

The next “lie” is the difference between the number of documents Greenwald says he received (10,000) and the number investigators claim were stolen (1.5 million). This is not the discrepancy that it seems. A “document” counted by the NSA is not the same as the number of “files” you might get on a thumb drive, which was shown the various ways of counting the size of the Chelsea/Bradley Manning leaks. Also, the NSA can only see which files Snowden accessed, not which ones were then subsequently copied to a thumb drive.

Finally, there is the more practical issue that Snowden cannot review the documents while at work. He’d have to instead download databases and copy whole directories to his thumb drives. Only away from work would he have the chance to winnow down which documents he wanted to take to Hong Kong, deleting the rest. Nothing Snowden has said conflicts with him deleting lots of stuff he never gave journalists, that he never took with him to Hong Kong, or took with him to Moscow.

The next “lie” is that Snowden claims the US revoked his passport after he got on the plane from Hong Kong and before he landed in Moscow.

This is factually wrong, in so far as the US had revoked his passport (and issued an arrest warrant) and notified Hong Kong of the revocation a day before the plane took off. However, as numerous news reports of the time reported, the US information [in the arrest warrant] was contradictory and incomplete, and thus Hong Kong did nothing to stop Snowden from leaving [*]. The Guardian [*] quotes a Hong Kong official as saying Snowden left “through a lawful and normal channel”. Seriously, countries are much less concerned about checking passports of passenger leaving than those arriving.

It’s the WSJ article that’s clearly prevaricating here, quoting a news article where a Hong Kong official admits being notified, but not quoting the officials saying that the information was bad, that they took no action, and that Snowden left in the normal way.

The next item is Snowden’s claim he destroyed all his copies of US secrets before going to Moscow. To debunk this, the WSJ refers to an NPR interview [*] with Frants Klintsevich, deputy chairman of the defense and security committee within the Duma at the time. Klintsevich is quoted as saying “Let’s be frank, Snowden did share intelligence”.

But Snowden himself debunks this:

The WSJ piece was written a week after this tweet. It’s hard to imagine why they ignored it. Either it itself is a lie (in which case, it should’ve been added to the article), or it totally debunks the statement. If Klintsevich is “only speculating”, then nothing after that point can be used to show Snowden is lying.

Thus, again we have proof that Epstein cannot be trusted. He clearly has an angle and bends evidence to service that angle, rather than being a reliable source of information.

I am no Snowden apologist. Most of my blogposts regarding Snowden have gone the other way, criticizing the way those like The Intercept distort Snowden disclosures in an anti-NSA/anti-USA manner. In areas of my experience (network stuff), I’ve blogged showing that those reporting on Snowden are clearly technically deficient.

But in this post, I show how Edward Epstein is clearly biased/untrustworthy, and how he adjusts the facts into a character attack on Snowden. I’ve documented it in a clear way that you can easily refute if I’m not correct. This is not because I’m a biased toward Snowden, but because I’m biased toward the truth.