Tag Archives: 684

Съд на ЕС: финансиране на обществена телевизия, приходите от реклама, казусът TV2 Danmark

Post Syndicated from nellyo original https://nellyo.wordpress.com/2017/11/11/tv2-danmark/

I

Публикувано е решение по делото C‑649/15 P TV2/Danmark A/S  срещу Европейска комисия.

С жалбата си TV2/Danmark   иска частична отмяна на решението на Общия съд, с което той отменя Решение 2011/839/ЕС на Комисията относно мерките, приведени в действие от Дания (C 2/03) за TV2/Danmark в частта, в която Комисията е приела, че приходите от реклама  са държавни помощи  и  отхвърля жалбата му в останалата ѝ част.

Държавни помощи — Член 107, параграф 1 ДФЕС — Обществена услуга по разпространение на радио- и телевизионни програми — Мерки, взети от датските органи за датския радио- и телевизионен оператор TV2/Danmark — Понятие за помощ, предоставена от държава членка или чрез ресурси на държава членка — Решение Altmark

TV2/Danmark е втората обществена телевизия в Дания. Дейността на обществената телевизия  се финансира от  такса, плащана от всички телевизионни зрители в Дания,  и от рекламна дейност.

В резултат на жалба, подадена  от SBS Broadcasting, ЕК анализира финансирането и приема, че това са държавни помощи, съвместими с вътрешния пазар  – с изключение на една сума, която представлява свръхкомпенсация.

Срещу решението на ЕК постъпват четири жалби. Общият съд отменя посоченото решение. Комисията преразглежда въпросните мерки. Тя запазва позицията си относно квалификацията на разглежданите мерки като „държавни помощи“ по смисъла на член 107, параграф 1 ДФЕС и приема, че приходите от реклама  представляват ресурси на държавата.

И второто решение на ЕК се обжалва. TV2/Danmark иска от Общия съд да отмени спорното решение в частта, в която Комисията е приела, че:

–        всички разглеждани мерки представляват нови помощи,

–        приходите от  такси, прехвърлени на регионалните станции на TV2/Danmark, представляват държавни помощи за TV2/Danmark,

–        приходите от реклама представляват държавни помощи за TV2/Danmark.

Общият съд отменя спорното решение в частта, в която Комисията е приела, че приходите от реклама представляват държавни помощи, и отхвърля жалбата в останалата ѝ част.

TV2/Danmark обжалва.

Жалбата  е отхвърлена в нейната цялост,  пълният текст на решението

II

В отделно решение Съдът се произнася и по дело C‑656/15 P   , в което Комисията иска от Съда  да отмени обжалваното съдебно решение в частта му, в която е отменено   решението на Комисията, че приходите от реклама, платени на TV2/Danmark чрез фонд TV2, представляват държавни помощи

Според ЕК:

Общият съд е тълкувал неправилно понятието „ресурси на държава членка“ по смисъла на член 107, параграф 1 ДФЕС.  Комисията поддържа, че тъй като TV2 Reklame е публично дружество, чийто единствен акционер е датската държава, това дружество е изцяло под контрола и на разположение на последната, така че средствата му е трябвало да се считат за ресурси на държава членка по смисъла на тази разпоредба. По-конкретно Комисията твърди, че произходът на съответните ресурси не е релевантен за такава квалификация и че първоначално частният характер на тези ресурси не отнема характера им на ресурси на държавата членка.

Според решението:

40      Съгласно постоянната практика на Съда, за да бъде квалифицирана дадена мярка като „помощ“ по смисъла на член 107, параграф 1 ДФЕС, трябва да са изпълнени всички посочени в тази разпоредба условия.

41      Посочената разпоредба предвижда четири условия. Първо, трябва да става въпрос за намеса на държавата или чрез ресурси на държавата. Второ, тази намеса трябва да може да засегне търговията между държавите членки. Трето, тя трябва да предоставя предимство на ползващия се от нея. Четвърто, тя трябва да нарушава или да създава опасност от нарушаване на конкуренцията (вж. решение  Altmark Trans и др.)

45      Всъщност правото на Съюза не може да допусне правилата в областта на държавните помощи да бъдат заобикаляни единствено със създаването на самостоятелни институции, отговарящи за разпределянето на помощите.

46      Освен това следва да се припомни, че съгласно постоянната практика на Съда член 107, параграф 1 ДФЕС обхваща всички имуществени средства, които публичните органи могат ефективно да използват, за да подкрепят предприятия, като е без значение дали тези средства спадат постоянно или не към патримониума на държавата. Ето защо, макар съответстващите на разглежданата мярка суми да не се притежават постоянно от държавата, обстоятелството, че те остават непрекъснато под публичен контрол и следователно на разположение на компетентните национални власти, е достатъчно, за да бъдат квалифицирани като „държавни ресурси“

47      Следователно, при положение че ресурсите на публичните предприятия попадат под контрола на държавата, така че са на нейно разположение, тези ресурси се обхващат от понятието „ресурси на държава членка“ по смисъла на член 107, параграф 1 ДФЕС. Всъщност посредством упражняването на своето доминиращо влияние върху тези предприятия държавата е напълно в състояние да насочи използването на техните средства, за да финансира евентуално специфични предимства в полза на други предприятия

52      Вследствие на това въпросните приходи се намират под публичен контрол и на разположение на държавата, която може да взема решения относно тяхното разпределяне.

53      При това положение съгласно припомнената в точки 43—48 от настоящото решение практика на Съда, разглежданите приходи представляват „ресурси на държава членка“ по смисъла на член 107, параграф 1 ДФЕС.

54      Следователно, като е приел, че приходите от продажбата от TV2 Reklame на рекламните блокове на TV2/Danmark и са преведени на последното посредством фонд TV2, не представляват ресурси на държава членка, поради което Комисията неправилно ги е квалифицирала като „държавна помощ“, Общият съд е допуснал грешка при прилагане на правото.

63      Впрочем, както вече бе отбелязано, настоящото дело се отнася до публични предприятия, в случая TV2 Reklame и фонд TV2, създадени, притежавани и упълномощени от датската държава, за да управляват приходите от продажбата на рекламните блокове на друго публично предприятие, а именно TV2/Danmark, така че тези приходи се намират под контрола и на разположение на датската държава.

По изложените съображения Съдът (първи състав) отменя решението на Общия съд TV2/Danmark/Комисия (T‑674/11, EU:T:2015:684), в частта му, в която е отменено Решение 2011/839/ЕС на Комисията относно мерките, приведени в действие от Дания (C 2/03) за TV2/Danmark, в частта, в която Европейската комисия е приела, че приходите от реклама, платени на TV2/Danmark чрез фонд TV2, представляват държавни помощи.

III

По дело C‑657/15 P  с предмет жалба на Viasat Broadcasting UK Ltd, установено в Уест Драйтън (Обединеното кралство) Съдът приема следното –

27      Според Viasat прехвърлянето на съответните средства чрез фонд TV2 не променя факта, че те са ресурси на държава членка, тъй като фонд TV2 също е публично предприятие, контролирано от датската държава.

28      Предвид тези обстоятелства Viasat счита, че настоящото дело се различава съществено от делата, по които са постановени решения от 13 март 2001 г., PreussenElektra (C‑379/98, EU:C:2001:160), и от 5 март 2009 г., UTECA (C‑222/07, EU:C:2009:124), отнасящи се до случаи, при които съответните ресурси в нито един момент не са напускали частния сектор.

Съдът уважава първото основание, изтъкнато от Viasat,  в подкрепа на жалбата му.

*

За заключенията на ГА

На вниманието на всички, които имат намерение да се произнасят по реформата на финансирането на обществените медии.

Filed under: EU Law, Media Law Tagged: съд на ес

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Getting Your Data into the Cloud is Just the Beginning

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/cost-data-of-transfer-cloud-storage/

Total Cloud Storage Cost

Organizations should consider not just the cost of getting their data into the cloud, but also long-term costs for storage and retrieval when deciding which cloud storage solution meets their needs.

As cloud storage has become ubiquitous, organizations large and small are joining in. For larger organizations the lure of reducing capital expenses and their associated operational costs is enticing. For smaller organizations, cloud storage often replaces an unmanageable closet full of external hard drives, thumb drives, SD cards, and other devices. With terabytes or even petabytes of data, the common challenge facing organizations, large and small, is how to get their data up to the cloud.

Transferring Data to the Cloud

The obvious solution for getting your data to the cloud is to upload your data from your internal network through the internet to the cloud storage vendor you’ve selected. Cloud storage vendors don’t charge you for uploading your data to their cloud, but you, of course, have to pay your network provider and that’s where things start to get interesting. Here are a few things to consider.

  • The initial upload: Unless you are just starting out, you will have a large amount of data you want to upload to the cloud. This could be data you wish to archive or have had archived previously, for example data stored on LTO tapes or kept stored on external hard drives.
  • Pipe size: This is the amount of upload bandwidth of your network connection. This is measured in Mbps (megabits per second). Remember, your data is stored in MB (megabytes), so an upload connection of 80 Mbps will transfer no more than 10 MB of data per second and most likely a lot less.
  • Cost and caps: In some places, organizations pay a flat monthly rate for a specified level of service (speed) for internet access. In other locations, internet access is metered, or pay as you go. In either case, there can be internet service caps that limit or completely stop data transfer once you reach your contracted threshold.

One or more of these challenges has the potential to make the initial upload of your data expensive and potentially impossible. You could wait until cloud storage companies start buying up internet providers and make data upload cheap (or free with Amazon Prime!), but there is another option.

Data Transfer Devices

Given the potential challenges of using your network for the initial upload of your data to the cloud, a handful of cloud storage companies have introduced data transfer or data ingest services. Backblaze has the B2 Fireball, Amazon has Snowball (and other similar devices), and Google recently introduced their Transfer Appliance.

KLRU-TV Austin PBS uploaded their Austin City Limits musical anthology series to Backblaze using a B2 Fireball.

These services work as follows:

  • The provider sends you a portable (or somewhat portable) storage device.
  • You connect the device to your network and load some amount of data on the device over your internal network connection.
  • You return the device, loaded with your data, to the provider, who uploads your data to your cloud storage account from inside their own data center.

Data Transfer Devices Save Time

Assuming your Internet connection is a flat rate service that has no caps or limits and your organizational operations can withstand the traffic, you still may want to opt to use a data transfer service to move your data to the cloud. Why? Time. For example, if your initial data upload is 100 TB here’s how long it would take using different network upload connection speeds:

Network Speed Upload Time
10 Mbps 3 years
100 Mbps 124 days
500 Mbps 25 days
1 Gbps 12 days

This assumes you are using most of your upload connection to upload your data, which is probably not realistic if you want to stay in business. You could potentially rent a better connection or upgrade your connection permanently, both of which add to the cost of running your business.

Speaking of cost, there is of course a charge for the data transfer service that can be summarized as follows:

  • Backblaze B2 Fireball — Up to 40 TB of data per trip for $550.00 for 30 days in use at your site.
  • Amazon Snowball — up to 50 TB of data per trip for $200.00 for 10 days use at your site, plus $15/day each day in use at your site thereafter.
  • Google Transfer Appliance — up to 100 TB of data per trip for $300.00 for 10 days use at your site, plus $10/day each day in use at your site thereafter.

These prices do not include shipping, which can range from $100 to $900 depending on shipping method, location, etc.

Both Amazon and Google have transfer devices that are larger and cost more. For comparison purposes below we’ll use the three device versions listed above.

The Real Cost of Uploading Your Data

If we stopped our review at the previous paragraph and we were prepared to load up our transfer device in 10 days or less, the clear winner would be Google. But, this leaves out two very important components of any cloud storage project; the cost of storing your data and the cost of downloading your data.

Let’s look at two examples:

Example 1 — Archive 100 TB of data:

  • Use the data transfer service move 100 TB of data to the cloud storage service.
  • Accomplish the transfer within 10 days.
  • Store that 100 TB of data for 1 year.
Service Transfer Cost Cloud Storage Total
Backblaze B2 $1,650 (3 trips) $6,000 $7,650
Google Cloud $300 (1 trip) $24,000 $24,300
Amazon S3 $400 (2 trips) $25,200 $25,600

Results:

  • Using the B2 Fireball to store data in Backblaze B2 saves you $16,650 over a one-year period versus the Google solution.
  • The payback period for using a Backblaze B2 FireBall versus a Google Transfer Appliance is less than 1 month.

Example 2 — Store and use 100 TB of data:

  • Use the data transfer service to move 100 TB of data to the cloud storage service.
  • Accomplish the transfer within 10 days.
  • Store that 100 TB of data for 1 year.
  • Add 5 TB a month (on average) to the total stored.
  • Delete 2 TB a month (on average) from the total stored.
  • Download 10 TB a month (on average) from the total stored.
Service Transfer Cost Cloud Storage Total
Backblaze B2 $1,650 (3 trips) $9,570 $11,220
Google Cloud $300 (1 trip) $39,684 $39,984
Amazon S3 $400 (2 trips) $36,114 $36,514

Results:

  • Using the B2 Fireball to store data in Backblaze B2 saves you $28,764 over a one-year period versus the Google solution.
  • The payback period for using a Backblaze B2 FireBall versus a Google Transfer Appliance is less than 1 month.

Notes:

  • All prices listed are based on list prices from the vendor websites as of the date of this blog post.
  • We are accomplishing the transfer of your data to the device within the 10 day “free” period specified by Amazon and Google.
  • We are comparing cloud storage services that have similar performance. For example, once the data is uploaded, it is readily available for download. The data is also available for access via a Web GUI, CLI, API, and/or various applications integrated with the cloud storage service. Multiple versions of files can be kept as desired. Files can be deleted any time.

To be fair, it requires Backblaze three trips to move 100 TB while it only takes 1 trip for the Google Transfer Appliance. This adds some cost to prepare, monitor, and ship three B2 Fireballs versus one Transfer Appliance. Even with that added cost, the Backblaze B2 solution will still be significantly less expensive over the one year period and beyond.

Have a Data Transfer Device Owner

Before you run out and order a transfer device, make sure the transfer process is someone’s job once the device arrives at your organization. Filling a transfer device should only take a few days, but if it is forgotten, you’ll find you’ve had the device for 2 or 3 weeks. While that’s not much of a problem with a B2 Fireball, it could start to get expensive otherwise.

Just the Beginning

As with most “new” technologies and services, you can expect other companies to jump in and provide various data ingest services. The cost will get cheaper or even free as cloud storage companies race to capture and lock up the data you have kept locally all these years. When you are evaluating cloud storage solutions, it’s best to look past the data ingest loss-leader price, and spend a few minutes to calculate the long-term cost of storing and using your data.

The post Getting Your Data into the Cloud is Just the Beginning appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The new sd-bus API of systemd

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/the-new-sd-bus-api-of-systemd.html

With the new v221 release of
systemd

we are declaring the
sd-bus
API shipped with
systemd
stable. sd-bus is our minimal D-Bus
IPC
C library, supporting as
back-ends both classic socket-based D-Bus and
kdbus. The library has been been
part of systemd for a while, but has only been used internally, since
we wanted to have the liberty to still make API changes without
affecting external consumers of the library. However, now we are
confident to commit to a stable API for it, starting with v221.

In this blog story I hope to provide you with a quick overview on
sd-bus, a short reiteration on D-Bus and its concepts, as well as a
few simple examples how to write D-Bus clients and services with it.

What is D-Bus again?

Let’s start with a quick reminder what
D-Bus actually is: it’s a
powerful, generic IPC system for Linux and other operating systems. It
knows concepts like buses, objects, interfaces, methods, signals,
properties. It provides you with fine-grained access control, a rich
type system, discoverability, introspection, monitoring, reliable
multicasting, service activation, file descriptor passing, and
more. There are bindings for numerous programming languages that are
used on Linux.

D-Bus has been a core component of Linux systems since more than 10
years. It is certainly the most widely established high-level local
IPC system on Linux. Since systemd’s inception it has been the IPC
system it exposes its interfaces on. And even before systemd, it was
the IPC system Upstart used to expose its interfaces. It is used by
GNOME, by KDE and by a variety of system components.

D-Bus refers to both a
specification
,
and a reference
implementation
. The
reference implementation provides both a bus server component, as well
as a client library. While there are multiple other, popular
reimplementations of the client library – for both C and other
programming languages –, the only commonly used server side is the
one from the reference implementation. (However, the kdbus project is
working on providing an alternative to this server implementation as a
kernel component.)

D-Bus is mostly used as local IPC, on top of AF_UNIX sockets. However,
the protocol may be used on top of TCP/IP as well. It does not
natively support encryption, hence using D-Bus directly on TCP is
usually not a good idea. It is possible to combine D-Bus with a
transport like ssh in order to secure it. systemd uses this to make
many of its APIs accessible remotely.

A frequently asked question about D-Bus is why it exists at all,
given that AF_UNIX sockets and FIFOs already exist on UNIX and have
been used for a long time successfully. To answer this question let’s
make a comparison with popular web technology of today: what
AF_UNIX/FIFOs are to D-Bus, TCP is to HTTP/REST. While AF_UNIX
sockets/FIFOs only shovel raw bytes between processes, D-Bus defines
actual message encoding and adds concepts like method call
transactions, an object system, security mechanisms, multicasting and
more.

From our 10year+ experience with D-Bus we know today that while there
are some areas where we can improve things (and we are working on
that, both with kdbus and sd-bus), it generally appears to be a very
well designed system, that stood the test of time, aged well and is
widely established. Today, if we’d sit down and design a completely
new IPC system incorporating all the experience and knowledge we
gained with D-Bus, I am sure the result would be very close to what
D-Bus already is.

Or in short: D-Bus is great. If you hack on a Linux project and need a
local IPC, it should be your first choice. Not only because D-Bus is
well designed, but also because there aren’t many alternatives that
can cover similar functionality.

Where does sd-bus fit in?

Let’s discuss why sd-bus exists, how it compares with the other
existing C D-Bus libraries and why it might be a library to consider
for your project.

For C, there are two established, popular D-Bus libraries: libdbus, as
it is shipped in the reference implementation of D-Bus, as well as
GDBus, a component of GLib, the low-level tool library of GNOME.

Of the two libdbus is the much older one, as it was written at the
time the specification was put together. The library was written with
a focus on being portable and to be useful as back-end for higher-level
language bindings. Both of these goals required the API to be very
generic, resulting in a relatively baroque, hard-to-use API that lacks
the bits that make it easy and fun to use from C. It provides the
building blocks, but few tools to actually make it straightforward to
build a house from them. On the other hand, the library is suitable
for most use-cases (for example, it is OOM-safe making it suitable for
writing lowest level system software), and is portable to operating
systems like Windows or more exotic UNIXes.

GDBus
is a much newer implementation. It has been written after considerable
experience with using a GLib/GObject wrapper around libdbus. GDBus is
implemented from scratch, shares no code with libdbus. Its design
differs substantially from libdbus, it contains code generators to
make it specifically easy to expose GObject objects on the bus, or
talking to D-Bus objects as GObject objects. It translates D-Bus data
types to GVariant, which is GLib’s powerful data serialization
format. If you are used to GLib-style programming then you’ll feel
right at home, hacking D-Bus services and clients with it is a lot
simpler than using libdbus.

With sd-bus we now provide a third implementation, sharing no code
with either libdbus or GDBus. For us, the focus was on providing kind
of a middle ground between libdbus and GDBus: a low-level C library
that actually is fun to work with, that has enough syntactic sugar to
make it easy to write clients and services with, but on the other hand
is more low-level than GDBus/GLib/GObject/GVariant. To be able to use
it in systemd’s various system-level components it needed to be
OOM-safe and minimal. Another major point we wanted to focus on was
supporting a kdbus back-end right from the beginning, in addition to
the socket transport of the original D-Bus specification (“dbus1”). In
fact, we wanted to design the library closer to kdbus’ semantics than
to dbus1’s, wherever they are different, but still cover both
transports nicely. In contrast to libdbus or GDBus portability is not
a priority for sd-bus, instead we try to make the best of the Linux
platform and expose specific Linux concepts wherever that is
beneficial. Finally, performance was also an issue (though a secondary
one): neither libdbus nor GDBus will win any speed records. We wanted
to improve on performance (throughput and latency) — but simplicity
and correctness are more important to us. We believe the result of our
work delivers our goals quite nicely: the library is fun to use,
supports kdbus and sockets as back-end, is relatively minimal, and the
performance is substantially
better

than both libdbus and GDBus.

To decide which of the three APIs to use for you C project, here are
short guidelines:

  • If you hack on a GLib/GObject project, GDBus is definitely your
    first choice.

  • If portability to non-Linux kernels — including Windows, Mac OS and
    other UNIXes — is important to you, use either GDBus (which more or
    less means buying into GLib/GObject) or libdbus (which requires a
    lot of manual work).

  • Otherwise, sd-bus would be my recommended choice.

(I am not covering C++ specifically here, this is all about plain C
only. But do note: if you use Qt, then QtDBus is the D-Bus API of
choice, being a wrapper around libdbus.)

Introduction to D-Bus Concepts

To the uninitiated D-Bus usually appears to be a relatively opaque
technology. It uses lots of concepts that appear unnecessarily complex
and redundant on first sight. But actually, they make a lot of
sense. Let’s have a look:

  • A bus is where you look for IPC services. There are usually two
    kinds of buses: a system bus, of which there’s exactly one per
    system, and which is where you’d look for system services; and a
    user bus, of which there’s one per user, and which is where you’d
    look for user services, like the address book service or the mail
    program. (Originally, the user bus was actually a session bus — so
    that you get multiple of them if you log in many times as the same
    user –, and on most setups it still is, but we are working on
    moving things to a true user bus, of which there is only one per
    user on a system, regardless how many times that user happens to
    log in.)

  • A service is a program that offers some IPC API on a bus. A
    service is identified by a name in reverse domain name
    notation. Thus, the org.freedesktop.NetworkManager service on the
    system bus is where NetworkManager’s APIs are available and
    org.freedesktop.login1 on the system bus is where
    systemd-logind‘s APIs are exposed.

  • A client is a program that makes use of some IPC API on a bus. It
    talks to a service, monitors it and generally doesn’t provide any
    services on its own. That said, lines are blurry and many services
    are also clients to other services. Frequently the term peer is
    used as a generalization to refer to either a service or a client.

  • An object path is an identifier for an object on a specific
    service. In a way this is comparable to a C pointer, since that’s
    how you generally reference a C object, if you hack object-oriented
    programs in C. However, C pointers are just memory addresses, and
    passing memory addresses around to other processes would make
    little sense, since they of course refer to the address space of
    the service, the client couldn’t make sense of it. Thus, the D-Bus
    designers came up with the object path concept, which is just a
    string that looks like a file system path. Example:
    /org/freedesktop/login1 is the object path of the ‘manager’
    object of the org.freedesktop.login1 service (which, as we
    remember from above, is still the service systemd-logind
    exposes). Because object paths are structured like file system
    paths they can be neatly arranged in a tree, so that you end up
    with a venerable tree of objects. For example, you’ll find all user
    sessions systemd-logind manages below the
    /org/freedesktop/login1/session sub-tree, for example called
    /org/freedesktop/login1/session/_7,
    /org/freedesktop/login1/session/_55 and so on. How services
    precisely label their objects and arrange them in a tree is
    completely up to the developers of the services.

  • Each object that is identified by an object path has one or more
    interfaces. An interface is a collection of signals, methods, and
    properties (collectively called members), that belong
    together. The concept of a D-Bus interface is actually pretty
    much identical to what you know from programming languages such as
    Java, which also know an interface concept. Which interfaces an
    object implements are up the developers of the service. Interface
    names are in reverse domain name notation, much like service
    names. (Yes, that’s admittedly confusing, in particular since it’s
    pretty common for simpler services to reuse the service name string
    also as an interface name.) A couple of interfaces are standardized
    though and you’ll find them available on many of the objects
    offered by the various services. Specifically, those are
    org.freedesktop.DBus.Introspectable, org.freedesktop.DBus.Peer
    and org.freedesktop.DBus.Properties.

  • An interface can contain methods. The word “method” is more or
    less just a fancy word for “function”, and is a term used pretty
    much the same way in object-oriented languages such as Java. The
    most common interaction between D-Bus peers is that one peer
    invokes one of these methods on another peer and gets a reply. A
    D-Bus method takes a couple of parameters, and returns others. The
    parameters are transmitted in a type-safe way, and the type
    information is included in the introspection data you can query
    from each object. Usually, method names (and the other member
    types) follow a CamelCase syntax. For example, systemd-logind
    exposes an ActivateSession method on the
    org.freedesktop.login1.Manager interface that is available on the
    /org/freedesktop/login1 object of the org.freedesktop.login1
    service.

  • A signature describes a set of parameters a function (or signal,
    property, see below) takes or returns. It’s a series of characters
    that each encode one parameter by its type. The set of types
    available is pretty powerful. For example, there are simpler types
    like s for string, or u for 32bit integer, but also complex
    types such as as for an array of strings or a(sb) for an array
    of structures consisting of one string and one boolean each. See
    the D-Bus specification
    for the full explanation of the type system. The
    ActivateSession method mentioned above takes a single string as
    parameter (the parameter signature is hence s), and returns
    nothing (the return signature is hence the empty string). Of
    course, the signature can get a lot more complex, see below for
    more examples.

  • A signal is another member type that the D-Bus object system
    knows. Much like a method it has a signature. However, they serve
    different purposes. While in a method call a single client issues a
    request on a single service, and that service sends back a response
    to the client, signals are for general notification of
    peers. Services send them out when they want to tell one or more
    peers on the bus that something happened or changed. In contrast to
    method calls and their replies they are hence usually broadcast
    over a bus. While method calls/replies are used for duplex
    one-to-one communication, signals are usually used for simplex
    one-to-many communication (note however that that’s not a
    requirement, they can also be used one-to-one). Example:
    systemd-logind broadcasts a SessionNew signal from its manager
    object each time a user logs in, and a SessionRemoved signal
    every time a user logs out.

  • A property is the third member type that the D-Bus object system
    knows. It’s similar to the property concept known by languages like
    C#. Properties also have a signature, and are more or less just
    variables that an object exposes, that can be read or altered by
    clients. Example: systemd-logind exposes a property Docked of
    the signature b (a boolean). It reflects whether systemd-logind
    thinks the system is currently in a docking station of some form
    (only applies to laptops …).

So much for the various concepts D-Bus knows. Of course, all these new
concepts might be overwhelming. Let’s look at them from a different
perspective. I assume many of the readers have an understanding of
today’s web technology, specifically HTTP and REST. Let’s try to
compare the concept of a HTTP request with the concept of a D-Bus
method call:

  • A HTTP request you issue on a specific network. It could be the
    Internet, or it could be your local LAN, or a company
    VPN. Depending on which network you issue the request on, you’ll be
    able to talk to a different set of servers. This is not unlike the
    “bus” concept of D-Bus.

  • On the network you then pick a specific HTTP server to talk
    to. That’s roughly comparable to picking a service on a specific bus.

  • On the HTTP server you then ask for a specific URL. The “path” part
    of the URL (by which I mean everything after the host name of the
    server, up to the last “/”) is pretty similar to a D-Bus object path.

  • The “file” part of the URL (by which I mean everything after the
    last slash, following the path, as described above), then defines
    the actual call to make. In D-Bus this could be mapped to an
    interface and method name.

  • Finally, the parameters of a HTTP call follow the path after the
    “?”, they map to the signature of the D-Bus call.

Of course, comparing an HTTP request to a D-Bus method call is a bit
comparing apples and oranges. However, I think it’s still useful to
get a bit of a feeling of what maps to what.

From the shell

So much about the concepts and the gray theory behind them. Let’s make
this exciting, let’s actually see how this feels on a real system.

Since a while systemd has included a tool busctl that is useful to
explore and interact with the D-Bus object system. When invoked
without parameters, it will show you a list of all peers connected to
the system bus. (Use --user to see the peers of your user bus
instead):

$ busctl
NAME                                       PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCRIPTION
:1.1                                         1 systemd         root             :1.1          -                         -          -
:1.11                                      705 NetworkManager  root             :1.11         NetworkManager.service    -          -
:1.14                                      744 gdm             root             :1.14         gdm.service               -          -
:1.4                                       708 systemd-logind  root             :1.4          systemd-logind.service    -          -
:1.7200                                  17563 busctl          lennart          :1.7200       session-1.scope           1          -
[…]
org.freedesktop.NetworkManager             705 NetworkManager  root             :1.11         NetworkManager.service    -          -
org.freedesktop.login1                     708 systemd-logind  root             :1.4          systemd-logind.service    -          -
org.freedesktop.systemd1                     1 systemd         root             :1.1          -                         -          -
org.gnome.DisplayManager                   744 gdm             root             :1.14         gdm.service               -          -
[…]

(I have shortened the output a bit, to make keep things brief).

The list begins with a list of all peers currently connected to the
bus. They are identified by peer names like “:1.11”. These are called
unique names in D-Bus nomenclature. Basically, every peer has a
unique name, and they are assigned automatically when a peer connects
to the bus. They are much like an IP address if you so will. You’ll
notice that a couple of peers are already connected, including our
little busctl tool itself as well as a number of system services. The
list then shows all actual services on the bus, identified by their
service names (as discussed above; to discern them from the unique
names these are also called well-known names). In many ways
well-known names are similar to DNS host names, i.e. they are a
friendlier way to reference a peer, but on the lower level they just
map to an IP address, or in this comparison the unique name. Much like
you can connect to a host on the Internet by either its host name or
its IP address, you can also connect to a bus peer either by its
unique or its well-known name. (Note that each peer can have as many
well-known names as it likes, much like an IP address can have
multiple host names referring to it).

OK, that’s already kinda cool. Try it for yourself, on your local
machine (all you need is a recent, systemd-based distribution).

Let’s now go the next step. Let’s see which objects the
org.freedesktop.login1 service actually offers:

$ busctl tree org.freedesktop.login1
└─/org/freedesktop/login1
  ├─/org/freedesktop/login1/seat
  │ ├─/org/freedesktop/login1/seat/seat0
  │ └─/org/freedesktop/login1/seat/self
  ├─/org/freedesktop/login1/session
  │ ├─/org/freedesktop/login1/session/_31
  │ └─/org/freedesktop/login1/session/self
  └─/org/freedesktop/login1/user
    ├─/org/freedesktop/login1/user/_1000
    └─/org/freedesktop/login1/user/self

Pretty, isn’t it? What’s actually even nicer, and which the output
does not show is that there’s full command line completion
available: as you press TAB the shell will auto-complete the service
names for you. It’s a real pleasure to explore your D-Bus objects that
way!

The output shows some objects that you might recognize from the
explanations above. Now, let’s go further. Let’s see what interfaces,
methods, signals and properties one of these objects actually exposes:

$ busctl introspect org.freedesktop.login1 /org/freedesktop/login1/session/_31
NAME                                TYPE      SIGNATURE RESULT/VALUE                             FLAGS
org.freedesktop.DBus.Introspectable interface -         -                                        -
.Introspect                         method    -         s                                        -
org.freedesktop.DBus.Peer           interface -         -                                        -
.GetMachineId                       method    -         s                                        -
.Ping                               method    -         -                                        -
org.freedesktop.DBus.Properties     interface -         -                                        -
.Get                                method    ss        v                                        -
.GetAll                             method    s         a{sv}                                    -
.Set                                method    ssv       -                                        -
.PropertiesChanged                  signal    sa{sv}as  -                                        -
org.freedesktop.login1.Session      interface -         -                                        -
.Activate                           method    -         -                                        -
.Kill                               method    si        -                                        -
.Lock                               method    -         -                                        -
.PauseDeviceComplete                method    uu        -                                        -
.ReleaseControl                     method    -         -                                        -
.ReleaseDevice                      method    uu        -                                        -
.SetIdleHint                        method    b         -                                        -
.TakeControl                        method    b         -                                        -
.TakeDevice                         method    uu        hb                                       -
.Terminate                          method    -         -                                        -
.Unlock                             method    -         -                                        -
.Active                             property  b         true                                     emits-change
.Audit                              property  u         1                                        const
.Class                              property  s         "user"                                   const
.Desktop                            property  s         ""                                       const
.Display                            property  s         ""                                       const
.Id                                 property  s         "1"                                      const
.IdleHint                           property  b         true                                     emits-change
.IdleSinceHint                      property  t         1434494624206001                         emits-change
.IdleSinceHintMonotonic             property  t         0                                        emits-change
.Leader                             property  u         762                                      const
.Name                               property  s         "lennart"                                const
.Remote                             property  b         false                                    const
.RemoteHost                         property  s         ""                                       const
.RemoteUser                         property  s         ""                                       const
.Scope                              property  s         "session-1.scope"                        const
.Seat                               property  (so)      "seat0" "/org/freedesktop/login1/seat... const
.Service                            property  s         "gdm-autologin"                          const
.State                              property  s         "active"                                 -
.TTY                                property  s         "/dev/tty1"                              const
.Timestamp                          property  t         1434494630344367                         const
.TimestampMonotonic                 property  t         34814579                                 const
.Type                               property  s         "x11"                                    const
.User                               property  (uo)      1000 "/org/freedesktop/login1/user/_1... const
.VTNr                               property  u         1                                        const
.Lock                               signal    -         -                                        -
.PauseDevice                        signal    uus       -                                        -
.ResumeDevice                       signal    uuh       -                                        -
.Unlock                             signal    -         -                                        -

As before, the busctl command supports command line completion, hence
both the service name and the object path used are easily put together
on the shell simply by pressing TAB. The output shows the methods,
properties, signals of one of the session objects that are currently
made available by systemd-logind. There’s a section for each
interface the object knows. The second column tells you what kind of
member is shown in the line. The third column shows the signature of
the member. In case of method calls that’s the input parameters, the
fourth column shows what is returned. For properties, the fourth
column encodes the current value of them.

So far, we just explored. Let’s take the next step now: let’s become
active – let’s call a method:

# busctl call org.freedesktop.login1 /org/freedesktop/login1/session/_31 org.freedesktop.login1.Session Lock

I don’t think I need to mention this anymore, but anyway: again
there’s full command line completion available. The third argument is
the interface name, the fourth the method name, both can be easily
completed by pressing TAB. In this case we picked the Lock method,
which activates the screen lock for the specific session. And yupp,
the instant I pressed enter on this line my screen lock turned on
(this only works on DEs that correctly hook into systemd-logind for
this to work. GNOME works fine, and KDE should work too).

The Lock method call we picked is very simple, as it takes no
parameters and returns none. Of course, it can get more complicated
for some calls. Here’s another example, this time using one of
systemd’s own bus calls, to start an arbitrary system unit:

# busctl call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss "cups.service" "replace"
o "/org/freedesktop/systemd1/job/42684"

This call takes two strings as input parameters, as we denote in the
signature string that follows the method name (as usual, command line
completion helps you getting this right). Following the signature the
next two parameters are simply the two strings to pass. The specified
signature string hence indicates what comes next. systemd’s StartUnit
method call takes the unit name to start as first parameter, and the
mode in which to start it as second. The call returned a single object
path value. It is encoded the same way as the input parameter: a
signature (just o for the object path) followed by the actual value.

Of course, some method call parameters can get a ton more complex, but
with busctl it’s relatively easy to encode them all. See the man
page
for
details.

busctl knows a number of other operations. For example, you can use
it to monitor D-Bus traffic as it happens (including generating a
.cap file for use with Wireshark!) or you can set or get specific
properties. However, this blog story was supposed to be about sd-bus,
not busctl, hence let’s cut this short here, and let me direct you
to the man page in case you want to know more about the tool.

busctl (like the rest of system) is implemented using the sd-bus
API. Thus it exposes many of the features of sd-bus itself. For
example, you can use to connect to remote or container buses. It
understands both kdbus and classic D-Bus, and more!

sd-bus

But enough! Let’s get back on topic, let’s talk about sd-bus itself.

The sd-bus set of APIs is mostly contained in the header file
sd-bus.h.

Here’s a random selection of features of the library, that make it
compare well with the other implementations available.

  • Supports both kdbus and dbus1 as back-end.

  • Has high-level support for connecting to remote buses via ssh, and
    to buses of local OS containers.

  • Powerful credential model, to implement authentication of clients
    in services. Currently 34 individual fields are supported, from the
    PID of the client to the cgroup or capability sets.

  • Support for tracking the life-cycle of peers in order to release
    local objects automatically when all peers referencing them
    disconnected.

  • The client builds an efficient decision tree to determine which
    handlers to deliver an incoming bus message to.

  • Automatically translates D-Bus errors into UNIX style errors and
    back (this is lossy though), to ensure best integration of D-Bus
    into low-level Linux programs.

  • Powerful but lightweight object model for exposing local objects on
    the bus. Automatically generates introspection as necessary.

The API is currently not fully documented, but we are working on
completing the set of manual pages. For details
see all pages starting with sd_bus_.

Invoking a Method, from C, with sd-bus

So much about the library in general. Here’s an example for connecting
to the bus and issuing a method call:

#include <stdio.h>
#include <stdlib.h>
#include <systemd/sd-bus.h>

int main(int argc, char *argv[]) {
        sd_bus_error error = SD_BUS_ERROR_NULL;
        sd_bus_message *m = NULL;
        sd_bus *bus = NULL;
        const char *path;
        int r;

        /* Connect to the system bus */
        r = sd_bus_open_system(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;
        }

        /* Issue the method call and store the respons message in m */
        r = sd_bus_call_method(bus,
                               "org.freedesktop.systemd1",           /* service to contact */
                               "/org/freedesktop/systemd1",          /* object path */
                               "org.freedesktop.systemd1.Manager",   /* interface name */
                               "StartUnit",                          /* method name */
                               &error,                               /* object to return error in */
                               &m,                                   /* return message on success */
                               "ss",                                 /* input signature */
                               "cups.service",                       /* first argument */
                               "replace");                           /* second argument */
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", error.message);
                goto finish;
        }

        /* Parse the response message */
        r = sd_bus_message_read(m, "o", &path);
        if (r < 0) {
                fprintf(stderr, "Failed to parse response message: %s\n", strerror(-r));
                goto finish;
        }

        printf("Queued service job as %s.\n", path);

finish:
        sd_bus_error_free(&error);
        sd_bus_message_unref(m);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-client.c, then build it with:

$ gcc bus-client.c -o bus-client `pkg-config --cflags --libs libsystemd`

This will generate a binary bus-client you can now run. Make sure to
run it as root though, since access to the StartUnit method is
privileged:

# ./bus-client
Queued service job as /org/freedesktop/systemd1/job/3586.

And that’s it already, our first example. It showed how we invoked a
method call on the bus. The actual function call of the method is very
close to the busctl command line we used before. I hope the code
excerpt needs little further explanation. It’s supposed to give you a
taste how to write D-Bus clients with sd-bus. For more more
information please have a look at the header file, the man page or
even the sd-bus sources.

Implementing a Service, in C, with sd-bus

Of course, just calling a single method is a rather simplistic
example. Let’s have a look on how to write a bus service. We’ll write
a small calculator service, that exposes a single object, which
implements an interface that exposes two methods: one to multiply two
64bit signed integers, and one to divide one 64bit signed integer by
another.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <systemd/sd-bus.h>

static int method_multiply(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;
        }

        /* Reply with the response */
        return sd_bus_reply_method_return(m, "x", x * y);
}

static int method_divide(sd_bus_message *m, void *userdata, sd_bus_error *ret_error) {
        int64_t x, y;
        int r;

        /* Read the parameters */
        r = sd_bus_message_read(m, "xx", &x, &y);
        if (r < 0) {
                fprintf(stderr, "Failed to parse parameters: %s\n", strerror(-r));
                return r;
        }

        /* Return an error on division by zero */
        if (y == 0) {
                sd_bus_error_set_const(ret_error, "net.poettering.DivisionByZero", "Sorry, can't allow division by zero.");
                return -EINVAL;
        }

        return sd_bus_reply_method_return(m, "x", x / y);
}

/* The vtable of our little object, implements the net.poettering.Calculator interface */
static const sd_bus_vtable calculator_vtable[] = {
        SD_BUS_VTABLE_START(0),
        SD_BUS_METHOD("Multiply", "xx", "x", method_multiply, SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_METHOD("Divide",   "xx", "x", method_divide,   SD_BUS_VTABLE_UNPRIVILEGED),
        SD_BUS_VTABLE_END
};

int main(int argc, char *argv[]) {
        sd_bus_slot *slot = NULL;
        sd_bus *bus = NULL;
        int r;

        /* Connect to the user bus this time */
        r = sd_bus_open_user(&bus);
        if (r < 0) {
                fprintf(stderr, "Failed to connect to system bus: %s\n", strerror(-r));
                goto finish;
        }

        /* Install the object */
        r = sd_bus_add_object_vtable(bus,
                                     &slot,
                                     "/net/poettering/Calculator",  /* object path */
                                     "net.poettering.Calculator",   /* interface name */
                                     calculator_vtable,
                                     NULL);
        if (r < 0) {
                fprintf(stderr, "Failed to issue method call: %s\n", strerror(-r));
                goto finish;
        }

        /* Take a well-known service name so that clients can find us */
        r = sd_bus_request_name(bus, "net.poettering.Calculator", 0);
        if (r < 0) {
                fprintf(stderr, "Failed to acquire service name: %s\n", strerror(-r));
                goto finish;
        }

        for (;;) {
                /* Process requests */
                r = sd_bus_process(bus, NULL);
                if (r < 0) {
                        fprintf(stderr, "Failed to process bus: %s\n", strerror(-r));
                        goto finish;
                }
                if (r > 0) /* we processed a request, try to process another one, right-away */
                        continue;

                /* Wait for the next request to process */
                r = sd_bus_wait(bus, (uint64_t) -1);
                if (r < 0) {
                        fprintf(stderr, "Failed to wait on bus: %s\n", strerror(-r));
                        goto finish;
                }
        }

finish:
        sd_bus_slot_unref(slot);
        sd_bus_unref(bus);

        return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS;
}

Save this example as bus-service.c, then build it with:

$ gcc bus-service.c -o bus-service `pkg-config --cflags --libs libsystemd`

Now, let’s run it:

$ ./bus-service

In another terminal, let’s try to talk to it. Note that this service
is now on the user bus, not on the system bus as before. We do this
for simplicity reasons: on the system bus access to services is
tightly controlled so unprivileged clients cannot request privileged
operations. On the user bus however things are simpler: as only
processes of the user owning the bus can connect no further policy
enforcement will complicate this example. Because the service is on
the user bus, we have to pass the --user switch on the busctl
command line. Let’s start with looking at the service’s object tree.

$ busctl --user tree net.poettering.Calculator
└─/net/poettering/Calculator

As we can see, there’s only a single object on the service, which is
not surprising, given that our code above only registered one. Let’s
see the interfaces and the members this object exposes:

$ busctl --user introspect net.poettering.Calculator /net/poettering/Calculator
NAME                                TYPE      SIGNATURE RESULT/VALUE FLAGS
net.poettering.Calculator           interface -         -            -
.Divide                             method    xx        x            -
.Multiply                           method    xx        x            -
org.freedesktop.DBus.Introspectable interface -         -            -
.Introspect                         method    -         s            -
org.freedesktop.DBus.Peer           interface -         -            -
.GetMachineId                       method    -         s            -
.Ping                               method    -         -            -
org.freedesktop.DBus.Properties     interface -         -            -
.Get                                method    ss        v            -
.GetAll                             method    s         a{sv}        -
.Set                                method    ssv       -            -
.PropertiesChanged                  signal    sa{sv}as  -            -

The sd-bus library automatically added a couple of generic interfaces,
as mentioned above. But the first interface we see is actually the one
we added! It shows our two methods, and both take “xx” (two 64bit
signed integers) as input parameters, and return one “x”. Great! But
does it work?

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Multiply xx 5 7
x 35

Woohoo! We passed the two integers 5 and 7, and the service actually
multiplied them for us and returned a single integer 35! Let’s try the
other method:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 99 17
x 5

Oh, wow! It can even do integer division! Fantastic! But let’s trick
it into dividing by zero:

$ busctl --user call net.poettering.Calculator /net/poettering/Calculator net.poettering.Calculator Divide xx 43 0
Sorry, can't allow division by zero.

Nice! It detected this nicely and returned a clean error about it. If
you look in the source code example above you’ll see how precisely we
generated the error.

And that’s really all I have for today. Of course, the examples I
showed are short, and I don’t get into detail here on what precisely
each line does. However, this is supposed to be a short introduction
into D-Bus and sd-bus, and it’s already way too long for that …

I hope this blog story was useful to you. If you are interested in
using sd-bus for your own programs, I hope this gets you started. If
you have further questions, check the (incomplete) man pages, and
inquire us on IRC or the systemd mailing list. If you need more
examples, have a look at the systemd source tree, all of systemd’s
many bus services use sd-bus extensively.