Tag Archives: music

Steal This Show S03E09: Learning To Love Your Panopticon

Post Syndicated from Ernesto original https://torrentfreak.com/steal-show-s03e09-learning-love-panopticon/

stslogo180If you enjoy this episode, consider becoming a patron and getting involved with the show. Check out Steal This Show’s Patreon campaign: support us and get all kinds of fantastic benefits!

In this episode we meet Diani Barreto from the Berlin Bureau of ExposeFacs. Launched in June 2014, ExposeFacts.org supports and encourages whistleblowers to disclose information that citizens need to make truly informed decisions in a democracy.

ExposeFacts aims to shed light on concealed activities that are relevant to human rights, corporate malfeasance, the environment, civil liberties and war.

Steal This Show aims to release bi-weekly episodes featuring insiders discussing copyright and file-sharing news. It complements our regular reporting by adding more room for opinion, commentary, and analysis.

The guests for our news discussions will vary, and we’ll aim to introduce voices from different backgrounds and persuasions. In addition to news, STS will also produce features interviewing some of the great innovators and minds.

Host: Jamie King

Guest: Diani Barreto

Produced by Jamie King
Edited & Mixed by Riley Byrne
Original Music by David Triana
Web Production by Siraje Amarniss

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Anti-Piracy Group Joins Internet Organization That Controls Top-Level Domain

Post Syndicated from Andy original https://torrentfreak.com/anti-piracy-group-joins-internet-organization-that-controls-top-level-domain-171019/

All around the world, content creators and rightsholders continue to protest against the unauthorized online distribution of copyrighted content.

While pirating end-users obviously share some of the burden, the main emphasis has traditionally been placed on the shuttering of illicit sites, whether torrent, streaming, or hosting based.

Over time, however, sites have become more prevalent and increasingly resilient, leaving the music, movie and publishing industries to play a frustrating game of whac-a-mole. With this in mind, their focus has increasingly shifted towards Internet gatekeepers, including ISPs and bodies with influence over domain availability.

While most of these efforts take place via cooperation or legal action, there’s regularly conflict when Hollywood, for example, wants a particular domain rendered inaccessible or the music industry wants pirates kicked off the Internet.

As a result, there’s nearly always a disconnect, with copyright holders on one side and Internet technology companies worried about mission creep on the other. In Denmark, however, those lines have just been blurred in the most intriguing way possible after an infamous anti-piracy outfit joined an organization with significant control over the Internet in the country.

RettighedsAlliancen (or Rights Alliance as it’s more commonly known) is an anti-piracy group which counts some of the most powerful local and international movie companies among its members. It also operates on behalf of IFPI and by extension, most of the world’s major recording labels.

The group has been involved in dozens of legal processes over the years against file-sharers and file-sharing sites, most recently fighting for and winning ISP blockades against most major pirate portals including The Pirate Bay, RARBG, Torrentz, and many more.

In a somewhat surprising new announcement, the group has revealed it’s become the latest member of DIFO, the Danish Internet Forum (DIFO) which “works for a secure and accessible Internet” under the top-level .DK domain. Indeed, DIFO has overall responsibility for Danish internet infrastructure.

“For DIFO it is important to have a strong link to the Danish internet community. Therefore, we are very pleased that the Alliance wishes to be part of the association,” DIFO said in a statement.

Rights Alliance will be DIFO’s third new member this year but uniquely it will get the opportunity to represent the interests of more than 100,000 Danish and international rightholders from inside an influential Internet-focused organization.

Looking at DIFO’s membership, Rights Alliance certainly stands out as unusual. The majority of the members are made up of IT-based organizations, such as the Internet Industry Association, The Association of Open Source Suppliers and DKRegistrar, the industry association for Danish domain registrars.

A meeting around a table with these players and their often conflicting interests is likely to be an experience for all involved. However, all parties seem more than happy with the new partnership.

“We want to help create a more secure internet for companies that invest in doing business online, and for users to be safe, so combating digital crime is a key and shared goal,” says Rights Alliance chief, Maria Fredenslund. “I am therefore looking forward to the future cooperation with DIFO.”

Only time will tell how this partnership will play out but if common ground can be found, it’s certainly possible that the anti-piracy scene in Denmark could step up a couple of gears in the future.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Google Asked to Remove 3 Billion “Pirate” Search Results

Post Syndicated from Ernesto original https://torrentfreak.com/google-asked-to-remove-3-billion-pirate-search-results-171018/

Copyright holders continue to flood Google with DMCA takedown requests, asking the company to remove “pirate links” from its search results.

In recent years the number of reported URLs has exploded, surging to unprecedented heights.

Since Google first started to report the volume of takedown requests in its Transparency Report, the company has been asked to remove more than three billion allegedly infringing search results.

The frequency at which these URLs are reported has increased over the years and at the moment roughly three million ‘pirate’ URLs are submitted per day.

The URLs are sent in by major rightsholders including members of the BPI, RIAA, and various major Hollywood studios. They target a wide variety of sites, over 1.3 million, but a few dozen ‘repeat offenders’ are causing the most trouble.

File-hosting service 4shared.com currently tops the list of most-targeted domains with 66 million URLs, followed by the now-defunct MP3 download site MP3toys.xyz and Rapidgator.net, with 51 and 28 million URLs respectively.

3 billion URLs

Interestingly, the high volume of takedown notices is used as an argument for and against the DMCA process.

While Google believes that the millions of reported URLs per day are a sign that the DMCA takedown process is working correctly, rightsholders believe the volumes are indicative of an unbeatable game of whack-a-mole.

According to some copyright holders, the takedown efforts do little to seriously combat piracy. Various industry groups have therefore asked governments and lawmakers for broad revisions.

Among other things they want advanced technologies and processes to ensure that infringing content doesn’t reappear elsewhere once it’s removed, a so-called “notice and stay down” approach. In addition, Google has often been asked to demote pirate links in search results.

UK music industry group BPI, who are responsible for more than 10% of all the takedown requests on Google, sees the new milestone as an indicator of how much effort its anti-piracy activities take.

“This 3 billion figure shows how hard the creative sector has to work to police its content online and how much time and resource this takes. The BPI is the world’s largest remover of illegal music links from Google, one third of which are on behalf of independent record labels,” Geoff Taylor, BPI’s Chief Executive, informs TF.

However, there is also some progress to report. Earlier this year BPI announced a voluntary partnership with Google and Bing to demote pirate content faster and more effectively for US visitors.

“We now have a voluntary code of practice in place in the UK, facilitated by Government, that requires Google and Bing to work together with the BPI and other creator organizations to develop lasting solutions to the problem of illegal sites gaining popularity in search listings,” Taylor notes.

According to BPI, both Google and Bing have shown that changes to their algorithms can be effective in demoting the worst pirate sites from the top search results and they hope others will follow suit.

“Other intermediaries should follow this lead and take more responsibility to work with creators to reduce the proliferation of illegal links and disrupt the ability of illegal sites to capture consumers and build black market businesses that take money away from creators.”

Agreement or not, there are still plenty of pirate links in search results, so the BPI is still sending out millions of takedown requests per month.

We asked Google for a comment on the new milestone but at the time of writing, we have yet to hear back. In any event, the issue is bound to remain a hot topic during the months and years to come.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Google Asked to Delist Pirate Movie Sites, ISPs Asked to Block Them

Post Syndicated from Andy original https://torrentfreak.com/google-asked-to-delist-pirate-movie-sites-isps-asked-to-block-them-171018/

After seizing several servers operated by popular private music tracker What.cd, last November French police went after a much bigger target.

Boasting millions of regular visitors, Zone-Telechargement (Zone-Download) was ranked the 11th most-visited website in the whole of the country. The site offered direct downloads of a wide variety of pirated content, including films, series, games, and music. Until the French Gendarmerie shut it down, that is.

After being founded in 2011 and enjoying huge growth following the 2012 raids against Megaupload, the Zone-Telechargement ‘brand’ was still popular with French users, despite the closure of the platform. It, therefore, came as no surprise that the site was quickly cloned by an unknown party and relaunched as Zone-Telechargement.ws.

The site has been doing extremely well following its makeover. To the annoyance of copyright holders, SimilarWeb reports the platform as France’s 37th most popular site with around 58 million visitors per month. That’s a huge achievement in less than 12 months.

Now, however, the site is receiving more unwanted attention. PCInpact says it has received information that several movie-focused organizations including the French National Film Center are requesting tough action against the site.

The National Federation of Film Distributors, the Video Publishing Union, the Association of Independent Producers and the Producers Union are all demanding the blocking of Zone-Telechargement by several local ISPs, alongside its delisting from search results.

The publication mentions four Internet service providers – Free, Numericable, Bouygues Telecom, and Orange – plus Google on the search engine front. At this stage, other search companies, such as Microsoft’s Bing, are not reported as part of the action.

In addition to Zone-Telechargement, several other ‘pirate’ sites (Papystreaming.org, Sokrostream.cc and Zonetelechargement.su, another site playing on the popular brand) are included in the legal process. All are described as “structurally infringing” by the complaining movie outfits, PCInpact notes.

The legal proceedings against the sites are based in Article 336-2 of the Intellectual Property Code. It’s ground already trodden by movie companies who following a 2011 complaint, achieved victory in 2013 against several Allostreaming-linked sites.

In that case, the High Court of Paris ordered ISPs, several of which appear in the current action, to “implement all appropriate means including blocking” to prevent access to the infringing sites.

The Court also ordered Google, Microsoft, and Yahoo to “take all necessary measures to prevent the occurrence on their services of any results referring to any of the sites” on their platforms.

Also of interest is that the action targets a service called DL-Protecte.com, which according to local anti-piracy agency HADOPI, makes it difficult for rightsholders to locate infringing content while at the same time generates more revenue for pirate sites.

A judgment is expected in “several months.”

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Abandon Proactive Copyright Filters, Huge Coalition Tells EU Heavyweights

Post Syndicated from Andy original https://torrentfreak.com/abandon-proactive-copyright-filters-huge-coalition-tells-eu-heavyweights-171017/

Last September, EU Commission President Jean-Claude Juncker announced plans to modernize copyright law in Europe.

The proposals (pdf) are part of the Digital Single Market reforms, which have been under development for the past several years.

One of the proposals is causing significant concern. Article 13 would require some online service providers to become ‘Internet police’, proactively detecting and filtering allegedly infringing copyright works, uploaded to their platforms by users.

Currently, users are generally able to share whatever they like but should a copyright holder take exception to their upload, mechanisms are available for that content to be taken down. It’s envisioned that proactive filtering, whereby user uploads are routinely scanned and compared to a database of existing protected content, will prevent content becoming available in the first place.

These proposals are of great concern to digital rights groups, who believe that such filters will not only undermine users’ rights but will also place unfair burdens on Internet platforms, many of which will struggle to fund such a program. Yesterday, in the latest wave of opposition to Article 13, a huge coalition of international rights groups came together to underline their concerns.

Headed up by Civil Liberties Union for Europe (Liberties) and European Digital Rights (EDRi), the coalition is formed of dozens of influential groups, including Electronic Frontier Foundation (EFF), Human Rights Watch, Reporters without Borders, and Open Rights Group (ORG), to name just a few.

In an open letter to European Commission President Jean-Claude Juncker, President of the European Parliament Antonio Tajani, President of the European Council Donald Tusk and a string of others, the groups warn that the proposals undermine the trust established between EU member states.

“Fundamental rights, justice and the rule of law are intrinsically linked and constitute
core values on which the EU is founded,” the letter begins.

“Any attempt to disregard these values undermines the mutual trust between member states required for the EU to function. Any such attempt would also undermine the commitments made by the European Union and national governments to their citizens.”

Those citizens, the letter warns, would have their basic rights undermined, should the new proposals be written into EU law.

“Article 13 of the proposal on Copyright in the Digital Single Market include obligations on internet companies that would be impossible to respect without the imposition of excessive restrictions on citizens’ fundamental rights,” it notes.

A major concern is that by placing new obligations on Internet service providers that allow users to upload content – think YouTube, Facebook, Twitter and Instagram – they will be forced to err on the side of caution. Should there be any concern whatsoever that content might be infringing, fair use considerations and exceptions will be abandoned in favor of staying on the right side of the law.

“Article 13 appears to provoke such legal uncertainty that online services will have no other option than to monitor, filter and block EU citizens’ communications if they are to have any chance of staying in business,” the letter warns.

But while the potential problems for service providers and users are numerous, the groups warn that Article 13 could also be illegal since it contradicts case law of the Court of Justice.

According to the E-Commerce Directive, platforms are already required to remove infringing content, once they have been advised it exists. The new proposal, should it go ahead, would force the monitoring of uploads, something which goes against the ‘no general obligation to monitor‘ rules present in the Directive.

“The requirement to install a system for filtering electronic communications has twice been rejected by the Court of Justice, in the cases Scarlet Extended (C70/10) and Netlog/Sabam (C 360/10),” the rights groups warn.

“Therefore, a legislative provision that requires internet companies to install a filtering system would almost certainly be rejected by the Court of Justice because it would contravene the requirement that a fair balance be struck between the right to intellectual property on the one hand, and the freedom to conduct business and the right to freedom of expression, such as to receive or impart information, on the other.”

Specifically, the groups note that the proactive filtering of content would violate freedom of expression set out in Article 11 of the Charter of Fundamental Rights. That being the case, the groups expect national courts to disapply it and the rule to be annulled by the Court of Justice.

The latest protests against Article 13 come in the wake of large-scale objections earlier in the year, voicing similar concerns. However, despite the groups’ fears, they have powerful adversaries, each determined to stop the flood of copyrighted content currently being uploaded to the Internet.

Front and center in support of Article 13 is the music industry and its current hot-topic, the so-called Value Gap(1,2,3). The industry feels that platforms like YouTube are able to avoid paying expensive licensing fees (for music in particular) by exploiting the safe harbor protections of the DMCA and similar legislation.

They believe that proactively filtering uploads would significantly help to diminish this problem, which may very well be the case. But at what cost to the general public and the platforms they rely upon? Citizens and scholars feel that freedoms will be affected and it’s likely the outcry will continue.

The ball is now with the EU, whose members will soon have to make what could be the most important decision in recent copyright history. The rights groups, who are urging for Article 13 to be deleted, are clear where they stand.

The full letter is available here (pdf)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Spinrilla Wants RIAA Case Thrown Out Over ‘Lies’ About ‘Hidden’ Piracy Data

Post Syndicated from Ernesto original https://torrentfreak.com/spinrilla-wants-riaa-case-thrown-out-over-lies-about-hidden-piracy-data-171016/

Earlier this year, a group of well-known labels targeted Spinrilla, a popular hip-hop mixtape site and app which serves millions of users.

The coalition of record labels, including Sony Music, Warner Bros. Records, and Universal Music Group, filed a lawsuit against the service over alleged copyright infringements.

While the discovery process is still ongoing, Spinrilla recently informed the court that the record labels have “just about derailed” the entire case. The company has submitted a motion for sanctions, which is currently sealed, but additional information submitted to the court this week reveals what’s going on.

When the labels filed their original complaint they listed 210 tracks, without providing the allegedly infringing URLs. These weren’t shared during the early stages of the discovery process either, forcing the site to manually search for potentially infringing links.

Then, early October, Spinrilla received a massive spreadsheet with over 2,000 tracks, including the infringing URLs. This data came from the RIAA and supported the long list of infringements in the amended complaint submitted around the same time.

The spreadsheet would have made the discovery process much easier for Spinrilla. In a supplemental brief supporting a motion for sanctions, Spinrilla accuses the labels of hiding the piracy data from them and lying about it, “derailing” the case in the process.

“Significantly, Plaintiffs used that lie to convince the Court they should be allowed to add about 1,900 allegedly infringed sound recordings to their original list of 210. Later, Plaintiffs repeated that lie to convince the Court to give them time to add even more sound recordings to their list.”

vbcn

Spinrilla says they were forced to go down an expensive and unnecessary rabbit hole to find the infringing files, even though the RIAA data was available all along.

“By hiding and lying about the RIAA data, Plaintiffs forced Defendants to spend precious time and money fumbling through discovery. Not knowing that Plaintiffs had the RIAA data,” the company writes.

The hip-hop mixtape site argues that the alleged wrongdoing is severe enough to have the entire complaint dismissed, as the ultimate sanction.

“It is without exaggeration to say that by hiding the RIAA spreadsheets and that underlying data, Defendants have been severely prejudiced. The Complaint should be dismissed with prejudice and, if it is, Plaintiffs can only blame themselves,” Spinrilla concludes.

The stakes are certainly high in this case. With well over 2,000 infringing tracks listed in the amended complaint, the hip-hop mixtape site faces statutory damages as high as $300 million, at least in theory.

Spinrilla’s supplement brief in further support of the motion for sanctions is available here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Popular Zer0day Torrent Tracker Taken Offline By Mass Copyright Complaint

Post Syndicated from Andy original https://torrentfreak.com/popular-zer0day-torrent-tracker-taken-offline-by-mass-copyright-complaint-171014/

In January 2016, a BitTorrent enthusiast decided to launch a stand-alone tracker, purely for fun.

The Zer0day platform, which hosts no torrents, is a tracker in the purest sense, directing traffic between peers, no matter what content is involved and no matter where people are in the world.

With this type of tracker in short supply, it was soon utilized by The Pirate Bay and the now-defunct ExtraTorrent. By August 2016, it was tracking almost four million peers and a million torrents, a considerable contribution to the BitTorrent ecosystem.

After handling many ups and downs associated with a service of this type, the tracker eventually made it to the end of 2016 intact. This year it grew further still and by the end of September was tracking an impressive 5.5 million peers spread over 1.2 million torrents. Soon after, however, the tracker disappeared from the Internet without warning.

In an effort to find out what had happened, TorrentFreak contacted Zer0day’s operator who told us a familiar story. Without any warning at all, the site’s host pulled the plug on the service, despite having been paid 180 euros for hosting just a week earlier.

“We’re hereby informing you of the termination of your dedicated server due to a breach of our terms of service,” the host informed Zer0day.

“Hosting trackers on our servers that distribute infringing and copyrighted content is prohibited. This server was found to distribute such content. Should we identify additional similar activity in your services, we will be forced to close your account.”

While hosts tend not to worry too much about what their customers are doing, this one had just received a particularly lengthy complaint. Sent by the head of anti-piracy at French collecting society SCPP, it laid out the group’s problems with the Zer0day tracker.

“SCPP has been responsible for the collective management and protection of sound recordings and music videos producers’ rights since 1985. SCPP counts more than 2,600 members including the majority of independent French producers, in addition to independent European producers, and the major international companies: Sony, Universal and Warner,” the complaints reads.

“SCPP administers a catalog of 7,200,000 sound tracks and 77,000 music videos. SCPP is empowered by its members to take legal action in order to put an end to any infringements of the producers’ rights set out in Article L335-4 of the French Intellectual Property Code…..punishable by a three-year prison sentence or a fine of €300,000.”

Noting that it works on behalf of a number of labels and distributors including BMG, Sony Music, Universal Music, Warner Music and others, SCPP listed countless dozens of albums under its protection, each allegedly tracked by the Zer0day platform.

“It has come to our attention that these music albums are illegally being communicated to the public (made available for download) by various users of the BitTorrent-Network,” the complaint reads.

Noting that Zer0day is involved in the process, the anti-piracy outfit presented dozens of hash codes relating to protected works, demanding that the site stop facilitation of infringement on each and every one of them.

“We have proof that your tracker udp://tracker.zer0day.to:1337/announce provided peers of the BitTorrent-Network with information regarding these torrents, to be specific IP Addresses of peers that were offering without authorization the full albums for download, and that this information enabled peers to download files that contain the sound recordings to which our members producers have the exclusive rights.

“These sound recordings are thus being illegally communicated to the public, and your tracker is enabling the seeders to do so.”

Rather than take the hashes down from the tracker, SCPP actually demanded that Zer0day create a permanent blacklist within 24 hours, to ensure the corresponding torrents wouldn’t be tracked again.

“You should understand that this letter constitutes a notice to you that you may be liable for the infringing activity occurring on your service. In addition, if you ignore this notice, you may also be liable for any resulting infringement,” the complaint added.

But despite all the threats, SCPP didn’t receive the response they’d demanded since the operator of the site refused to take any action.

“Obviously, ‘info hashes’ are not copyrightable nor point to specific copyrighted content, or even have any meaning. Further, I cannot verify that request strings parameters (‘info hashes’) you sent me contain copyrighted material,” he told SCPP.

“Like the website says; for content removal kindly ask the indexing site to remove the listing and the .torrent file. Also, tracker software does not have an option to block request strings parameters (‘info hashes’).”

The net effect of non-compliance with SCPP was fairly dramatic and swift. Zer0day’s host took down the whole tracker instead and currently it remains offline. Whether it reappears depends on the site’s operator finding a suitable web host, but at the moment he says he has no idea where one will appear from.

“Currently I’m searching for some virtual private server as a temporary home for the tracker,” he concludes.

As mentioned in an earlier article detailing the problems sites like Zer0day.to face, trackers aren’t absolutely essential for the functioning of BitTorrent transfers. Nevertheless, their existence certainly improves matters for file-sharers so when they go down, millions can be affected.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

SOPA Ghosts Hinder U.S. Pirate Site Blocking Efforts

Post Syndicated from Ernesto original https://torrentfreak.com/sopa-ghosts-hinder-u-s-pirate-site-blocking-efforts-171008/

Website blocking has become one of the entertainment industries’ favorite anti-piracy tools.

All over the world, major movie and music industry players have gone to court demanding that ISPs take action, often with great success.

Internal MPAA research showed that website blockades help to deter piracy and former boss Chris Dodd said that they are one of the most effective anti-tools available.

While not everyone is in agreement on this, the numbers are used to lobby politicians and convince courts. Interestingly, however, nothing is happening in the United States, which is where most pirate site visitors come from.

This is baffling to many people. Why would US-based companies go out of their way to demand ISP blocking in the most exotic locations, but fail to do the same at home?

We posed this question to Neil Turkewitz, RIAA’s former Executive Vice President International, who currently runs his own consulting group.

The main reason why pirate site blocking requests have not yet been made in the United States is down to SOPA. When the proposed SOPA legislation made headlines five years ago there was a massive backlash against website blocking, which isn’t something copyright groups want to reignite.

“The legacy of SOPA is that copyright industries want to avoid resurrecting the ghosts of SOPA past, and principally focus on ways to creatively encourage cooperation with platforms, and to use existing remedies,” Turkewitz tells us.

Instead of taking the likes of Comcast and Verizon to court, the entertainment industries focused on voluntary agreements, such as the now-defunct Copyright Alerts System. However, that doesn’t mean that website blocking and domain seizures are not an option.

“SOPA made ‘website blocking’ as such a four-letter word. But this is actually fairly misleading,” Turkewitz says.

“There have been a variety of civil and criminal actions addressing the conduct of entities subject to US jurisdiction facilitating piracy, regardless of the source, including hundreds of domain seizures by DHS/ICE.”

Indeed, there are plenty of legal options already available to do much of what SOPA promised. ABS-CBN has taken over dozens of pirate site domain names through the US court system. Most recently even through an ex-parte order, meaning that the site owners had no option to defend themselves before they lost their domains.

ISP and search engine blocking is also around the corner. As we reported earlier this week, a Virginia magistrate judge recently recommended an injunction which would require search engines and Internet providers to prevent users from accessing Sci-Hub.

Still, the major movie and music companies are not yet using these tools to take on The Pirate Bay or other major pirate sites. If it’s so easy, then why not? Apparently, SOPA may still be in the back of their minds.

Interestingly, the RIAA’s former top executive wasn’t a fan of SOPA when it was first announced, as it wouldn’t do much to extend the legal remedies that were already available.

“I actually didn’t like SOPA very much since it mostly reflected existing law and maintained a paradigm that didn’t involve ISP’s in creative interdiction, and simply preserved passivity. To see it characterized as ‘copyright gone wild’ was certainly jarring and incongruous,” Turkewitz says.

Ironically, it looks like a bill that failed to pass, and didn’t impress some copyright holders to begin with, is still holding them back after five years. They’re certainly not using all the legal options available to avoid SOPA comparison. The question is, for how long?

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

RIAA Identifies Top YouTube MP3 Rippers and Other Pirate Sites

Post Syndicated from Ernesto original https://torrentfreak.com/riaa-identifies-top-youtube-mp3-rippers-and-other-pirate-sites-171006/

Around the same time as Hollywood’s MPAA, the RIAA has also submitted its overview of “notorious markets” to the Office of the US Trade Representative (USTR).

These submissions help to guide the U.S. Government’s position toward foreign countries when it comes to copyright enforcement.

The RIAA’s overview begins positively, announcing two major successes achieved over the past year.

The first is the shutdown of sites such as Emp3world, AudioCastle, Viperial, Album Kings, and im1music. These sites all used the now-defunct Sharebeast platform, whose operator pleaded guilty to criminal copyright infringement.

Another victory followed a few weeks ago when YouTube-MP3.org shut down its services after being sued by the RIAA.

“The most popular YouTube ripping site, youtube-mp3.org, based in Germany and included in last year’s list of notorious markes [sic], recently shut down in response to a civil action brought by major record labels,” the RIAA writes.

This case also had an effect on similar services. Some stream ripping services that were reported to the USTR last year no longer permit the conversion and download of music videos on YouTube, the RIAA reports. However, they add that the problem is far from over.

“Unfortunately, several other stream-ripping sites have ‘doubled down’ and carry on in this illegal behavior, continuing to make this form of theft a major concern for the music industry,” the music group writes.

“The overall popularity of these sites and the staggering volume of traffic it attracts evidences the enormous damage being inflicted on the U.S. record industry.”

The music industry group is tracking more than 70 of these stream ripping sites and the most popular ones are listed in the overview of notorious markets. These are Mp3juices.cc, Convert2mp3.net, Savefrom.net, Ytmp3.cc, Convertmp3.io, Flvto.biz, and 2conv.com.

Youtube2mp3’s listing

The RIAA notes that many sites use domain privacy services to hide their identities, as well as Cloudflare to obscure the sites’ true hosting locations. This frustrates efforts to take action against these sites, they say.

Popular torrent sites are also highlighted, including The Pirate Bay. These sites regularly change domain names to avoid ISP blockades and domain seizures, and also use Cloudflare to hide their hosting location.

“BitTorrent sites, like many other pirate sites, are increasing [sic] turning to Cloudflare because routing their site through Cloudflare obfuscates the IP address of the actual hosting provider, masking the location of the site.”

Finally, the RIAA reports several emerging threats reported to the Government. Third party app stores, such as DownloadAtoZ.com, reportedly offer a slew of infringing apps. In addition, there’s a boom of Nigerian pirate sites that flood the market with free music.

“The number of such infringing sites with a Nigerian operator stands at over 200. Their primary method of promotion is via Twitter, and most sites make use of the Nigerian operated ISP speedhost247.com,” the report notes

The full list of RIAA’s “notorious” pirate sites, which also includes several cyberlockers, MP3 search and download sites, as well as unlicensed pay services, can be found below. The full report is available here (pdf).

Stream-Ripping Sites

– Mp3juices.cc
– Convert2mp3.net
– Savefrom.net
– Ytmp3.cc
– Convertmp3.io
– Flvto.biz
– 2conv.com.

Search-and-Download Sites

– Newalbumreleases.net
– Rnbxclusive.top
– DNJ.to

BitTorrent Indexing and Tracker Sites

– Thepiratebay.org
– Torrentdownloads.me
– Rarbg.to
– 1337x.to

Cyberlockers

– 4shared.com
– Uploaded.net
– Zippyshare.com
– Rapidgator.net
– Dopefile.pk
– Chomikuj.pl

Unlicensed Pay-for-Download Sites

– Mp3va.com
– Mp3fiesta.com

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Spotify Threatened Researchers Who Revealed ‘Pirate’ History

Post Syndicated from Andy original https://torrentfreak.com/spotify-threatened-researchers-who-revealed-pirate-history-171006/

As one of the members of Sweden’s infamous Piratbyrån (Piracy Bureau), Rasmus Fleischer was also one of early key figures at The Pirate Bay. Over the years he’s been a writer, researcher, debater, and musician, and in 2012 he finished his PhD thesis on “music’s political economy.”

As part of a five-person research team (Pelle Snickars, Patrick Vonderau, Anna Johansson, Rasmus Fleischer, Maria Eriksson) funded by the Swedish Research Council, Fleischer has co-written a book about the history of Spotify.

Titled ‘Spotify Teardown – Inside the Black Box of Streaming Music’, the publication is set to shine light on the history of the now famous music service while revealing quite a few past secrets.

With its release scheduled for 2018, Fleischer has already teased a few interesting nuggets, not least that Spotify’s early beta version used ‘pirate’ MP3 files, some of them sourced from The Pirate Bay.

Fleischer says that following an interview earlier this year with DI.se, in which he revealed that Spotify distributed unlicensed music between May 2007 to October 2008, Spotify looked at ways to try and stop his team’s research. However, the ‘pirate’ angle wasn’t the clear target, another facet of the team’s research was.

“Building on the tradition of ‘breaching experiments’ in ethnomethodology, the research group sought to break into the hidden infrastructures of digital music distribution in order to study its underlying norms and structures,” project leader Pelle Snickars previously revealed.

With this goal, the team conducted experiments to see if the system was open to abuse or could be manipulated, as Fleischer now explains.

“For example, some hundreds of robot users were created to study whether the same listening behavior results in different recommendations depending on whether the user was registered as male or female,” he says.

“We have also investigated on a small scale the possibilities of manipulating the system. However, we have not collected any data about real users. Our proposed methods appeared several years ago in our research funding application, which was approved by the Swedish Research Council, which was already noted in 2013.”

Fleischer says that Spotify had been aware of the project for several years but it wasn’t until this year, after he spoke of Spotify’s past as a ‘pirate’ service, that pressure began to mount.

“On May 19, our project manager received a letter from Benjamin Helldén-Hegelund, a lawyer at Spotify. The timing was hardly a coincidence. Spotify demanded that we ‘confirm in writing’ that we had ‘ceased activities contrary to their Terms of Use’,” Fleischer reveals.

A corresponding letter to the Swedish Research Council detailed Spotify’s problems with the project.

“Spotify is particularly concerned about the information that has emerged regarding the research group’s methods in the project. The data indicate that the research team has deliberately taken action that is explicitly in violation of Spotify’s Terms of Use and by means of technical methods they sought to conceal these breaches of conditions,” the letter read.

“The research group has worked, among other things, to artificially increase the number of plays and manipulate Spotify’s services using scripts or other automated processes.

“Spotify assumes that the systematic breach of its conditions has not been known to the Swedish Research Council and is convinced that the Swedish Research Council is convinced that the research undertaken with the support of the Swedish Research Council in all respects meets ethical guidelines and is carried out reasonably and in accordance with applicable law.”

Fleischer admits that part of the research was concerned with the possibility of artificially increasing the number of plays, but he says that was carried out on a small scale without any commercial gain.

“The purpose was simply to test if it is true that Spotify could be manipulated on a larger scale, as claimed by journalists who did similar experiments. It is also true that we ‘sought to hide these crimes’ by using a VPN connection,” he says.

Fleischer says that Spotify’s lawyer blended complaints together, such as correlating terms of service violations with violation of research ethics, while presenting the same as grounds for legal action.

“The argument was quite ridiculous. Nevertheless, the letter could not be interpreted as anything other than an attempt by Spotify to prevent us from pursuing the research project,” he notes.

This week, however, it appears the dispute has reached some kind of conclusion. In a posting on his Copyriot blog (Swedish), Fleischer reveals that Spotify has informed the Swedish Research Council that the case has been closed, meaning that the research into the streaming service can continue.

“It must be acknowledged that Spotify’s threats have taken both time and power from the project. This seems to be the purpose when big companies go after researchers who they perceive as uncomfortable. It may not be possible to stop the research but it can be delayed,” Fleischer says.

“Sure [Spotify] dislikes people being reminded of how the service started as a pirate service. But instead of inviting an open dialogue, lawyers are sent out for the purpose of slowing down researchers.”

Spotify Teardown. Inside the Black Box of Streaming Music is to be published by MIT Press in 2018.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Yarrrr! Dutch ISPs Block The Pirate Bay But It’s Bad Timing for Trolls

Post Syndicated from Andy original https://torrentfreak.com/yarrrr-dutch-isps-block-the-pirate-bay-but-its-bad-timing-for-trolls-171005/

While many EU countries have millions of Internet pirates, few have given citizens the freedom to plunder like the Netherlands. For many years, Dutch Internet users actually went about their illegal downloading with government blessing.

Just over three years ago, downloading and copying movies and music for personal use was not punishable by law. Instead, the Dutch compensated rightsholders through a “piracy levy” on writable media, hard drives and electronic devices with storage capacity, including smartphones.

Following a ruling from the European Court of Justice in 2014, however, all that came to an end. Along with uploading (think BitTorrent sharing), downloading was also outlawed.

Around the same time, The Court of The Hague handed down a decision in a long-running case which had previously forced two Dutch ISPs, Ziggo and XS4ALL, to block The Pirate Bay.

Ruling against local anti-piracy outfit BREIN, it was decided that the ISPs wouldn’t have to block The Pirate Bay after all. After a long and tortuous battle, however, the ISPs learned last month that they would have to block the site, pending a decision from the Supreme Court.

On September 22, both ISPs were given 10 business days to prevent subscriber access to the notorious torrent site, or face fines of 2,000 euros per day, up to a maximum of one million euros.

With that time nearly up, yesterday Ziggo broke cover to become the first of the pair to block the site. On a dedicated diversion page, somewhat humorously titled ziggo.nl/yarrr, the ISP explained the situation to now-blocked users.

“You are trying to visit a page of The Pirate Bay. On September 22, the Hague Court obliged us to block access to this site. The pirate flag is thus handled by us. The case is currently at the Supreme Court which judges the basic questions in this case,” the notice reads.

Ziggo Pirate Bay message (translated)

Customers of XS4ALL currently have no problem visiting The Pirate Bay but according to a statement handed to Tweakers by a spokesperson, the blockade will be implemented today.

In addition to the site’s main domains, the injunction will force the ISPs to block 155 URLs and IP addresses in total, a list that has been drawn up by BREIN to include various mirrors, proxies, and alternate access points. XS4All says it will publish a list of all the blocked items on its notification page.

While the re-introduction of a Pirate Bay blockade in the Netherlands is an achievement for BREIN, it’s potentially bad timing for the copyright trolls waiting in the wings to snare Dutch file-sharers.

As recently reported, movie outfit Dutch Filmworks (DFW) is preparing a wave of cash-settlement copyright-trolling letters to mimic those sent by companies elsewhere.

There’s little doubt that users of The Pirate Bay would’ve been DFW’s targets but it seems likely that given the introduction of blockades, many Dutch users will start to educate themselves on the use of VPNs to protect their privacy, or at least become more aware of the risks.

Of course, there will be no real shortage of people who’ll continue to download without protection, but DFW are getting into this game just as it’s likely to get more difficult for them. As more and more sites get blocked (and that is definitely BREIN’s overall plan) the low hanging fruit will sit higher and higher up the tree – and the cash with it.

Like all methods of censorship, site-blocking eventually drives communication underground. While anti-piracy outfits all say blocking is necessary, obfuscation and encryption isn’t welcomed by any of them.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Six Strikes Piracy Scheme May Be Dead But Those Warnings Keep on Coming

Post Syndicated from Andy original https://torrentfreak.com/six-strikes-piracy-scheme-may-be-dead-but-those-warnings-keep-on-coming-171001/

After at least 15 years of Internet pirates being monitored by copyright holders, one might think that the message would’ve sunk in by now. For many, it definitely hasn’t.

Bottom line: when people use P2P networks and protocols (such as BitTorrent) to share files including movies and music, copyright holders are often right there, taking notes about what is going on, perhaps in preparation for further action.

That can take a couple of forms, including suing users or, more probably, firing off a warning notice to their Internet service providers. Those notices are a little like a speeding ticket, telling the subscriber off for sharing copyrighted material but letting them off the hook if they promise to be good in future.

In 2013, the warning notice process in the US was formalized into what was known as the Copyright Alert System, a program through which most Internet users could receive at least six piracy warning notices without having any serious action taken against them. In January 2017, without having made much visible progress, it was shut down.

In some corners of the web there are still users under the impression that since the “six strikes” scheme has been shut down, all of a sudden US Internet users can forget about receiving a warning notice. In reality, the complete opposite is true.

While it’s impossible to put figures on how many notices get sent out (ISPs are reluctant to share the data), monitoring of various piracy-focused sites and forums indicates that plenty of notices are still being sent to ISPs, who are cheerfully sending them on to subscribers.

Also, over the past couple of months, there appears to have been an uptick in subscribers seeking advice after receiving warnings. Many report basic notices but there seems to be a bit of a trend of Internet connections being suspended or otherwise interrupted, apparently as a result of an infringement notice being received.

“So, over the weekend my internet got interrupted by my ISP (internet service provider) stating that someone on my network has violated some copyright laws. I had to complete a survey and they brought back the internet to me,” one subscriber wrote a few weeks ago. He added that his (unnamed) ISP advised him that seven warnings would get his account disconnected.

Another user, who named his ISP as Comcast, reported receiving a notice after downloading a game using BitTorrent. He was warned that the alleged infringement “may result in the suspension or termination of your Service account” but what remains unclear is how many warnings people can receive before this happens.

For example, a separate report from another Comcast user stated that one night of careless torrenting led to his mother receiving 40 copyright infringement notices the next day. He didn’t state which company the notices came from but 40 is clearly a lot in such a short space of time. That being said and as far as the report went, it didn’t lead to a suspension.

Of course, it’s possible that Comcast doesn’t take action if a single company sends many notices relating to the same content in a small time frame (Rightscorp is known to do this) but the risk is still there. Verizon, it seems, can suspend accounts quite easily.

“So lately I’ve been getting more and more annoyed with pirating because I get blasted with a webpage telling me my internet is disconnected and that I need to delete the file to reconnect, with the latest one having me actually call Verizon to reconnect,” a subscriber to the service reported earlier this month.

A few days ago, a Time Warner Cable customer reported having to take action after receiving his third warning notice from the ISP.

“So I’ve gotten three notices and after the third one I just went online to my computer and TWC had this page up that told me to stop downloading illegally and I had to click an ‘acknowledge’ button at the bottom of the page to be able to continue to use my internet,” he said.

Also posting this week, another subscriber of an unnamed ISP revealed he’d been disconnected twice in the past year. His comments raise a few questions that keep on coming up in these conversations.

“The first time [I was disconnected] was about a year ago and the next was a few weeks ago. When it happened I was downloading some fairly new movies so I was wondering if they monitor these new movie releases since they are more popular. Also are they monitoring what I am doing since I have been caught?” he asked.

While there is plenty of evidence to suggest that old content is also monitored, there’s little doubt that the fresher the content, the more likely it is to be monitored by copyright holders. If people are downloading a brand new movie, they should expect it to be monitored by someone, somewhere.

The second point, about whether risk increases after being caught already, is an interesting one, for a number of reasons.

Following the BMG v Cox Communication case, there is now a big emphasis on ISPs’ responsibility towards dealing with subscribers who are alleged to be repeat infringers. Anti-piracy outfit Rightscorp was deeply involved in that case and the company has a patent for detecting repeat infringers.

It’s becoming clear that the company actively targets such people in order to assist copyright holders (which now includes the RIAA) in strategic litigation against ISPs, such as Grande Communications, who are claimed to be going soft on repeat infringers.

Overall, however, there’s no evidence that “getting caught” once increases the chances of being caught again, but subscribers should be aware that the Cox case changed the position on the ground. If anecdotal evidence is anything to go by, it now seems that ISPs are tightening the leash on suspected pirates and are more likely to suspend or disconnect them in the face of repeated complaints.

The final question asked by the subscriber who was disconnected twice is a common one among people receiving notices.

“What can I do to continue what we all love doing?” he asked.

Time and time again, on sites like Reddit and other platforms attracting sharers, the response is the same.

“Get a paid VPN. I’m amazed you kept torrenting without protection after having your internet shut off, especially when downloading recent movies,” one such response reads.

Nevertheless, this still fails to help some people fully understand the notices they receive, leaving them worried about what might happen after receiving one. However, the answer is nearly always straightforward.

If the notice says “stop sharing content X”, then recipients should do so, period. And, if the notice doesn’t mention specific legal action, then it’s almost certain that no action is underway. They are called warning notices for a reason.

Also, notice recipients should consider the part where their ISP assures them that their details haven’t been shared with third parties. That is the truth and will remain that way unless subscribers keep ignoring notices. Then there’s a slim chance that a rightsholder will step in to make a noise via a lawyer. At that point, people shouldn’t say they haven’t been warned.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

‘China Should Crack Down on Pirate Streaming Box Distributors’

Post Syndicated from Ernesto original https://torrentfreak.com/china-should-crack-down-on-pirate-streaming-box-distributors-171001/

The International Intellectual Property Alliance (IIPA) has informed the U.S. Government that China must step up its game to better protect the interests of copyright holders.

The US Trade Representative is reviewing whether China has done enough to comply with its WTO obligations, but IIPA members including RIAA and MPAA believe there is still work to be done.

One of the areas to which the Chinese Government should pay more attention is enforcement. Although a lot of progress has been made in recent years, especially in combating music piracy, new threats have emerged.

One of the areas highlighted by IIPA is the streaming box ecosystem, aptly dubbed as “piracy 3.0” by the Motion Picture Association. This appeals to a new breed of pirates who rely on set-top boxes which are filled with pirate add-ons.

Industry groups often refer to these boxes as Illicit Streaming Devices (ISDs) and they see China as a major hub through which these are shipped around the world.

“ISDs are media boxes, set-top boxes or other devices that allow users, through the use of piracy apps, to stream, download, or otherwise access unauthorized content from the Internet,” IIPA writes.

“These devices have emerged as a significant means through which pirated motion picture and television content is accessed on televisions in homes in China as well as elsewhere in Asia and increasingly around the world. China is a hub for the manufacture of these devices.”

Although the hardware and media players are perfectly legal, things get problematic when they’re loaded with pirate add-ons and promoted as tools to facilitate copyright infringement.

IIPA states that the Chinese Government should do more to stop these devices from being sold. Cracking down on the main distribution points would be a good start, they say.

“However it is done, the Chinese government must increase enforcement efforts, including cracking down on piracy apps and on device retailers and/or distributors who preload the devices with apps that facilitate infringement.

“Moreover, because China is the main source of this problem spreading across Asia, the Chinese government should take immediate actions against key distribution points for devices that are being used illegally,” IIPA adds.

In addition to pirate boxes, the industry groups also want China to beef up its enforcement against online journal piracy, pirate apps, unauthorized camcording, and unlicensed streaming platforms.

IIPA intends to explain the above and several other shortcomings in detail during a hearing in Washington, DC, next Wednesday. The group has submitted an overview of its testimony to the Trade Representative, which is available here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

HDClub, Russia’s Leading HD-Only Torrent Site, Returns as EliteHD

Post Syndicated from Ernesto original https://torrentfreak.com/hdclub-russias-leading-hd-only-torrent-site-returns-as-elitehd-170930/

With around 170,000 users, HDClub was known for high-quality releases that often leaked to public sites like The Pirate Bay.

Describing itself as “The HighDefinition BitTorrent Community”, HDClub specialized in HD productions including Blu-ray and 3D content, covering movies, TV shows, music videos, and animation.

The site was the largest of its kind in Russia and had been around for a long time. It celebrated its tenth anniversary a few months ago and during this time it amassed over 170,000 members, which is quite significant for a private community.

However, last month the fun was over. As a total surprise to most of the members, HDTorrents’ operators decided to shut down the site. A Russian language announcement now present on its main page explains the reasons for the site’s demise.

“Recently, we received several dozens of complaints from rightsholders weekly, and our community is subjected to attacks and espionage. In parallel, there is a tightening of Internet legislation in Russia, Ukraine and EU countries,” the announcement explained.

This grim outlook was, however, paired with a glimmer of hope. “There are talks on preserving the heritage of the club,” the site teased.

This was not a false promise, it turned out this week. The former foundation of HDClub now forms the basis of a new tracker. EliteHD takes over where HDClub left off with a working copy of the code, torrents and user database.

“Welcome to the closed tracker elitehd.org. We will try to increase the best HD collection and ensure your safety and confidentiality,” EliteHD’s operators posted in a Russian announcement earlier this week.

“The new site received a full copy of the database and the code of the closed HDClub. The user base has been thoroughly cleaned, there will be no free registration,” it adds.

EliteHD’s torrents

“Thoroughly cleaned” means that around 80,000 accounts were removed and the new maximum is currently set at 100,000 registered users. The torrent database is intact though. There are over 26,000 HD torrents in the database totaling more than 500 terabytes of data.

The site’s operators note that members can continue to seed old torrents as well. All they have to do is change the torrent’s announce URL in their client, and uploads should pick up again.

In recent weeks there have been other private trackers which tried to get former HDClub users on board, but it will be hard to compete with a site that has the real database and code.

EliteHD specifically warns people not to fall for fakes and ‘unofficial’ incarnations of its predecessor. “We strongly recommend that you beware of numerous fake projects and “successors,” the site operators stress.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Algo-rhythmic PianoAI

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/pianoai/

It’s no secret that we love music projects at Pi Towers. On the contrary, we often shout it from the rooftops like we’re in Moulin Rouge! But the PianoAI project by Zack left us slack-jawed: he built an AI on a Raspberry Pi that listens to his piano playing, and then produces improvised, real-time accompaniment.

Jamming with PIanoAI (clip #1) (Version 1.0)

Another example of a short teaching and then jamming with piano with a version I’m more happy with. I have to play for the Pi for a little while before the Pi has enough data to make its own music.

The PianoAI

Inspired by a story about jazz musician Dan Tepfer, Zack set out to create an AI able to imitate his piano-playing style in real time. He began programming the AI in Python, before starting over in the open-source programming language Go.

The Go language gopher mascot with headphones and a MIDI keyboard

The Go mascot is a gopher. Why not?

Zack has published an excellent write-up of how he built PianoAI. It’s a very readable account of the progress he made and the obstacles he had to overcome while writing PianoAI, and it includes more example videos. It’s hard to add anything to Zack’s own words, so I shan’t try.

Paper notes for PianoAI algorithm

Some of Zack’s notes for his AI

If you just want to try out PianoAI, head over to his GitHub. He provides a detailed guide that talks you through how to implement and use it.

Music to our ears

The Raspberry Pi community never fails to amaze us with their wonderful builds, not least when it comes to musical ones. Check out this cool-looking synth by Toby Hendricks, this geometric instrument by David Sharples, and this pyrite-disc-reading music player by Dmitry Morozov. Aren’t they all splendid? And the list goes on and on

Which instrument do you play? The recorder? The ocarina? The jaw harp? Could you create an AI like Zack’s for it? Let us know in the comments below, and share your builds with us via social media.

The post Algo-rhythmic PianoAI appeared first on Raspberry Pi.

The possibilities of the Sense HAT

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/sense-hat-projects/

Did you realise the Sense HAT has been available for over two years now? Used by astronauts on the International Space Station, the exact same hardware is available to you on Earth. With a new Astro Pi challenge just launched, it’s time for a retrospective/roundup/inspiration post about this marvellous bit of kit.

Sense HAT attached to Pi and power cord

The Sense HAT on a Pi in full glory

The Sense HAT explained

We developed our scientific add-on board to be part of the Astro Pi computers we sent to the International Space Station with ESA astronaut Tim Peake. For a play-by-play of Astro Pi’s history, head to the blog archive.

Astro Pi logo with starry background

Just to remind you, this is all the cool stuff our engineers have managed to fit onto the HAT:

  • A gyroscope (sensing pitch, roll, and yaw)
  • An accelerometer
  • A magnetometer
  • Sensors for temperature, humidity, and barometric pressure
  • A joystick
  • An 8×8 LED matrix

You can find a roundup of the technical specs here on the blog.

How to Sense HAT

It’s easy to begin exploring this device: take a look at our free Getting started with the Sense HAT resource, or use one of our Code Club Sense HAT projects. You can also try out the emulator, available offline on Raspbian and online on Trinket.

Sense HAT emulator on Trinket

The Sense HAT emulator on trinket.io

Fun and games with the Sense HAT

Use the LED matrix and joystick to recreate games such as Pong or Flappy Bird. Of course, you could also add sensor input to your game: code an egg drop game or a Magic 8 Ball that reacts to how the device moves.

Sense HAT Random Sparkles

Create random sparkles on the Sense HAT

Once December rolls around, you could brighten up your home with a voice-controlled Christmas tree or an advent calendar on your Sense HAT.

If you like the great outdoors, you could also use your Sense HAT to recreate this Hiking Companion by Marcus Johnson. Take it with you on your next hike!

Art with the Sense HAT

The LED matrix is perfect for getting creative. To draw something basic without having to squint at a Python list, use this app by our very own Richard Hayler. Feeling more ambitious? The MagPi will teach you how to create magnificent pixel art. Ben Nuttall has created this neat little Python script for displaying a photo taken by the Raspberry Pi Camera Module on the Sense HAT.

Brett Haines Mathematica on the Sense HAT

It’s also possible to incorporate Sense HAT data into your digital art! The Python Turtle module and the Processing language are both useful tools for creating beautiful animations based on real-world information.

A Sense HAT project that also uses this principle is Giorgio Sancristoforo’s Tableau, a ‘generative music album’. This device creates music according to the sensor data:

Tableau Generative Album

“There is no doubt that, as music is removed by the phonographrecord from the realm of live production and from the imperative of artistic activity and becomes petrified, it absorbs into itself, in this process of petrification, the very life that would otherwise vanish.”

Science with the Sense HAT

This free Essentials book from The MagPi team covers all the Sense HAT science basics. You can, for example, learn how to measure gravity.

Cropped cover of Experiment with the Sense HAT book

Our online resource shows you how to record the information your HAT picks up. Next you can analyse and graph your data using Mathematica, which is included for free on Raspbian. This resource walks you through how this software works.

If you’re seeking inspiration for experiments you can do on our Astro Pis Izzy and Ed on the ISS, check out the winning entries of previous rounds of the Astro Pi challenge.

Thomas Pesquet with Ed and Izzy

Thomas Pesquet with Ed and Izzy

But you can also stick to terrestrial scientific investigations. For example, why not build a weather station and share its data on your own web server or via Weather Underground?

Your code in space!

If you’re a student or an educator in one of the 22 ESA member states, you can get a team together to enter our 2017-18 Astro Pi challenge. There are two missions to choose from, including Mission Zero: follow a few guidelines, and your code is guaranteed to run in space!

The post The possibilities of the Sense HAT appeared first on Raspberry Pi.

5 years with home NAS/RAID

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/09/5-years-with-home-nasraid.html

I have lots of data-sets (packet-caps, internet-scans), so I need a large RAID system to hole it all. As I described in 2012, I bought a home “NAS” system. I thought I’d give the 5 year perspective.

Reliability. I had two drives fail, which is about to be expected. Buying a new drive, swapping it in, and rebuilding the RAID went painless, though that’s because I used RAID6 (two drive redundancy). RAID5 (one drive redundancy) is for chumps.

Speed. I’ve been unhappy with the speed, but there’s not much I can do about it. Mechanical drives access times are slow, and I don’t see any way of fixing that.

Cost. It’s been $3000 over 5 years (including the two replacement drives). That comes out to $50/month. Amazon’s “Glacier” service is $108/month. Since we all have the same hardware costs, it’s unlikely that any online cloud storage can do better than doing it yourself.

Moore’s Law. For the same price as I spent 5 years ago, I can now get three times the storage, including faster processors in the NAS box. From that perspective, I’ve only spent $33/month on storage, as the remaining third still has value.

Ease-of-use: The reason to go with a NAS is ease-of-use, so I don’t have to mess with it. Yes, I’m a Linux sysadmin, but I have more than enough Linux boxen needing my attention. The NAS has been extremely easy to use, even dealing with the two disk failures.

Battery backup. The cheap $50 CyberPower UPS I bought never worked well and completely failed recently, so I’ve ordered a $150 APC unit to replace it.

Vendor. I chose Synology, and have no reason to complain. Of course they’ve had security vulnerabilities, but then, so have all their competition.

DLNA. This is a standard for streaming music among home devices. It never worked well. I suspect partly it’s Synology’s fault that they can’t transcode well. I suspect it’s also the apps I tried on the iPad which have obvious problems. I end up streaming to the iPad by simply using the SMB protocol to serve files rather than a video protocol.

Consumer vs. enterprise drives. I chose consumer rather than enterprise drives. I think this is always the best choice (RAID means inexpensive drives). But very smart people with experience in recovering data disagree with me.

If you are in the market. If you are building your own NAS, get a 4 or 5 bay device and RAID6. Two-drive redundancy is really important.

The Data Tinder Collects, Saves, and Uses

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/09/the_data_tinder.html

Under European law, service providers like Tinder are required to show users what information they have on them when requested. This author requested, and this is what she received:

Some 800 pages came back containing information such as my Facebook “likes,” my photos from Instagram (even after I deleted the associated account), my education, the age-rank of men I was interested in, how many times I connected, when and where every online conversation with every single one of my matches happened…the list goes on.

“I am horrified but absolutely not surprised by this amount of data,” said Olivier Keyes, a data scientist at the University of Washington. “Every app you use regularly on your phone owns the same [kinds of information]. Facebook has thousands of pages about you!”

As I flicked through page after page of my data I felt guilty. I was amazed by how much information I was voluntarily disclosing: from locations, interests and jobs, to pictures, music tastes and what I liked to eat. But I quickly realised I wasn’t the only one. A July 2017 study revealed Tinder users are excessively willing to disclose information without realising it.

“You are lured into giving away all this information,” says Luke Stark, a digital technology sociologist at Dartmouth University. “Apps such as Tinder are taking advantage of a simple emotional phenomenon; we can’t feel data. This is why seeing everything printed strikes you. We are physical creatures. We need materiality.”

Reading through the 1,700 Tinder messages I’ve sent since 2013, I took a trip into my hopes, fears, sexual preferences and deepest secrets. Tinder knows me so well. It knows the real, inglorious version of me who copy-pasted the same joke to match 567, 568, and 569; who exchanged compulsively with 16 different people simultaneously one New Year’s Day, and then ghosted 16 of them.

“What you are describing is called secondary implicit disclosed information,” explains Alessandro Acquisti, professor of information technology at Carnegie Mellon University. “Tinder knows much more about you when studying your behaviour on the app. It knows how often you connect and at which times; the percentage of white men, black men, Asian men you have matched; which kinds of people are interested in you; which words you use the most; how much time people spend on your picture before swiping you, and so on. Personal data is the fuel of the economy. Consumers’ data is being traded and transacted for the purpose of advertising.”

Tinder’s privacy policy clearly states your data may be used to deliver “targeted advertising.”

It’s not Tinder. Surveillance is the business model of the Internet. Everyone does this.