Tag Archives: sports

UK Government Publishes Advice on ‘Illicit Streaming Devices’

Post Syndicated from Andy original https://torrentfreak.com/uk-government-publishes-advice-on-illicit-streaming-devices-171120/

With torrents and other methods of obtaining content simmering away in the background, unauthorized streaming is the now the method of choice for millions of pirates around the globe.

Previously accessible only via a desktop browser, streaming is now available on a wide range of devices, from tablets and phones through to dedicated set-top box. These, collectively, are now being branded Illicit Streaming Devices (ISD) by the entertainment industries.

It’s terminology the UK government’s Intellectual Property Office has adopted this morning. In a new public advisory, the IPO notes that illicit streaming is the watching of content without the copyright owner’s permission using a variety of devices.

“Illicit streaming devices are physical boxes that are connected to your TV or USB sticks that plug into the TV such as adapted Amazon Fire sticks and so called ‘Kodi’ boxes or Android TV boxes,” the IPO reports.

“These devices are legal when used to watch legitimate, free to air, content. They become illegal once they are adapted to stream illicit content, for example TV programmes, films and subscription sports channels without paying the appropriate subscriptions.”

The IPO notes that streaming devices usually need to be loaded with special software add-ons in order to view copyright-infringing content. However, there are now dedicated apps available to view movies and TV shows which can be loaded straight on to smartphones and tablets.

But how can people know if the device they have is an ISD or not? According to the IPO it’s all down to common sense. If people usually charge for the content you’re getting for free, it’s illegal.

“If you are watching television programmes, films or sporting events where you would normally be paying to view them and you have not paid, you are likely to be using an illicit streaming device (ISD) or app. This could include a film recently released in the cinema, a sporting event that is being broadcast by BT Sport or a television programme, like Game of Thrones, that is only available on Sky,” the IPO says.

In an effort to familiarize the public with some of the terminology used by ISD sellers on eBay, Amazon or Gumtree, for example, the IPO then wanders into a bit of a minefield that really needs much greater clarification.

First up, the government states that ISDs are often described online as being “Fully loaded”, which is a colloquial term for a device with addons already installed. Although they won’t all be infringing, it’s very often the case that the majority are intended to be, so no problems here.

However, the IPO then says that people should keep an eye out for the term ‘jail broken’, which many readers will understand to be the process some hardware devices, such as Apple products, are put through in order for third-party software to be run on them. On occasion, some ISD sellers do put this term on Android devices, for example, but it’s incorrect, in a tiny minority, and of course misleading.

The IPO also warns people against devices marketed as “Plug and Play” but again this is a dual-use term and shouldn’t put consumers off a purchase without a proper investigation. A search on eBay this morning for that exact term didn’t yield any ISDs at all, only games consoles that can be plugged in and played with a minimum of fuss.

“Subscription Gift”, on the other hand, almost certainly references an illicit IPTV or satellite card-sharing subscription and is rarely used for anything else. 100% illegal, no doubt.

The government continues by giving reasons why people should avoid ISDs, not least since their use deprives the content industries of valuable revenue.

“[The creative industries] provide employment for more than 1.9 million people and contributes £84.1 billion to our economy. Using illicit streaming devices is illegal,” the IPO writes.

“If you are not paying for this content you are depriving industry of the revenue it needs to fund the next generation of TV programmes, films and sporting events we all enjoy. Instead it provides funds for the organized criminals who sell or adapt these illicit devices.”

Then, in keeping with the danger-based narrative employed by the entertainment industries’ recently, the government also warns that ISDs can have a negative effect on child welfare, not to mention on physical safety in the home.

“These devices often lack parental controls. Using them could expose children or young people to explicit or age inappropriate content,” the IPO warns.

“Another important reason for consumers to avoid purchasing these streaming devices is from an electrical safety point of view. Where devices and their power cables have been tested, some have failed EU safety standards and have the potential to present a real danger to the public, causing a fire in your home or premises.”

While there can be no doubt whatsoever that failing EU electrical standards in any way is unacceptable for any device, the recent headlines stating that “Kodi Boxes Can Kill Their Owners” are sensational at best and don’t present the full picture.

As reported this weekend, simply not having a recognized branding on such devices means that they fail electrical standards, with non-genuine phone chargers presenting a greater risk around the UK.

Finally, the government offers some advice for people who either want to get off the ISD gravy train or ensure that others don’t benefit from it.

“These devices can be used legally by removing the software. If you are unsure get advice to help you use the device legally. If you wish to watch content that’s only available via subscription, such as sports, you should approach the relevant provider to find out about legal ways to watch,” the IPO advises.

Get it Right from a Genuine Site helps you get the music, TV, films, games, books, newspapers, magazines and sport that you love from genuine services.”

And, if the public thinks that people selling such devices deserve a visit from the authorities, people are asked to report them to the Crimestoppers charity via an anonymous hotline.

The government’s guidance is exactly what one might expect, given that the advisory is likely to have been strongly assisted by companies including the Federation Against Copyright Theft, Premier League, and Sky, who have taken the lead in this area during the past year or so.

The big question is, however, whether many people using these devices really believe that obtaining subscription TV, movies, and sports for next to free is 100% legal. If there are people out there they must be in the minority but at least the government itself is now putting them on the right path.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Spanish Police Arrest Seven in Pirate Sports Streaming Crackdown

Post Syndicated from Andy original https://torrentfreak.com/spanish-police-arrest-seven-in-pirate-sports-streaming-crackdown-171111/

While most large broadcasters around the world now offer comprehensive sports packages to their customers, subscriptions are often quite expensive.

This has led to the proliferation of pirate services, each dedicated to bringing live sports to the masses at massively reduced prices or even completely free.

As a result, it’s now possible to watch almost any sport from a pirate source, whether that’s via a website, an augmented Kodi setup, or a premium IPTV provider. Today, however, there’s one less pirate service available after a series of raids in Spain.

According to the National Police, raids took place in Madrid, Alicante, Albacete, Gandía, and the Valencian cities of Xátiva and Antequera this week. In total, seven people were arrested for illegally broadcasting football matches.

Unusually in such cases, the suspects are alleged to have offered matches via a number of mechanisms, including direct download, streaming, subscription streaming, and peer-to-peer distribution. This, the police say, allowed them to have the broadest possible access to the market.

The group’s servers were scattered around the world; some located in Spain, others in France, with the remainder in the United States and Canada.

The investigation began in 2016 following a complaint from La Liga, the top professional association in Spanish football. The group alleged that a total of 13 websites were illegally offering lists of links which enabled visitors to access content to which it holds the exclusive rights.

Police say the operation was well organized, with matches presented to Internet users with schedules ordered by championships. Revenue was generated via advertising which appeared on the various pages viewed by visitors.

It’s claimed that the sites’ operators also attempted to make their scattered servers harder to find by utilizing intermediary companies, including those that offer server location anonymization services.

Across the country, eight house searches reportedly yielded a trove of evidence, both digital and physical, detailing the pirate operation and the profit obtained from it.

At this early stage, police estimate the “economic benefit” to the defendants from subscriptions and advertising to be in the region of 1.4 million euros, although it’s unclear whether those are actual historic or projected gains.

Following the raids, seven websites were ordered to be blocked and three bank accounts, said to be linked to the pirate operation, were frozen. Police say that the investigation continues so further arrests and website blockades can’t be ruled out.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Aussie ‘Pirate’ Blocking Efforts Switch to Premium IPTV

Post Syndicated from Andy original https://torrentfreak.com/aussie-pirate-blocking-efforts-switch-to-premium-iptv-171106/

Website blocking has become one of the leading anti-piracy mechanisms in recent years and is particularly prevalent across Europe, where thousands of sites are now off-limits by regular means.

More recently the practice spread to Australia, where movie and music industry bodies have filed several applications at the Federal Court. This has rendered dozens of major torrent and streaming inaccessible in the region, after local ISPs complied with orders compelling them to prevent subscriber access.

While such blocking is now commonplace, Village Roadshow and a coalition of movie studios have now switched tack, targeting an operation offering subscription-based IPTV services.

The action targets HDSubs+, a fairly well-known service that provides hundreds of otherwise premium live channels, movies, and sports for a relatively small monthly fee, at least versus the real deal.

A small selection of channels in the HDSubs+ package

ComputerWorld reports that the application for the injunction was filed last month. In common with earlier requests, it targets Australia’s largest ISPs including Telstra, Optus, TPG, and Vocus, plus the subsidiaries.

Access to HDSubs.com appears to be limited, possibly by the platform’s operators, so that visitors from desktop machines are redirected back to Google. However, access to the platform is available by other means and that reveals a fairly pricey IPTV offering.

As seen in the image below, the top package (HD Subs+), which includes all the TV anyone could need plus movies and TV shows on demand, weighs in at US$239.99 per year, around double the price of similar packages available elsewhere.

Broad selection of channels but quite pricey

If the court chooses to grant the injunction, ISPs will not only have to block the service’s main domain (HDSubs.com) but also a range of others which provide the infrastructure for the platform.

Unlike torrent and streaming sites which tend to be in one place (if we discount proxies and mirrors), IPTV services like HD Subs often rely on a number of domains to provide a sales platform, EPG (electronic program guide), software (such as an Android app), updates, and sundry other services.

As per CW, in the HD Subs case they are: ois001wfr.update-apk.com, ois005yfs.update-apk.com, ois003slp.update-apk.com, update002zmt.hiddeniptv.com, apk.hiddeniptv.com, crossepg003uix.hiddeniptv.com, crossepg002gwj.hiddeniptv.com, mpbs001utb.hiddeniptv.com, soft001rqv.update-apk.com and hdsubs.com.

This switch in tactics by Village Roadshow and the other studios involved is subtle but significant. While torrent and streaming sites provide a largely free but fragmented experience, premium IPTV services are direct commercial competitors, often providing a more comprehensive range of channels and services than the broadcasters themselves.

While quality may not always be comparable with their licensed counterparts, presentation is often first class, giving the impression of an official product which is comfortably accessed via a living room TV. This is clearly a concern to commercial broadcasters.

As reported last week, global IPTV traffic is both huge and growing, so expect more of these requests Down Under.

Previous efforts to block IPTV services include those in the UK, where the Premier League takes targeted action against providers who provide live soccer. These measures only target live streams when matches are underway and as far as we’re aware, there are no broader measures in place against any provider.

This could mean that the action in Australia, to permanently block a provider in its entirety, is the first of its kind anywhere.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

IPTV Piracy Generates More Internet Traffic Than Torrents

Post Syndicated from Ernesto original https://torrentfreak.com/iptv-piracy-generates-more-internet-traffic-than-torrents-171101/

Increasingly, people are trading in their expensive cable subscriptions, opting to use cheaper or free Internet TV instead.

This is made easy and convenient with help from a variety of easy-to-use set-top boxes, many of which are specifically configured to receive pirated content.

Following this trend, there has also been an uptick in the availability of unlicensed TV subscriptions, with dozens of vendors offering virtually any channel imaginable. Either for free or in exchange for a small fee.

Until now the true scope of this piracy ecosystem was largely unknown, but a new report published by Canadian broadband management company Sandvine reveals that it’s massive.

The company monitored traffic across multiple fixed access tier-1 networks in North America and found that 6.5% of households are communicating with known TV piracy services. This translates to seven million subscribers and many more potential viewers.

One of the interesting aspects of IPTV piracy is that most services charge money, around $10 per month. This means that there’s a lot of money involved. If the seven million figure is indeed accurate, these IPTV vendors would generate roughly $800 million in North America alone.

“TV piracy could quickly become almost a billion dollar a year industry for pirates,” Sandvine writes in its report, noting that the real rightsholders are being substantially harmed.

Pirate subscription TV ecosystem

According to Sandvine, roughly 95% of the IPTV subscriptions run off custom set-top boxes. Kodi-powered devices and Roku boxes follow at a respectable distance with 3% and 2% of the market, respectively.

With millions of viewers, there’s undoubtedly a large audience of pirate subscription TV viewers. This is also reflected in the bandwidth these services consume. During peak hours, 6.5% of all downstream traffic on fixed networks is generated by TV piracy services.

To put this into perspective; this is more than all BitTorrent traffic during the peak hours, which was “only” 1.73% last year, and dropping.

The pirate IPTV numbers are quite impressive, also when compared to Netflix and YouTube. While the two video giants still have a larger share of overall Internet traffic on fixed networks, pirate TV subscriptions are not that far behind.

Internet traffic share throughout the day

The graph above points out another issue. It highlights that many IPTV services continue to stream data even when they’re not actively used (tuned into a channel with the TV off). As a result, they have a larger share of the overall traffic during the night when most people don’t use Netflix or YouTube.

This wasted traffic is referred to as “phantom bandwidth” and can be as high as one terabyte per month for a single connection. Physically powering off the box is often the only way to prevent this.

Needless to say, “phantom bandwidth” increases IPTV traffic numbers, so it doesn’t necessarily mean that all this traffic is actively consumed.

Finally, Sandvine looked at the different types of content people are streaming with these pirate subscriptions. Live sporting events are popular, as we’ve seen with the megafight between Floyd Mayweather and Conor McGregor. The same is true for news channels and premium TV such as HBO and international broadcasts.

The most viewed of all in North America, with 4.6% of all pirated TV traffic, is the Indian Star Plus HD.

All and all it is safe to conclude that IPTV piracy is making up a large part of the pirate ecosystem. This hasn’t gone unnoticed to copyright holders of course. In recent months we have seen enforcement actions against several providers and if this trend continues, more are likely to follow.

Looking ahead, it would be interesting to see some numbers of the “on demand” piracy streaming websites and devices as well. IPTV subscriptions are substantial, but it would be no surprise if pirate streaming boxes and sites generate even more traffic.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

‘Pirate’ IPTV Provider Loses Case, Despite Not Offering Content Itself

Post Syndicated from Andy original https://torrentfreak.com/pirate-iptv-provider-loses-case-despite-not-offering-content-itself-171031/

In 2017, there can be little doubt that streaming is the big piracy engine of the moment. Dubbed Piracy 3.0 by the MPAA, the movement is causing tremendous headaches for rightsholders on a global scale.

One of the interesting things about this phenomenon is the distributed nature of the content on offer. Sourced from thousands of online locations, from traditional file-hosters to Google Drive, the big challenge is to aggregate it all into one place, to make it easy to find. This is often achieved via third-party addons for the legal Kodi software.

One company offering such a service was MovieStreamer.nl in the Netherlands. Via its website MovieStreamer the company offered its Easy Use Interface 2.0, a piece of software that made Kodi easy to use and other streams easy to find for 79 euros. It also sold ‘VIP’ access to thousands of otherwise premium channels for around 20 euros per month.

MovieStreamer Easy Interface 2.0

“Thanks to the unique Easy Use Interface, we have the unique 3-step process,” the company’s marketing read.

“Click tile of choice, activate subtitles, and play! Fully automated and instantly the most optimal settings. Our youngest user is 4 years old and the ‘oldest’ 86 years. Ideal for young and old, beginner and expert.”

Of course, being based in the Netherlands it wasn’t long before MovieStreamer caught the attention of BREIN. The anti-piracy outfit says it tried to get the company to stop offering the illegal product but after getting no joy, took the case to court.

From BREIN’s perspective, the case was cut and dried. MovieStreamer had no right to provide access to the infringing content so it was in breach of copyright law (unauthorized communication to the public) and should stop its activities immediately. MovieStreamer, however, saw things somewhat differently.

At the core of its defense was the claim that did it not provide content itself and was merely a kind of middleman. MovieStreamer said it provided only a referral service in the form of a hyperlink formatted as a shortened URL, which in turn brought together supply and demand.

In effect, MovieStreamer claimed that it was several steps away from any infringement and that only the users themselves could activate the shortener hyperlink and subsequent process (including a corresponding M3U playlist file, which linked to other hyperlinks) to access any pirated content. Due to this disconnect, MovieStreamer said that there was no infringement, for-profit or otherwise.

A judge at the District Court in Utrecht disagreed, ruling that by providing a unique hyperlink to customers which in turn lead to protected works was indeed a “communication to the public” based on the earlier Filmspeler case.

The Court also noted that MovieStreamer knew or indeed ought to have known the illegal nature of the content being linked to, not least since BREIN had already informed them of that fact. Since the company was aware, the for-profit element of the GS Media decision handed down by the European Court of Justice came into play.

In an order handed down October 27, the Court ordered MovieStreamer to stop its IPTV hyperlinking activities immediately, whether via its Kodi Easy Use Interface or other means. Failure to do so will result in a 5,000 euro per day fine, payable to BREIN, up to a maximum of 500,000 euros. MovieStreamer was also ordered to pay legal costs of 17,527 euros.

“Moviestreamer sold a link to illegal content. Then you are required to check if that content is legally on the internet,” BREIN Director Tim Kuik said in a statement.

“You can not claim that you have nothing to do with the content if you sell a link to that content.”

Speaking with Tweakers, MovieStreamer owner Bernhard Ohler said that the packages in question were removed from his website on Saturday night. He also warned that other similar companies could experience the same issues with BREIN.

“With this judgment in hand, BREIN has, of course, a powerful weapon to force them offline,” he said.

Ohler said that the margins on hardware were so small that the IPTV subscriptions were the heart of his company. Contacted by TorrentFreak on what this means for his business, he had just two words.

“The end,” he said.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

CBS Sues Man for Posting Image of a 59-Year TV-Show on Social Media

Post Syndicated from Ernesto original https://torrentfreak.com/cbs-sues-man-for-posting-image-of-a-59-year-tv-show-on-social-media-171030/

Just a few days ago we posted an article about the dozens of cases independent photographers file against mainstream media outlets.

These lawsuits accuse companies such as CBS, NBC, and Warner Bros of using copyrighted photos without the owners’ permission and have resulted in hundreds of settlements this year alone.

While the evidence in these cases is often strong, going up against such powerful companies is not without risk. They have the money and resources to fight back, by any means necessary. This is what New York photographer Jon Tannen has witnessed as well.

In February, Tannen sued CBS Interactive because it used one of his copyrighted photos on the website 247sports.com without being paid. While he hoped this would result in a decent settlement, CBS found a smoking gun and decided to fire back.

A few days ago the photographer was hit with a rather unusual lawsuit. In a four-page complaint, CBS Broadcasting accuses him of posting a copyright-infringing image of the classic TV-show Gunsmoke on social media.

Gunsmoke is one of the longest-running drama TV-shows in US history and aired on CBS from 1955 through 1975. In the complaint, the media giant brands Tannen a hypocrite for posting the infringing CBS screenshots online.

“This copyright infringement action arises out of Defendant’s unauthorized use of Plaintiff’s valuable intellectual property. Tannen hypocritically engaged in this act of infringement while simultaneously bringing suit against Plaintiff’s sister company, CBS Interactive Inc., claiming it had violated his own copyright.”

The complaint

The screenshot(s), which we were unable to locate at the time of writing, is taken from the “Dooley Surrenders” episode first aired in 1958. While many would classify a screenshot from a full episode as fair use, CBS is having none of it.

“Without any license or authorization from Plaintiff, Defendant has copied and published via social media platforms images copied from the ‘Dooley Surrenders’ episode of GUNSMOKE,” CBS writes, adding that the “infringement of Plaintiff’s copyright is willful within the meaning of the Copyright Act.”

CBS says that it’s been harmed by the infringing action but it can’t determine any actual damages. Instead, it requests statutory damages for willful copyright infringement which can reach $150,000 per work.

Of course, it’s unlikely that Tannen will have to pay that. CBS’s lawsuit is a clear retaliatory move through which the company hopes to lessen the potential damages for their own alleged infringement.

Posting a screenshot of a TV-episode is on a completely different level than using a full photo in a publication, of course. Still, CBS has shown that it’s willing to put up a fight, and with a powerful team of lawyers they are a tricky adversary.

—-

The full CBS complaint is available here (pdf). A copy of Tannen’s original suit against CBS Interactive can be found here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

High Court Passes Judgment in Illegal Sky Sports Streaming Case

Post Syndicated from Andy original https://torrentfreak.com/high-court-passes-judgment-in-illegal-sky-sports-streaming-case-171026/

Without doubt, streaming is the hot topic in piracy right now, with thousands of illicit channels, TV shows and movies just a few clicks away.

As widely reported, the legal Kodi software augmented with illicit third-party addons is the preferred way to watch for millions of users. However, if people don’t mind sitting at a desktop machine, there’s also a thriving underbelly of indexing sites and similar platforms offering unauthorized access to infringing content.

According to information released by the Federation Against Copyright Theft, an individual in the UK has just felt the wrath of the High Court for providing content to one such platform.

“On Monday 23 October 2017 a judgment was obtained in the High Court against a Sky customer who had been streaming Sky Sports content illegally online,” FACT reports.

“Mr Yusuf Mohammed, of Bristol, has been ordered to pay legal costs of over £16,000, and to disclose details about the money he made and people he colluded with.”

With FACT releasing no more information, TorrentFreak contacted the anti-piracy group for more details on the case.

“Mohammed shared the Sky Sports stream via a piracy blog,” FACT Director of Communications Alice Skeats told TF.

Although FACT didn’t directly answer our question on the topic, their statement that Mohammed was a Sky customer seems to suggest that he might’ve re-streamed content he previously paid for. When we can clarify this point, we will.

FACT didn’t name the ‘piracy blog’ either, nor did it respond to questions about how many people may have viewed Mohammed’s illegal streams. However, FACT did confirm that he streamed Sky Sports channels so potentially a wide range of sports was made available.

The other interesting factor is the claim that Mohammed made money from his streams. Again, FACT didn’t reveal how that revenue was generated (understandable since the case is ongoing) but it seems likely that advertising played a part, as it often does on pirate platforms.

Whether Mohammed will comply with the High Court’s orders to reveal who he colluded with is something for the future but even if he does, Sky isn’t finished with him yet. According to FACT, Mohammed’s already sizeable costs bill will be augmented with a claim for damages from the satellite broadcaster.

While providing and profiting from illegal streams could easily be considered criminal in the UK, FACT confirmed that the case against Mohammed was brought by Sky and supported by FACT in a civil proceeding alone. That was also the case last week when an individual who shared the Joshua vs Klitschko fight on Facebook apologized to Sky and agreed to pay Sky legal costs.

That’s an option Middlesborough businessman Brian Thompson didn’t enjoy when he was arrested for selling infringing ‘Kodi boxes’ two years ago. He was handed an 18 month suspended prison sentence last Friday, after being prosecuted by his local council.

Thompson won’t have to pay compensation but he still gets a criminal record, which can be a major hindrance when trying to get a job or even something as simple as cost-effective insurance cover. Whether these details will have any effect on other commercial pirates in the UK will remain to be seen but it’s certainly possible that some will begin to think twice.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Federate Database User Authentication Easily with IAM and Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/federate-database-user-authentication-easily-with-iam-and-amazon-redshift/

Managing database users though federation allows you to manage authentication and authorization procedures centrally. Amazon Redshift now supports database authentication with IAM, enabling user authentication though enterprise federation. No need to manage separate database users and passwords to further ease the database administration. You can now manage users outside of AWS and authenticate them for access to an Amazon Redshift data warehouse. Do this by integrating IAM authentication and a third-party SAML-2.0 identity provider (IdP), such as AD FS, PingFederate, or Okta. In addition, database users can also be automatically created at their first login based on corporate permissions.

In this post, I demonstrate how you can extend the federation to enable single sign-on (SSO) to the Amazon Redshift data warehouse.

SAML and Amazon Redshift

AWS supports Security Assertion Markup Language (SAML) 2.0, which is an open standard for identity federation used by many IdPs. SAML enables federated SSO, which enables your users to sign in to the AWS Management Console. Users can also make programmatic calls to AWS API actions by using assertions from a SAML-compliant IdP. For example, if you use Microsoft Active Directory for corporate directories, you may be familiar with how Active Directory and AD FS work together to enable federation. For more information, see the Enabling Federation to AWS Using Windows Active Directory, AD FS, and SAML 2.0 AWS Security Blog post.

Amazon Redshift now provides the GetClusterCredentials API operation that allows you to generate temporary database user credentials for authentication. You can set up an IAM permissions policy that generates these credentials for connecting to Amazon Redshift. Extending the IAM authentication, you can configure the federation of AWS access though a SAML 2.0–compliant IdP. An IAM role can be configured to permit the federated users call the GetClusterCredentials action and generate temporary credentials to log in to Amazon Redshift databases. You can also set up policies to restrict access to Amazon Redshift clusters, databases, database user names, and user group.

Amazon Redshift federation workflow

In this post, I demonstrate how you can use a JDBC– or ODBC-based SQL client to log in to the Amazon Redshift cluster using this feature. The SQL clients used with Amazon Redshift JDBC or ODBC drivers automatically manage the process of calling the GetClusterCredentials action, retrieving the database user credentials, and establishing a connection to your Amazon Redshift database. You can also use your database application to programmatically call the GetClusterCredentials action, retrieve database user credentials, and connect to the database. I demonstrate these features using an example company to show how different database users accounts can be managed easily using federation.

The following diagram shows how the SSO process works:

  1. JDBC/ODBC
  2. Authenticate using Corp Username/Password
  3. IdP sends SAML assertion
  4. Call STS to assume role with SAML
  5. STS Returns Temp Credentials
  6. Use Temp Credentials to get Temp cluster credentials
  7. Connect to Amazon Redshift using temp credentials

Walkthrough

Example Corp. is using Active Directory (idp host:demo.examplecorp.com) to manage federated access for users in its organization. It has an AWS account: 123456789012 and currently manages an Amazon Redshift cluster with the cluster ID “examplecorp-dw”, database “analytics” in us-west-2 region for its Sales and Data Science teams. It wants the following access:

  • Sales users can access the examplecorp-dw cluster using the sales_grp database group
  • Sales users access examplecorp-dw through a JDBC-based SQL client
  • Sales users access examplecorp-dw through an ODBC connection, for their reporting tools
  • Data Science users access the examplecorp-dw cluster using the data_science_grp database group.
  • Partners access the examplecorp-dw cluster and query using the partner_grp database group.
  • Partners are not federated through Active Directory and are provided with separate IAM user credentials (with IAM user name examplecorpsalespartner).
  • Partners can connect to the examplecorp-dw cluster programmatically, using language such as Python.
  • All users are automatically created in Amazon Redshift when they log in for the first time.
  • (Optional) Internal users do not specify database user or group information in their connection string. It is automatically assigned.
  • Data warehouse users can use SSO for the Amazon Redshift data warehouse using the preceding permissions.

Step 1:  Set up IdPs and federation

The Enabling Federation to AWS Using Windows Active Directory post demonstrated how to prepare Active Directory and enable federation to AWS. Using those instructions, you can establish trust between your AWS account and the IdP and enable user access to AWS using SSO.  For more information, see Identity Providers and Federation.

For this walkthrough, assume that this company has already configured SSO to their AWS account: 123456789012 for their Active Directory domain demo.examplecorp.com. The Sales and Data Science teams are not required to specify database user and group information in the connection string. The connection string can be configured by adding SAML Attribute elements to your IdP. Configuring these optional attributes enables internal users to conveniently avoid providing the DbUser and DbGroup parameters when they log in to Amazon Redshift.

The user-name attribute can be set up as follows, with a user ID (for example, nancy) or an email address (for example. [email protected]):

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbUser">  
  <AttributeValue>user-name</AttributeValue>
</Attribute>

The AutoCreate attribute can be defined as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/AutoCreate">
    <AttributeValue>true</AttributeValue>
</Attribute>

The sales_grp database group can be included as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbGroups">
    <AttributeValue>sales_grp</AttributeValue>
</Attribute>

For more information about attribute element configuration, see Configure SAML Assertions for Your IdP.

Step 2: Create IAM roles for access to the Amazon Redshift cluster

The next step is to create IAM policies with permissions to call GetClusterCredentials and provide authorization for Amazon Redshift resources. To grant a SQL client the ability to retrieve the cluster endpoint, region, and port automatically, include the redshift:DescribeClusters action with the Amazon Redshift cluster resource in the IAM role.  For example, users can connect to the Amazon Redshift cluster using a JDBC URL without the need to hardcode the Amazon Redshift endpoint:

Previous:  jdbc:redshift://endpoint:port/database

Current:  jdbc:redshift:iam://clustername:region/dbname

Use IAM to create the following policies. You can also use an existing user or role and assign these policies. For example, if you already created an IAM role for IdP access, you can attach the necessary policies to that role. Here is the policy created for sales users for this example:

Sales_DW_IAM_Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:userid": "AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp"
            ]
        }
    ]
}

The policy uses the following parameter values:

  • Region: us-west-2
  • AWS Account: 123456789012
  • Cluster name: examplecorp-dw
  • Database group: sales_grp
  • IAM role: AIDIODR4TAW7CSEXAMPLE
Policy Statement Description
{
"Effect":"Allow",
"Action":[
"redshift:DescribeClusters"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
]
}

Allow users to retrieve the cluster endpoint, region, and port automatically for the Amazon Redshift cluster examplecorp-dw. This specification uses the resource format arn:aws:redshift:region:account-id:cluster:clustername. For example, the SQL client JDBC can be specified in the format jdbc:redshift:iam://clustername:region/dbname.

For more information, see Amazon Resource Names.

{
"Effect":"Allow",
"Action":[
"redshift:GetClusterCredentials"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
"arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
],
"Condition":{
"StringEquals":{
"aws:userid":"AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
}
}
}

Generates a temporary token to authenticate into the examplecorp-dw cluster. “arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}” restricts the corporate user name to the database user name for that user. This resource is specified using the format: arn:aws:redshift:region:account-id:dbuser:clustername/dbusername.

The Condition block enforces that the AWS user ID should match “AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com”, so that individual users can authenticate only as themselves. The AIDIODR4TAW7CSEXAMPLE role has the Sales_DW_IAM_Policy policy attached.

{
"Effect":"Allow",
"Action":[
"redshift:CreateClusterUser"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
]
}
Automatically creates database users in examplecorp-dw, when they log in for the first time. Subsequent logins reuse the existing database user.
{
"Effect":"Allow",
"Action":[
"redshift:JoinGroup"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp"
]
}
Allows sales users to join the sales_grp database group through the resource “arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp” that is specified in the format arn:aws:redshift:region:account-id:dbgroup:clustername/dbgroupname.

Similar policies can be created for Data Science users with access to join the data_science_grp group in examplecorp-dw. You can now attach the Sales_DW_IAM_Policy policy to the role that is mapped to IdP application for SSO.
 For more information about how to define the claim rules, see Configuring SAML Assertions for the Authentication Response.

Because partners are not authorized using Active Directory, they are provided with IAM credentials and added to the partner_grp database group. The Partner_DW_IAM_Policy is attached to the IAM users for partners. The following policy allows partners to log in using the IAM user name as the database user name.

Partner_DW_IAM_Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "redshift:DbUser": "${aws:username}"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/partner_grp"
            ]
        }
    ]
}

redshift:DbUser“: “${aws:username}” forces an IAM user to use the IAM user name as the database user name.

With the previous steps configured, you can now establish the connection to Amazon Redshift through JDBC– or ODBC-supported clients.

Step 3: Set up database user access

Before you start connecting to Amazon Redshift using the SQL client, set up the database groups for appropriate data access. Log in to your Amazon Redshift database as superuser to create a database group, using CREATE GROUP.

Log in to examplecorp-dw/analytics as superuser and create the following groups and users:

CREATE GROUP sales_grp;
CREATE GROUP datascience_grp;
CREATE GROUP partner_grp;

Use the GRANT command to define access permissions to database objects (tables/views) for the preceding groups.

Step 4: Connect to Amazon Redshift using the JDBC SQL client

Assume that sales user “nancy” is using the SQL Workbench client and JDBC driver to log in to the Amazon Redshift data warehouse. The following steps help set up the client and establish the connection:

  1. Download the latest Amazon Redshift JDBC driver from the Configure a JDBC Connection page
  2. Build the JDBC URL with the IAM option in the following format:
    jdbc:redshift:iam://examplecorp-dw:us-west-2/sales_db

Because the redshift:DescribeClusters action is assigned to the preceding IAM roles, it automatically resolves the cluster endpoints and the port. Otherwise, you can specify the endpoint and port information in the JDBC URL, as described in Configure a JDBC Connection.

Identify the following JDBC options for providing the IAM credentials (see the “Prepare your environment” section) and configure in the SQL Workbench Connection Profile:

plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider 
idp_host=demo.examplecorp.com (The name of the corporate identity provider host)
idp_port=443  (The port of the corporate identity provider host)
user=examplecorp\nancy(corporate user name)
password=***(corporate user password)

The SQL workbench configuration looks similar to the following screenshot:

Now, “nancy” can connect to examplecorp-dw by authenticating using the corporate Active Directory. Because the SAML attributes elements are already configured for nancy, she logs in as database user nancy and is assigned the sales_grp. Similarly, other Sales and Data Science users can connect to the examplecorp-dw cluster. A custom Amazon Redshift ODBC driver can also be used to connect using a SQL client. For more information, see Configure an ODBC Connection.

Step 5: Connecting to Amazon Redshift using JDBC SQL Client and IAM Credentials

This optional step is necessary only when you want to enable users that are not authenticated with Active Directory. Partners are provided with IAM credentials that they can use to connect to the examplecorp-dw Amazon Redshift clusters. These IAM users are attached to Partner_DW_IAM_Policy that assigns them to be assigned to the public database group in Amazon Redshift. The following JDBC URLs enable them to connect to the Amazon Redshift cluster:

jdbc:redshift:iam//examplecorp-dw/analytics?AccessKeyID=XXX&SecretAccessKey=YYY&DbUser=examplecorpsalespartner&DbGroup= partner_grp&AutoCreate=true

The AutoCreate option automatically creates a new database user the first time the partner logs in. There are several other options available to conveniently specify the IAM user credentials. For more information, see Options for providing IAM credentials.

Step 6: Connecting to Amazon Redshift using an ODBC client for Microsoft Windows

Assume that another sales user “uma” is using an ODBC-based client to log in to the Amazon Redshift data warehouse using Example Corp Active Directory. The following steps help set up the ODBC client and establish the Amazon Redshift connection in a Microsoft Windows operating system connected to your corporate network:

  1. Download and install the latest Amazon Redshift ODBC driver.
  2. Create a system DSN entry.
    1. In the Start menu, locate the driver folder or folders:
      • Amazon Redshift ODBC Driver (32-bit)
      • Amazon Redshift ODBC Driver (64-bit)
      • If you installed both drivers, you have a folder for each driver.
    2. Choose ODBC Administrator, and then type your administrator credentials.
    3. To configure the driver for all users on the computer, choose System DSN. To configure the driver for your user account only, choose User DSN.
    4. Choose Add.
  3. Select the Amazon Redshift ODBC driver, and choose Finish. Configure the following attributes:
    Data Source Name =any friendly name to identify the ODBC connection 
    Database=analytics
    user=uma(corporate user name)
    Auth Type-Identity Provider: AD FS
    password=leave blank (Windows automatically authenticates)
    Cluster ID: examplecorp-dw
    idp_host=demo.examplecorp.com (The name of the corporate IdP host)

This configuration looks like the following:

  1. Choose OK to save the ODBC connection.
  2. Verify that uma is set up with the SAML attributes, as described in the “Set up IdPs and federation” section.

The user uma can now use this ODBC connection to establish the connection to the Amazon Redshift cluster using any ODBC-based tools or reporting tools such as Tableau. Internally, uma authenticates using the Sales_DW_IAM_Policy  IAM role and is assigned the sales_grp database group.

Step 7: Connecting to Amazon Redshift using Python and IAM credentials

To enable partners, connect to the examplecorp-dw cluster programmatically, using Python on a computer such as Amazon EC2 instance. Reuse the IAM users that are attached to the Partner_DW_IAM_Policy policy defined in Step 2.

The following steps show this set up on an EC2 instance:

  1. Launch a new EC2 instance with the Partner_DW_IAM_Policy role, as described in Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances. Alternatively, you can attach an existing IAM role to an EC2 instance.
  2. This example uses Python PostgreSQL Driver (PyGreSQL) to connect to your Amazon Redshift clusters. To install PyGreSQL on Amazon Linux, use the following command as the ec2-user:
    sudo easy_install pip
    sudo yum install postgresql postgresql-devel gcc python-devel
    sudo pip install PyGreSQL

  1. The following code snippet demonstrates programmatic access to Amazon Redshift for partner users:
    #!/usr/bin/env python
    """
    Usage:
    python redshift-unload-copy.py <config file> <region>
    
    * Copyright 2014, Amazon.com, Inc. or its affiliates. All Rights Reserved.
    *
    * Licensed under the Amazon Software License (the "License").
    * You may not use this file except in compliance with the License.
    * A copy of the License is located at
    *
    * http://aws.amazon.com/asl/
    *
    * or in the "license" file accompanying this file. This file is distributed
    * on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
    * express or implied. See the License for the specific language governing
    * permissions and limitations under the License.
    """
    
    import sys
    import pg
    import boto3
    
    REGION = 'us-west-2'
    CLUSTER_IDENTIFIER = 'examplecorp-dw'
    DB_NAME = 'sales_db'
    DB_USER = 'examplecorpsalespartner'
    
    options = """keepalives=1 keepalives_idle=200 keepalives_interval=200
                 keepalives_count=6"""
    
    set_timeout_stmt = "set statement_timeout = 1200000"
    
    def conn_to_rs(host, port, db, usr, pwd, opt=options, timeout=set_timeout_stmt):
        rs_conn_string = """host=%s port=%s dbname=%s user=%s password=%s
                             %s""" % (host, port, db, usr, pwd, opt)
        print "Connecting to %s:%s:%s as %s" % (host, port, db, usr)
        rs_conn = pg.connect(dbname=rs_conn_string)
        rs_conn.query(timeout)
        return rs_conn
    
    def main():
        # describe the cluster and fetch the IAM temporary credentials
        global redshift_client
        redshift_client = boto3.client('redshift', region_name=REGION)
        response_cluster_details = redshift_client.describe_clusters(ClusterIdentifier=CLUSTER_IDENTIFIER)
        response_credentials = redshift_client.get_cluster_credentials(DbUser=DB_USER,DbName=DB_NAME,ClusterIdentifier=CLUSTER_IDENTIFIER,DurationSeconds=3600)
        rs_host = response_cluster_details['Clusters'][0]['Endpoint']['Address']
        rs_port = response_cluster_details['Clusters'][0]['Endpoint']['Port']
        rs_db = DB_NAME
        rs_iam_user = response_credentials['DbUser']
        rs_iam_pwd = response_credentials['DbPassword']
        # connect to the Amazon Redshift cluster
        conn = conn_to_rs(rs_host, rs_port, rs_db, rs_iam_user,rs_iam_pwd)
        # execute a query
        result = conn.query("SELECT sysdate as dt")
        # fetch results from the query
        for dt_val in result.getresult() :
            print dt_val
        # close the Amazon Redshift connection
        conn.close()
    
    if __name__ == "__main__":
        main()

You can save this Python program in a file (redshiftscript.py) and execute it at the command line as ec2-user:

python redshiftscript.py

Now partners can connect to the Amazon Redshift cluster using the Python script, and authentication is federated through the IAM user.

Summary

In this post, I demonstrated how to use federated access using Active Directory and IAM roles to enable single sign-on to an Amazon Redshift cluster. I also showed how partners outside an organization can be managed easily using IAM credentials.  Using the GetClusterCredentials API action, now supported by Amazon Redshift, lets you manage a large number of database users and have them use corporate credentials to log in. You don’t have to maintain separate database user accounts.

Although this post demonstrated the integration of IAM with AD FS and Active Directory, you can replicate this solution across with your choice of SAML 2.0 third-party identity providers (IdP), such as PingFederate or Okta. For the different supported federation options, see Configure SAML Assertions for Your IdP.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to establish federated access to your AWS resources by using Active Directory user attributes.


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

Join us for an evening of League of Legends

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/league-of-legends-evening/

Last month, we shared the news that Riot Games is supporting digital literacy by matching 25% of sales of Championship Ashe and Championship Ward to create a charity fund that will benefit the Raspberry Pi Foundation and two other charities.

Raspberry Pi League of Legends Championship Ashe Riot Games

Vote for the Raspberry Pi Foundation

Riot Games is now calling for all League of Legends players to vote for their favourite charity — the winning organisation will receive 50% of the total fund.

By visiting the ‘Vote for charity’ tab in-client, you’ll be able to choose between the Raspberry Pi Foundation, BasicNeeds, and Learning Equality.

Players can vote only once, and your vote will be multiplied based on your honour level. Voting ends on 5 November 2017 at 11:59pm PT.

League of Legends with Riot Gaming

In honour of the Riot Games Charity Fund vote, and to support the work of the Raspberry Pi Foundation, KimmieRiot and M0RGZ of top female eSports organisation Riot Gaming (no relation to Riot Games) will run a four-hour League of Legends live-stream this Saturday, 21 October, from 6pm to 10pm BST.

Playing as Championship Ashe, they’ll be streaming live to Twitch, and you’re all invited to join in the fun. I’ll be making an appearance in the chat box as RaspberryPiFoundation, and we’ll be giving away some free T-shirts and stickers during the event — make sure to tune in to the conversation.

In a wonderful gesture, Riot Gaming will pass on all donations made to their channel during the live-stream to us. These funds will directly aid the ongoing charitable work of Raspberry Pi and our computing education programmes like CoderDojo.

Make sure to follow Riot Gaming, and activate notifications so you don’t miss the event!

We’re blushing

Thank you to everyone who buys Championship Ashe and Championship Ward, and to all of you who vote for us. We’re honoured to be one of the three charities selected to benefit from the Riot Games Charity Fund.

And a huge thank you to Riot Gaming for organising an evening of Raspberry Pi and League of Legends. We can’t wait!

The post Join us for an evening of League of Legends appeared first on Raspberry Pi.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Football Coach Retweets, Gets Sued for Copyright Infringement

Post Syndicated from Andy original https://torrentfreak.com/football-coach-retweets-gets-sued-for-copyright-infringement-170928/

When copyright infringement lawsuits hit the US courts, there’s often a serious case at hand. Whether that’s the sharing of a leaked movie online or indeed the mass infringement that allegedly took place on Megaupload, there’s usually something quite meaty to discuss.

A lawsuit filed this week in a Pennsylvania federal court certainly provides the later, but without managing to be much more than a fairly trivial matter in the first instance.

The case was filed by sports psychologist and author Dr. Keith Bell. It begins by describing Bell as an “internationally recognized performance consultant” who has worked with 500 teams, including the Olympic and national teams for the United States, Canada, Australia, New Zealand, Hong Kong, Fiji, and the Cayman Islands.

Bell is further described as a successful speaker, athlete and coach; “A four-time
collegiate All-American swimmer, a holder of numerous world and national masters swim records, and has coached several collegiate, high school, and private swim teams to competitive success.”

At the heart of the lawsuit is a book that Bell published in 1982, entitled Winning Isn’t Normal.

“The book has enjoyed substantial acclaim, distribution and publicity. Dr. Bell is the sole author of this work, and continues to own all rights in the work,” the lawsuit (pdf) reads.

Bell claims that on or about November 6, 2015, King’s College head football coach Jeffery Knarr retweeted a tweet that was initially posted from @NSUBaseball32, a Twitter account operated by Northeastern State University’s RiverHawks baseball team. The retweet, as shown in the lawsuit, can be seen below.

The retweet that sparked the lawsuit

“The post was made without authorization from Dr. Bell and without attribution
to Dr. Bell,” the lawsuit reads.

“Neither Defendant King’s College nor Defendant Jeffery Knarr contacted Dr.
Bell to request permission to use Dr. Bell’s copyrighted work. As of November 14, 2015, the post had received 206 ‘Retweets’ and 189 ‘Likes.’ Due to the globally accessible nature of Twitter, the post was accessible by Internet users across the world.”

Bell says he sent a cease and desist letter to NSU in September 2016 and shortly thereafter NSU removed the post, which removed the retweets. However, this meant that Knarr’s retweet had been online for “at least” 10 months and 21 days.

To put the icing on the cake, Bell also holds the trademark to the phrase “Winning Isn’t Normal”, so he’s suing Knarr and his King’s College employer for trademark infringement too.

“The Defendants included Plaintiff’s trademark twice in the Twitter post. The first instance was as the title of the post, with the mark shown in letters which
were emphasized by being capitalized, bold, and underlined,” the lawsuit notes.

“The second instance was at the end of the post, with the mark shown in letters which were emphasized by being capitalized, bold, underlined, and followed by three
exclamation points.”

Describing what appears to be a casual retweet as “willful, intentional and purposeful” infringement carried out “in disregard of and with indifference to Plaintiff’s rights,” Bell demands damages and attorneys fees from Knarr and his employer.

“As a direct and proximate result of said infringement by Defendants, Plaintiff is
entitled to damages in an amount to be proven at trial,” the lawsuit concludes.

Since the page from the book retweeted by Knarr is a small portion of the overall work, there may be a fair use defense. Nevertheless, defending this kind of suit is never cheap, so it’s probably fair to say there will already be a considerable amount of regret among the defendants at ever having set eyes on Bell’s 35-year-old book.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Boston Red Sox Caught Using Technology to Steal Signs

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/09/boston_red_sox_.html

The Boston Red Sox admitted to eavesdropping on the communications channel between catcher and pitcher.

Stealing signs is believed to be particularly effective when there is a runner on second base who can both watch what hand signals the catcher is using to communicate with the pitcher and can easily relay to the batter any clues about what type of pitch may be coming. Such tactics are allowed as long as teams do not use any methods beyond their eyes. Binoculars and electronic devices are both prohibited.

In recent years, as cameras have proliferated in major league ballparks, teams have begun using the abundance of video to help them discern opponents’ signs, including the catcher’s signals to the pitcher. Some clubs have had clubhouse attendants quickly relay information to the dugout from the personnel monitoring video feeds.

But such information has to be rushed to the dugout on foot so it can be relayed to players on the field — a runner on second, the batter at the plate — while the information is still relevant. The Red Sox admitted to league investigators that they were able to significantly shorten this communications chain by using electronics. In what mimicked the rhythm of a double play, the information would rapidly go from video personnel to a trainer to the players.

This is ridiculous. The rules about what sorts of sign stealing are allowed and what sorts are not are arbitrary and unenforceable. My guess is that the only reason there aren’t more complaints is because everyone does it.

The Red Sox responded in kind on Tuesday, filing a complaint against the Yankees claiming that the team uses a camera from its YES television network exclusively to steal signs during games, an assertion the Yankees denied.

Boston’s mistake here was using a very conspicuous Apple Watch as a communications device. They need to learn to be more subtle, like everyone else.

Pirates Are Not Easily Deterred by Viruses and Malware, Study Finds

Post Syndicated from Ernesto original https://torrentfreak.com/pirates-are-not-easily-deterred-by-viruses-and-malware-study-finds-170913/

Despite the widespread availability of legal streaming services, piracy remains rampant around the world.

This is the situation in Singapore where a new study commissioned by the Cable and Satellite Broadcasting Association of Asia (CASBAA) found that 39% of all Singaporeans download or stream movies, TV shows, or live sports illegally.

The survey, conducted by Sycamore Research, polled the opinions and behaviors of a weighted sample of 1,000 respondents. The research concludes that nearly half of the population regularly pirates and also found that these people are not easily deterred.

Although the vast majority of the population knows that piracy is against the law, the lure of free content is often hard to ignore. Many simply see it as socially acceptible behavior.

“The notion that piracy is something that everybody does nowadays turns it into a socially acceptable behavior”, Sycamore Research Director Anna Meadows says, commenting on the findings.

“Numerous studies have shown that what we perceive others to be doing has a far stronger influence on our behavior than what we know we ‘ought’ to do. People know that they shouldn’t really pirate, but they continue to do so because they believe those around them do as well.”

One of the main threats pirates face is the availability of malware and malicious ads that are present on some sites. This risk is recognized by 74% of the active pirates, but they continue nonetheless.

The dangers of malware and viruses, which is a key talking point among industry groups nowadays, do have some effect. Among those who stopped pirating, 40% cited it as their primary reason. That’s more than the availability of legal services, which is mentioned in 37% of cases.

Aside from traditional download and streaming sites, the growing popularity of pirate media boxes is clearly present in Singapore was well. A total of 14% of Singaporeans admit to having such a device in their home.

So why do people continue to pirate despite the risks?

The answer is simple; because it’s free. The vast majority (63%) mention the lack of financial costs as their main motivation to use pirate sites. The ability to watch something whenever they want and a lack of legal options follow at a distance, both at 31%.

“There are few perceived downsides to piracy,” Meadows notes.

“Whilst the risk of devices being infected with viruses or malware is understood, it is underweighted. In the face of the benefit of free content, people appear to discount the risks, as the idea of getting something for nothing is so psychologically powerful.”

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

The Things Pirates Do To Hinder Anti-Piracy Investigations

Post Syndicated from Andy original https://torrentfreak.com/the-things-pirates-do-to-hinder-anti-piracy-outfits-170909/

Dedicated Internet pirates dealing in fresh content or operating at any significant scale can be pretty sure that rightsholders and their anti-piracy colleagues are interested in their activities at some level.

With this in mind, most pirates these days are aware of things they can do to enhance their security, with products like VPNs often get discussed on the consumer side.

This week, in a report detailing the challenges social media poses to intellectual property rights, UK anti-piracy outfit Federation Against Copyright Theft published a list of techniques deployed by pirates that hinder their investigations.

Fake/hidden website registration details

“Website registration details are often fake or hidden, which provides no further links to the person controlling the domain and its illegal activities,” the group reveals.

Protected WHOIS records are nothing new and can sometimes be uncloaked by a determined adversary via court procedures. However, in the early stages of an investigation, open records provide leads that can be extremely useful in building an early picture about who might be involved in the operation of a website.

Having them hidden is a definite plus for pirate site operators, especially when the underlying details are also fake, which is particularly common practice. And, with companies like Peter Sunde’s Njalla entering the market, hiding registrations is easier than ever.

Overseas servers

“Investigating servers located offshore cause some specific problems for FACT’s law-enforcement partners. In order to complete a full investigation into an offshore server, a law-enforcement agency must liaise with its counterpart in the country where the server is located. The difficulties of obtaining evidence from other countries are well known,” FACT notes.

While FACT no doubt corresponds with entities overseas, the anti-piracy outfit has a history of targeting UK citizens who are reportedly infringing copyright. It regularly involves UK police in its investigations (FACT itself employs former police officers) but jurisdiction is necessarily limited to the UK.

It is possible to get overseas law enforcement entities involved to seize a server, for example, but they have to be convinced of the need to do so by the police, which isn’t easy and is usually reserved for more serious cases. The bottom line is that by placing a server a long way away from a pirate’s home territory, things can be made much more difficult for local investigators.

Torrent websites and DMCA compliance

“Some torrent website operators who maintain a high DMCA compliance rate will often use this to try to appease the law, while continuing to provide infringing links,” FACT says.

This is an interesting one. Under law in both the United States and Europe, service providers are required to remove infringing content from their systems when they are notified of its existence by a rightsholder or its agent. Not doing so can render them liable, if the content is indeed infringing.

What FACT appears to be saying is that sites that comply with the law, by removing infringing content when asked to, become more difficult targets for legal action. It sounds very obvious but the underlying suggestion is that compliance on the surface is used as a protective mechanism. No example sites are mentioned but the strategy has clearly hindered FACT.

Current legislation too vague to remove infringing live sports streams

“Current legislation is insufficient to effectively tackle the issue of websites illegally offering coverage of live sports events. Section 512 (c) of the Digital Millennium Copyright Act (DMCA) states that: upon notification of claimed infringement, the service provider should ‘respond expeditiously’ to remove or disable access to the copyright-infringing material. Most live sports events are under two hours long, so such non-specific timeframes for required action are inadequate,” FACT complains.

Since government reports like these can take a long time to prepare, it appears that FACT and its partners may have already found a solution to this particular problem. Major FACT client the Premier League now has a High Court injunction in place which allows it to block infringing streams on a real-time basis. It doesn’t remove the content at its source, but it still renders it largely inaccessible in the UK.

Nevertheless, FACT calls for takedowns to be actioned more swiftly, noting that “the law needs to reflect this narrow timeframe with a specified required response period for websites offering such live feeds.”

Camming content directly from cinema screen to the cloud

“Recent advancements in technology have made this a viable option to ‘cammers’ to avoid detection. Attempts to curtail and delete illicitly recorded film footage may become increasingly difficult with the emergence of streaming apps that automatically upload recorded video to cloud services,” FACT reports.

Over the years, FACT has been involved in numerous operations to hinder those who record movies with cameras in theaters and then upload them to the Internet. Once the perpetrator has exited the theater, FACT has effectively lost the battle, but the possibility that a live upload can now take place is certainly an interesting proposition.

“While enforcing officers may delete the footage held on the device, the footage has potentially already been stored remotely on a cloud system,” FACT warns.

Equally, this could also prove a problem for those seeking to secure evidence. With a cloud upload, the person doing the recording could safely delete the footage from the local device. That could be an obstacle to proving that an offense had even been committed when a suspect is confronted in situ.

Virtual currencies

“There is great potential in virtual currencies for money launderers and illicit traders. Government and law enforcement have raised concerns on how virtual currencies can be sent anonymously, leaving little or no trail for regulators or law-enforcement agencies,” FACT writes.

For many years, pirates of all kinds have relied on systems like PayPal, Mastercard, and Visa, to shift money around. However, these payment systems are now more difficult to deploy on pirate services and are more easily traced, even when operators manage to squeeze them through the gaps.

The same cannot be said of bitcoin and similar currencies that are gaining in popularity all the time. They are harder to use, of course, but there’s little doubt accessibility issues will be innovated out of the equation at some point. Once that happens, these currencies will be a force to be reckoned with.

The UK government’s Share and Share Alike report, which examines the challenges social media poses to intellectual property rights, can be downloaded here (pdf)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

New UK IP Crime Report Reveals Continued Focus on ‘Pirate’ Kodi Boxes

Post Syndicated from Andy original https://torrentfreak.com/new-uk-ip-crime-report-reveals-continued-focus-on-pirate-kodi-boxes-170908/

The UK’s Intellectual Property Office has published its annual IP Crime Report, spanning the period 2016 to 2017.

It covers key events in the copyright and trademark arenas and is presented with input from the police and trading standards, plus private entities such as the BPI, Premier League, and Federation Against Copyright Theft, to name a few.

The report begins with an interesting statistic. Despite claims that many millions of UK citizens regularly engage in some kind of infringement, figures from the Ministry of Justice indicate that just 47 people were found guilty of offenses under the Copyright, Designs and Patents Act during 2016. That’s down on the 69 found guilty in the previous year.

Despite this low conviction rate, 15% of all internet users aged 12+ are reported to have consumed at least one item of illegal content between March and May 2017. Figures supplied by the Industry Trust for IP indicate that 19% of adults watch content via various IPTV devices – often referred to as set-top, streaming, Android, or Kodi boxes.

“At its cutting edge IP crime is innovative. It exploits technological loopholes before they become apparent. IP crime involves sophisticated hackers, criminal financial experts, international gangs and service delivery networks. Keeping pace with criminal innovation places a burden on IP crime prevention resources,” the report notes.

The report covers a broad range of IP crime, from counterfeit sportswear to foodstuffs, but our focus is obviously on Internet-based infringement. Various contributors cover various aspects of online activity as it affects them, including music industry group BPI.

“The main online piracy threats to the UK recorded music industry at present are from BitTorrent networks, linking/aggregator sites, stream-ripping sites, unauthorized streaming sites and cyberlockers,” the BPI notes.

The BPI’s website blocking efforts have been closely reported, with 63 infringing sites blocked to date via various court orders. However, the BPI reports that more than 700 related URLs, IP addresses, and proxy sites/ proxy aggregators have also been rendered inaccessible as part of the same action.

“Site blocking has proven to be a successful strategy as the longer the blocks are in place, the more effective they are. We have seen traffic to these sites reduce by an average of 70% or more,” the BPI reports.

While prosecutions against music pirates are a fairly rare event in the UK, the Crown Prosecution Service (CPS) Specialist Fraud Division highlights that their most significant prosecution of the past 12 months involved a prolific music uploader.

As first revealed here on TF, Wayne Evans was an uploader not only on KickassTorrents and The Pirate Bay, but also some of his own sites. Known online as OldSkoolScouse, Evans reportedly cost the UK’s Performing Rights Society more than £1m in a single year. He was sentenced in December 2016 to 12 months in prison.

While Evans has been free for some time already, the CPS places particular emphasis on the importance of the case, “since it provided sentencing guidance for the Copyright, Designs and Patents Act 1988, where before there was no definitive guideline.”

The CPS says the case was useful on a number of fronts. Despite illegal distribution of content being difficult to investigate and piracy losses proving tricky to quantify, the court found that deterrent sentences are appropriate for the kinds of offenses Evans was accused of.

The CPS notes that various factors affect the severity of such sentences, not least the length of time the unlawful activity has persisted and particularly if it has done so after the service of a cease and desist notice. Other factors include the profit made by defendants and/or the loss caused to copyright holders “so far as it can accurately be calculated.”

Importantly, however, the CPS says that beyond issues of personal mitigation and timely guilty pleas, a jail sentence is probably going to be the outcome for others engaging in this kind of activity in future. That’s something for torrent and streaming site operators and their content uploaders to consider.

“[U]nless the unlawful activity of this kind is very amateur, minor or short-lived, or in the absence of particularly compelling mitigation or other exceptional circumstances, an immediate custodial sentence is likely to be appropriate in cases of illegal distribution of copyright infringing articles,” the CPS concludes.

But while a music-related trial provided the highlight of the year for the CPS, the online infringement world is still dominated by the rise of streaming sites and the now omnipresent “fully-loaded Kodi Box” – set-top devices configured to receive copyright-infringing live TV and VOD.

In the IP Crime Report, the Intellectual Property Office references a former US Secretary of Defense to describe the emergence of the threat.

“The echoes of Donald Rumsfeld’s famous aphorism concerning ‘known knowns’ and ‘known unknowns’ reverberate across our landscape perhaps more than any other. The certainty we all share is that we must be ready to confront both ‘known unknowns’ and ‘unknown unknowns’,” the IPO writes.

“Not long ago illegal streaming through Kodi Boxes was an ‘unknown’. Now, this technology updates copyright infringement by empowering TV viewers with the technology they need to subvert copyright law at the flick of a remote control.”

While the set-top box threat has grown in recent times, the report highlights the important legal clarifications that emerged from the BREIN v Filmspeler case, which found itself before the European Court of Justice.

As widely reported, the ECJ determined that the selling of piracy-configured devices amounts to a communication to the public, something which renders their sale illegal. However, in a submission by PIPCU, the Police Intellectual Property Crime Unit, box sellers are said to cast a keen eye on the legal situation.

“Organised criminals, especially those in the UK who distribute set-top boxes, are aware of recent developments in the law and routinely exploit loopholes in it,” PIPCU reports.

“Given recent judgments on the sale of pre-programmed set-top boxes, it is now unlikely criminals would advertise the devices in a way which is clearly infringing by offering them pre-loaded or ‘fully loaded’ with apps and addons specifically designed to access subscription services for free.”

With sellers beginning to clean up their advertising, it seems likely that detection will become more difficult than when selling was considered a gray area. While that will present its own issues, PIPCU still sees problems on two fronts – a lack of clear legislation and a perception of support for ‘pirate’ devices among the public.

“There is no specific legislation currently in place for the prosecution of end users or sellers of set-top boxes. Indeed, the general public do not see the usage of these devices as potentially breaking the law,” the unit reports.

“PIPCU are currently having to try and ‘shoehorn’ existing legislation to fit the type of criminality being observed, such as conspiracy to defraud (common law) to tackle this problem. Cases are yet to be charged and results will be known by late 2017.”

Whether these prosecutions will be effective remains to be seen, but PIPCU’s comments suggest an air of caution set to a backdrop of box-sellers’ tendency to adapt to legal challenges.

“Due to the complexity of these cases it is difficult to substantiate charges under the Fraud Act (2006). PIPCU have convicted one person under the Serious Crime Act (2015) (encouraging or assisting s11 of the Fraud Act). However, this would not be applicable unless the suspect had made obvious attempts to encourage users to use the boxes to watch subscription only content,” PIPCU notes, adding;

“The selling community is close knit and adapts constantly to allow itself to operate in the gray area where current legislation is unclear and where they feel they can continue to sell ‘under the radar’.”

More generally, pirate sites as a whole are still seen as a threat. As reported last month, the current anti-piracy narrative is that pirate sites represent a danger to their users. As a result, efforts are underway to paint torrent and streaming sites as risky places to visit, with users allegedly exposed to malware and other malicious content. The scare strategy is supported by PIPCU.

“Unlike the purchase of counterfeit physical goods, consumers who buy unlicensed content online are not taking a risk. Faulty copyright doesn’t explode, burn or break. For this reason the message as to why the public should avoid copyright fraud needs to be re-focused.

“A more concerted attempt to push out a message relating to malware on pirate websites, the clear criminality and the links to organized crime of those behind the sites are crucial if public opinion is to be changed,” the unit advises.

But while the changing of attitudes is desirable for pro-copyright entities, PIPCU says that winning over the public may not prove to be an easy battle. It was given a small taste of backlash itself, after taking action against the operator of a pirate site.

“The scale of the problem regarding public opinion of online copyright crime is evidenced by our own experience. After PIPCU executed a warrant against the owner of a streaming website, a tweet about the event (read by 200,000 people) produced a reaction heavily weighted against PIPCU’s legitimate enforcement action,” PIPCU concludes.

In summary, it seems likely that more effort will be expended during the next 12 months to target the set-top box threat, but there doesn’t appear to be an abundance of confidence in existing legislation to tackle all but the most egregious offenders. That being said, a line has now been drawn in the sand – if the public is prepared to respect it.

The full IP Crime Report 2016-2017 is available here (pdf)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

How Much Does ‘Free’ Premier League Piracy Cost These Days?

Post Syndicated from Andy original https://torrentfreak.com/how-much-does-free-premier-league-piracy-cost-these-days-170902/

Right now, the English Premier League is engaged in perhaps the most aggressively innovative anti-piracy operation the Internet has ever seen. After obtaining a new High Court order, it now has the ability to block ‘pirate’ streams of matches, in real-time, with no immediate legal oversight.

If the Premier League believes a server is streaming one of its matches, it can ask ISPs in the UK to block it, immediately. That’s unprecedented anywhere on the planet.

As previously reported, this campaign caused a lot of problems for people trying to access free and premium streams at the start of the season. Many IPTV services were blocked in the UK within minutes of matches starting, with free streams also dropping like flies. According to information obtained by TF, more than 600 illicit streams were blocked during that weekend.

While some IPTV providers and free streams continued without problems, it seems likely that it’s only a matter of time before the EPL begins to pick off more and more suppliers. To be clear, the EPL isn’t taking services or streams down, it’s only blocking them, which means that people using circumvention technologies like VPNs can get around the problem.

However, this raises the big issue again – that of continuously increasing costs. While piracy is often painted as free, it is not, and as setups get fancier, costs increase too.

Below, we take a very general view of a handful of the many ‘pirate’ configurations currently available, to work out how much ‘free’ piracy costs these days. The list is not comprehensive by any means (and excludes more obscure methods such as streaming torrents, which are always free and rarely blocked), but it gives an idea of costs and how the balance of power might eventually tip.

Basic beginner setup

On a base level, people who pirate online need at least some equipment. That could be an Android smartphone and easily installed free software such as Mobdro or Kodi. An Internet connection is a necessity and if the EPL blocks those all important streams, a VPN provider is required to circumvent the bans.

Assuming people already have a phone and the Internet, a VPN can be bought for less than £5 per month. This basic setup is certainly cheap but overall it’s an entry level experience that provides quality equal to the effort and money expended.

Equipment: Phone, tablet, PC
Comms: Fast Internet connection, decent VPN provider
Overal performance: Low quality, unpredictable, often unreliable
Cost: £5pm approx for VPN, plus Internet costs

Big screen, basic

For those who like their matches on the big screen, stepping up the chain costs more money. People need a TV with an HDMI input and a fast Internet connection as a minimum, alongside some kind of set-top device to run the necessary software.

Android devices are the most popular and are roughly split into two groups – the small standalone box type and the plug-in ‘stick’ variant such as Amazon’s Firestick.

A cheap Android set-top box

These cost upwards of £30 to £40 but the software to install on them is free. Like the phone, Mobdro is an option, but most people look to a Kodi setup with third-party addons. That said, all streams received on these setups are now vulnerable to EPL blocking so in the long-term, users will need to run a paid VPN.

The problem here is that some devices (including the 1st gen Firestick) aren’t ideal for running a VPN on top of a stream, so people will need to dump their old device and buy something more capable. That could cost another £30 to £40 and more, depending on requirements.

Importantly, none of this investment guarantees a decent stream – that’s down to what’s available on the day – but invariably the quality is low and/or intermittent, at best.

Equipment: TV, decent Android set-top box or equivalent
Comms: Fast Internet connection, decent VPN provider
Overall performance: Low to acceptable quality, unpredictable, often unreliable
Cost: £30 to £50 for set-top box, £5pm approx for VPN, plus Internet

Premium IPTV – PC or Android based

At this point, premium IPTV services come into play. People have a choice of spending varying amounts of money, depending on the quality of experience they require.

First of all, a monthly IPTV subscription with an established provider that isn’t going to disappear overnight is required, which can be a challenge to find in itself. We’re not here to review or recommend services but needless to say, like official TV packages they come in different flavors to suit varying wallet sizes. Some stick around, many don’t.

A decent one with a Sky-like EPG costs between £7 and £15 per month, depending on the quality and depth of streams, and how far in front users are prepared to commit.

Fairly typical IPTV with EPG (VOD shown)

Paying for a year in advance tends to yield better prices but with providers regularly disappearing and faltering in their service levels, people are often reluctant to do so. That said, some providers experience few problems so it’s a bit like gambling – research can improve the odds but there’s never a guarantee.

However, even when a provider, price, and payment period is decided upon, the process of paying for an IPTV service can be less than straightforward.

While some providers are happy to accept PayPal, many will only deal in credit cards, bitcoin, or other obscure payment methods. That sets up more barriers to entry that might deter the less determined customer. And, if time is indeed money, fussing around with new payment processors can be pricey, at least to begin with.

Once subscribed though, watching these streams is pretty straightforward. On a base level, people can use a phone, tablet, or set-top device to receive them, using software such as Perfect Player IPTV, for example. Currently available in free (ad supported) and premium (£2) variants, this software can be setup in a few clicks and will provide a decent user experience, complete with EPG.

Perfect Player IPTV

Those wanting to go down the PC route have more options but by far the most popular is receiving IPTV via a Kodi setup. For the complete novice, it’s not always easy to setup but some IPTV providers supply their own free addons, which streamline the process massively. These can also be used on Android-based Kodi setups, of course.

Nevertheless, if the EPL blocks the provider, a VPN is still going to be needed to access the IPTV service.

An Android tablet running Kodi

So, even if we ignore the cost of the PC and Internet connection, users could still find themselves paying between £10 and £20 per month for an IPTV service and a decent VPN. While more channels than simply football will be available from most providers, this is getting dangerously close to the £18 Sky are asking for its latest football package.

Equipment: TV, PC, or decent Android set-top box or equivalent
Comms: Fast Internet connection, IPTV subscription, decent VPN provider
Overal performance: High quality, mostly reliable, user-friendly (once setup)
Cost: PC or £30/£50 for set-top box, IPTV subscription £7 to £15pm, £5pm approx for VPN, plus Internet, plus time and patience for obscure payment methods.
Note: There are zero refunds when IPTV providers disappoint or disappear

Premium IPTV – Deluxe setup

Moving up to the top of the range, things get even more costly. Those looking to give themselves the full home entertainment-like experience will often move away from the PC and into the living room in front of the TV, armed with a dedicated set-top box. Weapon of choice: the Mag254.

Like Amazon’s FireStick, PC or Android tablet, the Mag254 is an entirely legal, content agnostic device. However, enter the credentials provided by many illicit IPTV suppliers and users are presented with a slick Sky-like experience, far removed from anything available elsewhere. The device is operated by remote control and integrates seamlessly with any HDMI-capable TV.

Mag254 IPTV box

Something like this costs around £70 in the UK, plus the cost of a WiFi adaptor on top, if needed. The cost of the IPTV provider needs to be figured in too, plus a VPN subscription if the provider gets blocked by EPL, which is likely. However, in this respect the Mag254 has a problem – it can’t run a VPN natively. This means that if streams get blocked and people need to use a VPN, they’ll need to find an external solution.

Needless to say, this costs more money. People can either do all the necessary research and buy a VPN-capable router/modem that’s also compatible with their provider (this can stretch to a couple of hundred pounds) or they’ll need to invest in a small ‘travel’ router with VPN client features built in.

‘Travel’ router (with tablet running Mobdro for scale)

These devices are available on Amazon for around £25 and sit in between the Mag254 (or indeed any other wireless device) and the user’s own regular router. Once the details of the VPN subscription are entered into the router, all traffic passing through is encrypted and will tunnel through web blocking measures. They usually solve the problem (ymmv) but of course, this is another cost.

Equipment: Mag254 or similar, with WiFi
Comms: Fast Internet connection, IPTV subscription, decent VPN provider
Overall performance: High quality, mostly reliable, very user-friendly
Cost: Mag254 around £75 with WiFi, IPTV subscription £7 to £15pm, £5pm for VPN (plus £25 for mini router), plus Internet, plus patience for obscure payment methods.
Note: There are zero refunds when IPTV providers disappoint or disappear

Conclusion

On the whole, people who want a reliable and high-quality Premier League streaming experience cannot get one for free, no matter where they source the content. There are many costs involved, some of which cannot be avoided.

If people aren’t screwing around with annoying and unreliable Kodi streams, they’ll be paying for an IPTV provider, VPN and other equipment. Or, if they want an easy life, they’ll be paying Sky, BT or Virgin Media. That might sound harsh to many pirates but it’s the only truly reliable solution.

However, for those looking for something that’s merely adequate, costs drop significantly. Indeed, if people don’t mind the hassle of wondering whether a sub-VHS quality stream will appear before the big match and stay on throughout, it can all be done on a shoestring.

But perhaps the most important thing to note in respect of costs is the recent changes to the pricing of Premier League content in the UK. As mentioned earlier, Sky now delivers a sports package for £18pm, which sounds like the best deal offered to football fans in recent years. It will be tempting for sure and has all the hallmarks of a price point carefully calculated by Sky.

The big question is whether it will be low enough to tip significant numbers of people away from piracy. The reality is that if another couple of thousand streams get hit hard again this weekend – and the next – and the next – many pirating fans will be watching the season drift away for yet another month, unviewed. That’s got to be frustrating.

The bottom line is that high-quality streaming piracy is becoming a little bit pricey just for football so if it becomes unreliable too – and that’s the Premier League’s goal – the balance of power could tip. At this point, the EPL will need to treat its new customers with respect, in order to keep them feeling both entertained and unexploited.

Fail on those counts – especially the latter – and the cycle will start again.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

‘Pirate’ Site Uses DMCA to Remove Pirated Copy from Github

Post Syndicated from Ernesto original https://torrentfreak.com/pirate-site-uses-dmca-to-remove-pirated-copy-from-github-170902/

Every day, copyright holders send out millions of takedown notices to various services, hoping to protect their works.

Pirate sites are usually at the receiving end of these requests but apparently, they can use it to their advantage as well.

A few days ago the operators of sports streaming site soccerstreams.net informed the developer platform GitHub that a copy of their code was being made available without permission.

The targeted repository was created by “mmstart007,” who allegedly copied it from Bitbucket without permission. The operator of the streaming site wasn’t happy with this and sent a DMCA takedown notice to GitHub asking to take the infringing code offline.

“It’s not an open source work its [a] private project we [are] using on our site and that was a private repo on bitbucket and that guy got unauthorized access to it,” Soccerstreams writes.

The operators stress that the repository “must be taken down as soon as possible,” adding the mandatory ‘good faith’ statement.

“I have a good faith belief that use of the copyrighted materials described above on the infringing web pages is not authorized by the copyright owner, or its agent, or the law. I have taken fair use into consideration,” the complaint reads.

GitHub responded swiftly to the request and pulled the repository offline. Those who try to access it today see the following notification instead.

The people running the Soccer Streams site, which is linked with a similarly named Reddit community, are certainly no strangers to takedown requests themselves. The website and the Reddit community was recently targeted by the Premier League recently for example, which accused it of providing links to copyrighted streams.

While soccerstreams.net regularly links to unauthorized streams and is seen as a pirate site by rightsholders, the site doesn’t believe that it’s doing anything wrong.

It has a dedicated DMCA page on its site stating that all streams are submitted by its users and that they cannot be held liable for any infringements.

While it’s a bit unusual for sites and tools with a “pirate” stigma to issue takedown requests, it’s not unique. Just a few weeks ago one of the popular Sickrage forks was removed from GitHub, following a complaint from another fork.

This episode caused a bit of a stir, but the owner of the targeted Sickrage repository eventually managed to get the project restored after a successful counter-notice.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.