Tag Archives: filtering

AWS Glue Now Supports Scala Scripts

Post Syndicated from Mehul Shah original https://aws.amazon.com/blogs/big-data/aws-glue-now-supports-scala-scripts/

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

This post shows how to build an example Scala script that identifies highly negative issues in the timeline. It pulls out issue events in the timeline sample, analyzes their titles using the sentiment prediction functions from the Stanford CoreNLP libraries, and surfaces the most negative issues.

Getting started

Before we start writing scripts, we use AWS Glue crawlers to get a sense of the data—its structure and characteristics. We also set up a development endpoint and attach an Apache Zeppelin notebook, so we can interactively explore the data and author the script.

Crawl the data

The dataset used in this example was downloaded from the GitHub archive website into our sample dataset bucket in Amazon S3, and copied to the following locations:

s3://aws-glue-datasets-<region>/examples/scala-blog/githubarchive/data/

Choose the best folder by replacing <region> with the region that you’re working in, for example, us-east-1. Crawl this folder, and put the results into a database named githubarchive in the AWS Glue Data Catalog, as described in the AWS Glue Developer Guide. This folder contains 12 hours of the timeline from January 22, 2017, and is organized hierarchically (that is, partitioned) by year, month, and day.

When finished, use the AWS Glue console to navigate to the table named data in the githubarchive database. Notice that this data has eight top-level columns, which are common to each event type, and three partition columns that correspond to year, month, and day.

Choose the payload column, and you will notice that it has a complex schema—one that reflects the union of the payloads of event types that appear in the crawled data. Also note that the schema that crawlers generate is a subset of the true schema because they sample only a subset of the data.

Set up the library, development endpoint, and notebook

Next, you need to download and set up the libraries that estimate the sentiment in a snippet of text. The Stanford CoreNLP libraries contain a number of human language processing tools, including sentiment prediction.

Download the Stanford CoreNLP libraries. Unzip the .zip file, and you’ll see a directory full of jar files. For this example, the following jars are required:

  • stanford-corenlp-3.8.0.jar
  • stanford-corenlp-3.8.0-models.jar
  • ejml-0.23.jar

Upload these files to an Amazon S3 path that is accessible to AWS Glue so that it can load these libraries when needed. For this example, they are in s3://glue-sample-other/corenlp/.

Development endpoints are static Spark-based environments that can serve as the backend for data exploration. You can attach notebooks to these endpoints to interactively send commands and explore and analyze your data. These endpoints have the same configuration as that of AWS Glue’s job execution system. So, commands and scripts that work there also work the same when registered and run as jobs in AWS Glue.

To set up an endpoint and a Zeppelin notebook to work with that endpoint, follow the instructions in the AWS Glue Developer Guide. When you are creating an endpoint, be sure to specify the locations of the previously mentioned jars in the Dependent jars path as a comma-separated list. Otherwise, the libraries will not be loaded.

After you set up the notebook server, go to the Zeppelin notebook by choosing Dev Endpoints in the left navigation pane on the AWS Glue console. Choose the endpoint that you created. Next, choose the Notebook Server URL, which takes you to the Zeppelin server. Log in using the notebook user name and password that you specified when creating the notebook. Finally, create a new note to try out this example.

Each notebook is a collection of paragraphs, and each paragraph contains a sequence of commands and the output for that command. Moreover, each notebook includes a number of interpreters. If you set up the Zeppelin server using the console, the (Python-based) pyspark and (Scala-based) spark interpreters are already connected to your new development endpoint, with pyspark as the default. Therefore, throughout this example, you need to prepend %spark at the top of your paragraphs. In this example, we omit these for brevity.

Working with the data

In this section, we use AWS Glue extensions to Spark to work with the dataset. We look at the actual schema of the data and filter out the interesting event types for our analysis.

Start with some boilerplate code to import libraries that you need:

%spark

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext

Then, create the Spark and AWS Glue contexts needed for working with the data:

@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)

You need the transient decorator on the SparkContext when working in Zeppelin; otherwise, you will run into a serialization error when executing commands.

Dynamic frames

This section shows how to create a dynamic frame that contains the GitHub records in the table that you crawled earlier. A dynamic frame is the basic data structure in AWS Glue scripts. It is like an Apache Spark data frame, except that it is designed and optimized for data cleaning and transformation workloads. A dynamic frame is well-suited for representing semi-structured datasets like the GitHub timeline.

A dynamic frame is a collection of dynamic records. In Spark lingo, it is an RDD (resilient distributed dataset) of DynamicRecords. A dynamic record is a self-describing record. Each record encodes its columns and types, so every record can have a schema that is unique from all others in the dynamic frame. This is convenient and often more efficient for datasets like the GitHub timeline, where payloads can vary drastically from one event type to another.

The following creates a dynamic frame, github_events, from your table:

val github_events = glueContext
                    .getCatalogSource(database = "githubarchive", tableName = "data")
                    .getDynamicFrame()

The getCatalogSource() method returns a DataSource, which represents a particular table in the Data Catalog. The getDynamicFrame() method returns a dynamic frame from the source.

Recall that the crawler created a schema from only a sample of the data. You can scan the entire dataset, count the rows, and print the complete schema as follows:

github_events.count
github_events.printSchema()

The result looks like the following:

The data has 414,826 records. As before, notice that there are eight top-level columns, and three partition columns. If you scroll down, you’ll also notice that the payload is the most complex column.

Run functions and filter records

This section describes how you can create your own functions and invoke them seamlessly to filter records. Unlike filtering with Python lambdas, Scala scripts do not need to convert records from one language representation to another, thereby reducing overhead and running much faster.

Let’s create a function that picks only the IssuesEvents from the GitHub timeline. These events are generated whenever someone posts an issue for a particular repository. Each GitHub event record has a field, “type”, that indicates the kind of event it is. The issueFilter() function returns true for records that are IssuesEvents.

def issueFilter(rec: DynamicRecord): Boolean = { 
    rec.getField("type").exists(_ == "IssuesEvent") 
}

Note that the getField() method returns an Option[Any] type, so you first need to check that it exists before checking the type.

You pass this function to the filter transformation, which applies the function on each record and returns a dynamic frame of those records that pass.

val issue_events =  github_events.filter(issueFilter)

Now, let’s look at the size and schema of issue_events.

issue_events.count
issue_events.printSchema()

It’s much smaller (14,063 records), and the payload schema is less complex, reflecting only the schema for issues. Keep a few essential columns for your analysis, and drop the rest using the ApplyMapping() transform:

val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                 ("actor.login", "string", "actor", "string"), 
                                                 ("repo.name", "string", "repo", "string"),
                                                 ("payload.action", "string", "action", "string"),
                                                 ("payload.issue.title", "string", "title", "string")))
issue_titles.show()

The ApplyMapping() transform is quite handy for renaming columns, casting types, and restructuring records. The preceding code snippet tells the transform to select the fields (or columns) that are enumerated in the left half of the tuples and map them to the fields and types in the right half.

Estimating sentiment using Stanford CoreNLP

To focus on the most pressing issues, you might want to isolate the records with the most negative sentiments. The Stanford CoreNLP libraries are Java-based and offer sentiment-prediction functions. Accessing these functions through Python is possible, but quite cumbersome. It requires creating Python surrogate classes and objects for those found on the Java side. Instead, with Scala support, you can use those classes and objects directly and invoke their methods. Let’s see how.

First, import the libraries needed for the analysis:

import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

The Stanford CoreNLP libraries have a main driver that orchestrates all of their analysis. The driver setup is heavyweight, setting up threads and data structures that are shared across analyses. Apache Spark runs on a cluster with a main driver process and a collection of backend executor processes that do most of the heavy sifting of the data.

The Stanford CoreNLP shared objects are not serializable, so they cannot be distributed easily across a cluster. Instead, you need to initialize them once for every backend executor process that might need them. Here is how to accomplish that:

val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")

object myNLP {
    lazy val coreNLP = new StanfordCoreNLP(props)
}

The properties tell the libraries which annotators to execute and how many words to process. The preceding code creates an object, myNLP, with a field coreNLP that is lazily evaluated. This field is initialized only when it is needed, and only once. So, when the backend executors start processing the records, each executor initializes the driver for the Stanford CoreNLP libraries only one time.

Next is a function that estimates the sentiment of a text string. It first calls Stanford CoreNLP to annotate the text. Then, it pulls out the sentences and takes the average sentiment across all the sentences. The sentiment is a double, from 0.0 as the most negative to 4.0 as the most positive.

def estimatedSentiment(text: String): Double = {
    if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
    val annotations = myNLP.coreNLP.process(text)
    val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
    sentences.foldLeft(0.0)( (csum, x) => { 
        csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
    }) / sentences.length
}

Now, let’s estimate the sentiment of the issue titles and add that computed field as part of the records. You can accomplish this with the map() method on dynamic frames:

val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
    val mbody = rec.getField("title")
    mbody match {
        case Some(mval: String) => { 
            rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
            rec }
        case _ => rec
    }
})

The map() method applies the user-provided function on every record. The function takes a DynamicRecord as an argument and returns a DynamicRecord. The code above computes the sentiment, adds it in a top-level field, sentiment, to the record, and returns the record.

Count the records with sentiment and show the schema. This takes a few minutes because Spark must initialize the library and run the sentiment analysis, which can be involved.

issue_sentiments.count
issue_sentiments.printSchema()

Notice that all records were processed (14,063), and the sentiment value was added to the schema.

Finally, let’s pick out the titles that have the lowest sentiment (less than 1.5). Count them and print out a sample to see what some of the titles look like.

val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
pressing_issues.count
pressing_issues.show(10)

Next, write them all to a file so that you can handle them later. (You’ll need to replace the output path with your own.)

glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions("""{"path": "s3://<bucket>/out/path/"}"""), 
                              format = "json")
            .writeDynamicFrame(pressing_issues)

Take a look in the output path, and you can see the output files.

Putting it all together

Now, let’s create a job from the preceding interactive session. The following script combines all the commands from earlier. It processes the GitHub archive files and writes out the highly negative issues:

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

object GlueApp {

    object myNLP {
        val props = new Properties()
        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
        props.setProperty("parse.maxlen", "70")

        lazy val coreNLP = new StanfordCoreNLP(props)
    }

    def estimatedSentiment(text: String): Double = {
        if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
        val annotations = myNLP.coreNLP.process(text)
        val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
        sentences.foldLeft(0.0)( (csum, x) => { 
            csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
        }) / sentences.length
    }

    def main(sysArgs: Array[String]) {
        val spark: SparkContext = SparkContext.getOrCreate()
        val glueContext: GlueContext = new GlueContext(spark)

        val dbname = "githubarchive"
        val tblname = "data"
        val outpath = "s3://<bucket>/out/path/"

        val github_events = glueContext
                            .getCatalogSource(database = dbname, tableName = tblname)
                            .getDynamicFrame()

        val issue_events =  github_events.filter((rec: DynamicRecord) => {
            rec.getField("type").exists(_ == "IssuesEvent")
        })

        val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                         ("actor.login", "string", "actor", "string"), 
                                                         ("repo.name", "string", "repo", "string"),
                                                         ("payload.action", "string", "action", "string"),
                                                         ("payload.issue.title", "string", "title", "string")))

        val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
            val mbody = rec.getField("title")
            mbody match {
                case Some(mval: String) => { 
                    rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
                    rec }
                case _ => rec
            }
        })

        val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))

        glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions(s"""{"path": "$outpath"}"""), 
                              format = "json")
                    .writeDynamicFrame(pressing_issues)
    }
}

Notice that the script is enclosed in a top-level object called GlueApp, which serves as the script’s entry point for the job. (You’ll need to replace the output path with your own.) Upload the script to an Amazon S3 location so that AWS Glue can load it when needed.

To create the job, open the AWS Glue console. Choose Jobs in the left navigation pane, and then choose Add job. Create a name for the job, and specify a role with permissions to access the data. Choose An existing script that you provide, and choose Scala as the language.

For the Scala class name, type GlueApp to indicate the script’s entry point. Specify the Amazon S3 location of the script.

Choose Script libraries and job parameters. In the Dependent jars path field, enter the Amazon S3 locations of the Stanford CoreNLP libraries from earlier as a comma-separated list (without spaces). Then choose Next.

No connections are needed for this job, so choose Next again. Review the job properties, and choose Finish. Finally, choose Run job to execute the job.

You can simply edit the script’s input table and output path to run this job on whatever GitHub timeline datasets that you might have.

Conclusion

In this post, we showed how to write AWS Glue ETL scripts in Scala via notebooks and how to run them as jobs. Scala has the advantage that it is the native language for the Spark runtime. With Scala, it is easier to call Scala or Java functions and third-party libraries for analyses. Moreover, data processing is faster in Scala because there’s no need to convert records from one language runtime to another.

You can find more example of Scala scripts in our GitHub examples repository: https://github.com/awslabs/aws-glue-samples. We encourage you to experiment with Scala scripts and let us know about any interesting ETL flows that you want to share.

Happy Glue-ing!

 


Additional Reading

If you found this post useful, be sure to check out Simplify Querying Nested JSON with the AWS Glue Relationalize Transform and Genomic Analysis with Hail on Amazon EMR and Amazon Athena.

 


About the Authors

Mehul Shah is a senior software manager for AWS Glue. His passion is leveraging the cloud to build smarter, more efficient, and easier to use data systems. He has three girls, and, therefore, he has no spare time.

 

 

 

Ben Sowell is a software development engineer at AWS Glue.

 

 

 

 
Vinay Vivili is a software development engineer for AWS Glue.

 

 

 

No Level of Copyright Enforcement Will Ever Be Enough For Big Media

Post Syndicated from Andy original https://torrentfreak.com/no-level-of-copyright-enforcement-will-ever-be-enough-for-big-media-180107/

For more than ten years TorrentFreak has documented a continuous stream of piracy battles so it’s natural that, every now and then, we pause to consider when this war might stop. The answer is always “no time soon” and certainly not in 2018.

When swapping files over the Internet first began it wasn’t a particularly widespread activity. A reasonable amount of content was available, but it was relatively inaccessible. Then peer-to-peer came along and it sparked a revolution.

From the beginning, copyright holders felt that the law would answer their problems, whether that was by suing Napster, Kazaa, or even end users. Some industry players genuinely believed this strategy was just a few steps away from achieving its goals. Just a little bit more pressure and all would be under control.

Then, when the landmark MGM Studios v. Grokster decision was handed down in the studios’ favor during 2005, the excitement online was palpable. As copyright holders rejoiced in this body blow for the pirating masses, file-sharing communities literally shook under the weight of the ruling. For a day, maybe two.

For the majority of file-sharers, the ruling meant absolutely nothing. So what if some company could be held responsible for other people’s infringements? Another will come along, outside of the US if need be, people said. They were right not to be concerned – that’s exactly what happened.

Ever since, this cycle has continued. Eager to stem the tide of content being shared without their permission, rightsholders have advocated stronger anti-piracy enforcement and lobbied for more restrictive interpretations of copyright law. Thus far, however, literally nothing has provided a solution.

One would have thought that given the military-style raid on Kim Dotcom’s Megaupload, a huge void would’ve appeared in the sharing landscape. Instead, the file-locker business took itself apart and reinvented itself in jurisdictions outside the United States. Meanwhile, the BitTorrent scene continued in the background, somewhat obliviously.

With the SOPA debacle still fresh in relatively recent memory, copyright holders are still doggedly pursuing their aims. Site-blocking is rampant, advertisers are being pressured into compliance, and ISPs like Cox Communications now find themselves responsible for the infringements of their users. But has any of this caused any fatal damage to the sharing landscape? Not really.

Instead, we’re seeing a rise in the use of streaming sites, each far more accessible to the newcomer than their predecessors and vastly more difficult for copyright holders to police.

Systems built into Kodi are transforming these platforms into a plug-and-play piracy playground, one in which sites skirt US law and users can consume both at will and in complete privacy. Meanwhile, commercial and unauthorized IPTV offerings are gathering momentum, even as rightsholders try to pull them back.

Faced with problems like these we are now seeing calls for even tougher legislation. While groups like the RIAA dream of filtering the Internet, over in the UK a 2017 consultation had copyright holders excited that end users could be criminalized for simply consuming infringing content, let alone distributing it.

While the introduction of both or either of these measures would cause uproar (and rightly so), history tells us that each would fail in its stated aim of stopping piracy. With that eventuality all but guaranteed, calls for even tougher legislation are being readied for later down the line.

In short, there is no law that can stop piracy and therefore no law that will stop the entertainment industries coming back for harsher measures, pursuing the dream. This much we’ve established from close to two decades of litigation and little to no progress.

But really, is anyone genuinely surprised that they’re still taking this route? Draconian efforts to maintain control over the distribution of content predate the file-sharing wars by a couple of hundred years, at the very least. Why would rightsholders stop now, when the prize is even more valuable?

No one wants a minefield of copyright law. No one wants a restricted Internet. No one wants extended liability for innovators, service providers, or the public. But this is what we’ll get if this problem isn’t solved soon. Something drastic needs to happen, but who will be brave enough to admit it, let alone do something about it?

During a discussion about piracy last year on the BBC, the interviewer challenged a caller who freely admitted to pirating sports content online. The caller’s response was clear:

For far too long, broadcasters and rightsholders have abused their monopoly position, charging ever-increasing amounts for popular content, even while making billions. Piracy is a natural response to that, and effectively a chance for the little guy to get back some control, he argued.

Exactly the same happened in the music market during the late 1990s and 2000s. In response to artificial restriction of the market and the unrealistic hiking of prices, people turned to peer-to-peer networks for their fix. Thanks to this pressure but after years of turmoil, services like Spotify emerged, converting millions of former pirates in the process. Netflix, it appears, is attempting to do the same thing with video.

When people feel that they aren’t getting ripped off and that they have no further use for sub-standard piracy services in the face of stunning legal alternatives, things will change. But be under no illusion, people won’t be bullied there.

If we end up with an Internet stifled in favor of rightsholders, one in which service providers are too scared to innovate, the next generation of consumers will never forget. This will be a major problem for two key reasons. Not only will consumers become enemies but piracy will still exist. We will have come full circle, fueled only by division and hatred.

It’s a natural response to reject monopolistic behavior and it’s a natural response, for most, to be fair when treated with fairness. Destroying freedom is far from fair and will not create a better future – for anyone.

Laws have their place, no sane person will argue against that, but when the entertainment industries are making billions yet still want more, they’ll have to decide whether this will go on forever with building resentment, or if making a bit less profit now makes more sense longer term.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Google Blocks Pirate Search Results Prophylactically

Post Syndicated from Ernesto original https://torrentfreak.com/google-blocks-pirated-search-results-prophylactically-180103/

On an average day, Google processes more than three million takedown notices from copyright holders, and that’s for its search engine alone.

Under the current DMCA legislation, US-based Internet service providers are expected to remove infringing links, if a copyright holder complains.

This process shields these services from direct liability. In recent years there has been a lot of discussion about the effectiveness of the system, but Google has always maintained that it works well.

This was also highlighted by Google’s copyright counsel Caleb Donaldson, in an article he wrote for the American Bar Association’s publication Landslide.

“The DMCA provided Google and other online service providers the legal certainty they needed to grow,” Donaldson writes.

“And the DMCA’s takedown notices help us fight piracy in other ways as well. Indeed, the Web Search notice-and-takedown process provides the cornerstone of Google’s fight against piracy.”

The search engine does indeed go beyond ‘just’ removing links. The takedown notices are also used as a signal to demote domains. Websites for which it receives a lot of takedown notices will be placed lower in search results, for example.

These measures can be expanded and complemented by artificial intelligence in the future, Google’s copyright counsel envisions.

“As we move into a world where artificial intelligence can learn from vast troves of data like these, we will only get better at using the information to better fight against piracy,” Donaldson writes.

Artificial intelligence (AI) is a buzz-term that has a pretty broad meaning nowadays. Donaldson doesn’t go into detail on how AI can fight piracy. It could help to spot erroneous notices, on the one hand, but can also be applied to filter content proactively.

The latter is something Google is slowly opening up to.

Over the past year, we’ve noticed on a few occasions that Google is processing takedown notices for non-indexed links. While we assumed that this was an ‘error’ on the sender’s part, it appears to be a new policy.

“Google has critically expanded notice and takedown in another important way: We accept notices for URLs that are not even in our index in the first place. That way, we can collect information even about pages and domains we have not yet crawled,” Donaldson writes.

In other words, Google blocks URLs before they appear in the search results, as some sort of piracy vaccine.

“We process these URLs as we do the others. Once one of these not-in-index URLs is approved for takedown, we prophylactically block it from appearing in our Search results, and we take all the additional deterrent measures listed above.”

Some submitters are heavily relying on the new feature, Google found. In some cases, the majority of the submitted URLs in a notice are not indexed yet.

The search engine will keep a close eye on these developments. At TorrentFreak, we also found that copyright holders sometimes target links that don’t even exist. Whether Google will also accept these takedown requests in the future, is unknown.

It’s clear that artificial intelligence and proactive filtering are becoming more and more common, but Google says that the company will also keep an eye on possible abuse of the system.

“Google will push back if we suspect a notice is mistaken, fraudulent, or abusive, or if we think fair use or another defense excuses that particular use of copyrighted content,” Donaldson notes.

Artificial intelligence and prophylactic blocking surely add a new dimension to the standard DMCA takedown procedure, but whether it will be enough to convince copyright holders that it works, has yet to be seen.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Our ‘Kodi Box’ Is Legal & Our Users Don’t Break the Law, TickBox Tells Hollywood

Post Syndicated from Andy original https://torrentfreak.com/our-kodi-box-is-legal-our-users-dont-break-the-law-tickbox-tells-hollywood-171229/

Georgia-based TickBox TV is a provider of set-top boxes that allow users to stream all kinds of popular content. Like other similar devices, Tickboxes use the popular Kodi media player alongside instructions how to find and use third-party addons.

Of course, these types of add-ons are considered a thorn in the side of the entertainment industries and as a result, Tickbox found itself on the receiving end of a lawsuit in the United States.

Filed in a California federal court in October, Universal, Columbia Pictures, Disney, 20th Century Fox, Paramount Pictures, Warner Bros, Amazon, and Netflix accused Tickbox of inducing and contributing to copyright infringement.

“TickBox sells ‘TickBox TV,’ a computer hardware device that TickBox urges its customers to use as a tool for the mass infringement of Plaintiffs’ copyrighted motion pictures and television shows,” the complaint reads.

“TickBox promotes the use of TickBox TV for overwhelmingly, if not exclusively, infringing purposes, and that is how its customers use TickBox TV. TickBox advertises TickBox TV as a substitute for authorized and legitimate distribution channels such as cable television or video-on-demand services like Amazon Prime and Netflix.”

The copyright holders reference a TickBox TV video which informs customers how to install ‘themes’, more commonly known as ‘builds’. These ‘builds’ are custom Kodi-setups which contain many popular add-ons that specialize in supplying pirate content. Is that illegal? TickBox TV believes not.

In a response filed yesterday, TickBox underlined its position that its device is not sold with any unauthorized or illegal content and complains that just because users may choose to download and install third-party programs through which they can search for and view unauthorized content, that’s not its fault. It goes on to attack the lawsuit on several fronts.

TickBox argues that plaintiffs’ claims, that TickBox can be held secondarily liable under the theory of contributory infringement or inducement liability as described in the famous Grokster and isoHunt cases, is unlikely to succeed. TickBox says the studios need to show four elements – distribution of a device or product, acts of infringement by users of Tickbox, an object of promoting its use to infringe copyright, and causation.

“Plaintiffs have failed to establish any of these four elements,” TickBox’s lawyers write.

Firstly, TickBox says that while its device can be programmed to infringe, it’s the third party software (the builds/themes containing addons) that do all the dirty work, and TickBox has nothing to do with them.

“The Motion spends a great deal of time describing these third-party ‘Themes’ and how they operate to search for and stream videos. But the ‘Themes’ on which Plaintiffs so heavily focus are not the [TickBox], and they have absolutely nothing to do with Defendant. Rather, they are third-party modifications of the open-source media player software [Kodi] which the Box utilizes,” the response reads.

TickBox says its device is merely a small computer, not unlike a smartphone or tablet. Indeed, when it comes to running the ‘pirate’ builds listed in the lawsuit, a device supplied by one of the plaintiffs can accomplish the same task.

“Plaintiffs have identified certain of these thirdparty ‘builds’ or ‘Themes’ which are available on the internet and which can be downloaded by users to view content streamed by third-party websites; however, this same software can be installed on many different types of devices, even one distributed by affiliates of Plaintiff Amazon Content Services, LLC,” the company adds.

Referencing the Grokster case, TickBox states that particular company was held liable for distributing a device (the Grokster software) “with the object of promoting its use to infringe copyright.” In the isoHunt case, it argues that the provision of torrent files satisfied the first element of inducement liability.

“In contrast, Defendant’s product – the Box – is not software through which users can access unauthorized content, as in Grokster, or even a necessary component of accessing unauthorized content, as in Fung [isoHunt],” TickBox writes.

“Defendant offers a computer, onto which users can voluntarily install legitimate or illegitimate software. The product about which Plaintiffs complain is third-party software which can be downloaded onto a myriad of devices, and which Defendant neither created nor supplies.”

From defending itself, TickBox switches track to highlight weaknesses in the studios’ case against users of its TickBox device. The company states that the plaintiffs have not presented any evidence that buyers of the TickBox streaming unit have actually accessed any copyrighted material.

Interestingly, however, the company also notes that even if people had streamed ‘pirate’ content, that might not constitute infringement.

First up, the company notes that there are no allegations that anyone – from TickBox itself to TickBox device owners – ever violated the plaintiffs’ exclusive right to perform its copyrighted works.

TickBox then further argues that copyright law does not impose liability for viewing streaming content, stating that an infringer is one who violates any of the exclusive rights of the copyright holder, in this case, the right to “perform the copyrighted work publicly.”

“Plaintiffs do not allege that Defendant, Defendant’s product, or the users of Defendant’s product ‘transmit or otherwise communicate a performance’ to the public; instead, Plaintiffs allege that users view streaming material on the Box.

“It is clear precedent [Perfect 10 v Google] in this Circuit that merely viewing copyrighted material online, without downloading, copying, or retransmitting such material, is not actionable.”

Taking this argument to its logical conclusion, TickBox insists that if its users aren’t infringing copyright, it’s impossible to argue that TickBox induced its customers to violate the plaintiffs’ rights. In that respect, plaintiffs’ complaints that TickBox failed to develop “filtering tools” to diminish its customers’ infringing activity are moot, since in TickBox’s eyes no infringement took place.

TickBox also argues that unlike in Grokster, where the defendant profited when users’ accessed infringing content, it does not. And, just to underline the earlier point, it claims that its place in the market is not to compete with entertainment companies, it’s actually to compete with devices such as Amazon’s Firestick – another similar Android-powered device.

Finally, TickBox notes that it has zero connection with any third-party sites that transmit copyrighted works in violation of the plaintiffs’ rights.

“Plaintiff has not alleged any element of contributory infringement vis-à-vis these unknown third-parties. Plaintiff has not alleged that Defendant has distributed any product to those third parties, that Defendant has committed any act which encourages those third parties’ infringement, or that any act of Defendant has, in fact, caused those third parties to infringe,” its response adds.

But even given the above defenses, TickBox says that it “voluntarily took steps” to remove links to the allegedly infringing Kodi builds from its device, following the plaintiffs’ lawsuit. It also claims to have modified its advertising and webpage “to attempt to appease Plaintiffs and resolve their complaint amicably.”

Given the above, TickBox says that the plaintiffs’ application for injunction is both vague and overly broad and would impose “imperssible hardship” on the company by effectively shutting it down while requiring it to “hack into and delete content” which TickBox users may have downloaded to their boxes.

TickBox raises some very interesting points around some obvious weaknesses so it will be intriguing to see how the Court handles its claims and what effect that has on the market for these devices in the US. In particular, the thorny issue of how they are advertised and promoted, which is nearly always the final stumbling block.

A copy of Tickbox’s response is available here (pdf), via Variety

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Might Google Class “Torrent” a Dirty Word? France is About to Find Out

Post Syndicated from Andy original https://torrentfreak.com/might-google-class-torrent-a-dirty-word-france-is-about-to-find-out-171223/

Like most countries, France is struggling to find ways to stop online piracy running rampant. A number of options have been tested thus far, with varying results.

One of the more interesting cases has been running since 2015, when music industry group SNEP took Google and Microsoft to court demanding automated filtering of ‘pirate’ search results featuring three local artists.

Before the High Court of Paris, SNEP argued that searches for the artists’ names plus the word “torrent” returned mainly infringing results on Google and Bing. Filtering out results with both sets of terms would reduce the impact of people finding pirate content through search, they said.

While SNEP claimed that its request was in line with Article L336-2 of France’s intellectual property code, which allows for “all appropriate measures” to prevent infringement, both Google and Microsoft fought back, arguing that such filtering would be disproportionate and could restrict freedom of expression.

The Court eventually sided with the search engines, noting that torrent is a common noun that refers to a neutral communication protocol.

“The requested measures are thus tantamount to general monitoring and may block access to lawful websites,” the High Court said.

Despite being told that its demands were too broad, SNEP decided to appeal. The case was heard in November where concerns were expressed over potential false positives.

Since SNEP even wants sites with “torrent” in their URL filtered out via a “fully automated procedures that do not require human intervention”, this very site – TorrentFreak.com – could be sucked in. To counter that eventuality, SNEP proposed some kind of whitelist, NextInpact reports.

With no real consensus on how to move forward, the parties were advised to enter discussions on how to get closer to the aim of reducing piracy but without causing collateral damage. Last week the parties agreed to enter negotiations so the details will now have to be hammered out between their respective law firms. Failing that, they will face a ruling from the court.

If this last scenario plays out, the situation appears to favor the search engines, who have a High Court ruling in their favor and already offer comprehensive takedown tools for copyright holders to combat the exploitation of their content online.

Meanwhile, other elements of the French recording industry have booked a notable success against several pirate sites.

SCPP, which represents Warner, Universal, Sony and thousands of others, went to court in February this year demanding that local ISPs Bouygues, Free, Orange, SFR and Numéricable prevent their subscribers from accessing ExtraTorrent, isoHunt, Torrent9 and Cpasbien.

Like SNEP in the filtering case, SCPP also cited Article L336-2 of France’s intellectual property code, demanding that the sites plus their variants, mirrors and proxies should be blocked by the ISPs so that their subscribers can no longer gain access.

This week the Paris Court of First Instance sided with the industry group, ordering the ISPs to block the sites. The service providers were also told to pick up the bill for costs.

These latest cases are yet more examples of France’s determination to crack down on piracy.

Early December it was revealed that since its inception, nine million piracy warnings have been sent to citizens via the Hadopi anti-piracy agency. Since the launch of its graduated response regime in 2010, more than 2,000 cases have been referred to prosecutors, resulting in 189 criminal convictions.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Google & Facebook Excluded From Aussie Safe Harbor Copyright Amendments

Post Syndicated from Andy original https://torrentfreak.com/google-facebook-excluded-from-aussie-safe-harbor-copyright-amendments-171205/

Due to a supposed drafting error in Australia’s implementation of the Australia – US Free Trade Agreement (AUSFTA), copyright safe harbor provisions currently only apply to commercial Internet service providers.

This means that while local ISPs such as Telstra receive protection from copyright infringement complaints, services such as Google, Facebook and YouTube face legal uncertainty.

Proposed amendments to the Copyright Act earlier this year would’ve seen enhanced safe harbor protections for such platforms but they were withdrawn at the eleventh hour so that the government could consider “further feedback” from interested parties.

Shortly after the government embarked on a detailed consultation with entertainment industry groups. They accuse platforms like YouTube of exploiting safe harbor provisions in the US and Europe, which forces copyright holders into an expensive battle to have infringing content taken down. They do not want that in Australia and at least for now, they appear to have achieved their aims.

According to a report from AFR (paywall), the Australian government is set to introduce new legislation Wednesday which will expand safe harbors for some organizations but will exclude companies such as Google, Facebook, and similar platforms.

Communications Minister Mitch Fifield confirmed the exclusions while noting that additional safeguards will be available to institutions, libraries, and organizations in the disability, archive and culture sectors.

“The measures in the bill will ensure these sectors are protected from legal liability where they can demonstrate that they have taken reasonable steps to deal with copyright infringement by users of their online platforms,” Senator Fifield told AFR.

“Extending the safe harbor scheme in this way will provide greater certainty to institutions in these sectors and enhance their ability to provide more innovative and creative services for all Australians.”

According to the Senator, the government will continue its work with stakeholders to further reform safe harbor provisions, before applying them to other service providers.

The news that Google, Facebook, and similar platforms are to be denied access to the new safe harbor rules will be seen as a victory for rightsholders. They’re desperately trying to tighten up legislation in other regions where such safeguards are already in place, arguing that platforms utilizing user-generated content for profit should obtain appropriate licensing first.

This so-called ‘Value Gap’ (1,2,3) and associated proactive filtering proposals are among the hottest copyright topics right now, generating intense debate across Europe and the United States.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Google Says It Can’t Filter Pirated Content Proactively

Post Syndicated from Ernesto original https://torrentfreak.com/google-says-it-cant-filter-pirated-content-proactively-171202/

Over the past few years the entertainment industries have repeatedly asked Google to step up its game when it comes to its anti-piracy efforts.

These calls haven’t fallen on deaf ears and Google has steadily implemented various anti-piracy measures in response.

Still, that is not enough. At least, according to several prominent music industry groups who are advocating a ‘Take Down, Stay Down’ approach.

Currently, Google mostly responds to takedown requests that are sent in by copyright holders. The search engine deletes the infringing results and demotes the domains of frequent infringers. However, the same content often reappears on other sites, or in another location on the same site.

Earlier this year a group of prominent music groups stated that the present situation forces rightsholders to participate in a never-ending game of whack-a-mole which doesn’t fix the underlying problem. Instead, it results in a “frustrating, burdensome and ultimately ineffective takedown process.”

While Google understands the rationale behind the complaints, the company doesn’t believe in a more proactive solution. This was reiterated by Matt Brittin, President of EMEA Business & Operations at Google, during the Royal Television Society Event in London this week.

“The music industry has been quite tough with us on this. They’d like us proactively to know this stuff. It’s just not possible in this industry,” Brittin said.

That doesn’t mean that Google is sitting still. Brittin stresses that the company has invested millions in anti-piracy tools. That said, there can always be room for improvement.

“What we’ve tried to do is build tools that allow them to do that at scale easily and that work all together … I’m sure there are places where we could do better. There are teams and millions of dollars invested in this.

“Combatting bad acts and piracy is obviously very important to us,” Brittin added.

While Google sees no room for proactive filtering in search results, music industry insiders believe it’s possible.

Ideally, they want some type of automated algorithm or technology that removes infringing results without a targeted DMCA notice. This could be similar to YouTube’s Content-ID system, or the hash filtering mechanisms Google Drive employs, for example.

For now, however, there’s no sign that Google will go beyond the current takedown notice approach, at least for search. A ‘Take Down, Stay Down’ mechanism wouldn’t “understand” when content is authorized or not, the company previously noted.

And so, the status quo is likely to remain, at least for now.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

[$] A thorough introduction to eBPF

Post Syndicated from corbet original https://lwn.net/Articles/740157/rss

In his linux.conf.au
2017 talk [YouTube]
on the eBPF in-kernel virtual machine, Brendan Gregg
proclaimed that “super powers have finally come to Linux”. Getting
eBPF to that point has been a long road of evolution and design. While
eBPF was originally used for network packet filtering, it turns out
that running user-space code inside a sanity-checking virtual machine
is a powerful tool for kernel developers and production engineers.

Over time, new eBPF users have appeared to take advantage of its
performance and convenience. This article explains how eBPF evolved
how it works, and how it is used in the kernel.

European Commission Steps Up Fight Against Online Piracy

Post Syndicated from Ernesto original https://torrentfreak.com/european-commission-steps-up-fight-against-online-piracy-171130/

The European Commission has had copyright issues at the top of its agenda for a while, resulting in several controversial proposals.

This week it presented a series of new measures to ensure that copyright holders are well protected, targeting both online piracy and counterfeit goods.

“Today we boost our collective ability to catch the ‘big fish’ behind fake goods and pirated content which harm our companies and our jobs – as well as our health and safety in areas such as medicines or toys,” Commissioner Elżbieta Bieńkowska announced.

The Commission notes that it’s stepping up the fight against counterfeiting and piracy. However, many of the proposals are not entirely new for those who follow anti-piracy issues around the globe.

One of the main goals is to focus on the people who facilitate copyright infringement, such as pirate site operators, and try to cut their revenue streams.

“The Commission seeks to deprive commercial-scale IP infringers of the revenue flows that make their criminal activity lucrative – this is the so-called ‘follow the money’ approach which focuses on the ‘big fish’ rather than individuals,” they write.

Instead of using legislation to reach this goal, the Commission prefers to continue its support for voluntary agreements between copyright holders and third-party services. This includes deals with advertising and payment services to cut their ties with pirate sites.

“Such agreements can lead to faster action against counterfeiting and piracy than court actions,” the Commission writes.

Another tool to fight piracy appears on the agenda for the first time. The European Commission notes that it will also support the quest for new anti-piracy initiatives, including the use of blockchain technology.

“Supporting industry-led initiatives to combat IP infringements, including work on Memoranda of Understanding and exploring the potential of new technologies such as blockchain to combat IP infringements in supply chains,” the suggestion reads.

No concrete examples were given but earlier this week, European Parliament member Brando Benifei wrote an article on the issue in Euractiv.

Benifei mentions that blockchain technology can help independent artists collect royalty payments without the need for middlemen. In a similar vein, blockchains can also be used to track the unauthorized distribution of works.

In addition to broadening the anti-piracy horizon, the European Commission also released a new guidance on how the current IPR Enforcement Directive (IPRED) should be interpreted, taking into account various recent developments, including landmark EU Court of Justice rulings.

The guidance explains how and when it’s appropriate to issue website blocking orders, for example. In general, blocking injunctions are warranted when they are proportional and aimed at preventing concrete infringements.

The comprehensive guidance also covers the issue of filtering. Interestingly, the Commission clarifies that third-party services can’t be required to “install and operate excessively broad, unspecific and expensive filtering systems.”

This appears to run counter to the mandatory piracy filters that were suggested as part of the copyright reform proposal.

However, the Commission notes that in some specific cases, hosting providers (e.g. YouTube) can be ordered to monitor uploads. This is in line with a recent communication which recommended that online services should implement measures to automatically detect and remove suspected illegal content.

While the new plans continue down the path of stronger copyright protections, not all rightsholders are happy. IFPI is glad that the main problems are highlighted, but would have liked to have seen more concrete plans.

“We are disappointed that despite the European Commission recognizing the need to modernize IPRED and years of evidence gathering, today’s result is merely guidance to EU Member State governments. Soft law does not give right holders the tools they need to take effective action against pirate services,” IFPI writes.

On the other side of the divide, opposition to the previously announced EU copyright reform plans continues as well. Earlier today a group of over 80 organizations urged EU member states to speak out against several controversial copyright proposals, including the upload filter.

“The signatories warn the Member states that the discussion around the Copyright Directive are on the verge of causing irreparable damage to our fundamental rights and freedoms, our economy and competitiveness, our education and research, our innovation and competition, our creativity and our culture,” they say.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

Presenting AWS IoT Analytics: Delivering IoT Analytics at Scale and Faster than Ever Before

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/launch-presenting-aws-iot-analytics/

One of the technology areas I thoroughly enjoy is the Internet of Things (IoT). Even as a child I used to infuriate my parents by taking apart the toys they would purchase for me to see how they worked and if I could somehow put them back together. It seems somehow I was destined to end up the tough and ever-changing world of technology. Therefore, it’s no wonder that I am really enjoying learning and tinkering with IoT devices and technologies. It combines my love of development and software engineering with my curiosity around circuits, controllers, and other facets of the electrical engineering discipline; even though an electrical engineer I can not claim to be.

Despite all of the information that is collected by the deployment of IoT devices and solutions, I honestly never really thought about the need to analyze, search, and process this data until I came up against a scenario where it became of the utmost importance to be able to search and query through loads of sensory data for an anomaly occurrence. Of course, I understood the importance of analytics for businesses to make accurate decisions and predictions to drive the organization’s direction. But it didn’t occur to me initially, how important it was to make analytics an integral part of my IoT solutions. Well, I learned my lesson just in time because this re:Invent a service is launching to make it easier for anyone to process and analyze IoT messages and device data.

 

Hello, AWS IoT Analytics!  AWS IoT Analytics is a fully managed service of AWS IoT that provides advanced data analysis of data collected from your IoT devices.  With the AWS IoT Analytics service, you can process messages, gather and store large amounts of device data, as well as, query your data. Also, the new AWS IoT Analytics service feature integrates with Amazon Quicksight for visualization of your data and brings the power of machine learning through integration with Jupyter Notebooks.

Benefits of AWS IoT Analytics

  • Helps with predictive analysis of data by providing access to pre-built analytical functions
  • Provides ability to visualize analytical output from service
  • Provides tools to clean up data
  • Can help identify patterns in the gathered data

Be In the Know: IoT Analytics Concepts

  • Channel: archives the raw, unprocessed messages and collects data from MQTT topics.
  • Pipeline: consumes messages from channels and allows message processing.
    • Activities: perform transformations on your messages including filtering attributes and invoking lambda functions advanced processing.
  • Data Store: Used as a queryable repository for processed messages. Provide ability to have multiple datastores for messages coming from different devices or locations or filtered by message attributes.
  • Data Set: Data retrieval view from a data store, can be generated by a recurring schedule. 

Getting Started with AWS IoT Analytics

First, I’ll create a channel to receive incoming messages.  This channel can be used to ingest data sent to the channel via MQTT or messages directed from the Rules Engine. To create a channel, I’ll select the Channels menu option and then click the Create a channel button.

I’ll name my channel, TaraIoTAnalyticsID and give the Channel a MQTT topic filter of Temperature. To complete the creation of my channel, I will click the Create Channel button.

Now that I have my Channel created, I need to create a Data Store to receive and store the messages received on the Channel from my IoT device. Remember you can set up multiple Data Stores for more complex solution needs, but I’ll just create one Data Store for my example. I’ll select Data Stores from menu panel and click Create a data store.

 

I’ll name my Data Store, TaraDataStoreID, and once I click the Create the data store button and I would have successfully set up a Data Store to house messages coming from my Channel.

Now that I have my Channel and my Data Store, I will need to connect the two using a Pipeline. I’ll create a simple pipeline that just connects my Channel and Data Store, but you can create a more robust pipeline to process and filter messages by adding Pipeline activities like a Lambda activity.

To create a pipeline, I’ll select the Pipelines menu option and then click the Create a pipeline button.

I will not add an Attribute for this pipeline. So I will click Next button.

As we discussed there are additional pipeline activities that I can add to my pipeline for the processing and transformation of messages but I will keep my first pipeline simple and hit the Next button.

The final step in creating my pipeline is for me to select my previously created Data Store and click Create Pipeline.

All that is left for me to take advantage of the AWS IoT Analytics service is to create an IoT rule that sends data to an AWS IoT Analytics channel.  Wow, that was a super easy process to set up analytics for IoT devices.

If I wanted to create a Data Set as a result of queries run against my data for visualization with Amazon Quicksight or integrate with Jupyter Notebooks to perform more advanced analytical functions, I can choose the Analyze menu option to bring up the screens to create data sets and access the Juypter Notebook instances.

Summary

As you can see, it was a very simple process to set up the advanced data analysis for AWS IoT. With AWS IoT Analytics, you have the ability to collect, visualize, process, query and store large amounts of data generated from your AWS IoT connected device. Additionally, you can access the AWS IoT Analytics service in a myriad of different ways; the AWS Command Line Interface (AWS CLI), the AWS IoT API, language-specific AWS SDKs, and AWS IoT Device SDKs.

AWS IoT Analytics is available today for you to dig into the analysis of your IoT data. To learn more about AWS IoT and AWS IoT Analytics go to the AWS IoT Analytics product page and/or the AWS IoT documentation.

Tara

Amazon MQ – Managed Message Broker Service for ActiveMQ

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-mq-managed-message-broker-service-for-activemq/

Messaging holds the parts of a distributed application together, while also adding resiliency and enabling the implementation of highly scalable architectures. For example, earlier this year, Amazon Simple Queue Service (SQS) and Amazon Simple Notification Service (SNS) supported the processing of customer orders on Prime Day, collectively processing 40 billion messages at a rate of 10 million per second, with no customer-visible issues.

SQS and SNS have been used extensively for applications that were born in the cloud. However, many of our larger customers are already making use of open-sourced or commercially-licensed message brokers. Their applications are mission-critical, and so is the messaging that powers them. Our customers describe the setup and on-going maintenance of their messaging infrastructure as “painful” and report that they spend at least 10 staff-hours per week on this chore.

New Amazon MQ
Today we are launching Amazon MQ – a managed message broker service for Apache ActiveMQ that lets you get started in minutes with just three clicks! As you may know, ActiveMQ is a popular open-source message broker that is fast & feature-rich. It offers queues and topics, durable and non-durable subscriptions, push-based and poll-based messaging, and filtering.

As a managed service, Amazon MQ takes care of the administration and maintenance of ActiveMQ. This includes responsibility for broker provisioning, patching, failure detection & recovery for high availability, and message durability. With Amazon MQ, you get direct access to the ActiveMQ console and industry standard APIs and protocols for messaging, including JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. This allows you to move from any message broker that uses these standards to Amazon MQ–along with the supported applications–without rewriting code.

You can create a single-instance Amazon MQ broker for development and testing, or an active/standby pair that spans AZs, with quick, automatic failover. Either way, you get data replication across AZs and a pay-as-you-go model for the broker instance and message storage.

Amazon MQ is a full-fledged part of the AWS family, including the use of AWS Identity and Access Management (IAM) for authentication and authorization to use the service API. You can use Amazon CloudWatch metrics to keep a watchful eye metrics such as queue depth and initiate Auto Scaling of your consumer fleet as needed.

Launching an Amazon MQ Broker
To get started, I open up the Amazon MQ Console, select the desired AWS Region, enter a name for my broker, and click on Next step:

Then I choose the instance type, indicate that I want to create a standby , and click on Create broker (I can select a VPC and fine-tune other settings in the Advanced settings section):

My broker will be created and ready to use in 5-10 minutes:

The URLs and endpoints that I use to access my broker are all available at a click:

I can access the ActiveMQ Web Console at the link provided:

The broker publishes instance, topic, and queue metrics to CloudWatch. Here are the instance metrics:

Available Now
Amazon MQ is available now and you can start using it today in the US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), and Asia Pacific (Sydney) Regions.

The AWS Free Tier lets you use a single-AZ micro instance for up to 750 hours and to store up to 1 gigabyte each month, for one year. After that, billing is based on instance-hours and message storage, plus charges Internet data transfer if the broker is accessed from outside of AWS.

Jeff;

Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production

Post Syndicated from Rafi Ton original https://aws.amazon.com/blogs/big-data/using-amazon-redshift-spectrum-amazon-athena-and-aws-glue-with-node-js-in-production/

This is a guest post by Rafi Ton, founder and CEO of NUVIAD. NUVIAD is, in their own words, “a mobile marketing platform providing professional marketers, agencies and local businesses state of the art tools to promote their products and services through hyper targeting, big data analytics and advanced machine learning tools.”

At NUVIAD, we’ve been using Amazon Redshift as our main data warehouse solution for more than 3 years.

We store massive amounts of ad transaction data that our users and partners analyze to determine ad campaign strategies. When running real-time bidding (RTB) campaigns in large scale, data freshness is critical so that our users can respond rapidly to changes in campaign performance. We chose Amazon Redshift because of its simplicity, scalability, performance, and ability to load new data in near real time.

Over the past three years, our customer base grew significantly and so did our data. We saw our Amazon Redshift cluster grow from three nodes to 65 nodes. To balance cost and analytics performance, we looked for a way to store large amounts of less-frequently analyzed data at a lower cost. Yet, we still wanted to have the data immediately available for user queries and to meet their expectations for fast performance. We turned to Amazon Redshift Spectrum.

In this post, I explain the reasons why we extended Amazon Redshift with Redshift Spectrum as our modern data warehouse. I cover how our data growth and the need to balance cost and performance led us to adopt Redshift Spectrum. I also share key performance metrics in our environment, and discuss the additional AWS services that provide a scalable and fast environment, with data available for immediate querying by our growing user base.

Amazon Redshift as our foundation

The ability to provide fresh, up-to-the-minute data to our customers and partners was always a main goal with our platform. We saw other solutions provide data that was a few hours old, but this was not good enough for us. We insisted on providing the freshest data possible. For us, that meant loading Amazon Redshift in frequent micro batches and allowing our customers to query Amazon Redshift directly to get results in near real time.

The benefits were immediately evident. Our customers could see how their campaigns performed faster than with other solutions, and react sooner to the ever-changing media supply pricing and availability. They were very happy.

However, this approach required Amazon Redshift to store a lot of data for long periods, and our data grew substantially. In our peak, we maintained a cluster running 65 DC1.large nodes. The impact on our Amazon Redshift cluster was evident, and we saw our CPU utilization grow to 90%.

Why we extended Amazon Redshift to Redshift Spectrum

Redshift Spectrum gives us the ability to run SQL queries using the powerful Amazon Redshift query engine against data stored in Amazon S3, without needing to load the data. With Redshift Spectrum, we store data where we want, at the cost that we want. We have the data available for analytics when our users need it with the performance they expect.

Seamless scalability, high performance, and unlimited concurrency

Scaling Redshift Spectrum is a simple process. First, it allows us to leverage Amazon S3 as the storage engine and get practically unlimited data capacity.

Second, if we need more compute power, we can leverage Redshift Spectrum’s distributed compute engine over thousands of nodes to provide superior performance – perfect for complex queries running against massive amounts of data.

Third, all Redshift Spectrum clusters access the same data catalog so that we don’t have to worry about data migration at all, making scaling effortless and seamless.

Lastly, since Redshift Spectrum distributes queries across potentially thousands of nodes, they are not affected by other queries, providing much more stable performance and unlimited concurrency.

Keeping it SQL

Redshift Spectrum uses the same query engine as Amazon Redshift. This means that we did not need to change our BI tools or query syntax, whether we used complex queries across a single table or joins across multiple tables.

An interesting capability introduced recently is the ability to create a view that spans both Amazon Redshift and Redshift Spectrum external tables. With this feature, you can query frequently accessed data in your Amazon Redshift cluster and less-frequently accessed data in Amazon S3, using a single view.

Leveraging Parquet for higher performance

Parquet is a columnar data format that provides superior performance and allows Redshift Spectrum (or Amazon Athena) to scan significantly less data. With less I/O, queries run faster and we pay less per query. You can read all about Parquet at https://parquet.apache.org/ or https://en.wikipedia.org/wiki/Apache_Parquet.

Lower cost

From a cost perspective, we pay standard rates for our data in Amazon S3, and only small amounts per query to analyze data with Redshift Spectrum. Using the Parquet format, we can significantly reduce the amount of data scanned. Our costs are now lower, and our users get fast results even for large complex queries.

What we learned about Amazon Redshift vs. Redshift Spectrum performance

When we first started looking at Redshift Spectrum, we wanted to put it to the test. We wanted to know how it would compare to Amazon Redshift, so we looked at two key questions:

  1. What is the performance difference between Amazon Redshift and Redshift Spectrum on simple and complex queries?
  2. Does the data format impact performance?

During the migration phase, we had our dataset stored in Amazon Redshift and S3 as CSV/GZIP and as Parquet file formats. We tested three configurations:

  • Amazon Redshift cluster with 28 DC1.large nodes
  • Redshift Spectrum using CSV/GZIP
  • Redshift Spectrum using Parquet

We performed benchmarks for simple and complex queries on one month’s worth of data. We tested how much time it took to perform the query, and how consistent the results were when running the same query multiple times. The data we used for the tests was already partitioned by date and hour. Properly partitioning the data improves performance significantly and reduces query times.

Simple query

First, we tested a simple query aggregating billing data across a month:

SELECT 
  user_id, 
  count(*) AS impressions, 
  SUM(billing)::decimal /1000000 AS billing 
FROM <table_name> 
WHERE 
  date >= '2017-08-01' AND 
  date <= '2017-08-31'  
GROUP BY 
  user_id;

We ran the same query seven times and measured the response times (red marking the longest time and green the shortest time):

Execution Time (seconds)
  Amazon Redshift Redshift Spectrum
CSV
Redshift Spectrum Parquet
Run #1 39.65 45.11 11.92
Run #2 15.26 43.13 12.05
Run #3 15.27 46.47 13.38
Run #4 21.22 51.02 12.74
Run #5 17.27 43.35 11.76
Run #6 16.67 44.23 13.67
Run #7 25.37 40.39 12.75
Average 21.53  44.82 12.61

For simple queries, Amazon Redshift performed better than Redshift Spectrum, as we thought, because the data is local to Amazon Redshift.

What was surprising was that using Parquet data format in Redshift Spectrum significantly beat ‘traditional’ Amazon Redshift performance. For our queries, using Parquet data format with Redshift Spectrum delivered an average 40% performance gain over traditional Amazon Redshift. Furthermore, Redshift Spectrum showed high consistency in execution time with a smaller difference between the slowest run and the fastest run.

Comparing the amount of data scanned when using CSV/GZIP and Parquet, the difference was also significant:

Data Scanned (GB)
CSV (Gzip) 135.49
Parquet 2.83

Because we pay only for the data scanned by Redshift Spectrum, the cost saving of using Parquet is evident and substantial.

Complex query

Next, we compared the same three configurations with a complex query.

Execution Time (seconds)
  Amazon Redshift Redshift Spectrum CSV Redshift Spectrum Parquet
Run #1 329.80 84.20 42.40
Run #2 167.60 65.30 35.10
Run #3 165.20 62.20 23.90
Run #4 273.90 74.90 55.90
Run #5 167.70 69.00 58.40
Average 220.84 71.12 43.14

This time, Redshift Spectrum using Parquet cut the average query time by 80% compared to traditional Amazon Redshift!

Bottom line: For complex queries, Redshift Spectrum provided a 67% performance gain over Amazon Redshift. Using the Parquet data format, Redshift Spectrum delivered an 80% performance improvement over Amazon Redshift. For us, this was substantial.

Optimizing the data structure for different workloads

Because the cost of S3 is relatively inexpensive and we pay only for the data scanned by each query, we believe that it makes sense to keep our data in different formats for different workloads and different analytics engines. It is important to note that we can have any number of tables pointing to the same data on S3. It all depends on how we partition the data and update the table partitions.

Data permutations

For example, we have a process that runs every minute and generates statistics for the last minute of data collected. With Amazon Redshift, this would be done by running the query on the table with something as follows:

SELECT 
  user, 
  COUNT(*) 
FROM 
  events_table 
WHERE 
  ts BETWEEN ‘2017-08-01 14:00:00’ AND ‘2017-08-01 14:00:59’ 
GROUP BY 
  user;

(Assuming ‘ts’ is your column storing the time stamp for each event.)

With Redshift Spectrum, we pay for the data scanned in each query. If the data is partitioned by the minute instead of the hour, a query looking at one minute would be 1/60th the cost. If we use a temporary table that points only to the data of the last minute, we save that unnecessary cost.

Creating Parquet data efficiently

On the average, we have 800 instances that process our traffic. Each instance sends events that are eventually loaded into Amazon Redshift. When we started three years ago, we would offload data from each server to S3 and then perform a periodic copy command from S3 to Amazon Redshift.

Recently, Amazon Kinesis Firehose added the capability to offload data directly to Amazon Redshift. While this is now a viable option, we kept the same collection process that worked flawlessly and efficiently for three years.

This changed, however, when we incorporated Redshift Spectrum. With Redshift Spectrum, we needed to find a way to:

  • Collect the event data from the instances.
  • Save the data in Parquet format.
  • Partition the data effectively.

To accomplish this, we save the data as CSV and then transform it to Parquet. The most effective method to generate the Parquet files is to:

  1. Send the data in one-minute intervals from the instances to Kinesis Firehose with an S3 temporary bucket as the destination.
  2. Aggregate hourly data and convert it to Parquet using AWS Lambda and AWS Glue.
  3. Add the Parquet data to S3 by updating the table partitions.

With this new process, we had to give more attention to validating the data before we sent it to Kinesis Firehose, because a single corrupted record in a partition fails queries on that partition.

Data validation

To store our click data in a table, we considered the following SQL create table command:

create external TABLE spectrum.blog_clicks (
    user_id varchar(50),
    campaign_id varchar(50),
    os varchar(50),
    ua varchar(255),
    ts bigint,
    billing float
)
partitioned by (date date, hour smallint)  
stored as parquet
location 's3://nuviad-temp/blog/clicks/';

The above statement defines a new external table (all Redshift Spectrum tables are external tables) with a few attributes. We stored ‘ts’ as a Unix time stamp and not as Timestamp, and billing data is stored as float and not decimal (more on that later). We also said that the data is partitioned by date and hour, and then stored as Parquet on S3.

First, we need to get the table definitions. This can be achieved by running the following query:

SELECT 
  * 
FROM 
  svv_external_columns 
WHERE 
  tablename = 'blog_clicks';

This query lists all the columns in the table with their respective definitions:

schemaname tablename columnname external_type columnnum part_key
spectrum blog_clicks user_id varchar(50) 1 0
spectrum blog_clicks campaign_id varchar(50) 2 0
spectrum blog_clicks os varchar(50) 3 0
spectrum blog_clicks ua varchar(255) 4 0
spectrum blog_clicks ts bigint 5 0
spectrum blog_clicks billing double 6 0
spectrum blog_clicks date date 7 1
spectrum blog_clicks hour smallint 8 2

Now we can use this data to create a validation schema for our data:

const rtb_request_schema = {
    "name": "clicks",
    "items": {
        "user_id": {
            "type": "string",
            "max_length": 100
        },
        "campaign_id": {
            "type": "string",
            "max_length": 50
        },
        "os": {
            "type": "string",
            "max_length": 50            
        },
        "ua": {
            "type": "string",
            "max_length": 255            
        },
        "ts": {
            "type": "integer",
            "min_value": 0,
            "max_value": 9999999999999
        },
        "billing": {
            "type": "float",
            "min_value": 0,
            "max_value": 9999999999999
        }
    }
};

Next, we create a function that uses this schema to validate data:

function valueIsValid(value, item_schema) {
    if (schema.type == 'string') {
        return (typeof value == 'string' && value.length <= schema.max_length);
    }
    else if (schema.type == 'integer') {
        return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
    }
    else if (schema.type == 'float' || schema.type == 'double') {
        return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
    }
    else if (schema.type == 'boolean') {
        return typeof value == 'boolean';
    }
    else if (schema.type == 'timestamp') {
        return (new Date(value)).getTime() > 0;
    }
    else {
        return true;
    }
}

Near real-time data loading with Kinesis Firehose

On Kinesis Firehose, we created a new delivery stream to handle the events as follows:

Delivery stream name: events
Source: Direct PUT
S3 bucket: nuviad-events
S3 prefix: rtb/
IAM role: firehose_delivery_role_1
Data transformation: Disabled
Source record backup: Disabled
S3 buffer size (MB): 100
S3 buffer interval (sec): 60
S3 Compression: GZIP
S3 Encryption: No Encryption
Status: ACTIVE
Error logging: Enabled

This delivery stream aggregates event data every minute, or up to 100 MB, and writes the data to an S3 bucket as a CSV/GZIP compressed file. Next, after we have the data validated, we can safely send it to our Kinesis Firehose API:

if (validated) {
    let itemString = item.join('|')+'\n'; //Sending csv delimited by pipe and adding new line

    let params = {
        DeliveryStreamName: 'events',
        Record: {
            Data: itemString
        }
    };

    firehose.putRecord(params, function(err, data) {
        if (err) {
            console.error(err, err.stack);        
        }
        else {
            // Continue to your next step 
        }
    });
}

Now, we have a single CSV file representing one minute of event data stored in S3. The files are named automatically by Kinesis Firehose by adding a UTC time prefix in the format YYYY/MM/DD/HH before writing objects to S3. Because we use the date and hour as partitions, we need to change the file naming and location to fit our Redshift Spectrum schema.

Automating data distribution using AWS Lambda

We created a simple Lambda function triggered by an S3 put event that copies the file to a different location (or locations), while renaming it to fit our data structure and processing flow. As mentioned before, the files generated by Kinesis Firehose are structured in a pre-defined hierarchy, such as:

S3://your-bucket/your-prefix/2017/08/01/20/events-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz

All we need to do is parse the object name and restructure it as we see fit. In our case, we did the following (the event is an object received in the Lambda function with all the data about the object written to S3):

/*
	object key structure in the event object:
your-prefix/2017/08/01/20/event-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz
	*/

let key_parts = event.Records[0].s3.object.key.split('/'); 

let event_type = key_parts[0];
let date = key_parts[1] + '-' + key_parts[2] + '-' + key_parts[3];
let hour = key_parts[4];
if (hour.indexOf('0') == 0) {
 		hour = parseInt(hour, 10) + '';
}
    
let parts1 = key_parts[5].split('-');
let minute = parts1[7];
if (minute.indexOf('0') == 0) {
        minute = parseInt(minute, 10) + '';
}

Now, we can redistribute the file to the two destinations we need—one for the minute processing task and the other for hourly aggregation:

    copyObjectToHourlyFolder(event, date, hour, minute)
        .then(copyObjectToMinuteFolder.bind(null, event, date, hour, minute))
        .then(addPartitionToSpectrum.bind(null, event, date, hour, minute))
        .then(deleteOldMinuteObjects.bind(null, event))
        .then(deleteStreamObject.bind(null, event))        
        .then(result => {
            callback(null, { message: 'done' });            
        })
        .catch(err => {
            console.error(err);
            callback(null, { message: err });            
        }); 

Kinesis Firehose stores the data in a temporary folder. We copy the object to another folder that holds the data for the last processed minute. This folder is connected to a small Redshift Spectrum table where the data is being processed without needing to scan a much larger dataset. We also copy the data to a folder that holds the data for the entire hour, to be later aggregated and converted to Parquet.

Because we partition the data by date and hour, we created a new partition on the Redshift Spectrum table if the processed minute is the first minute in the hour (that is, minute 0). We ran the following:

ALTER TABLE 
  spectrum.events 
ADD partition
  (date='2017-08-01', hour=0) 
  LOCATION 's3://nuviad-temp/events/2017-08-01/0/';

After the data is processed and added to the table, we delete the processed data from the temporary Kinesis Firehose storage and from the minute storage folder.

Migrating CSV to Parquet using AWS Glue and Amazon EMR

The simplest way we found to run an hourly job converting our CSV data to Parquet is using Lambda and AWS Glue (and thanks to the awesome AWS Big Data team for their help with this).

Creating AWS Glue jobs

What this simple AWS Glue script does:

  • Gets parameters for the job, date, and hour to be processed
  • Creates a Spark EMR context allowing us to run Spark code
  • Reads CSV data into a DataFrame
  • Writes the data as Parquet to the destination S3 bucket
  • Adds or modifies the Redshift Spectrum / Amazon Athena table partition for the table
import sys
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','day_partition_key', 'hour_partition_key', 'day_partition_value', 'hour_partition_value' ])

#day_partition_key = "partition_0"
#hour_partition_key = "partition_1"
#day_partition_value = "2017-08-01"
#hour_partition_value = "0"

day_partition_key = args['day_partition_key']
hour_partition_key = args['hour_partition_key']
day_partition_value = args['day_partition_value']
hour_partition_value = args['hour_partition_value']

print("Running for " + day_partition_value + "/" + hour_partition_value)

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

df = spark.read.option("delimiter","|").csv("s3://nuviad-temp/events/"+day_partition_value+"/"+hour_partition_value)
df.registerTempTable("data")

df1 = spark.sql("select _c0 as user_id, _c1 as campaign_id, _c2 as os, _c3 as ua, cast(_c4 as bigint) as ts, cast(_c5 as double) as billing from data")

df1.repartition(1).write.mode("overwrite").parquet("s3://nuviad-temp/parquet/"+day_partition_value+"/hour="+hour_partition_value)

client = boto3.client('athena', region_name='us-east-1')

response = client.start_query_execution(
    QueryString='alter table parquet_events add if not exists partition(' + day_partition_key + '=\'' + day_partition_value + '\',' + hour_partition_key + '=' + hour_partition_value + ')  location \'s3://nuviad-temp/parquet/' + day_partition_value + '/hour=' + hour_partition_value + '\'' ,
    QueryExecutionContext={
        'Database': 'spectrumdb'
    },
    ResultConfiguration={
        'OutputLocation': 's3://nuviad-temp/convertresults'
    }
)

response = client.start_query_execution(
    QueryString='alter table parquet_events partition(' + day_partition_key + '=\'' + day_partition_value + '\',' + hour_partition_key + '=' + hour_partition_value + ') set location \'s3://nuviad-temp/parquet/' + day_partition_value + '/hour=' + hour_partition_value + '\'' ,
    QueryExecutionContext={
        'Database': 'spectrumdb'
    },
    ResultConfiguration={
        'OutputLocation': 's3://nuviad-temp/convertresults'
    }
)

job.commit()

Note: Because Redshift Spectrum and Athena both use the AWS Glue Data Catalog, we could use the Athena client to add the partition to the table.

Here are a few words about float, decimal, and double. Using decimal proved to be more challenging than we expected, as it seems that Redshift Spectrum and Spark use them differently. Whenever we used decimal in Redshift Spectrum and in Spark, we kept getting errors, such as:

S3 Query Exception (Fetch). Task failed due to an internal error. File 'https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parquet has an incompatible Parquet schema for column 's3://nuviad-events/events.lat'. Column type: DECIMAL(18, 8), Parquet schema:\noptional float lat [i:4 d:1 r:0]\n (https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parq

We had to experiment with a few floating-point formats until we found that the only combination that worked was to define the column as double in the Spark code and float in Spectrum. This is the reason you see billing defined as float in Spectrum and double in the Spark code.

Creating a Lambda function to trigger conversion

Next, we created a simple Lambda function to trigger the AWS Glue script hourly using a simple Python code:

import boto3
import json
from datetime import datetime, timedelta
 
client = boto3.client('glue')
 
def lambda_handler(event, context):
    last_hour_date_time = datetime.now() - timedelta(hours = 1)
    day_partition_value = last_hour_date_time.strftime("%Y-%m-%d") 
    hour_partition_value = last_hour_date_time.strftime("%-H") 
    response = client.start_job_run(
    JobName='convertEventsParquetHourly',
    Arguments={
         '--day_partition_key': 'date',
         '--hour_partition_key': 'hour',
         '--day_partition_value': day_partition_value,
         '--hour_partition_value': hour_partition_value
         }
    )

Using Amazon CloudWatch Events, we trigger this function hourly. This function triggers an AWS Glue job named ‘convertEventsParquetHourly’ and runs it for the previous hour, passing job names and values of the partitions to process to AWS Glue.

Redshift Spectrum and Node.js

Our development stack is based on Node.js, which is well-suited for high-speed, light servers that need to process a huge number of transactions. However, a few limitations of the Node.js environment required us to create workarounds and use other tools to complete the process.

Node.js and Parquet

The lack of Parquet modules for Node.js required us to implement an AWS Glue/Amazon EMR process to effectively migrate data from CSV to Parquet. We would rather save directly to Parquet, but we couldn’t find an effective way to do it.

One interesting project in the works is the development of a Parquet NPM by Marc Vertes called node-parquet (https://www.npmjs.com/package/node-parquet). It is not in a production state yet, but we think it would be well worth following the progress of this package.

Timestamp data type

According to the Parquet documentation, Timestamp data are stored in Parquet as 64-bit integers. However, JavaScript does not support 64-bit integers, because the native number type is a 64-bit double, giving only 53 bits of integer range.

The result is that you cannot store Timestamp correctly in Parquet using Node.js. The solution is to store Timestamp as string and cast the type to Timestamp in the query. Using this method, we did not witness any performance degradation whatsoever.

Lessons learned

You can benefit from our trial-and-error experience.

Lesson #1: Data validation is critical

As mentioned earlier, a single corrupt entry in a partition can fail queries running against this partition, especially when using Parquet, which is harder to edit than a simple CSV file. Make sure that you validate your data before scanning it with Redshift Spectrum.

Lesson #2: Structure and partition data effectively

One of the biggest benefits of using Redshift Spectrum (or Athena for that matter) is that you don’t need to keep nodes up and running all the time. You pay only for the queries you perform and only for the data scanned per query.

Keeping different permutations of your data for different queries makes a lot of sense in this case. For example, you can partition your data by date and hour to run time-based queries, and also have another set partitioned by user_id and date to run user-based queries. This results in faster and more efficient performance of your data warehouse.

Storing data in the right format

Use Parquet whenever you can. The benefits of Parquet are substantial. Faster performance, less data to scan, and much more efficient columnar format. However, it is not supported out-of-the-box by Kinesis Firehose, so you need to implement your own ETL. AWS Glue is a great option.

Creating small tables for frequent tasks

When we started using Redshift Spectrum, we saw our Amazon Redshift costs jump by hundreds of dollars per day. Then we realized that we were unnecessarily scanning a full day’s worth of data every minute. Take advantage of the ability to define multiple tables on the same S3 bucket or folder, and create temporary and small tables for frequent queries.

Lesson #3: Combine Athena and Redshift Spectrum for optimal performance

Moving to Redshift Spectrum also allowed us to take advantage of Athena as both use the AWS Glue Data Catalog. Run fast and simple queries using Athena while taking advantage of the advanced Amazon Redshift query engine for complex queries using Redshift Spectrum.

Redshift Spectrum excels when running complex queries. It can push many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer, so that queries use much less of your cluster’s processing capacity.

Lesson #4: Sort your Parquet data within the partition

We achieved another performance improvement by sorting data within the partition using sortWithinPartitions(sort_field). For example:

df.repartition(1).sortWithinPartitions("campaign_id")…

Conclusion

We were extremely pleased with using Amazon Redshift as our core data warehouse for over three years. But as our client base and volume of data grew substantially, we extended Amazon Redshift to take advantage of scalability, performance, and cost with Redshift Spectrum.

Redshift Spectrum lets us scale to virtually unlimited storage, scale compute transparently, and deliver super-fast results for our users. With Redshift Spectrum, we store data where we want at the cost we want, and have the data available for analytics when our users need it with the performance they expect.


About the Author

With 7 years of experience in the AdTech industry and 15 years in leading technology companies, Rafi Ton is the founder and CEO of NUVIAD. He enjoys exploring new technologies and putting them to use in cutting edge products and services, in the real world generating real money. Being an experienced entrepreneur, Rafi believes in practical-programming and fast adaptation of new technologies to achieve a significant market advantage.

 

 

Serverless Automated Cost Controls, Part1

Post Syndicated from Shankar Ramachandran original https://aws.amazon.com/blogs/compute/serverless-automated-cost-controls-part1/

This post courtesy of Shankar Ramachandran, Pubali Sen, and George Mao

In line with AWS’s continual efforts to reduce costs for customers, this series focuses on how customers can build serverless automated cost controls. This post provides an architecture blueprint and a sample implementation to prevent budget overruns.

This solution uses the following AWS products:

  • AWS Budgets – An AWS Cost Management tool that helps customers define and track budgets for AWS costs, and forecast for up to three months.
  • Amazon SNS – An AWS service that makes it easy to set up, operate, and send notifications from the cloud.
  • AWS Lambda – An AWS service that lets you run code without provisioning or managing servers.

You can fine-tune a budget for various parameters, for example filtering by service or tag. The Budgets tool lets you post notifications on an SNS topic. A Lambda function that subscribes to the SNS topic can act on the notification. Any programmatically implementable action can be taken.

The diagram below describes the architecture blueprint.

In this post, we describe how to use this blueprint with AWS Step Functions and IAM to effectively revoke the ability of a user to start new Amazon EC2 instances, after a budget amount is exceeded.

Freedom with guardrails

AWS lets you quickly spin up resources as you need them, deploying hundreds or even thousands of servers in minutes. This means you can quickly develop and roll out new applications. Teams can experiment and innovate more quickly and frequently. If an experiment fails, you can always de-provision those servers without risk.

This improved agility also brings in the need for effective cost controls. Your Finance and Accounting department must budget, monitor, and control the AWS spend. For example, this could be a budget per project. Further, Finance and Accounting must take appropriate actions if the budget for the project has been exceeded, for example. Call it “freedom with guardrails” – where Finance wants to give developers freedom, but with financial constraints.

Architecture

This section describes how to use the blueprint introduced earlier to implement a “freedom with guardrails” solution.

  1. The budget for “Project Beta” is set up in Budgets. In this example, we focus on EC2 usage and identify the instances that belong to this project by filtering on the tag Project with the value Beta. For more information, see Creating a Budget.
  2. The budget configuration also includes settings to send a notification on an SNS topic when the usage exceeds 100% of the budgeted amount. For more information, see Creating an Amazon SNS Topic for Budget Notifications.
  3. The master Lambda function receives the SNS notification.
  4. It triggers execution of a Step Functions state machine with the parameters for completing the configured action.
  5. The action Lambda function is triggered as a task in the state machine. The function interacts with IAM to effectively remove the user’s permissions to create an EC2 instance.

This decoupled modular design allows for extensibility.  New actions (serially or in parallel) can be added by simply adding new steps.

Implementing the solution

All the instructions and code needed to implement the architecture have been posted on the Serverless Automated Cost Controls GitHub repo. We recommend that you try this first in a Dev/Test environment.

This implementation description can be broken down into two parts:

  1. Create a solution stack for serverless automated cost controls.
  2. Verify the solution by testing the EC2 fleet.

To tie this back to the “freedom with guardrails” scenario, the Finance department performs a one-time implementation of the solution stack. To simulate resources for Project Beta, the developers spin up the test EC2 fleet.

Prerequisites

There are two prerequisites:

  • Make sure that you have the necessary IAM permissions. For more information, see the section titled “Required IAM permissions” in the README.
  • Define and activate a cost allocation tag with the key Project. For more information, see Using Cost Allocation Tags. It can take up to 12 hours for the tags to propagate to Budgets.

Create resources

The solution stack includes creating the following resources:

  • Three Lambda functions
  • One Step Functions state machine
  • One SNS topic
  • One IAM group
  • One IAM user
  • IAM policies as needed
  • One budget

Two of the Lambda functions were described in the previous section, to a) receive the SNS notification and b) trigger the Step Functions state machine. Another Lambda function is used to create the budget, as a custom AWS CloudFormation resource. The SNS topic connects Budgets with Lambda function A. Lambda function B is configured as a task in Step Functions. A budget for $2 is created which is filtered by Service: EC2 and Tag: Project, Beta. A test IAM group and user is created to enable you to validate this Cost Control Solution.

To create the serverless automated cost control solution stack, choose the button below. It takes few minutes to spin up the stack. You can monitor the progress in the CloudFormation console.

When you see the CREATE_COMPLETE status for the stack you had created, choose Outputs. Copy the following four values that you need later:

  • TemplateURL
  • UserName
  • SignInURL
  • Password

Verify the stack

The next step is to verify the serverless automated cost controls solution stack that you just created. To do this, spin up an EC2 fleet of t2.micro instances, representative of the resources needed for Project Beta, and tag them with Project, Beta.

  1. Browse to the SignInURL, and log in using the UserName and Password values copied on from the stack output.
  2. In the CloudFormation console, choose Create Stack.
  3. For Choose a template, select Choose an Amazon S3 template URL and paste the TemplateURL value from the preceding section. Choose Next.
  4. Give this stack a name, such as “testEc2FleetForProjectBeta”. Choose Next.
  5. On the Specify Details page, enter parameters such as the UserName and Password copied in the previous section. Choose Next.
  6. Ignore any errors related to listing IAM roles. The test user has a minimal set of permissions that is just sufficient to spin up this test stack (in line with security best practices).
  7. On the Options page, choose Next.
  8. On the Review page, choose Create. It takes a few minutes to spin up the stack, and you can monitor the progress in the CloudFormation console. 
  9. When you see the status “CREATE_COMPLETE”, open the EC2 console to verify that four t2.micro instances have been spun up, with the tag of Project, Beta.

The hourly cost for these instances depends on the region in which they are running. On the average (irrespective of the region), you can expect the aggregate cost for this EC2 fleet to exceed the set $2 budget in 48 hours.

Verify the solution

The first step is to identify the test IAM group that was created in the previous section. The group should have “projectBeta” in the name, prepended with the CloudFormation stack name and appended with an alphanumeric string. Verify that the managed policy associated is: “EC2FullAccess”, which indicates that the users in this group have unrestricted access to EC2.

There are two stages of verification for this serverless automated cost controls solution: simulating a notification and waiting for a breach.

Simulated notification

Because it takes at least a few hours for the aggregate cost of the EC2 fleet to breach the set budget, you can verify the solution by simulating the notification from Budgets.

  1. Log in to the SNS console (using your regular AWS credentials).
  2. Publish a message on the SNS topic that has “budgetNotificationTopic” in the name. The complete name is appended by the CloudFormation stack identifier.  
  3. Copy the following text as the body of the notification: “This is a mock notification”.
  4. Choose Publish.
  5. Open the IAM console to verify that the policy for the test group has been switched to “EC2ReadOnly”. This prevents users in this group from creating new instances.
  6. Verify that the test user created in the previous section cannot spin up new EC2 instances.  You can log in as the test user and try creating a new EC2 instance (via the same CloudFormation stack or the EC2 console). You should get an error message indicating that you do not have the necessary permissions.
  7. If you are proceeding to stage 2 of the verification, then you must switch the permissions back to “EC2FullAccess” for the test group, which can be done in the IAM console.

Automatic notification

Within 48 hours, the aggregate cost of the EC2 fleet spun up in the earlier section breaches the budget rule and triggers an automatic notification. This results in the permissions getting switched out, just as in the simulated notification.

Clean up

Use the following steps to delete your resources and stop incurring costs.

  1. Open the CloudFormation console.
  2. Delete the EC2 fleet by deleting the appropriate stack (for example, delete the stack named “testEc2FleetForProjectBeta”).                                               
  3. Next, delete the “costControlStack” stack.                                                                                                                                                    

Conclusion

Using Lambda in tandem with Budgets, you can build Serverless automated cost controls on AWS. Find all the resources (instructions, code) for implementing the solution discussed in this post on the Serverless Automated Cost Controls GitHub repo.

Stay tuned to this series for more tips about building serverless automated cost controls. In the next post, we discuss using smart lighting to influence developer behavior and describe a solution to encourage cost-aware development practices.

If you have questions or suggestions, please comment below.

 

Your Holiday Cybersecurity Guide

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/11/your-holiday-cybersecurity-guide.html

Many of us are visiting parents/relatives this Thanksgiving/Christmas, and will have an opportunity to help our them with cybersecurity issues. I thought I’d write up a quick guide of the most important things.

1. Stop them from reusing passwords

By far the biggest threat to average people is that they re-use the same password across many websites, so that when one website gets hacked, all their accounts get hacked.
To demonstrate the problem, go to haveibeenpwned.com and enter the email address of your relatives. This will show them a number of sites where their password has already been stolen, like LinkedIn, Adobe, etc. That should convince them of the severity of the problem.

They don’t need a separate password for every site. You don’t care about the majority of website whether you get hacked. Use a common password for all the meaningless sites. You only need unique passwords for important accounts, like email, Facebook, and Twitter.

Write down passwords and store them in a safe place. Sure, it’s a common joke that people in offices write passwords on Post-It notes stuck on their monitors or under their keyboards. This is a common security mistake, but that’s only because the office environment is widely accessible. Your home isn’t, and there’s plenty of places to store written passwords securely, such as in a home safe. Even if it’s just a desk drawer, such passwords are safe from hackers, because they aren’t on a computer.

Write them down, with pen and paper. Don’t put them in a MyPasswords.doc, because when a hacker breaks in, they’ll easily find that document and easily hack your accounts.

You might help them out with getting a password manager, or two-factor authentication (2FA). Good 2FA like YubiKey will stop a lot of phishing threats. But this is difficult technology to learn, and of course, you’ll be on the hook for support issues, such as when they lose the device. Thus, while 2FA is best, I’m only recommending pen-and-paper to store passwords. (AccessNow has a guide, though I think YubiKey/U2F keys for Facebook and GMail are the best).

2. Lock their phone (passcode, fingerprint, faceprint)
You’ll lose your phone at some point. It has the keys all all your accounts, like email and so on. With your email, phones thieves can then reset passwords on all your other accounts. Thus, it’s incredibly important to lock the phone.

Apple has made this especially easy with fingerprints (and now faceprints), so there’s little excuse not to lock the phone.

Note that Apple iPhones are the most secure. I give my mother my old iPhones so that they will have something secure.

My mom demonstrates a problem you’ll have with the older generation: she doesn’t reliably have her phone with her, and charged. She’s the opposite of my dad who religiously slaved to his phone. Even a small change to make her lock her phone means it’ll be even more likely she won’t have it with her when you need to call her.

3. WiFi (WPA)
Make sure their home WiFi is WPA encrypted. It probably already is, but it’s worthwhile checking.

The password should be written down on the same piece of paper as all the other passwords. This is importance. My parents just moved, Comcast installed a WiFi access point for them, and they promptly lost the piece of paper. When I wanted to debug some thing on their network today, they didn’t know the password, and couldn’t find the paper. Get that password written down in a place it won’t get lost!

Discourage them from extra security features like “SSID hiding” and/or “MAC address filtering”. They provide no security benefit, and actually make security worse. It means a phone has to advertise the SSID when away from home, and it makes MAC address randomization harder, both of which allows your privacy to be tracked.

If they have a really old home router, you should probably replace it, or at least update the firmware. A lot of old routers have hacks that allow hackers (like me masscaning the Internet) to easily break in.

4. Ad blockers or Brave

Most of the online tricks that will confuse your older parents will come via advertising, such as popups claiming “You are infected with a virus, click here to clean it”. Installing an ad blocker in the browser, such as uBlock Origin, stops most all this nonsense.

For example, here’s a screenshot of going to the “Speedtest” website to test the speed of my connection (I took this on the plane on the way home for Thanksgiving). Ignore the error (plane’s firewall Speedtest) — but instead look at the advertising banner across the top of the page insisting you need to download a browser extension. This is tricking you into installing malware — the ad appears as if it’s a message from Speedtest, it’s not. Speedtest is just selling advertising and has no clue what the banner says. This sort of thing needs to be blocked — it fools even the technologically competent.

uBlock Origin for Chrome is the one I use. Another option is to replace their browser with Brave, a browser that blocks ads, but at the same time, allows micropayments to support websites you want to support. I use Brave on my iPhone.
A side benefit of ad blockers or Brave is that web surfing becomes much faster, since you aren’t downloading all this advertising. The smallest NYtimes story is 15 megabytes in size due to all the advertisements, for example.

5. Cloud Backups
Do backups, in the cloud. It’s a good idea in general, especially with the threat of ransomware these days.

In particular, consider your photos. Over time, they will be lost, because people make no effort to keep track of them. All hard drives will eventually crash, deleting your photos. Sure, a few key ones are backed up on Facebook for life, but the rest aren’t.
There are so many excellent online backup services out there, like DropBox and Backblaze. Or, you can use the iCloud feature that Apple provides. My favorite is Microsoft’s: I already pay $99 a year for Office 365 subscription, and it comes with 1-terabyte of online storage.

6. Separate email accounts
You should have three email accounts: work, personal, and financial.

First, you really need to separate your work account from personal. The IT department is already getting misdirected emails with your spouse/lover that they don’t want to see. Any conflict with your work, such as getting fired, gives your private correspondence to their lawyers.

Second, you need a wholly separate account for financial stuff, like Amazon.com, your bank, PayPal, and so on. That prevents confusion with phishing attacks.

Consider this warning today:

If you had split accounts, you could safely ignore this. The USPS would only know your financial email account, which gets no phishing attacks, because it’s not widely known. When your receive the phishing attack on your personal email, you ignore it, because you know the USPS doesn’t know your personal email account.

Phishing emails are so sophisticated that even experts can’t tell the difference. Splitting financial from personal emails makes it so you don’t have to tell the difference — anything financial sent to personal email can safely be ignored.

7. Deauth those apps!

Twitter user @tompcoleman comments that we also need deauth apps.
Social media sites like Facebook, Twitter, and Google encourage you to enable “apps” that work their platforms, often demanding privileges to generate messages on your behalf. The typical scenario is that you use them only once or twice and forget about them.
A lot of them are hostile. For example, my niece’s twitter account would occasional send out advertisements, and she didn’t know why. It’s because a long time ago, she enabled an app with the permission to send tweets for her. I had to sit down and get rid of most of her apps.
Now would be a good time to go through your relatives Facebook, Twitter, and Google/GMail and disable those apps. Don’t be a afraid to be ruthless — they probably weren’t using them anyway. Some will still be necessary. For example, Twitter for iPhone shows up in the list of Twitter apps. The URL for editing these apps for Twitter is https://twitter.com/settings/applications. Google link is here (thanks @spextr). I don’t know of simple URLs for Facebook, but you should find it somewhere under privacy/security settings.
Update: Here’s a more complete guide for a even more social media services.
https://www.permissions.review/

8. Up-to-date software? maybe

I put this last because it can be so much work.

You should install the latest OS (Windows 10, macOS High Sierra), and also turn on automatic patching.

But remember it may not be worth the huge effort involved. I want my parents to be secure — but no so secure I have to deal with issues.

For example, when my parents updated their HP Print software, the icon on the desktop my mom usually uses to scan things in from the printer disappeared, and needed me to spend 15 minutes with her helping find the new way to access the software.
However, I did get my mom a new netbook to travel with instead of the old WinXP one. I want to get her a Chromebook, but she doesn’t want one.
For iOS, you can probably make sure their phones have the latest version without having these usability problems.

Conclusion

You can’t solve every problem for your relatives, but these are the more critical ones.

New – Interactive AWS Cost Explorer API

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-interactive-aws-cost-explorer-api/

We launched the AWS Cost Explorer a couple of years ago in order to allow you to track, allocate, and manage your AWS costs. The response to that launch, and to additions that we have made since then, has been very positive. However our customers are, as Jeff Bezos has said, “beautifully, wonderfully, dissatisfied.”

I see this first-hand every day. We launch something and that launch inspires our customers to ask for even more. For example, with many customers going all-in and moving large parts of their IT infrastructure to the AWS Cloud, we’ve had many requests for the raw data that feeds into the Cost Explorer. These customers want to programmatically explore their AWS costs, update ledgers and accounting systems with per-application and per-department costs, and to build high-level dashboards that summarize spending. Some of these customers have been going to the trouble of extracting the data from the charts and reports provided by Cost Explorer!

New Cost Explorer API
Today we are making the underlying data that feeds into Cost Explorer available programmatically. The new Cost Explorer API gives you a set of functions that allow you do everything that I described above. You can retrieve cost and usage data that is filtered and grouped across multiple dimensions (Service, Linked Account, tag, Availability Zone, and so forth), aggregated by day or by month. This gives you the power to start simple (total monthly costs) and to refine your requests to any desired level of detail (writes to DynamoDB tables that have been tagged as production) while getting responses in seconds.

Here are the operations:

GetCostAndUsage – Retrieve cost and usage metrics for a single account or all accounts (master accounts in an organization have access to all member accounts) with filtering and grouping.

GetDimensionValues – Retrieve available filter values for a specified filter over a specified period of time.

GetTags – Retrieve available tag keys and tag values over a specified period of time.

GetReservationUtilization – Retrieve EC2 Reserved Instance utilization over a specified period of time, with daily or monthly granularity plus filtering and grouping.

I believe that these functions, and the data that they return, will give you the ability to do some really interesting things that will give you better insights into your business. For example, you could tag the resources used to support individual marketing campaigns or development projects and then deep-dive into the costs to measure business value. You how have the potential to know, down to the penny, how much you spend on infrastructure for important events like Cyber Monday or Black Friday.

Things to Know
Here are a couple of things to keep in mind as you start to think about ways to make use of the API:

Grouping – The Cost Explorer web application provides you with one level of grouping; the APIs give you two. For example you could group costs or RI utilization by Service and then by Region.

Pagination – The functions can return very large amounts of data and follow the AWS-wide model for pagination by including a nextPageToken if additional data is available. You simply call the same function again, supplying the token, to move forward.

Regions – The service endpoint is in the US East (Northern Virginia) Region and returns usage data for all public AWS Regions.

Pricing – Each API call costs $0.01. To put this into perspective, let’s say you use this API to build a dashboard and it gets 1000 hits per month from your users. Your operating cost for the dashboard should be $10 or so; this is far less expensive than setting up your own systems to extract & ingest the data and respond to interactive queries.

The Cost Explorer API is available now and you can start using it today. To learn more, read about the Cost Explorer API.

Jeff;

Amazon QuickSight Update – Geospatial Visualization, Private VPC Access, and More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-quicksight-update-geospatial-visualization-private-vpc-access-and-more/

We don’t often recognize or celebrate anniversaries at AWS. With nearly 100 services on our list, we’d be eating cake and drinking champagne several times a week. While that might sound like fun, we’d rather spend our working hours listening to customers and innovating. With that said, Amazon QuickSight has now been generally available for a little over a year and I would like to give you a quick update!

QuickSight in Action
Today, tens of thousands of customers (from startups to enterprises, in industries as varied as transportation, legal, mining, and healthcare) are using QuickSight to analyze and report on their business data.

Here are a couple of examples:

Gemini provides legal evidence procurement for California attorneys who represent injured workers. They have gone from creating custom reports and running one-off queries to creating and sharing dynamic QuickSight dashboards with drill-downs and filtering. QuickSight is used to track sales pipeline, measure order throughput, and to locate bottlenecks in the order processing pipeline.

Jivochat provides a real-time messaging platform to connect visitors to website owners. QuickSight lets them create and share interactive dashboards while also providing access to the underlying datasets. This has allowed them to move beyond the sharing of static spreadsheets, ensuring that everyone is looking at the same and is empowered to make timely decisions based on current data.

Transfix is a tech-powered freight marketplace that matches loads and increases visibility into logistics for Fortune 500 shippers in retail, food and beverage, manufacturing, and other industries. QuickSight has made analytics accessible to both BI engineers and non-technical business users. They scrutinize key business and operational metrics including shipping routes, carrier efficient, and process automation.

Looking Back / Looking Ahead
The feedback on QuickSight has been incredibly helpful. Customers tell us that their employees are using QuickSight to connect to their data, perform analytics, and make high-velocity, data-driven decisions, all without setting up or running their own BI infrastructure. We love all of the feedback that we get, and use it to drive our roadmap, leading to the introduction of over 40 new features in just a year. Here’s a summary:

Looking forward, we are watching an interesting trend develop within our customer base. As these customers take a close look at how they analyze and report on data, they are realizing that a serverless approach offers some tangible benefits. They use Amazon Simple Storage Service (S3) as a data lake and query it using a combination of QuickSight and Amazon Athena, giving them agility and flexibility without static infrastructure. They also make great use of QuickSight’s dashboards feature, monitoring business results and operational metrics, then sharing their insights with hundreds of users. You can read Building a Serverless Analytics Solution for Cleaner Cities and review Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight if you are interested in this approach.

New Features and Enhancements
We’re still doing our best to listen and to learn, and to make sure that QuickSight continues to meet your needs. I’m happy to announce that we are making seven big additions today:

Geospatial Visualization – You can now create geospatial visuals on geographical data sets.

Private VPC Access – You can now sign up to access a preview of a new feature that allows you to securely connect to data within VPCs or on-premises, without the need for public endpoints.

Flat Table Support – In addition to pivot tables, you can now use flat tables for tabular reporting. To learn more, read about Using Tabular Reports.

Calculated SPICE Fields – You can now perform run-time calculations on SPICE data as part of your analysis. Read Adding a Calculated Field to an Analysis for more information.

Wide Table Support – You can now use tables with up to 1000 columns.

Other Buckets – You can summarize the long tail of high-cardinality data into buckets, as described in Working with Visual Types in Amazon QuickSight.

HIPAA Compliance – You can now run HIPAA-compliant workloads on QuickSight.

Geospatial Visualization
Everyone seems to want this feature! You can now take data that contains a geographic identifier (country, city, state, or zip code) and create beautiful visualizations with just a few clicks. QuickSight will geocode the identifier that you supply, and can also accept lat/long map coordinates. You can use this feature to visualize sales by state, map stores to shipping destinations, and so forth. Here’s a sample visualization:

To learn more about this feature, read Using Geospatial Charts (Maps), and Adding Geospatial Data.

Private VPC Access Preview
If you have data in AWS (perhaps in Amazon Redshift, Amazon Relational Database Service (RDS), or on EC2) or on-premises in Teradata or SQL Server on servers without public connectivity, this feature is for you. Private VPC Access for QuickSight uses an Elastic Network Interface (ENI) for secure, private communication with data sources in a VPC. It also allows you to use AWS Direct Connect to create a secure, private link with your on-premises resources. Here’s what it looks like:

If you are ready to join the preview, you can sign up today.

Jeff;

 

Grafana 4.6 Released

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/10/26/grafana-4.6-released/

Release Highlights

The Grafana 4.6 release contains some exciting and much anticipated new additions:

This is a big release so check out the other features and fixes in the Changelog section below.

Annotations

Annotations provide a way to mark points on the graph with rich events. You can now add annotation events and regions right from the graph panel! Just hold CTRL/CMD + click or drag region to open the Add Annotation view. The
Annotations documentation is updated to include details on this new exciting feature.

Cloudwatch

Cloudwatch now supports alerting. You can now setup alert rules for any Cloudwatch metric!

Postgres

Grafana v4.6 now ships with a built-in datasource plugin for PostgreSQL. Have logs or metric data in Postgres? You can now visualize that data and
define alert rules on it like any of our other data sources.

Prometheus

New enhancements include support for instant queries (for a single point in time instead of a time range) and improvements to query editor in the form of autocomplete for label names and label values.

This makes exploring and filtering Prometheus data much easier.

Changelog

Here are just a few highlights from the Changelog.

New Features

  • Annotations: Add support for creating annotations from graph panel #8197
  • GCS: Adds support for Google Cloud Storage #8370 thx @chuhlomin
  • Prometheus: Adds /metrics endpoint for exposing Grafana metrics. #9187
  • Jaeger: Add support for open tracing using jaeger in Grafana. #9213
  • Unit types: New date & time unit types added, useful in singlestat to show dates & times. #3678, #6710, #2764
  • Prometheus: Add support for instant queries #5765, thx @mtanda
  • Cloudwatch: Add support for alerting using the cloudwatch datasource #8050, thx @mtanda
  • Pagerduty: Include triggering series in pagerduty notification #8479, thx @rickymoorhouse
  • Prometheus: Align $__interval with the step parameters. #9226, thx @alin-amana
  • Prometheus: Autocomplete for label name and label value #9208, thx @mtanda
  • Postgres: New Postgres data source #9209, thx @svenklemm
  • Datasources: Make datasource HTTP requests verify TLS by default. closes #9371, #5334, #8812, thx @mattbostock

Minor

  • SMTP: Make it possible to set specific HELO for smtp client. #9319
  • Alerting: Add diff and percent diff as series reducers #9386, thx @shanhuhai5739
  • Slack: Allow images to be uploaded to slack when Token is present #7175, thx @xginn8
  • Table: Add support for displaying the timestamp with milliseconds #9429, thx @s1061123
  • Hipchat: Add metrics, message and image to hipchat notifications #9110, thx @eloo
  • Kafka: Add support for sending alert notifications to kafka #7104, thx @utkarshcmu

Tech

  • Webpack: Changed from systemjs to webpack (see readme or building from source guide for new build instructions). Systemjs is still used to load plugins but now plugins can only import a limited set of dependencies. See PLUGIN_DEV.md for more details on how this can effect some plugins.

Download

Head to the v4.6 download page for download links & instructions.

Thanks

A big thanks to all the Grafana users who contribute by submitting PRs, bug reports, helping out on our community site and providing feedback!

Abandon Proactive Copyright Filters, Huge Coalition Tells EU Heavyweights

Post Syndicated from Andy original https://torrentfreak.com/abandon-proactive-copyright-filters-huge-coalition-tells-eu-heavyweights-171017/

Last September, EU Commission President Jean-Claude Juncker announced plans to modernize copyright law in Europe.

The proposals (pdf) are part of the Digital Single Market reforms, which have been under development for the past several years.

One of the proposals is causing significant concern. Article 13 would require some online service providers to become ‘Internet police’, proactively detecting and filtering allegedly infringing copyright works, uploaded to their platforms by users.

Currently, users are generally able to share whatever they like but should a copyright holder take exception to their upload, mechanisms are available for that content to be taken down. It’s envisioned that proactive filtering, whereby user uploads are routinely scanned and compared to a database of existing protected content, will prevent content becoming available in the first place.

These proposals are of great concern to digital rights groups, who believe that such filters will not only undermine users’ rights but will also place unfair burdens on Internet platforms, many of which will struggle to fund such a program. Yesterday, in the latest wave of opposition to Article 13, a huge coalition of international rights groups came together to underline their concerns.

Headed up by Civil Liberties Union for Europe (Liberties) and European Digital Rights (EDRi), the coalition is formed of dozens of influential groups, including Electronic Frontier Foundation (EFF), Human Rights Watch, Reporters without Borders, and Open Rights Group (ORG), to name just a few.

In an open letter to European Commission President Jean-Claude Juncker, President of the European Parliament Antonio Tajani, President of the European Council Donald Tusk and a string of others, the groups warn that the proposals undermine the trust established between EU member states.

“Fundamental rights, justice and the rule of law are intrinsically linked and constitute
core values on which the EU is founded,” the letter begins.

“Any attempt to disregard these values undermines the mutual trust between member states required for the EU to function. Any such attempt would also undermine the commitments made by the European Union and national governments to their citizens.”

Those citizens, the letter warns, would have their basic rights undermined, should the new proposals be written into EU law.

“Article 13 of the proposal on Copyright in the Digital Single Market include obligations on internet companies that would be impossible to respect without the imposition of excessive restrictions on citizens’ fundamental rights,” it notes.

A major concern is that by placing new obligations on Internet service providers that allow users to upload content – think YouTube, Facebook, Twitter and Instagram – they will be forced to err on the side of caution. Should there be any concern whatsoever that content might be infringing, fair use considerations and exceptions will be abandoned in favor of staying on the right side of the law.

“Article 13 appears to provoke such legal uncertainty that online services will have no other option than to monitor, filter and block EU citizens’ communications if they are to have any chance of staying in business,” the letter warns.

But while the potential problems for service providers and users are numerous, the groups warn that Article 13 could also be illegal since it contradicts case law of the Court of Justice.

According to the E-Commerce Directive, platforms are already required to remove infringing content, once they have been advised it exists. The new proposal, should it go ahead, would force the monitoring of uploads, something which goes against the ‘no general obligation to monitor‘ rules present in the Directive.

“The requirement to install a system for filtering electronic communications has twice been rejected by the Court of Justice, in the cases Scarlet Extended (C70/10) and Netlog/Sabam (C 360/10),” the rights groups warn.

“Therefore, a legislative provision that requires internet companies to install a filtering system would almost certainly be rejected by the Court of Justice because it would contravene the requirement that a fair balance be struck between the right to intellectual property on the one hand, and the freedom to conduct business and the right to freedom of expression, such as to receive or impart information, on the other.”

Specifically, the groups note that the proactive filtering of content would violate freedom of expression set out in Article 11 of the Charter of Fundamental Rights. That being the case, the groups expect national courts to disapply it and the rule to be annulled by the Court of Justice.

The latest protests against Article 13 come in the wake of large-scale objections earlier in the year, voicing similar concerns. However, despite the groups’ fears, they have powerful adversaries, each determined to stop the flood of copyrighted content currently being uploaded to the Internet.

Front and center in support of Article 13 is the music industry and its current hot-topic, the so-called Value Gap(1,2,3). The industry feels that platforms like YouTube are able to avoid paying expensive licensing fees (for music in particular) by exploiting the safe harbor protections of the DMCA and similar legislation.

They believe that proactively filtering uploads would significantly help to diminish this problem, which may very well be the case. But at what cost to the general public and the platforms they rely upon? Citizens and scholars feel that freedoms will be affected and it’s likely the outcry will continue.

The ball is now with the EU, whose members will soon have to make what could be the most important decision in recent copyright history. The rights groups, who are urging for Article 13 to be deleted, are clear where they stand.

The full letter is available here (pdf)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.