Tag Archives: cluster

AWS Glue Now Supports Scala Scripts

Post Syndicated from Mehul Shah original https://aws.amazon.com/blogs/big-data/aws-glue-now-supports-scala-scripts/

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

This post shows how to build an example Scala script that identifies highly negative issues in the timeline. It pulls out issue events in the timeline sample, analyzes their titles using the sentiment prediction functions from the Stanford CoreNLP libraries, and surfaces the most negative issues.

Getting started

Before we start writing scripts, we use AWS Glue crawlers to get a sense of the data—its structure and characteristics. We also set up a development endpoint and attach an Apache Zeppelin notebook, so we can interactively explore the data and author the script.

Crawl the data

The dataset used in this example was downloaded from the GitHub archive website into our sample dataset bucket in Amazon S3, and copied to the following locations:

s3://aws-glue-datasets-<region>/examples/scala-blog/githubarchive/data/

Choose the best folder by replacing <region> with the region that you’re working in, for example, us-east-1. Crawl this folder, and put the results into a database named githubarchive in the AWS Glue Data Catalog, as described in the AWS Glue Developer Guide. This folder contains 12 hours of the timeline from January 22, 2017, and is organized hierarchically (that is, partitioned) by year, month, and day.

When finished, use the AWS Glue console to navigate to the table named data in the githubarchive database. Notice that this data has eight top-level columns, which are common to each event type, and three partition columns that correspond to year, month, and day.

Choose the payload column, and you will notice that it has a complex schema—one that reflects the union of the payloads of event types that appear in the crawled data. Also note that the schema that crawlers generate is a subset of the true schema because they sample only a subset of the data.

Set up the library, development endpoint, and notebook

Next, you need to download and set up the libraries that estimate the sentiment in a snippet of text. The Stanford CoreNLP libraries contain a number of human language processing tools, including sentiment prediction.

Download the Stanford CoreNLP libraries. Unzip the .zip file, and you’ll see a directory full of jar files. For this example, the following jars are required:

  • stanford-corenlp-3.8.0.jar
  • stanford-corenlp-3.8.0-models.jar
  • ejml-0.23.jar

Upload these files to an Amazon S3 path that is accessible to AWS Glue so that it can load these libraries when needed. For this example, they are in s3://glue-sample-other/corenlp/.

Development endpoints are static Spark-based environments that can serve as the backend for data exploration. You can attach notebooks to these endpoints to interactively send commands and explore and analyze your data. These endpoints have the same configuration as that of AWS Glue’s job execution system. So, commands and scripts that work there also work the same when registered and run as jobs in AWS Glue.

To set up an endpoint and a Zeppelin notebook to work with that endpoint, follow the instructions in the AWS Glue Developer Guide. When you are creating an endpoint, be sure to specify the locations of the previously mentioned jars in the Dependent jars path as a comma-separated list. Otherwise, the libraries will not be loaded.

After you set up the notebook server, go to the Zeppelin notebook by choosing Dev Endpoints in the left navigation pane on the AWS Glue console. Choose the endpoint that you created. Next, choose the Notebook Server URL, which takes you to the Zeppelin server. Log in using the notebook user name and password that you specified when creating the notebook. Finally, create a new note to try out this example.

Each notebook is a collection of paragraphs, and each paragraph contains a sequence of commands and the output for that command. Moreover, each notebook includes a number of interpreters. If you set up the Zeppelin server using the console, the (Python-based) pyspark and (Scala-based) spark interpreters are already connected to your new development endpoint, with pyspark as the default. Therefore, throughout this example, you need to prepend %spark at the top of your paragraphs. In this example, we omit these for brevity.

Working with the data

In this section, we use AWS Glue extensions to Spark to work with the dataset. We look at the actual schema of the data and filter out the interesting event types for our analysis.

Start with some boilerplate code to import libraries that you need:

%spark

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext

Then, create the Spark and AWS Glue contexts needed for working with the data:

@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)

You need the transient decorator on the SparkContext when working in Zeppelin; otherwise, you will run into a serialization error when executing commands.

Dynamic frames

This section shows how to create a dynamic frame that contains the GitHub records in the table that you crawled earlier. A dynamic frame is the basic data structure in AWS Glue scripts. It is like an Apache Spark data frame, except that it is designed and optimized for data cleaning and transformation workloads. A dynamic frame is well-suited for representing semi-structured datasets like the GitHub timeline.

A dynamic frame is a collection of dynamic records. In Spark lingo, it is an RDD (resilient distributed dataset) of DynamicRecords. A dynamic record is a self-describing record. Each record encodes its columns and types, so every record can have a schema that is unique from all others in the dynamic frame. This is convenient and often more efficient for datasets like the GitHub timeline, where payloads can vary drastically from one event type to another.

The following creates a dynamic frame, github_events, from your table:

val github_events = glueContext
                    .getCatalogSource(database = "githubarchive", tableName = "data")
                    .getDynamicFrame()

The getCatalogSource() method returns a DataSource, which represents a particular table in the Data Catalog. The getDynamicFrame() method returns a dynamic frame from the source.

Recall that the crawler created a schema from only a sample of the data. You can scan the entire dataset, count the rows, and print the complete schema as follows:

github_events.count
github_events.printSchema()

The result looks like the following:

The data has 414,826 records. As before, notice that there are eight top-level columns, and three partition columns. If you scroll down, you’ll also notice that the payload is the most complex column.

Run functions and filter records

This section describes how you can create your own functions and invoke them seamlessly to filter records. Unlike filtering with Python lambdas, Scala scripts do not need to convert records from one language representation to another, thereby reducing overhead and running much faster.

Let’s create a function that picks only the IssuesEvents from the GitHub timeline. These events are generated whenever someone posts an issue for a particular repository. Each GitHub event record has a field, “type”, that indicates the kind of event it is. The issueFilter() function returns true for records that are IssuesEvents.

def issueFilter(rec: DynamicRecord): Boolean = { 
    rec.getField("type").exists(_ == "IssuesEvent") 
}

Note that the getField() method returns an Option[Any] type, so you first need to check that it exists before checking the type.

You pass this function to the filter transformation, which applies the function on each record and returns a dynamic frame of those records that pass.

val issue_events =  github_events.filter(issueFilter)

Now, let’s look at the size and schema of issue_events.

issue_events.count
issue_events.printSchema()

It’s much smaller (14,063 records), and the payload schema is less complex, reflecting only the schema for issues. Keep a few essential columns for your analysis, and drop the rest using the ApplyMapping() transform:

val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                 ("actor.login", "string", "actor", "string"), 
                                                 ("repo.name", "string", "repo", "string"),
                                                 ("payload.action", "string", "action", "string"),
                                                 ("payload.issue.title", "string", "title", "string")))
issue_titles.show()

The ApplyMapping() transform is quite handy for renaming columns, casting types, and restructuring records. The preceding code snippet tells the transform to select the fields (or columns) that are enumerated in the left half of the tuples and map them to the fields and types in the right half.

Estimating sentiment using Stanford CoreNLP

To focus on the most pressing issues, you might want to isolate the records with the most negative sentiments. The Stanford CoreNLP libraries are Java-based and offer sentiment-prediction functions. Accessing these functions through Python is possible, but quite cumbersome. It requires creating Python surrogate classes and objects for those found on the Java side. Instead, with Scala support, you can use those classes and objects directly and invoke their methods. Let’s see how.

First, import the libraries needed for the analysis:

import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

The Stanford CoreNLP libraries have a main driver that orchestrates all of their analysis. The driver setup is heavyweight, setting up threads and data structures that are shared across analyses. Apache Spark runs on a cluster with a main driver process and a collection of backend executor processes that do most of the heavy sifting of the data.

The Stanford CoreNLP shared objects are not serializable, so they cannot be distributed easily across a cluster. Instead, you need to initialize them once for every backend executor process that might need them. Here is how to accomplish that:

val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")

object myNLP {
    lazy val coreNLP = new StanfordCoreNLP(props)
}

The properties tell the libraries which annotators to execute and how many words to process. The preceding code creates an object, myNLP, with a field coreNLP that is lazily evaluated. This field is initialized only when it is needed, and only once. So, when the backend executors start processing the records, each executor initializes the driver for the Stanford CoreNLP libraries only one time.

Next is a function that estimates the sentiment of a text string. It first calls Stanford CoreNLP to annotate the text. Then, it pulls out the sentences and takes the average sentiment across all the sentences. The sentiment is a double, from 0.0 as the most negative to 4.0 as the most positive.

def estimatedSentiment(text: String): Double = {
    if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
    val annotations = myNLP.coreNLP.process(text)
    val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
    sentences.foldLeft(0.0)( (csum, x) => { 
        csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
    }) / sentences.length
}

Now, let’s estimate the sentiment of the issue titles and add that computed field as part of the records. You can accomplish this with the map() method on dynamic frames:

val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
    val mbody = rec.getField("title")
    mbody match {
        case Some(mval: String) => { 
            rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
            rec }
        case _ => rec
    }
})

The map() method applies the user-provided function on every record. The function takes a DynamicRecord as an argument and returns a DynamicRecord. The code above computes the sentiment, adds it in a top-level field, sentiment, to the record, and returns the record.

Count the records with sentiment and show the schema. This takes a few minutes because Spark must initialize the library and run the sentiment analysis, which can be involved.

issue_sentiments.count
issue_sentiments.printSchema()

Notice that all records were processed (14,063), and the sentiment value was added to the schema.

Finally, let’s pick out the titles that have the lowest sentiment (less than 1.5). Count them and print out a sample to see what some of the titles look like.

val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
pressing_issues.count
pressing_issues.show(10)

Next, write them all to a file so that you can handle them later. (You’ll need to replace the output path with your own.)

glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions("""{"path": "s3://<bucket>/out/path/"}"""), 
                              format = "json")
            .writeDynamicFrame(pressing_issues)

Take a look in the output path, and you can see the output files.

Putting it all together

Now, let’s create a job from the preceding interactive session. The following script combines all the commands from earlier. It processes the GitHub archive files and writes out the highly negative issues:

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

object GlueApp {

    object myNLP {
        val props = new Properties()
        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
        props.setProperty("parse.maxlen", "70")

        lazy val coreNLP = new StanfordCoreNLP(props)
    }

    def estimatedSentiment(text: String): Double = {
        if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
        val annotations = myNLP.coreNLP.process(text)
        val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
        sentences.foldLeft(0.0)( (csum, x) => { 
            csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
        }) / sentences.length
    }

    def main(sysArgs: Array[String]) {
        val spark: SparkContext = SparkContext.getOrCreate()
        val glueContext: GlueContext = new GlueContext(spark)

        val dbname = "githubarchive"
        val tblname = "data"
        val outpath = "s3://<bucket>/out/path/"

        val github_events = glueContext
                            .getCatalogSource(database = dbname, tableName = tblname)
                            .getDynamicFrame()

        val issue_events =  github_events.filter((rec: DynamicRecord) => {
            rec.getField("type").exists(_ == "IssuesEvent")
        })

        val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                         ("actor.login", "string", "actor", "string"), 
                                                         ("repo.name", "string", "repo", "string"),
                                                         ("payload.action", "string", "action", "string"),
                                                         ("payload.issue.title", "string", "title", "string")))

        val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
            val mbody = rec.getField("title")
            mbody match {
                case Some(mval: String) => { 
                    rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
                    rec }
                case _ => rec
            }
        })

        val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))

        glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions(s"""{"path": "$outpath"}"""), 
                              format = "json")
                    .writeDynamicFrame(pressing_issues)
    }
}

Notice that the script is enclosed in a top-level object called GlueApp, which serves as the script’s entry point for the job. (You’ll need to replace the output path with your own.) Upload the script to an Amazon S3 location so that AWS Glue can load it when needed.

To create the job, open the AWS Glue console. Choose Jobs in the left navigation pane, and then choose Add job. Create a name for the job, and specify a role with permissions to access the data. Choose An existing script that you provide, and choose Scala as the language.

For the Scala class name, type GlueApp to indicate the script’s entry point. Specify the Amazon S3 location of the script.

Choose Script libraries and job parameters. In the Dependent jars path field, enter the Amazon S3 locations of the Stanford CoreNLP libraries from earlier as a comma-separated list (without spaces). Then choose Next.

No connections are needed for this job, so choose Next again. Review the job properties, and choose Finish. Finally, choose Run job to execute the job.

You can simply edit the script’s input table and output path to run this job on whatever GitHub timeline datasets that you might have.

Conclusion

In this post, we showed how to write AWS Glue ETL scripts in Scala via notebooks and how to run them as jobs. Scala has the advantage that it is the native language for the Spark runtime. With Scala, it is easier to call Scala or Java functions and third-party libraries for analyses. Moreover, data processing is faster in Scala because there’s no need to convert records from one language runtime to another.

You can find more example of Scala scripts in our GitHub examples repository: https://github.com/awslabs/aws-glue-samples. We encourage you to experiment with Scala scripts and let us know about any interesting ETL flows that you want to share.

Happy Glue-ing!

 


Additional Reading

If you found this post useful, be sure to check out Simplify Querying Nested JSON with the AWS Glue Relationalize Transform and Genomic Analysis with Hail on Amazon EMR and Amazon Athena.

 


About the Authors

Mehul Shah is a senior software manager for AWS Glue. His passion is leveraging the cloud to build smarter, more efficient, and easier to use data systems. He has three girls, and, therefore, he has no spare time.

 

 

 

Ben Sowell is a software development engineer at AWS Glue.

 

 

 

 
Vinay Vivili is a software development engineer for AWS Glue.

 

 

 

Continuous Deployment to Kubernetes using AWS CodePipeline, AWS CodeCommit, AWS CodeBuild, Amazon ECR and AWS Lambda

Post Syndicated from Chris Barclay original https://aws.amazon.com/blogs/devops/continuous-deployment-to-kubernetes-using-aws-codepipeline-aws-codecommit-aws-codebuild-amazon-ecr-and-aws-lambda/

Thank you to my colleague Omar Lari for this blog on how to create a continuous deployment pipeline for Kubernetes!


You can use Kubernetes and AWS together to create a fully managed, continuous deployment pipeline for container based applications. This approach takes advantage of Kubernetes’ open-source system to manage your containerized applications, and the AWS developer tools to manage your source code, builds, and pipelines.

This post describes how to create a continuous deployment architecture for containerized applications. It uses AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, and AWS Lambda to deploy containerized applications into a Kubernetes cluster. In this environment, developers can remain focused on developing code without worrying about how it will be deployed, and development managers can be satisfied that the latest changes are always deployed.

What is Continuous Deployment?

There are many articles, posts and even conferences dedicated to the practice of continuous deployment. For the purposes of this post, I will summarize continuous delivery into the following points:

  • Code is more frequently released into production environments
  • More frequent releases allow for smaller, incremental changes reducing risk and enabling simplified roll backs if needed
  • Deployment is automated and requires minimal user intervention

For a more information, see “Practicing Continuous Integration and Continuous Delivery on AWS”.

How can you use continuous deployment with AWS and Kubernetes?

You can leverage AWS services that support continuous deployment to automatically take your code from a source code repository to production in a Kubernetes cluster with minimal user intervention. To do this, you can create a pipeline that will build and deploy committed code changes as long as they meet the requirements of each stage of the pipeline.

To create the pipeline, you will use the following services:

  • AWS CodePipeline. AWS CodePipeline is a continuous delivery service that models, visualizes, and automates the steps required to release software. You define stages in a pipeline to retrieve code from a source code repository, build that source code into a releasable artifact, test the artifact, and deploy it to production. Only code that successfully passes through all these stages will be deployed. In addition, you can optionally add other requirements to your pipeline, such as manual approvals, to help ensure that only approved changes are deployed to production.
  • AWS CodeCommit. AWS CodeCommit is a secure, scalable, and managed source control service that hosts private Git repositories. You can privately store and manage assets such as your source code in the cloud and configure your pipeline to automatically retrieve and process changes committed to your repository.
  • AWS CodeBuild. AWS CodeBuild is a fully managed build service that compiles source code, runs tests, and produces artifacts that are ready to deploy. You can use AWS CodeBuild to both build your artifacts, and to test those artifacts before they are deployed.
  • AWS Lambda. AWS Lambda is a compute service that lets you run code without provisioning or managing servers. You can invoke a Lambda function in your pipeline to prepare the built and tested artifact for deployment by Kubernetes to the Kubernetes cluster.
  • Kubernetes. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It provides a platform for running, deploying, and managing containers at scale.

An Example of Continuous Deployment to Kubernetes:

The following example illustrates leveraging AWS developer tools to continuously deploy to a Kubernetes cluster:

  1. Developers commit code to an AWS CodeCommit repository and create pull requests to review proposed changes to the production code. When the pull request is merged into the master branch in the AWS CodeCommit repository, AWS CodePipeline automatically detects the changes to the branch and starts processing the code changes through the pipeline.
  2. AWS CodeBuild packages the code changes as well as any dependencies and builds a Docker image. Optionally, another pipeline stage tests the code and the package, also using AWS CodeBuild.
  3. The Docker image is pushed to Amazon ECR after a successful build and/or test stage.
  4. AWS CodePipeline invokes an AWS Lambda function that includes the Kubernetes Python client as part of the function’s resources. The Lambda function performs a string replacement on the tag used for the Docker image in the Kubernetes deployment file to match the Docker image tag applied in the build, one that matches the image in Amazon ECR.
  5. After the deployment manifest update is completed, AWS Lambda invokes the Kubernetes API to update the image in the Kubernetes application deployment.
  6. Kubernetes performs a rolling update of the pods in the application deployment to match the docker image specified in Amazon ECR.
    The pipeline is now live and responds to changes to the master branch of the CodeCommit repository. This pipeline is also fully extensible, you can add steps for performing testing or adding a step to deploy into a staging environment before the code ships into the production cluster.

An example pipeline in AWS CodePipeline that supports this architecture can be seen below:

Conclusion

We are excited to see how you leverage this pipeline to help ease your developer experience as you develop applications in Kubernetes.

You’ll find an AWS CloudFormation template with everything necessary to spin up your own continuous deployment pipeline at the CodeSuite – Continuous Deployment Reference Architecture for Kubernetes repo on GitHub. The repository details exactly how the pipeline is provisioned and how you can use it to deploy your own applications. If you have any questions, feedback, or suggestions, please let us know!

Graphite 1.1: Teaching an Old Dog New Tricks

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2018/01/11/graphite-1.1-teaching-an-old-dog-new-tricks/

The Road to Graphite 1.1

I started working on Graphite just over a year ago, when @obfuscurity asked me to help out with some issues blocking the Graphite 1.0 release. Little did I know that a year later, that would have resulted in 262 commits (and counting), and that with the help of the other Graphite maintainers (especially @deniszh, @iksaif & @cbowman0) we would have added a huge amount of new functionality to Graphite.

There are a huge number of new additions and updates in this release, in this post I’ll give a tour of some of the highlights including tag support, syntax and function updates, custom function plugins, and python 3.x support.

Tagging!

The single biggest feature in this release is the addition of tag support, which brings the ability to describe metrics in a much richer way and to write more flexible and expressive queries.

Traditionally series in Graphite are identified using a hierarchical naming scheme based on dot-separated segments called nodes. This works very well and is simple to map into a hierarchical structure like the whisper filesystem tree, but it means that the user has to know what each segment represents, and makes it very difficult to modify or extend the naming scheme since everything is based on the positions of the segments within the hierarchy.

The tagging system gives users the ability to encode information about the series in a collection of tag=value pairs which are used together with the series name to uniquely identify each series, and the ability to query series by specifying tag-based matching expressions rather than constructing glob-style selectors based on the positions of specific segments within the hierarchy. This is broadly similar to the system used by Prometheus and makes it possible to use Graphite as a long-term storage backend for metrics gathered by Prometheus with full tag support.

When using tags, series names are specified using the new tagged carbon format: name;tag1=value1;tag2=value2. This format is backward compatible with most existing carbon tooling, and makes it easy to adapt existing tools to produce tagged metrics simply by changing the metric names. The OpenMetrics format is also supported for ingestion, and is normalized into the standard Graphite format internally.

At its core, the tagging system is implemented as a tag database (TagDB) alongside the metrics that allows them to be efficiently queried by individual tag values rather than having to traverse the metrics tree looking for series that match the specified query. Internally the tag index is stored in one of a number of pluggable tag databases, currently supported options are the internal graphite-web database, redis, or an external system that implements the Graphite tagging HTTP API. Carbon automatically keeps the index up to date with any tagged series seen.

The new seriesByTag function is used to query the TagDB and will return a list of all the series that match the expressions passed to it. seriesByTag supports both exact and regular expression matches, and can be used anywhere you would previously have specified a metric name or glob expression.

There are new dedicated functions for grouping and aliasing series by tag (groupByTags and aliasByTags), and you can also use tags interchangeably with node numbers in the standard Graphite functions like aliasByNode, groupByNodes, asPercent, mapSeries, etc.

Piping Syntax & Function Updates

One of the huge strengths of the Graphite render API is the ability to chain together multiple functions to process data, but until now (unless you were using a tool like Grafana) writing chained queries could be painful as each function had to be wrapped around the previous one. With this release it is now possible to “pipe” the output of one processing function into the next, and to combine piped and nested functions.

For example:

alias(movingAverage(scaleToSeconds(sumSeries(stats_global.production.counters.api.requests.*.count),60),30),'api.avg')

Can now be written as:

sumSeries(stats_global.production.counters.api.requests.*.count)|scaleToSeconds(60)|movingAverage(30)|alias('api.avg')

OR

stats_global.production.counters.api.requests.*.count|sumSeries()|scaleToSeconds(60)|movingAverage(30)|alias('api.avg')

Another source of frustration with the old function API was the inconsistent implementation of aggregations, with different functions being used in different parts of the API, and some functions simply not being available. In 1.1 all functions that perform aggregation (whether across series or across time intervals) now support a consistent set of aggregations; average, median, sum, min, max, diff, stddev, count, range, multiply and last. This is part of a new approach to implementing functions that emphasises using shared building blocks to ensure consistency across the API and solve the problem of a particular function not working with the aggregation needed for a given task.

To that end a number of new functions have been added that each provide the same functionality as an entire family of “old” functions; aggregate, aggregateWithWildcards, movingWindow, filterSeries, highest, lowest and sortBy.

Each of these functions accepts an aggregation method parameter, for example aggregate(some.metric.*, 'sum') implements the same functionality as sumSeries(some.metric.*).

It can also be used with different aggregation methods to replace averageSeries, stddevSeries, multiplySeries, diffSeries, rangeOfSeries, minSeries, maxSeries and countSeries. All those functions are now implemented as aliases for aggregate, and it supports the previously-missing median and last aggregations.

The same is true for the other functions, and the summarize, smartSummarize, groupByNode, groupByNodes and the new groupByTags functions now all support the standard set of aggregations. Gone are the days of wishing that sortByMedian or highestRange were available!

For more information on the functions available check the function documentation.

Custom Functions

No matter how many functions are available there are always going to be specific use-cases where a custom function can perform analysis that wouldn’t otherwise be possible, or provide a convenient alias for a complicated function chain or specific set of parameters.

In Graphite 1.1 we added support for easily adding one-off custom functions, as well as for creating and sharing plugins that can provide one or more functions.

Each function plugin is packaged as a simple python module, and will be automatically loaded by Graphite when placed into the functions/custom folder.

An example of a simple function plugin that translates the name of every series passed to it into UPPERCASE:

from graphite.functions.params import Param, ParamTypes

def toUpperCase(requestContext, seriesList):
  """Custom function that changes series names to UPPERCASE"""
  for series in seriesList:
    series.name = series.name.upper()
  return seriesList

toUpperCase.group = 'Custom'
toUpperCase.params = [
  Param('seriesList', ParamTypes.seriesList, required=True),
]

SeriesFunctions = {
  'upper': toUpperCase,
}

Once installed the function is not only available for use within Grpahite, but is also exposed via the new Function API which allows the function definition and documentation to be automatically loaded by tools like Grafana. This means that users will be able to select and use the new function in exactly the same way as the internal functions.

More information on writing and using custom functions is available in the documentation.

Clustering Updates

One of the biggest changes from the 0.9 to 1.0 releases was the overhaul of the clustering code, and with 1.1.1 that process has been taken even further to optimize performance when using Graphite in a clustered deployment. In the past it was common for a request to require the frontend node to make multiple requests to the backend nodes to identify matching series and to fetch data, and the code for handling remote vs local series was overly complicated. In 1.1.1 we took a new approach where all render data requests pass through the same path internally, and multiple backend nodes are handled individually rather than grouped together into a single finder. This has greatly simplified the codebase, making it much easier to understand and reason about, while allowing much more flexibility in design of the finders. After these changes, render requests can now be answered with a single internal request to each backend node, and all requests for both remote and local data are executed in parallel.

To maintain the ability of graphite to scale out horizontally, the tagging system works seamlessly within a clustered environment, with each node responsible for the series stored on that node. Calls to load tagged series via seriesByTag are fanned out to the backend nodes and results are merged on the query node just like they are for non-tagged series.

Python 3 & Django 1.11 Support

Graphite 1.1 finally brings support for Python 3.x, both graphite-web and carbon are now tested against Python 2.7, 3.4, 3.5, 3.6 and PyPy. Django releases 1.8 through 1.11 are also supported. The work involved in sorting out the compatibility issues between Python 2.x and 3.x was quite involved, but it is a huge step forward for the long term support of the project! With the new Django 2.x series supporting only Python 3.x we will need to evaluate our long-term support for Python 2.x, but the Django 1.11 series is supported through 2020 so there is time to consider the options there.

Watch This Space

Efforts are underway to add support for the new functionality across the ecosystem of tools that work with Graphite, adding collectd tagging support, prometheus remote read & write with tags (and native Prometheus remote read/write support in Graphite) and last but not least Graphite tag support in Grafana.

We’re excited about the possibilities that the new capabilities in 1.1.x open up, and can’t wait to see how the community puts them to work.

Download the 1.1.1 release and check out the release notes here.

Combine Transactional and Analytical Data Using Amazon Aurora and Amazon Redshift

Post Syndicated from Re Alvarez-Parmar original https://aws.amazon.com/blogs/big-data/combine-transactional-and-analytical-data-using-amazon-aurora-and-amazon-redshift/

A few months ago, we published a blog post about capturing data changes in an Amazon Aurora database and sending it to Amazon Athena and Amazon QuickSight for fast analysis and visualization. In this post, I want to demonstrate how easy it can be to take the data in Aurora and combine it with data in Amazon Redshift using Amazon Redshift Spectrum.

With Amazon Redshift, you can build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Amazon Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat.

In this post, we describe how to combine data in Aurora in Amazon Redshift. Here’s an overview of the solution:

  • Use AWS Lambda functions with Amazon Aurora to capture data changes in a table.
  • Save data in an Amazon S3
  • Query data using Amazon Redshift Spectrum.

We use the following services:

Serverless architecture for capturing and analyzing Aurora data changes

Consider a scenario in which an e-commerce web application uses Amazon Aurora for a transactional database layer. The company has a sales table that captures every single sale, along with a few corresponding data items. This information is stored as immutable data in a table. Business users want to monitor the sales data and then analyze and visualize it.

In this example, you take the changes in data in an Aurora database table and save it in Amazon S3. After the data is captured in Amazon S3, you combine it with data in your existing Amazon Redshift cluster for analysis.

By the end of this post, you will understand how to capture data events in an Aurora table and push them out to other AWS services using AWS Lambda.

The following diagram shows the flow of data as it occurs in this tutorial:

The starting point in this architecture is a database insert operation in Amazon Aurora. When the insert statement is executed, a custom trigger calls a Lambda function and forwards the inserted data. Lambda writes the data that it received from Amazon Aurora to a Kinesis data delivery stream. Kinesis Data Firehose writes the data to an Amazon S3 bucket. Once the data is in an Amazon S3 bucket, it is queried in place using Amazon Redshift Spectrum.

Creating an Aurora database

First, create a database by following these steps in the Amazon RDS console:

  1. Sign in to the AWS Management Console, and open the Amazon RDS console.
  2. Choose Launch a DB instance, and choose Next.
  3. For Engine, choose Amazon Aurora.
  4. Choose a DB instance class. This example uses a small, since this is not a production database.
  5. In Multi-AZ deployment, choose No.
  6. Configure DB instance identifier, Master username, and Master password.
  7. Launch the DB instance.

After you create the database, use MySQL Workbench to connect to the database using the CNAME from the console. For information about connecting to an Aurora database, see Connecting to an Amazon Aurora DB Cluster.

The following screenshot shows the MySQL Workbench configuration:

Next, create a table in the database by running the following SQL statement:

Create Table
CREATE TABLE Sales (
InvoiceID int NOT NULL AUTO_INCREMENT,
ItemID int NOT NULL,
Category varchar(255),
Price double(10,2), 
Quantity int not NULL,
OrderDate timestamp,
DestinationState varchar(2),
ShippingType varchar(255),
Referral varchar(255),
PRIMARY KEY (InvoiceID)
)

You can now populate the table with some sample data. To generate sample data in your table, copy and run the following script. Ensure that the highlighted (bold) variables are replaced with appropriate values.

#!/usr/bin/python
import MySQLdb
import random
import datetime

db = MySQLdb.connect(host="AURORA_CNAME",
                     user="DBUSER",
                     passwd="DBPASSWORD",
                     db="DB")

states = ("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN",
"IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
"NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA",
"WA","WV","WI","WY")

shipping_types = ("Free", "3-Day", "2-Day")

product_categories = ("Garden", "Kitchen", "Office", "Household")
referrals = ("Other", "Friend/Colleague", "Repeat Customer", "Online Ad")

for i in range(0,10):
    item_id = random.randint(1,100)
    state = states[random.randint(0,len(states)-1)]
    shipping_type = shipping_types[random.randint(0,len(shipping_types)-1)]
    product_category = product_categories[random.randint(0,len(product_categories)-1)]
    quantity = random.randint(1,4)
    referral = referrals[random.randint(0,len(referrals)-1)]
    price = random.randint(1,100)
    order_date = datetime.date(2016,random.randint(1,12),random.randint(1,30)).isoformat()

    data_order = (item_id, product_category, price, quantity, order_date, state,
    shipping_type, referral)

    add_order = ("INSERT INTO Sales "
                   "(ItemID, Category, Price, Quantity, OrderDate, DestinationState, \
                   ShippingType, Referral) "
                   "VALUES (%s, %s, %s, %s, %s, %s, %s, %s)")

    cursor = db.cursor()
    cursor.execute(add_order, data_order)

    db.commit()

cursor.close()
db.close() 

The following screenshot shows how the table appears with the sample data:

Sending data from Amazon Aurora to Amazon S3

There are two methods available to send data from Amazon Aurora to Amazon S3:

  • Using a Lambda function
  • Using SELECT INTO OUTFILE S3

To demonstrate the ease of setting up integration between multiple AWS services, we use a Lambda function to send data to Amazon S3 using Amazon Kinesis Data Firehose.

Alternatively, you can use a SELECT INTO OUTFILE S3 statement to query data from an Amazon Aurora DB cluster and save it directly in text files that are stored in an Amazon S3 bucket. However, with this method, there is a delay between the time that the database transaction occurs and the time that the data is exported to Amazon S3 because the default file size threshold is 6 GB.

Creating a Kinesis data delivery stream

The next step is to create a Kinesis data delivery stream, since it’s a dependency of the Lambda function.

To create a delivery stream:

  1. Open the Kinesis Data Firehose console
  2. Choose Create delivery stream.
  3. For Delivery stream name, type AuroraChangesToS3.
  4. For Source, choose Direct PUT.
  5. For Record transformation, choose Disabled.
  6. For Destination, choose Amazon S3.
  7. In the S3 bucket drop-down list, choose an existing bucket, or create a new one.
  8. Enter a prefix if needed, and choose Next.
  9. For Data compression, choose GZIP.
  10. In IAM role, choose either an existing role that has access to write to Amazon S3, or choose to generate one automatically. Choose Next.
  11. Review all the details on the screen, and choose Create delivery stream when you’re finished.

 

Creating a Lambda function

Now you can create a Lambda function that is called every time there is a change that needs to be tracked in the database table. This Lambda function passes the data to the Kinesis data delivery stream that you created earlier.

To create the Lambda function:

  1. Open the AWS Lambda console.
  2. Ensure that you are in the AWS Region where your Amazon Aurora database is located.
  3. If you have no Lambda functions yet, choose Get started now. Otherwise, choose Create function.
  4. Choose Author from scratch.
  5. Give your function a name and select Python 3.6 for Runtime
  6. Choose and existing or create a new Role, the role would need to have access to call firehose:PutRecord
  7. Choose Next on the trigger selection screen.
  8. Paste the following code in the code window. Change the stream_name variable to the Kinesis data delivery stream that you created in the previous step.
  9. Choose File -> Save in the code editor and then choose Save.
import boto3
import json

firehose = boto3.client('firehose')
stream_name = ‘AuroraChangesToS3’


def Kinesis_publish_message(event, context):
    
    firehose_data = (("%s,%s,%s,%s,%s,%s,%s,%s\n") %(event['ItemID'], 
    event['Category'], event['Price'], event['Quantity'],
    event['OrderDate'], event['DestinationState'], event['ShippingType'], 
    event['Referral']))
    
    firehose_data = {'Data': str(firehose_data)}
    print(firehose_data)
    
    firehose.put_record(DeliveryStreamName=stream_name,
    Record=firehose_data)

Note the Amazon Resource Name (ARN) of this Lambda function.

Giving Aurora permissions to invoke a Lambda function

To give Amazon Aurora permissions to invoke a Lambda function, you must attach an IAM role with appropriate permissions to the cluster. For more information, see Invoking a Lambda Function from an Amazon Aurora DB Cluster.

Once you are finished, the Amazon Aurora database has access to invoke a Lambda function.

Creating a stored procedure and a trigger in Amazon Aurora

Now, go back to MySQL Workbench, and run the following command to create a new stored procedure. When this stored procedure is called, it invokes the Lambda function you created. Change the ARN in the following code to your Lambda function’s ARN.

DROP PROCEDURE IF EXISTS CDC_TO_FIREHOSE;
DELIMITER ;;
CREATE PROCEDURE CDC_TO_FIREHOSE (IN ItemID VARCHAR(255), 
									IN Category varchar(255), 
									IN Price double(10,2),
                                    IN Quantity int(11),
                                    IN OrderDate timestamp,
                                    IN DestinationState varchar(2),
                                    IN ShippingType varchar(255),
                                    IN Referral  varchar(255)) LANGUAGE SQL 
BEGIN
  CALL mysql.lambda_async('arn:aws:lambda:us-east-1:XXXXXXXXXXXXX:function:CDCFromAuroraToKinesis', 
     CONCAT('{ "ItemID" : "', ItemID, 
            '", "Category" : "', Category,
            '", "Price" : "', Price,
            '", "Quantity" : "', Quantity, 
            '", "OrderDate" : "', OrderDate, 
            '", "DestinationState" : "', DestinationState, 
            '", "ShippingType" : "', ShippingType, 
            '", "Referral" : "', Referral, '"}')
     );
END
;;
DELIMITER ;

Create a trigger TR_Sales_CDC on the Sales table. When a new record is inserted, this trigger calls the CDC_TO_FIREHOSE stored procedure.

DROP TRIGGER IF EXISTS TR_Sales_CDC;
 
DELIMITER ;;
CREATE TRIGGER TR_Sales_CDC
  AFTER INSERT ON Sales
  FOR EACH ROW
BEGIN
  SELECT  NEW.ItemID , NEW.Category, New.Price, New.Quantity, New.OrderDate
  , New.DestinationState, New.ShippingType, New.Referral
  INTO @ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral;
  CALL  CDC_TO_FIREHOSE(@ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral);
END
;;
DELIMITER ;

If a new row is inserted in the Sales table, the Lambda function that is mentioned in the stored procedure is invoked.

Verify that data is being sent from the Lambda function to Kinesis Data Firehose to Amazon S3 successfully. You might have to insert a few records, depending on the size of your data, before new records appear in Amazon S3. This is due to Kinesis Data Firehose buffering. To learn more about Kinesis Data Firehose buffering, see the “Amazon S3” section in Amazon Kinesis Data Firehose Data Delivery.

Every time a new record is inserted in the sales table, a stored procedure is called, and it updates data in Amazon S3.

Querying data in Amazon Redshift

In this section, you use the data you produced from Amazon Aurora and consume it as-is in Amazon Redshift. In order to allow you to process your data as-is, where it is, while taking advantage of the power and flexibility of Amazon Redshift, you use Amazon Redshift Spectrum. You can use Redshift Spectrum to run complex queries on data stored in Amazon S3, with no need for loading or other data prep.

Just create a data source and issue your queries to your Amazon Redshift cluster as usual. Behind the scenes, Redshift Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your dataset grows to beyond an exabyte! Being able to query data that is stored in Amazon S3 means that you can scale your compute and your storage independently. You have the full power of the Amazon Redshift query model and all the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Amazon Redshift tables and in Amazon S3.

Redshift Spectrum supports open, common data types, including CSV/TSV, Apache Parquet, SequenceFile, and RCFile. Files can be compressed using gzip or Snappy, with other data types and compression methods in the works.

First, create an Amazon Redshift cluster. Follow the steps in Launch a Sample Amazon Redshift Cluster.

Next, create an IAM role that has access to Amazon S3 and Athena. By default, Amazon Redshift Spectrum uses the Amazon Athena data catalog. Your cluster needs authorization to access your external data catalog in AWS Glue or Athena and your data files in Amazon S3.

In the demo setup, I attached AmazonS3FullAccess and AmazonAthenaFullAccess. In a production environment, the IAM roles should follow the standard security of granting least privilege. For more information, see IAM Policies for Amazon Redshift Spectrum.

Attach the newly created role to the Amazon Redshift cluster. For more information, see Associate the IAM Role with Your Cluster.

Next, connect to the Amazon Redshift cluster, and create an external schema and database:

create external schema if not exists spectrum_schema
from data catalog 
database 'spectrum_db' 
region 'us-east-1'
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/RedshiftSpectrumRole'
create external database if not exists;

Don’t forget to replace the IAM role in the statement.

Then create an external table within the database:

 CREATE EXTERNAL TABLE IF NOT EXISTS spectrum_schema.ecommerce_sales(
  ItemID int,
  Category varchar,
  Price DOUBLE PRECISION,
  Quantity int,
  OrderDate TIMESTAMP,
  DestinationState varchar,
  ShippingType varchar,
  Referral varchar)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://{BUCKET_NAME}/CDC/'

Query the table, and it should contain data. This is a fact table.

select top 10 * from spectrum_schema.ecommerce_sales

 

Next, create a dimension table. For this example, we create a date/time dimension table. Create the table:

CREATE TABLE date_dimension (
  d_datekey           integer       not null sortkey,
  d_dayofmonth        integer       not null,
  d_monthnum          integer       not null,
  d_dayofweek                varchar(10)   not null,
  d_prettydate        date       not null,
  d_quarter           integer       not null,
  d_half              integer       not null,
  d_year              integer       not null,
  d_season            varchar(10)   not null,
  d_fiscalyear        integer       not null)
diststyle all;

Populate the table with data:

copy date_dimension from 's3://reparmar-lab/2016dates' 
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/redshiftspectrum'
DELIMITER ','
dateformat 'auto';

The date dimension table should look like the following:

Querying data in local and external tables using Amazon Redshift

Now that you have the fact and dimension table populated with data, you can combine the two and run analysis. For example, if you want to query the total sales amount by weekday, you can run the following:

select sum(quantity*price) as total_sales, date_dimension.d_season
from spectrum_schema.ecommerce_sales 
join date_dimension on spectrum_schema.ecommerce_sales.orderdate = date_dimension.d_prettydate 
group by date_dimension.d_season

You get the following results:

Similarly, you can replace d_season with d_dayofweek to get sales figures by weekday:

With Amazon Redshift Spectrum, you pay only for the queries you run against the data that you actually scan. We encourage you to use file partitioning, columnar data formats, and data compression to significantly minimize the amount of data scanned in Amazon S3. This is important for data warehousing because it dramatically improves query performance and reduces cost.

Partitioning your data in Amazon S3 by date, time, or any other custom keys enables Amazon Redshift Spectrum to dynamically prune nonrelevant partitions to minimize the amount of data processed. If you store data in a columnar format, such as Parquet, Amazon Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows. Similarly, if you compress your data using one of the supported compression algorithms in Amazon Redshift Spectrum, less data is scanned.

Analyzing and visualizing Amazon Redshift data in Amazon QuickSight

Modify the Amazon Redshift security group to allow an Amazon QuickSight connection. For more information, see Authorizing Connections from Amazon QuickSight to Amazon Redshift Clusters.

After modifying the Amazon Redshift security group, go to Amazon QuickSight. Create a new analysis, and choose Amazon Redshift as the data source.

Enter the database connection details, validate the connection, and create the data source.

Choose the schema to be analyzed. In this case, choose spectrum_schema, and then choose the ecommerce_sales table.

Next, we add a custom field for Total Sales = Price*Quantity. In the drop-down list for the ecommerce_sales table, choose Edit analysis data sets.

On the next screen, choose Edit.

In the data prep screen, choose New Field. Add a new calculated field Total Sales $, which is the product of the Price*Quantity fields. Then choose Create. Save and visualize it.

Next, to visualize total sales figures by month, create a graph with Total Sales on the x-axis and Order Data formatted as month on the y-axis.

After you’ve finished, you can use Amazon QuickSight to add different columns from your Amazon Redshift tables and perform different types of visualizations. You can build operational dashboards that continuously monitor your transactional and analytical data. You can publish these dashboards and share them with others.

Final notes

Amazon QuickSight can also read data in Amazon S3 directly. However, with the method demonstrated in this post, you have the option to manipulate, filter, and combine data from multiple sources or Amazon Redshift tables before visualizing it in Amazon QuickSight.

In this example, we dealt with data being inserted, but triggers can be activated in response to an INSERT, UPDATE, or DELETE trigger.

Keep the following in mind:

  • Be careful when invoking a Lambda function from triggers on tables that experience high write traffic. This would result in a large number of calls to your Lambda function. Although calls to the lambda_async procedure are asynchronous, triggers are synchronous.
  • A statement that results in a large number of trigger activations does not wait for the call to the AWS Lambda function to complete. But it does wait for the triggers to complete before returning control to the client.
  • Similarly, you must account for Amazon Kinesis Data Firehose limits. By default, Kinesis Data Firehose is limited to a maximum of 5,000 records/second. For more information, see Monitoring Amazon Kinesis Data Firehose.

In certain cases, it may be optimal to use AWS Database Migration Service (AWS DMS) to capture data changes in Aurora and use Amazon S3 as a target. For example, AWS DMS might be a good option if you don’t need to transform data from Amazon Aurora. The method used in this post gives you the flexibility to transform data from Aurora using Lambda before sending it to Amazon S3. Additionally, the architecture has the benefits of being serverless, whereas AWS DMS requires an Amazon EC2 instance for replication.

For design considerations while using Redshift Spectrum, see Using Amazon Redshift Spectrum to Query External Data.

If you have questions or suggestions, please comment below.


Additional Reading

If you found this post useful, be sure to check out Capturing Data Changes in Amazon Aurora Using AWS Lambda and 10 Best Practices for Amazon Redshift Spectrum


About the Authors

Re Alvarez-Parmar is a solutions architect for Amazon Web Services. He helps enterprises achieve success through technical guidance and thought leadership. In his spare time, he enjoys spending time with his two kids and exploring outdoors.

 

 

 

Now Available: New Digital Training to Help You Learn About AWS Big Data Services

Post Syndicated from Sara Snedeker original https://aws.amazon.com/blogs/big-data/now-available-new-digital-training-to-help-you-learn-about-aws-big-data-services/

AWS Training and Certification recently released free digital training courses that will make it easier for you to build your cloud skills and learn about using AWS Big Data services. This training includes courses like Introduction to Amazon EMR and Introduction to Amazon Athena.

You can get free and unlimited access to more than 100 new digital training courses built by AWS experts at aws.training. It’s easy to access training related to big data. Just choose the Analytics category on our Find Training page to browse through the list of courses. You can also use the keyword filter to search for training for specific AWS offerings.

Recommended training

Just getting started, or looking to learn about a new service? Check out the following digital training courses:

Introduction to Amazon EMR (15 minutes)
Covers the available tools that can be used with Amazon EMR and the process of creating a cluster. It includes a demonstration of how to create an EMR cluster.

Introduction to Amazon Athena (10 minutes)
Introduces the Amazon Athena service along with an overview of its operating environment. It covers the basic steps in implementing Athena and provides a brief demonstration.

Introduction to Amazon QuickSight (10 minutes)
Discusses the benefits of using Amazon QuickSight and how the service works. It also includes a demonstration so that you can see Amazon QuickSight in action.

Introduction to Amazon Redshift (10 minutes)
Walks you through Amazon Redshift and its core features and capabilities. It also includes a quick overview of relevant use cases and a short demonstration.

Introduction to AWS Lambda (10 minutes)
Discusses the rationale for using AWS Lambda, how the service works, and how you can get started using it.

Introduction to Amazon Kinesis Analytics (10 minutes)
Discusses how Amazon Kinesis Analytics collects, processes, and analyzes streaming data in real time. It discusses how to use and monitor the service and explores some use cases.

Introduction to Amazon Kinesis Streams (15 minutes)
Covers how Amazon Kinesis Streams is used to collect, process, and analyze real-time streaming data to create valuable insights.

Introduction to AWS IoT (10 minutes)
Describes how the AWS Internet of Things (IoT) communication architecture works, and the components that make up AWS IoT. It discusses how AWS IoT works with other AWS services and reviews a case study.

Introduction to AWS Data Pipeline (10 minutes)
Covers components like tasks, task runner, and pipeline. It also discusses what a pipeline definition is, and reviews the AWS services that are compatible with AWS Data Pipeline.

Go deeper with classroom training

Want to learn more? Enroll in classroom training to learn best practices, get live feedback, and hear answers to your questions from an instructor.

Big Data on AWS (3 days)
Introduces you to cloud-based big data solutions such as Amazon EMR, Amazon Redshift, Amazon Kinesis, and the rest of the AWS big data platform.

Data Warehousing on AWS (3 days)
Introduces you to concepts, strategies, and best practices for designing a cloud-based data warehousing solution, and demonstrates how to collect, store, and prepare data for the data warehouse.

Building a Serverless Data Lake (1 day)
Teaches you how to design, build, and operate a serverless data lake solution with AWS services. Includes topics such as ingesting data from any data source at large scale, storing the data securely and durably, using the right tool to process large volumes of data, and understanding the options available for analyzing the data in near-real time.

More training coming in 2018

We’re always evaluating and expanding our training portfolio, so stay tuned for more training options in the new year. You can always visit us at aws.training to explore our latest offerings.

Set Up a Continuous Delivery Pipeline for Containers Using AWS CodePipeline and Amazon ECS

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/set-up-a-continuous-delivery-pipeline-for-containers-using-aws-codepipeline-and-amazon-ecs/

This post contributed by Abby FullerAWS Senior Technical Evangelist

Last week, AWS announced support for Amazon Elastic Container Service (ECS) targets (including AWS Fargate) in AWS CodePipeline. This support makes it easier to create a continuous delivery pipeline for container-based applications and microservices.

Building and deploying containerized services manually is slow and prone to errors. Continuous delivery with automated build and test mechanisms helps detect errors early, saves time, and reduces failures, making this a popular model for application deployments. Previously, to automate your container workflows with ECS, you had to build your own solution using AWS CloudFormation. Now, you can integrate CodePipeline and CodeBuild with ECS to automate your workflows in just a few steps.

A typical continuous delivery workflow with CodePipeline, CodeBuild, and ECS might look something like the following:

  • Choosing your source
  • Building your project
  • Deploying your code

We also have a continuous deployment reference architecture on GitHub for this workflow.

Getting Started

First, create a new project with CodePipeline and give the project a name, such as “demo”.

Next, choose a source location where the code is stored. This could be AWS CodeCommit, GitHub, or Amazon S3. For this example, enter GitHub and then give CodePipeline access to the repository.

Next, add a build step. You can import an existing build, such as a Jenkins server URL or CodeBuild project, or create a new step with CodeBuild. If you don’t have an existing build project in CodeBuild, create one from within CodePipeline:

  • Build provider: AWS CodeBuild
  • Configure your project: Create a new build project
  • Environment image: Use an image managed by AWS CodeBuild
  • Operating system: Ubuntu
  • Runtime: Docker
  • Version: aws/codebuild/docker:1.12.1
  • Build specification: Use the buildspec.yml in the source code root directory

Now that you’ve created the CodeBuild step, you can use it as an existing project in CodePipeline.

Next, add a deployment provider. This is where your built code is placed. It can be a number of different options, such as AWS CodeDeploy, AWS Elastic Beanstalk, AWS CloudFormation, or Amazon ECS. For this example, connect to Amazon ECS.

For CodeBuild to deploy to ECS, you must create an image definition JSON file. This requires adding some instructions to the pre-build, build, and post-build phases of the CodeBuild build process in your buildspec.yml file. For help with creating the image definition file, see Step 1 of the Tutorial: Continuous Deployment with AWS CodePipeline.

  • Deployment provider: Amazon ECS
  • Cluster name: enter your project name from the build step
  • Service name: web
  • Image filename: enter your image definition filename (“web.json”).

You are almost done!

You can now choose an existing IAM service role that CodePipeline can use to access resources in your account, or let CodePipeline create one. For this example, use the wizard, and go with the role that it creates (AWS-CodePipeline-Service).

Finally, review all of your changes, and choose Create pipeline.

After the pipeline is created, you’ll have a model of your entire pipeline where you can view your executions, add different tests, add manual approvals, or release a change.

You can learn more in the AWS CodePipeline User Guide.

Happy automating!

HiveMQ 3.3.2 released

Post Syndicated from The HiveMQ Team original https://www.hivemq.com/blog/hivemq-3-3-2-released/

The HiveMQ team is pleased to announce the availability of HiveMQ 3.3.2. This is a maintenance release for the 3.3 series and brings the following improvements:

  • Improved unsubscribe performance
  • Webinterface now shows a warning if different HiveMQ versions are in a cluster
  • At Startup HiveMQ shows the enabled cipher suites and protocols for TLS listeners in the log
  • The Web UI Dashboard now shows MQTT Publishes instead of all MQTT messages in the graphs
  • Updated integrated native SSL/TLS library to latest version
  • Improved message ordering while the cluster topology changes
  • Fixed a cosmetic NullPointerException with background cleanup jobs
  • Fixed an issue where Web UI Popups could not be closed on IE/Edge and Safari
  • Fixed an issue which could lead to an IllegalArgumentException with a QoS 0 message in a rare edge-case
  • Improved persistence migrations for updating single HiveMQ deployments
  • Fixed a reference counting issue
  • Fixed an issue with rolling upgrades if the AsyncMetricService is used while the update is in progress

You can download the new HiveMQ version here.

We recommend to upgrade if you are an HiveMQ 3.3.x user.

Have a great day,
The HiveMQ Team

[$] Demystifying container runtimes

Post Syndicated from jake original https://lwn.net/Articles/741897/rss

As we briefly mentioned in our overview article about
KubeCon + CloudNativeCon, there are multiple container “runtimes”, which are
programs that can create and execute containers that are typically fetched
from online
images. That space is slowly reaching maturity both in terms
of standards and implementation: Docker’s containerd 1.0 was released
during KubeCon, CRI-O 1.0 was released a few months ago, and rkt is
also still in the game. With all of those runtimes, it may be a confusing
time for those looking at deploying their own container-based system
or Kubernetes cluster from
scratch. This article will try to explain
what container runtimes are, what they do, how they compare with each other, and
how to choose the right one. It also provides a primer on container
specifications and standards.

Power data ingestion into Splunk using Amazon Kinesis Data Firehose

Post Syndicated from Tarik Makota original https://aws.amazon.com/blogs/big-data/power-data-ingestion-into-splunk-using-amazon-kinesis-data-firehose/

In late September, during the annual Splunk .conf, Splunk and Amazon Web Services (AWS) jointly announced that Amazon Kinesis Data Firehose now supports Splunk Enterprise and Splunk Cloud as a delivery destination. This native integration between Splunk Enterprise, Splunk Cloud, and Amazon Kinesis Data Firehose is designed to make AWS data ingestion setup seamless, while offering a secure and fault-tolerant delivery mechanism. We want to enable customers to monitor and analyze machine data from any source and use it to deliver operational intelligence and optimize IT, security, and business performance.

With Kinesis Data Firehose, customers can use a fully managed, reliable, and scalable data streaming solution to Splunk. In this post, we tell you a bit more about the Kinesis Data Firehose and Splunk integration. We also show you how to ingest large amounts of data into Splunk using Kinesis Data Firehose.

Push vs. Pull data ingestion

Presently, customers use a combination of two ingestion patterns, primarily based on data source and volume, in addition to existing company infrastructure and expertise:

  1. Pull-based approach: Using dedicated pollers running the popular Splunk Add-on for AWS to pull data from various AWS services such as Amazon CloudWatch or Amazon S3.
  2. Push-based approach: Streaming data directly from AWS to Splunk HTTP Event Collector (HEC) by using AWS Lambda. Examples of applicable data sources include CloudWatch Logs and Amazon Kinesis Data Streams.

The pull-based approach offers data delivery guarantees such as retries and checkpointing out of the box. However, it requires more ops to manage and orchestrate the dedicated pollers, which are commonly running on Amazon EC2 instances. With this setup, you pay for the infrastructure even when it’s idle.

On the other hand, the push-based approach offers a low-latency scalable data pipeline made up of serverless resources like AWS Lambda sending directly to Splunk indexers (by using Splunk HEC). This approach translates into lower operational complexity and cost. However, if you need guaranteed data delivery then you have to design your solution to handle issues such as a Splunk connection failure or Lambda execution failure. To do so, you might use, for example, AWS Lambda Dead Letter Queues.

How about getting the best of both worlds?

Let’s go over the new integration’s end-to-end solution and examine how Kinesis Data Firehose and Splunk together expand the push-based approach into a native AWS solution for applicable data sources.

By using a managed service like Kinesis Data Firehose for data ingestion into Splunk, we provide out-of-the-box reliability and scalability. One of the pain points of the old approach was the overhead of managing the data collection nodes (Splunk heavy forwarders). With the new Kinesis Data Firehose to Splunk integration, there are no forwarders to manage or set up. Data producers (1) are configured through the AWS Management Console to drop data into Kinesis Data Firehose.

You can also create your own data producers. For example, you can drop data into a Firehose delivery stream by using Amazon Kinesis Agent, or by using the Firehose API (PutRecord(), PutRecordBatch()), or by writing to a Kinesis Data Stream configured to be the data source of a Firehose delivery stream. For more details, refer to Sending Data to an Amazon Kinesis Data Firehose Delivery Stream.

You might need to transform the data before it goes into Splunk for analysis. For example, you might want to enrich it or filter or anonymize sensitive data. You can do so using AWS Lambda. In this scenario, Kinesis Data Firehose buffers data from the incoming source data, sends it to the specified Lambda function (2), and then rebuffers the transformed data to the Splunk Cluster. Kinesis Data Firehose provides the Lambda blueprints that you can use to create a Lambda function for data transformation.

Systems fail all the time. Let’s see how this integration handles outside failures to guarantee data durability. In cases when Kinesis Data Firehose can’t deliver data to the Splunk Cluster, data is automatically backed up to an S3 bucket. You can configure this feature while creating the Firehose delivery stream (3). You can choose to back up all data or only the data that’s failed during delivery to Splunk.

In addition to using S3 for data backup, this Firehose integration with Splunk supports Splunk Indexer Acknowledgments to guarantee event delivery. This feature is configured on Splunk’s HTTP Event Collector (HEC) (4). It ensures that HEC returns an acknowledgment to Kinesis Data Firehose only after data has been indexed and is available in the Splunk cluster (5).

Now let’s look at a hands-on exercise that shows how to forward VPC flow logs to Splunk.

How-to guide

To process VPC flow logs, we implement the following architecture.

Amazon Virtual Private Cloud (Amazon VPC) delivers flow log files into an Amazon CloudWatch Logs group. Using a CloudWatch Logs subscription filter, we set up real-time delivery of CloudWatch Logs to an Kinesis Data Firehose stream.

Data coming from CloudWatch Logs is compressed with gzip compression. To work with this compression, we need to configure a Lambda-based data transformation in Kinesis Data Firehose to decompress the data and deposit it back into the stream. Firehose then delivers the raw logs to the Splunk Http Event Collector (HEC).

If delivery to the Splunk HEC fails, Firehose deposits the logs into an Amazon S3 bucket. You can then ingest the events from S3 using an alternate mechanism such as a Lambda function.

When data reaches Splunk (Enterprise or Cloud), Splunk parsing configurations (packaged in the Splunk Add-on for Kinesis Data Firehose) extract and parse all fields. They make data ready for querying and visualization using Splunk Enterprise and Splunk Cloud.

Walkthrough

Install the Splunk Add-on for Amazon Kinesis Data Firehose

The Splunk Add-on for Amazon Kinesis Data Firehose enables Splunk (be it Splunk Enterprise, Splunk App for AWS, or Splunk Enterprise Security) to use data ingested from Amazon Kinesis Data Firehose. Install the Add-on on all the indexers with an HTTP Event Collector (HEC). The Add-on is available for download from Splunkbase.

HTTP Event Collector (HEC)

Before you can use Kinesis Data Firehose to deliver data to Splunk, set up the Splunk HEC to receive the data. From Splunk web, go to the Setting menu, choose Data Inputs, and choose HTTP Event Collector. Choose Global Settings, ensure All tokens is enabled, and then choose Save. Then choose New Token to create a new HEC endpoint and token. When you create a new token, make sure that Enable indexer acknowledgment is checked.

When prompted to select a source type, select aws:cloudwatch:vpcflow.

Create an S3 backsplash bucket

To provide for situations in which Kinesis Data Firehose can’t deliver data to the Splunk Cluster, we use an S3 bucket to back up the data. You can configure this feature to back up all data or only the data that’s failed during delivery to Splunk.

Note: Bucket names are unique. Thus, you can’t use tmak-backsplash-bucket.

aws s3 create-bucket --bucket tmak-backsplash-bucket --create-bucket-configuration LocationConstraint=ap-northeast-1

Create an IAM role for the Lambda transform function

Firehose triggers an AWS Lambda function that transforms the data in the delivery stream. Let’s first create a role for the Lambda function called LambdaBasicRole.

Note: You can also set this role up when creating your Lambda function.

$ aws iam create-role --role-name LambdaBasicRole --assume-role-policy-document file://TrustPolicyForLambda.json

Here is TrustPolicyForLambda.json.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

 

After the role is created, attach the managed Lambda basic execution policy to it.

$ aws iam attach-role-policy 
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole 
  --role-name LambdaBasicRole

 

Create a Firehose Stream

On the AWS console, open the Amazon Kinesis service, go to the Firehose console, and choose Create Delivery Stream.

In the next section, you can specify whether you want to use an inline Lambda function for transformation. Because incoming CloudWatch Logs are gzip compressed, choose Enabled for Record transformation, and then choose Create new.

From the list of the available blueprint functions, choose Kinesis Data Firehose CloudWatch Logs Processor. This function unzips data and place it back into the Firehose stream in compliance with the record transformation output model.

Enter a name for the Lambda function, choose Choose an existing role, and then choose the role you created earlier. Then choose Create Function.

Go back to the Firehose Stream wizard, choose the Lambda function you just created, and then choose Next.

Select Splunk as the destination, and enter your Splunk Http Event Collector information.

Note: Amazon Kinesis Data Firehose requires the Splunk HTTP Event Collector (HEC) endpoint to be terminated with a valid CA-signed certificate matching the DNS hostname used to connect to your HEC endpoint. You receive delivery errors if you are using a self-signed certificate.

In this example, we only back up logs that fail during delivery.

To monitor your Firehose delivery stream, enable error logging. Doing this means that you can monitor record delivery errors.

Create an IAM role for the Firehose stream by choosing Create new, or Choose. Doing this brings you to a new screen. Choose Create a new IAM role, give the role a name, and then choose Allow.

If you look at the policy document, you can see that the role gives Kinesis Data Firehose permission to publish error logs to CloudWatch, execute your Lambda function, and put records into your S3 backup bucket.

You now get a chance to review and adjust the Firehose stream settings. When you are satisfied, choose Create Stream. You get a confirmation once the stream is created and active.

Create a VPC Flow Log

To send events from Amazon VPC, you need to set up a VPC flow log. If you already have a VPC flow log you want to use, you can skip to the “Publish CloudWatch to Kinesis Data Firehose” section.

On the AWS console, open the Amazon VPC service. Then choose VPC, Your VPC, and choose the VPC you want to send flow logs from. Choose Flow Logs, and then choose Create Flow Log. If you don’t have an IAM role that allows your VPC to publish logs to CloudWatch, choose Set Up Permissions and Create new role. Use the defaults when presented with the screen to create the new IAM role.

Once active, your VPC flow log should look like the following.

Publish CloudWatch to Kinesis Data Firehose

When you generate traffic to or from your VPC, the log group is created in Amazon CloudWatch. The new log group has no subscription filter, so set up a subscription filter. Setting this up establishes a real-time data feed from the log group to your Firehose delivery stream.

At present, you have to use the AWS Command Line Interface (AWS CLI) to create a CloudWatch Logs subscription to a Kinesis Data Firehose stream. However, you can use the AWS console to create subscriptions to Lambda and Amazon Elasticsearch Service.

To allow CloudWatch to publish to your Firehose stream, you need to give it permissions.

$ aws iam create-role --role-name CWLtoKinesisFirehoseRole --assume-role-policy-document file://TrustPolicyForCWLToFireHose.json


Here is the content for TrustPolicyForCWLToFireHose.json.

{
  "Statement": {
    "Effect": "Allow",
    "Principal": { "Service": "logs.us-east-1.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }
}

 

Attach the policy to the newly created role.

$ aws iam put-role-policy 
    --role-name CWLtoKinesisFirehoseRole 
    --policy-name Permissions-Policy-For-CWL 
    --policy-document file://PermissionPolicyForCWLToFireHose.json

Here is the content for PermissionPolicyForCWLToFireHose.json.

{
    "Statement":[
      {
        "Effect":"Allow",
        "Action":["firehose:*"],
        "Resource":["arn:aws:firehose:us-east-1:YOUR-AWS-ACCT-NUM:deliverystream/ FirehoseSplunkDeliveryStream"]
      },
      {
        "Effect":"Allow",
        "Action":["iam:PassRole"],
        "Resource":["arn:aws:iam::YOUR-AWS-ACCT-NUM:role/CWLtoKinesisFirehoseRole"]
      }
    ]
}

Finally, create a subscription filter.

$ aws logs put-subscription-filter 
   --log-group-name " /vpc/flowlog/FirehoseSplunkDemo" 
   --filter-name "Destination" 
   --filter-pattern "" 
   --destination-arn "arn:aws:firehose:us-east-1:YOUR-AWS-ACCT-NUM:deliverystream/FirehoseSplunkDeliveryStream" 
   --role-arn "arn:aws:iam::YOUR-AWS-ACCT-NUM:role/CWLtoKinesisFirehoseRole"

When you run the AWS CLI command preceding, you don’t get any acknowledgment. To validate that your CloudWatch Log Group is subscribed to your Firehose stream, check the CloudWatch console.

As soon as the subscription filter is created, the real-time log data from the log group goes into your Firehose delivery stream. Your stream then delivers it to your Splunk Enterprise or Splunk Cloud environment for querying and visualization. The screenshot following is from Splunk Enterprise.

In addition, you can monitor and view metrics associated with your delivery stream using the AWS console.

Conclusion

Although our walkthrough uses VPC Flow Logs, the pattern can be used in many other scenarios. These include ingesting data from AWS IoT, other CloudWatch logs and events, Kinesis Streams or other data sources using the Kinesis Agent or Kinesis Producer Library. We also used Lambda blueprint Kinesis Data Firehose CloudWatch Logs Processor to transform streaming records from Kinesis Data Firehose. However, you might need to use a different Lambda blueprint or disable record transformation entirely depending on your use case. For an additional use case using Kinesis Data Firehose, check out This is My Architecture Video, which discusses how to securely centralize cross-account data analytics using Kinesis and Splunk.

 


Additional Reading

If you found this post useful, be sure to check out Integrating Splunk with Amazon Kinesis Streams and Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review.


About the Authors

Tarik Makota is a solutions architect with the Amazon Web Services Partner Network. He provides technical guidance, design advice and thought leadership to AWS’ most strategic software partners. His career includes work in an extremely broad software development and architecture roles across ERP, financial printing, benefit delivery and administration and financial services. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.

 

 

 

Roy Arsan is a solutions architect in the Splunk Partner Integrations team. He has a background in product development, cloud architecture, and building consumer and enterprise cloud applications. More recently, he has architected Splunk solutions on major cloud providers, including an AWS Quick Start for Splunk that enables AWS users to easily deploy distributed Splunk Enterprise straight from their AWS console. He’s also the co-author of the AWS Lambda blueprints for Splunk. He holds an M.S. in Computer Science Engineering from the University of Michigan.

 

 

 

timeShift(GrafanaBuzz, 1w) Issue 26

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/12/15/timeshiftgrafanabuzz-1w-issue-26/

Welcome to TimeShift

Big news this week: Grafana v5.0 has been merged into master and is available in the nightly builds! We are really excited to share this with the community, and look forward to receiving community feedback (good or bad) on the new features and enhancements. As you see in the video below, there are some big changes that aim to improve workflow, team organization, permissions, and overall user experience. Check out the video below to see it in action, and give it a spin yourself.

  • New Grid Layout Engine: Make it easier to build dashboards and enable more complex layouts
  • Dashboard Folders & Permissions
  • User Teams
  • Improved Dashboard Settings UX
  • Improved Page Design and Navigation

NOTE: That’s actually Torkel Odegaard, creator of Grafana shredding on the soundtrack!


Latest Stable Release

Grafana 4.6.3 is available and includes some bug fixes:

  • Gzip: Fixes bug Gravatar images when gzip was enabled #5952
  • Alert list: Now shows alert state changes even after adding manual annotations on dashboard #99513
  • Alerting: Fixes bug where rules evaluated as firing when all conditions was false and using OR operator. #93183
  • Cloudwatch: CloudWatch no longer display metrics’ default alias #101514, thx @mtanda

Download Grafana 4.6.3 Now


From the Blogosphere

Monitoring MySQL with Prometheus and Grafana: Julien Pivotto (who will be speaking at GrafanaCon EU), gave a great presentation last month on Monitoring MySQL with Prometheus and Grafana. You can also check out his slides.

Monitor your Docker Containers: docker stats doesn’t often give you the level of insight you need to effectively manage your containers. This article discuses how to use cAdvisor, Prometheus and Grafana to get a handle on your Docker performance.

Magento Performance Monitoring with Grafana Dashboards and Alerts: This Christmas-themed post walks you through how to monitor the performance of Magento, start building dashboards, and setup Slack alerts, all while sitting in your rocking chair, sipping eggnog.

Icinga Web2 and Grafana Working Together: This is a follow-up post about displaying service performance data from Icinga2 in Grafana. Now that we know how to list the services on a dashboard, it would be helpful to filter this list so that specific teams can know the status of services they specifically manage.

Setup of sitespeed in AWS with Peter Hedenskog: In this video, Peter Hedenskop from Wikimedia and Stefan Judis set up a video call to go over setting up sitespeed in AWS. They create a fully functional Grafana dashboard, including web performance metrics from Stefan’s personal website running in the cloud.

Deploying Grafana to Access Zabbix in Alibaba Cloud ECS: This article walks you through how to deploy Grafana on Alibaba Cloud ECS to access Zabbix to visualize performance data for your website or application.

Let’s Summarize the Test Results with Grafana Annotations + Prometheus: The engineers of NTT Communications Corporation have created something of an Advent Calendar, with new posts each day. December 14th’s post focused on Grafana’s new annotation functionality via the UI and the API.


New Speakers Added!

We have added new speakers, and talk titles to the lineup at grafanacon.org. Only a few left to include, which should be added in the next few days.

Join us March 1-2, 2018 in Amsterdam for 2 days of talks centered around Grafana and the surrounding monitoring ecosystem including Graphite, Prometheus, InfluxData, Elasticsearch, Kubernetes, and many other topics.

This year we have speakers from Bloomberg, CERN, Tinder, Red Hat, Prometheus, InfluxData, Fastly, Automattic, Percona, and more!

Get Your Ticket Now


Grafana Plugins

This week we have a new plugin for the popular IoT platform DeviceHive, and an update to our own Kubernetes App. To install or update any plugin in an on-prem Grafana instance, use the Grafana-cli tool, or install and update with 1 click on Hosted Grafana.

NEW PLUGIN

DeviceHive is an IOT Platform and now has a data source plugin, which means you can visualize the live commands and notifications from a device.


Install Now

UPDATED PLUGIN

Kubernetes App – The Grafana Kubernetes App allows you to monitor your Kubernetes cluster’s performance. It includes 4 dashboards, Cluster, Node, Pod/Container and Deployment, and also comes with Intel Snap collectors that are deployed to your cluster to collect health metrics.


Update


Upcoming Events:

In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We also like to make sure we mention other Grafana-related events happening all over the world. If you’re putting on just such an event, let us know and we’ll list it here.

FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. Carl Bergquist is managing the Cloud and Monitoring Devroom, and we’ve heard there were some great talks submitted. There is no need to register; all are welcome.


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove


Ok, ok – This tweet isn’t showing a off a dashboard, but we can’t help but be thrilled when someone post about our poster series. We’ll be working on the fourth poster to be unveiled at GrafanaCon EU!


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


How are we doing?

Let us know what you think about timeShift. Submit a comment on this article below, or post something at our community forum. Find an article I haven’t included? Send it my way. Help us make timeShift better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

MQTT 5: Introduction to MQTT 5

Post Syndicated from The HiveMQ Team original https://www.hivemq.com/blog/mqtt-5-introduction-to-mqtt-5/

MQTT 5 Introduction

Introduction to MQTT 5

Welcome to our brand new blog post series MQTT 5 – Features and Hidden Gems. Without doubt, the MQTT protocol is the most popular and best received Internet of Things protocol as of today (see the Google Trends Chart below), supporting large scale use cases ranging from Connected Cars, Manufacturing Systems, Logistics, Military Use Cases to Enterprise Chat Applications, Mobile Apps and connecting constrained IoT devices. Of course, with huge amounts of production deployments, the wish list for future versions of the MQTT protocol grew bigger and bigger.

MQTT 5 is by far the most extensive and most feature-rich update to the MQTT protocol specification ever. We are going to explore all hidden gems and protocol features with use case discussion and useful background information – one blog post at a time.

Be sure to read the MQTT Essentials Blog Post series first before diving into our new MQTT 5 series. To get the most out of the new blog posts, it’s important to have a basic understanding of the MQTT 3.1.1 protocol as we are going to highlight key changes as well as all improvements.

Running Windows Containers on Amazon ECS

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/running-windows-containers-on-amazon-ecs/

This post was developed and written by Jeremy Cowan, Thomas Fuller, Samuel Karp, and Akram Chetibi.

Containers have revolutionized the way that developers build, package, deploy, and run applications. Initially, containers only supported code and tooling for Linux applications. With the release of Docker Engine for Windows Server 2016, Windows developers have started to realize the gains that their Linux counterparts have experienced for the last several years.

This week, we’re adding support for running production workloads in Windows containers using Amazon Elastic Container Service (Amazon ECS). Now, Amazon ECS provides an ECS-Optimized Windows Server Amazon Machine Image (AMI). This AMI is based on the EC2 Windows Server 2016 AMI, and includes Docker 17.06 Enterprise Edition and the ECS Agent 1.16. This AMI provides improved instance and container launch time performance. It’s based on Windows Server 2016 Datacenter and includes Docker 17.06.2-ee-5, along with a new version of the ECS agent that now runs as a native Windows service.

In this post, I discuss the benefits of this new support, and walk you through getting started running Windows containers with Amazon ECS.

When AWS released the Windows Server 2016 Base with Containers AMI, the ECS agent ran as a process that made it difficult to monitor and manage. As a service, the agent can be health-checked, managed, and restarted no differently than other Windows services. The AMI also includes pre-cached images for Windows Server Core 2016 and Windows Server Nano Server 2016. By caching the images in the AMI, launching new Windows containers is significantly faster. When Docker images include a layer that’s already cached on the instance, Docker re-uses that layer instead of pulling it from the Docker registry.

The ECS agent and an accompanying ECS PowerShell module used to install, configure, and run the agent come pre-installed on the AMI. This guarantees there is a specific platform version available on the container instance at launch. Because the software is included, you don’t have to download it from the internet. This saves startup time.

The Windows-compatible ECS-optimized AMI also reports CPU and memory utilization and reservation metrics to Amazon CloudWatch. Using the CloudWatch integration with ECS, you can create alarms that trigger dynamic scaling events to automatically add or remove capacity to your EC2 instances and ECS tasks.

Getting started

To help you get started running Windows containers on ECS, I’ve forked the ECS reference architecture, to build an ECS cluster comprised of Windows instances instead of Linux instances. You can pull the latest version of the reference architecture for Windows.

The reference architecture is a layered CloudFormation stack, in that it calls other stacks to create the environment. Within the stack, the ecs-windows-cluster.yaml file contains the instructions for bootstrapping the Windows instances and configuring the ECS cluster. To configure the instances outside of AWS CloudFormation (for example, through the CLI or the console), you can add the following commands to your instance’s user data:

Import-Module ECSTools
Initialize-ECSAgent

Or

Import-Module ECSTools
Initialize-ECSAgent –Cluster MyCluster -EnableIAMTaskRole

If you don’t specify a cluster name when you initialize the agent, the instance is joined to the default cluster.

Adding -EnableIAMTaskRole when initializing the agent adds support for IAM roles for tasks. Previously, enabling this setting meant running a complex script and setting an environment variable before you could assign roles to your ECS tasks.

When you enable IAM roles for tasks on Windows, it consumes port 80 on the host. If you have tasks that listen on port 80 on the host, I recommend configuring a service for them that uses load balancing. You can use port 80 on the load balancer, and the traffic can be routed to another host port on your container instances. For more information, see Service Load Balancing.

Create a cluster

To create a new ECS cluster, choose Launch stack, or pull the GitHub project to your local machine and run the following command:

aws cloudformation create-stack –template-body file://<path to master-windows.yaml> --stack-name <name>

Upload your container image

Now that you have a cluster running, step through how to build and push an image into a container repository. You use a repository hosted in Amazon Elastic Container Registry (Amazon ECR) for this, but you could also use Docker Hub. To build and push an image to a repository, install Docker on your Windows* workstation. You also create a repository and assign the necessary permissions to the account that pushes your image to Amazon ECR. For detailed instructions, see Pushing an Image.

* If you are building an image that is based on Windows layers, then you must use a Windows environment to build and push your image to the registry.

Write your task definition

Now that your image is built and ready, the next step is to run your Windows containers using a task.

Start by creating a new task definition based on the windows-simple-iis image from Docker Hub.

  1. Open the ECS console.
  2. Choose Task Definitions, Create new task definition.
  3. Scroll to the bottom of the page and choose Configure via JSON.
  4. Copy and paste the following JSON into that field.
  5. Choose Save, Create.
{
   "family": "windows-simple-iis",
   "containerDefinitions": [
   {
     "name": "windows_sample_app",
     "image": "microsoft/iis",
     "cpu": 100,
     "entryPoint":["powershell", "-Command"],
     "command":["New-Item -Path C:\\inetpub\\wwwroot\\index.html -Type file -Value '<html><head><title>Amazon ECS Sample App</title> <style>body {margin-top: 40px; background-color: #333;} </style> </head><body> <div style=color:white;text-align:center><h1>Amazon ECS Sample App</h1> <h2>Congratulations!</h2> <p>Your application is now running on a container in Amazon ECS.</p></body></html>'; C:\\ServiceMonitor.exe w3svc"],
     "portMappings": [
     {
       "protocol": "tcp",
       "containerPort": 80,
       "hostPort": 8080
     }
     ],
     "memory": 500,
     "essential": true
   }
   ]
}

You can now go back into the Task Definition page and see windows-simple-iis as an available task definition.

There are a few important aspects of the task definition file to note when working with Windows containers. First, the hostPort is configured as 8080, which is necessary because the ECS agent currently uses port 80 to enable IAM roles for tasks required for least-privilege security configurations.

There are also some fairly standard task parameters that are intentionally not included. For example, network mode is not available with Windows at the time of this release, so keep that setting blank to allow Docker to configure WinNAT, the only option available today.

Also, some parameters work differently with Windows than they do with Linux. The CPU limits that you define in the task definition are absolute, whereas on Linux they are weights. For information about other task parameters that are supported or possibly different with Windows, see the documentation.

Run your containers

At this point, you are ready to run containers. There are two options to run containers with ECS:

  1. Task
  2. Service

A task is typically a short-lived process that ECS creates. It can’t be configured to actively monitor or scale. A service is meant for longer-running containers and can be configured to use a load balancer, minimum/maximum capacity settings, and a number of other knobs and switches to help ensure that your code keeps running. In both cases, you are able to pick a placement strategy and a specific IAM role for your container.

  1. Select the task definition that you created above and choose Action, Run Task.
  2. Leave the settings on the next page to the default values.
  3. Select the ECS cluster created when you ran the CloudFormation template.
  4. Choose Run Task to start the process of scheduling a Docker container on your ECS cluster.

You can now go to the cluster and watch the status of your task. It may take 5–10 minutes for the task to go from PENDING to RUNNING, mostly because it takes time to download all of the layers necessary to run the microsoft/iis image. After the status is RUNNING, you should see the following results:

You may have noticed that the example task definition is named windows-simple-iis:2. This is because I created a second version of the task definition, which is one of the powerful capabilities of using ECS. You can make the task definitions part of your source code and then version them. You can also roll out new versions and practice blue/green deployment, switching to reduce downtime and improve the velocity of your deployments!

After the task has moved to RUNNING, you can see your website hosted in ECS. Find the public IP or DNS for your ECS host. Remember that you are hosting on port 8080. Make sure that the security group allows ingress from your client IP address to that port and that your VPC has an internet gateway associated with it. You should see a page that looks like the following:

This is a nice start to deploying a simple single instance task, but what if you had a Web API to be scaled out and in based on usage? This is where you could look at defining a service and collecting CloudWatch data to add and remove both instances of the task. You could also use CloudWatch alarms to add more ECS container instances and keep up with the demand. The former is built into the configuration of your service.

  1. Select the task definition and choose Create Service.
  2. Associate a load balancer.
  3. Set up Auto Scaling.

The following screenshot shows an example where you would add an additional task instance when the CPU Utilization CloudWatch metric is over 60% on average over three consecutive measurements. This may not be aggressive enough for your requirements; it’s meant to show you the option to scale tasks the same way you scale ECS instances with an Auto Scaling group. The difference is that these tasks start much faster because all of the base layers are already on the ECS host.

Do not confuse task dynamic scaling with ECS instance dynamic scaling. To add additional hosts, see Tutorial: Scaling Container Instances with CloudWatch Alarms.

Conclusion

This is just scratching the surface of the flexibility that you get from using containers and Amazon ECS. For more information, see the Amazon ECS Developer Guide and ECS Resources.

– Jeremy, Thomas, Samuel, Akram

The re:Invent 2017 Containers After-party Guide

Post Syndicated from Tiffany Jernigan original https://aws.amazon.com/blogs/compute/the-reinvent-2017-containers-after-party-guide/

Feeling uncontainable? re:Invent 2017 might be over, but the containers party doesn’t have to stop. Here are some ways you can keep learning about containers on AWS.

Learn about containers in Austin and New York

Come join AWS this week at KubeCon in Austin, Texas! We’ll be sharing best practices for running Kubernetes on AWS and talking about Amazon ECS, AWS Fargate, and Amazon EKS. Want to take Amazon EKS for a test drive? Sign up for the preview.

We’ll also be talking Containers at the NYC Pop-up Loft during AWS Compute Evolved: Containers Day on December 13th. Register to attend.

Join an upcoming webinar

Didn’t get to attend re:Invent or want to hear a recap? Join our upcoming webinar, What You Missed at re:Invent 2017, on December 11th from 12:00 PM – 12:40 PM PT (3:00 PM – 3:40 PM ET). Register to attend.

Start (or finish) a workshop

All of the containers workshops given at re:Invent are available online. Get comfortable, fire up your browser, and start building!

re:Watch your favorite talks

All of the keynote and breakouts from re:Invent are available to watch on our YouTube playlist. Slides can be found as they are uploaded on the AWS Slideshare. Just slip into your pajamas, make some popcorn, and start watching!

Learn more about what’s new

Andy Jassy announced two big updates to the container landscape at re:Invent, AWS Fargate and Amazon EKS. Here are some resources to help you learn more about all the new features and products we announced, why we built them, and how they work.

AWS Fargate

AWS Fargate is a technology that allows you to run containers without having to manage servers or clusters.

Amazon Elastic Container Service for Kubernetes (Amazon EKS)

Amazon Elastic Container Service for Kubernetes (Amazon EKS) is a managed service that makes it easy for you to run Kubernetes on AWS without needing to configure and operate your own Kubernetes clusters.

We hope you had a great re:Invent and look forward to seeing what you build on AWS in 2018!

– The AWS Containers Team

timeShift(GrafanaBuzz, 1w) Issue 24

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/12/01/timeshiftgrafanabuzz-1w-issue-24/

Welcome to TimeShift

It’s hard to believe it’s already December. Here at Grafana Labs we’ve been spending a lot of time working on new features and enhancements for Grafana v5, and finalizing our selections for GrafanaCon EU. This week we have some interesting articles to share and a number of plugin updates. Enjoy!


Latest Release

Grafana 4.6.2 is now available and includes some bug fixes:

  • Prometheus: Fixes bug with new Prometheus alerts in Grafana. Make sure to download this version if you’re using Prometheus for alerting. More details in the issue. #9777
  • Color picker: Bug after using textbox input field to change/paste color string #9769
  • Cloudwatch: build using golang 1.9.2 #9667, thanks @mtanda
  • Heatmap: Fixed tooltip for “time series buckets” mode #9332
  • InfluxDB: Fixed query editor issue when using > or < operators in WHERE clause #9871

Download Grafana 4.6.2 Now


From the Blogosphere

Monitoring Camel with Prometheus in Red Hat OpenShift: This in-depth walk-through will show you how to build an Apache Camel application from scratch, deploy it in a Kubernetes environment, gather metrics using Prometheus and display them in Grafana.

How to run Grafana with DeviceHive: We see more and more examples of people using Grafana in IoT. This article discusses how to gather data from the IoT platform, DeviceHive, and build useful dashboards.

How to Install Grafana on Linux Servers: Pretty self-explanatory, but this tutorial walks you installing Grafana on Ubuntu 16.04 and CentOS 7. After installation, it covers configuration and plugin installation. This is the first article in an upcoming series about Grafana.

Monitoring your AKS cluster with Grafana: It’s important to know how your application is performing regardless of where it lives; the same applies to Kubernetes. This article focuses on aggregating data from Kubernetes with Heapster and feeding it to a backend for Grafana to visualize.

CoinStatistics: With the price of Bitcoin skyrocketing, more and more people are interested in cryptocurrencies. This is a cool dashboard that has a lot of stats about popular cryptocurrencies, and has a calculator to let you know when you can buy that lambo.

Using OpenNTI As A Collector For Streaming Telemetry From Juniper Devices: Part 1: This series will serve as a quick start guide for getting up and running with streaming real-time telemetry data from Juniper devices. This first article covers some high-level concepts and installation, while part 2 covers configuration options.

How to Get Metrics for Advance Alerting to Prevent Trouble: What good is performance monitoring if you’re never told when something has gone wrong? This article suggests ways to be more proactive to prevent issues and avoid the scramble to troubleshoot issues.

Thoughtworks: Technology Radar: We got a shout-out in the latest Technology Radar in the Tools section, as the dashboard visualization tool of choice for Prometheus!


GrafanaCon Tickets are Going Fast

Tickets are going fast for GrafanaCon EU, but we still have a seat reserved for you. Join us March 1-2, 2018 in Amsterdam for 2 days of talks centered around Grafana and the surrounding monitoring ecosystem including Graphite, Prometheus, InfluxData, Elasticsearch, Kubernetes, and more.

Get Your Ticket Now


Grafana Plugins

We have a number of plugin updates to highlight this week. Authors improve plugins regularly to fix bugs and improve performance, so it’s important to keep your plugins up to date. We’ve made updating easy; for on-prem Grafana, use the Grafana-cli tool, or update with 1 click if you’re using Hosted Grafana.

UPDATED PLUGIN

Clickhouse Data Source – The Clickhouse Data Source received a substantial update this week. It now has support for Ace Editor, which has a reformatting function for the query editor that automatically formats your sql. If you’re using Clickhouse then you should also have a look at CHProxy – see the plugin readme for more details.


Update

UPDATED PLUGIN

Influx Admin Panel – This panel received a number of small fixes. A new version will be coming soon with some new features.

Some of the changes (see the release notes) for more details):

  • Fix issue always showing query results
  • When there is only one row, swap rows/cols (ie: SHOW DIAGNOSTICS)
  • Improve auto-refresh behavior
  • Show ‘message’ response. (ie: please use POST)
  • Fix query time sorting
  • Show ‘status’ field (killed, etc)

Update

UPDATED PLUGIN

Gnocchi Data Source – The latest version of the Gnocchi Data Source adds support for dynamic aggregations.


Update

UPDATED PLUGINS

BT Plugins – All of the BT panel plugins received updates this week.


Upcoming Events:

In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We have some awesome talks and events coming soon. Hope to see you at one of these!

KubeCon | Austin, TX – Dec. 6-8, 2017: We’re sponsoring KubeCon 2017! This is the must-attend conference for cloud native computing professionals. KubeCon + CloudNativeCon brings together leading contributors in:

  • Cloud native applications and computing
  • Containers
  • Microservices
  • Central orchestration processing
  • And more

Buy Tickets

FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. Carl Bergquist is managing the Cloud and Monitoring Devroom, and we’ve heard there were some great talks submitted. There is no need to register; all are welcome.


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

YIKES! Glad it’s not – there’s good attention and bad attention.


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


How are we doing?

Let us know if you’re finding these weekly roundups valuable. Submit a comment on this article below, or post something at our community forum. Find an article I haven’t included? Send it my way. Help us make timeShift better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

Collect Data Statistics Up to 5x Faster by Analyzing Only Predicate Columns with Amazon Redshift

Post Syndicated from George Caragea original https://aws.amazon.com/blogs/big-data/collect-data-statistics-up-to-5x-faster-by-analyzing-only-predicate-columns-with-amazon-redshift/

Amazon Redshift is a fast, fully managed, petabyte-scale data warehousing service that makes it simple and cost-effective to analyze all of your data. Many of our customers—including Boingo Wireless, Scholastic, Finra, Pinterest, and Foursquare—migrated to Amazon Redshift and achieved agility and faster time to insight, while dramatically reducing costs.

Query optimization and the need for accurate estimates

When a SQL query is submitted to Amazon Redshift, the query optimizer is in charge of generating all the possible ways to execute that query, and picking the fastest one. This can mean evaluating the cost of thousands, if not millions, of different execution plans.

The plan cost is calculated based on estimates of the data characteristics. For example, the characteristics could include the number of rows in each base table, the average width of a variable-length column, the number of distinct values in a column, and the most common values in a column. These estimates (or “statistics”) are computed in advance by running an ANALYZE command, and stored in the system catalog.

How do the query optimizer and ANALYZE work together?

An ideal scenario is to run ANALYZE after every ETL/ingestion job. This way, when running your workload, the query optimizer can use up-to-date data statistics, and choose the most optimal execution plan, given the updates.

However, running the ANALYZE command can add significant overhead to the data ingestion scripts. This can lead to customers not running ANALYZE on their data, and using default or stale estimates. The end result is usually the optimizer choosing a suboptimal execution plan that runs for longer than needed.

Analyzing predicate columns only

When you run a SQL query, the query optimizer requests statistics only on columns used in predicates in the SQL query (join predicates, filters in the WHERE clause and GROUP BY clauses). Consider the following query:

SELECT Avg(salary), 
       Min(hiredate), 
       deptname 
FROM   emp 
WHERE  state = 'CA' 
GROUP  BY deptname; 

In the query above, the optimizer requests statistics only on columns ‘state’ and ‘deptname’, but not on ‘salary’ and ‘hiredate’. If present, statistics on columns ‘salary’ and ‘hiredate’ are ignored, as they do not impact the cost of the execution plans considered.

Based on the optimizer functionality described earlier, the Amazon Redshift ANALYZE command has been updated to optionally collect information only about columns used in previous queries as part of a filter, join condition or a GROUP BY clause, and columns that are part of distribution or sort keys (predicate columns). There’s a recently introduced option for the ANALYZE command that only analyzes predicate columns:

ANALYZE <table name> PREDICATE COLUMNS;

By having Amazon Redshift collect information about predicate columns automatically, and analyzing those columns only, you’re able to reduce the time to run ANALYZE. For example, during the execution of the 99 queries in the TPC-DS workload, only 203 out of the 424 total columns are predicate columns (approximately 48%). By analyzing only the predicate columns for such a workload, the execution time for running ANALYZE can be significantly reduced.

From my experience in the data warehousing space, I have observed that about 20% of columns in a typical use case are marked predicate. In such a case, running ANALYZE PREDICATE COLUMNS can lead to a speedup of up to 5x relative to a full ANALYZE run.

If no information on predicate columns exists in the system (for example, a new table that has not been queried yet), ANALYZE PREDICATE COLUMNS collects statistics on all the columns. When queries on the table are run, Amazon Redshift collects information about predicate column usage, and subsequent runs of ANALYZE PREDICATE COLUMNS only operates on the predicate columns.

If the workload is relatively stable, and the set of predicate columns does not expand continuously over time, I recommend replacing all occurrences of the ANALYZE command with ANALYZE PREDICATE COLUMNS commands in your application and data ingestion code.

Using the Analyze/Vacuum utility

Several AWS customers are using the Analyze/Vacuum utility from the Redshift-Utils package to manage and automate their maintenance operations. By passing the –predicate-cols option to the Analyze/Vacuum utility, you can enable it to use the ANALYZE PREDICATE COLUMNS feature, providing you with the significant changes in overhead in a completely seamless manner.

Enhancements to logging for ANALYZE operations

When running ANALYZE with the PREDICATE COLUMNS option, the type of analyze run (Full vs Predicate Column), as well as information about the predicate columns encountered, is logged in the stl_analyze view:

SELECT status, 
       starttime, 
       prevtime, 
       num_predicate_cols, 
       num_new_predicate_cols 
FROM   stl_analyze;
     status   |    starttime        |   prevtime          | pred_cols | new_pred_cols
--------------+---------------------+---------------------+-----------+---------------
 Full         | 2017-11-09 01:15:47 |                     |         0 |             0
 PredicateCol | 2017-11-09 01:16:20 | 2017-11-09 01:15:47 |         2 |             2

AWS also enhanced the pg_statistic catalog table with two new pieces of information: the time stamp at which a column was marked as “predicate”, and the time stamp at which the column was last analyzed.

The Amazon Redshift documentation provides a view that allows a user to easily see which columns are marked as predicate, when they were marked as predicate, and when a column was last analyzed. For example, for the emp table used above, the output of the view could be as follows:

 SELECT col_name, 
       is_predicate, 
       first_predicate_use, 
       last_analyze 
FROM   predicate_columns 
WHERE  table_name = 'emp';

 col_name | is_predicate | first_predicate_use  |        last_analyze
----------+--------------+----------------------+----------------------------
 id       | f            |                      | 2017-11-09 01:15:47
 name     | f            |                      | 2017-11-09 01:15:47
 deptname | t            | 2017-11-09 01:16:03  | 2017-11-09 01:16:20
 age      | f            |                      | 2017-11-09 01:15:47
 salary   | f            |                      | 2017-11-09 01:15:47
 hiredate | f            |                      | 2017-11-09 01:15:47
 state    | t            | 2017-11-09 01:16:03  | 2017-11-09 01:16:20

Conclusion

After loading new data into an Amazon Redshift cluster, statistics need to be re-computed to guarantee performant query plans. By learning which column statistics are actually being used by the customer’s workload and collecting statistics only on those columns, Amazon Redshift is able to significantly reduce the amount of time needed for table maintenance during data loading workflows.


Additional Reading

Be sure to check out the Top 10 Tuning Techniques for Amazon Redshift, and the Advanced Table Design Playbook: Distribution Styles and Distribution Keys.


About the Author

George Caragea is a Senior Software Engineer with Amazon Redshift. He has been working on MPP Databases for over 6 years and is mainly interested in designing systems at scale. In his spare time, he enjoys being outdoors and on the water in the beautiful Bay Area and finishing the day exploring the rich local restaurant scene.

 

 

Glenn’s Take on re:Invent 2017 Part 1

Post Syndicated from Glenn Gore original https://aws.amazon.com/blogs/architecture/glenns-take-on-reinvent-2017-part-1/

GREETINGS FROM LAS VEGAS

Glenn Gore here, Chief Architect for AWS. I’m in Las Vegas this week — with 43K others — for re:Invent 2017. We have a lot of exciting announcements this week. I’m going to post to the AWS Architecture blog each day with my take on what’s interesting about some of the announcements from a cloud architectural perspective.

Why not start at the beginning? At the Midnight Madness launch on Sunday night, we announced Amazon Sumerian, our platform for VR, AR, and mixed reality. The hype around VR/AR has existed for many years, though for me, it is a perfect example of how a working end-to-end solution often requires innovation from multiple sources. For AR/VR to be successful, we need many components to come together in a coherent manner to provide a great experience.

First, we need lightweight, high-definition goggles with motion tracking that are comfortable to wear. Second, we need to track movement of our body and hands in a 3-D space so that we can interact with virtual objects in the virtual world. Third, we need to build the virtual world itself and populate it with assets and define how the interactions will work and connect with various other systems.

There has been rapid development of the physical devices for AR/VR, ranging from iOS devices to Oculus Rift and HTC Vive, which provide excellent capabilities for the first and second components defined above. With the launch of Amazon Sumerian we are solving for the third area, which will help developers easily build their own virtual worlds and start experimenting and innovating with how to apply AR/VR in new ways.

Already, within 48 hours of Amazon Sumerian being announced, I have had multiple discussions with customers and partners around some cool use cases where VR can help in training simulations, remote-operator controls, or with new ideas around interacting with complex visual data sets, which starts bringing concepts straight out of sci-fi movies into the real (virtual) world. I am really excited to see how Sumerian will unlock the creative potential of developers and where this will lead.

Amazon MQ
I am a huge fan of distributed architectures where asynchronous messaging is the backbone of connecting the discrete components together. Amazon Simple Queue Service (Amazon SQS) is one of my favorite services due to its simplicity, scalability, performance, and the incredible flexibility of how you can use Amazon SQS in so many different ways to solve complex queuing scenarios.

While Amazon SQS is easy to use when building cloud-native applications on AWS, many of our customers running existing applications on-premises required support for different messaging protocols such as: Java Message Service (JMS), .Net Messaging Service (NMS), Advanced Message Queuing Protocol (AMQP), MQ Telemetry Transport (MQTT), Simple (or Streaming) Text Orientated Messaging Protocol (STOMP), and WebSockets. One of the most popular applications for on-premise message brokers is Apache ActiveMQ. With the release of Amazon MQ, you can now run Apache ActiveMQ on AWS as a managed service similar to what we did with Amazon ElastiCache back in 2012. For me, there are two compelling, major benefits that Amazon MQ provides:

  • Integrate existing applications with cloud-native applications without having to change a line of application code if using one of the supported messaging protocols. This removes one of the biggest blockers for integration between the old and the new.
  • Remove the complexity of configuring Multi-AZ resilient message broker services as Amazon MQ provides out-of-the-box redundancy by always storing messages redundantly across Availability Zones. Protection is provided against failure of a broker through to complete failure of an Availability Zone.

I believe that Amazon MQ is a major component in the tools required to help you migrate your existing applications to AWS. Having set up cross-data center Apache ActiveMQ clusters in the past myself and then testing to ensure they work as expected during critical failure scenarios, technical staff working on migrations to AWS benefit from the ease of deploying a fully redundant, managed Apache ActiveMQ cluster within minutes.

Who would have thought I would have been so excited to revisit Apache ActiveMQ in 2017 after using SQS for many, many years? Choice is a wonderful thing.

Amazon GuardDuty
Maintaining application and information security in the modern world is increasingly complex and is constantly evolving and changing as new threats emerge. This is due to the scale, variety, and distribution of services required in a competitive online world.

At Amazon, security is our number one priority. Thus, we are always looking at how we can increase security detection and protection while simplifying the implementation of advanced security practices for our customers. As a result, we released Amazon GuardDuty, which provides intelligent threat detection by using a combination of multiple information sources, transactional telemetry, and the application of machine learning models developed by AWS. One of the biggest benefits of Amazon GuardDuty that I appreciate is that enabling this service requires zero software, agents, sensors, or network choke points. which can all impact performance or reliability of the service you are trying to protect. Amazon GuardDuty works by monitoring your VPC flow logs, AWS CloudTrail events, DNS logs, as well as combing other sources of security threats that AWS is aggregating from our own internal and external sources.

The use of machine learning in Amazon GuardDuty allows it to identify changes in behavior, which could be suspicious and require additional investigation. Amazon GuardDuty works across all of your AWS accounts allowing for an aggregated analysis and ensuring centralized management of detected threats across accounts. This is important for our larger customers who can be running many hundreds of AWS accounts across their organization, as providing a single common threat detection of their organizational use of AWS is critical to ensuring they are protecting themselves.

Detection, though, is only the beginning of what Amazon GuardDuty enables. When a threat is identified in Amazon GuardDuty, you can configure remediation scripts or trigger Lambda functions where you have custom responses that enable you to start building automated responses to a variety of different common threats. Speed of response is required when a security incident may be taking place. For example, Amazon GuardDuty detects that an Amazon Elastic Compute Cloud (Amazon EC2) instance might be compromised due to traffic from a known set of malicious IP addresses. Upon detection of a compromised EC2 instance, we could apply an access control entry restricting outbound traffic for that instance, which stops loss of data until a security engineer can assess what has occurred.

Whether you are a customer running a single service in a single account, or a global customer with hundreds of accounts with thousands of applications, or a startup with hundreds of micro-services with hourly release cycle in a devops world, I recommend enabling Amazon GuardDuty. We have a 30-day free trial available for all new customers of this service. As it is a monitor of events, there is no change required to your architecture within AWS.

Stay tuned for tomorrow’s post on AWS Media Services and Amazon Neptune.

 

Glenn during the Tour du Mont Blanc

AWS Fargate: A Product Overview

Post Syndicated from Deepak Dayama original https://aws.amazon.com/blogs/compute/aws-fargate-a-product-overview/

It was just about three years ago that AWS announced Amazon Elastic Container Service (Amazon ECS), to run and manage containers at scale on AWS. With Amazon ECS, you’ve been able to run your workloads at high scale and availability without having to worry about running your own cluster management and container orchestration software.

Today, AWS announced the availability of AWS Fargate – a technology that enables you to use containers as a fundamental compute primitive without having to manage the underlying instances. With Fargate, you don’t need to provision, configure, or scale virtual machines in your clusters to run containers. Fargate can be used with Amazon ECS today, with plans to support Amazon Elastic Container Service for Kubernetes (Amazon EKS) in the future.

Fargate has flexible configuration options so you can closely match your application needs and granular, per-second billing.

Amazon ECS with Fargate

Amazon ECS enables you to run containers at scale. This service also provides native integration into the AWS platform with VPC networking, load balancing, IAM, Amazon CloudWatch Logs, and CloudWatch metrics. These deep integrations make the Amazon ECS task a first-class object within the AWS platform.

To run tasks, you first need to stand up a cluster of instances, which involves picking the right types of instances and sizes, setting up Auto Scaling, and right-sizing the cluster for performance. With Fargate, you can leave all that behind and focus on defining your application and policies around permissions and scaling.

The same container management capabilities remain available so you can continue to scale your container deployments. With Fargate, the only entity to manage is the task. You don’t need to manage the instances or supporting software like Docker daemon or the Amazon ECS agent.

Fargate capabilities are available natively within Amazon ECS. This means that you don’t need to learn new API actions or primitives to run containers on Fargate.

Using Amazon ECS, Fargate is a launch type option. You continue to define the applications the same way by using task definitions. In contrast, the EC2 launch type gives you more control of your server clusters and provides a broader range of customization options.

For example, a RunTask command example is pasted below with the Fargate launch type:

ecs run-task --launch-type FARGATE --cluster fargate-test --task-definition nginx --network-configuration
"awsvpcConfiguration={subnets=[subnet-b563fcd3]}"

Key features of Fargate

Resource-based pricing and per second billing
You pay by the task size and only for the time for which resources are consumed by the task. The price for CPU and memory is charged on a per-second basis. There is a one-minute minimum charge.

Flexible configurations options
Fargate is available with 50 different combinations of CPU and memory to closely match your application needs. You can use 2 GB per vCPU anywhere up to 8 GB per vCPU for various configurations. Match your workload requirements closely, whether they are general purpose, compute, or memory optimized.

Networking
All Fargate tasks run within your own VPC. Fargate supports the recently launched awsvpc networking mode and the elastic network interface for a task is visible in the subnet where the task is running. This provides the separation of responsibility so you retain full control of networking policies for your applications via VPC features like security groups, routing rules, and NACLs. Fargate also supports public IP addresses.

Load Balancing
ECS Service Load Balancing  for the Application Load Balancer and Network Load Balancer is supported. For the Fargate launch type, you specify the IP addresses of the Fargate tasks to register with the load balancers.

Permission tiers
Even though there are no instances to manage with Fargate, you continue to group tasks into logical clusters. This allows you to manage who can run or view services within the cluster. The task IAM role is still applicable. Additionally, there is a new Task Execution Role that grants Amazon ECS permissions to perform operations such as pushing logs to CloudWatch Logs or pulling image from Amazon Elastic Container Registry (Amazon ECR).

Container Registry Support
Fargate provides seamless authentication to help pull images from Amazon ECR via the Task Execution Role. Similarly, if you are using a public repository like DockerHub, you can continue to do so.

Amazon ECS CLI
The Amazon ECS CLI provides high-level commands to help simplify to create and run Amazon ECS clusters, tasks, and services. The latest version of the CLI now supports running tasks and services with Fargate.

EC2 and Fargate Launch Type Compatibility
All Amazon ECS clusters are heterogeneous – you can run both Fargate and Amazon ECS tasks in the same cluster. This enables teams working on different applications to choose their own cadence of moving to Fargate, or to select a launch type that meets their requirements without breaking the existing model. You can make an existing ECS task definition compatible with the Fargate launch type and run it as a Fargate service, and vice versa. Choosing a launch type is not a one-way door!

Logging and Visibility
With Fargate, you can send the application logs to CloudWatch logs. Service metrics (CPU and Memory utilization) are available as part of CloudWatch metrics. AWS partners for visibility, monitoring and application performance management including Datadog, Aquasec, Splunk, Twistlock, and New Relic also support Fargate tasks.

Conclusion

Fargate enables you to run containers without having to manage the underlying infrastructure. Today, Fargate is availabe for Amazon ECS, and in 2018, Amazon EKS. Visit the Fargate product page to learn more, or get started in the AWS Console.

–Deepak Dayama