Tag Archives: Amazon EMR

How I built a data warehouse using Amazon Redshift and AWS services in record time

Post Syndicated from Stephen Borg original https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/

This is a customer post by Stephen Borg, the Head of Big Data and BI at Cerberus Technologies.

Cerberus Technologies, in their own words: Cerberus is a company founded in 2017 by a team of visionary iGaming veterans. Our mission is simple – to offer the best tech solutions through a data-driven and a customer-first approach, delivering innovative solutions that go against traditional forms of working and process. This mission is based on the solid foundations of reliability, flexibility and security, and we intend to fundamentally change the way iGaming and other industries interact with technology.

Over the years, I have developed and created a number of data warehouses from scratch. Recently, I built a data warehouse for the iGaming industry single-handedly. To do it, I used the power and flexibility of Amazon Redshift and the wider AWS data management ecosystem. In this post, I explain how I was able to build a robust and scalable data warehouse without the large team of experts typically needed.

In two of my recent projects, I ran into challenges when scaling our data warehouse using on-premises infrastructure. Data was growing at many tens of gigabytes per day, and query performance was suffering. Scaling required major capital investment for hardware and software licenses, and also significant operational costs for maintenance and technical staff to keep it running and performing well. Unfortunately, I couldn’t get the resources needed to scale the infrastructure with data growth, and these projects were abandoned. Thanks to cloud data warehousing, the bottleneck of infrastructure resources, capital expense, and operational costs have been significantly reduced or have totally gone away. There is no more excuse for allowing obstacles of the past to delay delivering timely insights to decision makers, no matter how much data you have.

With Amazon Redshift and AWS, I delivered a cloud data warehouse to the business very quickly, and with a small team: me. I didn’t have to order hardware or software, and I no longer needed to install, configure, tune, or keep up with patches and version updates. Instead, I easily set up a robust data processing pipeline and we were quickly ingesting and analyzing data. Now, my data warehouse team can be extremely lean, and focus more time on bringing in new data and delivering insights. In this post, I show you the AWS services and the architecture that I used.

Handling data feeds

I have several different data sources that provide everything needed to run the business. The data includes activity from our iGaming platform, social media posts, clickstream data, marketing and campaign performance, and customer support engagements.

To handle the diversity of data feeds, I developed abstract integration applications using Docker that run on Amazon EC2 Container Service (Amazon ECS) and feed data to Amazon Kinesis Data Streams. These data streams can be used for real time analytics. In my system, each record in Kinesis is preprocessed by an AWS Lambda function to cleanse and aggregate information. My system then routes it to be stored where I need on Amazon S3 by Amazon Kinesis Data Firehose. Suppose that you used an on-premises architecture to accomplish the same task. A team of data engineers would be required to maintain and monitor a Kafka cluster, develop applications to stream data, and maintain a Hadoop cluster and the infrastructure underneath it for data storage. With my stream processing architecture, there are no servers to manage, no disk drives to replace, and no service monitoring to write.

Setting up a Kinesis stream can be done with a few clicks, and the same for Kinesis Firehose. Firehose can be configured to automatically consume data from a Kinesis Data Stream, and then write compressed data every N minutes to Amazon S3. When I want to process a Kinesis data stream, it’s very easy to set up a Lambda function to be executed on each message received. I can just set a trigger from the AWS Lambda Management Console, as shown following.

I also monitor the duration of function execution using Amazon CloudWatch and AWS X-Ray.

Regardless of the format I receive the data from our partners, I can send it to Kinesis as JSON data using my own formatters. After Firehose writes this to Amazon S3, I have everything in nearly the same structure I received but compressed, encrypted, and optimized for reading.

This data is automatically crawled by AWS Glue and placed into the AWS Glue Data Catalog. This means that I can immediately query the data directly on S3 using Amazon Athena or through Amazon Redshift Spectrum. Previously, I used Amazon EMR and an Amazon RDS–based metastore in Apache Hive for catalog management. Now I can avoid the complexity of maintaining Hive Metastore catalogs. Glue takes care of high availability and the operations side so that I know that end users can always be productive.

Working with Amazon Athena and Amazon Redshift for analysis

I found Amazon Athena extremely useful out of the box for ad hoc analysis. Our engineers (me) use Athena to understand new datasets that we receive and to understand what transformations will be needed for long-term query efficiency.

For our data analysts and data scientists, we’ve selected Amazon Redshift. Amazon Redshift has proven to be the right tool for us over and over again. It easily processes 20+ million transactions per day, regardless of the footprint of the tables and the type of analytics required by the business. Latency is low and query performance expectations have been more than met. We use Redshift Spectrum for long-term data retention, which enables me to extend the analytic power of Amazon Redshift beyond local data to anything stored in S3, and without requiring me to load any data. Redshift Spectrum gives me the freedom to store data where I want, in the format I want, and have it available for processing when I need it.

To load data directly into Amazon Redshift, I use AWS Data Pipeline to orchestrate data workflows. I create Amazon EMR clusters on an intra-day basis, which I can easily adjust to run more or less frequently as needed throughout the day. EMR clusters are used together with Amazon RDS, Apache Spark 2.0, and S3 storage. The data pipeline application loads ETL configurations from Spring RESTful services hosted on AWS Elastic Beanstalk. The application then loads data from S3 into memory, aggregates and cleans the data, and then writes the final version of the data to Amazon Redshift. This data is then ready to use for analysis. Spark on EMR also helps with recommendations and personalization use cases for various business users, and I find this easy to set up and deliver what users want. Finally, business users use Amazon QuickSight for self-service BI to slice, dice, and visualize the data depending on their requirements.

Each AWS service in this architecture plays its part in saving precious time that’s crucial for delivery and getting different departments in the business on board. I found the services easy to set up and use, and all have proven to be highly reliable for our use as our production environments. When the architecture was in place, scaling out was either completely handled by the service, or a matter of a simple API call, and crucially doesn’t require me to change one line of code. Increasing shards for Kinesis can be done in a minute by editing a stream. Increasing capacity for Lambda functions can be accomplished by editing the megabytes allocated for processing, and concurrency is handled automatically. EMR cluster capacity can easily be increased by changing the master and slave node types in Data Pipeline, or by using Auto Scaling. Lastly, RDS and Amazon Redshift can be easily upgraded without any major tasks to be performed by our team (again, me).

In the end, using AWS services including Kinesis, Lambda, Data Pipeline, and Amazon Redshift allows me to keep my team lean and highly productive. I eliminated the cost and delays of capital infrastructure, as well as the late night and weekend calls for support. I can now give maximum value to the business while keeping operational costs down. My team pushed out an agile and highly responsive data warehouse solution in record time and we can handle changing business requirements rapidly, and quickly adapt to new data and new user requests.


Additional Reading

If you found this post useful, be sure to check out Deploy a Data Warehouse Quickly with Amazon Redshift, Amazon RDS for PostgreSQL and Tableau Server and Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift.


About the Author

Stephen Borg is the Head of Big Data and BI at Cerberus Technologies. He has a background in platform software engineering, and first became involved in data warehousing using the typical RDBMS, SQL, ETL, and BI tools. He quickly became passionate about providing insight to help others optimize the business and add personalization to products. He is now the Head of Big Data and BI at Cerberus Technologies.

 

 

 

Build a Multi-Tenant Amazon EMR Cluster with Kerberos, Microsoft Active Directory Integration and EMRFS Authorization

Post Syndicated from Songzhi Liu original https://aws.amazon.com/blogs/big-data/build-a-multi-tenant-amazon-emr-cluster-with-kerberos-microsoft-active-directory-integration-and-emrfs-authorization/

One of the challenges faced by our customers—especially those in highly regulated industries—is balancing the need for security with flexibility. In this post, we cover how to enable multi-tenancy and increase security by using EMRFS (EMR File System) authorization, the Amazon S3 storage-level authorization on Amazon EMR.

Amazon EMR is an easy, fast, and scalable analytics platform enabling large-scale data processing. EMRFS authorization provides Amazon S3 storage-level authorization by configuring EMRFS with multiple IAM roles. With this functionality enabled, different users and groups can share the same cluster and assume their own IAM roles respectively.

Simply put, on Amazon EMR, we can now have an Amazon EC2 role per user assumed at run time instead of one general EC2 role at the cluster level. When the user is trying to access Amazon S3 resources, Amazon EMR evaluates against a predefined mappings list in EMRFS configurations and picks up the right role for the user.

In this post, we will discuss what EMRFS authorization is (Amazon S3 storage-level access control) and show how to configure the role mappings with detailed examples. You will then have the desired permissions in a multi-tenant environment. We also demo Amazon S3 access from HDFS command line, Apache Hive on Hue, and Apache Spark.

EMRFS authorization for Amazon S3

There are two prerequisites for using this feature:

  1. Users must be authenticated, because EMRFS needs to map the current user/group/prefix to a predefined user/group/prefix. There are several authentication options. In this post, we launch a Kerberos-enabled cluster that manages the Key Distribution Center (KDC) on the master node, and enable a one-way trust from the KDC to a Microsoft Active Directory domain.
  2. The application must support accessing Amazon S3 via Applications that have their own S3FileSystem APIs (for example, Presto) are not supported at this time.

EMRFS supports three types of mapping entries: user, group, and Amazon S3 prefix. Let’s use an example to show how this works.

Assume that you have the following three identities in your organization, and they are defined in the Active Directory:

To enable all these groups and users to share the EMR cluster, you need to define the following IAM roles:

In this case, you create a separate Amazon EC2 role that doesn’t give any permission to Amazon S3. Let’s call the role the base role (the EC2 role attached to the EMR cluster), which in this example is named EMR_EC2_RestrictedRole. Then, you define all the Amazon S3 permissions for each specific user or group in their own roles. The restricted role serves as the fallback role when the user doesn’t belong to any user/group, nor does the user try to access any listed Amazon S3 prefixes defined on the list.

Important: For all other roles, like emrfs_auth_group_role_data_eng, you need to add the base role (EMR_EC2_RestrictedRole) as the trusted entity so that it can assume other roles. See the following example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::511586466501:role/EMR_EC2_RestrictedRole"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The following is an example policy for the admin user role (emrfs_auth_user_role_admin_user):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

We are assuming the admin user has access to all buckets in this example.

The following is an example policy for the data science group role (emrfs_auth_group_role_data_sci):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::emrfs-auth-data-science-bucket-demo/*",
                "arn:aws:s3:::emrfs-auth-data-science-bucket-demo"
            ],
            "Action": [
                "s3:*"
            ]
        }
    ]
}

This role grants all Amazon S3 permissions to the emrfs-auth-data-science-bucket-demo bucket and all the objects in it. Similarly, the policy for the role emrfs_auth_group_role_data_eng is shown below:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::emrfs-auth-data-engineering-bucket-demo/*",
                "arn:aws:s3:::emrfs-auth-data-engineering-bucket-demo"
            ],
            "Action": [
                "s3:*"
            ]
        }
    ]
}

Example role mappings configuration

To configure EMRFS authorization, you use EMR security configuration. Here is the configuration we use in this post

Consider the following scenario.

First, the admin user admin1 tries to log in and run a command to access Amazon S3 data through EMRFS. The first role emrfs_auth_user_role_admin_user on the mapping list, which is a user role, is mapped and picked up. Then admin1 has access to the Amazon S3 locations that are defined in this role.

Then a user from the data engineer group (grp_data_engineering) tries to access a data bucket to run some jobs. When EMRFS sees that the user is a member of the grp_data_engineering group, the group role emrfs_auth_group_role_data_eng is assumed, and the user has proper access to Amazon S3 that is defined in the emrfs_auth_group_role_data_eng role.

Next, the third user comes, who is not an admin and doesn’t belong to any of the groups. After failing evaluation of the top three entries, EMRFS evaluates whether the user is trying to access a certain Amazon S3 prefix defined in the last mapping entry. This type of mapping entry is called the prefix type. If the user is trying to access s3://emrfs-auth-default-bucket-demo/, then the prefix mapping is in effect, and the prefix role emrfs_auth_prefix_role_default_s3_prefix is assumed.

If the user is not trying to access any of the Amazon S3 paths that are defined on the list—which means it failed the evaluation of all the entries—it only has the permissions defined in the EMR_EC2RestrictedRole. This role is assumed by the EC2 instances in the cluster.

In this process, all the mappings defined are evaluated in the defined order, and the first role that is mapped is assumed, and the rest of the list is skipped.

Setting up an EMR cluster and mapping Active Directory users and groups

Now that we know how EMRFS authorization role mapping works, the next thing we need to think about is how we can use this feature in an easy and manageable way.

Active Directory setup

Many customers manage their users and groups using Microsoft Active Directory or other tools like OpenLDAP. In this post, we create the Active Directory on an Amazon EC2 instance running Windows Server and create the users and groups we will be using in the example below. After setting up Active Directory, we use the Amazon EMR Kerberos auto-join capability to establish a one-way trust from the KDC running on the EMR master node to the Active Directory domain on the EC2 instance. You can use your own directory services as long as it talks to the LDAP (Lightweight Directory Access Protocol).

To create and join Active Directory to Amazon EMR, follow the steps in the blog post Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory.

After configuring Active Directory, you can create all the users and groups using the Active Directory tools and add users to appropriate groups. In this example, we created users like admin1, dataeng1, datascientist1, grp_data_engineering, and grp_data_science, and then add the users to the right groups.

Join the EMR cluster to an Active Directory domain

For clusters with Kerberos, Amazon EMR now supports automated Active Directory domain joins. You can use the security configuration to configure the one-way trust from the KDC to the Active Directory domain. You also configure the EMRFS role mappings in the same security configuration.

The following is an example of the EMR security configuration with a trusted Active Directory domain EMRKRB.TEST.COM and the EMRFS role mappings as we discussed earlier:

The EMRFS role mapping configuration is shown in this example:

We will also provide an example AWS CLI command that you can run.

Launching the EMR cluster and running the tests

Now you have configured Kerberos and EMRFS authorization for Amazon S3.

Additionally, you need to configure Hue with Active Directory using the Amazon EMR configuration API in order to log in using the AD users created before. The following is an example of Hue AD configuration.

[
  {
    "Classification":"hue-ini",
    "Properties":{

    },
    "Configurations":[
      {
        "Classification":"desktop",
        "Properties":{

        },
        "Configurations":[
          {
            "Classification":"ldap",
            "Properties":{

            },
            "Configurations":[
              {
                "Classification":"ldap_servers",
                "Properties":{

                },
                "Configurations":[
                  {
                    "Classification":"AWS",
                    "Properties":{
                      "base_dn":"DC=emrkrb,DC=test,DC=com",
                      "ldap_url":"ldap://emrkrb.test.com",
                      "search_bind_authentication":"false",
                      "bind_dn":"CN=adjoiner,CN=users,DC=emrkrb,DC=test,DC=com",
                      "bind_password":"Abc123456",
                      "create_users_on_login":"true",
                      "nt_domain":"emrkrb.test.com"
                    },
                    "Configurations":[

                    ]
                  }
                ]
              }
            ]
          },
          {
            "Classification":"auth",
            "Properties":{
              "backend":"desktop.auth.backend.LdapBackend"
            },
            "Configurations":[

            ]
          }
        ]
      }
    ]
  }

Note: In the preceding configuration JSON file, change the values as required before pasting it into the software setting section in the Amazon EMR console.

Now let’s use this configuration and the security configuration you created before to launch the cluster.

In the Amazon EMR console, choose Create cluster. Then choose Go to advanced options. On the Step1: Software and Steps page, under Edit software settings (optional), paste the configuration in the box.

The rest of the setup is the same as an ordinary cluster setup, except in the Security Options section. In Step 4: Security, under Permissions, choose Custom, and then choose the RestrictedRole that you created before.

Choose the appropriate subnets (these should meet the base requirement in order for a successful Active Directory join—see the Amazon EMR Management Guide for more details), and choose the appropriate security groups to make sure it talks to the Active Directory. Choose a key so that you can log in and configure the cluster.

Most importantly, choose the security configuration that you created earlier to enable Kerberos and EMRFS authorization for Amazon S3.

You can use the following AWS CLI command to create a cluster.

aws emr create-cluster --name "TestEMRFSAuthorization" \ 
--release-label emr-5.10.0 \ --instance-type m3.xlarge \ 
--instance-count 3 \ 
--ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=MyEC2KeyPair \ --service-role EMR_DefaultRole \ 
--security-configuration MyKerberosConfig \ 
--configurations file://hue-config.json \
--applications Name=Hadoop Name=Hive Name=Hue Name=Spark \ 
--kerberos-attributes Realm=EC2.INTERNAL, \ KdcAdminPassword=<YourClusterKDCAdminPassword>, \ ADDomainJoinUser=<YourADUserLogonName>,ADDomainJoinPassword=<YourADUserPassword>, \ 
CrossRealmTrustPrincipalPassword=<MatchADTrustPwd>

Note: If you create the cluster using CLI, you need to save the JSON configuration for Hue into a file named hue-config.json and place it on the server where you run the CLI command.

After the cluster gets into the Waiting state, try to connect by using SSH into the cluster using the Active Directory user name and password.

ssh -l [email protected] <EMR IP or DNS name>

Quickly run two commands to show that the Active Directory join is successful:

  1. id [user name] shows the mapped AD users and groups in Linux.
  2. hdfs groups [user name] shows the mapped group in Hadoop.

Both should return the current Active Directory user and group information if the setup is correct.

Now, you can test the user mapping first. Log in with the admin1 user, and run a Hadoop list directory command:

hadoop fs -ls s3://emrfs-auth-data-science-bucket-demo/

Now switch to a user from the data engineer group.

Retry the previous command to access the admin’s bucket. It should throw an Amazon S3 Access Denied exception.

When you try listing the Amazon S3 bucket that a data engineer group member has accessed, it triggers the group mapping.

hadoop fs -ls s3://emrfs-auth-data-engineering-bucket-demo/

It successfully returns the listing results. Next we will test Apache Hive and then Apache Spark.

 

To run jobs successfully, you need to create a home directory for every user in HDFS for staging data under /user/<username>. Users can configure a step to create a home directory at cluster launch time for every user who has access to the cluster. In this example, you use Hue since Hue will create the home directory in HDFS for the user at the first login. Here Hue also needs to be integrated with the same Active Directory as explained in the example configuration described earlier.

First, log in to Hue as a data engineer user, and open a Hive Notebook in Hue. Then run a query to create a new table pointing to the data engineer bucket, s3://emrfs-auth-data-engineering-bucket-demo/table1_data_eng/.

You can see that the table was created successfully. Now try to create another table pointing to the data science group’s bucket, where the data engineer group doesn’t have access.

It failed and threw an Amazon S3 Access Denied error.

Now insert one line of data into the successfully create table.

Next, log out, switch to a data science group user, and create another table, test2_datasci_tb.

The creation is successful.

The last task is to test Spark (it requires the user directory, but Hue created one in the previous step).

Now let’s come back to the command line and run some Spark commands.

Login to the master node using the datascientist1 user:

Start the SparkSQL interactive shell by typing spark-sql, and run the show tables command. It should list the tables that you created using Hive.

As a data science group user, try select on both tables. You will find that you can only select the table defined in the location that your group has access to.

Conclusion

EMRFS authorization for Amazon S3 enables you to have multiple roles on the same cluster, providing flexibility to configure a shared cluster for different teams to achieve better efficiency. The Active Directory integration and group mapping make it much easier for you to manage your users and groups, and provides better auditability in a multi-tenant environment.


Additional Reading

If you found this post useful, be sure to check out Use Kerberos Authentication to Integrate Amazon EMR with Microsoft Active Directory and Launching and Running an Amazon EMR Cluster inside a VPC.


About the Authors

Songzhi Liu is a Big Data Consultant with AWS Professional Services. He works closely with AWS customers to provide them Big Data & Machine Learning solutions and best practices on the Amazon cloud.

 

 

 

 

AWS Glue Now Supports Scala Scripts

Post Syndicated from Mehul Shah original https://aws.amazon.com/blogs/big-data/aws-glue-now-supports-scala-scripts/

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

This post shows how to build an example Scala script that identifies highly negative issues in the timeline. It pulls out issue events in the timeline sample, analyzes their titles using the sentiment prediction functions from the Stanford CoreNLP libraries, and surfaces the most negative issues.

Getting started

Before we start writing scripts, we use AWS Glue crawlers to get a sense of the data—its structure and characteristics. We also set up a development endpoint and attach an Apache Zeppelin notebook, so we can interactively explore the data and author the script.

Crawl the data

The dataset used in this example was downloaded from the GitHub archive website into our sample dataset bucket in Amazon S3, and copied to the following locations:

s3://aws-glue-datasets-<region>/examples/scala-blog/githubarchive/data/

Choose the best folder by replacing <region> with the region that you’re working in, for example, us-east-1. Crawl this folder, and put the results into a database named githubarchive in the AWS Glue Data Catalog, as described in the AWS Glue Developer Guide. This folder contains 12 hours of the timeline from January 22, 2017, and is organized hierarchically (that is, partitioned) by year, month, and day.

When finished, use the AWS Glue console to navigate to the table named data in the githubarchive database. Notice that this data has eight top-level columns, which are common to each event type, and three partition columns that correspond to year, month, and day.

Choose the payload column, and you will notice that it has a complex schema—one that reflects the union of the payloads of event types that appear in the crawled data. Also note that the schema that crawlers generate is a subset of the true schema because they sample only a subset of the data.

Set up the library, development endpoint, and notebook

Next, you need to download and set up the libraries that estimate the sentiment in a snippet of text. The Stanford CoreNLP libraries contain a number of human language processing tools, including sentiment prediction.

Download the Stanford CoreNLP libraries. Unzip the .zip file, and you’ll see a directory full of jar files. For this example, the following jars are required:

  • stanford-corenlp-3.8.0.jar
  • stanford-corenlp-3.8.0-models.jar
  • ejml-0.23.jar

Upload these files to an Amazon S3 path that is accessible to AWS Glue so that it can load these libraries when needed. For this example, they are in s3://glue-sample-other/corenlp/.

Development endpoints are static Spark-based environments that can serve as the backend for data exploration. You can attach notebooks to these endpoints to interactively send commands and explore and analyze your data. These endpoints have the same configuration as that of AWS Glue’s job execution system. So, commands and scripts that work there also work the same when registered and run as jobs in AWS Glue.

To set up an endpoint and a Zeppelin notebook to work with that endpoint, follow the instructions in the AWS Glue Developer Guide. When you are creating an endpoint, be sure to specify the locations of the previously mentioned jars in the Dependent jars path as a comma-separated list. Otherwise, the libraries will not be loaded.

After you set up the notebook server, go to the Zeppelin notebook by choosing Dev Endpoints in the left navigation pane on the AWS Glue console. Choose the endpoint that you created. Next, choose the Notebook Server URL, which takes you to the Zeppelin server. Log in using the notebook user name and password that you specified when creating the notebook. Finally, create a new note to try out this example.

Each notebook is a collection of paragraphs, and each paragraph contains a sequence of commands and the output for that command. Moreover, each notebook includes a number of interpreters. If you set up the Zeppelin server using the console, the (Python-based) pyspark and (Scala-based) spark interpreters are already connected to your new development endpoint, with pyspark as the default. Therefore, throughout this example, you need to prepend %spark at the top of your paragraphs. In this example, we omit these for brevity.

Working with the data

In this section, we use AWS Glue extensions to Spark to work with the dataset. We look at the actual schema of the data and filter out the interesting event types for our analysis.

Start with some boilerplate code to import libraries that you need:

%spark

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext

Then, create the Spark and AWS Glue contexts needed for working with the data:

@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)

You need the transient decorator on the SparkContext when working in Zeppelin; otherwise, you will run into a serialization error when executing commands.

Dynamic frames

This section shows how to create a dynamic frame that contains the GitHub records in the table that you crawled earlier. A dynamic frame is the basic data structure in AWS Glue scripts. It is like an Apache Spark data frame, except that it is designed and optimized for data cleaning and transformation workloads. A dynamic frame is well-suited for representing semi-structured datasets like the GitHub timeline.

A dynamic frame is a collection of dynamic records. In Spark lingo, it is an RDD (resilient distributed dataset) of DynamicRecords. A dynamic record is a self-describing record. Each record encodes its columns and types, so every record can have a schema that is unique from all others in the dynamic frame. This is convenient and often more efficient for datasets like the GitHub timeline, where payloads can vary drastically from one event type to another.

The following creates a dynamic frame, github_events, from your table:

val github_events = glueContext
                    .getCatalogSource(database = "githubarchive", tableName = "data")
                    .getDynamicFrame()

The getCatalogSource() method returns a DataSource, which represents a particular table in the Data Catalog. The getDynamicFrame() method returns a dynamic frame from the source.

Recall that the crawler created a schema from only a sample of the data. You can scan the entire dataset, count the rows, and print the complete schema as follows:

github_events.count
github_events.printSchema()

The result looks like the following:

The data has 414,826 records. As before, notice that there are eight top-level columns, and three partition columns. If you scroll down, you’ll also notice that the payload is the most complex column.

Run functions and filter records

This section describes how you can create your own functions and invoke them seamlessly to filter records. Unlike filtering with Python lambdas, Scala scripts do not need to convert records from one language representation to another, thereby reducing overhead and running much faster.

Let’s create a function that picks only the IssuesEvents from the GitHub timeline. These events are generated whenever someone posts an issue for a particular repository. Each GitHub event record has a field, “type”, that indicates the kind of event it is. The issueFilter() function returns true for records that are IssuesEvents.

def issueFilter(rec: DynamicRecord): Boolean = { 
    rec.getField("type").exists(_ == "IssuesEvent") 
}

Note that the getField() method returns an Option[Any] type, so you first need to check that it exists before checking the type.

You pass this function to the filter transformation, which applies the function on each record and returns a dynamic frame of those records that pass.

val issue_events =  github_events.filter(issueFilter)

Now, let’s look at the size and schema of issue_events.

issue_events.count
issue_events.printSchema()

It’s much smaller (14,063 records), and the payload schema is less complex, reflecting only the schema for issues. Keep a few essential columns for your analysis, and drop the rest using the ApplyMapping() transform:

val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                 ("actor.login", "string", "actor", "string"), 
                                                 ("repo.name", "string", "repo", "string"),
                                                 ("payload.action", "string", "action", "string"),
                                                 ("payload.issue.title", "string", "title", "string")))
issue_titles.show()

The ApplyMapping() transform is quite handy for renaming columns, casting types, and restructuring records. The preceding code snippet tells the transform to select the fields (or columns) that are enumerated in the left half of the tuples and map them to the fields and types in the right half.

Estimating sentiment using Stanford CoreNLP

To focus on the most pressing issues, you might want to isolate the records with the most negative sentiments. The Stanford CoreNLP libraries are Java-based and offer sentiment-prediction functions. Accessing these functions through Python is possible, but quite cumbersome. It requires creating Python surrogate classes and objects for those found on the Java side. Instead, with Scala support, you can use those classes and objects directly and invoke their methods. Let’s see how.

First, import the libraries needed for the analysis:

import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

The Stanford CoreNLP libraries have a main driver that orchestrates all of their analysis. The driver setup is heavyweight, setting up threads and data structures that are shared across analyses. Apache Spark runs on a cluster with a main driver process and a collection of backend executor processes that do most of the heavy sifting of the data.

The Stanford CoreNLP shared objects are not serializable, so they cannot be distributed easily across a cluster. Instead, you need to initialize them once for every backend executor process that might need them. Here is how to accomplish that:

val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")

object myNLP {
    lazy val coreNLP = new StanfordCoreNLP(props)
}

The properties tell the libraries which annotators to execute and how many words to process. The preceding code creates an object, myNLP, with a field coreNLP that is lazily evaluated. This field is initialized only when it is needed, and only once. So, when the backend executors start processing the records, each executor initializes the driver for the Stanford CoreNLP libraries only one time.

Next is a function that estimates the sentiment of a text string. It first calls Stanford CoreNLP to annotate the text. Then, it pulls out the sentences and takes the average sentiment across all the sentences. The sentiment is a double, from 0.0 as the most negative to 4.0 as the most positive.

def estimatedSentiment(text: String): Double = {
    if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
    val annotations = myNLP.coreNLP.process(text)
    val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
    sentences.foldLeft(0.0)( (csum, x) => { 
        csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
    }) / sentences.length
}

Now, let’s estimate the sentiment of the issue titles and add that computed field as part of the records. You can accomplish this with the map() method on dynamic frames:

val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
    val mbody = rec.getField("title")
    mbody match {
        case Some(mval: String) => { 
            rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
            rec }
        case _ => rec
    }
})

The map() method applies the user-provided function on every record. The function takes a DynamicRecord as an argument and returns a DynamicRecord. The code above computes the sentiment, adds it in a top-level field, sentiment, to the record, and returns the record.

Count the records with sentiment and show the schema. This takes a few minutes because Spark must initialize the library and run the sentiment analysis, which can be involved.

issue_sentiments.count
issue_sentiments.printSchema()

Notice that all records were processed (14,063), and the sentiment value was added to the schema.

Finally, let’s pick out the titles that have the lowest sentiment (less than 1.5). Count them and print out a sample to see what some of the titles look like.

val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
pressing_issues.count
pressing_issues.show(10)

Next, write them all to a file so that you can handle them later. (You’ll need to replace the output path with your own.)

glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions("""{"path": "s3://<bucket>/out/path/"}"""), 
                              format = "json")
            .writeDynamicFrame(pressing_issues)

Take a look in the output path, and you can see the output files.

Putting it all together

Now, let’s create a job from the preceding interactive session. The following script combines all the commands from earlier. It processes the GitHub archive files and writes out the highly negative issues:

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

object GlueApp {

    object myNLP {
        val props = new Properties()
        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
        props.setProperty("parse.maxlen", "70")

        lazy val coreNLP = new StanfordCoreNLP(props)
    }

    def estimatedSentiment(text: String): Double = {
        if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
        val annotations = myNLP.coreNLP.process(text)
        val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
        sentences.foldLeft(0.0)( (csum, x) => { 
            csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
        }) / sentences.length
    }

    def main(sysArgs: Array[String]) {
        val spark: SparkContext = SparkContext.getOrCreate()
        val glueContext: GlueContext = new GlueContext(spark)

        val dbname = "githubarchive"
        val tblname = "data"
        val outpath = "s3://<bucket>/out/path/"

        val github_events = glueContext
                            .getCatalogSource(database = dbname, tableName = tblname)
                            .getDynamicFrame()

        val issue_events =  github_events.filter((rec: DynamicRecord) => {
            rec.getField("type").exists(_ == "IssuesEvent")
        })

        val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                         ("actor.login", "string", "actor", "string"), 
                                                         ("repo.name", "string", "repo", "string"),
                                                         ("payload.action", "string", "action", "string"),
                                                         ("payload.issue.title", "string", "title", "string")))

        val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
            val mbody = rec.getField("title")
            mbody match {
                case Some(mval: String) => { 
                    rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
                    rec }
                case _ => rec
            }
        })

        val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))

        glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions(s"""{"path": "$outpath"}"""), 
                              format = "json")
                    .writeDynamicFrame(pressing_issues)
    }
}

Notice that the script is enclosed in a top-level object called GlueApp, which serves as the script’s entry point for the job. (You’ll need to replace the output path with your own.) Upload the script to an Amazon S3 location so that AWS Glue can load it when needed.

To create the job, open the AWS Glue console. Choose Jobs in the left navigation pane, and then choose Add job. Create a name for the job, and specify a role with permissions to access the data. Choose An existing script that you provide, and choose Scala as the language.

For the Scala class name, type GlueApp to indicate the script’s entry point. Specify the Amazon S3 location of the script.

Choose Script libraries and job parameters. In the Dependent jars path field, enter the Amazon S3 locations of the Stanford CoreNLP libraries from earlier as a comma-separated list (without spaces). Then choose Next.

No connections are needed for this job, so choose Next again. Review the job properties, and choose Finish. Finally, choose Run job to execute the job.

You can simply edit the script’s input table and output path to run this job on whatever GitHub timeline datasets that you might have.

Conclusion

In this post, we showed how to write AWS Glue ETL scripts in Scala via notebooks and how to run them as jobs. Scala has the advantage that it is the native language for the Spark runtime. With Scala, it is easier to call Scala or Java functions and third-party libraries for analyses. Moreover, data processing is faster in Scala because there’s no need to convert records from one language runtime to another.

You can find more example of Scala scripts in our GitHub examples repository: https://github.com/awslabs/aws-glue-samples. We encourage you to experiment with Scala scripts and let us know about any interesting ETL flows that you want to share.

Happy Glue-ing!

 


Additional Reading

If you found this post useful, be sure to check out Simplify Querying Nested JSON with the AWS Glue Relationalize Transform and Genomic Analysis with Hail on Amazon EMR and Amazon Athena.

 


About the Authors

Mehul Shah is a senior software manager for AWS Glue. His passion is leveraging the cloud to build smarter, more efficient, and easier to use data systems. He has three girls, and, therefore, he has no spare time.

 

 

 

Ben Sowell is a software development engineer at AWS Glue.

 

 

 

 
Vinay Vivili is a software development engineer for AWS Glue.

 

 

 

Now Available: New Digital Training to Help You Learn About AWS Big Data Services

Post Syndicated from Sara Snedeker original https://aws.amazon.com/blogs/big-data/now-available-new-digital-training-to-help-you-learn-about-aws-big-data-services/

AWS Training and Certification recently released free digital training courses that will make it easier for you to build your cloud skills and learn about using AWS Big Data services. This training includes courses like Introduction to Amazon EMR and Introduction to Amazon Athena.

You can get free and unlimited access to more than 100 new digital training courses built by AWS experts at aws.training. It’s easy to access training related to big data. Just choose the Analytics category on our Find Training page to browse through the list of courses. You can also use the keyword filter to search for training for specific AWS offerings.

Recommended training

Just getting started, or looking to learn about a new service? Check out the following digital training courses:

Introduction to Amazon EMR (15 minutes)
Covers the available tools that can be used with Amazon EMR and the process of creating a cluster. It includes a demonstration of how to create an EMR cluster.

Introduction to Amazon Athena (10 minutes)
Introduces the Amazon Athena service along with an overview of its operating environment. It covers the basic steps in implementing Athena and provides a brief demonstration.

Introduction to Amazon QuickSight (10 minutes)
Discusses the benefits of using Amazon QuickSight and how the service works. It also includes a demonstration so that you can see Amazon QuickSight in action.

Introduction to Amazon Redshift (10 minutes)
Walks you through Amazon Redshift and its core features and capabilities. It also includes a quick overview of relevant use cases and a short demonstration.

Introduction to AWS Lambda (10 minutes)
Discusses the rationale for using AWS Lambda, how the service works, and how you can get started using it.

Introduction to Amazon Kinesis Analytics (10 minutes)
Discusses how Amazon Kinesis Analytics collects, processes, and analyzes streaming data in real time. It discusses how to use and monitor the service and explores some use cases.

Introduction to Amazon Kinesis Streams (15 minutes)
Covers how Amazon Kinesis Streams is used to collect, process, and analyze real-time streaming data to create valuable insights.

Introduction to AWS IoT (10 minutes)
Describes how the AWS Internet of Things (IoT) communication architecture works, and the components that make up AWS IoT. It discusses how AWS IoT works with other AWS services and reviews a case study.

Introduction to AWS Data Pipeline (10 minutes)
Covers components like tasks, task runner, and pipeline. It also discusses what a pipeline definition is, and reviews the AWS services that are compatible with AWS Data Pipeline.

Go deeper with classroom training

Want to learn more? Enroll in classroom training to learn best practices, get live feedback, and hear answers to your questions from an instructor.

Big Data on AWS (3 days)
Introduces you to cloud-based big data solutions such as Amazon EMR, Amazon Redshift, Amazon Kinesis, and the rest of the AWS big data platform.

Data Warehousing on AWS (3 days)
Introduces you to concepts, strategies, and best practices for designing a cloud-based data warehousing solution, and demonstrates how to collect, store, and prepare data for the data warehouse.

Building a Serverless Data Lake (1 day)
Teaches you how to design, build, and operate a serverless data lake solution with AWS services. Includes topics such as ingesting data from any data source at large scale, storing the data securely and durably, using the right tool to process large volumes of data, and understanding the options available for analyzing the data in near-real time.

More training coming in 2018

We’re always evaluating and expanding our training portfolio, so stay tuned for more training options in the new year. You can always visit us at aws.training to explore our latest offerings.

Now Open AWS EU (Paris) Region

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-open-aws-eu-paris-region/

Today we are launching our 18th AWS Region, our fourth in Europe. Located in the Paris area, AWS customers can use this Region to better serve customers in and around France.

The Details
The new EU (Paris) Region provides a broad suite of AWS services including Amazon API Gateway, Amazon Aurora, Amazon CloudFront, Amazon CloudWatch, CloudWatch Events, Amazon CloudWatch Logs, Amazon DynamoDB, Amazon Elastic Compute Cloud (EC2), EC2 Container Registry, Amazon ECS, Amazon Elastic Block Store (EBS), Amazon EMR, Amazon ElastiCache, Amazon Elasticsearch Service, Amazon Glacier, Amazon Kinesis Streams, Polly, Amazon Redshift, Amazon Relational Database Service (RDS), Amazon Route 53, Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), Amazon Simple Storage Service (S3), Amazon Simple Workflow Service (SWF), Amazon Virtual Private Cloud, Auto Scaling, AWS Certificate Manager (ACM), AWS CloudFormation, AWS CloudTrail, AWS CodeDeploy, AWS Config, AWS Database Migration Service, AWS Direct Connect, AWS Elastic Beanstalk, AWS Identity and Access Management (IAM), AWS Key Management Service (KMS), AWS Lambda, AWS Marketplace, AWS OpsWorks Stacks, AWS Personal Health Dashboard, AWS Server Migration Service, AWS Service Catalog, AWS Shield Standard, AWS Snowball, AWS Snowball Edge, AWS Snowmobile, AWS Storage Gateway, AWS Support (including AWS Trusted Advisor), Elastic Load Balancing, and VM Import.

The Paris Region supports all sizes of C5, M5, R4, T2, D2, I3, and X1 instances.

There are also four edge locations for Amazon Route 53 and Amazon CloudFront: three in Paris and one in Marseille, all with AWS WAF and AWS Shield. Check out the AWS Global Infrastructure page to learn more about current and future AWS Regions.

The Paris Region will benefit from three AWS Direct Connect locations. Telehouse Voltaire is available today. AWS Direct Connect will also become available at Equinix Paris in early 2018, followed by Interxion Paris.

All AWS infrastructure regions around the world are designed, built, and regularly audited to meet the most rigorous compliance standards and to provide high levels of security for all AWS customers. These include ISO 27001, ISO 27017, ISO 27018, SOC 1 (Formerly SAS 70), SOC 2 and SOC 3 Security & Availability, PCI DSS Level 1, and many more. This means customers benefit from all the best practices of AWS policies, architecture, and operational processes built to satisfy the needs of even the most security sensitive customers.

AWS is certified under the EU-US Privacy Shield, and the AWS Data Processing Addendum (DPA) is GDPR-ready and available now to all AWS customers to help them prepare for May 25, 2018 when the GDPR becomes enforceable. The current AWS DPA, as well as the AWS GDPR DPA, allows customers to transfer personal data to countries outside the European Economic Area (EEA) in compliance with European Union (EU) data protection laws. AWS also adheres to the Cloud Infrastructure Service Providers in Europe (CISPE) Code of Conduct. The CISPE Code of Conduct helps customers ensure that AWS is using appropriate data protection standards to protect their data, consistent with the GDPR. In addition, AWS offers a wide range of services and features to help customers meet the requirements of the GDPR, including services for access controls, monitoring, logging, and encryption.

From Our Customers
Many AWS customers are preparing to use this new Region. Here’s a small sample:

Societe Generale, one of the largest banks in France and the world, has accelerated their digital transformation while working with AWS. They developed SG Research, an application that makes reports from Societe Generale’s analysts available to corporate customers in order to improve the decision-making process for investments. The new AWS Region will reduce latency between applications running in the cloud and in their French data centers.

SNCF is the national railway company of France. Their mobile app, powered by AWS, delivers real-time traffic information to 14 million riders. Extreme weather, traffic events, holidays, and engineering works can cause usage to peak at hundreds of thousands of users per second. They are planning to use machine learning and big data to add predictive features to the app.

Radio France, the French public radio broadcaster, offers seven national networks, and uses AWS to accelerate its innovation and stay competitive.

Les Restos du Coeur, a French charity that provides assistance to the needy, delivering food packages and participating in their social and economic integration back into French society. Les Restos du Coeur is using AWS for its CRM system to track the assistance given to each of their beneficiaries and the impact this is having on their lives.

AlloResto by JustEat (a leader in the French FoodTech industry), is using AWS to to scale during traffic peaks and to accelerate their innovation process.

AWS Consulting and Technology Partners
We are already working with a wide variety of consulting, technology, managed service, and Direct Connect partners in France. Here’s a partial list:

AWS Premier Consulting PartnersAccenture, Capgemini, Claranet, CloudReach, DXC, and Edifixio.

AWS Consulting PartnersABC Systemes, Atos International SAS, CoreExpert, Cycloid, Devoteam, LINKBYNET, Oxalide, Ozones, Scaleo Information Systems, and Sopra Steria.

AWS Technology PartnersAxway, Commerce Guys, MicroStrategy, Sage, Software AG, Splunk, Tibco, and Zerolight.

AWS in France
We have been investing in Europe, with a focus on France, for the last 11 years. We have also been developing documentation and training programs to help our customers to improve their skills and to accelerate their journey to the AWS Cloud.

As part of our commitment to AWS customers in France, we plan to train more than 25,000 people in the coming years, helping them develop highly sought after cloud skills. They will have access to AWS training resources in France via AWS Academy, AWSome days, AWS Educate, and webinars, all delivered in French by AWS Technical Trainers and AWS Certified Trainers.

Use it Today
The EU (Paris) Region is open for business now and you can start using it today!

Jeff;

 

Power data ingestion into Splunk using Amazon Kinesis Data Firehose

Post Syndicated from Tarik Makota original https://aws.amazon.com/blogs/big-data/power-data-ingestion-into-splunk-using-amazon-kinesis-data-firehose/

In late September, during the annual Splunk .conf, Splunk and Amazon Web Services (AWS) jointly announced that Amazon Kinesis Data Firehose now supports Splunk Enterprise and Splunk Cloud as a delivery destination. This native integration between Splunk Enterprise, Splunk Cloud, and Amazon Kinesis Data Firehose is designed to make AWS data ingestion setup seamless, while offering a secure and fault-tolerant delivery mechanism. We want to enable customers to monitor and analyze machine data from any source and use it to deliver operational intelligence and optimize IT, security, and business performance.

With Kinesis Data Firehose, customers can use a fully managed, reliable, and scalable data streaming solution to Splunk. In this post, we tell you a bit more about the Kinesis Data Firehose and Splunk integration. We also show you how to ingest large amounts of data into Splunk using Kinesis Data Firehose.

Push vs. Pull data ingestion

Presently, customers use a combination of two ingestion patterns, primarily based on data source and volume, in addition to existing company infrastructure and expertise:

  1. Pull-based approach: Using dedicated pollers running the popular Splunk Add-on for AWS to pull data from various AWS services such as Amazon CloudWatch or Amazon S3.
  2. Push-based approach: Streaming data directly from AWS to Splunk HTTP Event Collector (HEC) by using AWS Lambda. Examples of applicable data sources include CloudWatch Logs and Amazon Kinesis Data Streams.

The pull-based approach offers data delivery guarantees such as retries and checkpointing out of the box. However, it requires more ops to manage and orchestrate the dedicated pollers, which are commonly running on Amazon EC2 instances. With this setup, you pay for the infrastructure even when it’s idle.

On the other hand, the push-based approach offers a low-latency scalable data pipeline made up of serverless resources like AWS Lambda sending directly to Splunk indexers (by using Splunk HEC). This approach translates into lower operational complexity and cost. However, if you need guaranteed data delivery then you have to design your solution to handle issues such as a Splunk connection failure or Lambda execution failure. To do so, you might use, for example, AWS Lambda Dead Letter Queues.

How about getting the best of both worlds?

Let’s go over the new integration’s end-to-end solution and examine how Kinesis Data Firehose and Splunk together expand the push-based approach into a native AWS solution for applicable data sources.

By using a managed service like Kinesis Data Firehose for data ingestion into Splunk, we provide out-of-the-box reliability and scalability. One of the pain points of the old approach was the overhead of managing the data collection nodes (Splunk heavy forwarders). With the new Kinesis Data Firehose to Splunk integration, there are no forwarders to manage or set up. Data producers (1) are configured through the AWS Management Console to drop data into Kinesis Data Firehose.

You can also create your own data producers. For example, you can drop data into a Firehose delivery stream by using Amazon Kinesis Agent, or by using the Firehose API (PutRecord(), PutRecordBatch()), or by writing to a Kinesis Data Stream configured to be the data source of a Firehose delivery stream. For more details, refer to Sending Data to an Amazon Kinesis Data Firehose Delivery Stream.

You might need to transform the data before it goes into Splunk for analysis. For example, you might want to enrich it or filter or anonymize sensitive data. You can do so using AWS Lambda. In this scenario, Kinesis Data Firehose buffers data from the incoming source data, sends it to the specified Lambda function (2), and then rebuffers the transformed data to the Splunk Cluster. Kinesis Data Firehose provides the Lambda blueprints that you can use to create a Lambda function for data transformation.

Systems fail all the time. Let’s see how this integration handles outside failures to guarantee data durability. In cases when Kinesis Data Firehose can’t deliver data to the Splunk Cluster, data is automatically backed up to an S3 bucket. You can configure this feature while creating the Firehose delivery stream (3). You can choose to back up all data or only the data that’s failed during delivery to Splunk.

In addition to using S3 for data backup, this Firehose integration with Splunk supports Splunk Indexer Acknowledgments to guarantee event delivery. This feature is configured on Splunk’s HTTP Event Collector (HEC) (4). It ensures that HEC returns an acknowledgment to Kinesis Data Firehose only after data has been indexed and is available in the Splunk cluster (5).

Now let’s look at a hands-on exercise that shows how to forward VPC flow logs to Splunk.

How-to guide

To process VPC flow logs, we implement the following architecture.

Amazon Virtual Private Cloud (Amazon VPC) delivers flow log files into an Amazon CloudWatch Logs group. Using a CloudWatch Logs subscription filter, we set up real-time delivery of CloudWatch Logs to an Kinesis Data Firehose stream.

Data coming from CloudWatch Logs is compressed with gzip compression. To work with this compression, we need to configure a Lambda-based data transformation in Kinesis Data Firehose to decompress the data and deposit it back into the stream. Firehose then delivers the raw logs to the Splunk Http Event Collector (HEC).

If delivery to the Splunk HEC fails, Firehose deposits the logs into an Amazon S3 bucket. You can then ingest the events from S3 using an alternate mechanism such as a Lambda function.

When data reaches Splunk (Enterprise or Cloud), Splunk parsing configurations (packaged in the Splunk Add-on for Kinesis Data Firehose) extract and parse all fields. They make data ready for querying and visualization using Splunk Enterprise and Splunk Cloud.

Walkthrough

Install the Splunk Add-on for Amazon Kinesis Data Firehose

The Splunk Add-on for Amazon Kinesis Data Firehose enables Splunk (be it Splunk Enterprise, Splunk App for AWS, or Splunk Enterprise Security) to use data ingested from Amazon Kinesis Data Firehose. Install the Add-on on all the indexers with an HTTP Event Collector (HEC). The Add-on is available for download from Splunkbase.

HTTP Event Collector (HEC)

Before you can use Kinesis Data Firehose to deliver data to Splunk, set up the Splunk HEC to receive the data. From Splunk web, go to the Setting menu, choose Data Inputs, and choose HTTP Event Collector. Choose Global Settings, ensure All tokens is enabled, and then choose Save. Then choose New Token to create a new HEC endpoint and token. When you create a new token, make sure that Enable indexer acknowledgment is checked.

When prompted to select a source type, select aws:cloudwatch:vpcflow.

Create an S3 backsplash bucket

To provide for situations in which Kinesis Data Firehose can’t deliver data to the Splunk Cluster, we use an S3 bucket to back up the data. You can configure this feature to back up all data or only the data that’s failed during delivery to Splunk.

Note: Bucket names are unique. Thus, you can’t use tmak-backsplash-bucket.

aws s3 create-bucket --bucket tmak-backsplash-bucket --create-bucket-configuration LocationConstraint=ap-northeast-1

Create an IAM role for the Lambda transform function

Firehose triggers an AWS Lambda function that transforms the data in the delivery stream. Let’s first create a role for the Lambda function called LambdaBasicRole.

Note: You can also set this role up when creating your Lambda function.

$ aws iam create-role --role-name LambdaBasicRole --assume-role-policy-document file://TrustPolicyForLambda.json

Here is TrustPolicyForLambda.json.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

 

After the role is created, attach the managed Lambda basic execution policy to it.

$ aws iam attach-role-policy 
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole 
  --role-name LambdaBasicRole

 

Create a Firehose Stream

On the AWS console, open the Amazon Kinesis service, go to the Firehose console, and choose Create Delivery Stream.

In the next section, you can specify whether you want to use an inline Lambda function for transformation. Because incoming CloudWatch Logs are gzip compressed, choose Enabled for Record transformation, and then choose Create new.

From the list of the available blueprint functions, choose Kinesis Data Firehose CloudWatch Logs Processor. This function unzips data and place it back into the Firehose stream in compliance with the record transformation output model.

Enter a name for the Lambda function, choose Choose an existing role, and then choose the role you created earlier. Then choose Create Function.

Go back to the Firehose Stream wizard, choose the Lambda function you just created, and then choose Next.

Select Splunk as the destination, and enter your Splunk Http Event Collector information.

Note: Amazon Kinesis Data Firehose requires the Splunk HTTP Event Collector (HEC) endpoint to be terminated with a valid CA-signed certificate matching the DNS hostname used to connect to your HEC endpoint. You receive delivery errors if you are using a self-signed certificate.

In this example, we only back up logs that fail during delivery.

To monitor your Firehose delivery stream, enable error logging. Doing this means that you can monitor record delivery errors.

Create an IAM role for the Firehose stream by choosing Create new, or Choose. Doing this brings you to a new screen. Choose Create a new IAM role, give the role a name, and then choose Allow.

If you look at the policy document, you can see that the role gives Kinesis Data Firehose permission to publish error logs to CloudWatch, execute your Lambda function, and put records into your S3 backup bucket.

You now get a chance to review and adjust the Firehose stream settings. When you are satisfied, choose Create Stream. You get a confirmation once the stream is created and active.

Create a VPC Flow Log

To send events from Amazon VPC, you need to set up a VPC flow log. If you already have a VPC flow log you want to use, you can skip to the “Publish CloudWatch to Kinesis Data Firehose” section.

On the AWS console, open the Amazon VPC service. Then choose VPC, Your VPC, and choose the VPC you want to send flow logs from. Choose Flow Logs, and then choose Create Flow Log. If you don’t have an IAM role that allows your VPC to publish logs to CloudWatch, choose Set Up Permissions and Create new role. Use the defaults when presented with the screen to create the new IAM role.

Once active, your VPC flow log should look like the following.

Publish CloudWatch to Kinesis Data Firehose

When you generate traffic to or from your VPC, the log group is created in Amazon CloudWatch. The new log group has no subscription filter, so set up a subscription filter. Setting this up establishes a real-time data feed from the log group to your Firehose delivery stream.

At present, you have to use the AWS Command Line Interface (AWS CLI) to create a CloudWatch Logs subscription to a Kinesis Data Firehose stream. However, you can use the AWS console to create subscriptions to Lambda and Amazon Elasticsearch Service.

To allow CloudWatch to publish to your Firehose stream, you need to give it permissions.

$ aws iam create-role --role-name CWLtoKinesisFirehoseRole --assume-role-policy-document file://TrustPolicyForCWLToFireHose.json


Here is the content for TrustPolicyForCWLToFireHose.json.

{
  "Statement": {
    "Effect": "Allow",
    "Principal": { "Service": "logs.us-east-1.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }
}

 

Attach the policy to the newly created role.

$ aws iam put-role-policy 
    --role-name CWLtoKinesisFirehoseRole 
    --policy-name Permissions-Policy-For-CWL 
    --policy-document file://PermissionPolicyForCWLToFireHose.json

Here is the content for PermissionPolicyForCWLToFireHose.json.

{
    "Statement":[
      {
        "Effect":"Allow",
        "Action":["firehose:*"],
        "Resource":["arn:aws:firehose:us-east-1:YOUR-AWS-ACCT-NUM:deliverystream/ FirehoseSplunkDeliveryStream"]
      },
      {
        "Effect":"Allow",
        "Action":["iam:PassRole"],
        "Resource":["arn:aws:iam::YOUR-AWS-ACCT-NUM:role/CWLtoKinesisFirehoseRole"]
      }
    ]
}

Finally, create a subscription filter.

$ aws logs put-subscription-filter 
   --log-group-name " /vpc/flowlog/FirehoseSplunkDemo" 
   --filter-name "Destination" 
   --filter-pattern "" 
   --destination-arn "arn:aws:firehose:us-east-1:YOUR-AWS-ACCT-NUM:deliverystream/FirehoseSplunkDeliveryStream" 
   --role-arn "arn:aws:iam::YOUR-AWS-ACCT-NUM:role/CWLtoKinesisFirehoseRole"

When you run the AWS CLI command preceding, you don’t get any acknowledgment. To validate that your CloudWatch Log Group is subscribed to your Firehose stream, check the CloudWatch console.

As soon as the subscription filter is created, the real-time log data from the log group goes into your Firehose delivery stream. Your stream then delivers it to your Splunk Enterprise or Splunk Cloud environment for querying and visualization. The screenshot following is from Splunk Enterprise.

In addition, you can monitor and view metrics associated with your delivery stream using the AWS console.

Conclusion

Although our walkthrough uses VPC Flow Logs, the pattern can be used in many other scenarios. These include ingesting data from AWS IoT, other CloudWatch logs and events, Kinesis Streams or other data sources using the Kinesis Agent or Kinesis Producer Library. We also used Lambda blueprint Kinesis Data Firehose CloudWatch Logs Processor to transform streaming records from Kinesis Data Firehose. However, you might need to use a different Lambda blueprint or disable record transformation entirely depending on your use case. For an additional use case using Kinesis Data Firehose, check out This is My Architecture Video, which discusses how to securely centralize cross-account data analytics using Kinesis and Splunk.

 


Additional Reading

If you found this post useful, be sure to check out Integrating Splunk with Amazon Kinesis Streams and Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review.


About the Authors

Tarik Makota is a solutions architect with the Amazon Web Services Partner Network. He provides technical guidance, design advice and thought leadership to AWS’ most strategic software partners. His career includes work in an extremely broad software development and architecture roles across ERP, financial printing, benefit delivery and administration and financial services. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.

 

 

 

Roy Arsan is a solutions architect in the Splunk Partner Integrations team. He has a background in product development, cloud architecture, and building consumer and enterprise cloud applications. More recently, he has architected Splunk solutions on major cloud providers, including an AWS Quick Start for Splunk that enables AWS users to easily deploy distributed Splunk Enterprise straight from their AWS console. He’s also the co-author of the AWS Lambda blueprints for Splunk. He holds an M.S. in Computer Science Engineering from the University of Michigan.

 

 

 

Now Open – AWS China (Ningxia) Region

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-open-aws-china-ningxia-region/

Today we launched our 17th Region globally, and the second in China. The AWS China (Ningxia) Region, operated by Ningxia Western Cloud Data Technology Co. Ltd. (NWCD), is generally available now and provides customers another option to run applications and store data on AWS in China.

The Details
At launch, the new China (Ningxia) Region, operated by NWCD, supports Auto Scaling, AWS Config, AWS CloudFormation, AWS CloudTrail, Amazon CloudWatch, CloudWatch Events, Amazon CloudWatch Logs, AWS CodeDeploy, AWS Direct Connect, Amazon DynamoDB, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), Amazon EC2 Systems Manager, AWS Elastic Beanstalk, Amazon ElastiCache, Amazon Elasticsearch Service, Elastic Load Balancing, Amazon EMR, Amazon Glacier, AWS Identity and Access Management (IAM), Amazon Kinesis Streams, Amazon Redshift, Amazon Relational Database Service (RDS), Amazon Simple Storage Service (S3), Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), AWS Support API, AWS Trusted Advisor, Amazon Simple Workflow Service (SWF), Amazon Virtual Private Cloud, and VM Import. Visit the AWS China Products page for additional information on these services.

The Region supports all sizes of C4, D2, M4, T2, R4, I3, and X1 instances.

Check out the AWS Global Infrastructure page to learn more about current and future AWS Regions.

Operating Partner
To comply with China’s legal and regulatory requirements, AWS has formed a strategic technology collaboration with NWCD to operate and provide services from the AWS China (Ningxia) Region. Founded in 2015, NWCD is a licensed datacenter and cloud services provider, based in Ningxia, China. NWCD joins Sinnet, the operator of the AWS China China (Beijing) Region, as an AWS operating partner in China. Through these relationships, AWS provides its industry-leading technology, guidance, and expertise to NWCD and Sinnet, while NWCD and Sinnet operate and provide AWS cloud services to local customers. While the cloud services offered in both AWS China Regions are the same as those available in other AWS Regions, the AWS China Regions are different in that they are isolated from all other AWS Regions and operated by AWS’s Chinese partners separately from all other AWS Regions. Customers using the AWS China Regions enter into customer agreements with Sinnet and NWCD, rather than with AWS.

Use it Today
The AWS China (Ningxia) Region, operated by NWCD, is open for business, and you can start using it now! Starting today, Chinese developers, startups, and enterprises, as well as government, education, and non-profit organizations, can leverage AWS to run their applications and store their data in the new AWS China (Ningxia) Region, operated by NWCD. Customers already using the AWS China (Beijing) Region, operated by Sinnet, can select the AWS China (Ningxia) Region directly from the AWS Management Console, while new customers can request an account at www.amazonaws.cn to begin using both AWS China Regions.

Jeff;

 

 

Glenn’s Take on re:Invent 2017 – Part 3

Post Syndicated from Glenn Gore original https://aws.amazon.com/blogs/architecture/glenns-take-on-reinvent-2017-part-3/

Glenn Gore here, Chief Architect for AWS. I was in Las Vegas last week — with 43K others — for re:Invent 2017. I checked in to the Architecture blog here and here with my take on what was interesting about some of the bigger announcements from a cloud-architecture perspective.

In the excitement of so many new services being launched, we sometimes overlook feature updates that, while perhaps not as exciting as Amazon DeepLens, have significant impact on how you architect and develop solutions on AWS.

Amazon DynamoDB is used by more than 100,000 customers around the world, handling over a trillion requests every day. From the start, DynamoDB has offered high availability by natively spanning multiple Availability Zones within an AWS Region. As more customers started building and deploying truly-global applications, there was a need to replicate a DynamoDB table to multiple AWS Regions, allowing for read/write operations to occur in any region where the table was replicated. This update is important for providing a globally-consistent view of information — as users may transition from one region to another — or for providing additional levels of availability, allowing for failover between AWS Regions without loss of information.

There are some interesting concurrency-design aspects you need to be aware of and ensure you can handle correctly. For example, we support the “last writer wins” reconciliation where eventual consistency is being used and an application updates the same item in different AWS Regions at the same time. If you require strongly-consistent read/writes then you must perform all of your read/writes in the same AWS Region. The details behind this can be found in the DynamoDB documentation. Providing a globally-distributed, replicated DynamoDB table simplifies many different use cases and allows for the logic of replication, which may have been pushed up into the application layers to be simplified back down into the data layer.

The other big update for DynamoDB is that you can now back up your DynamoDB table on demand with no impact to performance. One of the features I really like is that when you trigger a backup, it is available instantly, regardless of the size of the table. Behind the scenes, we use snapshots and change logs to ensure a consistent backup. While backup is instant, restoring the table could take some time depending on its size and ranges — from minutes to hours for very large tables.

This feature is super important for those of you who work in regulated industries that often have strict requirements around data retention and backups of data, which sometimes limited the use of DynamoDB or required complex workarounds to implement some sort of backup feature in the past. This often incurred significant, additional costs due to increased read transactions on their DynamoDB tables.

Amazon Simple Storage Service (Amazon S3) was our first-released AWS service over 11 years ago, and it proved the simplicity and scalability of true API-driven architectures in the cloud. Today, Amazon S3 stores trillions of objects, with transactional requests per second reaching into the millions! Dealing with data as objects opened up an incredibly diverse array of use cases ranging from libraries of static images, game binary downloads, and application log data, to massive data lakes used for big data analytics and business intelligence. With Amazon S3, when you accessed your data in an object, you effectively had to write/read the object as a whole or use the range feature to retrieve a part of the object — if possible — in your individual use case.

Now, with Amazon S3 Select, an SQL-like query language is used that can work with delimited text and JSON files, as well as work with GZIP compressed files. We don’t support encryption during the preview of Amazon S3 Select.

Amazon S3 Select provides two major benefits:

  • Faster access
  • Lower running costs

Serverless Lambda functions, where every millisecond matters when you are being charged, will benefit greatly from Amazon S3 Select as data retrieval and processing of your Lambda function will experience significant speedups and cost reductions. For example, we have seen 2x speed improvement and 80% cost reduction with the Serverless MapReduce code.

Other AWS services such as Amazon Athena, Amazon Redshift, and Amazon EMR will support Amazon S3 Select as well as partner offerings including Cloudera and Hortonworks. If you are using Amazon Glacier for longer-term data archival, you will be able to use Amazon Glacier Select to retrieve a subset of your content from within Amazon Glacier.

As the volume of data that can be stored within Amazon S3 and Amazon Glacier continues to scale on a daily basis, we will continue to innovate and develop improved and optimized services that will allow you to work with these magnificently-large data sets while reducing your costs (retrieval and processing). I believe this will also allow you to simplify the transformation and storage of incoming data into Amazon S3 in basic, semi-structured formats as a single copy vs. some of the duplication and reformatting of data sometimes required to do upfront optimizations for downstream processing. Amazon S3 Select largely removes the need for this upfront optimization and instead allows you to store data once and process it based on your individual Amazon S3 Select query per application or transaction need.

Thanks for reading!

Glenn contemplating why CSV format is still relevant in 2017 (Italy).

Amazon EC2 Update – Streamlined Access to Spot Capacity, Smooth Price Changes, Instance Hibernation

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-ec2-update-streamlined-access-to-spot-capacity-smooth-price-changes-instance-hibernation/

EC2 Spot Instances give you access to spare compute capacity in the AWS Cloud. Our customers use fleets of Spot Instances to power their CI/CD environments & traffic generators, host web servers & microservices, render movies, and to run many types of analytics jobs, all at prices that offer significant savings in comparison to On-Demand Instances.

New Streamlined Access
Today we are introducing a new, streamlined access model for Spot Instances. You simply indicate your desire to use Spot capacity when you launch an instance via the RunInstances function, the run-instances command, or the AWS Management Console to submit a request that will be fulfilled as long as the capacity is available. With no extra effort on your part you’ll save up to 90% off of the On-Demand price for the instance type, allowing you to boost your overall application throughput by up to 10x for the same budget. The instances that you launch in this way will continue to run until you terminate them or if EC2 needs to reclaim them for On-Demand usage. At that point the instance will be given the usual 2-minute warning and then reclaimed, making this a great fit for applications that are fault-tolerant.

Unlike the old model which required an understanding of Spot markets, bidding, and calls to a standalone asynchronous API, the new model is synchronous and as easy to use as On-Demand. Your code or your script receives an Instance ID immediately and need not check back to see if the request has been processed and accepted.

We’ve made this as clean and as simple as possible, with the expectation that it will be easy to modify many current scripts and applications to request and make use of Spot capacity. If you want to exercise additional control over your Spot instance budget, you have the option to specify a maximum price when you make a request for capacity. If you use Spot capacity to power your Amazon EMR, Amazon ECS, or AWS Batch clusters, or if you launch Spot instances by way of a AWS CloudFormation template or Auto Scaling Group, you will benefit from this new model without having to make any changes.

Applications that are built around RequestSpotInstances or RequestSpotFleet will continue to work just fine with no changes. However, you now have the option to make requests that do not include the SpotPrice parameter.

Smooth Price Changes
As part of today’s launch we are also changing the way that Spot prices change, moving to a model where prices adjust more gradually, based on longer-term trends in supply and demand. As I mentioned earlier, you will continue to save an average of 70-90% off the On-Demand price, and you will continue to pay the Spot price that’s in effect for the time period your instances are running. Applications built around our Spot Fleet feature will continue to automatically diversify placement of their Spot Instances across the most cost-effective pools based on the configuration you specified when you created the fleet.

Spot in Action
To launch a Spot Instance from the command line; simply specify the Spot market:

$ aws ec2 run-instances –-market Spot --image-id ami-1a2b3c4d --count 1 --instance-type c3.large 

Instance Hibernation
If you run workloads that keep a lot of state in memory, you will love this new feature!

You can arrange for instances to save their in-memory state when they are reclaimed, allowing the instances and the applications on them to pick up where they left off when capacity is once again available, just like closing and then opening your laptop. This feature works on C3, C4, and certain sizes of R3, R4, and M4 instances running Amazon Linux, Ubuntu, or Windows Server, and is supported by the EC2 Hibernation Agent.

The in-memory state is written to the root EBS volume of the instance using space that is set-aside when the instance launches. The private IP address and any Elastic IP Addresses are also preserved across a stop/start cycle.

Jeff;

Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production

Post Syndicated from Rafi Ton original https://aws.amazon.com/blogs/big-data/using-amazon-redshift-spectrum-amazon-athena-and-aws-glue-with-node-js-in-production/

This is a guest post by Rafi Ton, founder and CEO of NUVIAD. NUVIAD is, in their own words, “a mobile marketing platform providing professional marketers, agencies and local businesses state of the art tools to promote their products and services through hyper targeting, big data analytics and advanced machine learning tools.”

At NUVIAD, we’ve been using Amazon Redshift as our main data warehouse solution for more than 3 years.

We store massive amounts of ad transaction data that our users and partners analyze to determine ad campaign strategies. When running real-time bidding (RTB) campaigns in large scale, data freshness is critical so that our users can respond rapidly to changes in campaign performance. We chose Amazon Redshift because of its simplicity, scalability, performance, and ability to load new data in near real time.

Over the past three years, our customer base grew significantly and so did our data. We saw our Amazon Redshift cluster grow from three nodes to 65 nodes. To balance cost and analytics performance, we looked for a way to store large amounts of less-frequently analyzed data at a lower cost. Yet, we still wanted to have the data immediately available for user queries and to meet their expectations for fast performance. We turned to Amazon Redshift Spectrum.

In this post, I explain the reasons why we extended Amazon Redshift with Redshift Spectrum as our modern data warehouse. I cover how our data growth and the need to balance cost and performance led us to adopt Redshift Spectrum. I also share key performance metrics in our environment, and discuss the additional AWS services that provide a scalable and fast environment, with data available for immediate querying by our growing user base.

Amazon Redshift as our foundation

The ability to provide fresh, up-to-the-minute data to our customers and partners was always a main goal with our platform. We saw other solutions provide data that was a few hours old, but this was not good enough for us. We insisted on providing the freshest data possible. For us, that meant loading Amazon Redshift in frequent micro batches and allowing our customers to query Amazon Redshift directly to get results in near real time.

The benefits were immediately evident. Our customers could see how their campaigns performed faster than with other solutions, and react sooner to the ever-changing media supply pricing and availability. They were very happy.

However, this approach required Amazon Redshift to store a lot of data for long periods, and our data grew substantially. In our peak, we maintained a cluster running 65 DC1.large nodes. The impact on our Amazon Redshift cluster was evident, and we saw our CPU utilization grow to 90%.

Why we extended Amazon Redshift to Redshift Spectrum

Redshift Spectrum gives us the ability to run SQL queries using the powerful Amazon Redshift query engine against data stored in Amazon S3, without needing to load the data. With Redshift Spectrum, we store data where we want, at the cost that we want. We have the data available for analytics when our users need it with the performance they expect.

Seamless scalability, high performance, and unlimited concurrency

Scaling Redshift Spectrum is a simple process. First, it allows us to leverage Amazon S3 as the storage engine and get practically unlimited data capacity.

Second, if we need more compute power, we can leverage Redshift Spectrum’s distributed compute engine over thousands of nodes to provide superior performance – perfect for complex queries running against massive amounts of data.

Third, all Redshift Spectrum clusters access the same data catalog so that we don’t have to worry about data migration at all, making scaling effortless and seamless.

Lastly, since Redshift Spectrum distributes queries across potentially thousands of nodes, they are not affected by other queries, providing much more stable performance and unlimited concurrency.

Keeping it SQL

Redshift Spectrum uses the same query engine as Amazon Redshift. This means that we did not need to change our BI tools or query syntax, whether we used complex queries across a single table or joins across multiple tables.

An interesting capability introduced recently is the ability to create a view that spans both Amazon Redshift and Redshift Spectrum external tables. With this feature, you can query frequently accessed data in your Amazon Redshift cluster and less-frequently accessed data in Amazon S3, using a single view.

Leveraging Parquet for higher performance

Parquet is a columnar data format that provides superior performance and allows Redshift Spectrum (or Amazon Athena) to scan significantly less data. With less I/O, queries run faster and we pay less per query. You can read all about Parquet at https://parquet.apache.org/ or https://en.wikipedia.org/wiki/Apache_Parquet.

Lower cost

From a cost perspective, we pay standard rates for our data in Amazon S3, and only small amounts per query to analyze data with Redshift Spectrum. Using the Parquet format, we can significantly reduce the amount of data scanned. Our costs are now lower, and our users get fast results even for large complex queries.

What we learned about Amazon Redshift vs. Redshift Spectrum performance

When we first started looking at Redshift Spectrum, we wanted to put it to the test. We wanted to know how it would compare to Amazon Redshift, so we looked at two key questions:

  1. What is the performance difference between Amazon Redshift and Redshift Spectrum on simple and complex queries?
  2. Does the data format impact performance?

During the migration phase, we had our dataset stored in Amazon Redshift and S3 as CSV/GZIP and as Parquet file formats. We tested three configurations:

  • Amazon Redshift cluster with 28 DC1.large nodes
  • Redshift Spectrum using CSV/GZIP
  • Redshift Spectrum using Parquet

We performed benchmarks for simple and complex queries on one month’s worth of data. We tested how much time it took to perform the query, and how consistent the results were when running the same query multiple times. The data we used for the tests was already partitioned by date and hour. Properly partitioning the data improves performance significantly and reduces query times.

Simple query

First, we tested a simple query aggregating billing data across a month:

SELECT 
  user_id, 
  count(*) AS impressions, 
  SUM(billing)::decimal /1000000 AS billing 
FROM <table_name> 
WHERE 
  date >= '2017-08-01' AND 
  date <= '2017-08-31'  
GROUP BY 
  user_id;

We ran the same query seven times and measured the response times (red marking the longest time and green the shortest time):

Execution Time (seconds)
  Amazon Redshift Redshift Spectrum
CSV
Redshift Spectrum Parquet
Run #1 39.65 45.11 11.92
Run #2 15.26 43.13 12.05
Run #3 15.27 46.47 13.38
Run #4 21.22 51.02 12.74
Run #5 17.27 43.35 11.76
Run #6 16.67 44.23 13.67
Run #7 25.37 40.39 12.75
Average 21.53  44.82 12.61

For simple queries, Amazon Redshift performed better than Redshift Spectrum, as we thought, because the data is local to Amazon Redshift.

What was surprising was that using Parquet data format in Redshift Spectrum significantly beat ‘traditional’ Amazon Redshift performance. For our queries, using Parquet data format with Redshift Spectrum delivered an average 40% performance gain over traditional Amazon Redshift. Furthermore, Redshift Spectrum showed high consistency in execution time with a smaller difference between the slowest run and the fastest run.

Comparing the amount of data scanned when using CSV/GZIP and Parquet, the difference was also significant:

Data Scanned (GB)
CSV (Gzip) 135.49
Parquet 2.83

Because we pay only for the data scanned by Redshift Spectrum, the cost saving of using Parquet is evident and substantial.

Complex query

Next, we compared the same three configurations with a complex query.

Execution Time (seconds)
  Amazon Redshift Redshift Spectrum CSV Redshift Spectrum Parquet
Run #1 329.80 84.20 42.40
Run #2 167.60 65.30 35.10
Run #3 165.20 62.20 23.90
Run #4 273.90 74.90 55.90
Run #5 167.70 69.00 58.40
Average 220.84 71.12 43.14

This time, Redshift Spectrum using Parquet cut the average query time by 80% compared to traditional Amazon Redshift!

Bottom line: For complex queries, Redshift Spectrum provided a 67% performance gain over Amazon Redshift. Using the Parquet data format, Redshift Spectrum delivered an 80% performance improvement over Amazon Redshift. For us, this was substantial.

Optimizing the data structure for different workloads

Because the cost of S3 is relatively inexpensive and we pay only for the data scanned by each query, we believe that it makes sense to keep our data in different formats for different workloads and different analytics engines. It is important to note that we can have any number of tables pointing to the same data on S3. It all depends on how we partition the data and update the table partitions.

Data permutations

For example, we have a process that runs every minute and generates statistics for the last minute of data collected. With Amazon Redshift, this would be done by running the query on the table with something as follows:

SELECT 
  user, 
  COUNT(*) 
FROM 
  events_table 
WHERE 
  ts BETWEEN ‘2017-08-01 14:00:00’ AND ‘2017-08-01 14:00:59’ 
GROUP BY 
  user;

(Assuming ‘ts’ is your column storing the time stamp for each event.)

With Redshift Spectrum, we pay for the data scanned in each query. If the data is partitioned by the minute instead of the hour, a query looking at one minute would be 1/60th the cost. If we use a temporary table that points only to the data of the last minute, we save that unnecessary cost.

Creating Parquet data efficiently

On the average, we have 800 instances that process our traffic. Each instance sends events that are eventually loaded into Amazon Redshift. When we started three years ago, we would offload data from each server to S3 and then perform a periodic copy command from S3 to Amazon Redshift.

Recently, Amazon Kinesis Firehose added the capability to offload data directly to Amazon Redshift. While this is now a viable option, we kept the same collection process that worked flawlessly and efficiently for three years.

This changed, however, when we incorporated Redshift Spectrum. With Redshift Spectrum, we needed to find a way to:

  • Collect the event data from the instances.
  • Save the data in Parquet format.
  • Partition the data effectively.

To accomplish this, we save the data as CSV and then transform it to Parquet. The most effective method to generate the Parquet files is to:

  1. Send the data in one-minute intervals from the instances to Kinesis Firehose with an S3 temporary bucket as the destination.
  2. Aggregate hourly data and convert it to Parquet using AWS Lambda and AWS Glue.
  3. Add the Parquet data to S3 by updating the table partitions.

With this new process, we had to give more attention to validating the data before we sent it to Kinesis Firehose, because a single corrupted record in a partition fails queries on that partition.

Data validation

To store our click data in a table, we considered the following SQL create table command:

create external TABLE spectrum.blog_clicks (
    user_id varchar(50),
    campaign_id varchar(50),
    os varchar(50),
    ua varchar(255),
    ts bigint,
    billing float
)
partitioned by (date date, hour smallint)  
stored as parquet
location 's3://nuviad-temp/blog/clicks/';

The above statement defines a new external table (all Redshift Spectrum tables are external tables) with a few attributes. We stored ‘ts’ as a Unix time stamp and not as Timestamp, and billing data is stored as float and not decimal (more on that later). We also said that the data is partitioned by date and hour, and then stored as Parquet on S3.

First, we need to get the table definitions. This can be achieved by running the following query:

SELECT 
  * 
FROM 
  svv_external_columns 
WHERE 
  tablename = 'blog_clicks';

This query lists all the columns in the table with their respective definitions:

schemaname tablename columnname external_type columnnum part_key
spectrum blog_clicks user_id varchar(50) 1 0
spectrum blog_clicks campaign_id varchar(50) 2 0
spectrum blog_clicks os varchar(50) 3 0
spectrum blog_clicks ua varchar(255) 4 0
spectrum blog_clicks ts bigint 5 0
spectrum blog_clicks billing double 6 0
spectrum blog_clicks date date 7 1
spectrum blog_clicks hour smallint 8 2

Now we can use this data to create a validation schema for our data:

const rtb_request_schema = {
    "name": "clicks",
    "items": {
        "user_id": {
            "type": "string",
            "max_length": 100
        },
        "campaign_id": {
            "type": "string",
            "max_length": 50
        },
        "os": {
            "type": "string",
            "max_length": 50            
        },
        "ua": {
            "type": "string",
            "max_length": 255            
        },
        "ts": {
            "type": "integer",
            "min_value": 0,
            "max_value": 9999999999999
        },
        "billing": {
            "type": "float",
            "min_value": 0,
            "max_value": 9999999999999
        }
    }
};

Next, we create a function that uses this schema to validate data:

function valueIsValid(value, item_schema) {
    if (schema.type == 'string') {
        return (typeof value == 'string' && value.length <= schema.max_length);
    }
    else if (schema.type == 'integer') {
        return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
    }
    else if (schema.type == 'float' || schema.type == 'double') {
        return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
    }
    else if (schema.type == 'boolean') {
        return typeof value == 'boolean';
    }
    else if (schema.type == 'timestamp') {
        return (new Date(value)).getTime() > 0;
    }
    else {
        return true;
    }
}

Near real-time data loading with Kinesis Firehose

On Kinesis Firehose, we created a new delivery stream to handle the events as follows:

Delivery stream name: events
Source: Direct PUT
S3 bucket: nuviad-events
S3 prefix: rtb/
IAM role: firehose_delivery_role_1
Data transformation: Disabled
Source record backup: Disabled
S3 buffer size (MB): 100
S3 buffer interval (sec): 60
S3 Compression: GZIP
S3 Encryption: No Encryption
Status: ACTIVE
Error logging: Enabled

This delivery stream aggregates event data every minute, or up to 100 MB, and writes the data to an S3 bucket as a CSV/GZIP compressed file. Next, after we have the data validated, we can safely send it to our Kinesis Firehose API:

if (validated) {
    let itemString = item.join('|')+'\n'; //Sending csv delimited by pipe and adding new line

    let params = {
        DeliveryStreamName: 'events',
        Record: {
            Data: itemString
        }
    };

    firehose.putRecord(params, function(err, data) {
        if (err) {
            console.error(err, err.stack);        
        }
        else {
            // Continue to your next step 
        }
    });
}

Now, we have a single CSV file representing one minute of event data stored in S3. The files are named automatically by Kinesis Firehose by adding a UTC time prefix in the format YYYY/MM/DD/HH before writing objects to S3. Because we use the date and hour as partitions, we need to change the file naming and location to fit our Redshift Spectrum schema.

Automating data distribution using AWS Lambda

We created a simple Lambda function triggered by an S3 put event that copies the file to a different location (or locations), while renaming it to fit our data structure and processing flow. As mentioned before, the files generated by Kinesis Firehose are structured in a pre-defined hierarchy, such as:

S3://your-bucket/your-prefix/2017/08/01/20/events-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz

All we need to do is parse the object name and restructure it as we see fit. In our case, we did the following (the event is an object received in the Lambda function with all the data about the object written to S3):

/*
	object key structure in the event object:
your-prefix/2017/08/01/20/event-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz
	*/

let key_parts = event.Records[0].s3.object.key.split('/'); 

let event_type = key_parts[0];
let date = key_parts[1] + '-' + key_parts[2] + '-' + key_parts[3];
let hour = key_parts[4];
if (hour.indexOf('0') == 0) {
 		hour = parseInt(hour, 10) + '';
}
    
let parts1 = key_parts[5].split('-');
let minute = parts1[7];
if (minute.indexOf('0') == 0) {
        minute = parseInt(minute, 10) + '';
}

Now, we can redistribute the file to the two destinations we need—one for the minute processing task and the other for hourly aggregation:

    copyObjectToHourlyFolder(event, date, hour, minute)
        .then(copyObjectToMinuteFolder.bind(null, event, date, hour, minute))
        .then(addPartitionToSpectrum.bind(null, event, date, hour, minute))
        .then(deleteOldMinuteObjects.bind(null, event))
        .then(deleteStreamObject.bind(null, event))        
        .then(result => {
            callback(null, { message: 'done' });            
        })
        .catch(err => {
            console.error(err);
            callback(null, { message: err });            
        }); 

Kinesis Firehose stores the data in a temporary folder. We copy the object to another folder that holds the data for the last processed minute. This folder is connected to a small Redshift Spectrum table where the data is being processed without needing to scan a much larger dataset. We also copy the data to a folder that holds the data for the entire hour, to be later aggregated and converted to Parquet.

Because we partition the data by date and hour, we created a new partition on the Redshift Spectrum table if the processed minute is the first minute in the hour (that is, minute 0). We ran the following:

ALTER TABLE 
  spectrum.events 
ADD partition
  (date='2017-08-01', hour=0) 
  LOCATION 's3://nuviad-temp/events/2017-08-01/0/';

After the data is processed and added to the table, we delete the processed data from the temporary Kinesis Firehose storage and from the minute storage folder.

Migrating CSV to Parquet using AWS Glue and Amazon EMR

The simplest way we found to run an hourly job converting our CSV data to Parquet is using Lambda and AWS Glue (and thanks to the awesome AWS Big Data team for their help with this).

Creating AWS Glue jobs

What this simple AWS Glue script does:

  • Gets parameters for the job, date, and hour to be processed
  • Creates a Spark EMR context allowing us to run Spark code
  • Reads CSV data into a DataFrame
  • Writes the data as Parquet to the destination S3 bucket
  • Adds or modifies the Redshift Spectrum / Amazon Athena table partition for the table
import sys
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','day_partition_key', 'hour_partition_key', 'day_partition_value', 'hour_partition_value' ])

#day_partition_key = "partition_0"
#hour_partition_key = "partition_1"
#day_partition_value = "2017-08-01"
#hour_partition_value = "0"

day_partition_key = args['day_partition_key']
hour_partition_key = args['hour_partition_key']
day_partition_value = args['day_partition_value']
hour_partition_value = args['hour_partition_value']

print("Running for " + day_partition_value + "/" + hour_partition_value)

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

df = spark.read.option("delimiter","|").csv("s3://nuviad-temp/events/"+day_partition_value+"/"+hour_partition_value)
df.registerTempTable("data")

df1 = spark.sql("select _c0 as user_id, _c1 as campaign_id, _c2 as os, _c3 as ua, cast(_c4 as bigint) as ts, cast(_c5 as double) as billing from data")

df1.repartition(1).write.mode("overwrite").parquet("s3://nuviad-temp/parquet/"+day_partition_value+"/hour="+hour_partition_value)

client = boto3.client('athena', region_name='us-east-1')

response = client.start_query_execution(
    QueryString='alter table parquet_events add if not exists partition(' + day_partition_key + '=\'' + day_partition_value + '\',' + hour_partition_key + '=' + hour_partition_value + ')  location \'s3://nuviad-temp/parquet/' + day_partition_value + '/hour=' + hour_partition_value + '\'' ,
    QueryExecutionContext={
        'Database': 'spectrumdb'
    },
    ResultConfiguration={
        'OutputLocation': 's3://nuviad-temp/convertresults'
    }
)

response = client.start_query_execution(
    QueryString='alter table parquet_events partition(' + day_partition_key + '=\'' + day_partition_value + '\',' + hour_partition_key + '=' + hour_partition_value + ') set location \'s3://nuviad-temp/parquet/' + day_partition_value + '/hour=' + hour_partition_value + '\'' ,
    QueryExecutionContext={
        'Database': 'spectrumdb'
    },
    ResultConfiguration={
        'OutputLocation': 's3://nuviad-temp/convertresults'
    }
)

job.commit()

Note: Because Redshift Spectrum and Athena both use the AWS Glue Data Catalog, we could use the Athena client to add the partition to the table.

Here are a few words about float, decimal, and double. Using decimal proved to be more challenging than we expected, as it seems that Redshift Spectrum and Spark use them differently. Whenever we used decimal in Redshift Spectrum and in Spark, we kept getting errors, such as:

S3 Query Exception (Fetch). Task failed due to an internal error. File 'https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parquet has an incompatible Parquet schema for column 's3://nuviad-events/events.lat'. Column type: DECIMAL(18, 8), Parquet schema:\noptional float lat [i:4 d:1 r:0]\n (https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parq

We had to experiment with a few floating-point formats until we found that the only combination that worked was to define the column as double in the Spark code and float in Spectrum. This is the reason you see billing defined as float in Spectrum and double in the Spark code.

Creating a Lambda function to trigger conversion

Next, we created a simple Lambda function to trigger the AWS Glue script hourly using a simple Python code:

import boto3
import json
from datetime import datetime, timedelta
 
client = boto3.client('glue')
 
def lambda_handler(event, context):
    last_hour_date_time = datetime.now() - timedelta(hours = 1)
    day_partition_value = last_hour_date_time.strftime("%Y-%m-%d") 
    hour_partition_value = last_hour_date_time.strftime("%-H") 
    response = client.start_job_run(
    JobName='convertEventsParquetHourly',
    Arguments={
         '--day_partition_key': 'date',
         '--hour_partition_key': 'hour',
         '--day_partition_value': day_partition_value,
         '--hour_partition_value': hour_partition_value
         }
    )

Using Amazon CloudWatch Events, we trigger this function hourly. This function triggers an AWS Glue job named ‘convertEventsParquetHourly’ and runs it for the previous hour, passing job names and values of the partitions to process to AWS Glue.

Redshift Spectrum and Node.js

Our development stack is based on Node.js, which is well-suited for high-speed, light servers that need to process a huge number of transactions. However, a few limitations of the Node.js environment required us to create workarounds and use other tools to complete the process.

Node.js and Parquet

The lack of Parquet modules for Node.js required us to implement an AWS Glue/Amazon EMR process to effectively migrate data from CSV to Parquet. We would rather save directly to Parquet, but we couldn’t find an effective way to do it.

One interesting project in the works is the development of a Parquet NPM by Marc Vertes called node-parquet (https://www.npmjs.com/package/node-parquet). It is not in a production state yet, but we think it would be well worth following the progress of this package.

Timestamp data type

According to the Parquet documentation, Timestamp data are stored in Parquet as 64-bit integers. However, JavaScript does not support 64-bit integers, because the native number type is a 64-bit double, giving only 53 bits of integer range.

The result is that you cannot store Timestamp correctly in Parquet using Node.js. The solution is to store Timestamp as string and cast the type to Timestamp in the query. Using this method, we did not witness any performance degradation whatsoever.

Lessons learned

You can benefit from our trial-and-error experience.

Lesson #1: Data validation is critical

As mentioned earlier, a single corrupt entry in a partition can fail queries running against this partition, especially when using Parquet, which is harder to edit than a simple CSV file. Make sure that you validate your data before scanning it with Redshift Spectrum.

Lesson #2: Structure and partition data effectively

One of the biggest benefits of using Redshift Spectrum (or Athena for that matter) is that you don’t need to keep nodes up and running all the time. You pay only for the queries you perform and only for the data scanned per query.

Keeping different permutations of your data for different queries makes a lot of sense in this case. For example, you can partition your data by date and hour to run time-based queries, and also have another set partitioned by user_id and date to run user-based queries. This results in faster and more efficient performance of your data warehouse.

Storing data in the right format

Use Parquet whenever you can. The benefits of Parquet are substantial. Faster performance, less data to scan, and much more efficient columnar format. However, it is not supported out-of-the-box by Kinesis Firehose, so you need to implement your own ETL. AWS Glue is a great option.

Creating small tables for frequent tasks

When we started using Redshift Spectrum, we saw our Amazon Redshift costs jump by hundreds of dollars per day. Then we realized that we were unnecessarily scanning a full day’s worth of data every minute. Take advantage of the ability to define multiple tables on the same S3 bucket or folder, and create temporary and small tables for frequent queries.

Lesson #3: Combine Athena and Redshift Spectrum for optimal performance

Moving to Redshift Spectrum also allowed us to take advantage of Athena as both use the AWS Glue Data Catalog. Run fast and simple queries using Athena while taking advantage of the advanced Amazon Redshift query engine for complex queries using Redshift Spectrum.

Redshift Spectrum excels when running complex queries. It can push many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer, so that queries use much less of your cluster’s processing capacity.

Lesson #4: Sort your Parquet data within the partition

We achieved another performance improvement by sorting data within the partition using sortWithinPartitions(sort_field). For example:

df.repartition(1).sortWithinPartitions("campaign_id")…

Conclusion

We were extremely pleased with using Amazon Redshift as our core data warehouse for over three years. But as our client base and volume of data grew substantially, we extended Amazon Redshift to take advantage of scalability, performance, and cost with Redshift Spectrum.

Redshift Spectrum lets us scale to virtually unlimited storage, scale compute transparently, and deliver super-fast results for our users. With Redshift Spectrum, we store data where we want at the cost we want, and have the data available for analytics when our users need it with the performance they expect.


About the Author

With 7 years of experience in the AdTech industry and 15 years in leading technology companies, Rafi Ton is the founder and CEO of NUVIAD. He enjoys exploring new technologies and putting them to use in cutting edge products and services, in the real world generating real money. Being an experienced entrepreneur, Rafi believes in practical-programming and fast adaptation of new technologies to achieve a significant market advantage.

 

 

AWS Achieves FedRAMP JAB Moderate Provisional Authorization for 20 Services in the AWS US East/West Region

Post Syndicated from Chris Gile original https://aws.amazon.com/blogs/security/aws-achieves-fedramp-jab-moderate-authorization-for-20-services-in-us-eastwest/

The AWS US East/West Region has received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) at the Federal Risk and Authorization Management Program (FedRAMP) Moderate baseline.

Though AWS has maintained an AWS US East/West Region Agency-ATO since early 2013, this announcement represents AWS’s carefully deliberated move to the JAB for the centralized maintenance of our P-ATO for 10 services already authorized. This also includes the addition of 10 new services to our FedRAMP program (see the complete list of services below). This doubles the number of FedRAMP Moderate services available to our customers to enable increased use of the cloud and support modernized IT missions. Our public sector customers now can leverage this FedRAMP P-ATO as a baseline for their own authorizations and look to the JAB for centralized Continuous Monitoring reporting and updates. In a significant enhancement for our partners that build their solutions on the AWS US East/West Region, they can now achieve FedRAMP JAB P-ATOs of their own for their Platform as a Service (PaaS) and Software as a Service (SaaS) offerings.

In line with FedRAMP security requirements, our independent FedRAMP assessment was completed in partnership with a FedRAMP accredited Third Party Assessment Organization (3PAO) on our technical, management, and operational security controls to validate that they meet or exceed FedRAMP’s Moderate baseline requirements. Effective immediately, you can begin leveraging this P-ATO for the following 20 services in the AWS US East/West Region:

  • Amazon Aurora (MySQL)*
  • Amazon CloudWatch Logs*
  • Amazon DynamoDB
  • Amazon Elastic Block Store
  • Amazon Elastic Compute Cloud
  • Amazon EMR*
  • Amazon Glacier*
  • Amazon Kinesis Streams*
  • Amazon RDS (MySQL, Oracle, Postgres*)
  • Amazon Redshift
  • Amazon Simple Notification Service*
  • Amazon Simple Queue Service*
  • Amazon Simple Storage Service
  • Amazon Simple Workflow Service*
  • Amazon Virtual Private Cloud
  • AWS CloudFormation*
  • AWS CloudTrail*
  • AWS Identity and Access Management
  • AWS Key Management Service
  • Elastic Load Balancing

* Services with first-time FedRAMP Moderate authorizations

We continue to work with the FedRAMP Project Management Office (PMO), other regulatory and compliance bodies, and our customers and partners to ensure that we are raising the bar on our customers’ security and compliance needs.

To learn more about how AWS helps customers meet their security and compliance requirements, see the AWS Compliance website. To learn about what other public sector customers are doing on AWS, see our Government, Education, and Nonprofits Case Studies and Customer Success Stories. To review the public posting of our FedRAMP authorizations, see the FedRAMP Marketplace.

– Chris Gile, Senior Manager, AWS Public Sector Risk and Compliance

Tableau 10.4 Supports Amazon Redshift Spectrum with External Amazon S3 Tables

Post Syndicated from Robin Cottiss original https://aws.amazon.com/blogs/big-data/tableau-10-4-supports-amazon-redshift-spectrum-with-external-amazon-s3-tables/

This is a guest post by Robin Cottiss, strategic customer consultant, Russell Christopher, staff product manager, and Vaidy Krishnan, senior manager of product marketing, at Tableau. Tableau, in their own words, “helps anyone quickly analyze, visualize, and share information. More than 61,000 customer accounts get rapid results with Tableau in the office and on the go. Over 300,000 people use Tableau Public to share public data in their blogs and websites.”

We’re excited to announce today an update to our Amazon Redshift connector with support for Amazon Redshift Spectrum to analyze data in external Amazon S3 tables. This feature, the direct result of joint engineering and testing work performed by the teams at Tableau and AWS, was released as part of Tableau 10.3.3 and will be available broadly in Tableau 10.4.1. With this update, you can quickly and directly connect Tableau to data in Amazon Redshift and analyze it in conjunction with data in Amazon S3—all with drag-and-drop ease.

This connector is yet another in a series of market-leading integrations of Tableau with AWS’s analytics platform, with services such as Amazon Redshift, Amazon EMR, and Amazon Athena. These integrations have allowed Tableau to become the natural choice of tool for analyzing data stored on AWS. Beyond this, Tableau Server runs seamlessly in the AWS Cloud infrastructure. If you prefer to deploy all your applications inside AWS, you have a complete solution offering from Tableau.

How does support for Amazon Redshift Spectrum help you?

If you’re like many Tableau customers, you have large buckets of data stored in Amazon S3. You might need to access this data frequently and store it in a consistent, highly structured format. If so, you can provision it to a data warehouse like Amazon Redshift. You might also want to explore this S3 data on an ad hoc basis. For example, you might want to determine whether or not to provision the data, and where—options might be Hadoop, Impala, Amazon EMR, or Amazon Redshift. To do so, you can use Amazon Athena, a serverless interactive query service from AWS that requires no infrastructure setup and management.

But what if you want to analyze both the frequently accessed data stored locally in Amazon Redshift AND your full datasets stored cost-effectively in Amazon S3? What if you want the throughput of disk and sophisticated query optimization of Amazon Redshift AND a service that combines a serverless scale-out processing capability with the massively reliable and scalable S3 infrastructure? What if you want the super-fast performance of Amazon Redshift AND support for open storage formats (for example, Parquet or ORC) in S3?

To enable these AND and resolve the tyranny of ORs, AWS launched Amazon Redshift Spectrum earlier this year.

Amazon Redshift Spectrum gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it. Since the Amazon Redshift Spectrum launch, Tableau has worked tirelessly to provide best-in-class support for this new service. With Tableau and Redshift Spectrum, you can extend your Amazon Redshift analyses out to the entire universe of data in your S3 data lakes.

This latest update has been tested by many customers with very positive feedback. One such customer is the world’s largest food product distributor, Sysco—you can watch their session referencing the Amazon Spectrum integration at Tableau Conference 2017. Sysco also plans to reprise its “Tableau on AWS” story again in a month’s time at AWS re:Invent.

Now, I’d like to use a concrete example to demonstrate how Tableau works with Amazon Redshift Spectrum. In this example, I also show you how and why you might want to connect to your AWS data in different ways.

The setup

I use the pipeline described following to ingest, process, and analyze data with Tableau on an AWS stack. The source data is the New York City Taxi dataset, which has 9 years’ worth of taxi rides activity (including pick-up and drop-off location, amount paid, payment type, and so on) captured in 1.2 billion records.

In this pipeline, this data lands in S3, is cleansed and partitioned by using Amazon EMR, and is then converted to a columnar Parquet format that is analytically optimized. You can point Tableau to the raw data in S3 by using Amazon Athena. You can also access the cleansed data with Tableau using Presto through your Amazon EMR cluster.

Why use Tableau this early in the pipeline? Because sometimes you want to understand what’s there and what questions are worth asking before you even start the analysis.

After you find out what those questions are and determine if this sort of analysis has long-term usefulness, you can automate and optimize that pipeline. You do this to add new data as soon as possible as it arrives, to get it to the processes and people that need it. You might also want to provision this data to a highly performant “hotter” layer (Amazon Redshift or Tableau Extract) for repeated access.

In the illustration preceding, S3 contains the raw denormalized ride data at the timestamp level of granularity. This S3 data is the fact table. Amazon Redshift has the time dimensions broken out by date, month, and year, and also has the taxi zone information.

Now imagine I want to know where and when taxi pickups happen on a certain date in a certain borough. With support for Amazon Redshift Spectrum, I can now join the S3 tables with the Amazon Redshift dimensions, as shown following.

I can next analyze the data in Tableau to produce a borough-by-borough view of New York City ride density on Christmas Day 2015.

Or I can hone in on just Manhattan and identify pickup hotspots, with ride charges way above the average!

With Amazon Redshift Spectrum, you now have a fast, cost-effective engine that minimizes data processed with dynamic partition pruning. You can further improve query performance by reducing the data scanned. You do this by partitioning and compressing data and by using a columnar format for storage.

At the end of the day, which engine you use behind Tableau is a function of what you want to optimize for. Some possible engines are Amazon Athena, Amazon Redshift, and Redshift Spectrum, or you can bring a subset of data into Tableau Extract. Factors in planning optimization include these:

  • Are you comfortable with the serverless cost model of Amazon Athena and potential full scans? Or do you prefer the advantages of no setup?
  • Do you want the throughput of local disk?
  • Effort and time of setup. Are you okay with the lead-time of an Amazon Redshift cluster setup, as opposed to just bringing everything into Tableau Extract?

To meet the many needs of our customers, Tableau’s approach is simple: It’s all about choice. The choice of how you want to connect to and analyze your data. Throughout the history of our product and into the future, we have and will continue to empower choice for customers.

For more on how to deal with choice, as you go about making architecture decisions for your enterprise, watch this big data strategy session my friend Robin Cottiss and I delivered at Tableau Conference 2017. This session includes several customer examples leveraging the Tableau on AWS platform, and also a run-through of the aforementioned demonstration.

If you’re curious to learn more about analyzing data with Tableau on Amazon Redshift we encourage you to check out the following resources:

Hot Startups on AWS – October 2017

Post Syndicated from Tina Barr original https://aws.amazon.com/blogs/aws/hot-startups-on-aws-october-2017/

In 2015, the Centers for Medicare and Medicaid Services (CMS) reported that healthcare spending made up 17.8% of the U.S. GDP – that’s almost $3.2 trillion or $9,990 per person. By 2025, the CMS estimates this number will increase to nearly 20%. As cloud technology evolves in the healthcare and life science industries, we are seeing how companies of all sizes are using AWS to provide powerful and innovative solutions to customers across the globe. This month we are excited to feature the following startups:

  • ClearCare – helping home care agencies operate efficiently and grow their business.
  • DNAnexus – providing a cloud-based global network for sharing and managing genomic data.

ClearCare (San Francisco, CA)

ClearCare envisions a future where home care is the only choice for aging in place. Home care agencies play a critical role in the economy and their communities by significantly lowering the overall cost of care, reducing the number of hospital admissions, and bending the cost curve of aging. Patients receiving home care typically have multiple chronic conditions and functional limitations, driving over $190 billion in healthcare spending in the U.S. each year. To offset these costs, health insurance payers are developing in-home care management programs for patients. ClearCare’s goal is to help home care agencies leverage technology to improve costs, outcomes, and quality of life for the aging population. The company’s powerful software platform is specifically designed for use by non-medical, in-home care agencies to manage their businesses.

Founder and CEO Geoff Nudd created ClearCare because of his own grandmother’s need for care. Keeping family members and caregivers up to date on a loved one’s well being can be difficult, so Geoff created what is now ClearCare’s Family Room, which enables caregivers and agency staff to check schedules and receive real-time updates about what’s happening in the home. Since then, agencies have provided feedback on others areas of their businesses that could be streamlined. ClearCare has now built over 20 modules to help home care agencies optimize operations with services including a telephony service, billing and payroll, and more. ClearCare now serves over 4,000 home care agencies, representing 500,000 caregivers and 400,000 seniors.

Using AWS, ClearCare is able to spin up reliable infrastructure for proofs of concept and iterate on those systems to quickly get value to market. The company runs many AWS services including Amazon Elasticsearch Service, Amazon RDS, and Amazon CloudFront. Amazon EMR and Amazon Athena have enabled ClearCare to build a Hadoop-based ETL and data warehousing system that processes terabytes of data each day. By utilizing these managed services, ClearCare has been able to go from concept to customer delivery in less than three months.

To learn more about ClearCare, check out their website.

DNAnexus (Mountain View, CA)

DNAnexus is accelerating the application of genomic data in precision medicine by providing a cloud-based platform for sharing and managing genomic and biomedical data and analysis tools. The company was founded in 2009 by Stanford graduate student Andreas Sundquist and two Stanford professors Arend Sidow and Serafim Batzoglou, to address the need for scaling secondary analysis of next-generation sequencing (NGS) data in the cloud. The founders quickly learned that users needed a flexible solution to build complex analysis workflows and tools that enable them to share and manage large volumes of data. DNAnexus is optimized to address the challenges of security, scalability, and collaboration for organizations that are pursuing genomic-based approaches to health, both in clinics and research labs. DNAnexus has a global customer base – spanning North America, Europe, Asia-Pacific, South America, and Africa – that runs a million jobs each month and is doubling their storage year-over-year. The company currently stores more than 10 petabytes of biomedical and genomic data. That is equivalent to approximately 100,000 genomes, or in simpler terms, over 50 billion Facebook photos!

DNAnexus is working with its customers to help expand their translational informatics research, which includes expanding into clinical trial genomic services. This will help companies developing different medicines to better stratify clinical trial populations and develop companion tests that enable the right patient to get the right medicine. In collaboration with Janssen Human Microbiome Institute, DNAnexus is also launching Mosaic – a community platform for microbiome research.

AWS provides DNAnexus and its customers the flexibility to grow and scale research programs. Building the technology infrastructure required to manage these projects in-house is expensive and time-consuming. DNAnexus removes that barrier for labs of any size by using AWS scalable cloud resources. The company deploys its customers’ genomic pipelines on Amazon EC2, using Amazon S3 for high-performance, high-durability storage, and Amazon Glacier for low-cost data archiving. DNAnexus is also an AWS Life Sciences Competency Partner.

Learn more about DNAnexus here.

-Tina