Amazon QuickSight is a fully managed cloud business intelligence system that gives you Fast & Easy to Use Business Analytics for Big Data. QuickSight makes business analytics available to organizations of all shapes and sizes, with the ability to access data that is stored in your Amazon Redshift data warehouse, your Amazon Relational Database Service (RDS) relational databases, flat files in S3, and (via connectors) data stored in on-premises MySQL, PostgreSQL, and SQL Server databases. QuickSight scales to accommodate tens, hundreds, or thousands of users per organization.
Today we are launching a new, session-based pricing option for QuickSight, along with additional region support and other important new features. Let’s take a look at each one:
Pay-per-Session Pricing Our customers are making great use of QuickSight and take full advantage of the power it gives them to connect to data sources, create reports, and and explore visualizations.
However, not everyone in an organization needs or wants such powerful authoring capabilities. Having access to curated data in dashboards and being able to interact with the data by drilling down, filtering, or slicing-and-dicing is more than adequate for their needs. Subscribing them to a monthly or annual plan can be seen as an unwarranted expense, so a lot of such casual users end up not having access to interactive data or BI.
In order to allow customers to provide all of their users with interactive dashboards and reports, the Enterprise Edition of Amazon QuickSight now allows Reader access to dashboards on a Pay-per-Session basis. QuickSight users are now classified as Admins, Authors, or Readers, with distinct capabilities and prices:
Authors have access to the full power of QuickSight; they can establish database connections, upload new data, create ad hoc visualizations, and publish dashboards, all for $9 per month (Standard Edition) or $18 per month (Enterprise Edition).
Readers can view dashboards, slice and dice data using drill downs, filters and on-screen controls, and download data in CSV format, all within the secure QuickSight environment. Readers pay $0.30 for 30 minutes of access, with a monthly maximum of $5 per reader.
Admins have all authoring capabilities, and can manage users and purchase SPICE capacity in the account. The QuickSight admin now has the ability to set the desired option (Author or Reader) when they invite members of their organization to use QuickSight. They can extend Reader invites to their entire user base without incurring any up-front or monthly costs, paying only for the actual usage.
A New Region QuickSight is now available in the Asia Pacific (Tokyo) Region:
The UI is in English, with a localized version in the works.
Hourly Data Refresh Enterprise Edition SPICE data sets can now be set to refresh as frequently as every hour. In the past, each data set could be refreshed up to 5 times a day. To learn more, read Refreshing Imported Data.
Access to Data in Private VPCs This feature was launched in preview form late last year, and is now available in production form to users of the Enterprise Edition. As I noted at the time, you can use it to implement secure, private communication with data sources that do not have public connectivity, including on-premises data in Teradata or SQL Server, accessed over an AWS Direct Connect link. To learn more, read Working with AWS VPC.
Parameters with On-Screen Controls QuickSight dashboards can now include parameters that are set using on-screen dropdown, text box, numeric slider or date picker controls. The default value for each parameter can be set based on the user name (QuickSight calls this a dynamic default). You could, for example, set an appropriate default based on each user’s office location, department, or sales territory. Here’s an example:
URL Actions for Linked Dashboards You can now connect your QuickSight dashboards to external applications by defining URL actions on visuals. The actions can include parameters, and become available in the Details menu for the visual. URL actions are defined like this:
You can use this feature to link QuickSight dashboards to third party applications (e.g. Salesforce) or to your own internal applications. Read Custom URL Actions to learn how to use this feature.
Dashboard Sharing You can now share QuickSight dashboards across every user in an account.
Larger SPICE Tables The per-data set limit for SPICE tables has been raised from 10 GB to 25 GB.
Upgrade to Enterprise Edition The QuickSight administrator can now upgrade an account from Standard Edition to Enterprise Edition with a click. This enables provisioning of Readers with pay-per-session pricing, private VPC access, row-level security for dashboards and data sets, and hourly refresh of data sets. Enterprise Edition pricing applies after the upgrade.
Available Now Everything I listed above is available now and you can start using it today!
Amazon Neptune is now Generally Available in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland). Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. At the core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latencies. Neptune supports two popular graph models, Property Graph and RDF, through Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune can be used to power everything from recommendation engines and knowledge graphs to drug discovery and network security. Neptune is fully-managed with automatic minor version upgrades, backups, encryption, and fail-over. I wrote about Neptune in detail for AWS re:Invent last year and customers have been using the preview and providing great feedback that the team has used to prepare the service for GA.
Now that Amazon Neptune is generally available there are a few changes from the preview:
A large number of performance enhancements and updates
Launching a Neptune cluster is as easy as navigating to the AWS Management Console and clicking create cluster. Of course you can also launch with CloudFormation, the CLI, or the SDKs.
You can monitor your cluster health and the health of individual instances through Amazon CloudWatch and the console.
Additional Resources
We’ve created two repos with some additional tools and examples here. You can expect continuous development on these repos as we add additional tools and examples.
Amazon Neptune Tools Repo This repo has a useful tool for converting GraphML files into Neptune compatible CSVs for bulk loading from S3.
Amazon Neptune Samples Repo This repo has a really cool example of building a collaborative filtering recommendation engine for video game preferences.
Purpose Built Databases
There’s an industry trend where we’re moving more and more onto purpose-built databases. Developers and businesses want to access their data in the format that makes the most sense for their applications. As cloud resources make transforming large datasets easier with tools like AWS Glue, we have a lot more options than we used to for accessing our data. With tools like Amazon Redshift, Amazon Athena, Amazon Aurora, Amazon DynamoDB, and more we get to choose the best database for the job or even enable entirely new use-cases. Amazon Neptune is perfect for workloads where the data is highly connected across data rich edges.
I’m really excited about graph databases and I see a huge number of applications. Looking for ideas of cool things to build? I’d love to build a web crawler in AWS Lambda that uses Neptune as the backing store. You could further enrich it by running Amazon Comprehend or Amazon Rekognition on the text and images found and creating a search engine on top of Neptune.
As always, feel free to reach out in the comments or on twitter to provide any feedback!
This post is courtesy of Otavio Ferreira, Manager, Amazon SNS, AWS Messaging.
Amazon SNS message filtering provides a set of string and numeric matching operators that allow each subscription to receive only the messages of interest. Hence, SNS message filtering can simplify your pub/sub messaging architecture by offloading the message filtering logic from your subscriber systems, as well as the message routing logic from your publisher systems.
After you set the subscription attribute that defines a filter policy, the subscribing endpoint receives only the messages that carry attributes matching this filter policy. Other messages published to the topic are filtered out for this subscription. In this way, the native integration between SNS and Amazon CloudWatch provides visibility into the number of messages delivered, as well as the number of messages filtered out.
CloudWatch metrics are captured automatically for you. To get started with SNS message filtering, see Filtering Messages with Amazon SNS.
Message Filtering Metrics
The following six CloudWatch metrics are relevant to understanding your SNS message filtering activity:
NumberOfMessagesPublished – Inbound traffic to SNS. This metric tracks all the messages that have been published to the topic.
NumberOfNotificationsDelivered – Outbound traffic from SNS. This metric tracks all the messages that have been successfully delivered to endpoints subscribed to the topic. A delivery takes place either when the incoming message attributes match a subscription filter policy, or when the subscription has no filter policy at all, which results in a catch-all behavior.
NumberOfNotificationsFilteredOut – This metric tracks all the messages that were filtered out because they carried attributes that didn’t match the subscription filter policy.
NumberOfNotificationsFilteredOut-NoMessageAttributes – This metric tracks all the messages that were filtered out because they didn’t carry any attributes at all and, consequently, didn’t match the subscription filter policy.
NumberOfNotificationsFilteredOut-InvalidAttributes – This metric keeps track of messages that were filtered out because they carried invalid or malformed attributes and, thus, didn’t match the subscription filter policy.
NumberOfNotificationsFailed – This last metric tracks all the messages that failed to be delivered to subscribing endpoints, regardless of whether a filter policy had been set for the endpoint. This metric is emitted after the message delivery retry policy is exhausted, and SNS stops attempting to deliver the message. At that moment, the subscribing endpoint is likely no longer reachable. For example, the subscribing SQS queue or Lambda function has been deleted by its owner. You may want to closely monitor this metric to address message delivery issues quickly.
Message filtering graphs
Through the AWS Management Console, you can compose graphs to display your SNS message filtering activity. The graph shows the number of messages published, delivered, and filtered out within the timeframe you specify (1h, 3h, 12h, 1d, 3d, 1w, or custom).
To compose an SNS message filtering graph with CloudWatch:
Open the CloudWatch console.
Choose Metrics, SNS, All Metrics, and Topic Metrics.
Select all metrics to add to the graph, such as:
NumberOfMessagesPublished
NumberOfNotificationsDelivered
NumberOfNotificationsFilteredOut
Choose Graphed metrics.
In the Statistic column, switch from Average to Sum.
Title your graph with a descriptive name, such as “SNS Message Filtering”
After you have your graph set up, you may want to copy the graph link for bookmarking, emailing, or sharing with co-workers. You may also want to add your graph to a CloudWatch dashboard for easy access in the future. Both actions are available to you on the Actions menu, which is found above the graph.
Summary
SNS message filtering defines how SNS topics behave in terms of message delivery. By using CloudWatch metrics, you gain visibility into the number of messages published, delivered, and filtered out. This enables you to validate the operation of filter policies and more easily troubleshoot during development phases.
SNS message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). CloudWatch metrics for SNS message filtering is available now, in all AWS Regions.
In February, the bpfilter mechanism was first posted to the mailing lists. Bpfilter is meant to be a replacement for the current in-kernel firewall/packet-filtering code. It provides little functionality itself; instead, it creates a set of hooks that can run BPF programs to make the packet-filtering decisions. A version of that patch set has been merged into the net-next tree for 4.18. It will not be replacing any existing packet filters in its current form, but it does feature a significant change to one of its more controversial features: the new user-mode helper mechanism.
The CFP will close on July 30th. Notification of acceptance and non-acceptance will go out within 7 days of the closing of the CFP.
All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:
Low-level container executors and infrastructure
IoT and embedded OS infrastructure
BPF and eBPF filtering
OS, container, IoT image delivery and updating
Building Linux devices and applications
Low-level desktop technologies
Networking
System and service management
Tracing and performance measuring
IPC and RPC systems
Security and Sandboxing
While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome, as long as they have a clear and direct relevance for user-space.
One of the more useful features of masscan is the “–banners” check, which connects to the TCP port, sends some request, and gets a basic response back. However, since masscan has it’s own TCP stack, it’ll interfere with the operating system’s TCP stack if they are sharing the same IPv4 address. The operating system will reply with a RST packet before the TCP connection can be established.
The way to fix this is to use the built-in packet-filtering firewall to block those packets in the operating-system TCP/IP stack. The masscan program still sees everything before the packet-filter, but the operating system can’t see anything after the packet-filter.
Note that we are talking about the “packet-filter” firewall feature here. Remember that macOS, like most operating systems these days, has two separate firewalls: an application firewall and a packet-filter firewall. The application firewall is the one you see in System Settings labeled “Firewall”, and it controls things based upon the application’s identity rather than by which ports it uses. This is normally “on” by default. The packet-filter is normally “off” by default and is of little use to normal users.
Also note that macOS changed packet-filters around version 10.10.5 (“Yosemite”, October 2014). The older one is known as “ipfw“, which was the default firewall for FreeBSD (much of macOS is based on FreeBSD). The replacement is known as PF, which comes from OpenBSD. Whereas you used to use the old “ipfw” command on the command line, you now use the “pfctl” command, as well as the “/etc/pf.conf” configuration file.
What we need to filter is the source port of the packets that masscan will send, so that when replies are received, they won’t reach the operating-system stack, and just go to masscan instead. To do this, we need find a range of ports that won’t conflict with the operating system. Namely, when the operating system creates outgoing connections, it randomly chooses a source port within a certain range. We want to use masscan to use source ports in a different range.
To figure out the range macOS uses, we run the following command:
On my laptop, which is probably the default for macOS, I get the following range. Sniffing with Wireshark confirms this is the range used for source ports for outgoing connections.
So this means I shouldn’t use source ports anywhere in the range 49152 to 65535. On my laptop, I’ve decided to use for masscan the ports 40000 to 41023. The range masscan uses must be a power of 2, so here I’m using 1024 (two to the tenth power).
To configure masscan, I can either type the parameter “–source-port 40000-41023” every time I run the program, or I can add the following line to /etc/masscan/masscan.conf. Remember that by default, masscan will look in that configuration file for any configuration parameters, so you don’t have to keep retyping them on the command line.
source-port = 40000-41023
Next, I need to add the following firewall rule to the bottom of /etc/pf.conf:
block in proto tcp from any to any port 40000 >< 41024
However, we aren’t done yet. By default, the packet-filter firewall is off on some versions of macOS. Therefore, every time you reboot your computer, you need to enable it. The simple way to do this is on the command line run:
pfctl -e
Or, if that doesn’t work, try:
pfctl -E
If the firewall is already running, then you’ll need to load the file explicitly (or reboot):
Brandon Williams writes about the new Git remote protocol that will debut in the 2.18 release. “We recently rolled out support for protocol version 2 at Google and have seen a performance improvement of 3x for no-op fetches of a single branch on repositories containing 500k references. Protocol v2 has also enabled a reduction of 8x of the overhead bytes (non-packfile) sent from googlesource.com servers. A majority of this improvement is due to filtering references advertised by the server to the refs the client has expressed interest in.”
Today, I’m pleased to announce that, as of April 24th 2018, the AWS IoT Analytics service is generally available. Customers can use IoT Analytics to clean, process, encrich, store, and analyze their connected device data at scale. AWS IoT Analytics is now available in US East (N. Virginia), US West (Oregon), US East (Ohio), and EU (Ireland). In November of last year, my colleague Tara Walker wrote an excellent post that walks through some of the features of the AWS IoT Analytics service and Ben Kehoe (an AWS Community Hero and Research Scientist at iRobot) spoke at AWS Re:Invent about replacing iRobot’s existing “rube goldberg machine” for forwarding data into an elasticsearch cluster with AWS IoT Analytics.
Iterating on customer feedback received during the service preview the AWS IoT Analytics team has added a number of new features including the ability to ingest data from external souces using the BatchPutMessage API, the ability to set a data retention policy on stored data, the ability to reprocess existing data, preview pipeline results, and preview messages from channels with the SampleChannelData API.
Let’s cover the core concepts of IoT Analytics and then walk through an example.
AWS IoT Analytics Concepts
AWS IoT Analytics can be broken down into a few simple concepts. For data preparation customers have: Channels, Pipelines, and Data Stores. For analyzing data customers have: Datasets and Notebooks.
Data Preparation
Channels are the entry point into IoT Analytics and they collect data from an existing IoT Core MQTT topic or from external sources that send messages to the channel using the Ingestion API. Channels are elastically scalable and consume messages in Binary or JSON format. Channels also immutably store raw device data for easily reprocessing using different logic if your needs change.
Pipelines consume messages from channels and allow you to process messages with steps, called activities, such as filtering on attributes, transforming the content of the message by adding or remvoing fields, invoking lambda functions for complex transformations and adding data from external data sources, or even enriching the messages with data from IoT Core. Pipelines output their data to a Data Store.
Data Stores are a queryable IoT-optimized data storage solution for the output of your pipelines. Data stores support custom retention periods to optimize costs. When a customer queries a Data Store the result is put into a Dataset.
Data Analytics
Datasets are similar to a view in a SQL database. Customers create a dataset by running a query against a data store. Data sets can be generated manually or on a recurring schedule.
Notebooks are Amazon SageMaker hosted Jupyter notebooks that let customers analyze their data with custom code and even build or train ML models on the data. IoT Analytics offers several notebook templates with pre-authored models for common IoT use cases such as Predictive Maintenance, Anomaly Detection, Fleet Segmentation, and Forecasting.
Additionally, you can use IoT analytics as a data source for Amazon QuickSight for easy visualizations of your data. You can find pricing information for each of these services on the AWS IoT Analytics Pricing Page.
IoT Analytics Walkthrough
While this walkthrough uses the console everything shown here is equally easy to do with the CLI. When we first navigate to the console we have a helpful guide telling us to build a channel, pipeline, and a data store: Our first step is to create a channel. I already have some data into an MQTT channel with IoT core so I’ll select that channel. First we’ll name the channel and select a retention period.
Now, I’ll select my IoT Core topic and grab the data. I can also post messages directly into the channel with the PutMessages APIs.
Now that I have a channel my next step is to create a pipeline. To do this I’ll select “Create a pipeline from this channel” from the “Actions” drop down.
Now, I’ll walk through the pipeline wizard giving my pipeline a name and a source.
I’ll select which of the message attributes the pipeline should expect. This can draw from the channel with the sampling API and guess at which attributes are needed or I could upload a specification in JSON.
Next I define the pipeline activities. If I’m dealing with binary data I need a lambda function to first deserialize the message into JSON so the other filter functions can operate on it. I can create filters, calculate attributes based on other attributes, and I can also enrich the message with metadata from IoT core registry.
For now I just want to filter out some messages and make a small transform with a Lambda function.
Finally, I choose or create a data store to output the results of my pipeline.
Now that I have a data store, I can create a view of that data by creating a data set.
I’ll just select all the data from the data store for this dataset but I could also select individual attributes as needed.
I have a data set! I can adjust the cron expression in the schedule to re-run this as frequently or infrequently as I wish.
If I want to create a model from my data I can create a SageMaker powered Jupyter notebook. There are a few templates that are great starting points like anomaly detection or output forecasting.
Here you can see an example of the anomaly detection notebook.
Finally, if I want to create simple visualizations of my data I can use QuickSight to bring in an IoT Analytics data set.
Let Us Know
I’m excited to see what customers build with AWS IoT Analytics. My colleagues on the IoT teams are eager to hear your feedback about the service so please let us know in the comments or on Twitter what features you want to see.
AWS Glue provides enhanced support for working with datasets that are organized into Hive-style partitions. AWS Glue crawlers automatically identify partitions in your Amazon S3 data. The AWS Glue ETL (extract, transform, and load) library natively supports partitions when you work with DynamicFrames. DynamicFrames represent a distributed collection of data without requiring you to specify a schema. You can now push down predicates when creating DynamicFrames to filter out partitions and avoid costly calls to S3. We have also added support for writing DynamicFrames directly into partitioned directories without converting them to Apache Spark DataFrames.
Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. For example, you might decide to partition your application logs in Amazon S3 by date—broken down by year, month, and day. Files corresponding to a single day’s worth of data would then be placed under a prefix such as s3://my_bucket/logs/year=2018/month=01/day=23/.
Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by value without making unnecessary calls to Amazon S3. This can significantly improve the performance of applications that need to read only a few partitions.
In this post, we show you how to efficiently process partitioned datasets using AWS Glue. First, we cover how to set up a crawler to automatically scan your partitioned dataset and create a table and partitions in the AWS Glue Data Catalog. Then, we introduce some features of the AWS Glue ETL library for working with partitioned data. You can now filter partitions using SQL expressions or user-defined functions to avoid listing and reading unnecessary data from Amazon S3. We’ve also added support in the ETL library for writing AWS Glue DynamicFrames directly into partitions without relying on Spark SQL DataFrames.
Let’s get started!
Crawling partitioned data
In this example, we use the same GitHub archive dataset that we introduced in a previous post about Scala support in AWS Glue. This data, which is publicly available from the GitHub archive, contains a JSON record for every API request made to the GitHub service. A sample dataset containing one month of activity from January 2017 is available at the following location:
Here you can replace <region> with the AWS Region in which you are working, for example, us-east-1. This dataset is partitioned by year, month, and day, so an actual file will be at a path like the following:
To crawl this data, you can either follow the instructions in the AWS Glue Developer Guide or use the provided AWS CloudFormation template. This template creates a stack that contains the following:
An IAM role with permissions to access AWS Glue resources
A database in the AWS Glue Data Catalog named githubarchive_month
A crawler set up to crawl the GitHub dataset
An AWS Glue development endpoint (which is used in the next section to transform the data)
To run this template, you must provide an S3 bucket and prefix where you can write output data in the next section. The role that this template creates will have permission to write to this bucket only. You also need to provide a public SSH key for connecting to the development endpoint. For more information about creating an SSH key, see our Development Endpoint tutorial. After you create the AWS CloudFormation stack, you can run the crawler from the AWS Glue console.
In addition to inferring file types and schemas, crawlers automatically identify the partition structure of your dataset and populate the AWS Glue Data Catalog. This ensures that your data is correctly grouped into logical tables and makes the partition columns available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.
After you crawl the table, you can view the partitions by navigating to the table in the AWS Glue console and choosing View partitions. The partitions should look like the following:
For partitioned paths in Hive-style of the form key=val, crawlers automatically populate the column name. In this case, because the GitHub data is stored in directories of the form 2017/01/01, the crawlers use default names like partition_0, partition_1, and so on. You can easily change these names on the AWS Glue console: Navigate to the table, choose Edit schema, and rename partition_0 to year, partition_1 to month, and partition_2 to day:
Now that you’ve crawled the dataset and named your partitions appropriately, let’s see how to work with partitioned data in an AWS Glue ETL job.
Transforming and filtering the data
To get started with the AWS Glue ETL libraries, you can use an AWS Glue development endpoint and an Apache Zeppelin notebook. AWS Glue development endpoints provide an interactive environment to build and run scripts using Apache Spark and the AWS Glue ETL library. They are great for debugging and exploratory analysis, and can be used to develop and test scripts before migrating them to a recurring job.
If you ran the AWS CloudFormation template in the previous section, then you already have a development endpoint named partition-endpoint in your account. Otherwise, you can follow the instructions in this development endpoint tutorial. In either case, you need to set up an Apache Zeppelin notebook, either locally, or on an EC2 instance. You can find more information about development endpoints and notebooks in the AWS Glue Developer Guide.
The following examples are all written in the Scala programming language, but they can all be implemented in Python with minimal changes.
Reading a partitioned dataset
To get started, let’s read the dataset and see how the partitions are reflected in the schema. First, you import some classes that you will need for this example and set up a GlueContext, which is the main class that you will use to read and write data.
Execute the following in a Zeppelin paragraph, which is a unit of executable code:
%spark
import com.amazonaws.services.glue.DynamicFrame import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext
import java.util.Calendar
import java.util.GregorianCalendar
import scala.collection.JavaConversions._
@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
This is straightforward with two caveats: First, each paragraph must start with the line %spark to indicate that the paragraph is Scala. Second, the spark variable must be marked @transient to avoid serialization issues. This is only necessary when running in a Zeppelin notebook.
Next, read the GitHub data into a DynamicFrame, which is the primary data structure that is used in AWS Glue scripts to represent a distributed collection of data. A DynamicFrame is similar to a Spark DataFrame, except that it has additional enhancements for ETL transformations. DynamicFrames are discussed further in the post AWS Glue Now Supports Scala Scripts, and in the AWS Glue API documentation.
The following snippet creates a DynamicFrame by referencing the Data Catalog table that you just crawled and then prints the schema:
%spark
val githubEvents: DynamicFrame = glueContext.getCatalogSource(
database = "githubarchive_month",
tableName = "data"
).getDynamicFrame()
githubEvents.schema.asFieldList.foreach { field =>
println(s"${field.getName}: ${field.getType.getType.getName}")
}
You could also print the full schema using githubEvents.printSchema(). But in this case, the full schema is quite large, so I’ve printed only the top-level columns. This paragraph takes about 5 minutes to run on a standard size AWS Glue development endpoint. After it runs, you should see the following output:
Note that the partition columns year, month, and day were automatically added to each record.
Filtering by partition columns
One of the primary reasons for partitioning data is to make it easier to operate on a subset of the partitions, so now let’s see how to filter data by the partition columns. In particular, let’s find out what people are building in their free time by looking at GitHub activity on the weekends. One way to accomplish this is to use the filter transformation on the githubEvents DynamicFrame that you created earlier to select the appropriate events:
%spark
def filterWeekend(rec: DynamicRecord): Boolean = {
def getAsInt(field: String): Int = {
rec.getField(field) match {
case Some(strVal: String) => strVal.toInt
// The filter transformation will catch exceptions and mark the record as an error.
case _ => throw new IllegalArgumentException(s"Unable to extract field $field")
}
}
val (year, month, day) = (getAsInt("year"), getAsInt("month"), getAsInt("day"))
val cal = new GregorianCalendar(year, month - 1, day) // Calendar months start at 0.
val dayOfWeek = cal.get(Calendar.DAY_OF_WEEK)
dayOfWeek == Calendar.SATURDAY || dayOfWeek == Calendar.SUNDAY
}
val filteredEvents = githubEvents.filter(filterWeekend)
filteredEvents.count
This snippet defines the filterWeekend function that uses the Java Calendar class to identify those records where the partition columns (year, month, and day) fall on a weekend. If you run this code, you see that there were 6,303,480 GitHub events falling on the weekend in January 2017, out of a total of 29,160,561 events. This seems reasonable—about 22 percent of the events fell on the weekend, and about 29 percent of the days that month fell on the weekend (9 out of 31). So people are using GitHub slightly less on the weekends, but there is still a lot of activity!
Predicate pushdowns for partition columns
The main downside to using the filter transformation in this way is that you have to list and read all files in the entire dataset from Amazon S3 even though you need only a small fraction of them. This is manageable when dealing with a single month’s worth of data. But as you try to process more data, you will spend an increasing amount of time reading records only to immediately discard them.
To address this issue, we recently released support for pushing down predicates on partition columns that are specified in the AWS Glue Data Catalog. Instead of reading the data and filtering the DynamicFrame at executors in the cluster, you apply the filter directly on the partition metadata available from the catalog. Then you list and read only the partitions from S3 that you need to process.
To accomplish this, you can specify a Spark SQL predicate as an additional parameter to the getCatalogSource method. This predicate can be any SQL expression or user-defined function as long as it uses only the partition columns for filtering. Remember that you are applying this to the metadata stored in the catalog, so you don’t have access to other fields in the schema.
The following snippet shows how to use this functionality to read only those partitions occurring on a weekend:
%spark
val partitionPredicate =
"date_format(to_date(concat(year, '-', month, '-', day)), 'E') in ('Sat', 'Sun')"
val pushdownEvents = glueContext.getCatalogSource(
database = "githubarchive_month",
tableName = "data",
pushDownPredicate = partitionPredicate).getDynamicFrame()
Here you use the SparkSQL string concat function to construct a date string. You use the to_date function to convert it to a date object, and the date_format function with the ‘E’ pattern to convert the date to a three-character day of the week (for example, Mon, Tue, and so on). For more information about these functions, Spark SQL expressions, and user-defined functions in general, see the Spark SQL documentation and list of functions.
Note that the pushdownPredicate parameter is also available in Python. The corresponding call in Python is as follows:
You can observe the performance impact of pushing down predicates by looking at the execution time reported for each Zeppelin paragraph. The initial approach using a Scala filter function took 2.5 minutes:
Because the version using a pushdown lists and reads much less data, it takes only 24 seconds to complete, a 5X improvement!
Of course, the exact benefit that you see depends on the selectivity of your filter. The more partitions that you exclude, the more improvement you will see.
In addition to Hive-style partitioning for Amazon S3 paths, Parquet and ORC file formats further partition each file into blocks of data that represent column values. Each block also stores statistics for the records that it contains, such as min/max for column values. AWS Glue supports pushdown predicates for both Hive-style partitions and block partitions in these formats. While reading data, it prunes unnecessary S3 partitions and also skips the blocks that are determined unnecessary to be read by column statistics in Parquet and ORC formats.
Additional transformations
Now that you’ve read and filtered your dataset, you can apply any additional transformations to clean or modify the data. For example, you could augment it with sentiment analysis as described in the previous AWS Glue post.
To keep things simple, you can just pick out some columns from the dataset using the ApplyMapping transformation:
ApplyMapping is a flexible transformation for performing projection and type-casting. In this example, we use it to unnest several fields, such as actor.login, which we map to the top-level actor field. We also cast the id column to a long and the partition columns to integers.
Writing out partitioned data
The final step is to write out your transformed dataset to Amazon S3 so that you can process it with other systems like Amazon Athena. By default, when you write out a DynamicFrame, it is not partitioned—all the output files are written at the top level under the specified output path. Until recently, the only way to write a DynamicFrame into partitions was to convert it into a Spark SQL DataFrame before writing. We are excited to share that DynamicFrames now support native partitioning by a sequence of keys.
You can accomplish this by passing the additional partitionKeys option when creating a sink. For example, the following code writes out the dataset that you created earlier in Parquet format to S3 in directories partitioned by the type field.
Here, $outpath is a placeholder for the base output path in S3. The partitionKeys parameter can also be specified in Python in the connection_options dict:
When you execute this write, the type field is removed from the individual records and is encoded in the directory structure. To demonstrate this, you can list the output path using the aws s3 ls command from the AWS CLI:
PRE type=CommitCommentEvent/
PRE type=CreateEvent/
PRE type=DeleteEvent/
PRE type=ForkEvent/
PRE type=GollumEvent/
PRE type=IssueCommentEvent/
PRE type=IssuesEvent/
PRE type=MemberEvent/
PRE type=PublicEvent/
PRE type=PullRequestEvent/
PRE type=PullRequestReviewCommentEvent/
PRE type=PushEvent/
PRE type=ReleaseEvent/
PRE type=WatchEvent/
As expected, there is a partition for each distinct event type. In this example, we partitioned by a single value, but this is by no means required. For example, if you want to preserve the original partitioning by year, month, and day, you could simply set the partitionKeys option to be Seq(“year”, “month”, “day”).
Conclusion
In this post, we showed you how to work with partitioned data in AWS Glue. Partitioning is a crucial technique for getting the most out of your large datasets. Many tools in the AWS big data ecosystem, including Amazon Athena and Amazon Redshift Spectrum, take advantage of partitions to accelerate query processing. AWS Glue provides mechanisms to crawl, filter, and write partitioned data so that you can structure your data in Amazon S3 however you want, to get the best performance out of your big data applications.
Ben Sowell is a senior software development engineer at AWS Glue. He has worked for more than 5 years on ETL systems to help users unlock the potential of their data. In his free time, he enjoys reading and exploring the Bay Area.
Mohit Saxena is a senior software development engineer at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies and reading about the latest technology.
As I have discussed in the past, sophisticated AWS customers invariably control multiple AWS accounts. Some of these are the results of acquisitions or a holdover from bottom-up, departmental adoption of cloud computing. Others create multiple accounts in order to isolate developers, projects, or departments from each other. We strongly endorse this as a best practice, and back it up with cross-account features in many AWS services, as well as AWS Organizations for policy-based management that spans accounts. Many of these customers also make great use of AWS Config and use Config Rules (both their own and those supplied by Config) to check their AWS resources for compliance.
Aggregate Across Accounts and Regions Today we are making Config Rules even more useful by adding the ability to aggregate the compliance data produced by their rules across multiple AWS accounts and/or Regions. The aggregated data can then be viewed in a single dashboard, making this a great way to improve governance and compliance. Even better, the aggregation and dashboard are available at no charge to all AWS Config users!
I’ll show you how to set this up in a moment. First, let’s define a couple of terms:
Aggregator – This is a new Config resource. It identifies the sources (accounts and regions) of the compliance data to be aggregated. Multiple aggregators can be used simultaneously, giving you the ability to fine-tune your governance and compliance model.
Aggregator account – This is an AWS account that owns one or more aggregators.
Source account – This is an AWS account that has compliance data to be aggregated.
Aggregated view – A dashboard that shows compliant and non-compliant rules for an aggregator.
Here’s how it all fits together:
Setting up Aggregation Let’s set up aggregation for some AWS Config data! The first steps take place in the aggregator account. I open the Config Console, find the Aggregated View section and click Aggregators:
I review the list of Aggregators, and click Add aggregator to make a new one:
I grant AWS Config permission to replicate data from the source accounts and enter a name for my aggregator (MyAgg):
Next, I select the source accounts. I have three options here: I can manually add the account IDs, upload a file that contains a comma-separated list, or add all of the accounts in my AWS Organization:
I click on Add source accounts to manually add one account, enter the ID, and click Add source accounts:
Next, I choose the regions of interest, with the option to select current regions as well as future ones, then click Save to move ahead:
The next step takes place in the source account, within the Config Console. An Authorization request appears:
And I confirm it:
You can use CloudFormation StackSets to enable authorization programmatically across all source accounts. Also note that the authorization step is not needed if you choose to aggregate all accounts in your AWS Organization.
Compliance data from the source account begins to flow to the aggregator account and becomes visible in the Console, generally within 2-5 minutes:
As you can see, I have a multitude of filtering options! I can focus the view on a particular region or account, and I can see which rules or accounts have the most issues to address. For example, I can see all of the buckets that do not have server-side encryption enabled:
I can also look at the overall compliance situation for an account, seeing both compliant and non-compliant resources:
Things to Know This new feature is available today in the US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), EU (Ireland), EU (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Sydney), and Asia Pacific (Singapore) regions at no charge and you can start using it today. You pay for the use of Config and Config Rules as usual.
The multi-account, multi-region data aggregation capability in AWS Config allows you to view the compliance status of your accounts from a central account. It assumes that you have already enabled Config and Config Rules across your accounts (you can use CloudFormation StackSets to distribute and deploy your Config Rules across multiple accounts).
This blog was contributed by Otavio Ferreira, Software Development Manager for Amazon SNS
Message filtering simplifies the overall pub/sub messaging architecture by offloading message filtering logic from subscribers, as well as message routing logic from publishers. The initial launch of message filtering provided a basic operator that was based on exact string comparison. For more information, see Simplify Your Pub/Sub Messaging with Amazon SNS Message Filtering.
Today, AWS is announcing an additional set of filtering operators that bring even more power and flexibility to your pub/sub messaging use cases.
Message filtering operators
Amazon SNS now supports both numeric and string matching. Specifically, string matching operators allow for exact, prefix, and “anything-but” comparisons, while numeric matching operators allow for exact and range comparisons, as outlined below. Numeric matching operators work for values between -10e9 and +10e9 inclusive, with five digits of accuracy right of the decimal point.
Anything-but matching on string values (Blacklisting): Subscription filter policy {"sport": [{"anything-but": "rugby"}]} matches message attributes such as {"sport": "baseball"} and {"sport": "basketball"} and {"sport": "football"} but not {"sport": "rugby"}
Prefix matching on string values: Subscription filter policy {"sport": [{"prefix": "bas"}]} matches message attributes such as {"sport": "baseball"} and {"sport": "basketball"}
Exact matching on numeric values: Subscription filter policy {"balance": [{"numeric": ["=", 301.5]}]} matches message attributes {"balance": 301.500} and {"balance": 3.015e2}
Range matching on numeric values: Subscription filter policy {"balance": [{"numeric": ["<", 0]}]} matches negative numbers only, and {"balance": [{"numeric": [">", 0, "<=", 150]}]} matches any positive number up to 150.
As usual, you may apply the “AND” logic by appending multiple keys in the subscription filter policy, and the “OR” logic by appending multiple values for the same key, as follows:
AND logic: Subscription filter policy {"sport": ["rugby"], "language": ["English"]} matches only messages that carry both attributes {"sport": "rugby"} and {"language": "English"}
OR logic: Subscription filter policy {"sport": ["rugby", "football"]} matches messages that carry either the attribute {"sport": "rugby"} or {"sport": "football"}
Message filtering operators in action
Here’s how this new set of filtering operators works. The following example is based on a pharmaceutical company that develops, produces, and markets a variety of prescription drugs, with research labs located in Asia Pacific and Europe. The company built an internal procurement system to manage the purchasing of lab supplies (for example, chemicals and utensils), office supplies (for example, paper, folders, and markers) and tech supplies (for example, laptops, monitors, and printers) from global suppliers.
This distributed system is composed of the four following subsystems:
A requisition system that presents the catalog of products from suppliers, and takes orders from buyers
An approval system for orders targeted to Asia Pacific labs
Another approval system for orders targeted to European labs
A fulfillment system that integrates with shipping partners
As shown in the following diagram, the company leverages AWS messaging services to integrate these distributed systems.
Firstly, an SNS topic named “Orders” was created to take all orders placed by buyers on the requisition system.
Secondly, two Amazon SQS queues, named “Lab-Orders-AP” and “Lab-Orders-EU” (for Asia Pacific and Europe respectively), were created to backlog orders that are up for review on the approval systems.
Lastly, an SQS queue named “Common-Orders” was created to backlog orders that aren’t related to lab supplies, which can already be picked up by shipping partners on the fulfillment system.
The company also uses AWS Lambda functions to automatically process lab supply orders that don’t require approval or which are invalid.
In this example, because different types of orders have been published to the SNS topic, the subscribing endpoints have had to set advanced filter policies on their SNS subscriptions, to have SNS automatically filter out orders they can’t deal with.
As depicted in the above diagram, the following five filter policies have been created:
The SNS subscription that points to the SQS queue “Lab-Orders-AP” sets a filter policy that matches lab supply orders, with a total value greater than $1,000, and that target Asia Pacific labs only. These more expensive transactions require an approver to review orders placed by buyers.
The SNS subscription that points to the SQS queue “Lab-Orders-EU” sets a filter policy that matches lab supply orders, also with a total value greater than $1,000, but that target European labs instead.
The SNS subscription that points to the Lambda function “Lab-Preapproved” sets a filter policy that only matches lab supply orders that aren’t as expensive, up to $1,000, regardless of their target lab location. These orders simply don’t require approval and can be automatically processed.
The SNS subscription that points to the Lambda function “Lab-Cancelled” sets a filter policy that only matches lab supply orders with total value of $0 (zero), regardless of their target lab location. These orders carry no actual items, obviously need neither approval nor fulfillment, and as such can be automatically canceled.
The SNS subscription that points to the SQS queue “Common-Orders” sets a filter policy that blacklists lab supply orders. Hence, this policy matches only office and tech supply orders, which have a more streamlined fulfillment process, and require no approval, regardless of price or target location.
After the company finished building this advanced pub/sub architecture, they were then able to launch their internal procurement system and allow buyers to begin placing orders. The diagram above shows six example orders published to the SNS topic. Each order contains message attributes that describe the order, and cause them to be filtered in a different manner, as follows:
Message #1 is a lab supply order, with a total value of $15,700 and targeting a research lab in Singapore. Because the value is greater than $1,000, and the location “Asia-Pacific-Southeast” matches the prefix “Asia-Pacific-“, this message matches the first SNS subscription and is delivered to SQS queue “Lab-Orders-AP”.
Message #2 is a lab supply order, with a total value of $1,833 and targeting a research lab in Ireland. Because the value is greater than $1,000, and the location “Europe-West” matches the prefix “Europe-“, this message matches the second SNS subscription and is delivered to SQS queue “Lab-Orders-EU”.
Message #3 is a lab supply order, with a total value of $415. Because the value is greater than $0 and less than $1,000, this message matches the third SNS subscription and is delivered to Lambda function “Lab-Preapproved”.
Message #4 is a lab supply order, but with a total value of $0. Therefore, it only matches the fourth SNS subscription, and is delivered to Lambda function “Lab-Cancelled”.
Messages #5 and #6 aren’t lab supply orders actually; one is an office supply order, and the other is a tech supply order. Therefore, they only match the fifth SNS subscription, and are both delivered to SQS queue “Common-Orders”.
Although each message only matched a single subscription, each was tested against the filter policy of every subscription in the topic. Hence, depending on which attributes are set on the incoming message, the message might actually match multiple subscriptions, and multiple deliveries will take place. Also, it is important to bear in mind that subscriptions with no filter policies catch every single message published to the topic, as a blank filter policy equates to a catch-all behavior.
Summary
Amazon SNS allows for both string and numeric filtering operators. As explained in this post, string operators allow for exact, prefix, and “anything-but” comparisons, while numeric operators allow for exact and range comparisons. These advanced filtering operators bring even more power and flexibility to your pub/sub messaging functionality and also allow you to simplify your architecture further by removing even more logic from your subscribers.
Message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). SNS filtering operators for numeric matching, prefix matching, and blacklisting are available now in all AWS Regions, for no extra charge.
Amazon EC2 Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back.
Customers have been taking advantage of Spot Instance interruption notices available via the instance metadata service since January 2015 to orchestrate their workloads seamlessly around any potential interruptions. Examples include saving the state of a job, detaching from a load balancer, or draining containers. Needless to say, the two-minute Spot Instance interruption notice is a powerful tool when using Spot Instances.
In January 2018, the Spot Instance interruption notice also became available as an event in Amazon CloudWatch Events. This allows targets such as AWS Lambda functions or Amazon SNS topics to process Spot Instance interruption notices by creating a CloudWatch Events rule to monitor for the notice.
In this post, I walk through an example use case for taking advantage of Spot Instance interruption notices in CloudWatch Events to automatically deregister Spot Instances from an Elastic Load Balancing Application Load Balancer.
When any of the Spot Instances receives an interruption notice, Spot Fleet sends the event to CloudWatch Events. The CloudWatch Events rule then notifies both targets, the Lambda function and SNS topic. The Lambda function detaches the Spot Instance from the Application Load Balancer target group, taking advantage of nearly a full two minutes of connection draining before the instance is interrupted. The SNS topic also receives a message, and is provided as an example for the reader to use as an exercise.
To complete this walkthrough, have the AWS CLI installed and configured, as well as the ability to launch CloudFormation stacks.
Launch the stack
Go ahead and launch the CloudFormation stack. You can check it out from GitHub, or grab the template directly. In this post, I use the stack name “spot-spin-cwe“, but feel free to use any name you like. Just remember to change it in the instructions.
Here are the details of the architecture being launched by the stack.
IAM permissions
Give permissions to a few components in the architecture:
The Lambda function
The CloudWatch Events rule
The Spot Fleet
The Lambda function needs basic Lambda function execution permissions so that it can write logs to CloudWatch Logs. You can use the AWS managed policy for this. It also needs to describe EC2 tags as well as deregister targets within Elastic Load Balancing. You can create a custom policy for these.
Finally, Spot Fleet needs permissions to request Spot Instances, tag, and register targets in Elastic Load Balancing. You can tap into an AWS managed policy for this.
Because you are taking advantage of the two-minute Spot Instance notice, you can tune the Elastic Load Balancing target group deregistration timeout delay to match. When a target is deregistered from the target group, it is put into connection draining mode for the length of the timeout delay: 120 seconds to equal the two-minute notice.
To capture the Spot Instance interruption notice being published to CloudWatch Events, create a rule with two targets: the Lambda function and the SNS topic.
The Lambda function does the heavy lifting for you. The details of the CloudWatch event are published to the Lambda function, which then uses boto3 to make a couple of AWS API calls. The first call is to describe the EC2 tags for the Spot Instance, filtering on a key of “TargetGroupArn”. If this tag is found, the instance is then deregistered from the target group ARN stored as the value of the tag.
import boto3
def handler(event, context):
instanceId = event['detail']['instance-id']
instanceAction = event['detail']['instance-action']
try:
ec2client = boto3.client('ec2')
describeTags = ec2client.describe_tags(Filters=[{'Name': 'resource-id','Values':[instanceId],'Name':'key','Values':['loadBalancerTargetGroup']}])
except:
print("No action being taken. Unable to describe tags for instance id:", instanceId)
return
try:
elbv2client = boto3.client('elbv2')
deregisterTargets = elbv2client.deregister_targets(TargetGroupArn=describeTags['Tags'][0]['Value'],Targets=[{'Id':instanceId}])
except:
print("No action being taken. Unable to deregister targets for instance id:", instanceId)
return
print("Detaching instance from target:")
print(instanceId, describeTags['Tags'][0]['Value'], deregisterTargets, sep=",")
return
SNS topic
Finally, you’ve created an SNS topic as an example target. For example, you could subscribe an email address to the SNS topic in order to receive email notifications when a Spot Instance interruption notice is received.
To proceed to creating your Spot Fleet request, use some of the resources that the CloudFormation stack created, to populate the Spot Fleet request launch configuration. You can find the values in the outputs values of the CloudFormation stack:
You can confirm that the Spot Fleet request was fulfilled by checking that ActivityStatus is “fulfilled”, or by checking that FulfilledCapacity is greater than or equal to TargetCapacity, while describing the request:
In order to test, you can take advantage of the fact that any interruption action that Spot Fleet takes on a Spot Instance results in a Spot Instance interruption notice being provided. Therefore, you can simply decrease the target size of your Spot Fleet from 2 to 1. The instance that is interrupted receives the interruption notice:
As soon as the interruption notice is published to CloudWatch Events, the Lambda function triggers and detaches the instance from the target group, effectively putting the instance in a draining state.
In conclusion, Amazon EC2 Spot Instance interruption notices are an extremely powerful tool when taking advantage of Amazon EC2 Spot Instances in your workloads, for tasks such as saving state, draining connections, and much more. I’d love to hear how you are using them in your own environment!
Chad Schmutzer Solutions Architect
Chad Schmutzer is a Solutions Architect at Amazon Web Services based in Pasadena, CA. As an extension of the Amazon EC2 Spot Instances team, Chad helps customers significantly reduce the cost of running their applications, growing their compute capacity and throughput without increasing budget, and enabling new types of cloud computing applications.
We have been busy adding new features and capabilities to Amazon Redshift, and we wanted to give you a glimpse of what we’ve been doing over the past year. In this article, we recap a few of our enhancements and provide a set of resources that you can use to learn more and get the most out of your Amazon Redshift implementation.
In 2017, we made more than 30 announcements about Amazon Redshift. We listened to you, our customers, and delivered Redshift Spectrum, a feature of Amazon Redshift, that gives you the ability to extend analytics to your data lake—without moving data. We launched new DC2 nodes, doubling performance at the same price. We also announced many new features that provide greater scalability, better performance, more automation, and easier ways to manage your analytics workloads.
To see a full list of our launches, visit our what’s new page—and be sure to subscribe to our RSS feed.
Major launches in 2017
Amazon Redshift Spectrum—extend analytics to your data lake, without moving data
We launched Amazon Redshift Spectrum to give you the freedom to store data in Amazon S3, in open file formats, and have it available for analytics without the need to load it into your Amazon Redshift cluster. It enables you to easily join datasets across Redshift clusters and S3 to provide unique insights that you would not be able to obtain by querying independent data silos.
With Redshift Spectrum, you can run SQL queries against data in an Amazon S3 data lake as easily as you analyze data stored in Amazon Redshift. And you can do it without loading data or resizing the Amazon Redshift cluster based on growing data volumes. Redshift Spectrum separates compute and storage to meet workload demands for data size, concurrency, and performance. Redshift Spectrum scales processing across thousands of nodes, so results are fast, even with massive datasets and complex queries. You can query open file formats that you already use—such as Apache Avro, CSV, Grok, ORC, Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV—directly in Amazon S3, without any data movement.
“For complex queries, Redshift Spectrum provided a 67 percent performance gain,” said Rafi Ton, CEO, NUVIAD. “Using the Parquet data format, Redshift Spectrum delivered an 80 percent performance improvement. For us, this was substantial.”
DC2 nodes—twice the performance of DC1 at the same price
We launched second-generation Dense Compute (DC2) nodes to provide low latency and high throughput for demanding data warehousing workloads. DC2 nodes feature powerful Intel E5-2686 v4 (Broadwell) CPUs, fast DDR4 memory, and NVMe-based solid state disks (SSDs). We’ve tuned Amazon Redshift to take advantage of the better CPU, network, and disk on DC2 nodes, providing up to twice the performance of DC1 at the same price. Our DC2.8xlarge instances now provide twice the memory per slice of data and an optimized storage layout with 30 percent better storage utilization.
“Redshift allows us to quickly spin up clusters and provide our data scientists with a fast and easy method to access data and generate insights,” said Bradley Todd, technology architect at Liberty Mutual. “We saw a 9x reduction in month-end reporting time with Redshift DC2 nodes as compared to DC1.”
On average, our customers are seeing 3x to 5x performance gains for most of their critical workloads.
We introduced short query acceleration to speed up execution of queries such as reports, dashboards, and interactive analysis. Short query acceleration uses machine learning to predict the execution time of a query, and to move short running queries to an express short query queue for faster processing.
We launched results caching to deliver sub-second response times for queries that are repeated, such as dashboards, visualizations, and those from BI tools. Results caching has an added benefit of freeing up resources to improve the performance of all other queries.
We also introduced late materialization to reduce the amount of data scanned for queries with predicate filters by batching and factoring in the filtering of predicates before fetching data blocks in the next column. For example, if only 10 percent of the table rows satisfy the predicate filters, Amazon Redshift can potentially save 90 percent of the I/O for the remaining columns to improve query performance.
We launched query monitoring rules and pre-defined rule templates. These features make it easier for you to set metrics-based performance boundaries for workload management (WLM) queries, and specify what action to take when a query goes beyond those boundaries. For example, for a queue that’s dedicated to short-running queries, you might create a rule that aborts queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.
Customer insights
Amazon Redshift and Redshift Spectrum serve customers across a variety of industries and sizes, from startups to large enterprises. Visit our customer page to see the success that customers are having with our recent enhancements. Learn how companies like Liberty Mutual Insurance saw a 9x reduction in month-end reporting time using DC2 nodes. On this page, you can find case studies, videos, and other content that show how our customers are using Amazon Redshift to drive innovation and business results.
In addition, check out these resources to learn about the success our customers are having building out a data warehouse and data lake integration solution with Amazon Redshift:
You can enhance your Amazon Redshift data warehouse by working with industry-leading experts. Our AWS Partner Network (APN) Partners have certified their solutions to work with Amazon Redshift. They offer software, tools, integration, and consulting services to help you at every step. Visit our Amazon Redshift Partner page and choose an APN Partner. Or, use AWS Marketplace to find and immediately start using third-party software.
To see what our Partners are saying about Amazon Redshift Spectrum and our DC2 nodes mentioned earlier, read these blog posts:
If you are evaluating or considering a proof of concept with Amazon Redshift, or you need assistance migrating your on-premises or other cloud-based data warehouse to Amazon Redshift, our team of product experts and solutions architects can help you with architecting, sizing, and optimizing your data warehouse. Contact us using this support request form, and let us know how we can assist you.
If you are an Amazon Redshift customer, we offer a no-cost health check program. Our team of database engineers and solutions architects give you recommendations for optimizing Amazon Redshift and Amazon Redshift Spectrum for your specific workloads. To learn more, email us at [email protected].
Larry Heathcote is a Principle Product Marketing Manager at Amazon Web Services for data warehousing and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out and the taste of classic barbeque.
The Linux kernel currently supports two separate network packet-filtering mechanisms: iptables and nftables. For the last few years, it has been generally assumed that nftables would eventually replace the older iptables implementation; few people expected that the kernel developers would, instead, add a third packet filter. But that would appear to be what is happening with the newly announced bpfilter mechanism. Bpfilter may eventually replace both iptables and nftables, but there are a lot of questions that will need to be answered first.
Using Amazon Cloud Directory, you can build flexible, cloud-native directories for organizing hierarchies of data along multiple dimensions. And now, you can search more efficiently by searching across only a subset of objects in your directory. For example, instead of searching through all of the employees in a company directory built using Cloud Directory, you can choose to search only full-time employees or contractors.
To search across such a subset of objects, you must first create a facet-based index. A facet is a set of attributes defined in a schema that is associated with a directory object. Using facets, you can create different object types in your directory. For instance, you can create different facets for full-time employees and contractors in a schema and then create full-time employee objects and contractor objects. You then can create an index of all the objects that include a specific facet and search those objects more efficiently.
In this blog post, I show how you can create a facet-based index in Cloud Directory to more efficiently search for objects in your directory.
Scenario: Searching a company directory for a specific employee type
Let’s say a company called AnyCompany wants to be able to efficiently search in Cloud Directory for information about its full-time employees and contractors. To do this, AnyCompany must create a company directory using Cloud Directory. (If AnyCompany already had a company directory using Cloud Directory, they could use that directory instead.) AnyCompany starts by creating DirectorySchema, which is a schema that includes three facets: FullTimeEmployeeFacet, ManagerFacet, and ContractorFacet.
The following diagram is a visual representation of AnyCompany’s company directory, and it includes full-time employees and contractors in a reporting hierarchy. The full-time employees are shown in blue nodes and the contractors are shown in green nodes. The directory’s three facets are shown as they correspond to full-time employees, managers, and contractors.
To more efficiently search your directory, follow these steps:
Create a facet-based index that includes the facets you want to use when searching.
Populate the index with the appropriate employee objects.
List all the objects in the index.
List objects in the index that include a specific facet.
1. Create a facet-based index that includes the facets you want to use when searching
The following code example creates a facet-based index of the employee objects in the directory. Cloud Directory currently supports only simple indexes, which means that an index object can only store one type of value, such as a facet.
// Create an index
// <region> indicates an AWS Region value such as “us-east-1”
// <accountId> indicates your AWS account ID
// <directoryId> indicates your Cloud Directory ID
// The schemaArn points to the specific schema, which is DirectorySchema
String schemaArn = "arn:aws:clouddirectory:<region>:<accountId>:directory/<directoryId>/schema/DirectorySchema/1.0" ;
// I define attributes that I want to use for indexing. In this case, I use “facets” to define an
// attribute for indexing. This is a hard-coded value that is defined by Cloud Directory.
AttributeKey indexAttributeKey = new AttributeKey()
.withSchemaArn(schemaArn)
.withFacetName("facets")
.withName("facets") ;
List<AttributeKey> orderedIndexedAttributeList = new ArrayList<AttributeKey>() ;
orderedIndexedAttributeList.add(indexAttributeKey) ;
// The directoryArn points to the specific directory that I am working on
String directoryArn = "arn:aws:clouddirectory:<region>:<accountId>:directory/<directoryId>" ;
// I create the index request and pass in the directoryArn and my attribute list.
// Because I am defining the facetIndex at the root of my directory, my parentReference is the root of the directory
// For LinkName, I have defined “MyFacetIndex,” as shown in the diagram
ObjectReference dirRoot = new ObjectReference().withSelector("/"); // Directory root
CreateIndexRequest createRequest = new CreateIndexRequest()
.withDirectoryArn(directoryArn)
.withOrderedIndexedAttributeList(orderedIndexedAttributeList)
.withIsUnique(false)
.withParentReference(dirRoot) // Attach to directory root
.withLinkName("MyFacetIndex") ;
// I assign the indexed object to facetIndex
CreateIndexResult facetIndexResult = cloudDirectoryClient.createIndex(createRequest) ;
ObjectReference facetIndex = new ObjectReference().withSelector(facetIndexResult.getObjectIdentifier());
2. Populate the index with the appropriate employee objects
Next, I add all the objects that I want to include in the index. The following code example adds objects to facetIndex.
// I assume userObj1 is “Tim”. The following code adds “Tim” to the index.
// Create an index attach request with the directory, facet, and object details
AttachToIndexRequest indexAttachRequest = new AttachToIndexRequest()
.withDirectoryArn(directoryArn)
.withIndexReference(facetIndex)
.withTargetReference(userObj1) ;
// Add the object to the index
cloudDirectoryClient.attachToIndex(indexAttachRequest) ;
// You can follow the same code pattern to add other full-time employee and contractor objects to the index.
3. List all the objects in the index
Now, I can query my directory efficiently for the set of objects I have in facetIndex. The following code example returns all the objects in your index.
// List all objects in the facet-based index
ListIndexResult listResults = cloudDirectoryClient.listIndex(new ListIndexRequest()
.withDirectoryArn(directoryArn)
.withIndexReference(facetIndex)) ;
4. List objects in the index that include a specific facet
I can add a filter for retrieving subsets of objects in the index that contain a specific facet. The following code example shows how to add a filter to the query so that only objects that contain the facet FullTimeEmployeeFacet are returned.
// I choose the specific facet I will use for filtering my query and get all objects that contain this facet in them.
String filterString = "DirectorySchema/1.0/FullTimeEmployeeFacet" ;
TypedAttributeValue filterStringValue = new TypedAttributeValue().withStringValue(filterString);
// I define the filter range and mention both the start mode and end mode as inclusive because I will query for a specific facet
ObjectAttributeRange objectAttributeRange = new ObjectAttributeRange()
.withAttributeKey(new AttributeKey()
.withFacetName("facets")
.withName("facets")
.withSchemaArn(schemaArn))
.withRange(new TypedAttributeValueRange()
.withStartMode(RangeMode.INCLUSIVE)
.withStartValue(filterStringValue)
.withEndMode(RangeMode.INCLUSIVE)
.withEndValue(filterStringValue)) ; // Query for objects with FullTimeEmployeeFacet which is defined in filterString
// List the index results
ListIndexResult filteredResults = cloudDirectoryClient.listIndex(new ListIndexRequest()
.withDirectoryArn(directoryArn)
.withIndexReference(facetIndex)
.withRangesOnIndexedValues(objectAttributeRange)) ;
Using this subset of objects, I can now search for a specific employee without searching across all the objects in my directory.
Summary
You can use facet-based indexing to search your directory more efficiently by searching across only a subset of objects in of your directory. For more information about this feature, see Indexing and Search.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing the solution in this blog post, start a new thread in the Directory Service forum or contact AWS Support.
We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.
Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.
To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.
This post shows how to build an example Scala script that identifies highly negative issues in the timeline. It pulls out issue events in the timeline sample, analyzes their titles using the sentiment prediction functions from the Stanford CoreNLP libraries, and surfaces the most negative issues.
Getting started
Before we start writing scripts, we use AWS Glue crawlers to get a sense of the data—its structure and characteristics. We also set up a development endpoint and attach an Apache Zeppelin notebook, so we can interactively explore the data and author the script.
Crawl the data
The dataset used in this example was downloaded from the GitHub archive website into our sample dataset bucket in Amazon S3, and copied to the following locations:
Choose the best folder by replacing <region> with the region that you’re working in, for example, us-east-1. Crawl this folder, and put the results into a database named githubarchivein the AWS Glue Data Catalog, as described in the AWS Glue Developer Guide. This folder contains 12 hours of the timeline from January 22, 2017, and is organized hierarchically (that is, partitioned) by year, month, and day.
When finished, use the AWS Glue console to navigate to the table named data in the githubarchive database. Notice that this data has eight top-level columns, which are common to each event type, and three partition columns that correspond to year, month, and day.
Choose the payload column, and you will notice that it has a complex schema—one that reflects the union of the payloads of event types that appear in the crawled data. Also note that the schema that crawlers generate is a subset of the true schema because they sample only a subset of the data.
Set up the library, development endpoint, and notebook
Next, you need to download and set up the libraries that estimate the sentiment in a snippet of text. The Stanford CoreNLP libraries contain a number of human language processing tools, including sentiment prediction.
Download the Stanford CoreNLP libraries. Unzip the .zip file, and you’ll see a directory full of jar files. For this example, the following jars are required:
stanford-corenlp-3.8.0.jar
stanford-corenlp-3.8.0-models.jar
ejml-0.23.jar
Upload these files to an Amazon S3 path that is accessible to AWS Glue so that it can load these libraries when needed. For this example, they are in s3://glue-sample-other/corenlp/.
Development endpoints are static Spark-based environments that can serve as the backend for data exploration. You can attach notebooks to these endpoints to interactively send commands and explore and analyze your data. These endpoints have the same configuration as that of AWS Glue’s job execution system. So, commands and scripts that work there also work the same when registered and run as jobs in AWS Glue.
To set up an endpoint and a Zeppelin notebook to work with that endpoint, follow the instructions in the AWS Glue Developer Guide. When you are creating an endpoint, be sure to specify the locations of the previously mentioned jars in the Dependent jars path as a comma-separated list. Otherwise, the libraries will not be loaded.
After you set up the notebook server, go to the Zeppelin notebook by choosing Dev Endpoints in the left navigation pane on the AWS Glue console. Choose the endpoint that you created. Next, choose the Notebook Server URL, which takes you to the Zeppelin server. Log in using the notebook user name and password that you specified when creating the notebook. Finally, create a new note to try out this example.
Each notebook is a collection of paragraphs, and each paragraph contains a sequence of commands and the output for that command. Moreover, each notebook includes a number of interpreters. If you set up the Zeppelin server using the console, the (Python-based) pyspark and (Scala-based) spark interpreters are already connected to your new development endpoint, with pyspark as the default. Therefore, throughout this example, you need to prepend %spark at the top of your paragraphs. In this example, we omit these for brevity.
Working with the data
In this section, we use AWS Glue extensions to Spark to work with the dataset. We look at the actual schema of the data and filter out the interesting event types for our analysis.
Start with some boilerplate code to import libraries that you need:
Then, create the Spark and AWS Glue contexts needed for working with the data:
@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
You need the transient decorator on the SparkContext when working in Zeppelin; otherwise, you will run into a serialization error when executing commands.
Dynamic frames
This section shows how to create a dynamic frame that contains the GitHub records in the table that you crawled earlier. A dynamic frame is the basic data structure in AWS Glue scripts. It is like an Apache Spark data frame, except that it is designed and optimized for data cleaning and transformation workloads. A dynamic frame is well-suited for representing semi-structured datasets like the GitHub timeline.
A dynamic frame is a collection of dynamic records. In Spark lingo, it is an RDD (resilient distributed dataset) of DynamicRecords. A dynamic record is a self-describing record. Each record encodes its columns and types, so every record can have a schema that is unique from all others in the dynamic frame. This is convenient and often more efficient for datasets like the GitHub timeline, where payloads can vary drastically from one event type to another.
The following creates a dynamic frame, github_events, from your table:
The getCatalogSource() method returns a DataSource, which represents a particular table in the Data Catalog. The getDynamicFrame() method returns a dynamic frame from the source.
Recall that the crawler created a schema from only a sample of the data. You can scan the entire dataset, count the rows, and print the complete schema as follows:
github_events.count
github_events.printSchema()
The result looks like the following:
The data has 414,826 records. As before, notice that there are eight top-level columns, and three partition columns. If you scroll down, you’ll also notice that the payload is the most complex column.
Run functions and filter records
This section describes how you can create your own functions and invoke them seamlessly to filter records. Unlike filtering with Python lambdas, Scala scripts do not need to convert records from one language representation to another, thereby reducing overhead and running much faster.
Let’s create a function that picks only the IssuesEvents from the GitHub timeline. These events are generated whenever someone posts an issue for a particular repository. Each GitHub event record has a field, “type”, that indicates the kind of event it is. The issueFilter() function returns true for records that are IssuesEvents.
Note that the getField() method returns an Option[Any] type, so you first need to check that it exists before checking the type.
You pass this function to the filter transformation, which applies the function on each record and returns a dynamic frame of those records that pass.
val issue_events = github_events.filter(issueFilter)
Now, let’s look at the size and schema of issue_events.
issue_events.count
issue_events.printSchema()
It’s much smaller (14,063 records), and the payload schema is less complex, reflecting only the schema for issues. Keep a few essential columns for your analysis, and drop the rest using the ApplyMapping() transform:
The ApplyMapping() transform is quite handy for renaming columns, casting types, and restructuring records. The preceding code snippet tells the transform to select the fields (or columns) that are enumerated in the left half of the tuples and map them to the fields and types in the right half.
Estimating sentiment using Stanford CoreNLP
To focus on the most pressing issues, you might want to isolate the records with the most negative sentiments. The Stanford CoreNLP libraries are Java-based and offer sentiment-prediction functions. Accessing these functions through Python is possible, but quite cumbersome. It requires creating Python surrogate classes and objects for those found on the Java side. Instead, with Scala support, you can use those classes and objects directly and invoke their methods. Let’s see how.
First, import the libraries needed for the analysis:
The Stanford CoreNLP libraries have a main driver that orchestrates all of their analysis. The driver setup is heavyweight, setting up threads and data structures that are shared across analyses. Apache Spark runs on a cluster with a main driver process and a collection of backend executor processes that do most of the heavy sifting of the data.
The Stanford CoreNLP shared objects are not serializable, so they cannot be distributed easily across a cluster. Instead, you need to initialize them once for every backend executor process that might need them. Here is how to accomplish that:
val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")
object myNLP {
lazy val coreNLP = new StanfordCoreNLP(props)
}
The properties tell the libraries which annotators to execute and how many words to process. The preceding code creates an object, myNLP, with a field coreNLP that is lazily evaluated. This field is initialized only when it is needed, and only once. So, when the backend executors start processing the records, each executor initializes the driver for the Stanford CoreNLP libraries only one time.
Next is a function that estimates the sentiment of a text string. It first calls Stanford CoreNLP to annotate the text. Then, it pulls out the sentences and takes the average sentiment across all the sentences. The sentiment is a double, from 0.0 as the most negative to 4.0 as the most positive.
Now, let’s estimate the sentiment of the issue titles and add that computed field as part of the records. You can accomplish this with the map() method on dynamic frames:
val issue_sentiments = issue_titles.map((rec: DynamicRecord) => {
val mbody = rec.getField("title")
mbody match {
case Some(mval: String) => {
rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
rec }
case _ => rec
}
})
The map() method applies the user-provided function on every record. The function takes a DynamicRecord as an argument and returns a DynamicRecord. The code above computes the sentiment, adds it in a top-level field, sentiment, to the record, and returns the record.
Count the records with sentiment and show the schema. This takes a few minutes because Spark must initialize the library and run the sentiment analysis, which can be involved.
Notice that all records were processed (14,063), and the sentiment value was added to the schema.
Finally, let’s pick out the titles that have the lowest sentiment (less than 1.5). Count them and print out a sample to see what some of the titles look like.
val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
pressing_issues.count
pressing_issues.show(10)
Next, write them all to a file so that you can handle them later. (You’ll need to replace the output path with your own.)
Take a look in the output path, and you can see the output files.
Putting it all together
Now, let’s create a job from the preceding interactive session. The following script combines all the commands from earlier. It processes the GitHub archive files and writes out the highly negative issues:
import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._
object GlueApp {
object myNLP {
val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")
lazy val coreNLP = new StanfordCoreNLP(props)
}
def estimatedSentiment(text: String): Double = {
if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
val annotations = myNLP.coreNLP.process(text)
val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
sentences.foldLeft(0.0)( (csum, x) => {
csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree]))
}) / sentences.length
}
def main(sysArgs: Array[String]) {
val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
val dbname = "githubarchive"
val tblname = "data"
val outpath = "s3://<bucket>/out/path/"
val github_events = glueContext
.getCatalogSource(database = dbname, tableName = tblname)
.getDynamicFrame()
val issue_events = github_events.filter((rec: DynamicRecord) => {
rec.getField("type").exists(_ == "IssuesEvent")
})
val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
("actor.login", "string", "actor", "string"),
("repo.name", "string", "repo", "string"),
("payload.action", "string", "action", "string"),
("payload.issue.title", "string", "title", "string")))
val issue_sentiments = issue_titles.map((rec: DynamicRecord) => {
val mbody = rec.getField("title")
mbody match {
case Some(mval: String) => {
rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
rec }
case _ => rec
}
})
val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
glueContext.getSinkWithFormat(connectionType = "s3",
options = JsonOptions(s"""{"path": "$outpath"}"""),
format = "json")
.writeDynamicFrame(pressing_issues)
}
}
Notice that the script is enclosed in a top-level object called GlueApp, which serves as the script’s entry point for the job. (You’ll need to replace the output path with your own.) Upload the script to an Amazon S3 location so that AWS Glue can load it when needed.
To create the job, open the AWS Glue console. Choose Jobs in the left navigation pane, and then choose Add job. Create a name for the job, and specify a role with permissions to access the data. Choose An existing script that you provide, and choose Scala as the language.
For the Scala class name, type GlueApp to indicate the script’s entry point. Specify the Amazon S3 location of the script.
Choose Script libraries and job parameters. In the Dependent jars path field, enter the Amazon S3 locations of the Stanford CoreNLP libraries from earlier as a comma-separated list (without spaces). Then choose Next.
No connections are needed for this job, so choose Next again. Review the job properties, and choose Finish. Finally, choose Run job to execute the job.
You can simply edit the script’s input table and output path to run this job on whatever GitHub timeline datasets that you might have.
Conclusion
In this post, we showed how to write AWS Glue ETL scripts in Scala via notebooks and how to run them as jobs. Scala has the advantage that it is the native language for the Spark runtime. With Scala, it is easier to call Scala or Java functions and third-party libraries for analyses. Moreover, data processing is faster in Scala because there’s no need to convert records from one language runtime to another.
You can find more example of Scala scripts in our GitHub examples repository: https://github.com/awslabs/aws-glue-samples. We encourage you to experiment with Scala scripts and let us know about any interesting ETL flows that you want to share.
Mehul Shah is a senior software manager for AWS Glue. His passion is leveraging the cloud to build smarter, more efficient, and easier to use data systems. He has three girls, and, therefore, he has no spare time.
Ben Sowell is a software development engineer at AWS Glue.
Vinay Vivili is a software development engineer for AWS Glue.
In his linux.conf.au 2017 talk [YouTube] on the eBPF in-kernel virtual machine, Brendan Gregg proclaimed that “super powers have finally come to Linux”. Getting eBPF to that point has been a long road of evolution and design. While eBPF was originally used for network packet filtering, it turns out that running user-space code inside a sanity-checking virtual machine is a powerful tool for kernel developers and production engineers.
Over time, new eBPF users have appeared to take advantage of its performance and convenience. This article explains how eBPF evolved how it works, and how it is used in the kernel.
One of the technology areas I thoroughly enjoy is the Internet of Things (IoT). Even as a child I used to infuriate my parents by taking apart the toys they would purchase for me to see how they worked and if I could somehow put them back together. It seems somehow I was destined to end up the tough and ever-changing world of technology. Therefore, it’s no wonder that I am really enjoying learning and tinkering with IoT devices and technologies. It combines my love of development and software engineering with my curiosity around circuits, controllers, and other facets of the electrical engineering discipline; even though an electrical engineer I can not claim to be.
Despite all of the information that is collected by the deployment of IoT devices and solutions, I honestly never really thought about the need to analyze, search, and process this data until I came up against a scenario where it became of the utmost importance to be able to search and query through loads of sensory data for an anomaly occurrence. Of course, I understood the importance of analytics for businesses to make accurate decisions and predictions to drive the organization’s direction. But it didn’t occur to me initially, how important it was to make analytics an integral part of my IoT solutions. Well, I learned my lesson just in time because this re:Invent a service is launching to make it easier for anyone to process and analyze IoT messages and device data.
Hello, AWS IoT Analytics! AWS IoT Analytics is a fully managed service of AWS IoT that provides advanced data analysis of data collected from your IoT devices. With the AWS IoT Analytics service, you can process messages, gather and store large amounts of device data, as well as, query your data. Also, the new AWS IoTAnalytics service feature integrates with Amazon Quicksight for visualization of your data and brings the power of machine learning through integration with Jupyter Notebooks.
Benefits of AWS IoT Analytics
Helps with predictive analysis of data by providing access to pre-built analytical functions
Provides ability to visualize analytical output from service
Provides tools to clean up data
Can help identify patterns in the gathered data
Be In the Know: IoT Analytics Concepts
Channel: archives the raw, unprocessed messages and collects data from MQTT topics.
Pipeline: consumes messages from channels and allows message processing.
Activities: perform transformations on your messages including filtering attributes and invoking lambda functions advanced processing.
Data Store: Used as a queryable repository for processed messages. Provide ability to have multiple datastores for messages coming from different devices or locations or filtered by message attributes.
Data Set: Data retrieval view from a data store, can be generated by a recurring schedule.
Getting Started with AWS IoT Analytics
First, I’ll create a channel to receive incoming messages. This channel can be used to ingest data sent to the channel via MQTT or messages directed from the Rules Engine. To create a channel, I’ll select the Channels menu option and then click the Create a channel button.
I’ll name my channel, TaraIoTAnalyticsID and give the Channel a MQTT topic filter of Temperature. To complete the creation of my channel, I will click the Create Channel button.
Now that I have my Channel created, I need to create a Data Store to receive and store the messages received on the Channel from my IoT device. Remember you can set up multiple Data Stores for more complex solution needs, but I’ll just create one Data Store for my example. I’ll select Data Stores from menu panel and click Create a data store.
I’ll name my Data Store, TaraDataStoreID, and once I click the Create the data store button and I would have successfully set up a Data Store to house messages coming from my Channel.
Now that I have my Channel and my Data Store, I will need to connect the two using a Pipeline. I’ll create a simple pipeline that just connects my Channel and Data Store, but you can create a more robust pipeline to process and filter messages by adding Pipeline activities like a Lambda activity.
To create a pipeline, I’ll select the Pipelines menu option and then click the Create a pipeline button.
I will not add an Attribute for this pipeline. So I will click Next button.
As we discussed there are additional pipeline activities that I can add to my pipeline for the processing and transformation of messages but I will keep my first pipeline simple and hit the Next button.
The final step in creating my pipeline is for me to select my previously created Data Store and click Create Pipeline.
All that is left for me to take advantage of the AWS IoT Analytics service is to create an IoT rule that sends data to an AWS IoT Analytics channel. Wow, that was a super easy process to set up analytics for IoT devices.
If I wanted to create a Data Set as a result of queries run against my data for visualization with Amazon Quicksight or integrate with Jupyter Notebooks to perform more advanced analytical functions, I can choose the Analyze menu option to bring up the screens to create data sets and access the Juypter Notebook instances.
Summary
As you can see, it was a very simple process to set up the advanced data analysis for AWS IoT. With AWS IoT Analytics, you have the ability to collect, visualize, process, query and store large amounts of data generated from your AWS IoT connected device. Additionally, you can access the AWS IoT Analytics service in a myriad of different ways; the AWS Command Line Interface (AWS CLI), the AWS IoT API, language-specific AWS SDKs, and AWS IoT Device SDKs.
AWS IoT Analytics is available today for you to dig into the analysis of your IoT data. To learn more about AWS IoT and AWS IoT Analytics go to the AWS IoT Analytics product page and/or the AWS IoT documentation.
Messaging holds the parts of a distributed application together, while also adding resiliency and enabling the implementation of highly scalable architectures. For example, earlier this year, Amazon Simple Queue Service (SQS) and Amazon Simple Notification Service (SNS) supported the processing of customer orders on Prime Day, collectively processing 40 billion messages at a rate of 10 million per second, with no customer-visible issues.
SQS and SNS have been used extensively for applications that were born in the cloud. However, many of our larger customers are already making use of open-sourced or commercially-licensed message brokers. Their applications are mission-critical, and so is the messaging that powers them. Our customers describe the setup and on-going maintenance of their messaging infrastructure as “painful” and report that they spend at least 10 staff-hours per week on this chore.
New Amazon MQ Today we are launching Amazon MQ – a managed message broker service for Apache ActiveMQ that lets you get started in minutes with just three clicks! As you may know, ActiveMQ is a popular open-source message broker that is fast & feature-rich. It offers queues and topics, durable and non-durable subscriptions, push-based and poll-based messaging, and filtering.
As a managed service, Amazon MQ takes care of the administration and maintenance of ActiveMQ. This includes responsibility for broker provisioning, patching, failure detection & recovery for high availability, and message durability. With Amazon MQ, you get direct access to the ActiveMQ console and industry standard APIs and protocols for messaging, including JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. This allows you to move from any message broker that uses these standards to Amazon MQ–along with the supported applications–without rewriting code.
You can create a single-instance Amazon MQ broker for development and testing, or an active/standby pair that spans AZs, with quick, automatic failover. Either way, you get data replication across AZs and a pay-as-you-go model for the broker instance and message storage.
Amazon MQ is a full-fledged part of the AWS family, including the use of AWS Identity and Access Management (IAM) for authentication and authorization to use the service API. You can use Amazon CloudWatch metrics to keep a watchful eye metrics such as queue depth and initiate Auto Scaling of your consumer fleet as needed.
Launching an Amazon MQ Broker To get started, I open up the Amazon MQ Console, select the desired AWS Region, enter a name for my broker, and click on Next step:
Then I choose the instance type, indicate that I want to create a standby , and click on Create broker (I can select a VPC and fine-tune other settings in the Advanced settings section):
My broker will be created and ready to use in 5-10 minutes:
The URLs and endpoints that I use to access my broker are all available at a click:
I can access the ActiveMQ Web Console at the link provided:
The broker publishes instance, topic, and queue metrics to CloudWatch. Here are the instance metrics:
Available Now Amazon MQ is available now and you can start using it today in the US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), and Asia Pacific (Sydney) Regions.
The AWS Free Tier lets you use a single-AZ micro instance for up to 750 hours and to store up to 1 gigabyte each month, for one year. After that, billing is based on instance-hours and message storage, plus charges Internet data transfer if the broker is accessed from outside of AWS.
This is a guest post by Rafi Ton, founder and CEO of NUVIAD. NUVIAD is, in their own words, “a mobile marketing platform providing professional marketers, agencies and local businesses state of the art tools to promote their products and services through hyper targeting, big data analytics and advanced machine learning tools.”
At NUVIAD, we’ve been using Amazon Redshift as our main data warehouse solution for more than 3 years.
We store massive amounts of ad transaction data that our users and partners analyze to determine ad campaign strategies. When running real-time bidding (RTB) campaigns in large scale, data freshness is critical so that our users can respond rapidly to changes in campaign performance. We chose Amazon Redshift because of its simplicity, scalability, performance, and ability to load new data in near real time.
Over the past three years, our customer base grew significantly and so did our data. We saw our Amazon Redshift cluster grow from three nodes to 65 nodes. To balance cost and analytics performance, we looked for a way to store large amounts of less-frequently analyzed data at a lower cost. Yet, we still wanted to have the data immediately available for user queries and to meet their expectations for fast performance. We turned to Amazon Redshift Spectrum.
In this post, I explain the reasons why we extended Amazon Redshift with Redshift Spectrum as our modern data warehouse. I cover how our data growth and the need to balance cost and performance led us to adopt Redshift Spectrum. I also share key performance metrics in our environment, and discuss the additional AWS services that provide a scalable and fast environment, with data available for immediate querying by our growing user base.
Amazon Redshift as our foundation
The ability to provide fresh, up-to-the-minute data to our customers and partners was always a main goal with our platform. We saw other solutions provide data that was a few hours old, but this was not good enough for us. We insisted on providing the freshest data possible. For us, that meant loading Amazon Redshift in frequent micro batches and allowing our customers to query Amazon Redshift directly to get results in near real time.
The benefits were immediately evident. Our customers could see how their campaigns performed faster than with other solutions, and react sooner to the ever-changing media supply pricing and availability. They were very happy.
However, this approach required Amazon Redshift to store a lot of data for long periods, and our data grew substantially. In our peak, we maintained a cluster running 65 DC1.large nodes. The impact on our Amazon Redshift cluster was evident, and we saw our CPU utilization grow to 90%.
Why we extended Amazon Redshift to Redshift Spectrum
Redshift Spectrum gives us the ability to run SQL queries using the powerful Amazon Redshift query engine against data stored in Amazon S3, without needing to load the data. With Redshift Spectrum, we store data where we want, at the cost that we want. We have the data available for analytics when our users need it with the performance they expect.
Seamless scalability, high performance, and unlimited concurrency
Scaling Redshift Spectrum is a simple process. First, it allows us to leverage Amazon S3 as the storage engine and get practically unlimited data capacity.
Second, if we need more compute power, we can leverage Redshift Spectrum’s distributed compute engine over thousands of nodes to provide superior performance – perfect for complex queries running against massive amounts of data.
Third, all Redshift Spectrum clusters access the same data catalog so that we don’t have to worry about data migration at all, making scaling effortless and seamless.
Lastly, since Redshift Spectrum distributes queries across potentially thousands of nodes, they are not affected by other queries, providing much more stable performance and unlimited concurrency.
Keeping it SQL
Redshift Spectrum uses the same query engine as Amazon Redshift. This means that we did not need to change our BI tools or query syntax, whether we used complex queries across a single table or joins across multiple tables.
An interesting capability introduced recently is the ability to create a view that spans both Amazon Redshift and Redshift Spectrum external tables. With this feature, you can query frequently accessed data in your Amazon Redshift cluster and less-frequently accessed data in Amazon S3, using a single view.
Leveraging Parquet for higher performance
Parquet is a columnar data format that provides superior performance and allows Redshift Spectrum (or Amazon Athena) to scan significantly less data. With less I/O, queries run faster and we pay less per query. You can read all about Parquet at https://parquet.apache.org/ or https://en.wikipedia.org/wiki/Apache_Parquet.
Lower cost
From a cost perspective, we pay standard rates for our data in Amazon S3, and only small amounts per query to analyze data with Redshift Spectrum. Using the Parquet format, we can significantly reduce the amount of data scanned. Our costs are now lower, and our users get fast results even for large complex queries.
What we learned about Amazon Redshift vs. Redshift Spectrum performance
When we first started looking at Redshift Spectrum, we wanted to put it to the test. We wanted to know how it would compare to Amazon Redshift, so we looked at two key questions:
What is the performance difference between Amazon Redshift and Redshift Spectrum on simple and complex queries?
Does the data format impact performance?
During the migration phase, we had our dataset stored in Amazon Redshift and S3 as CSV/GZIP and as Parquet file formats. We tested three configurations:
Amazon Redshift cluster with 28 DC1.large nodes
Redshift Spectrum using CSV/GZIP
Redshift Spectrum using Parquet
We performed benchmarks for simple and complex queries on one month’s worth of data. We tested how much time it took to perform the query, and how consistent the results were when running the same query multiple times. The data we used for the tests was already partitioned by date and hour. Properly partitioning the data improves performance significantly and reduces query times.
Simple query
First, we tested a simple query aggregating billing data across a month:
SELECT
user_id,
count(*) AS impressions,
SUM(billing)::decimal /1000000 AS billing
FROM <table_name>
WHERE
date >= '2017-08-01' AND
date <= '2017-08-31'
GROUP BY
user_id;
We ran the same query seven times and measured the response times (red marking the longest time and green the shortest time):
Execution Time (seconds)
Amazon Redshift
Redshift Spectrum CSV
Redshift Spectrum Parquet
Run #1
39.65
45.11
11.92
Run #2
15.26
43.13
12.05
Run #3
15.27
46.47
13.38
Run #4
21.22
51.02
12.74
Run #5
17.27
43.35
11.76
Run #6
16.67
44.23
13.67
Run #7
25.37
40.39
12.75
Average
21.53
44.82
12.61
For simple queries, Amazon Redshift performed better than Redshift Spectrum, as we thought, because the data is local to Amazon Redshift.
What was surprising was that using Parquet data format in Redshift Spectrum significantly beat ‘traditional’ Amazon Redshift performance. For our queries, using Parquet data format with Redshift Spectrum delivered an average 40% performance gain over traditional Amazon Redshift. Furthermore, Redshift Spectrum showed high consistency in execution time with a smaller difference between the slowest run and the fastest run.
Comparing the amount of data scanned when using CSV/GZIP and Parquet, the difference was also significant:
Data Scanned (GB)
CSV (Gzip)
135.49
Parquet
2.83
Because we pay only for the data scanned by Redshift Spectrum, the cost saving of using Parquet is evident and substantial.
Complex query
Next, we compared the same three configurations with a complex query.
Execution Time (seconds)
Amazon Redshift
Redshift Spectrum CSV
Redshift Spectrum Parquet
Run #1
329.80
84.20
42.40
Run #2
167.60
65.30
35.10
Run #3
165.20
62.20
23.90
Run #4
273.90
74.90
55.90
Run #5
167.70
69.00
58.40
Average
220.84
71.12
43.14
This time, Redshift Spectrum using Parquet cut the average query time by 80% compared to traditional Amazon Redshift!
Bottom line: For complex queries, Redshift Spectrum provided a 67% performance gain over Amazon Redshift. Using the Parquet data format, Redshift Spectrum delivered an 80% performance improvement over Amazon Redshift. For us, this was substantial.
Optimizing the data structure for different workloads
Because the cost of S3 is relatively inexpensive and we pay only for the data scanned by each query, we believe that it makes sense to keep our data in different formats for different workloads and different analytics engines. It is important to note that we can have any number of tables pointing to the same data on S3. It all depends on how we partition the data and update the table partitions.
Data permutations
For example, we have a process that runs every minute and generates statistics for the last minute of data collected. With Amazon Redshift, this would be done by running the query on the table with something as follows:
SELECT
user,
COUNT(*)
FROM
events_table
WHERE
ts BETWEEN ‘2017-08-01 14:00:00’ AND ‘2017-08-01 14:00:59’
GROUP BY
user;
(Assuming ‘ts’ is your column storing the time stamp for each event.)
With Redshift Spectrum, we pay for the data scanned in each query. If the data is partitioned by the minute instead of the hour, a query looking at one minute would be 1/60th the cost. If we use a temporary table that points only to the data of the last minute, we save that unnecessary cost.
Creating Parquet data efficiently
On the average, we have 800 instances that process our traffic. Each instance sends events that are eventually loaded into Amazon Redshift. When we started three years ago, we would offload data from each server to S3 and then perform a periodic copy command from S3 to Amazon Redshift.
Recently, Amazon Kinesis Firehose added the capability to offload data directly to Amazon Redshift. While this is now a viable option, we kept the same collection process that worked flawlessly and efficiently for three years.
This changed, however, when we incorporated Redshift Spectrum. With Redshift Spectrum, we needed to find a way to:
Collect the event data from the instances.
Save the data in Parquet format.
Partition the data effectively.
To accomplish this, we save the data as CSV and then transform it to Parquet. The most effective method to generate the Parquet files is to:
Send the data in one-minute intervals from the instances to Kinesis Firehose with an S3 temporary bucket as the destination.
Aggregate hourly data and convert it to Parquet using AWS Lambda and AWS Glue.
Add the Parquet data to S3 by updating the table partitions.
With this new process, we had to give more attention to validating the data before we sent it to Kinesis Firehose, because a single corrupted record in a partition fails queries on that partition.
Data validation
To store our click data in a table, we considered the following SQL create table command:
create external TABLE spectrum.blog_clicks (
user_id varchar(50),
campaign_id varchar(50),
os varchar(50),
ua varchar(255),
ts bigint,
billing float
)
partitioned by (date date, hour smallint)
stored as parquet
location 's3://nuviad-temp/blog/clicks/';
The above statement defines a new external table (all Redshift Spectrum tables are external tables) with a few attributes. We stored ‘ts’ as a Unix time stamp and not as Timestamp, and billing data is stored as float and not decimal (more on that later). We also said that the data is partitioned by date and hour, and then stored as Parquet on S3.
First, we need to get the table definitions. This can be achieved by running the following query:
SELECT
*
FROM
svv_external_columns
WHERE
tablename = 'blog_clicks';
This query lists all the columns in the table with their respective definitions:
schemaname
tablename
columnname
external_type
columnnum
part_key
spectrum
blog_clicks
user_id
varchar(50)
1
0
spectrum
blog_clicks
campaign_id
varchar(50)
2
0
spectrum
blog_clicks
os
varchar(50)
3
0
spectrum
blog_clicks
ua
varchar(255)
4
0
spectrum
blog_clicks
ts
bigint
5
0
spectrum
blog_clicks
billing
double
6
0
spectrum
blog_clicks
date
date
7
1
spectrum
blog_clicks
hour
smallint
8
2
Now we can use this data to create a validation schema for our data:
Next, we create a function that uses this schema to validate data:
function valueIsValid(value, item_schema) {
if (schema.type == 'string') {
return (typeof value == 'string' && value.length <= schema.max_length);
}
else if (schema.type == 'integer') {
return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
}
else if (schema.type == 'float' || schema.type == 'double') {
return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
}
else if (schema.type == 'boolean') {
return typeof value == 'boolean';
}
else if (schema.type == 'timestamp') {
return (new Date(value)).getTime() > 0;
}
else {
return true;
}
}
Near real-time data loading with Kinesis Firehose
On Kinesis Firehose, we created a new delivery stream to handle the events as follows:
Delivery stream name: events
Source: Direct PUT
S3 bucket: nuviad-events
S3 prefix: rtb/
IAM role: firehose_delivery_role_1
Data transformation: Disabled
Source record backup: Disabled
S3 buffer size (MB): 100
S3 buffer interval (sec): 60
S3 Compression: GZIP
S3 Encryption: No Encryption
Status: ACTIVE
Error logging: Enabled
This delivery stream aggregates event data every minute, or up to 100 MB, and writes the data to an S3 bucket as a CSV/GZIP compressed file. Next, after we have the data validated, we can safely send it to our Kinesis Firehose API:
if (validated) {
let itemString = item.join('|')+'\n'; //Sending csv delimited by pipe and adding new line
let params = {
DeliveryStreamName: 'events',
Record: {
Data: itemString
}
};
firehose.putRecord(params, function(err, data) {
if (err) {
console.error(err, err.stack);
}
else {
// Continue to your next step
}
});
}
Now, we have a single CSV file representing one minute of event data stored in S3. The files are named automatically by Kinesis Firehose by adding a UTC time prefix in the format YYYY/MM/DD/HH before writing objects to S3. Because we use the date and hour as partitions, we need to change the file naming and location to fit our Redshift Spectrum schema.
Automating data distribution using AWS Lambda
We created a simple Lambda function triggered by an S3 put event that copies the file to a different location (or locations), while renaming it to fit our data structure and processing flow. As mentioned before, the files generated by Kinesis Firehose are structured in a pre-defined hierarchy, such as:
All we need to do is parse the object name and restructure it as we see fit. In our case, we did the following (the event is an object received in the Lambda function with all the data about the object written to S3):
/*
object key structure in the event object:
your-prefix/2017/08/01/20/event-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz
*/
let key_parts = event.Records[0].s3.object.key.split('/');
let event_type = key_parts[0];
let date = key_parts[1] + '-' + key_parts[2] + '-' + key_parts[3];
let hour = key_parts[4];
if (hour.indexOf('0') == 0) {
hour = parseInt(hour, 10) + '';
}
let parts1 = key_parts[5].split('-');
let minute = parts1[7];
if (minute.indexOf('0') == 0) {
minute = parseInt(minute, 10) + '';
}
Now, we can redistribute the file to the two destinations we need—one for the minute processing task and the other for hourly aggregation:
Kinesis Firehose stores the data in a temporary folder. We copy the object to another folder that holds the data for the last processed minute. This folder is connected to a small Redshift Spectrum table where the data is being processed without needing to scan a much larger dataset. We also copy the data to a folder that holds the data for the entire hour, to be later aggregated and converted to Parquet.
Because we partition the data by date and hour, we created a new partition on the Redshift Spectrum table if the processed minute is the first minute in the hour (that is, minute 0). We ran the following:
ALTER TABLE
spectrum.events
ADD partition
(date='2017-08-01', hour=0)
LOCATION 's3://nuviad-temp/events/2017-08-01/0/';
After the data is processed and added to the table, we delete the processed data from the temporary Kinesis Firehose storage and from the minute storage folder.
Migrating CSV to Parquet using AWS Glue and Amazon EMR
The simplest way we found to run an hourly job converting our CSV data to Parquet is using Lambda and AWS Glue (and thanks to the awesome AWS Big Data team for their help with this).
Creating AWS Glue jobs
What this simple AWS Glue script does:
Gets parameters for the job, date, and hour to be processed
Creates a Spark EMR context allowing us to run Spark code
Reads CSV data into a DataFrame
Writes the data as Parquet to the destination S3 bucket
Adds or modifies the Redshift Spectrum / Amazon Athena table partition for the table
Note: Because Redshift Spectrum and Athena both use the AWS Glue Data Catalog, we could use the Athena client to add the partition to the table.
Here are a few words about float, decimal, and double. Using decimal proved to be more challenging than we expected, as it seems that Redshift Spectrum and Spark use them differently. Whenever we used decimal in Redshift Spectrum and in Spark, we kept getting errors, such as:
S3 Query Exception (Fetch). Task failed due to an internal error. File 'https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parquet has an incompatible Parquet schema for column 's3://nuviad-events/events.lat'. Column type: DECIMAL(18, 8), Parquet schema:\noptional float lat [i:4 d:1 r:0]\n (https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parq
We had to experiment with a few floating-point formats until we found that the only combination that worked was to define the column as double in the Spark code and float in Spectrum. This is the reason you see billing defined as float in Spectrum and double in the Spark code.
Creating a Lambda function to trigger conversion
Next, we created a simple Lambda function to trigger the AWS Glue script hourly using a simple Python code:
Using Amazon CloudWatch Events, we trigger this function hourly. This function triggers an AWS Glue job named ‘convertEventsParquetHourly’ and runs it for the previous hour, passing job names and values of the partitions to process to AWS Glue.
Redshift Spectrum and Node.js
Our development stack is based on Node.js, which is well-suited for high-speed, light servers that need to process a huge number of transactions. However, a few limitations of the Node.js environment required us to create workarounds and use other tools to complete the process.
Node.js and Parquet
The lack of Parquet modules for Node.js required us to implement an AWS Glue/Amazon EMR process to effectively migrate data from CSV to Parquet. We would rather save directly to Parquet, but we couldn’t find an effective way to do it.
One interesting project in the works is the development of a Parquet NPM by Marc Vertes called node-parquet (https://www.npmjs.com/package/node-parquet). It is not in a production state yet, but we think it would be well worth following the progress of this package.
Timestamp data type
According to the Parquet documentation, Timestamp data are stored in Parquet as 64-bit integers. However, JavaScript does not support 64-bit integers, because the native number type is a 64-bit double, giving only 53 bits of integer range.
The result is that you cannot store Timestamp correctly in Parquet using Node.js. The solution is to store Timestamp as string and cast the type to Timestamp in the query. Using this method, we did not witness any performance degradation whatsoever.
Lessons learned
You can benefit from our trial-and-error experience.
Lesson #1: Data validation is critical
As mentioned earlier, a single corrupt entry in a partition can fail queries running against this partition, especially when using Parquet, which is harder to edit than a simple CSV file. Make sure that you validate your data before scanning it with Redshift Spectrum.
Lesson #2: Structure and partition data effectively
One of the biggest benefits of using Redshift Spectrum (or Athena for that matter) is that you don’t need to keep nodes up and running all the time. You pay only for the queries you perform and only for the data scanned per query.
Keeping different permutations of your data for different queries makes a lot of sense in this case. For example, you can partition your data by date and hour to run time-based queries, and also have another set partitioned by user_id and date to run user-based queries. This results in faster and more efficient performance of your data warehouse.
Storing data in the right format
Use Parquet whenever you can. The benefits of Parquet are substantial. Faster performance, less data to scan, and much more efficient columnar format. However, it is not supported out-of-the-box by Kinesis Firehose, so you need to implement your own ETL. AWS Glue is a great option.
Creating small tables for frequent tasks
When we started using Redshift Spectrum, we saw our Amazon Redshift costs jump by hundreds of dollars per day. Then we realized that we were unnecessarily scanning a full day’s worth of data every minute. Take advantage of the ability to define multiple tables on the same S3 bucket or folder, and create temporary and small tables for frequent queries.
Lesson #3: Combine Athena and Redshift Spectrum for optimal performance
Moving to Redshift Spectrum also allowed us to take advantage of Athena as both use the AWS Glue Data Catalog. Run fast and simple queries using Athena while taking advantage of the advanced Amazon Redshift query engine for complex queries using Redshift Spectrum.
Redshift Spectrum excels when running complex queries. It can push many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer, so that queries use much less of your cluster’s processing capacity.
Lesson #4: Sort your Parquet data within the partition
We achieved another performance improvement by sorting data within the partition using sortWithinPartitions(sort_field). For example:
We were extremely pleased with using Amazon Redshift as our core data warehouse for over three years. But as our client base and volume of data grew substantially, we extended Amazon Redshift to take advantage of scalability, performance, and cost with Redshift Spectrum.
Redshift Spectrum lets us scale to virtually unlimited storage, scale compute transparently, and deliver super-fast results for our users. With Redshift Spectrum, we store data where we want at the cost we want, and have the data available for analytics when our users need it with the performance they expect.
About the Author
With 7 years of experience in the AdTech industry and 15 years in leading technology companies, Rafi Ton is the founder and CEO of NUVIAD. He enjoys exploring new technologies and putting them to use in cutting edge products and services, in the real world generating real money. Being an experienced entrepreneur, Rafi believes in practical-programming and fast adaptation of new technologies to achieve a significant market advantage.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.