Tag Archives: Amazon Redshift

Bannerconnect uses Amazon Redshift to help clients improve digital marketing results

Post Syndicated from Danny Stommen original https://aws.amazon.com/blogs/big-data/bannerconnect-uses-amazon-redshift-to-help-clients-improve-digital-marketing-results/

Bannerconnect uses programmatic marketing solutions that empower advertisers to win attention and customers by getting their ads seen by the right person at the right time and place. Data-driven insights help large advertisers, trade desks, and agencies boost brand awareness and maximize the results of their digital marketing. Timely analysis of log data is critical to respond to dynamic changes in customer behavior, optimize marketing campaigns quickly, and to gain competitive advantage.

By moving to AWS and Amazon Redshift, our clients can now get near real-time analytics at their fingertips. In this blog post, we describe the challenges that we faced with our legacy, on-premises data warehouse, and discuss the benefits we received by moving to Amazon Redshift. Now, we can ingest data faster, do more sophisticated analytics, and help our clients make faster, data-driven decisions to improve their digital marketing.

Legacy on-premises situation and challenges

Our on-premises, legacy infrastructure consisted of the IBM PureData System as our log level data warehouse. We used a MySQL database for storing all metadata and all analytics data. In this physical, nonvirtualized environment, we needed to plan capacity carefully, far in advance, to handle data growth. We needed a sizeable team to manage upgrades, maintenance, backups, and the day-to-day management of workloads and query performance.

We faced many challenges. We had only 1 gigabyte of bandwidth available to load log-level data into the data warehouse. At peak loads, our extract, transform, load (ETL) server ran completely full, and bandwidth became a bottleneck that delayed when the data was available for analytics. Software and firmware upgrades to the data warehouse needed to be scheduled, and maintenance downtime sometimes took up to eight hours to complete. Our infrastructure was also fragile. We ran everything on one PureData System, and we didn’t have a separate development and test environment. Clients that had direct access to our production environment could submit incorrect SQL queries and pull down the entire data warehouse.

From the log-level data, we created aggregates and stored them in MySQL. Indexes slowed down the loading process significantly. Several aggregations that we wanted to do were simply not possible. Running ad hoc (one-time) queries against 200 gigabytes of uncompressed, row-based data took ages to complete. Many dashboard queries took 10–15 minutes or longer, and were ultimately cancelled. Users were frustrated, so we knew that we had to evolve to a more responsive solution, end-to-end. We chose AWS and Amazon Redshift for our data warehouse.

Moving to Amazon Redshift

Because our legacy software was not designed to run in the cloud, we decided to rebuild our applications using all available AWS components. This saved us the hassle of any migration process, and we could design our applications to use the full potential of AWS.

Our new infrastructure uses Amazon Redshift as our log-level data warehouse. We use a 40-node ds2.xlarge cluster for our production processes. Here, we run log-level queries to aggregate data for the analytics cluster, and run thousands of queries each day to optimize our marketing campaigns.

We set up a separate 30-node ds2.xlarge Amazon Redshift cluster for client access. We replicate the log level data to this cluster, and allow our clients to run queries here without jeopardizing our production processes. Our clients perform data science queries against the data in this cluster.

We also created a 24-node dc2.large cluster for high-performance queries that would not be affected by large, complex queries running on our other clusters. We use this cluster for ad hoc analytics on aggregated data, made available through our API.

We use Amazon S3 as our main data store, giving us infinite storage. Amazon EC2 hosts our ETL processes, API, and several other applications.

Bannerconnect architecture. Amazon S3 is not added to the flowchart to make it simpler. You can add S3 to almost every arrow in the chart.

Our biggest gains

Creating our next-generation solution on AWS and Amazon Redshift provides many benefits for us. We simply follow the best practices provided in Amazon Redshift documentation.

  • Managed service: We wanted to focus on what we are good at, developing software, and not managing and maintaining infrastructure. With Amazon Redshift, we no longer have to do software updates, firmware upgrades, or deal with broken hardware. We no longer need to plan capacity months in advance, or deal with integrating new servers into our environment. Amazon Redshift completely automates all of this for us, including scaling with our data growth, so we can focus on data and analytics to better serve our clients.
  • A full, separate development and test environment: We now have an isolated Amazon Redshift cluster (single node), where we can perform, develop, and test without worrying about breaking the production environment. In our on-premises setup, we did have a development database, but it was always on the production infrastructure. Furthermore, we have an exact copy of our entire infrastructure in our test environment (obviously all in small scale and small instance types). This lets us run automated tests on each deployment to verify that all data flows work as expected.
  • Infinite scalability: We can scale instantly to any capacity we need. Amazon S3 gives us infinite storage, and we can scale Amazon Redshift compute capacity in just a few clicks.
  • Separate clusters: Clients now can’t jeopardize our production processes. Clients still write incorrect SQL queries. However, by using query monitoring rules in Amazon Redshift, we can identify these queries and have Amazon Redshift automatically stop them. Bad queries might affect the client cluster momentarily, but they don’t affect our production processes at all.
  • Faster ad hoc analytics: Due to the massive parallel processing, data compression, and columnar-based storage capabilities in Amazon Redshift, we can create aggregates that were not possible in MySQL. In terms of performance, it’s hard to give good numbers to compare. Running a query against a smaller aggregate using an index on MySQL might be faster at times. However, the majority of our queries are significantly faster on Amazon Redshift. For example, our biggest aggregated table contains about 2 billion records and 500 GB of data (compressed). MySQL couldn’t handle this, but Amazon Redshift results are retrieved within seconds. Large, complex queries took a long time on MySQL. Amazon Redshift completes these in tens of seconds or less.

Building the multicluster environment

This section explores an easy option to build a multicluster setup using AWS CloudFormation templates. With the templates, you can launch multiple Amazon Redshift clusters inside a VPC in both private and public subnets in different Availability Zones. The private subnet enables internal applications, such as EC2 instances, to execute the ETL process to interact with the Amazon Redshift cluster to refresh data. You can use the public Amazon Redshift cluster for the external client tools and scripts. Here is the architecture of this setup:

Let’s walk through the configuration. For demonstration purposes, we use just two Amazon Redshift clusters in a private and public subnet, but you can modify these steps to add more parallel clusters. The configuration is a two-step process to first create the network stack and later launch the Amazon Redshift cluster in those stacks. This process creates the following:

  • VPC and associated subnets, security groups, and routes
  • IAM roles to load data from S3 to Amazon Redshift
  • Amazon Redshift cluster or clusters


Step 1 – Create a network stack

  1. Sign in to the AWS Management Console and navigate to CloudFormation, then do the following:
  • Choose the AWS Region to launch the stack in, for example US East (Ohio).
  • Choose Create Stack.
  • Choose Specify an Amazon S3 template URL.
  • Copy and paste this URL into the text box: https://s3.amazonaws.com/salamander-us-east-1/Bannerconnect/networkstack.json
  1. Choose Next and provide the following information:
    • Stack Name: Name the stack anything convenient.
    • CIDR Prefix: Enter a class B CIDR prefix (for example, 168, 10.1, or 172.16).
    • Environment Tag: Name the environment anything convenient to tag the resources.
    • Key Name: The EC2 key pair to allow access to EC2 instances. If you don’t already have one, see Amazon EC2 Key Pairs in the EC2 documentation.
    • Use the default values for all other parameters and choose Create.
  2. The stack takes 10 minutes to launch, after which the network stack is ready.
  3. Review the outputs of the stack when the launch is complete to note the resource names that were created.

Step 2 – Create an Amazon Redshift cluster or clusters

  1. Navigate back to the CloudFormation console and do the following:
  • Choose Create Stack.
  • Choose Specify an Amazon S3 template URL.
  • Copy and paste this URL into the text box: https://s3.amazonaws.com/salamander-us-east-1/Bannerconnect/reshiftstack.json
  1. Choose Next and provide the following information:
    • Stack Name: Name the stack anything convenient.
    • Cluster Type: Choose a multi-node or single-node cluster.
    • Inbound Traffic: Allow inbound traffic to the cluster from this CIDR range.
    • Master Username: The user name that is associated with the master user account for the cluster that is being created. The default is adminuser.
    • Master User Password: The password that is associated with the master user account for the cluster that is being created.
    • Network Stack Name: The name of the active CloudFormation stack that was created in step 1, which contains the networking resources such as the subnet and security group.
    • Node Type: The node type to be provisioned for the Amazon Redshift cluster.
    • Number Of Nodes: The number of compute nodes in the Amazon Redshift cluster:
      • When the cluster type is specified as single-node, this value should be 1.
      • When the cluster type is specified as multi-node, this value should be greater than 1.
    • Port Number: The port number on which the cluster accepts incoming connections. The default is 5439.
    • Public Access: Public access to the Amazon Redshift cluster, either true or false. When this value is true, the cluster is launched in a public subnet. When this value is false, the cluster is launched in a private subnet.
    • Use the default values for all other parameters and choose Create.
  2. The stack take 10 minutes to launch, after which the Amazon Redshift cluster is launched in the network stack.
  3. Review the outputs of the stack when the launch is complete to note the resource names that were created for the Amazon Redshift cluster.
  4. Repeat steps 5–8 to add more Amazon Redshift clusters to this network stack.

With this easy deployment using the AWS CloudFormation template, you can launch all the resources needed for a multicluster setup with a few clicks.


Migrating to Amazon Redshift and setting up the data warehouse on AWS enabled us to build highly scalable decoupled applications and to use different clusters for different use cases. Operationally, we were able to build robust dev, test, and prod systems independently that are easy to manage to implement complex data workflows.

Recently, we started using Amazon Redshift Spectrum to query data directly from Amazon S3, without needing to load the data into Amazon Redshift. This saves us loading time and speeds up time to analytics, creating many new possibilities for us. Loading dynamic data with different formats and columns becomes a lot easier with Amazon Redshift Spectrum.


About the Authors

Danny Stommen has been working for Bannerconnect for 10 years, with his current role being Solutions Architect and most of his time working on the CORE solution. Next to work, he enjoys spending quality time with his family and actively playing soccer.




Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Before working at AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Running Amazon Payments analytics on Amazon Redshift with 750TB of data

Post Syndicated from Bishwabandhu Newton original https://aws.amazon.com/blogs/big-data/running-amazon-payments-analytics-on-amazon-redshift-with-750tb-of-data/

The Amazon Payments Data Engineering team is responsible for data ingestion, transformation, and the computation and storage of data. It makes these services available for more than 300 business customers across the globe. These customers include product managers, marketing managers, program managers, data scientists, business analysts and software development engineers. They use the data for their scheduled and one-time queries to make the right business decisions. This data is also used for building weekly, monthly and quarterly business review metrics, which are reviewed by the leadership team.

We support various consumer payment business teams, including the following:

  • Amazon Payment Products (Credit cards, Shop with Points, Amazon Currency Convertor, International Payment Products)
  • Gift Cards
  • Payment Acceptance Experience
  • Amazon Business Payments.

We also feed into the machine learning recommendation engine. This engine suggests the best payment product to a customer on Amazon’s payment checkout page.

Challenges with old datawarehouse

This section describes our previous challenges with our data warehouse and analytical needs. With payment products launches and their extension to new marketplaces, we had exponential growth in data volume. Later, scaling our extract, transform, and load process (ETL) was met with severe challenges, and resulted in delays and operational burdens. Here are the specific challenges we faced with our data warehouse:

  • Upsert did not scale, so we got updates more than ~10MN per run. The consumer product catalog dataset has more than 6.5BN records listed in the US marketplace, and occasionally the daily updates exceeded 10MN mark. We saw a similar trend for the Order attributes dataset.
  • Data aggregation either took longer or never finished if we had to analyze even for 6 months of payments data. Often, business owners wanted to aggregate the data based on certain attributes. For example, the number of successful transactions and monetary value by certain types of cards.
  • Shared cluster, and thus shared storage and compute caused resource crunch and impacted all its users. Each team was given ~100TB each on the Data warehouse. Each team could bring their table and join with central data warehouse tables. Any bad query on the cluster impacted all other queries on the same cluster. It was difficult to identify the owner of those bad queries.
  • There were more than 30,000 production tables. it became almost impossible to host all of them on the same cluster.
  • Index corruption on a larger table was cumbersome to rebuild and backfill the table.
  • We required a Database Administrator to apply patches and updates.

Using Amazon Redshift as the new payments data warehouse

We started exploring different options which suits our analytical needs, which is fast, reliable and scales well for future data growth. With all of the previously described issues, Central Data warehouse moved towards separating the compute and storage layer and they decided to be responsible for storage. They built a data lake on Amazon S3, which is encrypted to store even the highly confidential critical data. Each consumer team got the guideline to bring their own compute capacity for their analytical needs. Our payments team started looking for the following advantages:

  • Expedient analytics.
  • Integrates with S3 and the other AWS services.
  • Affordable storage and compute rates.
  • Works for ETL processing.

We chose Amazon Redshift because of it has the following features:

  • Bulk uploads are faster. ~700MN data inserts into ~30 minutes.
  • Data upsert is exceptionally fast.
  • Aggregation query on multi-million dataset with fewer columns of data returns in a few seconds as compared to a few minutes.
  • No need for DBA time to be allocated to maintain the database. Data engineers can easily perform backups, re-snapshot to a new cluster, set up alarms in case of cluster issues, and add new nodes.
  • The ability to keep data on S3. This data is accessible from multiple independent Amazon Redshift clusters through Spectrum and also allows users to join Spectrum tables with other tables created locally on Amazon Redshift. It offloads some processing to the Spectrum layer while storing the data on S3.
  • The ability to use Amazon Redshift best practices to design our tables in regards to distribution keys, sort keys, and compression. As a result, the query performance exceeded our SLA expectations.
  • An effective compression factor. This saves more than 40 to 50 percent of space by choosing the right compression. This enables faster query and efficient storage option.

Sources of data and storage

We consume data from different sources, like PostgreSQL, Amazon DynamoDB live streams, Central data warehouse data lake and bank partners’ data through secure protocols. Data from PostgreSQL databases are in relational format, whereas DynamoDB has key value pairs. We translate the key/value data to relational format and store in Amazon Redshift and S3. Most frequently accessed data is kept in Amazon Redshift. Less frequently accessed and larger datasets are stored in S3 and accessed through Amazon Redshift Spectrum.

Central data lake hosts more than 30,000 tables from different teams, such as Orders, Shipments, and Refunds. Whereas, we as a payments team need approximately 200 tables from this data lake as source tables. Later, we built a data mart specific to payment products, which feeds both to scheduled and one-time data and reporting needs. All small and mid-size tables, smaller than 50 TB, are directly loaded in Amazon Redshift from data lake, which physically stores the data. Tables larger than 50 TB are not stored locally on Amazon Redshift. Instead, we pull from the data lake using EMR-Hive, convert the format from tsv to ORC/Parquet and store on S3. We create an Amazon Redshift Spectrum table on top of S3 data. Format conversion lowers the runtime for each analytical aggregation queries, whereas storing on S3 makes sure we do not fill up entire Amazon Redshift cluster with data rather use it for efficient computing.

Data Architecture

Different components

  1. Central data warehouse data lake (Andes) — Almost all the systems in Amazon, wanting to share their data with other teams, publish their data to this datalake. It is an encrypted storage built on top of Amazon S3 which has metadata attached along with datafiles. Every dataset has a onetime dump and then incremental delta files. Teams willing to consume the data by
  • Physically copying the data into their own Amazon Redshift cluster. It is efficient for smaller and mid-size tables which are accessed most frequently.
  • Using Amazon Redshift Spectrum to run analytical queries on datasets stored in data lake. It helps in accessing cold large data, generally larger than 50TB, thereby avoids scaling up your Amazon Redshift cluster just because all the space might be consumed by these larger data files.
  • Using the AWS Glue catalog to update the metadata in your team’s S3 bucket and use Amazon EMR to pull the data, apply transformation, change format and store the final data in S3 bucket, which can further be consumed using Amazon Redshift Spectrum. It is efficient when the dataset is large and need transformations before being consumed.
  1. Amazon Redshift clusters — Amazon Redshift has centric architecture and is best suited for being single place for all source of truth, but we are managing three clusters mostly because of having consistent SLA of our reports, decoupling the user query experience with central data lake ingestion process (which is resource intensive). Here are the cluster specific reasons for why we need these as separate clusters.
  • Staging cluster:
    • Our data sources are dynamic which are in the transition state and moving away from relational to non-relational sources; for example, Oracle to Postgres or to DynamoDB.
    • The mechanism to pull data from central data lake and store into Amazon Redshift, is also evolving and is resource intensive in current state.
    • Our datamart is payment specific, though the table names in our datamart looks similar to central data lakes tables, but our datamart data is different than central data lake datasets. We apply necessary transformation and filters before bringing the data to the user’s Amazon Redshift cluster.
  • User cluster: Our internal business users wanted to create the tables in public schema for their analysis. They also needed direct connect access for any adhoc analysis. Most of the users know SQL and are aware of best practices but there are users who are new to SQL and sometimes their query is not optimized and impact other running queries, we have Amazon Redshift workload manager (WLM) settings to protect our cluster from these bad queries.
  • Prod ETL cluster: We have tight SLA to make dataset available for data users. In order to minimize the impact of bad queries running on the system we have set up replica of user cluster. All prod transforms run here and the output data is copied to both user and prod clusters. It insures the SLA we commit to data business users.
  1. Near real time data ingestion — Many applications like promotional data, card registration, gift card issuance etc need realtime data collection to detect fraud. Application data is stored in Amazon DynamoDB, with DynamoDB Streams enabled. We consume the data from these streams through an AWS Lambda function and Amazon Kinesis Data Firehose. Kinesis Firehose delivers the data to S3 and submits the copy command to load the data into Redshift. We have micro batch of 15 mins which makes sure not all the connections are consumed by these near-real time data applications.
  2. Alternate compute on Amazon EMR — We track customer behavior through website clickstream data. There are few metrics on conversion rate (customer applied for payment products after they saw the banner at different placements like payment checkout page etc), payment product application, website hit waterfall analysis, funnel reports which tracks the journey of a customer starting from seeing these adds, clicking to apply button, filling the application form to finally submitting the application. We get almost 10B website hits records on a given day. Before we generate any report, we look for cheaper storage for these massive and not frequently accessed datasets. We decided to choose S3 as a storage option and applied the transformations using Amazon EMR. With this approach, we made sure we do not fill up the database with massive cold data and at the same time we enable data access on S3 using Amazon Redshift Spectrum, which provided similar query performance. As Amazon Redshift is a columnar database and is exceptionally fast for any sort of aggregation if we choose fewer dimensional columns. We wanted similar performance for the data we stored on S3. We were able to do it using Amazon EMR and by changing the data format from TSV to ORC or Parquet. Every day, we created new partition data on S3 and refreshed the Amazon Redshift Spectrum table definition to include new partition data. These Spectrum table were accessed by business users for their one-time analysis using any Amazon Redshift SQL client or for scheduling any ETL pipeline.
  3. Publish the data to data warehouse datalake for non-payment users — We built payment specific datasets. For example, decline transaction behavior, wallet penetration and others. Sometimes non-payments business users are also interested in consuming these datasets. We publish these datasets to central data warehouse data lake for them. Additionally, payments application teams are the source for payment product application data. Data engineering team consumes these datasets, apply needed transformation and publish it both for payments and non-payments user through Amazon Redshift and the data lake.

Schema management

We have a prod schema which stores all production tables and only platform team has access to make any changes to it. We also provide payment product specific sandboxes which is accessible by product specific member. There is generic public schema for any payments data users. They can create, load, truncate/drop the tables in this schema.

Database and ETL lookup

Here are few fun facts about our Amazon Redshift database objects.

  • Number of Databases: 4
    • Staging Database DB1: ds2.8xlarge x 20 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • User Database DB2: ds2.8xlarge x 24 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • Platform Database DB3: ds2.8xlarge x 24 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • Reporting Database DB4: ds2.8xlarge x 4 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
  • Size of Databases:
    • Total memory: 17 TB
    • Total storage: 1.15 Petabytes
  • Number of tables:
    • Analytics prod db: 6500
    • Analytics staging db: 390

User cluster

Here are some stats around the database that is exposed to the users. It has both core tables and users are allowed to create their own tables based on their need. We have the mirror image of the same database, which hosts all of the tables, except user=created tables. Another database is used to run ETL platform related prod pipeline. Most of these tables have entire history, except the snapshot tables like clickstream datasets, which have been archived to S3.

Staging cluster

Here are some stats around the staging database. It is the landing zone for all the data coming from other teams or from the central data warehouse data lake. Table retention has been applied to all the tables as most of the ELT downstream jobs look for the last updated date, pulls just incremental data and store in user and replica databases.

Scheduled ETL and Query load on database

  • Number of daily extraction ETL jobs: 2943
  • Number of loading ETL jobs: 1655

  • Total daily load processed volume: 119 BN
  • Total daily loading runtime: 11,415 mins

  • Total daily data extraction volume: 166 BN
  • Total daily date extraction runtimes: 25,585 mins

Both scheduled and one-time query loads on the database

  • Daily query load on database by different database users.

Best practices

  1. Design tables with right sort keys and distribution key: Query performance is dependent on how much data it scans and if the joines are co-located join. Choosing right sort key make sure we do not scan the data which we do not need and selecting right distribution key makes sure the joining data is present on the same node and there is less movement of data over network resulting in better query performance. For more information, see Amazon Redshift Best Practices for Designing Tables.
  1. Refer to Amazon Redshift Best Practices for Designing Queries while writing the query.
  1. Change the loading strategy by splitting larger files into smaller files, use bulk loads instead serial inserts. For more information, see Amazon Redshift Best Practices for Loading Data.
  1. Configure the appropriate WLM setting to avoid system abuse by allocating right run-times, memory, priority queues, etc. For more information, see Tutorial: Configuring Workload Management (WLM) Queues to Improve Query Processing.
  1. Use Amazon Redshift advisor to identify potential tables needing compression, tables with missing stats, uncompressed data loads, and further fine tune your ETL pipelines. For more information, see Working Recommendations from Amazon Redshift Advisor.
  1. Identify the tables with most of the wasted space and vacuum them frequently. It releases the wasted space and, at the same time, increases the query performance. For more information, see Vacuuming Tables.
  1. Analyze the SQLs submitted to DB and identify the pattern on table usage and expensive joins. It helps data engineers build more denormalized tables by pre-joining these tables, and helping users access a single table, which is fast and efficient.

Conclusion and Next Steps

Amazon Redshift clusters with total capacity of 1.15 PB, ~6500 tables, 4500 scheduled ETL runs, 13,000 ETL queries a day, is solving almost all the ETL need of business users in payments team. However, the recent volume growth is filling up our dbs more than we expected, Next step could be choosing cheaper storage option by building a datalake on S3 and access them using Amazon Redshift spectrum without even bothering about scaling challenges and with seamless user experience.


About the authors

Bishwabandhu Newton is a senior data engineer with Amazon Consumer Payments team. He has over 12 years of data warehousing experience, with 9+ years at Amazon.com.




Matt Scaer is a Principal Data Warehousing Specialist Solution Architect, with over 20 years of data warehousing experience, with 11+ years at both AWS and Amazon.com.

Scale your Amazon Redshift clusters up and down in minutes to get the performance you need, when you need it

Post Syndicated from Ayush Jain original https://aws.amazon.com/blogs/big-data/scale-your-amazon-redshift-clusters-up-and-down-in-minutes-to-get-the-performance-you-need-when-you-need-it/

Amazon Redshift is the cloud data warehouse of choice for organizations of all sizes—from fast-growing technology companies such as Turo and Yelp to Fortune 500 companies such as 21st Century Fox and Johnson & Johnson. With quickly expanding use cases, data sizes, and analyst populations, these customers have a critical need for scalable data warehouses.

Since we launched Amazon Redshift, our customers have grown with us. Working closely with them, we have learned how their needs evolve as their data scales. We commonly encounter scenarios like the following for data analytics:

  • A US-based retail company runs a large number of scheduled queries and complex BI reports. Their Amazon Redshift usage peaks between 8 AM and 6 PM, when their data scientists and analysts run heavy workloads. At night, they might have an occasional user who queries the data and generates a small report. Consequently, they don’t need the same cluster capacity at night as they do during the day.
  • A healthcare consulting company is rapidly growing their Data as a Service (DaaS) business. They want to quickly create a duplicate environment and provide the cluster endpoint to their clients. After the duplicate cluster is created, it needs to be right-sized quickly based on the cost and performance requirements of the client.
  • An IoT service provider is on a rapid growth trajectory. Whenever a major event takes place, their sensors send in terabytes of new data that needs to be ingested into Amazon Redshift and analyzed soon after it arrives.

When database administrators (DBAs) don’t have the nimbleness to react to these scenarios, the analysts can experience longer response times for mission-critical workloads. Or, they might be locked out altogether if the data warehouse is down for a resize. The DBAs, in turn, cannot support Service Level Agreements (SLAs) that they have set with their business stakeholders.

With Amazon Redshift, you can already scale quickly in three ways. First, you can query data in your Amazon S3 data lakes in place using Amazon Redshift Spectrum, without needing to load it into the cluster. This flexibility lets you analyze growing data volumes without waiting for extract, transform, and load (ETL) jobs or adding more storage capacity. Second, you can resize your Amazon Redshift clusters by adding more nodes or changing node types in just a few hours. During this time, analysts can continue to run read queries with no downtime. This gives you more agility compared to on-premises data warehouses that take days to scale. Third, you can spin up multiple Amazon Redshift clusters by quickly restoring data from a snapshot. This allows you to add compute resources that might be needed to support high concurrency.

Introducing elastic resize

We’re excited to introduce elastic resize, a new feature that enables you to add or remove nodes in an Amazon Redshift cluster in minutes. This further increases your agility to get better performance and more storage for demanding workloads, and to reduce cost during periods of low demand. You can resize manually from the AWS Management Console or programmatically with a simple API call.

With elastic resize, you can start small and scale up on-demand as your needs grow, as illustrated in the following diagram.

Amazon Redshift customers who have been previewing elastic resize before launch have immediately benefited by the scalability that it unlocks for them. Here is what some of our customers had to say about elastic resize:


Amazon Prime Video uses advanced data analytics to customize viewing recommendations and measure fan viewing experiences. “Redshift’s new elastic resize feature reduced our operational resizing time from 6 hours down to 15 minutes, allowing us to dynamically scale our infrastructure according to the diverse nature of our workloads and optimizing costs while maximizing performance.” Sergio Diaz Bautista, Data Engineer at Amazon Prime Video



Yelp uses Amazon Redshift to analyze mobile app usage data and ad data on customer cohorts, auctions, and ad metrics. “Yelp is at the forefront of using data analytics to drive business decisions and enhance its users’ experience. Using elastic resize, we can confidently optimize for the best performance and keep costs low by configuring the cluster to scale up whenever demand increases beyond the usual variability window and scale down during off-peak hours. The ability to scale our data warehouse containing hundreds of terabytes of data, in minutes, is amazing,” says Shahid Chohan, data architect at Yelp.com


“Coupang is disrupting how the world shops on phones. We cannot always predict analytical demand because of evolving business needs and ad hoc analyses that are unexpectedly required. With elastic resize, we can scale compute and storage quickly to finish large ETL jobs faster and serve more users querying the data,” says Hara Ketha, senior manager of data engineering at Coupang.


OLX uses Amazon Redshift to power personalization and relevance, run reporting, and generate actionable customer insights. “With OLX, millions of people across the world buy and sell from each other daily. Redshift has been at the core of advanced analytics and big data innovation at OLX. For our ever increasing data needs, we can now add nodes using elastic resize blazingly fast. It’s also easy to size down during periods of low activity in order to save costs. Thank you Redshift!” said Michał Adamkiewicz, data architect for the Europe data team at OLX.com

How elastic resize works

Elastic resize is fundamentally different from the existing classic resize operation available in Amazon Redshift. Unlike classic resize, which creates a new cluster and transfers data to it, elastic resize modifies the number of nodes in your existing cluster. It enables you to deploy the additional nodes in minutes with minimal disruption to ongoing read or write queries. Hence, you can size up your cluster quickly for faster performance and size down when the job is finished to save cost.

You can still use classic resize when you want to change the node type (for instance, if you are upgrading from DC1 to DC2 nodes). The following stages describe what happens behind the scenes when you trigger elastic resize.

Stage 1: Preparing for resize while the cluster is fully available

At the start of elastic resize, Amazon Redshift first updates the snapshot on Amazon S3 with the most recent data. Today, Amazon Redshift takes an automated snapshot every 8 hours or 5 GB of data change, whichever comes first. In addition, you can also take snapshots manually. Each snapshot is incremental to capture changes that occurred since the last automated or manual snapshot. While the first snapshot is taken, the cluster is fully available for read and write queries.

On the console and via the describe-clusters API operation, the cluster status shows as available, prep-for-resize during this stage. Additionally, an event notification (REDSHIFT-EVENT-3012) lets you know that a request for elastic resize has been received.

Stage 2: Resizing the cluster while the cluster is unavailable

This stage, when the resizing actually happens, takes just a few minutes to finish. As the resize begins, the existing connections are put on hold. No new connections are accepted until the resize finishes, and the cluster is unavailable for querying. Some of the held connections or queries might fail if they were part of ongoing transactions. New nodes are added (for scaling up) or removed (for scaling down) during this period. Another incremental backup is taken to account for any data updates made by users during stage 1.

On the console and using the describe-clusters API operation, the cluster status now shows as resizing. Additionally, an event notification (REDSHIFT-EVENT-3011) lets you know that elastic resize started on the cluster.

Stage 3: Data transferring while the cluster is fully available

After resizing is finished in stage 2, the cluster is fully available for read and write queries. Queries that were being held are queued for execution automatically. During this stage, data is transferred from Amazon S3 to the nodes. Because Amazon Redshift only subtracts or adds nodes, only a fraction of the data needs to be transferred. For example, if you resized a three-node dc2.8xlarge cluster to six nodes, only 50 percent of the data needs to be moved.

We also improved the rate at which we restore data blocks from S3 by 2x, cutting down the time it would have taken to finish stage 3. In addition, this improvement also makes restores from a snapshot faster.

Moreover, Amazon Redshift moves the data intelligently to minimize the impact of data transfer on queries during this stage. The most frequently accessed data blocks are moved first, followed by other blocks based on their access frequency. This results in most incoming queries finding the data blocks that they are trying to access on disk.

On the console and using the describe-clusters API operation, the cluster status now shows as available, transferring data. Additionally, an event notification (REDSHIFT-EVENT-3534) lets you know that elastic resize has finished and the cluster is available for reads and writes.

After the data transfer is finished, the cluster status on the console and via the describe-clusters API operation once again shows as available.

Elastic resize constraints

Before deciding whether elastic resize is appropriate for your use case, consider the following constraints:

  • The new node configuration must have enough storage for existing data. Even when you add nodes, your new configuration might not have enough storage because of the way that data is redistributed.
  • You can resize only by a factor of 2, up or down, for dc2.large or ds2.xlarge node types. For example, you can resize a four-node cluster up to eight nodes or down to two nodes. This limitation exists to avoid data skew between nodes caused by an uneven distribution of slices.
  • For dc2.8xlarge or ds2.8xlarge node types, you can resize up to two times the original node count, or down to one-half the original node count. For example, you can resize a 16-node cluster to any size up to 32 nodes, or any size down to 8 nodes. This limitation exists to avoid data skew between nodes caused by an uneven distribution of slices.


Elastic resize significantly improves your ability to scale your Amazon Redshift clusters on-demand. Together with features such as Amazon Redshift Spectrum, it enables you to independently scale storage and compute so that you can adapt to the evolving analytical needs of your business.

To learn more about elastic resize and understand how you can size your clusters, watch the AWS Online Tech Talk: Best Practices for Scaling Amazon Redshift. You can also refer to the documentation for Resizing Clusters in the Amazon Redshift Cluster Management Guide.

About the Authors

Ayush Jain is a Product Marketer at Amazon Web Services. He loves growing cloud services and helping customers get more value from the cloud deployments. He has several years of experience in Software Development, Product Management and Product Marketing in developer and data services.




Himanshu Raja is a Senior Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

Create cross-account and cross-region AWS Glue connections

Post Syndicated from Pankaj Malhotra original https://aws.amazon.com/blogs/big-data/create-cross-account-and-cross-region-aws-glue-connections/

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. AWS Glue uses connections to access certain types of source and target data stores, as described in the AWS Glue documentation.

By default, you can use AWS Glue to create connections to data stores in the same AWS account and AWS Region as the one where you have AWS Glue resources. In this blog post, we describe how to access data stores in an account or AWS Region different from the one where you have AWS Glue resources.

AWS Glue connections

AWS Glue uses a connection to crawl and catalog a data store’s metadata in the AWS Glue Data Catalog, as the documentation describes. AWS Glue ETL jobs also use connections to connect to source and target data stores. AWS Glue supports connections to Amazon Redshift, Amazon RDS, and JDBC data stores.

A connection contains the properties needed by AWS Glue to access a data store. These properties might include connection information such as user name and password, data store subnet IDs, and security groups.

If the data store is located inside an Amazon VPC, AWS Glue uses the VPC subnet ID and security group ID connection properties to set up elastic network interfaces in the VPC containing the data store. Doing this enables ETL jobs and crawlers to connect securely to the data store in the VPC.

AWS Glue can create this elastic network interface setup if the VPC containing the data store is in the same account and AWS Region as the AWS Glue resources. The security groups specified in a connection’s properties are applied on each of the network interfaces. The security group rules and network ACLs associated with the subnet control network traffic through the subnet. Correct rules for allowing outbound traffic through the subnet ensure that AWS Glue can establish network connectivity with all subnets in the VPC containing the data store, and therefore access the source or target data store.

VPC components can be interlinked only if they are present in the same AWS Region. Therefore, AWS Glue cannot create the elastic network interfaces inside a VPC in another region. If the VPC containing the data store is in another region, you have to add the network routes and create additional network interfaces which allow network interfaces set up by AWS Glue to establish network connectivity with the data store.

In this blog post, we describe how to configure the networking routes and interfaces to give AWS Glue access to a data store in an AWS Region different from the one with your AWS Glue resources. In our example, we connect AWS Glue, located in Region A, to an Amazon Redshift data warehouse located in Region B.

Note: The examples here assume that the Amazon Redshift cluster is in a different AWS Region, but belongs to the same account. The same setup and instructions are also valid for an Amazon Redshift cluster in a different account.

Setting up VPC components for AWS Glue

AWS Glue requires a VPC with networking routes to the data stores to which it connects. In our solution, the security groups and route tables are configured to enable elastic network interfaces set up by AWS Glue in a private subnet to reach the internet or connect to data stores outside the VPC. The following diagram shows the necessary components and the network traffic flow.

Required components for VPC setup:

  • AWS Glue resources in a private subnet in Region A.
  • A NAT gateway with an Elastic IP address attached to it in a public subnet in Region A.
  • A private route table containing a route allowing outbound network traffic from the private subnet to pass through the NAT gateway.
  • An internet gateway in Region A.
  • A public route table with a route allowing outbound network traffic from the public subnet to pass through the internet gateway.

Note: We must update the default security group of the VPC to include a self-referencing inbound rule and an outbound rule to allow all traffic from all ports. Later in the example, we attach this security group to an AWS Glue connection to let network interfaces set up by AWS Glue communicate with each other within a private subnet.

Network traffic flow through the components:

Outbound network traffic from AWS Glue resources in the private subnet to any destination or data store outside the private subnet is routed through the NAT gateway.

The NAT gateway is present in a public subnet and has an associated Elastic IP address. It forwards network traffic from AWS Glue resources to internet by using an internet gateway.

When AWS Glue tries to establish a connection with a data store outside of the private subnet, the incoming network traffic on the data store side appears to come from the NAT Gateway.

On the data store side, you allow the data store or its security group to accept incoming network traffic from the Elastic IP address attached to the NAT gateway. This is shown in the section “Allow Amazon Redshift to accept network traffic from AWS Glue,” following.

Creating VPC components using AWS CloudFormation

You can automate the creation of a VPC and all the components described preceding using the vpc_setup.yaml CloudFormation template, hosted on GitHub. Follow these step-by-step instructions to create the stack in your AWS account:

  1. Deploy the stack in the US Oregon (us-west-2) Region:

Note: In this example, we create the AWS Glue resources and connection in the us-west-2 Region. You can change this to the AWS Region where you have your AWS Glue connection and resources.

You are directed to the AWS CloudFormation console, with the stack name and URL template fields pre-filled.

  1. Choose Next.
  2. Use the default IP ranges and choose Next.
  3. Skip this step and choose Next.
  4. Review and choose Create.
  5. Wait for stack creation to complete. After completion, all the VPC components and necessary setup required are created.
  6. Navigate to the VPC console and copy the Elastic IP address for the newly created NAT.
    Note: This IP address is used for outbound network flow from AWS Glue resources and so should be whitelisted on the data store side. For more detail, see “Allow Amazon Redshift to accept network traffic from AWS Glue,” following.

Before creating and testing an AWS Glue connection to your data store, you need an IAM role that lets AWS Glue access the VPC components that you just created.

Creating an IAM role to let AWS Glue access Amazon VPC components

For this example, we create a role called TestAWSGlueConnectionIAMRole with a managed IAM policy AWSGlueServiceRole attached to it.

  1. Choose the Roles tab from the AWS Identity and Access Management (IAM) console.
  2. Choose Create role and select AWS Glue as a trusted entity.

  1. Attach an IAM policy to the role that allows AWS Glue to access the VPC components. In this example, we are using the default AWSGlueServiceRole policy, which contains all the required permissions for the setup.

  1. We name the role TestAWSGlueConnectionRole.

Note: The default GlueServiceRole policy that we attached to our custom role TestAWSGlueConnectionIAMRole has permissions for accessing VPC components. If you are using a custom policy instead of the default one, it should also contain the same permissions to be able to access VPC components.

Creating an Amazon Redshift cluster using AWS CloudFormation

For this example, we create a sample Amazon Redshift cluster in a VPC in the US N. Virginia (us-east-1) Region. Follow these step-by-step instructions to create the stack in your AWS account:

  1. Navigate to the CloudFormation console in region us-east-1 and create a new stack using this CloudFormation template, described in the documentation.
  2. Provide the configuration for the cluster and MasterUsername and MasterUserPassword. MasterUserPassword must follow the following constraints:
  • It must be 8–64 characters in length.
  • It must contain at least one uppercase letter, one lowercase letter, and one number.
  • It can use any printable ASCII characters (ASCII code 33–126) except ‘ (single quote), ” (double quote), :, \, /, @, or space.

  1. Choose Next and proceed with the stack creation.
  2. Review the configuration and choose Create.

  1. Wait for stack creation to complete, which can take a few minutes.

  1. Navigate to the Amazon Redshift console and choose the cluster name to see the cluster properties.

  1. Note the JDBC URL for the cluster and the attached security group for later use.

Note: We created a sample Amazon Redshift cluster in a public subnet present inside a VPC in Region B. We recommend that you follow the best practices for increased security and availability while setting up a new Amazon Redshift cluster, as shown in our samples on GitHub.

Creating an AWS Glue connection

Now you have the required VPC setup, Amazon Redshift cluster, and IAM role in place. Next, you can create an AWS Glue connection and test it as follows:

  1. Choose Add Connection under the Connections tab in AWS Glue console. The AWS Region in which we are creating this connection is the same as for our VPC setup, that is US Oregon (us-west-2).

  1. Choose a JDBC connection type. You can choose to enforce JDBC SSL or not, depending on the configuration for your data store.

  1. Add the connection-specific configuration. Note the URL for our Amazon Redshift cluster. It shows that the Amazon Redshift cluster is present in us-east-1.

Note: We use the VPC (VPCForGlue) and the private subnet (GluePrivateSubnet) we created for this connection. For security groups, we use the default security group for the VPC. This security group has a self-referencing inbound rule and an outbound rule that allows all traffic.

  1. Review configuration and choose Finish.

The AWS Glue console should now show that the connection was created successfully.

Note: Completing this step just means that an AWS Glue connection was created. It doesn’t guarantee that AWS Glue can actually connect to your data store. Before we test the connection, we also need to allow Amazon Redshift to accept network traffic coming from AWS Glue.

Allow Amazon Redshift to accept network traffic from AWS Glue

The Amazon Redshift cluster in a different AWS Region (us-east-1) from AWS Glue must allow incoming network traffic from AWS Glue.

For this, we update the security group attached to the Amazon Redshift cluster, and whitelist the Elastic IP address attached to the NAT gateway for the AWS Glue VPC.

Testing the AWS Glue connection

As a best practice, before you use a data store connection in an ETL job, choose Test connection. AWS Glue uses the parameters in your connection to confirm that it can access your data store and reports back any errors.

  1. Select the connection TestAWSGlueConnection that we just created and choose Test Connection.
  2. Select the TestAWSGlueConnectionIAMRole that we created for allowing AWS Glue resources to access VPC components.

  1. After you choose the Test connection button in the previous step, it can take a few seconds for AWS Glue to successfully connect to the data store. When it does, the console shows a message saying it “connected successfully to your instance.”


By creating a VPC setup similar to the one we describe, you can let AWS Glue connect to a data store in a different account or AWS Region. By doing this, you establish network connectivity between AWS Glue resources and your data store. You can now use this AWS Glue connection in ETL jobs and AWS Glue crawlers to connect with the data store.

If you have questions or suggestions, please leave a comment following.

Additional Reading

If you found this post helpful, be sure to check out Connecting to and running ETL jobs across multiple VPCs using a dedicated AWS Glue VPC, and How to access and analyze on-premises data stores using AWS Glue.


About the Author

Pankaj Malhotra is a Software Development Engineer at Amazon Web Services. He enjoys solving problems related to cloud infrastructure and distributed systems. He specializes in developing multi-regional, resilient services using serverless technologies.




Connecting to and running ETL jobs across multiple VPCs using a dedicated AWS Glue VPC

Post Syndicated from Nivas Shankar original https://aws.amazon.com/blogs/big-data/connecting-to-and-running-etl-jobs-across-multiple-vpcs-using-a-dedicated-aws-glue-vpc/

Many organizations use a setup that includes multiple VPCs based on the Amazon VPC service, with databases isolated in separate VPCs for security, auditing, and compliance purposes. This blog post shows how you can use AWS Glue to perform extract, transform, load (ETL) and crawler operations for databases located in multiple VPCs.

The solution presented here uses a dedicated AWS Glue VPC and subnet to perform the following operations on databases located in different VPCs:

  • Scenario 1: Ingest data from an Amazon RDS for MySQL database, transform it in AWS Glue, and output the results to an Amazon Redshift data warehouse.
  • Scenario 2: Ingest data from an Amazon RDS for MySQL database, transform it in AWS Glue, and output the results to an Amazon RDS for PostgreSQL database.

In this blog post, we’ll go through the steps needed to build an ETL pipeline that consumes from one source in one VPC and outputs it to another source in a different VPC. We’ll set up in multiple VPCs to reproduce a situation where your database instances are in multiple VPCs for isolation related to security, audit, or other purposes.

For this solution, we create a VPC dedicated to AWS Glue. Next, we set up VPC peering between the AWS Glue VPC and all of the other database VPCs. Then we configure an Amazon S3 endpoint, route tables, security groups, and IAM so that AWS Glue can function properly. Lastly, we create AWS Glue connections and an AWS Glue job to perform the task at hand.

Step 1: Set up a VPC

To simulate these scenarios, we create four VPCs with their respective IPv4 CIDR ranges. (Note: CIDR ranges can’t overlap when you use VPC peering.)

VPC 1 Amazon Redshift
VPC 2 Amazon RDS for MySQL
VPC 3 Amazon RDS for PostgreSQL
VPC 4 AWS Glue

Key configuration notes:

  1. The AWS Glue VPC needs at least one private subnet for AWS Glue to use.
  2. Ensure that DNS hostnames are enabled for all of your VPCs (unless you plan to refer to your databases by IP address later on, which isn’t recommended).

Step 2: Set up a VPC peering connection

Next, we peer our VPCs together to ensure that AWS Glue can communicate with all of the database targets and sources. This approach is necessary because AWS Glue resources are created with private addresses only. Thus, they can’t use an internet gateway to communicate with public addresses, such as public database endpoints. If your database endpoints are public, you can alternatively use a network address translation (NAT) gateway with AWS Glue rather than peer across VPCs.

Create the following peering connections.

Requester Accepter
Peer 1 VPC 4 VPC 1
Peer 2 VPC 4 -VPC 2
Peer 3 VPC 4 VPC 3

These peering connections can be across separate AWS Regions if needed. The database VPCs are not peered together; they are all peered with the AWS Glue VPC instead. We do this because AWS Glue connects to each database from its own VPC. The databases don’t connect to each other.

Key configuration notes:

  1. Create a VPC peering connection, as described in the Amazon VPC documentation. Select the AWS Glue VPC as the requester and the VPC for your database as the accepter.
  2. Accept the VPC peering request. If you are peering to a different AWS Region, switch to that AWS Region to accept the request.

Important: Enable Domain Name Service (DNS) settings for each of the peering connections. Doing this ensures that AWS Glue can retrieve the private IP address of your database endpoints. Otherwise, AWS Glue resolves your database endpoints to public IP addresses. AWS Glue can’t connect to public IP addresses without a NAT gateway.

Step 3: Create an Amazon S3 endpoint for the AWS Glue subnet

We need to add an Amazon S3 endpoint to the AWS Glue VPC (VPC 4). During setup, associate the endpoint with the route table that your private subnet uses. For more details on creating an S3 endpoint for AWS Glue, see Amazon VPC Endpoints for Amazon S3 in the AWS Glue documentation.

AWS Glue uses S3 to store your scripts and temporary data to load into Amazon Redshift.

Step 4: Create a route table configuration

Add the following routes to the route tables used by the respective services’ subnets. These routes are configured along with existing settings.

VPC 4—AWS Glue Destination Target
Route table VPC 3 Peer 3 VPC 1 Peer 1 VPC 2 Peer 2


VPC 1—Amazon Redshift Destination Target
Route table VPC 4 Peer 1


VPC 2—Amazon RDS MySQL Destination Target
Route table VPC 4 Peer 2


VPC 3—Amazon RDS PostgreSQL Destination Target
Route table VPC 4 Peer 3

Key configuration notes:

  • The route table for the AWS Glue VPC has peering connections to all VPCs. It has these so that AWS Glue can initiate connections to all of the databases.
  • All of the database VPCs have a peering connection back to the AWS Glue VPC. They have these connections to allow return traffic to reach AWS Glue.
  • Ensure that your S3 endpoint is present in the route table for the AWS Glue VPC.

Step 5: Update the database security groups

Each database’s security group must allow traffic to its listening port (3306, 5432, 5439, and so on) from the AWS Glue VPC for AWS Glue to be able to connect to it. It’s also a good idea to restrict the range of source IP addresses as much as possible.

There are two ways to accomplish this. If your AWS Glue job will be in the same AWS Region as the resource, you can define the source as the security group that you use for AWS Glue. If you are using AWS Glue to connect across AWS Regions, specify the IP range from the private subnet in the AWS Glue VPC instead. The examples following use a security group as our AWS Glue job, and data sources are all in the same AWS Region.

In addition to configuring the database’s security groups, AWS Glue requires a special security group that allows all inbound traffic from itself. Because it isn’t secure to allow traffic from, we create a self-referencing rule that simply allows all traffic originating from the security group. You can create a new security group for this purpose, or you can modify an existing security group. In the example following, we create a new security group to use later when AWS Glue connections are created.

The security group Amazon RDS for MySQL needs to allow traffic from AWS Glue:

Amazon RDS for PostgreSQL allows traffic to its listening port from the same:

Amazon Redshift does it as so:

AWS Glue does it as so:

Step 6: Set up IAM

Make sure that you have an AWS Glue IAM role with access to Amazon S3. You might want to provide your own policy for access to specific Amazon S3 resources. Data sources require s3:ListBucket and s3:GetObject permissions. Data targets require s3:ListBucket, s3:PutObject, and s3:DeleteObject permissions. For more information on creating an Amazon S3 policy for your resources, see Policies and Permissions in the IAM documentation.

The role should look like this:

Or you can create an S3 policy that’s more restricted to suit your use case.

Step 7: Set up an AWS Glue connection

The Amazon RDS for MySQL connection in AWS Glue should look like this:

The Amazon Redshift connection should look like this:

The Amazon RDS for PostgreSQL connection should look like this:

Step 8: Set up an AWS Glue job

Key configuration notes:

  1. Create a crawler to import table metadata from the source database (Amazon RDS for MySQL) into the AWS Glue Data Catalog. The scenario includes a database in the catalog named gluedb, to which the crawler adds the sample tables from the source Amazon RDS for MySQL database.
  2. Use either the source connection or destination connection to create a sample job as shown following. (This step is required for the AWS Glue job to establish a network connection and create the necessary elastic network interfaces with the databases’ VPCs and peered connections.)
  3. This scenario uses pyspark code and performs the load operation from Amazon RDS for MySQL to Amazon Redshift. The ingest from Amazon RDS for MySQL to Amazon RDS for PostgreSQL includes a similar job.
  4. After running the job, verify that the table exists in the target database and that the counts match.

The following screenshots show the steps to create a job in the AWS Glue Management Console.

Following are some of examples of loading data from source tables to target instances. These are simple one-to-one mappings, with no transformations applied. Notice that the data sources and data sink (target) connection configuration access multiple VPCs from a single AWS Glue job.

Sample script 1 (Amazon RDS for MySQL to Amazon Redshift)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource = glueContext.create_dynamic_frame.from_catalog(database = "gluedb", table_name = "mysqldb_events", transformation_ctx = "datasource")

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = datasource, catalog_connection = "Redshift", connection_options = {"dbtable": "mysqldb_events", "database": "dmartblog"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink")


Sample script 2:  Amazon RDS for MySQL to Amazon RDS for PostgreSQL (can also change with other RDS endpoint)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource = glueContext.create_dynamic_frame.from_catalog(database = "gluedb", table_name = "mysqldb_events", transformation_ctx = "datasource")

datasink = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, catalog_connection = "vpc-pgsql", connection_options = {"dbtable": "mysqldb_events", "database": "mypgsql"}, transformation_ctx = "datasink")



In this blog post, you learn how to configure AWS Glue to run in a separate VPC so that it can execute jobs for databases located in multiple VPCs.

The benefits of doing this include the following:

  • A separate VPC and dedicated pool on the running AWS Glue job, isolated from database and compute nodes.
  • Dedicated ETL developer access to a single VPC for better security control and provisioning.


Additional Reading

If you found this post useful, be sure to check out Restrict access to your AWS Glue Data Catalog with resource-level IAM permissions and resource-based policies, and Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production.


About the Author

Nivas Shankar is a Senior Big Data Consultant at Amazon Web Services. He helps and works closely with enterprise customers building big data applications on the AWS platform. He holds a Masters degree in physics and is highly passionate about theoretical physics concepts. He enjoys spending time with his wife and two adorable kids. In his spare time, he takes his kids to tennis and football practice.



Ian Eberhart is a Cloud Support Engineer on the Big Data team for AWS Premium Support. He works with customers on a daily basis to find solutions for moving and sorting their data on the AWS platform. In his spare time, Ian enjoys seeing independent and weird movies, riding his bike, and hiking in the mountains.


Chasing earthquakes: How to prepare an unstructured dataset for visualization via ETL processing with Amazon Redshift

Post Syndicated from Ian Funnell original https://aws.amazon.com/blogs/big-data/chasing-earthquakes-how-to-prepare-an-unstructured-dataset-for-visualization-via-etl-processing-with-amazon-redshift/

As organizations expand analytics practices and hire data scientists and other specialized roles, big data pipelines are growing increasingly complex. Sophisticated models are being built using the troves of data being collected every second.

The bottleneck today is often not the know-how of analytical techniques. Rather, it’s the difficulty of building and maintaining ETL (extract, transform, and load) jobs using tools that might be unsuitable for the cloud.

In this post, I demonstrate a solution to this challenge. I start with a noisy semistructured dataset of seismic events, spanning several years and recorded at different locations across the globe. I set out to obtain broad insights about the nature of the rocks forming the Earth’s surface itself—the tectonic plate structure—to be visualized using the mapping capability in Amazon QuickSight.

To accomplish this, I use several AWS services, orchestrated together using Matillion ETL for Amazon Redshift:

Tectonic plate structure context

An earthquake is caused by a buildup of pressure that gets suddenly released. Earthquakes tend to be more severe at the boundaries of destructive tectonic plates. These boundaries are formed when a heavier and denser oceanic plate collides with a lighter continental plate, or when two oceanic plates collide. Due to the difference in density, the oceanic lithosphere is pushed underneath the continental plate, forming what is called a subduction zone. (See the following diagram.) In subduction zones, earthquakes can occur at depths as great as 700 kilometers.

Photography by KDS4444 [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

For our analysis, we ultimately want to visualize the depth of an earthquake focus to locate subduction zones, and therefore find places on earth with the most severe earthquakes.

Seismographic data source

The data source is from the International Federation of Digital Seismograph Networks (FDSN). The event data is in JSON format (from the European Mediterranean Seismological Centre, or EMSC). An external process accumulates files daily into an Amazon S3 bucket, as shown following.

Each individual file contains all the seismic events for one day—usually several hundred—in an embedded array named “features,” as shown in the following example:

  "type": "FeatureCollection",
  "metadata": {
    "totalCount": 103
  "features": [
      "geometry": {
        "type": "Point",
        "coordinates": [26.76, 45.77, -140]
      "type": "Feature",
      "id": "20180302_0000103",
      "properties": {
        "lastupdate": "2018-03-02T23:27:00.0Z",
        "lon": 26.76, "lat": 45.77, "depth": 140,
        "mag": 3.7,
        "time": "2018-03-02T23:22:52.1Z",
        "flynn_region": "ROMANIA"
      "geometry": {
        "type": "Point",

Architecture overview

Athena reads and flattens the S3 data and makes it available for Matillion ETL to load into Amazon Redshift via JDBC. Matillion orchestrates this data movement, and it also provides a graphical framework to design and build the more complex data enrichment and aggregation steps to be performed by Amazon Redshift. Finally, the prepared data is queried by Amazon QuickSight for visualization.

Amazon Athena setup

You can use Athena to query data in S3 using standard SQL, via a serverless infrastructure that is managed entirely by AWS on your behalf. Before you can query your data, start by creating an external table. By doing this, you are defining the schema to apply to the data when it is being queried.

You can choose to use an AWS Glue crawler to assist in automatically discovering the schema and format of your source data files.

The following is the CREATE TABLE statement that you can copy and paste into the Athena console to create the schema needed to query the seismic data. Make sure that you substitute the correct S3 path to your seismic data in the LOCATION field of the statement.

  `type` string COMMENT 'from deserializer', 
  `metadata` struct<totalcount:int> COMMENT 'from deserializer', 
  `features` array<struct<geometry:struct<type:string,coordinates:array<double>>,type:string,id:string,properties:struct<lastupdate:string,magtype:string,evtype:string,lon:double,auth:string,lat:double,depth:double,unid:string,mag:double,time:string,source_id:string,source_catalog:string,flynn_region:string>>> COMMENT 'from deserializer')
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://your-bucket/SeismicPortal'

After the table is created, you are ready to query it. Athena uses the in-memory, distributed SQL engine called Apache Presto. It provides the ability to unnest arrays, which you use next.

Transfer to Amazon Redshift

The embedded array in every source record gets flattened out and converted to individual records during the JDBC export (download the .jar file) into Amazon Redshift. You use a Matillion ETL Database Query component to assist with the data transfer during this step, as shown in the following image.

This component simplifies ETL by automating the following steps:

Runs the SQL SELECT statement (shown in the following example).

  1. Streams query results across the network from Athena and into temporary storage in S3.
  2. Performs a bulk data load into Amazon Redshift.

Athena executes the following SQL statement:

SELECT f.id,
	   f.properties.time AS event_time,
FROM “seismic”.”sp_events”
CROSS JOIN UNNEST (features) as t(f)

The CROSS JOIN UNNEST syntax flattens the embedded array, generating hundreds of individual event records per day.

Now that the data has been copied and flattened into individual event records (shown in the following image), it’s ready for enrichment and aggregation.

Data enrichment

Earthquakes occur along a continuous range of spatial coordinates. In order to aggregate them, as we’ll be doing very soon, it’s necessary to first group them together. A convenient method is to assign every event into a Universal Transverse Mercator (UTM) zone. These zones are six-degree bands of longitudes that convert the spherical latitude/longitude coordinates into a 2D representation. Performing this conversion provides good granularity for visualization later.

The calculation to convert a spherical longitude/latitude coordinate into a two-dimensional UTM coordinate is complex. It can be performed ideally using an Amazon Redshift user-defined function (UDF). I chose a UDF for the ability to invoke it, via a Matillion component, in the next step.

CREATE OR REPLACE FUNCTION f_ll_utm (Lat float, Long float)
AS $$
From math import pi, sin, cos, tan, sqrt

_deg2rad = pi / 180.0
_rad2deg = 180.0 / pi

_EquatorialRadius = 1
_eccentricitySquared = 2
_ellipsoid = [ “WGS-84”, 6378137, 0.00669438]

The UDF has to return three pieces of information:

  • UTM Zone code
  • Easting (x-axis measurement in meters)
  • Northing (ditto, for the y-axis)

A scalar UDF can only return a single value. Therefore the three results were returned as a pipe-delimited string, in which the three values are pipe-separated:

To bring the values out into individual fields, the UDF is first invoked using a Matillion ETL Calculator component, followed by a field splitter and a Calculator to perform data type conversion and rounding.

Data aggregation

To reiterate, we’re interested in the depth of earthquake focus specifically on destructive plate boundaries. Knowing the depth helps us estimate the potential severity of earthquakes.

We need to find the average event depth within each UTM zone, in the expectation that a spatial pattern will appear that will highlight the subduction zones.

The last three steps in the Matillion transformation (shown in the following diagram) perform the necessary aggregation, add a depth quartile, and create an output table from the resulting data.

The ”Aggregate to UTM ref” step gets Amazon Redshift to perform a GROUP BY function in SQL, which approximates every event to the appropriate UTM zone. While doing this aggregation, you simultaneously do the following:

  • Count the events (which determines the size of the visual representation).
  • Find the average depth (which determines the color of the visual representation).
  • Determine the average latitude and longitude (which approximates to the center of the UTM zone, and determines the position of the visual representation).

The following image shows the aggregation type for each column:

Average depth is a useful measure, but to maximize the visual impact of the final presentation, we also take the opportunity to rank the results into quartiles. This allows the zones with the deepest quartile to stand out consistently on the map.

NTILE(4) OVER (ORDER BY "avg_depth")

Amazon Redshift is great at performing this type of analytics, which is delivered inside another Matillion ETL Calculator component.

The Recreate Output step materializes the dataset into an Amazon Redshift table, ready for Amazon QuickSight to visualize.

Amazon QuickSight visualization

The Amazon QuickSight “points on map” visualization is perfect for this 2D rendering. The values for the field wells come straight from the aggregated data in Amazon Redshift:

  • Geospatial — the average lat/long per UTM grid.
  • Size — the record count, in other words, the number of events in that UTM zone.
  • Color — the Depth Ntile, with the fourth quartile in pink.

The resulting map shows the global subduction zones highlighted clearly in pink, these being the areas with the deepest earthquake’s focus on average.

Recap and summary

In this post, I used seismological data as an example to explore challenges around the visualization of unstructured data and to provide best practices. I suggested a way to overcome these challenges with an architecture that is also applicable for datasets from a wide array of sources, beyond geology. I then explored how to orchestrate activities of different data processing tasks between S3, Athena, and Amazon Redshift using Matillion ETL for Amazon Redshift.

If you’re trying to solve similar analytics challenges and want to see how Amazon Redshift and Matillion can help, launch a 14 day free trial of Matillion ETL for Amazon Redshift on the AWS Marketplace or schedule a demo today. If you have questions or suggestions, please comment below.

Additional Reading

If you found this post helpful, be sure to check out Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift, and Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda.


About the Author

Ian Funnell, a lead solution architect at Matillion, is a seasoned cloud and data warehousing specialist. Ian is dedicated to supporting customers in driving their data transformation forward and solving their deepest technical challenges. Ian’s has 25+ years experience in the tech industry.




Performance matters: Amazon Redshift is now up to 3.5x faster for real-world workloads

Post Syndicated from Ayush Jain original https://aws.amazon.com/blogs/big-data/performance-matters-amazon-redshift-is-now-up-to-3-5x-faster-for-real-world-workloads/

Since we launched Amazon Redshift, thousands of customers have trusted us to get uncompromising speed for their most complex analytical workloads. Over the course of 2017, our customers benefited from a 3x to 5x performance gain, resulting from short query acceleration, result caching, late materialization, and many other under-the-hood improvements. In this post, we highlight recent improvements to Amazon Redshift and how our continued focus on performance enhancements is benefiting customers. We also discuss performance testing derived from industry-standard benchmarks that help us measure the impact of these ongoing improvements.

Recent performance improvements

With the largest number of data warehousing deployments in the cloud, we have the ability to analyze usage patterns across a variety of analytical workloads and uncover opportunities to improve performance. We leverage these insights to deliver improvements that seamlessly benefit thousands of customers. Major improvements in performance over the past six months include the following:

  • Improved resource management for memory-intensive queries: Amazon Redshift improved how joins and aggregations consume and reserve memory. This improved cache efficiency for the majority of the hash tables and reduced spilling for memory-intensive joins and aggregations by up to 1.6x.
  • Improved performance for commits: As a central component of write transactions, commit has a direct impact on the performance of data update and data ingestion workloads, such as ETL (extract, transform, and load) jobs. Since November 2017, we’ve delivered a series of commit performance optimizations such as batching multiple commits in a single operation, improved usage of commit locks, and a locality-aware metadata defragmenter. These and other related optimizations have resulted in a 4x commit time reduction on average for HDD-based clusters. For heavy transactions (the top 5 percent of commit operations in Amazon Redshift), the delivered optimizations resulted in a 7.5x improvement.
  • Improved performance for repeated queries: With Amazon Redshift’s result caching, dashboards, visualization, and business intelligence (BI) tools that execute queries repeatedly now see a significant boost in performance. In addition, result caching frees up resources that can improve the performance of all other queries.
  • Query processing improvements: Amazon Redshift now performs 2x–6x faster for scenarios such as repeated subqueries, advanced analytics functions with predicates, and complex query plans by eliminating duplicate work and streamlining steps.
  • Faster string manipulation: Amazon Redshift yields 5x better performance for frequently used string functions because of more efficient code generation techniques.

We’ve also complemented these out-of-the-box improvements with tailored recommendations to help you get better performance at a lower cost with Amazon Redshift Advisor. Advisor has already provided close to 50,000 recommendations since it launched in July 2018.

All of these optimizations have transparently boosted customers’ ability to get faster insights from their AWS analytics platform and saved thousands of hours of execution time on a daily basis. This applies to even the largest deployments, where customers have multiple petabytes of data in Redshift clusters, and seamless access to even larger data volumes in their Amazon S3 data lakes with Amazon Redshift Spectrum. “Redshift’s query performance and scalability has been increasing, even though our data has grown.” said Minero Aoki, Senior Data Engineer, Cookpad Inc. “In the last 10 months, we have seen commit performance increase by 500% without any increase in cost.”

Using benchmarks to measure success

To measure the impact of these ongoing improvements, we measure performance on a nightly basis and run queries derived from industry-standard benchmarks such as TPC-DS. We also occasionally benchmark Amazon Redshift against other data warehouse services. We set up these measurements to reflect our customers’ real-world usage, as highlighted earlier. This enables us to accurately gauge whether Amazon Redshift is getting better with each release, which happens every two weeks.

Comparing Amazon Redshift releases over the past few months, we observed that Amazon Redshift is now 3.5x faster versus six months ago, running all 99 queries derived from the TPC-DS benchmark. This is shown in the following chart.

Note: We used a Cloud DW benchmark derived from TPC-DS for this study. As such, the Cloud DW benchmark is not comparable to published TPC-DS results. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council.

For this post, we also compared the latest Amazon Redshift release with Microsoft Azure SQL Data Warehouse using the Cloud DW benchmark derived from TPC-DS. Queries ran against a 3 TB dataset on a 4-node cluster on both services, using dc2.8xlarge for Amazon Redshift and DW2000c Gen2 for Azure SQL Data Warehouse. We could not run a larger dataset because Azure could not allocate the DW15000c cluster required for a 30 TB dataset owing to capacity constraints at the time of publishing.

We observed that Amazon Redshift is 15x faster than Azure SQL Data Warehouse running all 99 queries with one user, and 14x faster with four concurrent users. There were a couple of outlier queries that took Azure SQL Data Warehouse several hours to complete. Excluding the two long running queries, Amazon Redshift is 2x faster than Azure SQL Data Warehouse with 1 user and 1.6x faster with four concurrent users. The following charts compare the two services.

Note: We used queries derived from TPC-DS v2.9 for this study. Amazon Redshift and Azure SQL DW do not support rollup queries, so we used TPC-DS provided variants for queries 5, 14, 18, 27, 36, 67, 70, 77, 80, and 86. We used out-of-the-box Workload Management configuration for Amazon Redshift, which allows for 5 concurrent queries, and ‘largerc’ resource class for Azure SQL DW, which has a lower limit of 4 concurrent queries. Amazon Redshift took 25 minutes to run all 99 queries, whereas Azure SQL Data Warehouse took 6.4 hours. Ignoring two queries that each took Azure SQL Data Warehouse more than 1 hour to execute (Q38 and Q67), Amazon Redshift took 22 minutes, while Azure SQL Data Warehouse took 42 minutes.


Evaluating Amazon Redshift

Although benchmarks against other data warehouse services are interesting, they are of limited value. First, there’s no one-size-fits-all benchmark. Each service has its unique real-world usage patterns and ways to configure and tune for them. We make a best effort to configure the services based on publicly available guidance, but we can’t guarantee optimal performance for any given service. We see this commonly with third-party benchmarks, for instance, where Amazon Redshift’s powerful distribution and sort keys are not used—even though the large majority of our customers use them.

Similarly, each benchmark query can only be run once, in contrast to real-world scenarios where 99.5 percent of queries we observe have components that can be found in the compilation cache (Amazon Redshift generates and compiles code for each query execution plan. The compiled code segments are stored in a least recently used cache and shared across sessions in a cluster). In other words, they are similar to queries that were run previously. So, the query run times measured by benchmarking studies can end up over-indexing on compilation times, which might not indicate the actual performance you can expect to get.

Secondly, these studies are, by necessity, a point-in-time assessment. As cloud vendors update and evolve their service, benchmark numbers might already be obsolete by the time they’re published.

Therefore, we don’t recommend that you make product selection decisions based on these benchmarks because your data and your query workloads have their own unique characteristics. If you’re evaluating Amazon Redshift for your analytics platform, we have created a Proof of Concept guide to help. You can also request assistance from us, or work with one of our System Integration and Consulting Partners and make a data-driven decision.

Finally, we invite you to watch the recent Fireside chat webinar and join us at re:Invent 2018 in Las Vegas, where we have a ton of exciting news to share with you. Happy querying!

If you would like instruction to reproduce the benchmark, please contact us at [email protected]. If you have questions or suggestions, please comment below.

About the Authors

Ayush Jain is a Product Marketer at Amazon Web Services. He loves growing cloud services and helping customers get more value from the cloud deployments. He has several years of experience in Software Development, Product Management and Product Marketing in developer and data services.




Mostafa Mokhtar is an engineer working on Redshift performance. Previously, he held similar roles at Cloudera, Hortonworks and on the SQL Server team at Microsoft.


Closing the customer journey loop with Amazon Redshift at Equinox Fitness Clubs

Post Syndicated from Ryan Kelly original https://aws.amazon.com/blogs/big-data/closing-the-customer-journey-loop-with-amazon-redshift-at-equinox-fitness-clubs/

Clickstream analysis tools handle their data well, and some even have impressive BI interfaces. However, analyzing clickstream data in isolation comes with many limitations. For example, a customer is interested in a product or service on your website. They go to your physical store to purchase it. The clickstream analyst asks, “What happened after they viewed the product?” and the commerce analyst asks themselves, “What happened before they bought it?”

It’s no surprise that clickstream data can enhance your other data sources. When used with purchase data, it helps you determine abandoned carts or optimize marketing spending. Similarly, it helps you analyze offline and online behavior, and also behavior before customers even registered an account. However, once the benefits of clickstream data feeds are evident, you must accommodate new requests quickly.

This blog post shows how we, at Equinox Fitness Clubs, moved our data from Amazon Redshift to Amazon S3 to use a late-binding view strategy with our clickstream data. Expect such fun things as Apache Spark, Apache Parquet, data lakes, hive partitioning, and external tables, all of which we will talk about extensively in this post!

When we first started passing our clickstream data from its own tool to our Amazon Redshift data warehouse, speed was the primary concern. Our initial use case was to tie Salesforce data and Adobe Analytics data together to get a better understanding about our lead process. Adobe Analytics could tell us what channel and campaigns people came from, what pages they looked at during the visit, and if they submitted a lead form on our site. Salesforce could tell us if a lead was qualified, if they ever met an advisor, and ultimately if they became a member. Tying these two datasets together helped us better understand and optimize our marketing.

To begin, we knew the steps involved to centralize Salesforce and Adobe Analytics data into Amazon Redshift. However, even when joined together in Redshift, they needed a common identifier to talk to each other. Our first step involved generating and sending the same GUID to both Salesforce and Adobe Analytics when a person submitted a lead form on our website.

Next, we had to pass Salesforce data to Redshift. Luckily, those feeds already existed, so we could add this new GUID attribute in the feed and describe it in Redshift.

Similarly, we had to generate data feeds from Adobe Analytics to Amazon Redshift. Adobe Analytics provides Amazon S3 as a destination option for our data, so we passed the data to S3 and then created a job to send it to Redshift. The job involved taking the daily Adobe Analytics feed – which comes with a data file containing hundreds of columns and hundreds of thousands of rows, a collection of lookup files like the headers for the data, and a manifest file that describes the files that were sent – and passing it all to Amazon S3 in its raw state. From there, we used Amazon EMR with Apache Spark to process the data feed files into a single CSV file, and then save it to S3 so that we could perform the COPY command to send the data to Amazon Redshift.

The job ran for a few weeks and worked well until we started to use the data more frequently. While the job was effective, backdating data with new columns (schema evolution) started to occur. That’s when we decided we needed greater flexibility because of the nature of the data.

Data lake to the rescue

When we decided to refactor the job, we had two things lined up for us. First, we were already moving towards more of a data lake strategy. Second, Redshift Spectrum had been recently released. It would enable us to query these flat files of clickstream data in our data lake without ever having to run the COPY command and store it in Redshift. Also, we could more efficiently join the clickstream data to other data sources stored inside of Redshift.

We wanted to take advantage of self-describing data which combines the schema of the data with the data itself. Converting the data to self-describing data would help us manage the wide clickstream datasets and prevent the schema-evolution related challenges. We could pull every column we desired into the data lake file and then use only the important columns in our queries to speed up processing. To accomplish this flexibility, we used the Apache Parquet file format, which is both self-describing and blazing fast due to its columnar storage technology. We used Apache Spark on Amazon EMR to convert from CSV to Parquet, and partition the data for scanning performance, as shown in the following code.


from datetime import date, timedelta
from pyspark.sql.types import *
from pyspark.sql import SparkSession
import json
import argparse
# Usage
# spark-submit all_omniture_to_parquet.py 2017-10-31 s3a:// eqxdl-prod-l-omniture eqxios eqxdl-prod eqeqxiosprod omniture_eqxios
# python -m tasks.s2w_all_omniture_to_parquet 2017-10-31

parser = argparse.ArgumentParser()
parser.add_argument('year_month_day_arg', help='Run date (yyyy-mm-dd)', type=str, default='XXX')
parser.add_argument('s3_protocol', help='S3 protocol i.e. s3a://',type=str, default='XXX')
parser.add_argument('source_bucket', help='Omniture source data bucket',type=str, default='XXX')
parser.add_argument('source_path', help='Omniture source data path',type=str, default='XXX')
parser.add_argument('target_bucket', help='Omniture target data bucket',type=str, default='XXX')
parser.add_argument('report_suite', help='Omniture report suite ID',type=str, default='XXX')
parser.add_argument('application', help='App name for job',type=str, default='XXX')
args = parser.parse_args()

spark = SparkSession\

sc = spark.sparkContext

def manifest_toJSON(file, location):
	text = sc.textFile(file).collect()
	manifest = {'lookup_files': [], 'data_files': location, 'total_rows': 0}
	for x in text:
		if 'Lookup-File:' in x:
			manifest['lookup_files'].append(location+x.split(': ')[1])
		elif 'Data-File: 01' in x:
			wildcard_path = x.replace('Data-File: 01','*')
			manifest['data_files'] += wildcard_path
		elif 'Record-Count:' in x:
			manifest['total_rows'] += int(x.split(': ')[1])
	return manifest

# Create metadata by stitching together the file paths for
# the header file and the data file from the manifest file
# base_filepath = '/Users/rkelly/projects/sparkyTest/project_remodeling/ios_test_data/'
base_filepath = '{}{}/{}/'.format(args.s3_protocol, args.source_bucket, args.source_path)
manifest_filepath = base_filepath+'{}_{}.txt'.format(args.report_suite, args.year_month_day_arg)
metadata = manifest_toJSON(manifest_filepath, base_filepath)

# Create a list of files and their data
# Look specifically for the column_headers.tsv data
# Split on \x00 to remove the garbage encoding and return a string of headers
lookup_files = sc.textFile(','.join(metadata['lookup_files'])).collect()
encoded_header = lookup_files[[idx for idx, s in enumerate(lookup_files) if 'column_headers.tsv' in s][0]].split('\x00')
header = encoded_header[[idx for idx, s in enumerate(encoded_header) if '\t' in s][0]]\
			.replace('\n', '')\
			.replace('(', '')\
			.replace(')', '')\
			.replace(' ', '-')

# Create a schema for the list from the header file splitting on tabs
# Cast everything as a string to avoid data type failures
schema = StructType([ StructField(field, StringType(), True) for field in header.split('\t')])

# Bypass RDD and write data file as a dataframe
# then save as parquet to tie headers to their respective values
df = spark.read.csv(metadata['data_files'], header=False, schema=schema, sep='\t', nullValue=None)
destination_filepath = '{}{}/{}/dt={}/'.format(args.s3_protocol, args.target_bucket, args.application, args.year_month_day_arg)

# Gracefully exit out of spark and this file


Using the AWS Glue Data Catalog allowed us to make our clickstream data available to be queried within Amazon Redshift and other query tools like Amazon Athena and Apache Spark. This is accomplished by mapping the Parquet file to a relational schema. AWS Glue, enables querying additional data in mere seconds. This is because schema changes can occur in real time. This means you can delete and add columns, reorder column indices, and change column types all at once. You can then query the data immediately after saving the schema. Additionally, Parquet format prevents failures when the shape of the data changes, or when certain columns are deprecated and removed from the data set.

We used the following query to create our first AWS Glue table for our Adobe Analytics website data. We ran this query in Amazon Redshift in SQL Workbench.


--First create your schema
create external schema omniture_prod
from data catalog 
database 'omniture' 
iam_role 'arn:aws:iam:::role

--Then create your “table” 
CREATE EXTERNAL TABLE omniture_prod.eqx_web (
  date_time		VARCHAR,
  va_closer_id		VARCHAR,
  va_closer_detail	VARCHAR,
  va_finder_detail	VARCHAR,
  va_finder_id		VARCHAR,
  ip			VARCHAR,
  domain		VARCHAR,
  post_evar1		VARCHAR
LOCATION 's3://eqxdl-prod/omniture/eqx_web/'
table properties ('parquet.compress'='SNAPPY');

--Check your databases, schemas, and tables
select * from pg_catalog.svv_external_databases;
select * from pg_catalog.svv_external_schemas;
select * from pg_catalog.svv_external_tables;


After running this query, we added additional columns to the schema upon request throughthe AWS Glue interface. We also used partitioning to make our queries faster and cheaper.

At this point, we had a new schema folder in our database. It contained the external table that could be queried, but we wanted to take it a step further. We needed to add some transformations to the data such as:

  • Renaming IDs to strings
  • Concatenating values
  • Manipulating strings, excluding bot traffic that we send from AWS to test the website
  • Changing the column names to be more user friendly.

To do this, we created a view of the external table, as follows:


create view edw_t.f_omniture_web as 
    REPLACE(dt, '-', '') as hive_date_key,
    va_closer_detail as last_touch_campaign,
        WHEN (va_closer_id) = '1' THEN 'Paid Search'
        WHEN (va_closer_id) = '2' THEN 'Natural Search'
        WHEN (va_closer_id) = '3' THEN 'Display'
        WHEN (va_closer_id) = '4' THEN 'Email Acq'
        WHEN (va_closer_id) = '5' THEN 'Direct'
        WHEN (va_closer_id) = '6' THEN 'Session Refresh'
        WHEN (va_closer_id) = '7' THEN 'Social Media'
        WHEN (va_closer_id) = '8' THEN 'Referring Domains'
        WHEN (va_closer_id) = '9' THEN 'Email Memb'
        WHEN (va_closer_id) = '10' THEN 'Social Placement'
        WHEN (va_closer_id) = '11' THEN 'Other Placement'
        WHEN (va_closer_id) = '12' THEN 'Partnership'
        WHEN (va_closer_id) = '13' THEN 'Other Eqx Sites'
        WHEN (va_closer_id) = '14' THEN 'Influencers'
        ELSE NULL
    END AS last_touch_channel,
    va_finder_detail as first_touch_campaign,
    va_finder_id as va_finder_id,
        WHEN (va_finder_id) = '1' THEN 'Paid Search'
        WHEN (va_finder_id) = '2' THEN 'Natural Search'
        WHEN (va_finder_id) = '3' THEN 'Display'
        WHEN (va_finder_id) = '4' THEN 'Email Acq'
        WHEN (va_finder_id) = '5' THEN 'Direct'
        WHEN (va_finder_id) = '6' THEN 'Session Refresh'
        WHEN (va_finder_id) = '7' THEN 'Social Media'
        WHEN (va_finder_id) = '8' THEN 'Referring Domains'
        WHEN (va_finder_id) = '9' THEN 'Email Memb'
        WHEN (va_finder_id) = '10' THEN 'Social Placement'
        WHEN (va_finder_id) = '11' THEN 'Other Placement'
        WHEN (va_finder_id) = '12' THEN 'Partnership'
        WHEN (va_finder_id) = '13' THEN 'Other Eqx Sites'
        WHEN (va_closer_id) = '14' THEN 'Influencers'
        ELSE NULL
    END AS first_touch_channel,
    ip as ip_address,
    domain as domain,
    post_evar1 AS internal_compaign,
    post_evar10 as site_subsection_nm,
    post_evar11 as IOS_app_view_txt,
    post_evar12 AS site_section_nm,
    post_evar15 AS transaction_id,
    post_evar23 as join_barcode_id,
    post_evar3 AS page_nm,
    post_evar32 as host_nm,
    post_evar41 as class_category_id,
    post_evar42 as class_id,
    post_evar43 as class_instance_id,
    post_evar60 AS referral_source_txt,
    post_evar69 as adwords_gclid,
    post_evar7 as usersec_tracking_id, 
    post_evar8 as facility_id,
    post_event_list as post_event_list,  
    post_visid_low||post_visid_high as unique_adobe_id,
    post_visid_type as post_visid_type,
    post_page_event as hit_type,
    visit_num as visit_number,
    post_evar25 as login_status,
    exclude_hit as exclude_hit,
    hit_source as hit_source,
    post_evar64 as api_error_msg,
    post_evar70 as page_load_time,
    post_evar78 as join_transaction_id,
    post_evar9 as page_url,
    visit_start_pagename as entry_pg,
    post_tnt as abtest_campaign,
    post_tnt_action as abtest_experience,
    user_agent as user_agent,
    mobile_id as mobile_id,
    cast(date_time as timestamp) as date_time,
        'America/New_York', -- timezone of origin
            when post_t_time_info like '%undefined%' then '0'
            when post_t_time_info is null then '0'
            when post_t_time_info = '' then '0'
            when cast(split_part(post_t_time_info,' ',4) as int) < 0
              then left(split_part(post_t_time_info,' ',4),4)
            else left(split_part(post_t_time_info,' ',4),3) end as int
        cast(date_time as timestamp)
    ) as date_time_local,
    post_t_time_info as local_timezone
from omniture_prod.eqx_web
where exclude_hit = '0'
and hit_source not in ('5','7','8','9')

and domain <> 'amazonaws.com'
and domain <> 'amazon.com'



Now we can perform queries from Amazon Redshift, blending our structured Salesforce data with our semi-structured, dynamic Adobe Analytics data. With these changes, our data became extremely flexible, friendly on storage size, and very performant when queried. Since then, we have started using Redshift Spectrum for many use cases from data quality checks, machine data, historical data archiving, and empowering our data analysts and scientists to more easily blend and onboard data.


with web_leads as (
  select transaction_id,last_touch_channel
  from edw_t.f_omniture_web
  where hive_date_key = '20170301'
      and post_event_list like '%201%'
      and transaction_id != '807f0cdc-80cf-42d3-8d75-e55e277a8718'
opp_lifecycle as (
  SELECT lifecycle,weblead_transactionid
  FROM edw_t.f_opportunity
  where weblead_transactionid is not null
  and created_date_key between '20170220'and '20170310'
  coalesce(opp_lifecycle.lifecycle, 'N/A') as Lifecycle
from web_leads
left join opp_lifecycle on web_leads.transaction_id = opp_lifecycle.weblead_transactionid


We were able to setup an efficient and flexible analytics platform for clickstream data by combining the Amazon S3 data lake with Amazon Redshift. This has eliminated the need to always load clickstream data into the data warehouse, and also made the platform adaptable to schema changes in the incoming data. You can read the Redshift documentation to get started with Redshift Spectrum, and also watch our presentation at the AWS Chicago Summit 2018 below.

Additional Reading

If you found this post helpful, be sure to check out From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum, and Narrativ is helping producers monetize their digital content with Amazon Redshift.


About the Author

Ryan Kelly is a data architect at Equinox, where he helps outline and implement frameworks for data initiatives. He also leads clickstream tracking which helps aid teams with insights on their digital initiatives. Ryan loves making it easier for people to reach and ingest their data for the purposes of business intelligence, analytics, and product/service enrichment. He also loves exploring and vetting new technologies to see how they can enhance what they do at Equinox.




How Annalect built an event log data analytics solution using Amazon Redshift

Post Syndicated from Eric Kamm original https://aws.amazon.com/blogs/big-data/how-annalect-built-an-event-log-data-analytics-solution-using-amazon-redshift/

Ingesting and analyzing event log data into a data warehouse in near real-time is challenging. Data ingest must be fast and efficient. The data warehouse must be able to scale quickly to handle incoming data volumes. Storing massive amounts of historical data in the data warehouse for infrequent access is prohibitively expensive. If you store infrequently accessed data outside the data warehouse, loading time is unacceptable.

At Annalect, we found a way to overcome these challenges. In this article, we discuss how we built a solution on AWS to manage, enhance, and analyze event log data from our ad technology partners. We use Amazon S3 for storage, Amazon EC2 and AWS Batch for compute, AWS Glue Data Catalog, and Amazon Redshift and Amazon Redshift Spectrum for analytics. This scalable, on-demand architecture has proven to be a high-performance solution for our analytics users and is very cost-effective.

Architecture overview

In mid-2016, we were using Amazon Redshift with local tables to provide SQL access to event-level ad tech datasets for analysts and modelers to improve ad buying strategies. We retained up to six months of historical data for select data sources; this represented about 230 billion events. By early 2017, the same data footprint grew to over 385 billion events, and we increased our Amazon Redshift capacity to keep pace. As we looked forward, we wanted to support more data, both in terms of history and data source coverage. At the same time, we wanted to maintain or improve query performance, ideally without increasing costs.

Because we were already using Amazon Redshift as our data warehouse, we looked at Redshift Spectrum. Redshift Spectrum is a feature of Amazon Redshift that queries data stored in Amazon S3. It uses the same SQL capabilities as Amazon Redshift, without the need to load the data into Amazon Redshift tables stored on disks within an Amazon Redshift cluster.

With Redshift Spectrum, a table’s storage is separate from a cluster’s compute resources, which enables each to scale independently of the other. We can keep large data tables in S3 instead of local Amazon Redshift storage, which reduces the size of our Amazon Redshift clusters and lowers costs. Queries that use external Redshift Spectrum tables use Redshift Spectrum compute resources (separate from the Amazon Redshift cluster) to scan the data in S3. Queries are billed by AWS at a metered rate based on the amount of data scanned. For query efficiency, we convert the data stored in S3 into Parquet files using AWS Batch.

The result is an efficient, mostly serverless analytics environment where the data is always available to query with Redshift Spectrum. Although we manage some always-on Amazon Redshift clusters, we think of this as “mostly serverless.” This is because the cluster footprint is very small relative to the serverless compute power we use through Redshift Spectrum for analysis and through AWS Batch for data preparation.

The Amazon Redshift clusters are dedicated to performing analytics for our analysts, rather than having a portion of their time dedicated to load jobs. This is because no time is spent copying data into local tables in the Amazon Redshift clusters and because we do the data preparation work outside of Amazon Redshift.

With the evolution to Redshift Spectrum and storing table data in S3, by early 2018 we increased the amount of data that we deliver to analysts by an order of magnitude. At the same time, we kept our costs at the same level as 2017. Using Redshift Spectrum is a great cost saving to us. We now have up to 36 months of historical data available for many more data sources, representing trillions of events.

Our data management infrastructure, the Enhanced Event-Level Data Environment (EELDE), is structured as shown following.

The EELDE provides four primary functions:

  • Data ingest
  • Data transformation
  • Compute-intensive enrichment
  • Analyst ad hoc query environment

These four functions are described following in more detail.

Data ingest

We architected our ingest system to ensure smooth and rapid downstream data availability. We took into account the inherent inconsistencies of vendor data feeds, both in the contents of the feeds themselves and in different vendors’ data push conventions. We also took into account the large volume of data files that data vendors might deliver at a given time. The diagram following shows our data ingest process.

We set up two storage areas in Amazon S3 to initially hold data from our vendors. The Landing Zone serves as a repository for long-term storage of historical, unaltered raw log files. Log files that are pulled by us directly from vendors are stored immediately in the Landing Zone. The Datafeeds storage serves as a transient place for log files that are pushed out by the vendors. We curate and arrange the data files delivered into the Datafeeds storage by vendors as we pull them into the Landing Zone.

The pull processes are executed in the EELDE’s Job Execution Layer, which is autoscaling and can handle many data files in parallel. This ensures that data processing and availability are not delayed by sudden spikes in data volume. Some spikes might occur from the need to backfill a large dataset from a new vendor. Others might occur when a vendor delivers a large volume of logs accumulated after a delay on their side.

Whether the vendor logs are pushed or pulled, we monitor the Landing Zone. We send alerts when we detect changes in file formats or gaps in data that might have an impact on downstream analysis. We found that this architecture and process gives us the control, flexibility, and agility that we need to respond rapidly to any volume of data and any changes in the data.

Data transformation

The EELDE can process, enhance, and analyze large volumes of event data in a compute-efficient and cost-efficient manner. An important part of this ability is the initial transformation of files in the Landing Zone. The EELDE’s Job Execution Layer uses AWS Batch to parallelize the compute-intensive data transformation from the original source data. This data is typically in comma-separated value (CSV) or JSON format. AWS Batch parallelizes it into a columnar, compressed storage format, Apache Parquet. In the diagrams, you can see the Parquet storage location labeled as the “Level 2 (columnar)” bucket.

For our Parquet transformation jobs, we have had a good experience running many containers with local-mode Apache Spark using AWS Batch. The goal for us is supporting a high throughput of file transformations where each file’s size is in the range of a few megabytes to a few gigabytes. A given Parquet transformation might take anywhere from a few seconds to 30 minutes.

The resource isolation and job management provided by AWS Batch makes running hundreds of independent data transformation jobs straightforward. At the same time, it removes some of the challenges of running many tasks simultaneously within Amazon EMR. In contrast, to transform larger datasets, the parallel compute per job that EMR can provide is attractive.

The diagram following shows data transformation and cataloging.

We partition the large Parquet-backed event data tables by date. Date partitioning is strategically implemented to optimize both query cost and query performance by providing an efficient path to the required data for a given analytics process. After the files are converted to Parquet, they are exposed as table partitions in the AWS Glue Data Catalog. Although the data storage format has been transformed, the data still retains all of its original fields, which is important for supporting ad hoc analyses.

Once the Parquet conversion and AWS Glue Data Catalog management processes have run, the data is accessible through SQL queries using Redshift Spectrum. Redshift Spectrum uses the AWS Glue Data Catalog to reference external tables stored in S3 so that you can query them through the Amazon Redshift cluster. Using a columnar format like Parquet increases the speed and cost efficiency of performing queries with Redshift Spectrum.

Compute-intensive enrichment

Before analytics work, the Job Execution Layer might also run other compute intensive data transformations to enhance the logs so that they can support more advanced types of analysis. Event-level log combination and event attribute augmentation are two examples.

Although these enrichment tasks might have different compute requirements, they can all use Redshift Spectrum to gather relevant datasets, which are then processed in the Job Execution Layer. The results are made available again through Redshift Spectrum. This enables complex enrichment of large datasets in acceptable timeframes and at a low cost.

Analyst ad hoc query environment

Our environment provides analysts with a data landscape and SQL query environment for mission-critical analytics. The data is organized within the EELDE into separate external schemas by ad tech vendor and system seat or account. Our agency-based analysts and data scientists are granted access to specific schemas according to project needs.

By having access to various datasets within a single platform, analysts can gain unique insights that are not possible by analyzing data in separate silos. They can create new datasets from a combination of source vendors for use during specific analytic projects while respecting data separation requirements. Doing this helps support use cases such as predictive modeling, channel planning, and multisource attribution.

Best practices to keep costs low

Along the way, we have developed some best practices for Redshift Spectrum:

  • Use compressed columnar format. We use Apache Parquet for our large event-level tables. Doing this reduces query run time, data scan costs, and storage costs. Our large tables typically have 40–200 columns, yet most of our queries use only 4–8 columns at a time. Using Parquet commonly lowers the scan cost by over 90 percent compared to noncolumnar alternatives.
  • Partition tables. We partition our large event level tables by date, and we train users to use the partition column in WHERE clauses. Doing this provides cost savings proportional to the portion of dates used by the queries. For example, suppose that we expose six months of data and a query looks at only one month. In this case, the cost for that query is reduced by five-sixths by using the appropriate WHERE clause.
  • Actively manage table partitions. Although we have up to 36 months of historical data available as Parquet files, by default we expose only the most recent 6 months of data as external table partitions. This practice reduces the cost of queries when users fail to filter on the partition column. On request, we temporarily add older date partitions to the necessary tables to support queries that need more than 6 months of data. In this way, we support longer timeframes when needed. At the same time, we usually maintain a lower cost limit by not exposing more data than is commonly necessary.
  • Move external tables into Amazon Redshift when necessary. When a large number of queries repeatedly hit the same external table, we often create a temporary local table in Amazon Redshift to query against. Depending on the number of queries and their scope, it can be helpful to load the data into a local Amazon Redshift table with the appropriate distribution key and sort keys. This practice helps eliminate the repeated Redshift Spectrum scan costs.
  • Use spectrum_scan_size_mb WLM rules. Most of our users and jobs operate with an effective Redshift Spectrum data scan limit of 1 terabyte by setting a spectrum_scan_size_mb WLM rule on the default WLM queue. For a single analysis or modeling query, this limit rarely needs to be exceeded.
  • Add clusters to increase concurrency when needed. Multiple Amazon Redshift clusters have access to the same external tables in S3. If we have a lot of users running queries at the same time against external tables, we might choose to add more clusters. Redshift Spectrum and the management tools we have created for our environment make this an easy task.
  • Use short-lived Amazon Redshift clusters. For some scheduled production jobs, we use short-lived Amazon Redshift clusters with the necessary data made available through external schemas. Redshift Spectrum allows us to avoid lengthy data load times, so it is practical to launch a Amazon Redshift cluster for a couple of hours of work and then terminate it every day. Doing this can save 80–90 percent compared to having an always-on cluster for a project. It also avoids contention with other query activity, because the short-lived cluster is dedicated to the project.


By establishing a data warehouse strategy using Amazon S3 for storage and Redshift Spectrum for analytics, we increased the size of the datasets we support by over an order of magnitude. In addition, we improved our ability to ingest large volumes of data quickly, and maintained fast performance without increasing our costs. Our analysts and modelers can now perform deeper analytics to improve ad buying strategies and results.


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and How I built a data warehouse using Amazon Redshift and AWS services in record time.


About the Author

Eric Kamm is a senior engineer and architect at Annalect. He has been creating and managing large-scale ETL workflows and providing data and compute environments for teams of analysts and modelers for over twenty years. He enjoys working with cloud technologies because they can accelerate project planning and implementation while increasing reliability and operational efficiency.

Store, Protect, Optimize Your Healthcare Data with AWS: Part 2

Post Syndicated from Stephen Jepsen original https://aws.amazon.com/blogs/architecture/store-protect-optimize-your-healthcare-data-with-aws-part-2/

Leveraging Analytics and Machine Learning Tools for Readmissions Prediction

This blog post was co-authored by Ujjwal Ratan, a senior AI/ML solutions architect on the global life sciences team.

In Part 1, we looked at various options to ingest and store sensitive healthcare data using AWS. The post described our shared responsibility model and provided a reference architecture that healthcare organizations could use as a foundation to build a robust platform on AWS to store and protect their sensitive data, including protected health information (PHI). In Part 2, we will dive deeper into how customers can optimize their healthcare datasets for analytics and machine learning (ML) to address clinical and operational challenges.

There are a number of factors creating pressures for healthcare organizations, both providers and payers, to adopt analytic tools to better understand their data: regulatory requirements, changing reimbursement models from volume- to value-based care, population health management for risk-bearing organizations, and movement toward personalized medicine. As organizations deploy new solutions to address these areas, the availability of large and complex datasets from electronic health records, genomics, images (for example, CAT, PET, MRI, ultrasound, X-ray), and IoT has been increasing. With these data assets growing in size, healthcare organizations want to leverage analytic and ML tools to derive new actionable insights across their departments.

One example of the use of ML in healthcare is diagnostic image analysis, including digital pathology. Pathology is extremely important in diagnosing and treating patients, but it is also extremely time-consuming and largely a manual process. While the complexity and quantity of workloads are increasing, the number of pathologists is decreasing. According to one study, the number of active pathologists could drop by 30 percent by 2030 compared to 2010 levels. (1) A cloud architecture and solution can automate part of the workflow, including sample management, analysis, storing, sharing, and comparison with previous samples to complement existing provider workflows effectively. A recent study using deep learning to analyze metastatic breast cancer tissue samples resulted in an approximately 85% reduction in human error rate. (2)

ML is also being used to assist radiologists in examining other diagnostic images such as X-rays, MRIs, and CAT scans. Having large quantities of images and metadata to train the algorithms that are the key to ML is one of the main challenges for ML adoption. To help address this problem, the National Institutes of Health recently released 90,000 X-ray plates tagged either with one of 14 diseases or tagged as being normal. Leading academic medical centers are using these images to build their neural networks and train their algorithms. With advanced analytics and ML, we can answer the hard questions such as “what is the next best action for my patient, the expected outcome, and the cost.”

The foundations for a great analytical layer

Let’s pick up from where we left off in Part 1. We have seen how providers can ingest data into AWS from their data centers and store it securely into different services depending on the type of data. For example:

  1. All object data is stored in Amazon S3, Amazon S3 Infrequent Access, or Amazon Glacier depending on how often they are used.
  2. Data from the provider’s database is either processed and stored as objects in Amazon S3 or aggregated into data marts on Amazon Redshift.
  3. Metadata of the objects on Amazon S3 are maintained in the DynamoDB database.
  4. Amazon Athena is used to query the objects directly stored on Amazon S3 to address ad hoc requirements.

We will now look at two best practices that are key to building a robust analytical layer using these datasets.

  1. Separating storage and compute: You should not be compelled to scale compute resources just to store more data. The scaling rules of the two layers should be separate.
  2. Leverage the vast array of AWS big data services when it comes to building the analytical platforms instead of concentrating on just a few of them. Remember, one size does not fit all.

Technical overview

In this overview, we will demonstrate how we can leverage AWS big data and ML services to build a scalable analytical layer for our healthcare data. We will use a single source of data stored in Amazon S3 for performing ad hoc analysis using Amazon Athena, integrate it with a data warehouse on Amazon Redshift, build a visual dashboard for some metrics using Amazon QuickSight, and finally build a ML model to predict readmissions using Amazon SageMaker. By not moving the data around and just connecting to it using different services, we avoid building redundant copies of the same data. There are multiple advantages to this approach:

  1. We optimize our storage. Not having redundant copies reduces the amount of storage required.
  2. We keep the data secure with only authorized services having access to it. Keeping multiple copies of the data can result in higher security risk.
  3. We are able to scale the storage and compute separately as needed.
  4. It becomes easier to manage the data and monitor usage metrics centrally such as how often the data has been accessed, who has been accessing it, and what has been the growth pattern of the data over a period of time. These metrics can be difficult to aggregate if the data is duplicated multiple times.

Let’s build out this architecture using the following steps:

  1. Create a database in AWS Glue Data Catalog

We will do this using a Glue crawler. First create a JSON file that contains the parameters for the Glue crawler.

"Name": "readmissions",
"Role": "arn of the role for Glue",
"DatabaseName": "readmissions",
"Description": "glue data catalog for storing readmission data",
"Targets": {
"S3Targets": [
"Path": "s3://<bucket>/<prefix>"
"Path": "s3://<bucket>/<prefix>"

As you can see, the crawler will crawl two locations in Amazon S3 and save the resulting tables in a new database called “readmissions.” Replace the role ARN and Amazon S3 locations with your corresponding details. Save this in a file create_crawler.json. Then from the AWS CLI, call the following command to create the crawler:

aws glue create-crawler --cli-input-json file://create_crawler.json

Once the crawler is created, run it by calling the following command:

aws glue start-crawler --name readmissions

Log on to the AWS Glue console, navigate to the crawlers, and wait until the crawler completes running.

This will create two tables — phi and non-phi — in a database named “readmissions” in the AWS Glue Data Catalog as shown below.

  1. Query the data using Athena

The Amazon Glue Data Catalog is seamlessly integrated with Amazon Athena. For details on how to enable this, see Integration with AWS Glue.

As a result of this integration, the tables created using the Glue crawler can now be queried using Amazon Athena. Amazon Athena allows you to do ad hoc analysis on the dataset. You can do exploratory analysis on the data and also determine its structure and quality. This type of upfront ad hoc analysis is invaluable for ensuring the data quality in your downstream data warehouse or your ML algorithms that will make use of this data for training models. In the next few sections, we will explore these aspects in greater detail.

To query the data using Amazon Athena, navigate to the Amazon Athena console.

NOTE: Make sure the region is the same as the region you chose in the previous step. If it’s not the same, switch the region by using the drop-down menu on the top right-hand corner of the screen.

Once you arrive in the Amazon Athena console, you should already see the tables and databases you created previously, and you should be able to see the data in the two tables by writing Amazon Athena queries. Here is a list of the top 10 rows from the table readmissions.nonphi:

Now that we are able to query the dataset, we can run some queries for exploratory analysis. Here are just a few examples:

Analysis Amazon Athena Query
How many Patients have been discharged to home? SELECT count(*) from nonphi where discharge_disposition = ‘Discharged to home’
What’s the minimum and the maximum number of procedures carried out on a patient? SELECT min(num_procedures), max(num_procedures) from nonphi
How many patients were referred to this hospital by another physician? SELECT count(*) FROM nonphi group by admission_source having admission_source = ‘Physician Referral’
What were the top 5 specialties with positive readmissions?

SELECT count(readmission_result) as num_readmissions, medical_specialty from

(select readmission_result,medical_specialty from nonphi where readmission_result = ‘Yes’)

group by medical_specialty order by num_readmissions desc limit 5

Which payer was responsible for paying for treatments that involved more than 5 procedures? SELECT distinct payer_code from nonphi where num_procedures >5 and payer_code !='(null)’

While this information is valuable, you typically do not want to invest too much time and effort into building an ad hoc query platform like this because at this stage, you are not even sure if the data is of any value for your business-critical analytical applications. One benefit of using Amazon Athena for ad hoc analysis is that it requires little effort or time. It uses Schema-On-Read instead of schema on write, allowing you to work with various source data formats without worrying about the underlying structures. You can put the data on Amazon S3 and start querying immediately.

  1. Create an external table in Amazon Redshift Spectrum with the same data

Now that we are satisfied with the data quality and understand the structure of the data, we would like to integrate this with a data warehouse. We’ll use Amazon Redshift Spectrum to create external tables on the files in S3 and then integrate these external tables with a physical table in Amazon Redshift.

Amazon Redshift Spectrum allows you to run Amazon Redshift SQL queries against data on Amazon S3, extending the capabilities of your data warehouse beyond the physical Amazon Redshift clusters. You don’t need to do any elaborate ETL or move the data around. The data exists in one place in Amazon S3 and you interface with it using different services (Athena and Redshift Spectrum) to satisfy different requirements.

Before beginning, please look at this step by step guide to set up Redshift Spectrum.

After you have set up Amazon Redshift Spectrum, you can begin executing the steps below:

  1. Create an external schema called “readmissions.” Amazon Redshift Spectrum integrates with the Amazon Glue Data Catalog and allows you to create spectrum tables by referring the catalog. This feature allows you to build the external table on the same data that you analyzed with Amazon Athena in the previous step without the need for ETL. This can be achieved by the following:
create external schema readmissions
from data catalog
database 'readmissions'
iam_role 'arn for your redshift spectrum role '
region ‘region when the S3 data exists’;

NOTE: Make sure you select the appropriate role arn and region.

  1. Once the command executes successfully, you can confirm the schema was created by running the following:
select * from svv_external_schemas;

You should see a row similar to the one above with your corresponding region and role.

You can also see the external tables that were created by running the following command:

select * from SVV_EXTERNAL_TABLES;

  1. Let’s confirm we can see all the rows in the external table by counting the number of rows:
select count(*) from readmissions.phi;
select count(*) from readmissions.nonphi;

You should see 101,766 rows in both the tables, confirming that your external tables contain all the records that you read using the AWS Glue data crawler and analyzed using Athena.

  1. Now that we have all the external tables created, let’s create an aggregate fact table in the physical Redshift data warehouse. We can use the “As Select” clause of the Redshift create table query to do this:
create table readmissions_aggregate_fact as
avg(time_in_hospital) as avg_time_in_hospital,
min(num_procedures) as min_procedures,
max(num_procedures) as max_procedures,
avg(num_procedures) as avg_num_procedures,
avg(num_medications) as avg_num_medications,
avg(number_outpatient) as avg_number_outpatient,
avg(number_emergency) as avg_number_emergency,
avg(number_inpatient) as avg_number_inpatient,
avg(number_diagnoses) as avg_number_diagnoses
from readmissions.nonphi
group by readmission_result,admission_type,discharge_disposition,diabetesmed

Once this query executes successfully, you can see a new table created in the physical public schema of your Amazon Redshift cluster. You can confirm this by executing the following query:

select distinct(tablename) from pg_table_def where schemaname = 'public'

  1. Build a QuickSight Dashboard from the aggregate fact

We can now create dashboards to visualize the data in our readmissions aggregate fact table using Amazon QuickSight. Here are some examples of reports you can generate using Amazon QuickSight on the readmission data.

For more details on Amazon QuickSight, refer to the service documentation.

  1. Build a ML model in Amazon SageMaker to predict readmissions

As a final step, we will create a ML model to predict the attribute readmission_result, which denotes if a patient was readmitted or not, using the non-PHI dataset.

  1. Create a notebook instance in Amazon SageMaker that is used to develop our code.
  2. The code reads non-PHI data from the Amazon S3 bucket as a data frame in Python. This is achieved using the pandas.readcsv function.

  1. Use the pandas.get_dummies function to encode categorical values into numeric values for use with the model.

  1. Split the data into two, 80% for training and 20% for testing, using the numpy.random.rand function.

  1. Form train_X, train_y and test_X, test_y corresponding to training features, training labels, testing features, and testing labels respectively.

  1. Use the Amazon SageMaker Linear learner algorithm to train our model. The implementation of the algorithm uses dense tensor format to optimize the training job. Use the function write_numpy_to_dense_tensor from the Amazon SageMaker library to convert the numpy array into the dense tensor format.

  1. Create the training job in Amazon SageMaker with appropriate configurations and run it.

  1. Once the training job completes, create an endpoint in Amazon SageMaker to host our model, using the linear.deploy function to deploy the endpoint.

  1. Finally, run a prediction by invoking the endpoint using the linear_predictor.predict function.

You can view the complete notebook here.

Data, analytics, and ML are strategic assets to help you manage your patients, staff, equipment, and supplies more efficiently. These technologies can also help you be more proactive in treating and preventing disease. Industry luminaries share this opinion: “By leveraging big data and scientific advancements while maintaining the important doctor-patient bond, we believe we can create a health system that will go beyond curing disease after the fact to preventing disease before it strikes by focusing on health and wellness,” writes Lloyd B. Minor, MD, dean of the Stanford School of Medicine.

ML and analytics offer huge value in helping achieve the quadruple aim : improved patient satisfaction, improved population health, improved provider satisfaction, and reduced costs. Technology should never replace the clinician but instead become an extension of the clinician and allow them to be more efficient by removing some of the mundane, repetitive tasks involved in prevention, diagnostics, and treatment of patients.

(1) “The Digital Future of Pathology.” The Medical Futurist, 28 May 2018, medicalfuturist.com/digital-future-pathology.

(2) Wang, Dayong, et al. “Deep Learning for Identifying Metastatic Breast Cancer.” Deep Learning for Identifying Metastatic Breast Cancer, 18 June 2016, arxiv.org/abs/1606.05718.

About the Author

Stephen Jepsen is a Global HCLS Practice Manager in AWS Professional Services.


Narrativ is helping producers monetize their digital content with Amazon Redshift

Post Syndicated from Colin Nichols original https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/

Narrativ, in their own words: Narrativ is building monetization technology for the next generation of digital content producers. Our product portfolio includes a real-time bidding platform and visual storytelling tools that together generate millions of dollars of advertiser value and billions of data points each month.

At Narrativ, we have seen massive growth in our platform usage with a similar order-of-magnitude increase in data generated by our products over the past 15 months. In this blog post, we share our evolution to a robust, scalable, performant, and cost-effective analytics environment with AWS. We also discuss the best practices that we learned along the way in data warehousing and data lake analytics.

Anticipating Narrativ’s continued growth acceleration, we began planning late last year for the next order of magnitude. We have been using Amazon Redshift as our data warehouse, which has served us very well. As our data continued to grow, we utilized Amazon S3 as a data lake and used external tables in Amazon Redshift Spectrum to query data directly in S3. We were excited that this allowed us to easily scale storage and compute resources separately to meet our needs without trading off against cost or complexity.

In the process, we created Spectrify, which simplifies working with Redshift Spectrum and encapsulates a number of best practices. Spectrify accomplishes three things in one easy command. First, it exports an Amazon Redshift table to S3 in comma-separated value (CSV) format. Second, it converts exported CSVs to Apache Parquet files in parallel. Third, it creates the external tables in the Amazon Redshift cluster. Queries can now span massive volumes of Parquet data in S3 with data in Amazon Redshift, and return results in just moments.


The preceding diagram shows how Spectrify simplifies querying across Parquet data in Amazon S3 and data in Amazon Redshift, returning results in just moments.

Amazon Redshift at Narrativ

Amazon Redshift has been the foundation of our data warehouse since Narrativ’s inception. We use it to produce traffic statistics, to feed the data-driven algorithms in our programmatic systems, and as a window to explore data insights. Amazon Redshift has scaled well to support the company’s needs. Our cluster has grown from 3 nodes to 36 nodes, and queries per hour have increased significantly without losing performance. As our workload moved from statistics and accounting toward data-driven insights, Amazon Redshift kept pace with increasing query complexity.

When planning the next iteration of our data strategy, our requirements were as follows:

  • Be maintainable by a small engineering team (mostly hands-off)
  • Be able to query all historic data at any time
  • Be able to query the previous three months of data very quickly
  • Maintain performance at 10x current data volume

We considered a few ideas to meet these requirements:

  • Upgrade to a DC2 cluster and a DS2 cluster
  • Upgrade to a very large DC2 cluster
  • Move to an open-source alternative (difficult to maintain)
  • Move to another big data provider (high barrier to entry, costly, risky)

Then we discovered Redshift Spectrum, which changes the game by separating compute costs from storage costs. With Redshift Spectrum, we can offload older, infrequently used data to S3 while maintaining a modestly sized, yet fast, Amazon Redshift cluster. Compared to other solutions, Redshift Spectrum reduces complexity by allowing us to join to the same set of dimension tables for all of our queries. And finally, it even preserves, if not improves, performance (depending on the query). All of our requirements were met with Redshift Spectrum, and it was easy, fast, and cost-effective to implement.

Moving forward, we view Redshift Spectrum as a critical tool to scaling our data capabilities on AWS to the next order of magnitude. Redshift Spectrum allows us to define Amazon Redshift tables backed by data stored in Amazon S3. In this way, Redshift Spectrum turns Amazon Redshift into a hybrid data warehouse/data lake. It provides the best of both worlds—unlimited storage separated from compute costs, plus the performance and converged workflow of Amazon Redshift itself.

Unlocking Redshift Spectrum

Using best practices can provide massive performance improvements in Redshift Spectrum. We shared our Spectrify implementation best practices in an open-source Python library in GitHub to help others reap the same rewards that we have. Following are further details on our top suggestions for optimizing Redshift Spectrum, which you will find incorporated into Spectrify.

Use Apache Parquet

Apache Parquet is a columnar file format. When you store your data in Parquet, Redshift Spectrum can scan only the column data relevant to your query. In contrast, row-oriented formats like CSV or JSON require a scan of every column, regardless of the query. For Narrativ’s workload, using Parquet led to large performance gains and drastically lower bills.

Columnar formats also enable more efficient data compression. Data compresses better when similar data is grouped together. In this case, we saw about an 80 percent reduction in gzipped Parquet files vs. gzipped CSVs. This reduction means data can be ingested, and thus scanned, much faster.

When we first investigated converting our data to Parquet, we found no options that fit our needs. We don’t use Spark or any other project from the Hadoop ecosystem, which is the most common toolset for creating Parquet files. A few Python projects had write support for Parquet. However, we had reservations about data consistency caveats, especially in relation to nullable columns. Additionally, we didn’t find a library that supported the correct timestamp format.

We built Spectrify to solve all these problems. We worked with the Apache Arrow project to ensure full compatibility between Redshift Spectrum and Parquet files generated by Spectrify.

Partition your data

Narrativ stores timestamped event data in Redshift Spectrum. We partition our data on event creation date, allowing Redshift Spectrum to skip inspecting most of our data files for many of our queries. Generally, our queries are concerned with a small subset of the time domain, such as a month or even just a day (for example, answering a specific question from Black Friday 2017). Queries are rarely concerned with to the entire years-long history.

This drastically reduces the amount of data scanned, and we see most queries complete in one-tenth the time. Also, our bill is much lower. Partitioning alone reduced our average cost per query by a factor of 20.


As we prepared for an order-of-magnitude increase in data volumes, we turned to a data lake strategy built on Amazon S3 using Amazon Redshift and Redshift Spectrum for analytics. Redshift Spectrum enabled us to query data across S3 and Amazon Redshift, maintain or improve query performance, and do it easily and cost-effectively. Along the way, we learned best practices and built Spectrify to streamline data preparation and analytics even further. We hope that sharing our code examples and what we learned will also help you achieve similar results with your growing data.

For more information, see our Spectrify Usage documentation and code samples, and our full example on GitHub.


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and How I built a data warehouse using Amazon Redshift and AWS services in record time.


About the Author

Colin Nichols leads big data initiatives at Narrativ. He enjoys working at the intersection of systems, and has a strong record of shaping engineering culture around the tenets of quality, automation, and continuous learning. In his spare time, Colin enjoys teaching his dog, MacGyver, new tricks.

Get sub-second query response times with Amazon Redshift result caching

Post Syndicated from Entong Shen original https://aws.amazon.com/blogs/big-data/get-sub-second-query-response-times-with-amazon-redshift-result-caching/

Customers tell us that their data warehouse and business intelligence users want extremely fast response times so that they can make equally fast decisions. They also tell us that their users often repeat the same queries over and over again, even when the data has not changed. Repeat queries consume compute resources each time they are executed, which slows down performance for all queries.

In this post, we take a look at query result caching in Amazon Redshift. Result caching does exactly what its name implies—it caches the results of a query. When the same query comes in against the same data, the prior results are retrieved from the cache and returned immediately, instead of rerunning the same query. Result caching reduces system use, making more resources available for other workloads. It delivers faster response times for users, improves throughput for all queries, and increases concurrency.

Result caching options

There are two main ways that you can implement data warehouse result caching. The first method is to save subsets of the data tables and cache query results outside the data warehouse. This method requires additional logic and memory outside the data warehouse. You must take great care to ensure that the cache is invalidated and a query is rerun when table data is modified.

The second method is to cache the results of a query inside the data warehouse and return the cached result for future repeat queries. This method delivers higher performance because it is faster to cache data and serve it from within the cluster.

Amazon Redshift uses the second method to cache query results within the cluster to achieve higher query throughput. Amazon Redshift result caching automatically responds to data and workload changes, transparently serving multiple BI applications and SQL tools. It is available by default for all Amazon Redshift customers for no additional charge.

Since Amazon Redshift introduced result caching, the feature has saved customers thousands of hours of execution time on a daily basis. “With Amazon Redshift result caching, 20 percent of our queries now complete in less than one second,” said Greg Rokita, Executive Director of Technology, Edmunds, at the AWS Summit in San Francisco. “Our cluster reliance on disk has decreased, and consequently the cluster is able to better serve the rest of our queries. Best of all, we didn’t have to change anything to get this speed-up with Redshift, which supports our mission-critical workloads.”

How does it work?

When a query is executed in Amazon Redshift, both the query and the results are cached in memory. As future queries come in, they are normalized and compared to the queries in the cache to determine whether there are repeat queries. In this comparison, Amazon Redshift also determines whether the underlying data has changed in any way. Any data modification language (DML) or data definition language (DDL) on a table or function invalidates only the cache entries that refer to it. Examples of such statements include INSERT, DELETE, UPDATE, COPY, and TRUNCATE. Amazon Redshift manages the cache memory to evict old entries, ensuring that optimal memory use is maintained for the cache itself.

Result caching is fully managed by Amazon Redshift, and it requires no changes in your application code. Amazon Redshift automatically selects the optimal configuration based on the specific condition of your cluster, and no tuning is required for you to get the most effective configuration.

When Amazon Redshift determines that a query is eligible to reuse prior query cached results, it bypasses query planning, the workload manager (WLM), and the query execution engine altogether. Cached result rows are returned to the client application immediately with sub-second performance. This method frees up cluster resources for ETL (extract, transform, and load) and other workloads that need the compute resources.

Design and usage

The following diagram illustrates the architecture of Amazon Redshift result caching.

The query result cache resides in the memory of the leader node and is shared across different user sessions to the same database. The feature is transparent, so it works by default without the need for user configurations. Result caching complies with Amazon Redshift multi-version concurrency control (MVCC). It acquires the proper locks on the table objects and manages the lifecycle of the cache entries when multiple user sessions read/write a table object at the same time. In addition, access control of the cached results is managed so that a user must have the required permission of the objects used in the query to retrieve result rows from the cache.

Which queries are served from the result cache?

Read-only queries are eligible for caching with some exceptions. Query results are not cached in the following circumstances:

  • When a query uses a function that must be evaluated each time it is run, such as getdate().
  • When a query refers to system tables or views.
  • When a query refers to external tables, that is, Amazon Redshift Spectrum tables.
  • When a query runs only on the leader node, or the result is too large.

Suppose that your query contains functions like current_date and you want to take advantage of the result cache. You can consider rewriting the query by materializing the value of current_date (for example, in your JDBC application), using the query text, and refreshing it as needed.

To determine which executed queries served results from the cache, a new column source_query has been added to system view SVL_QLOG to record the source query ID when a query is executed from the cache. You can use the following example query to find out which queries used cached results:

select userid, query, elapsed, source_query from svl_qlog where userid > 1
order by query desc;

userid | query  | elapsed  | source_query
   104 | 629035 |       27 |       628919
   104 | 629034 |       60 |       628900
   104 | 629033 |       23 |       628891
   102 | 629017 |  1229393 |             
   102 | 628942 |       28 |       628919
   102 | 628941 |       57 |       628900
   102 | 628940 |       26 |       628891
   100 | 628919 | 84295686 |             
   100 | 628900 | 87015637 |             
   100 | 628891 | 58808694 | 

For more information about result cache usage, see Result Caching in the Amazon Redshift documentation.


Amazon Redshift result caching helps ensure that no computing resources are wasted on repeat queries. It enables you to do more analytics in less time to support decision making and improve outcomes. In this post, we explained how Amazon Redshift result caching works and discussed the significant impact for Amazon Redshift customers. Result caching is enabled automatically, and we encourage you to see the difference it can make in your environment.

Additional Reading

If you found this post useful, be sure to check out Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required, Collect Data Statistics Up to 5x Faster by Analyzing Only Predicate Columns with Amazon Redshift and Amazon Redshift – 2017 Recap.


About the Authors

Entong Shen is a software engineer on the Amazon Redshift query processing team. He has been working on MPP databases for over 5 years and has focused on query optimization, statistics and SQL language features. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.



Larry Heathcote is a Principal Product Marketing Manager at Amazon Web Services for data warehousing and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out and the taste of classic barbeque.



Maor Kleider is a Senior Product Manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.



Meng Tong is a Senior Software Engineer with Amazon Redshift Query Processing team. His work has been focused on query performance improvements across rewriter, optimizer and executor, Redshift Spectrum, and most recently Redshift Result Caching. His passion lies in discovering simple, elegant solutions for customer needs in big data systems. He is a big Rafael Nadal fan and enjoys watching and playing tennis in his spare time.



Naresh Chainani is a Senior Software Development Manager at Amazon Redshift where he leads the Query Processing team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.




Streaming Events from Amazon Pinpoint to Redshift

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/messaging-and-targeting/streaming-events-from-amazon-pinpoint-to-redshift/

Note: This post was originally written by Ryan Idrigo-Lam, one of the founding members of the Amazon Pinpoint team.

You can use Amazon Pinpoint to segment, target, and engage with your customers directly from the console. The Pinpoint console also includes a variety of dashboards that you can use to keep track of how your customers use your applications, and measure how likely your customers are to engage with the messages you send them.

Some Pinpoint customers, however, have use cases that require a bit more than what these dashboards have to offer. For example, some customers want to join their Pinpoint data to external data sets, or to collect historical data beyond the six month window that Pinpoint retains. To help customers meet these needs, and many more, Amazon Pinpoint includes a feature called Event Streams.

This article provides information about using Event Streams to export your data from Amazon Pinpoint and into a high-performance Amazon Redshift database. Once your data is in Redshift, you can run queries against it, join it with other data sets, use it as a data source for analytics and data visualization tools, and much more.

Step 1: Create a Redshift Cluster

The first step in this process involves creating a new Redshift cluster to store your data. You can complete this step in a few clicks by using the Amazon Redshift console. For more information, see Managing Clusters Using the Console in the Amazon Redshift Cluster Management Guide.

When you create the new cluster, make a note of the values you specify for the Cluster Identifier, Database Name, Master User Name, and Master User Password. You’ll use all of these values when you set up Amazon Kinesis Firehose in the next section.

Step 2: Create a Firehose Delivery Stream with a Redshift Destination

After you create your Redshift cluster, you can create the Amazon Kinesis Data Firehose delivery stream that will deliver your Pinpoint data to the Redshift cluster.

To create the Kinesis Data Firehose delivery stream

  1. Open the Amazon Kinesis Data Firehose console at https://console.aws.amazon.com/firehose/home.
  2. Choose Create delivery stream.
  3. For Delivery stream name, type a name.
  4. Under Choose source, for Source, choose Direct PUT or other sources. Choose Next.
  5. On the Process records page, do the following:
    1. Under Transform source records with AWS Lambda, choose Enabled if you want to use a Lambda function to transform the data before Firehose loads it into Redshift. Otherwise, choose Disabled.
    2. Under Convert record format, choose Disabled, and then choose Next.
  6. On the Choose destination page, do the following:
    1. For Destination, choose Amazon Redshift.
    2. Under Amazon Redshift destination, specify the Cluster name, User name, Password, and Database for the Redshift database you created earlier. Also specify a name for the Table.
    3. Under Intermediate S3 destination, choose an S3 bucket to store data in. Alternatively, choose Create new to create a new bucket. Choose Next.
  7. On the Configure settings page, do the following:
    1. Under IAM role, choose an IAM role that Firehose can use to access your S3 bucket and KMS key. Alternatively, you can have the Firehose console create a new role. Choose Next.
    2. On the Review page, confirm the settings you specified on the previous pages. If the settings are correct, choose Create delivery stream.

Step 3: Create a JSONPaths file

The next step in this process is to create a JSONPaths file and upload it to an Amazon S3 bucket. You use the JSONPaths file to tell Amazon Redshift how to interpret the unstructured JSON that Amazon Pinpoint provides.

To create a JSONPaths file and upload it to Amazon S3

  1. In a text editor, create a new file.
  2. Paste the following code into the text file:
      "jsonpaths": [

  3. Modify the preceding code example to include the fields that you want to import into Redshift.
    Note: You can specify custom attributes or metrics by replacing my_custom_attribute or my_custom_metric in the example above with your custom attributes or metrics, respectively.
  4. When you finish modifying the code example, remove all whitespace, including spaces and line breaks, from the file. Save the file as json-paths.json.
  5. Open the Amazon S3 console at https://s3.console.aws.amazon.com/s3/home.
  6. Choose the S3 bucket you created when you set up the Firehose stream. Upload json-paths.json into the bucket.

Step 4: Configure the table in Redshift

At this point, it’s time to finish setting up your Redshift database. In this section, you’ll create a table in the Redshift cluster you created earlier. The columns in this table mirror the values you specified in the JSONPaths file in the previous section.

  1. Connect to your Redshift cluster by using a database tool such as SQL Workbench/J. For more information about connecting to a cluster, see Connect to the Cluster in the Amazon Redshift Getting Started Guide.
  2. Create a new table that contains a column for each field in the JSONPaths file you created in the preceding section. You can use the following example as a template.
    CREATE schema AWSMA;
      event_type VARCHAR(256) NOT NULL ENCODE LZO,
      event_timestamp TIMESTAMP NOT NULL ENCODE LZO,
      arrival_timestamp TIMESTAMP NULL ENCODE LZO,
      event_version CHAR(12) NULL ENCODE LZO,
      application_app_id VARCHAR(64) NOT NULL ENCODE LZO,
      application_package_name VARCHAR(256) NULL ENCODE LZO,
      application_version_name VARCHAR(256) NULL ENCODE LZO,
      application_version_code VARCHAR(256) NULL ENCODE LZO,
      application_title VARCHAR(256) NULL ENCODE LZO,
      application_cognito_identity_pool_id VARCHAR(64) NULL ENCODE LZO,
      application_sdk_name VARCHAR(256) NULL ENCODE LZO,
      application_sdk_version VARCHAR(256) NULL ENCODE LZO,
      client_cognito_id VARCHAR(64) NULL ENCODE LZO,
      device_model VARCHAR(256) NULL ENCODE LZO,
      device_make VARCHAR(256) NULL ENCODE LZO,
      device_platform_name VARCHAR(256) NULL ENCODE LZO,
      device_platform_version VARCHAR(256) NULL ENCODE LZO,
      device_locale_code VARCHAR(256) NULL ENCODE LZO,
      device_locale_language VARCHAR(64) NULL ENCODE LZO,
      device_locale_country VARCHAR(64) NULL ENCODE LZO,
      session_id VARCHAR(64) NULL ENCODE LZO,
      session_start_timestamp TIMESTAMP NULL ENCODE LZO,
      session_stop_timestamp TIMESTAMP NULL ENCODE LZO,
      monetization_transaction_id VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_store VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_item_id VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_quantity FLOAT8 NULL,
      monetization_transaction_price_reported VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_price_amount FLOAT8 NULL,
      monetization_transaction_price_currency_code VARCHAR(16) NULL ENCODE LZO,
      monetization_transaction_price_currency_symbol VARCHAR(32) NULL ENCODE LZO,
      - Custom Attributes
      a_campaign_id VARCHAR(4000),
      a_campaign_activity_id VARCHAR(4000),
      a_my_custom_attribute VARCHAR(4000),
      - Custom Metrics
      m_my_custom_metric float8
    SORTKEY ( application_app_id, event_timestamp, event_type);

Step 5: Configure the Firehose Stream

You’re getting close! At this point, you’re ready to point the Kinesis Data Firehose stream to your JSONPaths file so that Redshift parses the incoming data properly. You also need to list the columns of the table that your data will be copied into.

To configure the Firehose Stream

  1. Open the Amazon Kinesis Data Firehose console at https://console.aws.amazon.com/firehose/home.
  2. In the list of delivery streams, choose the delivery stream you created earlier.
  3. On the Details tab, choose Edit.
  4. Under Amazon Redshift destination, for COPY options, paste the following:
    JSON 's3://s3-bucket/json-paths.json'
    TIMEFORMAT 'epochmillisecs'

  5. Replace s3-bucket in the preceding code example with the path to the S3 bucket that contains json-paths.json.
  6. For Columns, list all of the columns that are present in the JSONPaths file you created earlier. Specify the column names in the same order as they’re listed in the json-paths.json file, using commas to separate the column names. When you finish, choose Save.

Step 6: Enable Event Streams in Amazon Pinpoint

The only thing left to do now is to tell Amazon Pinpoint to start sending data to Amazon Kinesis.

To enable Event Streaming in Amazon Pinpoint

  1. Open the Amazon Pinpoint console at https://console.aws.amazon.com/pinpoint/home.
  2. Choose the application or project that you want to enable event streams for.
  3. In the navigation pane, choose Settings.
  4. On the Event stream tab, choose Enable streaming of events to Amazon Kinesis.
  5. Under Stream to Amazon Kinesis, select Send events to an Amazon Kinesis Firehose delivery stream.
  6. For Amazon Kinesis Firehose delivery stream, choose the stream you created earlier.
  7. For IAM role, choose an existing role that allows the firehose:PutRecordBatch action, or choose Automatically create a role to have Amazon Pinpoint create a role with the appropriate permissions. If you choose to have Amazon Pinpoint create a role for you, type a name for the role. Choose Save.

That’s it! Once you complete this final step, Amazon Pinpoint starts exporting the data you specified into your Redshift cluster.

I hope this walk through was helpful. If you have any questions, please let us know in the comments or in the Amazon Pinpoint forum.

New – Pay-per-Session Pricing for Amazon QuickSight, Another Region, and Lots More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-pay-per-session-pricing-for-amazon-quicksight-another-region-and-lots-more/

Amazon QuickSight is a fully managed cloud business intelligence system that gives you Fast & Easy to Use Business Analytics for Big Data. QuickSight makes business analytics available to organizations of all shapes and sizes, with the ability to access data that is stored in your Amazon Redshift data warehouse, your Amazon Relational Database Service (RDS) relational databases, flat files in S3, and (via connectors) data stored in on-premises MySQL, PostgreSQL, and SQL Server databases. QuickSight scales to accommodate tens, hundreds, or thousands of users per organization.

Today we are launching a new, session-based pricing option for QuickSight, along with additional region support and other important new features. Let’s take a look at each one:

Pay-per-Session Pricing
Our customers are making great use of QuickSight and take full advantage of the power it gives them to connect to data sources, create reports, and and explore visualizations.

However, not everyone in an organization needs or wants such powerful authoring capabilities. Having access to curated data in dashboards and being able to interact with the data by drilling down, filtering, or slicing-and-dicing is more than adequate for their needs. Subscribing them to a monthly or annual plan can be seen as an unwarranted expense, so a lot of such casual users end up not having access to interactive data or BI.

In order to allow customers to provide all of their users with interactive dashboards and reports, the Enterprise Edition of Amazon QuickSight now allows Reader access to dashboards on a Pay-per-Session basis. QuickSight users are now classified as Admins, Authors, or Readers, with distinct capabilities and prices:

Authors have access to the full power of QuickSight; they can establish database connections, upload new data, create ad hoc visualizations, and publish dashboards, all for $9 per month (Standard Edition) or $18 per month (Enterprise Edition).

Readers can view dashboards, slice and dice data using drill downs, filters and on-screen controls, and download data in CSV format, all within the secure QuickSight environment. Readers pay $0.30 for 30 minutes of access, with a monthly maximum of $5 per reader.

Admins have all authoring capabilities, and can manage users and purchase SPICE capacity in the account. The QuickSight admin now has the ability to set the desired option (Author or Reader) when they invite members of their organization to use QuickSight. They can extend Reader invites to their entire user base without incurring any up-front or monthly costs, paying only for the actual usage.

To learn more, visit the QuickSight Pricing page.

A New Region
QuickSight is now available in the Asia Pacific (Tokyo) Region:

The UI is in English, with a localized version in the works.

Hourly Data Refresh
Enterprise Edition SPICE data sets can now be set to refresh as frequently as every hour. In the past, each data set could be refreshed up to 5 times a day. To learn more, read Refreshing Imported Data.

Access to Data in Private VPCs
This feature was launched in preview form late last year, and is now available in production form to users of the Enterprise Edition. As I noted at the time, you can use it to implement secure, private communication with data sources that do not have public connectivity, including on-premises data in Teradata or SQL Server, accessed over an AWS Direct Connect link. To learn more, read Working with AWS VPC.

Parameters with On-Screen Controls
QuickSight dashboards can now include parameters that are set using on-screen dropdown, text box, numeric slider or date picker controls. The default value for each parameter can be set based on the user name (QuickSight calls this a dynamic default). You could, for example, set an appropriate default based on each user’s office location, department, or sales territory. Here’s an example:

To learn more, read about Parameters in QuickSight.

URL Actions for Linked Dashboards
You can now connect your QuickSight dashboards to external applications by defining URL actions on visuals. The actions can include parameters, and become available in the Details menu for the visual. URL actions are defined like this:

You can use this feature to link QuickSight dashboards to third party applications (e.g. Salesforce) or to your own internal applications. Read Custom URL Actions to learn how to use this feature.

Dashboard Sharing
You can now share QuickSight dashboards across every user in an account.

Larger SPICE Tables
The per-data set limit for SPICE tables has been raised from 10 GB to 25 GB.

Upgrade to Enterprise Edition
The QuickSight administrator can now upgrade an account from Standard Edition to Enterprise Edition with a click. This enables provisioning of Readers with pay-per-session pricing, private VPC access, row-level security for dashboards and data sets, and hourly refresh of data sets. Enterprise Edition pricing applies after the upgrade.

Available Now
Everything I listed above is available now and you can start using it today!

You can try QuickSight for 60 days at no charge, and you can also attend our June 20th Webinar.



Amazon Neptune Generally Available

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-neptune-generally-available/

Amazon Neptune is now Generally Available in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland). Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. At the core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latencies. Neptune supports two popular graph models, Property Graph and RDF, through Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune can be used to power everything from recommendation engines and knowledge graphs to drug discovery and network security. Neptune is fully-managed with automatic minor version upgrades, backups, encryption, and fail-over. I wrote about Neptune in detail for AWS re:Invent last year and customers have been using the preview and providing great feedback that the team has used to prepare the service for GA.

Now that Amazon Neptune is generally available there are a few changes from the preview:

Launching an Amazon Neptune Cluster

Launching a Neptune cluster is as easy as navigating to the AWS Management Console and clicking create cluster. Of course you can also launch with CloudFormation, the CLI, or the SDKs.

You can monitor your cluster health and the health of individual instances through Amazon CloudWatch and the console.

Additional Resources

We’ve created two repos with some additional tools and examples here. You can expect continuous development on these repos as we add additional tools and examples.

  • Amazon Neptune Tools Repo
    This repo has a useful tool for converting GraphML files into Neptune compatible CSVs for bulk loading from S3.
  • Amazon Neptune Samples Repo
    This repo has a really cool example of building a collaborative filtering recommendation engine for video game preferences.

Purpose Built Databases

There’s an industry trend where we’re moving more and more onto purpose-built databases. Developers and businesses want to access their data in the format that makes the most sense for their applications. As cloud resources make transforming large datasets easier with tools like AWS Glue, we have a lot more options than we used to for accessing our data. With tools like Amazon Redshift, Amazon Athena, Amazon Aurora, Amazon DynamoDB, and more we get to choose the best database for the job or even enable entirely new use-cases. Amazon Neptune is perfect for workloads where the data is highly connected across data rich edges.

I’m really excited about graph databases and I see a huge number of applications. Looking for ideas of cool things to build? I’d love to build a web crawler in AWS Lambda that uses Neptune as the backing store. You could further enrich it by running Amazon Comprehend or Amazon Rekognition on the text and images found and creating a search engine on top of Neptune.

As always, feel free to reach out in the comments or on twitter to provide any feedback!


Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift

Post Syndicated from Roy Hasson original https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/

Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.

To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.

Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.

What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.

Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.

Solution overview

Here is how the solution is laid out:



The following sections walk you through each of these steps to set up the pipeline.

1. Define the schema

When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.

To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.

The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.

CREATE EXTERNAL TABLE default.kfhconnectblog (
  awsaccountid string,
  agentarn string,
  currentagentsnapshot string,
  eventid string,
  eventtimestamp string,
  eventtype string,
  instancearn string,
  previousagentsnapshot string,
  version string
STORED AS parquet
LOCATION 's3://your_bucket/kfhconnectblog/'
TBLPROPERTIES ("parquet.compression"="SNAPPY")

That’s all you have to do to prepare the schema for Kinesis Data Firehose.

2. Define the data streams

Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events.  Open the Kinesis Data Streams console and create two streams.  You can configure them with only one shard each because you don’t have a lot of data right now.

3. Define the Kinesis Data Firehose delivery stream for Parquet

Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.

As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.

To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.

On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.

Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.

Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.

Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.

4. Set up the Amazon Connect contact center

After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.

After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.

At this point, your pipeline is complete.  Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.

So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.

To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.

5. Analyze the data with Athena

Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures.  The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.

Let’s run the first query to see which agents have logged into the system.

The query might look complex, but it’s fairly straightforward:

WITH dataset AS (
    from_iso8601_timestamp(eventtimestamp) AS event_ts,
      '$.agentstatus.name') AS current_status,
        '$.agentstatus.starttimestamp')) AS current_starttimestamp,
      '$.configuration.firstname') AS current_firstname,
      '$.configuration.lastname') AS current_lastname,
      '$.configuration.username') AS current_username,
      '$.configuration.routingprofile.defaultoutboundqueue.name') AS               current_outboundqueue,
      '$.configuration.routingprofile.inboundqueues[0].name') as current_inboundqueue,
      '$.agentstatus.name') as prev_status,
       '$.agentstatus.starttimestamp')) as prev_starttimestamp,
      '$.configuration.firstname') as prev_firstname,
      '$.configuration.lastname') as prev_lastname,
      '$.configuration.username') as prev_username,
      '$.configuration.routingprofile.defaultoutboundqueue.name') as current_outboundqueue,
      '$.configuration.routingprofile.inboundqueues[0].name') as prev_inboundqueue
  from kfhconnectblog
  where eventtype <> 'HEART_BEAT'
  current_status as status,
  current_username as username,
FROM dataset
WHERE eventtype = 'LOGIN' AND current_username <> ''
ORDER BY event_ts DESC

The query output looks something like this:

Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.

WITH src AS (
     json_extract_scalar(currentagentsnapshot, '$.configuration.username') as username,
     cast(json_extract(currentagentsnapshot, '$.contacts') AS ARRAY(JSON)) as c,
     cast(json_extract(previousagentsnapshot, '$.contacts') AS ARRAY(JSON)) as p
  from kfhconnectblog
src2 AS (
  FROM src CROSS JOIN UNNEST (c, p) AS contacts(c_item, p_item)
dataset AS (
  json_extract_scalar(c_item, '$.contactid') as c_contactid,
  json_extract_scalar(c_item, '$.channel') as c_channel,
  json_extract_scalar(c_item, '$.initiationmethod') as c_direction,
  json_extract_scalar(c_item, '$.queue.name') as c_queue,
  json_extract_scalar(c_item, '$.state') as c_state,
  from_iso8601_timestamp(json_extract_scalar(c_item, '$.statestarttimestamp')) as c_ts,
  json_extract_scalar(p_item, '$.contactid') as p_contactid,
  json_extract_scalar(p_item, '$.channel') as p_channel,
  json_extract_scalar(p_item, '$.initiationmethod') as p_direction,
  json_extract_scalar(p_item, '$.queue.name') as p_queue,
  json_extract_scalar(p_item, '$.state') as p_state,
  from_iso8601_timestamp(json_extract_scalar(p_item, '$.statestarttimestamp')) as p_ts
FROM src2
  c_channel as channel,
  c_direction as direction,
  p_state as prev_state,
  c_state as current_state,
  c_ts as current_ts,
  c_contactid as id
FROM dataset
WHERE c_contactid = p_contactid
ORDER BY id DESC, current_ts ASC

The query output looks similar to the following:

6. Analyze the data with Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.

Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.

  json_extract_path_text(currentagentsnapshot,'agentstatus','name') AS current_status,
  json_extract_path_text(currentagentsnapshot, 'configuration','firstname') AS current_firstname,
  json_extract_path_text(currentagentsnapshot, 'configuration','lastname') AS current_lastname,
    'configuration','routingprofile','defaultoutboundqueue','name') AS current_outboundqueue,
FROM default_schema.kfhconnectblog

The following shows the query output:


In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.


Additional Reading

If you found this post useful, be sure to check out Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight and Work with partitioned data in AWS Glue.

About the Author

Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.




10 visualizations to try in Amazon QuickSight with sample data

Post Syndicated from Karthik Kumar Odapally original https://aws.amazon.com/blogs/big-data/10-visualizations-to-try-in-amazon-quicksight-with-sample-data/

If you’re not already familiar with building visualizations for quick access to business insights using Amazon QuickSight, consider this your introduction. In this post, we’ll walk through some common scenarios with sample datasets to provide an overview of how you can connect yuor data, perform advanced analysis and access the results from any web browser or mobile device.

The following visualizations are built from the public datasets available in the links below. Before we jump into that, let’s take a look at the supported data sources, file formats and a typical QuickSight workflow to build any visualization.

Which data sources does Amazon QuickSight support?

At the time of publication, you can use the following data methods:

  • Connect to AWS data sources, including:
    • Amazon RDS
    • Amazon Aurora
    • Amazon Redshift
    • Amazon Athena
    • Amazon S3
  • Upload Excel spreadsheets or flat files (CSV, TSV, CLF, and ELF)
  • Connect to on-premises databases like Teradata, SQL Server, MySQL, and PostgreSQL
  • Import data from SaaS applications like Salesforce and Snowflake
  • Use big data processing engines like Spark and Presto

This list is constantly growing. For more information, see Supported Data Sources.

Answers in instants

SPICE is the Amazon QuickSight super-fast, parallel, in-memory calculation engine, designed specifically for ad hoc data visualization. SPICE stores your data in a system architected for high availability, where it is saved until you choose to delete it. Improve the performance of database datasets by importing the data into SPICE instead of using a direct database query. To calculate how much SPICE capacity your dataset needs, see Managing SPICE Capacity.

Typical Amazon QuickSight workflow

When you create an analysis, the typical workflow is as follows:

  1. Connect to a data source, and then create a new dataset or choose an existing dataset.
  2. (Optional) If you created a new dataset, prepare the data (for example, by changing field names or data types).
  3. Create a new analysis.
  4. Add a visual to the analysis by choosing the fields to visualize. Choose a specific visual type, or use AutoGraph and let Amazon QuickSight choose the most appropriate visual type, based on the number and data types of the fields that you select.
  5. (Optional) Modify the visual to meet your requirements (for example, by adding a filter or changing the visual type).
  6. (Optional) Add more visuals to the analysis.
  7. (Optional) Add scenes to the default story to provide a narrative about some aspect of the analysis data.
  8. (Optional) Publish the analysis as a dashboard to share insights with other users.

The following graphic illustrates a typical Amazon QuickSight workflow.

Visualizations created in Amazon QuickSight with sample datasets

Visualizations for a data analyst

Source:  https://data.worldbank.org/

Download and Resources:  https://datacatalog.worldbank.org/dataset/world-development-indicators

Data catalog:  The World Bank invests into multiple development projects at the national, regional, and global levels. It’s a great source of information for data analysts.

The following graph shows the percentage of the population that has access to electricity (rural and urban) during 2000 in Asia, Africa, the Middle East, and Latin America.

The following graph shows the share of healthcare costs that are paid out-of-pocket (private vs. public). Also, you can maneuver over the graph to get detailed statistics at a glance.

Visualizations for a trading analyst

Source:  Deutsche Börse Public Dataset (DBG PDS)

Download and resources:  https://aws.amazon.com/public-datasets/deutsche-boerse-pds/

Data catalog:  The DBG PDS project makes real-time data derived from Deutsche Börse’s trading market systems available to the public for free. This is the first time that such detailed financial market data has been shared freely and continually from the source provider.

The following graph shows the market trend of max trade volume for different EU banks. It builds on the data available on XETRA engines, which is made up of a variety of equities, funds, and derivative securities. This graph can be scrolled to visualize trade for a period of an hour or more.

The following graph shows the common stock beating the rest of the maximum trade volume over a period of time, grouped by security type.

Visualizations for a data scientist

Source:  https://catalog.data.gov/

Download and resources:  https://catalog.data.gov/dataset/road-weather-information-stations-788f8

Data catalog:  Data derived from different sensor stations placed on the city bridges and surface streets are a core information source. The road weather information station has a temperature sensor that measures the temperature of the street surface. It also has a sensor that measures the ambient air temperature at the station each second.

The following graph shows the present max air temperature in Seattle from different RWI station sensors.

The following graph shows the minimum temperature of the road surface at different times, which helps predicts road conditions at a particular time of the year.

Visualizations for a data engineer

Source:  https://www.kaggle.com/

Download and resources:  https://www.kaggle.com/datasnaek/youtube-new/data

Data catalog:  Kaggle has come up with a platform where people can donate open datasets. Data engineers and other community members can have open access to these datasets and can contribute to the open data movement. They have more than 350 datasets in total, with more than 200 as featured datasets. It has a few interesting datasets on the platform that are not present at other places, and it’s a platform to connect with other data enthusiasts.

The following graph shows the trending YouTube videos and presents the max likes for the top 20 channels. This is one of the most popular datasets for data engineers.

The following graph shows the YouTube daily statistics for the max views of video titles published during a specific time period.

Visualizations for a business user

Source:  New York Taxi Data

Download and resources:  https://data.cityofnewyork.us/Transportation/2016-Green-Taxi-Trip-Data/hvrh-b6nb

Data catalog: NYC Open data hosts some very popular open data sets for all New Yorkers. This platform allows you to get involved in dive deep into the data set to pull some useful visualizations. 2016 Green taxi trip dataset includes trip records from all trips completed in green taxis in NYC in 2016. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

The following graph presents maximum fare amount grouped by the passenger count during a period of time during a day. This can be further expanded to follow through different day of the month based on the business need.

The following graph shows the NewYork taxi data from January 2016, showing the dip in the number of taxis ridden on January 23, 2016 across all types of taxis.

A quick search for that date and location shows you the following news report:


Using Amazon QuickSight, you can see patterns across a time-series data by building visualizations, performing ad hoc analysis, and quickly generating insights. We hope you’ll give it a try today!


Additional Reading

If you found this post useful, be sure to check out Amazon QuickSight Adds Support for Combo Charts and Row-Level Security and Visualize AWS Cloudtrail Logs Using AWS Glue and Amazon QuickSight.

Karthik Odapally is a Sr. Solutions Architect in AWS. His passion is to build cost effective and highly scalable solutions on the cloud. In his spare time, he bakes cookies and cupcakes for family and friends here in the PNW. He loves vintage racing cars.




Pranabesh Mandal is a Solutions Architect in AWS. He has over a decade of IT experience. He is passionate about cloud technology and focuses on Analytics. In his spare time, he likes to hike and explore the beautiful nature and wild life of most divine national parks around the United States alongside his wife.