Tag Archives: Amazon Redshift

How Annalect built an event log data analytics solution using Amazon Redshift

Post Syndicated from Eric Kamm original https://aws.amazon.com/blogs/big-data/how-annalect-built-an-event-log-data-analytics-solution-using-amazon-redshift/

Ingesting and analyzing event log data into a data warehouse in near real-time is challenging. Data ingest must be fast and efficient. The data warehouse must be able to scale quickly to handle incoming data volumes. Storing massive amounts of historical data in the data warehouse for infrequent access is prohibitively expensive. If you store infrequently accessed data outside the data warehouse, loading time is unacceptable.

At Annalect, we found a way to overcome these challenges. In this article, we discuss how we built a solution on AWS to manage, enhance, and analyze event log data from our ad technology partners. We use Amazon S3 for storage, Amazon EC2 and AWS Batch for compute, AWS Glue Data Catalog, and Amazon Redshift and Amazon Redshift Spectrum for analytics. This scalable, on-demand architecture has proven to be a high-performance solution for our analytics users and is very cost-effective.

Architecture overview

In mid-2016, we were using Amazon Redshift with local tables to provide SQL access to event-level ad tech datasets for analysts and modelers to improve ad buying strategies. We retained up to six months of historical data for select data sources; this represented about 230 billion events. By early 2017, the same data footprint grew to over 385 billion events, and we increased our Amazon Redshift capacity to keep pace. As we looked forward, we wanted to support more data, both in terms of history and data source coverage. At the same time, we wanted to maintain or improve query performance, ideally without increasing costs.

Because we were already using Amazon Redshift as our data warehouse, we looked at Redshift Spectrum. Redshift Spectrum is a feature of Amazon Redshift that queries data stored in Amazon S3. It uses the same SQL capabilities as Amazon Redshift, without the need to load the data into Amazon Redshift tables stored on disks within an Amazon Redshift cluster.

With Redshift Spectrum, a table’s storage is separate from a cluster’s compute resources, which enables each to scale independently of the other. We can keep large data tables in S3 instead of local Amazon Redshift storage, which reduces the size of our Amazon Redshift clusters and lowers costs. Queries that use external Redshift Spectrum tables use Redshift Spectrum compute resources (separate from the Amazon Redshift cluster) to scan the data in S3. Queries are billed by AWS at a metered rate based on the amount of data scanned. For query efficiency, we convert the data stored in S3 into Parquet files using AWS Batch.

The result is an efficient, mostly serverless analytics environment where the data is always available to query with Redshift Spectrum. Although we manage some always-on Amazon Redshift clusters, we think of this as “mostly serverless.” This is because the cluster footprint is very small relative to the serverless compute power we use through Redshift Spectrum for analysis and through AWS Batch for data preparation.

The Amazon Redshift clusters are dedicated to performing analytics for our analysts, rather than having a portion of their time dedicated to load jobs. This is because no time is spent copying data into local tables in the Amazon Redshift clusters and because we do the data preparation work outside of Amazon Redshift.

With the evolution to Redshift Spectrum and storing table data in S3, by early 2018 we increased the amount of data that we deliver to analysts by an order of magnitude. At the same time, we kept our costs at the same level as 2017. Using Redshift Spectrum is a great cost saving to us. We now have up to 36 months of historical data available for many more data sources, representing trillions of events.

Our data management infrastructure, the Enhanced Event-Level Data Environment (EELDE), is structured as shown following.

The EELDE provides four primary functions:

  • Data ingest
  • Data transformation
  • Compute-intensive enrichment
  • Analyst ad hoc query environment

These four functions are described following in more detail.

Data ingest

We architected our ingest system to ensure smooth and rapid downstream data availability. We took into account the inherent inconsistencies of vendor data feeds, both in the contents of the feeds themselves and in different vendors’ data push conventions. We also took into account the large volume of data files that data vendors might deliver at a given time. The diagram following shows our data ingest process.

We set up two storage areas in Amazon S3 to initially hold data from our vendors. The Landing Zone serves as a repository for long-term storage of historical, unaltered raw log files. Log files that are pulled by us directly from vendors are stored immediately in the Landing Zone. The Datafeeds storage serves as a transient place for log files that are pushed out by the vendors. We curate and arrange the data files delivered into the Datafeeds storage by vendors as we pull them into the Landing Zone.

The pull processes are executed in the EELDE’s Job Execution Layer, which is autoscaling and can handle many data files in parallel. This ensures that data processing and availability are not delayed by sudden spikes in data volume. Some spikes might occur from the need to backfill a large dataset from a new vendor. Others might occur when a vendor delivers a large volume of logs accumulated after a delay on their side.

Whether the vendor logs are pushed or pulled, we monitor the Landing Zone. We send alerts when we detect changes in file formats or gaps in data that might have an impact on downstream analysis. We found that this architecture and process gives us the control, flexibility, and agility that we need to respond rapidly to any volume of data and any changes in the data.

Data transformation

The EELDE can process, enhance, and analyze large volumes of event data in a compute-efficient and cost-efficient manner. An important part of this ability is the initial transformation of files in the Landing Zone. The EELDE’s Job Execution Layer uses AWS Batch to parallelize the compute-intensive data transformation from the original source data. This data is typically in comma-separated value (CSV) or JSON format. AWS Batch parallelizes it into a columnar, compressed storage format, Apache Parquet. In the diagrams, you can see the Parquet storage location labeled as the “Level 2 (columnar)” bucket.

For our Parquet transformation jobs, we have had a good experience running many containers with local-mode Apache Spark using AWS Batch. The goal for us is supporting a high throughput of file transformations where each file’s size is in the range of a few megabytes to a few gigabytes. A given Parquet transformation might take anywhere from a few seconds to 30 minutes.

The resource isolation and job management provided by AWS Batch makes running hundreds of independent data transformation jobs straightforward. At the same time, it removes some of the challenges of running many tasks simultaneously within Amazon EMR. In contrast, to transform larger datasets, the parallel compute per job that EMR can provide is attractive.

The diagram following shows data transformation and cataloging.

We partition the large Parquet-backed event data tables by date. Date partitioning is strategically implemented to optimize both query cost and query performance by providing an efficient path to the required data for a given analytics process. After the files are converted to Parquet, they are exposed as table partitions in the AWS Glue Data Catalog. Although the data storage format has been transformed, the data still retains all of its original fields, which is important for supporting ad hoc analyses.

Once the Parquet conversion and AWS Glue Data Catalog management processes have run, the data is accessible through SQL queries using Redshift Spectrum. Redshift Spectrum uses the AWS Glue Data Catalog to reference external tables stored in S3 so that you can query them through the Amazon Redshift cluster. Using a columnar format like Parquet increases the speed and cost efficiency of performing queries with Redshift Spectrum.

Compute-intensive enrichment

Before analytics work, the Job Execution Layer might also run other compute intensive data transformations to enhance the logs so that they can support more advanced types of analysis. Event-level log combination and event attribute augmentation are two examples.

Although these enrichment tasks might have different compute requirements, they can all use Redshift Spectrum to gather relevant datasets, which are then processed in the Job Execution Layer. The results are made available again through Redshift Spectrum. This enables complex enrichment of large datasets in acceptable timeframes and at a low cost.

Analyst ad hoc query environment

Our environment provides analysts with a data landscape and SQL query environment for mission-critical analytics. The data is organized within the EELDE into separate external schemas by ad tech vendor and system seat or account. Our agency-based analysts and data scientists are granted access to specific schemas according to project needs.

By having access to various datasets within a single platform, analysts can gain unique insights that are not possible by analyzing data in separate silos. They can create new datasets from a combination of source vendors for use during specific analytic projects while respecting data separation requirements. Doing this helps support use cases such as predictive modeling, channel planning, and multisource attribution.

Best practices to keep costs low

Along the way, we have developed some best practices for Redshift Spectrum:

  • Use compressed columnar format. We use Apache Parquet for our large event-level tables. Doing this reduces query run time, data scan costs, and storage costs. Our large tables typically have 40–200 columns, yet most of our queries use only 4–8 columns at a time. Using Parquet commonly lowers the scan cost by over 90 percent compared to noncolumnar alternatives.
  • Partition tables. We partition our large event level tables by date, and we train users to use the partition column in WHERE clauses. Doing this provides cost savings proportional to the portion of dates used by the queries. For example, suppose that we expose six months of data and a query looks at only one month. In this case, the cost for that query is reduced by five-sixths by using the appropriate WHERE clause.
  • Actively manage table partitions. Although we have up to 36 months of historical data available as Parquet files, by default we expose only the most recent 6 months of data as external table partitions. This practice reduces the cost of queries when users fail to filter on the partition column. On request, we temporarily add older date partitions to the necessary tables to support queries that need more than 6 months of data. In this way, we support longer timeframes when needed. At the same time, we usually maintain a lower cost limit by not exposing more data than is commonly necessary.
  • Move external tables into Amazon Redshift when necessary. When a large number of queries repeatedly hit the same external table, we often create a temporary local table in Amazon Redshift to query against. Depending on the number of queries and their scope, it can be helpful to load the data into a local Amazon Redshift table with the appropriate distribution key and sort keys. This practice helps eliminate the repeated Redshift Spectrum scan costs.
  • Use spectrum_scan_size_mb WLM rules. Most of our users and jobs operate with an effective Redshift Spectrum data scan limit of 1 terabyte by setting a spectrum_scan_size_mb WLM rule on the default WLM queue. For a single analysis or modeling query, this limit rarely needs to be exceeded.
  • Add clusters to increase concurrency when needed. Multiple Amazon Redshift clusters have access to the same external tables in S3. If we have a lot of users running queries at the same time against external tables, we might choose to add more clusters. Redshift Spectrum and the management tools we have created for our environment make this an easy task.
  • Use short-lived Amazon Redshift clusters. For some scheduled production jobs, we use short-lived Amazon Redshift clusters with the necessary data made available through external schemas. Redshift Spectrum allows us to avoid lengthy data load times, so it is practical to launch a Amazon Redshift cluster for a couple of hours of work and then terminate it every day. Doing this can save 80–90 percent compared to having an always-on cluster for a project. It also avoids contention with other query activity, because the short-lived cluster is dedicated to the project.

Summary

By establishing a data warehouse strategy using Amazon S3 for storage and Redshift Spectrum for analytics, we increased the size of the datasets we support by over an order of magnitude. In addition, we improved our ability to ingest large volumes of data quickly, and maintained fast performance without increasing our costs. Our analysts and modelers can now perform deeper analytics to improve ad buying strategies and results.

 


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and How I built a data warehouse using Amazon Redshift and AWS services in record time.

 


About the Author

Eric Kamm is a senior engineer and architect at Annalect. He has been creating and managing large-scale ETL workflows and providing data and compute environments for teams of analysts and modelers for over twenty years. He enjoys working with cloud technologies because they can accelerate project planning and implementation while increasing reliability and operational efficiency.

Store, Protect, Optimize Your Healthcare Data with AWS: Part 2

Post Syndicated from Stephen Jepsen original https://aws.amazon.com/blogs/architecture/store-protect-optimize-your-healthcare-data-with-aws-part-2/

Leveraging Analytics and Machine Learning Tools for Readmissions Prediction

This blog post was co-authored by Ujjwal Ratan, a senior AI/ML solutions architect on the global life sciences team.

In Part 1, we looked at various options to ingest and store sensitive healthcare data using AWS. The post described our shared responsibility model and provided a reference architecture that healthcare organizations could use as a foundation to build a robust platform on AWS to store and protect their sensitive data, including protected health information (PHI). In Part 2, we will dive deeper into how customers can optimize their healthcare datasets for analytics and machine learning (ML) to address clinical and operational challenges.

There are a number of factors creating pressures for healthcare organizations, both providers and payers, to adopt analytic tools to better understand their data: regulatory requirements, changing reimbursement models from volume- to value-based care, population health management for risk-bearing organizations, and movement toward personalized medicine. As organizations deploy new solutions to address these areas, the availability of large and complex datasets from electronic health records, genomics, images (for example, CAT, PET, MRI, ultrasound, X-ray), and IoT has been increasing. With these data assets growing in size, healthcare organizations want to leverage analytic and ML tools to derive new actionable insights across their departments.

One example of the use of ML in healthcare is diagnostic image analysis, including digital pathology. Pathology is extremely important in diagnosing and treating patients, but it is also extremely time-consuming and largely a manual process. While the complexity and quantity of workloads are increasing, the number of pathologists is decreasing. According to one study, the number of active pathologists could drop by 30 percent by 2030 compared to 2010 levels. (1) A cloud architecture and solution can automate part of the workflow, including sample management, analysis, storing, sharing, and comparison with previous samples to complement existing provider workflows effectively. A recent study using deep learning to analyze metastatic breast cancer tissue samples resulted in an approximately 85% reduction in human error rate. (2)

ML is also being used to assist radiologists in examining other diagnostic images such as X-rays, MRIs, and CAT scans. Having large quantities of images and metadata to train the algorithms that are the key to ML is one of the main challenges for ML adoption. To help address this problem, the National Institutes of Health recently released 90,000 X-ray plates tagged either with one of 14 diseases or tagged as being normal. Leading academic medical centers are using these images to build their neural networks and train their algorithms. With advanced analytics and ML, we can answer the hard questions such as “what is the next best action for my patient, the expected outcome, and the cost.”

The foundations for a great analytical layer

Let’s pick up from where we left off in Part 1. We have seen how providers can ingest data into AWS from their data centers and store it securely into different services depending on the type of data. For example:

  1. All object data is stored in Amazon S3, Amazon S3 Infrequent Access, or Amazon Glacier depending on how often they are used.
  2. Data from the provider’s database is either processed and stored as objects in Amazon S3 or aggregated into data marts on Amazon Redshift.
  3. Metadata of the objects on Amazon S3 are maintained in the DynamoDB database.
  4. Amazon Athena is used to query the objects directly stored on Amazon S3 to address ad hoc requirements.

We will now look at two best practices that are key to building a robust analytical layer using these datasets.

  1. Separating storage and compute: You should not be compelled to scale compute resources just to store more data. The scaling rules of the two layers should be separate.
  2. Leverage the vast array of AWS big data services when it comes to building the analytical platforms instead of concentrating on just a few of them. Remember, one size does not fit all.

Technical overview

In this overview, we will demonstrate how we can leverage AWS big data and ML services to build a scalable analytical layer for our healthcare data. We will use a single source of data stored in Amazon S3 for performing ad hoc analysis using Amazon Athena, integrate it with a data warehouse on Amazon Redshift, build a visual dashboard for some metrics using Amazon QuickSight, and finally build a ML model to predict readmissions using Amazon SageMaker. By not moving the data around and just connecting to it using different services, we avoid building redundant copies of the same data. There are multiple advantages to this approach:

  1. We optimize our storage. Not having redundant copies reduces the amount of storage required.
  2. We keep the data secure with only authorized services having access to it. Keeping multiple copies of the data can result in higher security risk.
  3. We are able to scale the storage and compute separately as needed.
  4. It becomes easier to manage the data and monitor usage metrics centrally such as how often the data has been accessed, who has been accessing it, and what has been the growth pattern of the data over a period of time. These metrics can be difficult to aggregate if the data is duplicated multiple times.

Let’s build out this architecture using the following steps:

  1. Create a database in AWS Glue Data Catalog

We will do this using a Glue crawler. First create a JSON file that contains the parameters for the Glue crawler.

{
"Name": "readmissions",
"Role": "arn of the role for Glue",
"DatabaseName": "readmissions",
"Description": "glue data catalog for storing readmission data",
"Targets": {
"S3Targets": [
{
"Path": "s3://<bucket>/<prefix>"
},
{
"Path": "s3://<bucket>/<prefix>"
}
]
}
}

As you can see, the crawler will crawl two locations in Amazon S3 and save the resulting tables in a new database called “readmissions.” Replace the role ARN and Amazon S3 locations with your corresponding details. Save this in a file create_crawler.json. Then from the AWS CLI, call the following command to create the crawler:

aws glue create-crawler --cli-input-json file://create_crawler.json

Once the crawler is created, run it by calling the following command:

aws glue start-crawler --name readmissions

Log on to the AWS Glue console, navigate to the crawlers, and wait until the crawler completes running.

This will create two tables — phi and non-phi — in a database named “readmissions” in the AWS Glue Data Catalog as shown below.

  1. Query the data using Athena

The Amazon Glue Data Catalog is seamlessly integrated with Amazon Athena. For details on how to enable this, see Integration with AWS Glue.

As a result of this integration, the tables created using the Glue crawler can now be queried using Amazon Athena. Amazon Athena allows you to do ad hoc analysis on the dataset. You can do exploratory analysis on the data and also determine its structure and quality. This type of upfront ad hoc analysis is invaluable for ensuring the data quality in your downstream data warehouse or your ML algorithms that will make use of this data for training models. In the next few sections, we will explore these aspects in greater detail.

To query the data using Amazon Athena, navigate to the Amazon Athena console.

NOTE: Make sure the region is the same as the region you chose in the previous step. If it’s not the same, switch the region by using the drop-down menu on the top right-hand corner of the screen.

Once you arrive in the Amazon Athena console, you should already see the tables and databases you created previously, and you should be able to see the data in the two tables by writing Amazon Athena queries. Here is a list of the top 10 rows from the table readmissions.nonphi:

Now that we are able to query the dataset, we can run some queries for exploratory analysis. Here are just a few examples:

Analysis Amazon Athena Query
How many Patients have been discharged to home? SELECT count(*) from nonphi where discharge_disposition = ‘Discharged to home’
What’s the minimum and the maximum number of procedures carried out on a patient? SELECT min(num_procedures), max(num_procedures) from nonphi
How many patients were referred to this hospital by another physician? SELECT count(*) FROM nonphi group by admission_source having admission_source = ‘Physician Referral’
What were the top 5 specialties with positive readmissions?

SELECT count(readmission_result) as num_readmissions, medical_specialty from

(select readmission_result,medical_specialty from nonphi where readmission_result = ‘Yes’)

group by medical_specialty order by num_readmissions desc limit 5

Which payer was responsible for paying for treatments that involved more than 5 procedures? SELECT distinct payer_code from nonphi where num_procedures >5 and payer_code !='(null)’

While this information is valuable, you typically do not want to invest too much time and effort into building an ad hoc query platform like this because at this stage, you are not even sure if the data is of any value for your business-critical analytical applications. One benefit of using Amazon Athena for ad hoc analysis is that it requires little effort or time. It uses Schema-On-Read instead of schema on write, allowing you to work with various source data formats without worrying about the underlying structures. You can put the data on Amazon S3 and start querying immediately.

  1. Create an external table in Amazon Redshift Spectrum with the same data

Now that we are satisfied with the data quality and understand the structure of the data, we would like to integrate this with a data warehouse. We’ll use Amazon Redshift Spectrum to create external tables on the files in S3 and then integrate these external tables with a physical table in Amazon Redshift.

Amazon Redshift Spectrum allows you to run Amazon Redshift SQL queries against data on Amazon S3, extending the capabilities of your data warehouse beyond the physical Amazon Redshift clusters. You don’t need to do any elaborate ETL or move the data around. The data exists in one place in Amazon S3 and you interface with it using different services (Athena and Redshift Spectrum) to satisfy different requirements.

Before beginning, please look at this step by step guide to set up Redshift Spectrum.

After you have set up Amazon Redshift Spectrum, you can begin executing the steps below:

  1. Create an external schema called “readmissions.” Amazon Redshift Spectrum integrates with the Amazon Glue Data Catalog and allows you to create spectrum tables by referring the catalog. This feature allows you to build the external table on the same data that you analyzed with Amazon Athena in the previous step without the need for ETL. This can be achieved by the following:
create external schema readmissions
from data catalog
database 'readmissions'
iam_role 'arn for your redshift spectrum role '
region ‘region when the S3 data exists’;

NOTE: Make sure you select the appropriate role arn and region.

  1. Once the command executes successfully, you can confirm the schema was created by running the following:
select * from svv_external_schemas;

You should see a row similar to the one above with your corresponding region and role.

You can also see the external tables that were created by running the following command:

select * from SVV_EXTERNAL_TABLES;

  1. Let’s confirm we can see all the rows in the external table by counting the number of rows:
select count(*) from readmissions.phi;
select count(*) from readmissions.nonphi;

You should see 101,766 rows in both the tables, confirming that your external tables contain all the records that you read using the AWS Glue data crawler and analyzed using Athena.

  1. Now that we have all the external tables created, let’s create an aggregate fact table in the physical Redshift data warehouse. We can use the “As Select” clause of the Redshift create table query to do this:
create table readmissions_aggregate_fact as
select
readmission_result,admission_type,discharge_disposition,diabetesmed,
avg(time_in_hospital) as avg_time_in_hospital,
min(num_procedures) as min_procedures,
max(num_procedures) as max_procedures,
avg(num_procedures) as avg_num_procedures,
avg(num_medications) as avg_num_medications,
avg(number_outpatient) as avg_number_outpatient,
avg(number_emergency) as avg_number_emergency,
avg(number_inpatient) as avg_number_inpatient,
avg(number_diagnoses) as avg_number_diagnoses
from readmissions.nonphi
group by readmission_result,admission_type,discharge_disposition,diabetesmed

Once this query executes successfully, you can see a new table created in the physical public schema of your Amazon Redshift cluster. You can confirm this by executing the following query:

select distinct(tablename) from pg_table_def where schemaname = 'public'

  1. Build a QuickSight Dashboard from the aggregate fact

We can now create dashboards to visualize the data in our readmissions aggregate fact table using Amazon QuickSight. Here are some examples of reports you can generate using Amazon QuickSight on the readmission data.

For more details on Amazon QuickSight, refer to the service documentation.

  1. Build a ML model in Amazon SageMaker to predict readmissions

As a final step, we will create a ML model to predict the attribute readmission_result, which denotes if a patient was readmitted or not, using the non-PHI dataset.

  1. Create a notebook instance in Amazon SageMaker that is used to develop our code.
  2. The code reads non-PHI data from the Amazon S3 bucket as a data frame in Python. This is achieved using the pandas.readcsv function.

  1. Use the pandas.get_dummies function to encode categorical values into numeric values for use with the model.

  1. Split the data into two, 80% for training and 20% for testing, using the numpy.random.rand function.

  1. Form train_X, train_y and test_X, test_y corresponding to training features, training labels, testing features, and testing labels respectively.

  1. Use the Amazon SageMaker Linear learner algorithm to train our model. The implementation of the algorithm uses dense tensor format to optimize the training job. Use the function write_numpy_to_dense_tensor from the Amazon SageMaker library to convert the numpy array into the dense tensor format.

  1. Create the training job in Amazon SageMaker with appropriate configurations and run it.

  1. Once the training job completes, create an endpoint in Amazon SageMaker to host our model, using the linear.deploy function to deploy the endpoint.

  1. Finally, run a prediction by invoking the endpoint using the linear_predictor.predict function.

You can view the complete notebook here.

Data, analytics, and ML are strategic assets to help you manage your patients, staff, equipment, and supplies more efficiently. These technologies can also help you be more proactive in treating and preventing disease. Industry luminaries share this opinion: “By leveraging big data and scientific advancements while maintaining the important doctor-patient bond, we believe we can create a health system that will go beyond curing disease after the fact to preventing disease before it strikes by focusing on health and wellness,” writes Lloyd B. Minor, MD, dean of the Stanford School of Medicine.

ML and analytics offer huge value in helping achieve the quadruple aim : improved patient satisfaction, improved population health, improved provider satisfaction, and reduced costs. Technology should never replace the clinician but instead become an extension of the clinician and allow them to be more efficient by removing some of the mundane, repetitive tasks involved in prevention, diagnostics, and treatment of patients.

(1) “The Digital Future of Pathology.” The Medical Futurist, 28 May 2018, medicalfuturist.com/digital-future-pathology.

(2) Wang, Dayong, et al. “Deep Learning for Identifying Metastatic Breast Cancer.” Deep Learning for Identifying Metastatic Breast Cancer, 18 June 2016, arxiv.org/abs/1606.05718.

About the Author

Stephen Jepsen is a Global HCLS Practice Manager in AWS Professional Services.

 

Narrativ is helping producers monetize their digital content with Amazon Redshift

Post Syndicated from Colin Nichols original https://aws.amazon.com/blogs/big-data/narrativ-is-helping-producers-monetize-their-digital-content-with-amazon-redshift/

Narrativ, in their own words: Narrativ is building monetization technology for the next generation of digital content producers. Our product portfolio includes a real-time bidding platform and visual storytelling tools that together generate millions of dollars of advertiser value and billions of data points each month.

At Narrativ, we have seen massive growth in our platform usage with a similar order-of-magnitude increase in data generated by our products over the past 15 months. In this blog post, we share our evolution to a robust, scalable, performant, and cost-effective analytics environment with AWS. We also discuss the best practices that we learned along the way in data warehousing and data lake analytics.

Anticipating Narrativ’s continued growth acceleration, we began planning late last year for the next order of magnitude. We have been using Amazon Redshift as our data warehouse, which has served us very well. As our data continued to grow, we utilized Amazon S3 as a data lake and used external tables in Amazon Redshift Spectrum to query data directly in S3. We were excited that this allowed us to easily scale storage and compute resources separately to meet our needs without trading off against cost or complexity.

In the process, we created Spectrify, which simplifies working with Redshift Spectrum and encapsulates a number of best practices. Spectrify accomplishes three things in one easy command. First, it exports an Amazon Redshift table to S3 in comma-separated value (CSV) format. Second, it converts exported CSVs to Apache Parquet files in parallel. Third, it creates the external tables in the Amazon Redshift cluster. Queries can now span massive volumes of Parquet data in S3 with data in Amazon Redshift, and return results in just moments.

 

The preceding diagram shows how Spectrify simplifies querying across Parquet data in Amazon S3 and data in Amazon Redshift, returning results in just moments.

Amazon Redshift at Narrativ

Amazon Redshift has been the foundation of our data warehouse since Narrativ’s inception. We use it to produce traffic statistics, to feed the data-driven algorithms in our programmatic systems, and as a window to explore data insights. Amazon Redshift has scaled well to support the company’s needs. Our cluster has grown from 3 nodes to 36 nodes, and queries per hour have increased significantly without losing performance. As our workload moved from statistics and accounting toward data-driven insights, Amazon Redshift kept pace with increasing query complexity.

When planning the next iteration of our data strategy, our requirements were as follows:

  • Be maintainable by a small engineering team (mostly hands-off)
  • Be able to query all historic data at any time
  • Be able to query the previous three months of data very quickly
  • Maintain performance at 10x current data volume

We considered a few ideas to meet these requirements:

  • Upgrade to a DC2 cluster and a DS2 cluster
  • Upgrade to a very large DC2 cluster
  • Move to an open-source alternative (difficult to maintain)
  • Move to another big data provider (high barrier to entry, costly, risky)

Then we discovered Redshift Spectrum, which changes the game by separating compute costs from storage costs. With Redshift Spectrum, we can offload older, infrequently used data to S3 while maintaining a modestly sized, yet fast, Amazon Redshift cluster. Compared to other solutions, Redshift Spectrum reduces complexity by allowing us to join to the same set of dimension tables for all of our queries. And finally, it even preserves, if not improves, performance (depending on the query). All of our requirements were met with Redshift Spectrum, and it was easy, fast, and cost-effective to implement.

Moving forward, we view Redshift Spectrum as a critical tool to scaling our data capabilities on AWS to the next order of magnitude. Redshift Spectrum allows us to define Amazon Redshift tables backed by data stored in Amazon S3. In this way, Redshift Spectrum turns Amazon Redshift into a hybrid data warehouse/data lake. It provides the best of both worlds—unlimited storage separated from compute costs, plus the performance and converged workflow of Amazon Redshift itself.

Unlocking Redshift Spectrum

Using best practices can provide massive performance improvements in Redshift Spectrum. We shared our Spectrify implementation best practices in an open-source Python library in GitHub to help others reap the same rewards that we have. Following are further details on our top suggestions for optimizing Redshift Spectrum, which you will find incorporated into Spectrify.

Use Apache Parquet

Apache Parquet is a columnar file format. When you store your data in Parquet, Redshift Spectrum can scan only the column data relevant to your query. In contrast, row-oriented formats like CSV or JSON require a scan of every column, regardless of the query. For Narrativ’s workload, using Parquet led to large performance gains and drastically lower bills.

Columnar formats also enable more efficient data compression. Data compresses better when similar data is grouped together. In this case, we saw about an 80 percent reduction in gzipped Parquet files vs. gzipped CSVs. This reduction means data can be ingested, and thus scanned, much faster.

When we first investigated converting our data to Parquet, we found no options that fit our needs. We don’t use Spark or any other project from the Hadoop ecosystem, which is the most common toolset for creating Parquet files. A few Python projects had write support for Parquet. However, we had reservations about data consistency caveats, especially in relation to nullable columns. Additionally, we didn’t find a library that supported the correct timestamp format.

We built Spectrify to solve all these problems. We worked with the Apache Arrow project to ensure full compatibility between Redshift Spectrum and Parquet files generated by Spectrify.

Partition your data

Narrativ stores timestamped event data in Redshift Spectrum. We partition our data on event creation date, allowing Redshift Spectrum to skip inspecting most of our data files for many of our queries. Generally, our queries are concerned with a small subset of the time domain, such as a month or even just a day (for example, answering a specific question from Black Friday 2017). Queries are rarely concerned with to the entire years-long history.

This drastically reduces the amount of data scanned, and we see most queries complete in one-tenth the time. Also, our bill is much lower. Partitioning alone reduced our average cost per query by a factor of 20.

Summary

As we prepared for an order-of-magnitude increase in data volumes, we turned to a data lake strategy built on Amazon S3 using Amazon Redshift and Redshift Spectrum for analytics. Redshift Spectrum enabled us to query data across S3 and Amazon Redshift, maintain or improve query performance, and do it easily and cost-effectively. Along the way, we learned best practices and built Spectrify to streamline data preparation and analytics even further. We hope that sharing our code examples and what we learned will also help you achieve similar results with your growing data.

For more information, see our Spectrify Usage documentation and code samples, and our full example on GitHub.

 


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and How I built a data warehouse using Amazon Redshift and AWS services in record time.

 


About the Author

Colin Nichols leads big data initiatives at Narrativ. He enjoys working at the intersection of systems, and has a strong record of shaping engineering culture around the tenets of quality, automation, and continuous learning. In his spare time, Colin enjoys teaching his dog, MacGyver, new tricks.

Get sub-second query response times with Amazon Redshift result caching

Post Syndicated from Entong Shen original https://aws.amazon.com/blogs/big-data/get-sub-second-query-response-times-with-amazon-redshift-result-caching/

Customers tell us that their data warehouse and business intelligence users want extremely fast response times so that they can make equally fast decisions. They also tell us that their users often repeat the same queries over and over again, even when the data has not changed. Repeat queries consume compute resources each time they are executed, which slows down performance for all queries.

In this post, we take a look at query result caching in Amazon Redshift. Result caching does exactly what its name implies—it caches the results of a query. When the same query comes in against the same data, the prior results are retrieved from the cache and returned immediately, instead of rerunning the same query. Result caching reduces system use, making more resources available for other workloads. It delivers faster response times for users, improves throughput for all queries, and increases concurrency.

Result caching options

There are two main ways that you can implement data warehouse result caching. The first method is to save subsets of the data tables and cache query results outside the data warehouse. This method requires additional logic and memory outside the data warehouse. You must take great care to ensure that the cache is invalidated and a query is rerun when table data is modified.

The second method is to cache the results of a query inside the data warehouse and return the cached result for future repeat queries. This method delivers higher performance because it is faster to cache data and serve it from within the cluster.

Amazon Redshift uses the second method to cache query results within the cluster to achieve higher query throughput. Amazon Redshift result caching automatically responds to data and workload changes, transparently serving multiple BI applications and SQL tools. It is available by default for all Amazon Redshift customers for no additional charge.

Since Amazon Redshift introduced result caching, the feature has saved customers thousands of hours of execution time on a daily basis. “With Amazon Redshift result caching, 20 percent of our queries now complete in less than one second,” said Greg Rokita, Executive Director of Technology, Edmunds, at the AWS Summit in San Francisco. “Our cluster reliance on disk has decreased, and consequently the cluster is able to better serve the rest of our queries. Best of all, we didn’t have to change anything to get this speed-up with Redshift, which supports our mission-critical workloads.”

How does it work?

When a query is executed in Amazon Redshift, both the query and the results are cached in memory. As future queries come in, they are normalized and compared to the queries in the cache to determine whether there are repeat queries. In this comparison, Amazon Redshift also determines whether the underlying data has changed in any way. Any data modification language (DML) or data definition language (DDL) on a table or function invalidates only the cache entries that refer to it. Examples of such statements include INSERT, DELETE, UPDATE, COPY, and TRUNCATE. Amazon Redshift manages the cache memory to evict old entries, ensuring that optimal memory use is maintained for the cache itself.

Result caching is fully managed by Amazon Redshift, and it requires no changes in your application code. Amazon Redshift automatically selects the optimal configuration based on the specific condition of your cluster, and no tuning is required for you to get the most effective configuration.

When Amazon Redshift determines that a query is eligible to reuse prior query cached results, it bypasses query planning, the workload manager (WLM), and the query execution engine altogether. Cached result rows are returned to the client application immediately with sub-second performance. This method frees up cluster resources for ETL (extract, transform, and load) and other workloads that need the compute resources.

Design and usage

The following diagram illustrates the architecture of Amazon Redshift result caching.

The query result cache resides in the memory of the leader node and is shared across different user sessions to the same database. The feature is transparent, so it works by default without the need for user configurations. Result caching complies with Amazon Redshift multi-version concurrency control (MVCC). It acquires the proper locks on the table objects and manages the lifecycle of the cache entries when multiple user sessions read/write a table object at the same time. In addition, access control of the cached results is managed so that a user must have the required permission of the objects used in the query to retrieve result rows from the cache.

Which queries are served from the result cache?

Read-only queries are eligible for caching with some exceptions. Query results are not cached in the following circumstances:

  • When a query uses a function that must be evaluated each time it is run, such as getdate().
  • When a query refers to system tables or views.
  • When a query refers to external tables, that is, Amazon Redshift Spectrum tables.
  • When a query runs only on the leader node, or the result is too large.

Suppose that your query contains functions like current_date and you want to take advantage of the result cache. You can consider rewriting the query by materializing the value of current_date (for example, in your JDBC application), using the query text, and refreshing it as needed.

To determine which executed queries served results from the cache, a new column source_query has been added to system view SVL_QLOG to record the source query ID when a query is executed from the cache. You can use the following example query to find out which queries used cached results:

select userid, query, elapsed, source_query from svl_qlog where userid > 1
order by query desc;

userid | query  | elapsed  | source_query
-------+--------+----------+-------------
   104 | 629035 |       27 |       628919
   104 | 629034 |       60 |       628900
   104 | 629033 |       23 |       628891
   102 | 629017 |  1229393 |             
   102 | 628942 |       28 |       628919
   102 | 628941 |       57 |       628900
   102 | 628940 |       26 |       628891
   100 | 628919 | 84295686 |             
   100 | 628900 | 87015637 |             
   100 | 628891 | 58808694 | 

For more information about result cache usage, see Result Caching in the Amazon Redshift documentation.

Summary

Amazon Redshift result caching helps ensure that no computing resources are wasted on repeat queries. It enables you to do more analytics in less time to support decision making and improve outcomes. In this post, we explained how Amazon Redshift result caching works and discussed the significant impact for Amazon Redshift customers. Result caching is enabled automatically, and we encourage you to see the difference it can make in your environment.


Additional Reading

If you found this post useful, be sure to check out Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required, Collect Data Statistics Up to 5x Faster by Analyzing Only Predicate Columns with Amazon Redshift and Amazon Redshift – 2017 Recap.

 


About the Authors

Entong Shen is a software engineer on the Amazon Redshift query processing team. He has been working on MPP databases for over 5 years and has focused on query optimization, statistics and SQL language features. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

 

 

Larry Heathcote is a Principal Product Marketing Manager at Amazon Web Services for data warehousing and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out and the taste of classic barbeque.

 

 

Maor Kleider is a Senior Product Manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

 

 

Meng Tong is a Senior Software Engineer with Amazon Redshift Query Processing team. His work has been focused on query performance improvements across rewriter, optimizer and executor, Redshift Spectrum, and most recently Redshift Result Caching. His passion lies in discovering simple, elegant solutions for customer needs in big data systems. He is a big Rafael Nadal fan and enjoys watching and playing tennis in his spare time.

 

 

Naresh Chainani is a Senior Software Development Manager at Amazon Redshift where he leads the Query Processing team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.

 

 

 

Streaming Events from Amazon Pinpoint to Redshift

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/messaging-and-targeting/streaming-events-from-amazon-pinpoint-to-redshift/

Note: This post was originally written by Ryan Idrigo-Lam, one of the founding members of the Amazon Pinpoint team.


You can use Amazon Pinpoint to segment, target, and engage with your customers directly from the console. The Pinpoint console also includes a variety of dashboards that you can use to keep track of how your customers use your applications, and measure how likely your customers are to engage with the messages you send them.

Some Pinpoint customers, however, have use cases that require a bit more than what these dashboards have to offer. For example, some customers want to join their Pinpoint data to external data sets, or to collect historical data beyond the six month window that Pinpoint retains. To help customers meet these needs, and many more, Amazon Pinpoint includes a feature called Event Streams.

This article provides information about using Event Streams to export your data from Amazon Pinpoint and into a high-performance Amazon Redshift database. Once your data is in Redshift, you can run queries against it, join it with other data sets, use it as a data source for analytics and data visualization tools, and much more.

Step 1: Create a Redshift Cluster

The first step in this process involves creating a new Redshift cluster to store your data. You can complete this step in a few clicks by using the Amazon Redshift console. For more information, see Managing Clusters Using the Console in the Amazon Redshift Cluster Management Guide.

When you create the new cluster, make a note of the values you specify for the Cluster Identifier, Database Name, Master User Name, and Master User Password. You’ll use all of these values when you set up Amazon Kinesis Firehose in the next section.

Step 2: Create a Firehose Delivery Stream with a Redshift Destination

After you create your Redshift cluster, you can create the Amazon Kinesis Data Firehose delivery stream that will deliver your Pinpoint data to the Redshift cluster.

To create the Kinesis Data Firehose delivery stream

  1. Open the Amazon Kinesis Data Firehose console at https://console.aws.amazon.com/firehose/home.
  2. Choose Create delivery stream.
  3. For Delivery stream name, type a name.
  4. Under Choose source, for Source, choose Direct PUT or other sources. Choose Next.
  5. On the Process records page, do the following:
    1. Under Transform source records with AWS Lambda, choose Enabled if you want to use a Lambda function to transform the data before Firehose loads it into Redshift. Otherwise, choose Disabled.
    2. Under Convert record format, choose Disabled, and then choose Next.
  6. On the Choose destination page, do the following:
    1. For Destination, choose Amazon Redshift.
    2. Under Amazon Redshift destination, specify the Cluster name, User name, Password, and Database for the Redshift database you created earlier. Also specify a name for the Table.
    3. Under Intermediate S3 destination, choose an S3 bucket to store data in. Alternatively, choose Create new to create a new bucket. Choose Next.
  7. On the Configure settings page, do the following:
    1. Under IAM role, choose an IAM role that Firehose can use to access your S3 bucket and KMS key. Alternatively, you can have the Firehose console create a new role. Choose Next.
    2. On the Review page, confirm the settings you specified on the previous pages. If the settings are correct, choose Create delivery stream.

Step 3: Create a JSONPaths file

The next step in this process is to create a JSONPaths file and upload it to an Amazon S3 bucket. You use the JSONPaths file to tell Amazon Redshift how to interpret the unstructured JSON that Amazon Pinpoint provides.

To create a JSONPaths file and upload it to Amazon S3

  1. In a text editor, create a new file.
  2. Paste the following code into the text file:
    {
      "jsonpaths": [
        "$['event_type']",
        "$['event_timestamp']",
        "$['arrival_timestamp']",
        "$['event_version']",
        "$['application']['app_id']",
        "$['application']['package_name']",
        "$['application']['version_name']",
        "$['application']['version_code']",
        "$['application']['title']",
        "$['application']['cognito_identity_pool_id']",
        "$['application']['sdk']['name']",
        "$['application']['sdk']['version']",
        "$['client']['client_id']",
        "$['client']['cognito_id']",
        "$['device']['model']",
        "$['device']['make']",
        "$['device']['platform']['name']",
        "$['device']['platform']['version']",
        "$['device']['locale']['code']",
        "$['device']['locale']['language']",
        "$['device']['locale']['country']",
        "$['session']['session_id']",
        "$['session']['start_timestamp']",
        "$['session']['stop_timestamp']",
        "$['monetization']['transaction']['transaction_id']",
        "$['monetization']['transaction']['store']",
        "$['monetization']['transaction']['item_id']",
        "$['monetization']['transaction']['quantity']",
        "$['monetization']['transaction']['price']['reported_price']",
        "$['monetization']['transaction']['price']['amount']",
        "$['monetization']['transaction']['price']['currency']['code']",
        "$['monetization']['transaction']['price']['currency']['symbol']",
        "$['attributes']['campaign_id']",
        "$['attributes']['campaign_activity_id']",
        "$['attributes']['my_custom_attribute']",
        "$['metrics']['my_custom_metric']"
      ]
    }

  3. Modify the preceding code example to include the fields that you want to import into Redshift.
    Note: You can specify custom attributes or metrics by replacing my_custom_attribute or my_custom_metric in the example above with your custom attributes or metrics, respectively.
  4. When you finish modifying the code example, remove all whitespace, including spaces and line breaks, from the file. Save the file as json-paths.json.
  5. Open the Amazon S3 console at https://s3.console.aws.amazon.com/s3/home.
  6. Choose the S3 bucket you created when you set up the Firehose stream. Upload json-paths.json into the bucket.

Step 4: Configure the table in Redshift

At this point, it’s time to finish setting up your Redshift database. In this section, you’ll create a table in the Redshift cluster you created earlier. The columns in this table mirror the values you specified in the JSONPaths file in the previous section.

  1. Connect to your Redshift cluster by using a database tool such as SQL Workbench/J. For more information about connecting to a cluster, see Connect to the Cluster in the Amazon Redshift Getting Started Guide.
  2. Create a new table that contains a column for each field in the JSONPaths file you created in the preceding section. You can use the following example as a template.
    CREATE schema AWSMA;
    CREATE TABLE AWSMA.event(
      event_type VARCHAR(256) NOT NULL ENCODE LZO,
      event_timestamp TIMESTAMP NOT NULL ENCODE LZO,
      arrival_timestamp TIMESTAMP NULL ENCODE LZO,
      event_version CHAR(12) NULL ENCODE LZO,
      application_app_id VARCHAR(64) NOT NULL ENCODE LZO,
      application_package_name VARCHAR(256) NULL ENCODE LZO,
      application_version_name VARCHAR(256) NULL ENCODE LZO,
      application_version_code VARCHAR(256) NULL ENCODE LZO,
      application_title VARCHAR(256) NULL ENCODE LZO,
      application_cognito_identity_pool_id VARCHAR(64) NULL ENCODE LZO,
      application_sdk_name VARCHAR(256) NULL ENCODE LZO,
      application_sdk_version VARCHAR(256) NULL ENCODE LZO,
      client_id VARCHAR(64) NULL DISTKEY ENCODE LZO,
      client_cognito_id VARCHAR(64) NULL ENCODE LZO,
      device_model VARCHAR(256) NULL ENCODE LZO,
      device_make VARCHAR(256) NULL ENCODE LZO,
      device_platform_name VARCHAR(256) NULL ENCODE LZO,
      device_platform_version VARCHAR(256) NULL ENCODE LZO,
      device_locale_code VARCHAR(256) NULL ENCODE LZO,
      device_locale_language VARCHAR(64) NULL ENCODE LZO,
      device_locale_country VARCHAR(64) NULL ENCODE LZO,
      session_id VARCHAR(64) NULL ENCODE LZO,
      session_start_timestamp TIMESTAMP NULL ENCODE LZO,
      session_stop_timestamp TIMESTAMP NULL ENCODE LZO,
      monetization_transaction_id VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_store VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_item_id VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_quantity FLOAT8 NULL,
      monetization_transaction_price_reported VARCHAR(64) NULL ENCODE LZO,
      monetization_transaction_price_amount FLOAT8 NULL,
      monetization_transaction_price_currency_code VARCHAR(16) NULL ENCODE LZO,
      monetization_transaction_price_currency_symbol VARCHAR(32) NULL ENCODE LZO,
      - Custom Attributes
      a_campaign_id VARCHAR(4000),
      a_campaign_activity_id VARCHAR(4000),
      a_my_custom_attribute VARCHAR(4000),
      - Custom Metrics
      m_my_custom_metric float8
    )
    SORTKEY ( application_app_id, event_timestamp, event_type);

Step 5: Configure the Firehose Stream

You’re getting close! At this point, you’re ready to point the Kinesis Data Firehose stream to your JSONPaths file so that Redshift parses the incoming data properly. You also need to list the columns of the table that your data will be copied into.

To configure the Firehose Stream

  1. Open the Amazon Kinesis Data Firehose console at https://console.aws.amazon.com/firehose/home.
  2. In the list of delivery streams, choose the delivery stream you created earlier.
  3. On the Details tab, choose Edit.
  4. Under Amazon Redshift destination, for COPY options, paste the following:
    JSON 's3://s3-bucket/json-paths.json'
    TRUNCATECOLUMNS
    TIMEFORMAT 'epochmillisecs'

  5. Replace s3-bucket in the preceding code example with the path to the S3 bucket that contains json-paths.json.
  6. For Columns, list all of the columns that are present in the JSONPaths file you created earlier. Specify the column names in the same order as they’re listed in the json-paths.json file, using commas to separate the column names. When you finish, choose Save.

Step 6: Enable Event Streams in Amazon Pinpoint

The only thing left to do now is to tell Amazon Pinpoint to start sending data to Amazon Kinesis.

To enable Event Streaming in Amazon Pinpoint

  1. Open the Amazon Pinpoint console at https://console.aws.amazon.com/pinpoint/home.
  2. Choose the application or project that you want to enable event streams for.
  3. In the navigation pane, choose Settings.
  4. On the Event stream tab, choose Enable streaming of events to Amazon Kinesis.
  5. Under Stream to Amazon Kinesis, select Send events to an Amazon Kinesis Firehose delivery stream.
  6. For Amazon Kinesis Firehose delivery stream, choose the stream you created earlier.
  7. For IAM role, choose an existing role that allows the firehose:PutRecordBatch action, or choose Automatically create a role to have Amazon Pinpoint create a role with the appropriate permissions. If you choose to have Amazon Pinpoint create a role for you, type a name for the role. Choose Save.

That’s it! Once you complete this final step, Amazon Pinpoint starts exporting the data you specified into your Redshift cluster.

I hope this walk through was helpful. If you have any questions, please let us know in the comments or in the Amazon Pinpoint forum.

New – Pay-per-Session Pricing for Amazon QuickSight, Another Region, and Lots More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-pay-per-session-pricing-for-amazon-quicksight-another-region-and-lots-more/

Amazon QuickSight is a fully managed cloud business intelligence system that gives you Fast & Easy to Use Business Analytics for Big Data. QuickSight makes business analytics available to organizations of all shapes and sizes, with the ability to access data that is stored in your Amazon Redshift data warehouse, your Amazon Relational Database Service (RDS) relational databases, flat files in S3, and (via connectors) data stored in on-premises MySQL, PostgreSQL, and SQL Server databases. QuickSight scales to accommodate tens, hundreds, or thousands of users per organization.

Today we are launching a new, session-based pricing option for QuickSight, along with additional region support and other important new features. Let’s take a look at each one:

Pay-per-Session Pricing
Our customers are making great use of QuickSight and take full advantage of the power it gives them to connect to data sources, create reports, and and explore visualizations.

However, not everyone in an organization needs or wants such powerful authoring capabilities. Having access to curated data in dashboards and being able to interact with the data by drilling down, filtering, or slicing-and-dicing is more than adequate for their needs. Subscribing them to a monthly or annual plan can be seen as an unwarranted expense, so a lot of such casual users end up not having access to interactive data or BI.

In order to allow customers to provide all of their users with interactive dashboards and reports, the Enterprise Edition of Amazon QuickSight now allows Reader access to dashboards on a Pay-per-Session basis. QuickSight users are now classified as Admins, Authors, or Readers, with distinct capabilities and prices:

Authors have access to the full power of QuickSight; they can establish database connections, upload new data, create ad hoc visualizations, and publish dashboards, all for $9 per month (Standard Edition) or $18 per month (Enterprise Edition).

Readers can view dashboards, slice and dice data using drill downs, filters and on-screen controls, and download data in CSV format, all within the secure QuickSight environment. Readers pay $0.30 for 30 minutes of access, with a monthly maximum of $5 per reader.

Admins have all authoring capabilities, and can manage users and purchase SPICE capacity in the account. The QuickSight admin now has the ability to set the desired option (Author or Reader) when they invite members of their organization to use QuickSight. They can extend Reader invites to their entire user base without incurring any up-front or monthly costs, paying only for the actual usage.

To learn more, visit the QuickSight Pricing page.

A New Region
QuickSight is now available in the Asia Pacific (Tokyo) Region:

The UI is in English, with a localized version in the works.

Hourly Data Refresh
Enterprise Edition SPICE data sets can now be set to refresh as frequently as every hour. In the past, each data set could be refreshed up to 5 times a day. To learn more, read Refreshing Imported Data.

Access to Data in Private VPCs
This feature was launched in preview form late last year, and is now available in production form to users of the Enterprise Edition. As I noted at the time, you can use it to implement secure, private communication with data sources that do not have public connectivity, including on-premises data in Teradata or SQL Server, accessed over an AWS Direct Connect link. To learn more, read Working with AWS VPC.

Parameters with On-Screen Controls
QuickSight dashboards can now include parameters that are set using on-screen dropdown, text box, numeric slider or date picker controls. The default value for each parameter can be set based on the user name (QuickSight calls this a dynamic default). You could, for example, set an appropriate default based on each user’s office location, department, or sales territory. Here’s an example:

To learn more, read about Parameters in QuickSight.

URL Actions for Linked Dashboards
You can now connect your QuickSight dashboards to external applications by defining URL actions on visuals. The actions can include parameters, and become available in the Details menu for the visual. URL actions are defined like this:

You can use this feature to link QuickSight dashboards to third party applications (e.g. Salesforce) or to your own internal applications. Read Custom URL Actions to learn how to use this feature.

Dashboard Sharing
You can now share QuickSight dashboards across every user in an account.

Larger SPICE Tables
The per-data set limit for SPICE tables has been raised from 10 GB to 25 GB.

Upgrade to Enterprise Edition
The QuickSight administrator can now upgrade an account from Standard Edition to Enterprise Edition with a click. This enables provisioning of Readers with pay-per-session pricing, private VPC access, row-level security for dashboards and data sets, and hourly refresh of data sets. Enterprise Edition pricing applies after the upgrade.

Available Now
Everything I listed above is available now and you can start using it today!

You can try QuickSight for 60 days at no charge, and you can also attend our June 20th Webinar.

Jeff;

 

Amazon Neptune Generally Available

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-neptune-generally-available/

Amazon Neptune is now Generally Available in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland). Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. At the core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latencies. Neptune supports two popular graph models, Property Graph and RDF, through Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune can be used to power everything from recommendation engines and knowledge graphs to drug discovery and network security. Neptune is fully-managed with automatic minor version upgrades, backups, encryption, and fail-over. I wrote about Neptune in detail for AWS re:Invent last year and customers have been using the preview and providing great feedback that the team has used to prepare the service for GA.

Now that Amazon Neptune is generally available there are a few changes from the preview:

Launching an Amazon Neptune Cluster

Launching a Neptune cluster is as easy as navigating to the AWS Management Console and clicking create cluster. Of course you can also launch with CloudFormation, the CLI, or the SDKs.

You can monitor your cluster health and the health of individual instances through Amazon CloudWatch and the console.

Additional Resources

We’ve created two repos with some additional tools and examples here. You can expect continuous development on these repos as we add additional tools and examples.

  • Amazon Neptune Tools Repo
    This repo has a useful tool for converting GraphML files into Neptune compatible CSVs for bulk loading from S3.
  • Amazon Neptune Samples Repo
    This repo has a really cool example of building a collaborative filtering recommendation engine for video game preferences.

Purpose Built Databases

There’s an industry trend where we’re moving more and more onto purpose-built databases. Developers and businesses want to access their data in the format that makes the most sense for their applications. As cloud resources make transforming large datasets easier with tools like AWS Glue, we have a lot more options than we used to for accessing our data. With tools like Amazon Redshift, Amazon Athena, Amazon Aurora, Amazon DynamoDB, and more we get to choose the best database for the job or even enable entirely new use-cases. Amazon Neptune is perfect for workloads where the data is highly connected across data rich edges.

I’m really excited about graph databases and I see a huge number of applications. Looking for ideas of cool things to build? I’d love to build a web crawler in AWS Lambda that uses Neptune as the backing store. You could further enrich it by running Amazon Comprehend or Amazon Rekognition on the text and images found and creating a search engine on top of Neptune.

As always, feel free to reach out in the comments or on twitter to provide any feedback!

Randall

Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift

Post Syndicated from Roy Hasson original https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/

Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.

To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.

Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.

What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.

Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.

Solution overview

Here is how the solution is laid out:

 

 

The following sections walk you through each of these steps to set up the pipeline.

1. Define the schema

When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.

To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.

The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.

CREATE EXTERNAL TABLE default.kfhconnectblog (
  awsaccountid string,
  agentarn string,
  currentagentsnapshot string,
  eventid string,
  eventtimestamp string,
  eventtype string,
  instancearn string,
  previousagentsnapshot string,
  version string
)
STORED AS parquet
LOCATION 's3://your_bucket/kfhconnectblog/'
TBLPROPERTIES ("parquet.compression"="SNAPPY")

That’s all you have to do to prepare the schema for Kinesis Data Firehose.

2. Define the data streams

Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events.  Open the Kinesis Data Streams console and create two streams.  You can configure them with only one shard each because you don’t have a lot of data right now.

3. Define the Kinesis Data Firehose delivery stream for Parquet

Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.

As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.

To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.

On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.

Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.

Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.

Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.

4. Set up the Amazon Connect contact center

After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.

After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.

At this point, your pipeline is complete.  Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.

So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.

To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.

5. Analyze the data with Athena

Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures.  The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.

Let’s run the first query to see which agents have logged into the system.

The query might look complex, but it’s fairly straightforward:

WITH dataset AS (
  SELECT 
    from_iso8601_timestamp(eventtimestamp) AS event_ts,
    eventtype,
    -- CURRENT STATE
    json_extract_scalar(
      currentagentsnapshot,
      '$.agentstatus.name') AS current_status,
    from_iso8601_timestamp(
      json_extract_scalar(
        currentagentsnapshot,
        '$.agentstatus.starttimestamp')) AS current_starttimestamp,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.firstname') AS current_firstname,
    json_extract_scalar(
      currentagentsnapshot,
      '$.configuration.lastname') AS current_lastname,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.username') AS current_username,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.routingprofile.defaultoutboundqueue.name') AS               current_outboundqueue,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.routingprofile.inboundqueues[0].name') as current_inboundqueue,
    -- PREVIOUS STATE
    json_extract_scalar(
      previousagentsnapshot, 
      '$.agentstatus.name') as prev_status,
    from_iso8601_timestamp(
      json_extract_scalar(
        previousagentsnapshot, 
       '$.agentstatus.starttimestamp')) as prev_starttimestamp,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.firstname') as prev_firstname,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.lastname') as prev_lastname,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.username') as prev_username,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.routingprofile.defaultoutboundqueue.name') as current_outboundqueue,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.routingprofile.inboundqueues[0].name') as prev_inboundqueue
  from kfhconnectblog
  where eventtype <> 'HEART_BEAT'
)
SELECT
  current_status as status,
  current_username as username,
  event_ts
FROM dataset
WHERE eventtype = 'LOGIN' AND current_username <> ''
ORDER BY event_ts DESC

The query output looks something like this:

Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.

WITH src AS (
  SELECT
     eventid,
     json_extract_scalar(currentagentsnapshot, '$.configuration.username') as username,
     cast(json_extract(currentagentsnapshot, '$.contacts') AS ARRAY(JSON)) as c,
     cast(json_extract(previousagentsnapshot, '$.contacts') AS ARRAY(JSON)) as p
  from kfhconnectblog
),
src2 AS (
  SELECT *
  FROM src CROSS JOIN UNNEST (c, p) AS contacts(c_item, p_item)
),
dataset AS (
SELECT 
  eventid,
  username,
  json_extract_scalar(c_item, '$.contactid') as c_contactid,
  json_extract_scalar(c_item, '$.channel') as c_channel,
  json_extract_scalar(c_item, '$.initiationmethod') as c_direction,
  json_extract_scalar(c_item, '$.queue.name') as c_queue,
  json_extract_scalar(c_item, '$.state') as c_state,
  from_iso8601_timestamp(json_extract_scalar(c_item, '$.statestarttimestamp')) as c_ts,
  
  json_extract_scalar(p_item, '$.contactid') as p_contactid,
  json_extract_scalar(p_item, '$.channel') as p_channel,
  json_extract_scalar(p_item, '$.initiationmethod') as p_direction,
  json_extract_scalar(p_item, '$.queue.name') as p_queue,
  json_extract_scalar(p_item, '$.state') as p_state,
  from_iso8601_timestamp(json_extract_scalar(p_item, '$.statestarttimestamp')) as p_ts
FROM src2
)
SELECT 
  username,
  c_channel as channel,
  c_direction as direction,
  p_state as prev_state,
  c_state as current_state,
  c_ts as current_ts,
  c_contactid as id
FROM dataset
WHERE c_contactid = p_contactid
ORDER BY id DESC, current_ts ASC

The query output looks similar to the following:

6. Analyze the data with Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.

Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.

SELECT 
  eventtype,
  json_extract_path_text(currentagentsnapshot,'agentstatus','name') AS current_status,
  json_extract_path_text(currentagentsnapshot, 'configuration','firstname') AS current_firstname,
  json_extract_path_text(currentagentsnapshot, 'configuration','lastname') AS current_lastname,
  json_extract_path_text(
    currentagentsnapshot,
    'configuration','routingprofile','defaultoutboundqueue','name') AS current_outboundqueue,
FROM default_schema.kfhconnectblog

The following shows the query output:

Summary

In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.

 


Additional Reading

If you found this post useful, be sure to check out Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight and Work with partitioned data in AWS Glue.


About the Author

Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.

 

 

 

10 visualizations to try in Amazon QuickSight with sample data

Post Syndicated from Karthik Kumar Odapally original https://aws.amazon.com/blogs/big-data/10-visualizations-to-try-in-amazon-quicksight-with-sample-data/

If you’re not already familiar with building visualizations for quick access to business insights using Amazon QuickSight, consider this your introduction. In this post, we’ll walk through some common scenarios with sample datasets to provide an overview of how you can connect yuor data, perform advanced analysis and access the results from any web browser or mobile device.

The following visualizations are built from the public datasets available in the links below. Before we jump into that, let’s take a look at the supported data sources, file formats and a typical QuickSight workflow to build any visualization.

Which data sources does Amazon QuickSight support?

At the time of publication, you can use the following data methods:

  • Connect to AWS data sources, including:
    • Amazon RDS
    • Amazon Aurora
    • Amazon Redshift
    • Amazon Athena
    • Amazon S3
  • Upload Excel spreadsheets or flat files (CSV, TSV, CLF, and ELF)
  • Connect to on-premises databases like Teradata, SQL Server, MySQL, and PostgreSQL
  • Import data from SaaS applications like Salesforce and Snowflake
  • Use big data processing engines like Spark and Presto

This list is constantly growing. For more information, see Supported Data Sources.

Answers in instants

SPICE is the Amazon QuickSight super-fast, parallel, in-memory calculation engine, designed specifically for ad hoc data visualization. SPICE stores your data in a system architected for high availability, where it is saved until you choose to delete it. Improve the performance of database datasets by importing the data into SPICE instead of using a direct database query. To calculate how much SPICE capacity your dataset needs, see Managing SPICE Capacity.

Typical Amazon QuickSight workflow

When you create an analysis, the typical workflow is as follows:

  1. Connect to a data source, and then create a new dataset or choose an existing dataset.
  2. (Optional) If you created a new dataset, prepare the data (for example, by changing field names or data types).
  3. Create a new analysis.
  4. Add a visual to the analysis by choosing the fields to visualize. Choose a specific visual type, or use AutoGraph and let Amazon QuickSight choose the most appropriate visual type, based on the number and data types of the fields that you select.
  5. (Optional) Modify the visual to meet your requirements (for example, by adding a filter or changing the visual type).
  6. (Optional) Add more visuals to the analysis.
  7. (Optional) Add scenes to the default story to provide a narrative about some aspect of the analysis data.
  8. (Optional) Publish the analysis as a dashboard to share insights with other users.

The following graphic illustrates a typical Amazon QuickSight workflow.

Visualizations created in Amazon QuickSight with sample datasets

Visualizations for a data analyst

Source:  https://data.worldbank.org/

Download and Resources:  https://datacatalog.worldbank.org/dataset/world-development-indicators

Data catalog:  The World Bank invests into multiple development projects at the national, regional, and global levels. It’s a great source of information for data analysts.

The following graph shows the percentage of the population that has access to electricity (rural and urban) during 2000 in Asia, Africa, the Middle East, and Latin America.

The following graph shows the share of healthcare costs that are paid out-of-pocket (private vs. public). Also, you can maneuver over the graph to get detailed statistics at a glance.

Visualizations for a trading analyst

Source:  Deutsche Börse Public Dataset (DBG PDS)

Download and resources:  https://aws.amazon.com/public-datasets/deutsche-boerse-pds/

Data catalog:  The DBG PDS project makes real-time data derived from Deutsche Börse’s trading market systems available to the public for free. This is the first time that such detailed financial market data has been shared freely and continually from the source provider.

The following graph shows the market trend of max trade volume for different EU banks. It builds on the data available on XETRA engines, which is made up of a variety of equities, funds, and derivative securities. This graph can be scrolled to visualize trade for a period of an hour or more.

The following graph shows the common stock beating the rest of the maximum trade volume over a period of time, grouped by security type.

Visualizations for a data scientist

Source:  https://catalog.data.gov/

Download and resources:  https://catalog.data.gov/dataset/road-weather-information-stations-788f8

Data catalog:  Data derived from different sensor stations placed on the city bridges and surface streets are a core information source. The road weather information station has a temperature sensor that measures the temperature of the street surface. It also has a sensor that measures the ambient air temperature at the station each second.

The following graph shows the present max air temperature in Seattle from different RWI station sensors.

The following graph shows the minimum temperature of the road surface at different times, which helps predicts road conditions at a particular time of the year.

Visualizations for a data engineer

Source:  https://www.kaggle.com/

Download and resources:  https://www.kaggle.com/datasnaek/youtube-new/data

Data catalog:  Kaggle has come up with a platform where people can donate open datasets. Data engineers and other community members can have open access to these datasets and can contribute to the open data movement. They have more than 350 datasets in total, with more than 200 as featured datasets. It has a few interesting datasets on the platform that are not present at other places, and it’s a platform to connect with other data enthusiasts.

The following graph shows the trending YouTube videos and presents the max likes for the top 20 channels. This is one of the most popular datasets for data engineers.

The following graph shows the YouTube daily statistics for the max views of video titles published during a specific time period.

Visualizations for a business user

Source:  New York Taxi Data

Download and resources:  https://data.cityofnewyork.us/Transportation/2016-Green-Taxi-Trip-Data/hvrh-b6nb

Data catalog: NYC Open data hosts some very popular open data sets for all New Yorkers. This platform allows you to get involved in dive deep into the data set to pull some useful visualizations. 2016 Green taxi trip dataset includes trip records from all trips completed in green taxis in NYC in 2016. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

The following graph presents maximum fare amount grouped by the passenger count during a period of time during a day. This can be further expanded to follow through different day of the month based on the business need.

The following graph shows the NewYork taxi data from January 2016, showing the dip in the number of taxis ridden on January 23, 2016 across all types of taxis.

A quick search for that date and location shows you the following news report:

Summary

Using Amazon QuickSight, you can see patterns across a time-series data by building visualizations, performing ad hoc analysis, and quickly generating insights. We hope you’ll give it a try today!

 


Additional Reading

If you found this post useful, be sure to check out Amazon QuickSight Adds Support for Combo Charts and Row-Level Security and Visualize AWS Cloudtrail Logs Using AWS Glue and Amazon QuickSight.


Karthik Odapally is a Sr. Solutions Architect in AWS. His passion is to build cost effective and highly scalable solutions on the cloud. In his spare time, he bakes cookies and cupcakes for family and friends here in the PNW. He loves vintage racing cars.

 

 

 

Pranabesh Mandal is a Solutions Architect in AWS. He has over a decade of IT experience. He is passionate about cloud technology and focuses on Analytics. In his spare time, he likes to hike and explore the beautiful nature and wild life of most divine national parks around the United States alongside his wife.

 

 

 

 

How to retain system tables’ data spanning multiple Amazon Redshift clusters and run cross-cluster diagnostic queries

Post Syndicated from Karthik Sonti original https://aws.amazon.com/blogs/big-data/how-to-retain-system-tables-data-spanning-multiple-amazon-redshift-clusters-and-run-cross-cluster-diagnostic-queries/

Amazon Redshift is a data warehouse service that logs the history of the system in STL log tables. The STL log tables manage disk space by retaining only two to five days of log history, depending on log usage and available disk space.

To retain STL tables’ data for an extended period, you usually have to create a replica table for every system table. Then, for each you load the data from the system table into the replica at regular intervals. By maintaining replica tables for STL tables, you can run diagnostic queries on historical data from the STL tables. You then can derive insights from query execution times, query plans, and disk-spill patterns, and make better cluster-sizing decisions. However, refreshing replica tables with live data from STL tables at regular intervals requires schedulers such as Cron or AWS Data Pipeline. Also, these tables are specific to one cluster and they are not accessible after the cluster is terminated. This is especially true for transient Amazon Redshift clusters that last for only a finite period of ad hoc query execution.

In this blog post, I present a solution that exports system tables from multiple Amazon Redshift clusters into an Amazon S3 bucket. This solution is serverless, and you can schedule it as frequently as every five minutes. The AWS CloudFormation deployment template that I provide automates the solution setup in your environment. The system tables’ data in the Amazon S3 bucket is partitioned by cluster name and query execution date to enable efficient joins in cross-cluster diagnostic queries.

I also provide another CloudFormation template later in this post. This second template helps to automate the creation of tables in the AWS Glue Data Catalog for the system tables’ data stored in Amazon S3. After the system tables are exported to Amazon S3, you can run cross-cluster diagnostic queries on the system tables’ data and derive insights about query executions in each Amazon Redshift cluster. You can do this using Amazon QuickSight, Amazon Athena, Amazon EMR, or Amazon Redshift Spectrum.

You can find all the code examples in this post, including the CloudFormation templates, AWS Glue extract, transform, and load (ETL) scripts, and the resolution steps for common errors you might encounter in this GitHub repository.

Solution overview

The solution in this post uses AWS Glue to export system tables’ log data from Amazon Redshift clusters into Amazon S3. The AWS Glue ETL jobs are invoked at a scheduled interval by AWS Lambda. AWS Systems Manager, which provides secure, hierarchical storage for configuration data management and secrets management, maintains the details of Amazon Redshift clusters for which the solution is enabled. The last-fetched time stamp values for the respective cluster-table combination are maintained in an Amazon DynamoDB table.

The following diagram covers the key steps involved in this solution.

The solution as illustrated in the preceding diagram flows like this:

  1. The Lambda function, invoke_rs_stl_export_etl, is triggered at regular intervals, as controlled by Amazon CloudWatch. It’s triggered to look up the AWS Systems Manager parameter store to get the details of the Amazon Redshift clusters for which the system table export is enabled.
  2. The same Lambda function, based on the Amazon Redshift cluster details obtained in step 1, invokes the AWS Glue ETL job designated for the Amazon Redshift cluster. If an ETL job for the cluster is not found, the Lambda function creates one.
  3. The ETL job invoked for the Amazon Redshift cluster gets the cluster credentials from the parameter store. It gets from the DynamoDB table the last exported time stamp of when each of the system tables was exported from the respective Amazon Redshift cluster.
  4. The ETL job unloads the system tables’ data from the Amazon Redshift cluster into an Amazon S3 bucket.
  5. The ETL job updates the DynamoDB table with the last exported time stamp value for each system table exported from the Amazon Redshift cluster.
  6. The Amazon Redshift cluster system tables’ data is available in Amazon S3 and is partitioned by cluster name and date for running cross-cluster diagnostic queries.

Understanding the configuration data

This solution uses AWS Systems Manager parameter store to store the Amazon Redshift cluster credentials securely. The parameter store also securely stores other configuration information that the AWS Glue ETL job needs for extracting and storing system tables’ data in Amazon S3. Systems Manager comes with a default AWS Key Management Service (AWS KMS) key that it uses to encrypt the password component of the Amazon Redshift cluster credentials.

The following table explains the global parameters and cluster-specific parameters required in this solution. The global parameters are defined once and applicable at the overall solution level. The cluster-specific parameters are specific to an Amazon Redshift cluster and repeat for each cluster for which you enable this post’s solution. The CloudFormation template explained later in this post creates these parameters as part of the deployment process.

Parameter name Type Description
Global parametersdefined once and applied to all jobs
redshift_query_logs.global.s3_prefix String The Amazon S3 path where the query logs are exported. Under this path, each exported table is partitioned by cluster name and date.
redshift_query_logs.global.tempdir String The Amazon S3 path that AWS Glue ETL jobs use for temporarily staging the data.
redshift_query_logs.global.role> String The name of the role that the AWS Glue ETL jobs assume. Just the role name is sufficient. The complete Amazon Resource Name (ARN) is not required.
redshift_query_logs.global.enabled_cluster_list StringList A comma-separated list of cluster names for which system tables’ data export is enabled. This gives flexibility for a user to exclude certain clusters.
Cluster-specific parametersfor each cluster specified in the enabled_cluster_list parameter
redshift_query_logs.<<cluster_name>>.connection String The name of the AWS Glue Data Catalog connection to the Amazon Redshift cluster. For example, if the cluster name is product_warehouse, the entry is redshift_query_logs.product_warehouse.connection.
redshift_query_logs.<<cluster_name>>.user String The user name that AWS Glue uses to connect to the Amazon Redshift cluster.
redshift_query_logs.<<cluster_name>>.password Secure String The password that AWS Glue uses to connect the Amazon Redshift cluster’s encrypted-by key that is managed in AWS KMS.

For example, suppose that you have two Amazon Redshift clusters, product-warehouse and category-management, for which the solution described in this post is enabled. In this case, the parameters shown in the following screenshot are created by the solution deployment CloudFormation template in the AWS Systems Manager parameter store.

Solution deployment

To make it easier for you to get started, I created a CloudFormation template that automatically configures and deploys the solution—only one step is required after deployment.

Prerequisites

To deploy the solution, you must have one or more Amazon Redshift clusters in a private subnet. This subnet must have a network address translation (NAT) gateway or a NAT instance configured, and also a security group with a self-referencing inbound rule for all TCP ports. For more information about why AWS Glue ETL needs the configuration it does, described previously, see Connecting to a JDBC Data Store in a VPC in the AWS Glue documentation.

To start the deployment, launch the CloudFormation template:

CloudFormation stack parameters

The following table lists and describes the parameters for deploying the solution to export query logs from multiple Amazon Redshift clusters.

Property Default Description
S3Bucket mybucket The bucket this solution uses to store the exported query logs, stage code artifacts, and perform unloads from Amazon Redshift. For example, the mybucket/extract_rs_logs/data bucket is used for storing all the exported query logs for each system table partitioned by the cluster. The mybucket/extract_rs_logs/temp/ bucket is used for temporarily staging the unloaded data from Amazon Redshift. The mybucket/extract_rs_logs/code bucket is used for storing all the code artifacts required for Lambda and the AWS Glue ETL jobs.
ExportEnabledRedshiftClusters Requires Input A comma-separated list of cluster names from which the system table logs need to be exported.
DataStoreSecurityGroups Requires Input A list of security groups with an inbound rule to the Amazon Redshift clusters provided in the parameter, ExportEnabledClusters. These security groups should also have a self-referencing inbound rule on all TCP ports, as explained on Connecting to a JDBC Data Store in a VPC.

After you launch the template and create the stack, you see that the following resources have been created:

  1. AWS Glue connections for each Amazon Redshift cluster you provided in the CloudFormation stack parameter, ExportEnabledRedshiftClusters.
  2. All parameters required for this solution created in the parameter store.
  3. The Lambda function that invokes the AWS Glue ETL jobs for each configured Amazon Redshift cluster at a regular interval of five minutes.
  4. The DynamoDB table that captures the last exported time stamps for each exported cluster-table combination.
  5. The AWS Glue ETL jobs to export query logs from each Amazon Redshift cluster provided in the CloudFormation stack parameter, ExportEnabledRedshiftClusters.
  6. The IAM roles and policies required for the Lambda function and AWS Glue ETL jobs.

After the deployment

For each Amazon Redshift cluster for which you enabled the solution through the CloudFormation stack parameter, ExportEnabledRedshiftClusters, the automated deployment includes temporary credentials that you must update after the deployment:

  1. Go to the parameter store.
  2. Note the parameters <<cluster_name>>.user and redshift_query_logs.<<cluster_name>>.password that correspond to each Amazon Redshift cluster for which you enabled this solution. Edit these parameters to replace the placeholder values with the right credentials.

For example, if product-warehouse is one of the clusters for which you enabled system table export, you edit these two parameters with the right user name and password and choose Save parameter.

Querying the exported system tables

Within a few minutes after the solution deployment, you should see Amazon Redshift query logs being exported to the Amazon S3 location, <<S3Bucket_you_provided>>/extract_redshift_query_logs/data/. In that bucket, you should see the eight system tables partitioned by customer name and date: stl_alert_event_log, stl_dlltext, stl_explain, stl_query, stl_querytext, stl_scan, stl_utilitytext, and stl_wlm_query.

To run cross-cluster diagnostic queries on the exported system tables, create external tables in the AWS Glue Data Catalog. To make it easier for you to get started, I provide a CloudFormation template that creates an AWS Glue crawler, which crawls the exported system tables stored in Amazon S3 and builds the external tables in the AWS Glue Data Catalog.

Launch this CloudFormation template to create external tables that correspond to the Amazon Redshift system tables. S3Bucket is the only input parameter required for this stack deployment. Provide the same Amazon S3 bucket name where the system tables’ data is being exported. After you successfully create the stack, you can see the eight tables in the database, redshift_query_logs_db, as shown in the following screenshot.

Now, navigate to the Athena console to run cross-cluster diagnostic queries. The following screenshot shows a diagnostic query executed in Athena that retrieves query alerts logged across multiple Amazon Redshift clusters.

You can build the following example Amazon QuickSight dashboard by running cross-cluster diagnostic queries on Athena to identify the hourly query count and the key query alert events across multiple Amazon Redshift clusters.

How to extend the solution

You can extend this post’s solution in two ways:

  • Add any new Amazon Redshift clusters that you spin up after you deploy the solution.
  • Add other system tables or custom query results to the list of exports from an Amazon Redshift cluster.

Extend the solution to other Amazon Redshift clusters

To extend the solution to more Amazon Redshift clusters, add the three cluster-specific parameters in the AWS Systems Manager parameter store following the guidelines earlier in this post. Modify the redshift_query_logs.global.enabled_cluster_list parameter to append the new cluster to the comma-separated string.

Extend the solution to add other tables or custom queries to an Amazon Redshift cluster

The current solution ships with the export functionality for the following Amazon Redshift system tables:

  • stl_alert_event_log
  • stl_dlltext
  • stl_explain
  • stl_query
  • stl_querytext
  • stl_scan
  • stl_utilitytext
  • stl_wlm_query

You can easily add another system table or custom query by adding a few lines of code to the AWS Glue ETL job, <<cluster-name>_extract_rs_query_logs. For example, suppose that from the product-warehouse Amazon Redshift cluster you want to export orders greater than $2,000. To do so, add the following five lines of code to the AWS Glue ETL job product-warehouse_extract_rs_query_logs, where product-warehouse is your cluster name:

  1. Get the last-processed time-stamp value. The function creates a value if it doesn’t already exist.

salesLastProcessTSValue = functions.getLastProcessedTSValue(trackingEntry=”mydb.sales_2000",job_configs=job_configs)

  1. Run the custom query with the time stamp.

returnDF=functions.runQuery(query="select * from sales s join order o where o.order_amnt > 2000 and sale_timestamp > '{}'".format (salesLastProcessTSValue) ,tableName="mydb.sales_2000",job_configs=job_configs)

  1. Save the results to Amazon S3.

functions.saveToS3(dataframe=returnDF,s3Prefix=s3Prefix,tableName="mydb.sales_2000",partitionColumns=["sale_date"],job_configs=job_configs)

  1. Get the latest time-stamp value from the returned data frame in Step 2.

latestTimestampVal=functions.getMaxValue(returnDF,"sale_timestamp",job_configs)

  1. Update the last-processed time-stamp value in the DynamoDB table.

functions.updateLastProcessedTSValue(“mydb.sales_2000",latestTimestampVal[0],job_configs)

Conclusion

In this post, I demonstrate a serverless solution to retain the system tables’ log data across multiple Amazon Redshift clusters. By using this solution, you can incrementally export the data from system tables into Amazon S3. By performing this export, you can build cross-cluster diagnostic queries, build audit dashboards, and derive insights into capacity planning by using services such as Athena. I also demonstrate how you can extend this solution to other ad hoc query use cases or tables other than system tables by adding a few lines of code.


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production and Amazon Redshift – 2017 Recap.


About the Author

Karthik Sonti is a senior big data architect at Amazon Web Services. He helps AWS customers build big data and analytical solutions and provides guidance on architecture and best practices.

 

 

 

 

AWS Achieves Spain’s ENS High Certification Across 29 Services

Post Syndicated from Oliver Bell original https://aws.amazon.com/blogs/security/aws-achieves-spains-ens-high-certification-across-29-services/

AWS has achieved Spain’s Esquema Nacional de Seguridad (ENS) High certification across 29 services. To successfully achieve the ENS High Standard, BDO España conducted an independent audit and attested that AWS meets confidentiality, integrity, and availability standards. This provides the assurance needed by Spanish Public Sector organizations wanting to build secure applications and services on AWS.

The National Security Framework, regulated under Royal Decree 3/2010, was developed through close collaboration between ENAC (Entidad Nacional de Acreditación), the Ministry of Finance and Public Administration and the CCN (National Cryptologic Centre), and other administrative bodies.

The following AWS Services are ENS High accredited across our Dublin and Frankfurt Regions:

  • Amazon API Gateway
  • Amazon DynamoDB
  • Amazon Elastic Container Service
  • Amazon Elastic Block Store
  • Amazon Elastic Compute Cloud
  • Amazon Elastic File System
  • Amazon Elastic MapReduce
  • Amazon ElastiCache
  • Amazon Glacier
  • Amazon Redshift
  • Amazon Relational Database Service
  • Amazon Simple Queue Service
  • Amazon Simple Storage Service
  • Amazon Simple Workflow Service
  • Amazon Virtual Private Cloud
  • Amazon WorkSpaces
  • AWS CloudFormation
  • AWS CloudTrail
  • AWS Config
  • AWS Database Migration Service
  • AWS Direct Connect
  • AWS Directory Service
  • AWS Elastic Beanstalk
  • AWS Key Management Service
  • AWS Lambda
  • AWS Snowball
  • AWS Storage Gateway
  • Elastic Load Balancing
  • VM Import/Export

Amazon Redshift – 2017 Recap

Post Syndicated from Larry Heathcote original https://aws.amazon.com/blogs/big-data/amazon-redshift-2017-recap/

We have been busy adding new features and capabilities to Amazon Redshift, and we wanted to give you a glimpse of what we’ve been doing over the past year. In this article, we recap a few of our enhancements and provide a set of resources that you can use to learn more and get the most out of your Amazon Redshift implementation.

In 2017, we made more than 30 announcements about Amazon Redshift. We listened to you, our customers, and delivered Redshift Spectrum, a feature of Amazon Redshift, that gives you the ability to extend analytics to your data lake—without moving data. We launched new DC2 nodes, doubling performance at the same price. We also announced many new features that provide greater scalability, better performance, more automation, and easier ways to manage your analytics workloads.

To see a full list of our launches, visit our what’s new page—and be sure to subscribe to our RSS feed.

Major launches in 2017

Amazon Redshift Spectrumextend analytics to your data lake, without moving data

We launched Amazon Redshift Spectrum to give you the freedom to store data in Amazon S3, in open file formats, and have it available for analytics without the need to load it into your Amazon Redshift cluster. It enables you to easily join datasets across Redshift clusters and S3 to provide unique insights that you would not be able to obtain by querying independent data silos.

With Redshift Spectrum, you can run SQL queries against data in an Amazon S3 data lake as easily as you analyze data stored in Amazon Redshift. And you can do it without loading data or resizing the Amazon Redshift cluster based on growing data volumes. Redshift Spectrum separates compute and storage to meet workload demands for data size, concurrency, and performance. Redshift Spectrum scales processing across thousands of nodes, so results are fast, even with massive datasets and complex queries. You can query open file formats that you already use—such as Apache Avro, CSV, Grok, ORC, Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV—directly in Amazon S3, without any data movement.

For complex queries, Redshift Spectrum provided a 67 percent performance gain,” said Rafi Ton, CEO, NUVIAD. “Using the Parquet data format, Redshift Spectrum delivered an 80 percent performance improvement. For us, this was substantial.

To learn more about Redshift Spectrum, watch our AWS Summit session Intro to Amazon Redshift Spectrum: Now Query Exabytes of Data in S3, and read our announcement blog post Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data.

DC2 nodes—twice the performance of DC1 at the same price

We launched second-generation Dense Compute (DC2) nodes to provide low latency and high throughput for demanding data warehousing workloads. DC2 nodes feature powerful Intel E5-2686 v4 (Broadwell) CPUs, fast DDR4 memory, and NVMe-based solid state disks (SSDs). We’ve tuned Amazon Redshift to take advantage of the better CPU, network, and disk on DC2 nodes, providing up to twice the performance of DC1 at the same price. Our DC2.8xlarge instances now provide twice the memory per slice of data and an optimized storage layout with 30 percent better storage utilization.

Redshift allows us to quickly spin up clusters and provide our data scientists with a fast and easy method to access data and generate insights,” said Bradley Todd, technology architect at Liberty Mutual. “We saw a 9x reduction in month-end reporting time with Redshift DC2 nodes as compared to DC1.”

Read our customer testimonials to see the performance gains our customers are experiencing with DC2 nodes. To learn more, read our blog post Amazon Redshift Dense Compute (DC2) Nodes Deliver Twice the Performance as DC1 at the Same Price.

Performance enhancements— 3x-5x faster queries

On average, our customers are seeing 3x to 5x performance gains for most of their critical workloads.

We introduced short query acceleration to speed up execution of queries such as reports, dashboards, and interactive analysis. Short query acceleration uses machine learning to predict the execution time of a query, and to move short running queries to an express short query queue for faster processing.

We launched results caching to deliver sub-second response times for queries that are repeated, such as dashboards, visualizations, and those from BI tools. Results caching has an added benefit of freeing up resources to improve the performance of all other queries.

We also introduced late materialization to reduce the amount of data scanned for queries with predicate filters by batching and factoring in the filtering of predicates before fetching data blocks in the next column. For example, if only 10 percent of the table rows satisfy the predicate filters, Amazon Redshift can potentially save 90 percent of the I/O for the remaining columns to improve query performance.

We launched query monitoring rules and pre-defined rule templates. These features make it easier for you to set metrics-based performance boundaries for workload management (WLM) queries, and specify what action to take when a query goes beyond those boundaries. For example, for a queue that’s dedicated to short-running queries, you might create a rule that aborts queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.

Customer insights

Amazon Redshift and Redshift Spectrum serve customers across a variety of industries and sizes, from startups to large enterprises. Visit our customer page to see the success that customers are having with our recent enhancements. Learn how companies like Liberty Mutual Insurance saw a 9x reduction in month-end reporting time using DC2 nodes. On this page, you can find case studies, videos, and other content that show how our customers are using Amazon Redshift to drive innovation and business results.

In addition, check out these resources to learn about the success our customers are having building out a data warehouse and data lake integration solution with Amazon Redshift:

Partner solutions

You can enhance your Amazon Redshift data warehouse by working with industry-leading experts. Our AWS Partner Network (APN) Partners have certified their solutions to work with Amazon Redshift. They offer software, tools, integration, and consulting services to help you at every step. Visit our Amazon Redshift Partner page and choose an APN Partner. Or, use AWS Marketplace to find and immediately start using third-party software.

To see what our Partners are saying about Amazon Redshift Spectrum and our DC2 nodes mentioned earlier, read these blog posts:

Resources

Blog posts

Visit the AWS Big Data Blog for a list of all Amazon Redshift articles.

YouTube videos

GitHub

Our community of experts contribute on GitHub to provide tips and hints that can help you get the most out of your deployment. Visit GitHub frequently to get the latest technical guidance, code samples, administrative task automation utilities, the analyze & vacuum schema utility, and more.

Customer support

If you are evaluating or considering a proof of concept with Amazon Redshift, or you need assistance migrating your on-premises or other cloud-based data warehouse to Amazon Redshift, our team of product experts and solutions architects can help you with architecting, sizing, and optimizing your data warehouse. Contact us using this support request form, and let us know how we can assist you.

If you are an Amazon Redshift customer, we offer a no-cost health check program. Our team of database engineers and solutions architects give you recommendations for optimizing Amazon Redshift and Amazon Redshift Spectrum for your specific workloads. To learn more, email us at [email protected].

If you have any questions, email us at [email protected].

 


Additional Reading

If you found this post useful, be sure to check out Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data, Using Amazon Redshift for Fast Analytical Reports and How to Migrate Your Oracle Data Warehouse to Amazon Redshift Using AWS SCT and AWS DMS.


About the Author

Larry Heathcote is a Principle Product Marketing Manager at Amazon Web Services for data warehousing and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out and the taste of classic barbeque.