Tag Archives: Analytics

New for Amazon Redshift – Data Lake Export and Federated Query

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-data-lake-export-and-federated-queries/

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze data using standard SQL and existing Business Intelligence (BI) tools.

To get information from unstructured data that would not fit in a data warehouse, you can build a data lake. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. With a data lake built on Amazon Simple Storage Service (S3), you can easily run big data analytics and use machine learning to gain insights from your semi-structured (such as JSON, XML) and unstructured datasets.

Today, we are launching two new features to help you improve the way you manage your data warehouse and integrate with a data lake:

  • Data Lake Export to unload data from a Redshift cluster to S3 in Apache Parquet format, an efficient open columnar storage format optimized for analytics.
  • Federated Query to be able, from a Redshift cluster, to query across data stored in the cluster, in your S3 data lake, and in one or more Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora PostgreSQL databases.

This architectural diagram gives a quick summary of how these features work and how they can be used together with other AWS services.

Let’s explain the interactions you see in the diagram better, starting from how you can use these features, and the advantages they provide.

Using Redshift Data Lake Export

You can now unload the result of a Redshift query to your S3 data lake in Apache Parquet format. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Redshift into your S3 data lake in an open format.

You can then analyze the data in your data lake with Redshift Spectrum, a feature of Redshift that allows you to query data directly from files on S3. Or you can use different tools such as Amazon Athena, Amazon EMR, or Amazon SageMaker.

To try this new feature, I create a new cluster from the Redshift console, and follow this tutorial to load sample data that keeps track of sales of musical events across different venues. I want to correlate this data with social media comments on the events stored in my data lake. To understand their relevance, each event should have a way of comparing its relative sales to other events.

Let’s build a query in Redshift to export the data to S3. My data is stored across multiple tables. I need to create a query that gives me a single view of what is going on with sales. I want to join the content of the  sales and date tables, adding information on the gross sales for an event (total_price in the query), and the percentile in terms of all time gross sales compared to all events.

To export the result of the query to S3 in Parquet format, I use the following SQL command:

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/Sales/'
FORMAT AS PARQUET
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

To give Redshift write access to my S3 bucket, I am using an AWS Identity and Access Management (IAM) role. I can see the result of the UNLOAD command using the AWS Command Line Interface (CLI). As expected, the output of the query is exported using the Parquet columnar data format:

$ aws s3 ls s3://MY-BUCKET/DataLake/Sales/
2019-11-25 14:26:56 1638550 0000_part_00.parquet
2019-11-25 14:26:56 1635489 0001_part_00.parquet
2019-11-25 14:26:56 1624418 0002_part_00.parquet
2019-11-25 14:26:56 1646179 0003_part_00.parquet

To optimize access to data, I can specify one or more partition columns so that unloaded data is automatically partitioned into folders in my S3 bucket. For example, I can unload sales data partitioned by year, month, and day. This enables my queries to take advantage of partition pruning and skip scanning irrelevant partitions, improving query performance and minimizing cost.

To use partitioning, I need to add to the previous SQL command the PARTITION BY option, followed by the columns I want to use to partition the data in different directories. In my case, I want to partition the output based on the year and the calendar date (caldate in the query) of the sales.

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/SalesPartitioned/'
FORMAT AS PARQUET
PARTITION BY (year, caldate)
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

This time, the output of the query is stored in multiple partitions. For example, here’s the content of a folder for a specific year and date:

$ aws s3 ls s3://MY-BUCKET/DataLake/SalesPartitioned/year=2008/caldate=2008-07-20/
2019-11-25 14:36:17 11940 0000_part_00.parquet
2019-11-25 14:36:17 11052 0001_part_00.parquet
2019-11-25 14:36:17 11138 0002_part_00.parquet
2019-11-25 14:36:18 12582 0003_part_00.parquet

Optionally, I can use AWS Glue to set up a Crawler that (on demand or on a schedule) looks for data in my S3 bucket to update the Glue Data Catalog. When the Data Catalog is updated, I can easily query the data using Redshift Spectrum, Athena, or EMR.

The sales data is now ready to be processed together with the unstructured and semi-structured  (JSON, XML, Parquet) data in my data lake. For example, I can now use Apache Spark with EMR, or any Sagemaker built-in algorithm to access the data and get new insights.

Using Redshift Federated Query
You can now also access data in RDS and Aurora PostgreSQL stores directly from your Redshift data warehouse. In this way, you can access data as soon as it is available. Straight from Redshift, you can now perform queries processing data in your data warehouse, transactional databases, and data lake, without requiring ETL jobs to transfer data to the data warehouse.

Redshift leverages its advanced optimization capabilities to push down and distribute a significant portion of the computation directly into the transactional databases, minimizing the amount of data moving over the network.

Using this syntax, you can add an external schema from an RDS or Aurora PostgreSQL database to a Redshift cluster:

CREATE EXTERNAL SCHEMA IF NOT EXISTS online_system
FROM POSTGRES
DATABASE 'online_sales_db' SCHEMA 'online_system'
URI ‘my-hostname' port 5432
IAM_ROLE 'iam-role-arn'
SECRET_ARN 'ssm-secret-arn';

Schema and port are optional here. Schema will default to public if left unspecified and default port for PostgreSQL databases is 5432. Redshift is using AWS Secrets Manager to manage the credentials to connect to the external databases.

With this command, all tables in the external schema are available and can be used by Redshift for any complex SQL query processing data in the cluster or, using Redshift Spectrum, in your S3 data lake.

Coming back to the sales data example I used before, I can now correlate the trends of my historical data of musical events with real-time sales. In this way, I can understand if an event is performing as expected or not, and calibrate my marketing activities without delays.

For example, after I define the online commerce database as the online_system external schema in my Redshift cluster, I can compare previous sales with what is in the online commerce system with this simple query:

SELECT eventid,
       sum(pricepaid) total_price,
       sum(online_pricepaid) online_total_price
  FROM sales, online_system.current_sales
 GROUP BY eventid
 WHERE eventid = online_eventid;

Redshift doesn’t import database or schema catalog in its entirety. When a query is run, it localizes the metadata for the Aurora and RDS tables (and views) that are part of the query. This localized metadata is then used for query compilation and plan generation.

Available Now
Amazon Redshift data lake export is a new tool to improve your data processing pipeline and is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Redshift availability, and check the version of your clusters.

The new federation capability in Amazon Redshift is released as a public preview and allows you to bring together data stored in Redshift, S3, and one or more RDS and Aurora PostgreSQL databases. When creating a cluster in the Amazon Redshift management console, you can pick three tracks for maintenance: Current, Trailing, or Preview. Within the Preview track, preview_features should be chosen to participate to the Federated Query public preview. For example:

These features simplify data processing and analytics, giving you more tools to react quickly, and a single point of view for your data. Let me know what you are going to use them for!

Danilo

Highlight the breadth of your data and analytics technical expertise with new AWS Certification beta

Post Syndicated from Beth Shepherd original https://aws.amazon.com/blogs/big-data/highlight-the-breadth-of-your-data-and-analytics-technical-expertise-with-new-aws-certification-beta/

AWS offers the broadest set of analytic tools and engines that analyzes data using open formats and open standards. To validate expertise with AWS data analytics solutions, builders can now take the beta for the AWS Certified Data Analytics — Specialty certification.

The AWS Certified Data Analytics — Specialty certification validates technical expertise with designing, building, securing, and maintaining analytics solutions on AWS. This certification first launched in 2017 as AWS Certified Big Data — Specialty. The new name highlights the breadth of the data and analytics technical skills and experience validated by the certification. Candidates who take and pass the beta exam available for registration now or the general availability release in April 2020 will earn a certification with the new name.

The new exam version includes updated content across categories from collection to visualization. View the exam guide to learn more about what’s on the exam.

This is the only AWS Certification to specifically focus on data analytics expertise. This certification demonstrates your ability to design and implement analytics solutions that provide insight through visualizing data with the appropriate security measures and automation in place.

You can take the AWS Certified Data Analytics — Specialty beta exam at testing centers worldwide through January 10, 2020, and onsite at re:Invent 2019. Space is limited, so register today. The beta exam is available in English for 150 USD, which is 50% off the standard pricing for Specialty-level certifications. Results for the beta exam will be available approximately 90 days after the end of the beta period. If you miss the beta, the standard version is expected in April 2020.


About the Authors

Beth Shepherd is the Product Marketing Manager for AWS Certification. She joined Amazon in 2019 and is based in Boston.

 

 

 

 

Bill Baldwin is the Head of AWS Data Days and the WW Tech Leader AWS Databases. He has been with Amazon since 2016 and is based in Atlanta.

 

 

 

Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/extract-transform-and-load-data-into-s3-data-lake-using-ctas-and-insert-into-statements-in-amazon-athena/

Amazon Athena is an interactive query service that makes it easy to analyze the data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. You can reduce your per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. To learn more about best practices to boost query performance and reduce costs, see Top 10 Performance Tuning Tips for Amazon Athena.

Overview

This blog post discusses how to use Athena for extract, transform and load (ETL) jobs for data processing. This example optimizes the dataset for analytics by partitioning it and converting it to a columnar data format using Create Table as Select (CTAS) and INSERT INTO statements.

CTAS statements create new tables using standard SELECT queries to filter data as required. You can also partition the data, specify compression, and convert the data into columnar formats like Apache Parquet and Apache ORC using CTAS statements. As part of the execution, the resultant tables and partitions are added to the AWS Glue Data Catalog, making them immediately available for subsequent queries.

INSERT INTO statements insert new rows into a destination table based on a SELECT query statement that runs on a source table. If the source table’s underlying data is in CSV format and destination table’s data is in Parquet format, then INSERT INTO can easily transform and load data into destination table’s format. CTAS and INSERT INTO statements can be used together to perform an initial batch conversion of data as well as incremental updates to the existing table.

Here is an overview of the ETL steps to be followed in Athena for data conversion:

  1. Create a table on the original dataset.
  2. Use a CTAS statement to create a new table in which the format, compression, partition fields and location of the new table can be specified.
  3. Add more data into the table using an INSERT INTO statement.

This example uses a subset of NOAA Global Historical Climatology Network Daily (GHCN-D), a publicly available dataset on Amazon S3, in this example.

This subset of data is available at the following S3 location:

s3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/
Total objects: 41727 
Size of CSV dataset: 11.3 GB
Region: us-east-1

Procedure

Follow these steps to use Athena for an ETL job.

Create a table based on original dataset

The original data is in CSV format with no partitions in Amazon S3. The following files are stored at the Amazon S3 location:

2019-10-31 13:06:57  413.1 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0000
2019-10-31 13:06:57  412.0 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0001
2019-10-31 13:06:57   34.4 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0002
2019-10-31 13:06:57  412.2 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0100
2019-10-31 13:06:57  412.7 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0101

Note that the file sizes are pretty small. Merging them into larger files and reducing total number of files would lead to faster query execution. CTAS and INSERT INTO can help achieve this.

To execute queries in the Athena console (preferably in us-east-1 to avoid inter-region Amazon S3 data transfer charges). First, create a database for this demo:

CREATE DATABASE blogdb

Now, create a table from the data above.

CREATE EXTERNAL TABLE `blogdb`.`original_csv` (
  `id` string, 
  `date` string, 
  `element` string, 
  `datavalue` bigint, 
  `mflag` string, 
  `qflag` string, 
  `sflag` string, 
  `obstime` bigint)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/'

Use CTAS to partition data and convert into parquet format with snappy compression

Now, convert the data to Parquet format with Snappy compression and partition the data on a yearly basis. All these actions are performed using the CTAS statement. For the purpose of this blog, the initial table only includes data from 2015 to 2019. You can add new data to this table using the INSERT INTO command.

The table created in Step 1 has a date field with the date formatted as YYYYMMDD (e.g. 20100104). The new table is partitioned on year. Extract the year value from the date field using the Presto function substr(“date”,1,4).

CREATE table new_parquet
WITH (format='PARQUET', 
parquet_compression='SNAPPY', 
partitioned_by=array['year'], 
external_location = 's3://your-bucket/optimized-data/') 
AS
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) >= 2015
        AND cast(substr("date",1,4) AS bigint) <= 2019

Once the query is successful, check the Amazon S3 location specified in the CTAS statement above. You should be able to see partitions and parquet files in each of these partitions, as shown in the following examples:

  1. Partitions:
    $ aws s3 ls s3://your-bucket/optimized-data/
                               PRE year=2015/
                               PRE year=2016/
                               PRE year=2017/
                               PRE year=2018/
                               PRE year=2019/

  2. Parquet files:
    $ aws s3 ls s3://your-bucket/optimized-data/ --recursive --human-readable | head -5
    
    2019-10-31 14:51:05    7.3 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_1be48df2-3154-438b-b61d-8fb23809679d
    2019-10-31 14:51:05    7.0 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_2a57f4e2-ffa0-4be3-9c3f-28b16d86ed5a
    2019-10-31 14:51:05    9.9 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_34381db1-00ca-4092-bd65-ab04e06dc799
    2019-10-31 14:51:05    7.5 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_354a2bc1-345f-4996-9073-096cb863308d
    2019-10-31 14:51:05    6.9 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_42da4cfd-6e21-40a1-8152-0b902da385a1

Add more data into table using INSERT INTO statement

Now, add more data and partitions into the new table created above. The original dataset has data from 2010 to 2019. Since you added 2015 to 2019 using CTAS, add the rest of the data now using an INSERT INTO statement:

INSERT INTO new_parquet
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) < 2015

List the Amazon S3 location of the new table:

 $ aws s3 ls s3://your-bucket/optimized-data/
                           PRE year=2010/
                           PRE year=2011/
                           PRE year=2012/
                           PRE year=2013/
                           PRE year=2014/
                           PRE year=2015/
                           PRE year=2016/
                           PRE year=2017/
                           PRE year=2018/
                           PRE year=2019/ 

You can see that INSERT INTO is able to determine that “year” is a partition column and writes the data to Amazon S3 accordingly. There is also a significant reduction in the total size of the dataset thanks to compression and columnar storage in the Parquet format:

Size of dataset after parquet with snappy compression - 1.2 GB

You can also run INSERT INTO statements if more CSV data is added to original table. Assume you have new data for the year 2020 added to the original Amazon S3 dataset. In that case, you can run the following INSERT INTO statement to add this data and the relevant partition(s) to the new_parquet table:

INSERT INTO new_parquet
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) = 2020

Query the results

Now that you have transformed data, run some queries to see what you gained in terms of performance and cost optimization:

First, find the number of distinct IDs for every value of the year:

    1. Query on the original table:
      SELECT substr("date",1,4) as year, 
             COUNT(DISTINCT id) 
      FROM original_csv 
      GROUP BY 1 ORDER BY 1 DESC

    2. Query on the new table:
      SELECT year, 
        COUNT(DISTINCT id) 
      FROM new_parquet 
      GROUP BY  1 ORDER BY 1 DESC

      Original table

      New table

      Savings

      Run timeData scannedCost

      Run

      Time

      Data

      Scanned

      Cost
      16.88 seconds11.35 GB$0.05673.79 seconds428.05 MB$0.00214577.5% faster and 96.2% cheaper

       

Next, calculate the average maximum temperature (Celsius), average minimum temperature (Celsius), and average rainfall (mm) for the Earth in 2018:

      1. Query on the original table:
        SELECT element, round(avg(CAST(datavalue AS real)/10),2) AS value
        FROM original_csv
        WHERE element IN ('TMIN', 'TMAX', 'PRCP') AND substr("date",1,4) = '2018'
        GROUP BY  1

      1. Query on the new table:
        SELECT element, round(avg(CAST(datavalue AS real)/10),2) AS value
        FROM new_parquet 
        WHERE element IN ('TMIN', 'TMAX', 'PRCP') and year = '2018'
        GROUP BY  1

        Original tableNew tableSavings
        Run timeData scannedCost

        Run

        Time

        Data

        Scanned

        Cost
        18.65 seconds11.35 GB$0.05671.92 seconds68.08 MB$0.00034590% faster and 99.4% cheaper

         

Conclusion

This post showed you how to perform ETL operations using CTAS and INSERT INTO statements in Athena. You can perform the first set of transformations using a CTAS statement. When new data arrives, use an INSERT INTO statement to transform and load data to the table created by the CTAS statement. Using this approach, you converted data to the Parquet format with Snappy compression, converted a non-partitioned dataset to a partitioned dataset, reduced the overall size of the dataset and lowered the costs of running queries in Athena.

 


About the Author

 Pathik Shah is a big data architect for Amazon EMR at AWS.

 

 

 

 

 

Connect Amazon Athena to your Apache Hive Metastore and use user-defined functions

Post Syndicated from Janak Agarwal original https://aws.amazon.com/blogs/big-data/connect-amazon-athena-to-your-apache-hive-metastore-and-use-user-defined-functions/

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. This post details the two new preview features that you can start using today: connecting to Apache Hive Metastore and using user-defined functions.

Connecting Athena to your Apache Hive Metastore

Several customers use the Hive Metastore as a common metadata catalog for their big data environments. Such customers run Apache Spark, Presto, and Apache Hive on Amazon EC2 and Amazon EMR clusters with a self-hosted Hive Metastore as the common catalog. AWS also offers the AWS Glue Data Catalog, which is a fully managed catalog and drop-in replacement for the Hive Metastore. With the release as of this writing, you can now use the Hive Metastore in addition to the Data Catalog with Athena. Athena now allows you to connect to multiple Hive Metastores along with existing Data Catalog.

To connect to a self-hosted Hive Metastore, you need a metastore connector. You can download a reference implementation of this connector, which runs as a Lambda function in your account. The current version of the implementation supports only SELECT queries. DDL support is limited to basic metadata syntax. For more information, see please check Considerations and Limitations of this feature. You can also write a Hive Metastore connector using the previous reference implementation as an example. You can deploy your implementation as a Lambda function, and subsequently use it with Athena. For more information about the feature, see the Using Athena Data Connector for External Hive Metastore (Preview) documentation.

Using user-defined functions in Athena

Athena also offers preview support for scalar user-defined functions (UDFs). UDFs enable you to write functions and invoke them in SQL queries. A scalar UDF is applied one row at a time and returns a single column value. Athena invokes your scalar UDF with batches of rows to limit the performance impact associated with making a remote call for the UDF itself.

With the latest release as of this writing, you can use the Athena Query Federation SDK to define your functions and invoke them inline in SQL queries. You can now compress and decompress row values, scrub personally identifiable information (PII) from columns, transform dates to a different format, read image metadata, and execute proprietary custom code in your queries. You can also execute UDFs in both the SELECT and FILTER phase of the query and invoke multiple UDFs in the same query.

For more information about UDFs, see our documentation. For common UDF example implementations, see the GitHub repo. For more information about writing functions using the Athena Query Federation SDK, please visit this link.

Testing the preview features

All Athena queries originating from the workgroup AmazonAthenaPreviewFunctionality are considered Preview test queries.

Create a new workgroup AmazonAthenaPreviewFunctionality using Athena APIs or the Athena console. For more information, see Create a Workgroup.

The following considerations are important when using preview features.

Do not edit the workgroup name. You can edit other workgroup properties, such as enabling Amazon CloudWatch metrics and requester pays. You can use the Athena console, JDBC/ODBC drivers, or APIs to submit your test queries. Specify the workgroup AmazonAthenaPreviewFunctionality when you submit test queries.

Preview functionality is available only in the us-east-1 Region. If you use Athena in any other Region and submit queries using AmazonAthenaPreviewFunctionality, your query fails. Cross-Region calls are not supported in preview mode.

During the preview, you do not incur charges for the data scanned from federated data sources. However, you are charged standard Athena rates for data scanned from S3. Additionally, you are charged standard rates for the AWS services that you use with Athena, such as S3, AWS Lambda, AWS Glue, Amazon SageMaker, and AWS SAM. For example, you are charged S3 rates for storage, requests, and inter-Region data transfer. By default, query results are stored in an S3 bucket of your choice and are billed at standard S3 rates. If you use Lambda, you are charged based on the number of requests for your functions and the duration (the time it takes for your code to execute).

It is not recommended to onboard your production workload to AmazonAthenaPreviewFunctionality.

Query performance may vary between the preview workgroup and the other workgroups in your account. Additionally, new features and bug fixes may not be backwards compatible.

Summary

In summary, we introduced Athena’s two new features that released today in Preview.

Customers who use the Apache Hive Metastore for metadata management, and were previously unable to use Athena, can now connect their Hive Metastore to Athena to run queries. Also, customers can now use Athena’s Query Federation SDK to define and invoke their own functions in their SQL queries in Athena.

Both these features are available in Preview in the AWS us-east-1 region. Begin your Preview now by following these steps in the Athena FAQ. We welcome your feedback at [email protected]

 


About the Author

Janak Agarwal is a product manager for Athena at AWS.

 

 

 

Prepare data for model-training and invoke machine learning models with Amazon Athena

Post Syndicated from Janak Agarwal original https://aws.amazon.com/blogs/big-data/prepare-data-for-model-training-and-invoke-machine-learning-models-with-amazon-athena/

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Amazon Athena has announced a public preview of a new feature that provides an easy way to run inference using machine learning (ML) models deployed on Amazon SageMaker directly from SQL queries. The ability to use ML models in SQL queries makes complex tasks such as anomaly detection, customer cohort analysis, and sales predictions as simple as writing a SQL query.

Overview

Users now have the flexibility to use ML models trained on proprietary datasets or to use out-of-the-box, pre-trained ML models deployed on Amazon SageMaker. You can now easily invoke a variety of ML algorithms across text analysis, statistical tools, and any algorithm deployed on Amazon SageMaker. There is no additional setup required. Users can invoke ML models in SQL queries from the Athena console, Athena APIs, and through use of Athena’s JDBC and ODBC drivers in tools such as Amazon QuickSight. Within seconds, analysts can run inferences to forecast sales, detect suspicious logins to an application, or categorize all users into customer cohorts.

During the preview, you are not charged for the data scanned from federated data sources. However, you are charged standard Athena rates for data scanned from Amazon S3. Additionally, you are charged standard rates for the AWS services that you use with Athena, such as Amazon S3, AWS Lambda, AWS Glue, Amazon SageMaker, and AWS Serverless Application Repository. For example, you are charged S3 rates for storage, requests, and inter-region data transfer. By default, query results are stored in an S3 bucket of your choice and are also billed at standard Amazon S3 rates. If you use AWS Lambda, you are charged based on the number of requests for your functions and the duration, the time it takes for your code to execute.

Athena has also added support for federated queries and user-defined functions (UDFs). This blog demonstrates how to:

  1. Ingest, clean, and transform a dataset using Athena SQL queries to ready it for model training purposes
  2. Use Amazon SageMaker to train the model
  3. Use Athena’s ML inference capabilities to run prediction queries in SQL using the model created in Step 2

For illustration purposes, consider an imaginary video game company whose goal is to predict which games will be successful.

Assume that the video game company has a raw dataset in Amazon S3 with the schema represented by the following SQL statement.

CREATE EXTERNAL TABLE `video_game_data`.`video_games`(
  `gameid` int COMMENT '', 
  `name` string COMMENT '', 
  `platform` string COMMENT '',
  `year_of_release` int COMMENT '',
  `genre` string COMMENT '',
  `publisher` string COMMENT '',
  `sales` int COMMENT '',
  `critic_score` int COMMENT '',
  `critic_count` int COMMENT '',
  `user_score` int COMMENT '',
  `user_count` int COMMENT '',
  `developer` string COMMENT '',
  `rating` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'separatorChar'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://video-game-demo/'
TBLPROPERTIES (
  'has_encrypted_data'='false')

A screenshot of the sample dataset in Athena follows.

Diagram 1: Screenshot of the sample dataset in our example.

Data Analysis and Cleaning

To prepare the dataset for training the model using Amazon SageMaker, you need to select the relevant data required to train the ML model. You don’t need columns that are not relevant for training, such as game_id, name, year_of_release, sales, critic_count, user_count, and developer.

Additionally, this example defines success as the scenario in which sales of a particular game exceed 1,000,000. You can create a success column that is either 0 (denoting no success) or 1 (denoting success) in the dataset to reflect success.

A sample query and screenshot showing the success column follow:

SELECT platform,
         genre,
         critic_score,
         user_score,
         rating,
         (sales > 1000000) AS success
FROM "video_game_data"."video_games";

Diagram 2: Screenshot of our sample dataset with the columns that are irrelevant for training our ML model deleted.

ML models typically do not work well with String Enums. To optimize performance and improve model accuracy, convert the platform, genre, publisher, and rating to integer Enum constants. You can use Athena’s UDF functionality to accomplish this.

The following sample code creates an AWS Lambda function that you can invoke in the Athena SQL query as UDF.

public class AthenaUDFHandler
        extends UserDefinedFunctionHandler
{
    private static final String SOURCE_TYPE = "athena_common_udfs";
    private final ImmutableMap<String, Integer> genreMap;

    public AthenaUDFHandler()
    {
        super(SOURCE_TYPE);
        genreMap = ImmutableMap.<String, Integer>builder()
                .put("Action", 1)
                .put("Adventure", 2)
                .put("Fighting", 3)
                .put("Misc", 4)
                .put("Platform", 5)
                .put("Puzzle", 6)
                .put("Racing", 7)
                .put("Role-Playing", 8)
                .put("Shooter", 9)
                .put("Simulation", 10)
                .put("Sports", 11)
                .put("Strategy", 12).build();
    }

    public Integer normalize_genre(String genre)
    {
        //Implement your code here
        return genreMap.getOrDefault(genre, 0);    }

    public Integer normalize_rating(String rating)
    {
        //Implement your code here
        return rating.hascode();
    }

    public Integer normalize_platform(String platform)
    {
        //Implement your code here
        return platform.hashcode();
    }
}

A sample Athena Query that normalizes the dataset using the functions created above follows:

USING 
FUNCTION normalize_genre(value VARCHAR) RETURNS VARCHAR TYPE LAMBDA_INVOKE
WITH (lambda_name='VideoNormalization'),

FUNCTION normalize_platform(value VARCHAR) RETURNS VARCHAR TYPE LAMBDA_INVOKE
WITH (lambda_name='VideoNormalization')

FUNCTION normalize_rating(value VARCHAR) RETURNS VARCHAR TYPE LAMBDA_INVOKE
WITH (lambda_name='VideoNormalization')

SELECT CAST((sales > 1000000) AS integer) AS success,
         normalize_platform('platform', platform) AS platform, 
         normalize_genre('genre', genre) AS genre, 
         critic_score, 
         user_score, 
         normalize_rating('rating', rating) AS rating
FROM video_game_data.video_games;

The following screenshot shows the normalized dataset:

Diagram 3: Screenshot after defining, and invoking our UDFs that help to normalize our dataset.

Creating the machine learning model

Next, create the ML model in Amazon SageMaker.

Open an Amazon Sagemaker Notebook instance with Permissions to Execute Athena Queries and execute the following sample Athena query. The query first imports the required Amazon SageMaker libraries and PyAthena into your Amazon SageMaker Notebook, executes an Athena query to retrieve the training dataset, invokes the training algorithm on this dataset, and deploys the resulting model on the selected Amazon SageMaker instance. PyAthena allows you to invoke Athena SQL queries from within your Amazon SageMaker Notebook.

Note that you can also train your model using a different algorithm and evaluate your model. For more details on using Amazon SageMaker, please visit the SageMaker documentation.

import sys
!{sys.executable} -m pip install PyAthena
import boto3, os
import sagemaker
import pandas as pd 

from sagemaker import get_execution_role
from pyathena import connect 

# Create Traing Dataset for inference
athena_output_bucket = 'athena-results'
region = 'us-east-1'
connection = connect(s3_staging_dir='s3://{}/'.format(athena_output_bucket, region_name=region, work_group='AmazonAthenaPreviewFunctionality') 

results = pd.read_sql("USING FUNCTION normalize_genre(value VARCHAR) RETURNS INTEGER TYPE LAMBDA_INVOKE WITH (lambda_name='VideoNormalization'), 
FUNCTION normalize_platform(value VARCHAR) RETURNS INTEGER TYPE LAMBDA_INVOKE WITH (lambda_name='VideoNormalization'), 
FUNCTION normalize_rating(value VARCHAR) RETURNS INTEGER TYPE LAMBDA_INVOKE WITH (lambda_name='VideoNormalization') 
SELECT CAST((sales > 1000000) AS integer) AS success, normalize_platform(platform) AS platform, normalize_genre(genre) AS genre, critic_score, user_score, normalize_rating(rating) AS rating FROM video_game_data.video_games", connection) 

training_data_output_location = 'video-games-sales-prediction'

results.to_csv('training_data.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(training_data_output_location).Object(os.path.join(prefix, 'train/train.csv')).upload_file('training_data.csv')
s3_input = sagemaker.s3_input(s3_data='s3://{}/train'.format(training_data_output_location), content_type='csv')

#Model Training
role = get_execution_role()
container = '811284229777.dkr.ecr.us-east-1-1.amazonaws.com/xgboost:latest'

sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/model'.format(training_data_output_location),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        num_round=100)

xgb.fit({'train': s3_input})

# Model Deployment
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge')
xgb_predictor.endpoint_name

Using the model in SQL queries

Having trained the ML model and deployed it on Amazon SageMaker, your next task is to use this model in your Athena SQL queries to predict whether a given video game will be a success.

A sample prediction query follows. You can run this query in your Amazon SageMaker notebook after loading PyAthena, or using the Athena Console. You can also submit this query using Athena’s APIs or using the Athena Preview JDBC driver.

#Prediction

USING FUNCTION predict(platform int, genre int, critic_score int, user_score int, rating int) returns DOUBLE type SAGEMAKER_INVOKE_ENDPOINT
WITH (sagemaker_endpoint='xgboost-2019-11-22-00-52-22-742'),

FUNCTION normalize_genre(value VARCHAR) RETURNS int TYPE LAMBDA_INVOKE
WITH (lambda_name='VideoNormalization'),

FUNCTION normalize_platform(value VARCHAR) RETURNS int TYPE LAMBDA_INVOKE
WITH (lambda_name='VideoNormalization'),

FUNCTION normalize_rating(value VARCHAR) RETURNS int TYPE LAMBDA_INVOKE
WITH (lambda_name='VideoNormalization')

SELECT predict(platform,
         genre,
         critic_score,
         user_score,
         rating),
         name
FROM 
    (SELECT name,
         normalize_platform(platform) AS platform, 
         normalize_genre(genre) AS genre, 
         critic_score, 
         user_score, 
         normalize_rating(rating) AS rating
FROM video_game_data.video_games);

Conclusion

In this blog, we introduced Athena’s new functionality that enables you to invoke your machine learning models in SQL queries to run inference. Using an example, we saw how to use Athena’s UDF functionality. We defined and invoked our functions to ready our dataset for machine learning model training purposes. To train our model, we used PyAthena to invoke Athena SQL queries in Amazon SageMaker and finally, invoked our ML model in SQL query to run inference.

Amazon Athena’s ML functionality is available today in preview in the us-east-1 (N. Virginia) region. Begin your preview now by following these Athena FAQ.
To learn more about the feature, please visit our querying ML model documentation.

 


About the Authors

Janak Agarwal is a product manager for Athena at AWS.

 

 

 

 

 Ronak Shah is a software development engineer for Athena at AWS.

 

 

 

 

Query any data source with Amazon Athena’s new federated query

Post Syndicated from Janak Agarwal original https://aws.amazon.com/blogs/big-data/query-any-data-source-with-amazon-athenas-new-federated-query/

Organizations today use data stores that are the best fit for the applications they build. For example, for an organization building a social network, a graph database such as Amazon Neptune is likely the best fit when compared to a relational database. Similarly, for workloads that require flexible schema for fast iterations, Amazon DocumentDB (with MongoDB compatibility) is likely a better fit. As Werner Vogels, CTO and VP of Amazon.com, said: “Seldom can one database fit the needs of multiple distinct use cases.” Developers today build highly distributed applications using a multitude of purpose-built databases. In a sense, developers are doing what they do best: dividing complex applications into smaller pieces, which allows them to choose the right tool for the right job. As the number of data stores and applications increase, running analytics across multiple data sources can become challenging.

Today, we are happy to announce a Public Preview of Amazon Athena support for federated queries.

Federated Query in Amazon Athena

Federated query is a new Amazon Athena feature that enables data analysts, engineers, and data scientists to execute SQL queries across data stored in relational, non-relational, object, and custom data sources. With Athena federated query, customers can submit a single SQL query and analyze data from multiple sources running on-premises or hosted on the cloud. Athena executes federated queries using Data Source Connectors that run on AWS Lambda. AWS has open-sourced Athena Data Source connectors for Amazon DynamoDB, Apache HBase, Amazon DocumentDB, Amazon Redshift, Amazon CloudWatch Logs, AWS CloudWatch Metrics, and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. Customers can use these connectors to run federated SQL queries in Athena across these data sources. Additionally, using Query Federation SDK, customers can build connectors to any proprietary data source and enable Athena to run SQL queries against the data source. Since connectors run on Lambda, customers continue to benefit from Athena’s serverless architecture and do not have to manage infrastructure or scale for peak demands.

Running analytics on data spread across applications can be complex and time consuming. Application developers pick a data store based on the application’s primary function. As a result, data required for analytics is often spread across relational, key-value, document, in-memory, search, graph, time-series, and ledger databases. Event and application logs are often stored in object stores such as Amazon S3. To analyze data across these sources, analysts have to learn new programming languages and data access constructs, and build complex pipelines to extract, transform and load into a data warehouse before they can easily query the data. Data pipelines introduce delays and require custom processes to validate data accuracy and consistency across systems. Moreover, when source applications are modified, data pipelines have to be updated and data has to be re-stated for corrections. Federated queries in Athena eliminate this complexity by allowing customers to query data in-place wherever it resides. Analysts can use familiar SQL constructs to JOIN data across multiple data sources for quick analysis or use scheduled SQL queries to extract and store results in Amazon S3 for subsequent analysis.

The Athena Query Federation SDK extends the benefits of federated querying beyond AWS provided connectors. With fewer than 100 lines of code, customers can build connectors to proprietary data sources and share them across the organization. Connectors are deployed as Lambda functions and registered for use in Athena as data sources. Once registered, Athena invokes the connector to retrieve databases, tables, and columns available from the data source. A single Athena query can span multiple data sources. When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements. Connectors use Apache Arrow as the format for returning data requested in a query, which enables connectors to be implemented in languages such as C, C++, Java, Python, and Rust. Since connectors are executed in Lambda, they can be used to access data from any data source on the cloud or on-premises that is accessible from Lambda.

Data Source Connectors

You can run SQL queries against new data sources by registering the data source with Athena. When registering the data source, you associate an Athena Data Connector specific to the data source. You can use AWS-provided open-source connectors, build your own, contribute to existing connectors, or use community or marketplace-built connectors. Depending on the type of data source, a connector manages metadata information; identifies specific parts of the tables that need to be scanned, read or filtered; and manages parallelism. Athena Data Source Connectors run as Lambda functions in your account.

Each data connector is composed of two Lambda functions, each specific to a data source: one for metadata and one for record reading. The connector code is open-source and should be deployed as Lambda functions. You can also deploy Lambda functions to AWS Serverless Application Repository and use them with Athena. Once the Lambda functions are deployed, they produce a unique Amazon Resource Name or ARN. You must register these ARNs with Athena. Registering an ARN allows Athena to understand with which Lambda function to talk during query execution. Once both the ARNs are registered, you can query the registered data source.

When a query runs on a federated data source, Athena fans out the Lambda invocations reading metadata and data in parallel. The number of parallel invocations depends on the Lambda concurrency limits enforced in your account. For example, if you have a limit of 300 concurrent Lambda invocations, Athena can invoke 300 parallel Lambda functions for record reading. For two queries running in parallel, Athena invokes twice the number of concurrent executions.

Diagram 1 shows how Athena Federated Queries work. When you submit a federated query to Athena, Athena will invoke the right Lambda-based connector to connect with your Data Source. Athena will fan out Lambda invocations to read metadata and data in parallel.

Diagram 1: Athena Federated Query Architecture

Example

This blog post demonstrates how data analysts can query data in multiple databases for faster analysis in one SQL query. For illustration purposes, consider an imaginary e-commerce company whose architecture leverages the following purpose-built data sources:

  1. Payment transaction records stored in Apache HBase running on AWS.
  2. Active orders, defined as customer orders not yet delivered, stored in Redis so that the processing engine can retrieve such orders quickly.
  3. Customer data such as email addresses, shipping information, etc., stored in DocumentDB.
  4. Product Catalog stored in Aurora.
  5. Order Processor’s log events housed in Amazon CloudWatch Logs.
  6. Historical orders and analytics in Redshift.
  7. Shipment tracking data in DynamoDB.
  8. A fleet of drivers performing last-mile delivery while using IoT-enabled tablets.

Customers of this imaginary e-commerce company have a problem. They have complained that their orders are stuck in a weird state. Some orders show as pending even though they have actually been delivered while other orders show as delivered but have not actually been shipped.

The company management has tasked the customer service analysts to determine the true state of all orders.

Using Athena federated queries

Using Athena’s query federation, the analysts can quickly analyze records from different sources. Additionally, they can setup a pipeline that can extract data from these sources, store them in Amazon S3 and use Athena to query them.

Diagram 2 shows Athena invoking Lambda-based connectors to connect with data sources that are on On Premises and in Cloud in the same query. In this diagram, Athena is scanning data from S3 and executing the Lambda-based connectors to read data from HBase in EMR, Dynamo DB, MySQL, RedShift, ElastiCache (Redis) and Amazon Aurora.

Diagram 2: Federated Query Example.

 

Analysts can register and use the following connectors found in this repository and run a query that:

  1. Grabs all active orders from Redis. (see athena-redis)
  2. Joins against any orders with ‘WARN’ or ‘ERROR’ events in Cloudwatch logs by using regex matching and extraction. (see athena-cloudwatch)
  3. Joins against our EC2 inventory to get the hostname(s) and status of the Order Processor(s) that logged the ‘WARN’ or ‘ERROR’. (see athena-cmdb)
  4. Joins against DocumentDB to obtain customer contact details for the affected orders. (see athena-docdb)
  5. Joins against a scatter-gather query sent to the Driver Fleet via Android Push notification. (see athena-android)
  6. Joins against DynamoDB to get shipping status and tracking details. (see athena-dynamodb)
  7. Joins against HBase to get payment status for the affected orders. (see athena-hbase)
  8. Joins against the advertising conversion data in BigQuery to see which promotions need to be applied if a re-order is needed. (see athena-bigquery)

Data Source Connector Registration

Analysts can register a data source connector using the Connect data source Flow in the Athena Query Editor.

  1. Choose Connect data source or Data sources on the Query Editor.
  2. Select the data source to which you want to connect, as shown in the following screenshot. You can also choose to write your own data source connector using the Query Federation SDK.
  3. Follow the rest of the steps in the UX to complete the registration. They involve configuring the connector function for your data source (as shown in the following screenshot), selecting a Name as the Catalog Name to use in your query, and providing a description.

Sample Analyst Query

Once the registration of the data source connectors is complete, the customer service analyst can write the following sample query to identify the affected orders in one SQL query, thus increasing the organization’s business velocity.

Below you’ll find a video demonstration of a sample federated query:

WITH logs 
     AS (SELECT log_stream, 
                message                                          AS 
                order_processor_log, 
                Regexp_extract(message, '.*orderId=(\d+) .*', 1) AS orderId, 
                Regexp_extract(message, '(.*):.*', 1)            AS log_level 
         FROM 
     "lambda:cloudwatch"."/var/ecommerce-engine/order-processor".all_log_streams 
         WHERE  Regexp_extract(message, '(.*):.*', 1) != 'WARN'), 
     active_orders 
     AS (SELECT * 
         FROM   redis.redis_db.redis_customer_orders), 
     order_processors 
     AS (SELECT instanceid, 
                publicipaddress, 
                state.NAME 
         FROM   awscmdb.ec2.ec2_instances), 
     customer 
     AS (SELECT id, 
                email 
         FROM   docdb.customers.customer_info), 
     addresses 
     AS (SELECT id, 
                is_residential, 
                address.street AS street 
         FROM   docdb.customers.customer_addresses),
     drivers
     AS ( SELECT name as driver_name, 
                 result_field as driver_order, 
                 device_id as truck_id, 
                 last_updated 
         FROM android.android.live_query where query_timeout = 5000 and query_min_results=5),
     impressions 
     AS ( SELECT path as advertisement, 
                 conversion
         FROM bigquery.click_impressions.click_conversions),
     shipments 
     AS ( SELECT order_id, 
                 shipment_id, 
                 from_unixtime(cast(shipped_date as double)) as shipment_time,
                 carrier
        FROM lambda_ddb.default.order_shipments),
     payments
     AS ( SELECT "summary:order_id", 
                 "summary:status", 
                 "summary:cc_id", 
                 "details:network" 
        FROM "hbase".hbase_payments.transactions)

SELECT _key_            AS redis_order_id, 
       customer_id, 
       customer.email   AS cust_email, 
       "summary:cc_id"  AS credit_card,
       "details:network" AS CC_type,
       "summary:status" AS payment_status,
       impressions.advertisement as advertisement,
       status           AS redis_status, 
       addresses.street AS street_address, 
       shipments.shipment_time as shipment_time,
       shipments.carrier as shipment_carrier,
       driver_name     AS driver_name,
       truck_id       AS truck_id,
       last_updated AS driver_updated,
       publicipaddress  AS ec2_order_processor, 
       NAME             AS ec2_state, 
       log_level, 
       order_processor_log 
FROM   active_orders 
       LEFT JOIN logs 
              ON logs.orderid = active_orders._key_ 
       LEFT JOIN order_processors 
              ON logs.log_stream = order_processors.instanceid 
       LEFT JOIN customer 
              ON customer.id = customer_id 
       LEFT JOIN addresses 
              ON addresses.id = address_id 
       LEFT JOIN drivers 
              ON drivers.driver_order = active_orders._key_ 
       LEFT JOIN impressions
              ON impressions.conversion = active_orders._key_
       LEFT JOIN shipments
              ON shipments.order_id = active_orders._key_
       LEFT JOIN payments
              ON payments."summary:order_id" = active_orders._key_

Additionally, Athena writes all query results in an S3 bucket that you specify in your query. If your use-case mandates you to ingest data into S3, you can use Athena’s query federation capabilities statement to register your data source, ingest to S3, and use CTAS statement or INSERT INTO statements to create partitions and metadata in Glue catalog as well as convert data format to a supported format.

Conclusion

In this blog, we introduced Athena’s new federated query feature. Using an example, we saw how to register and use Athena data source connectors to write federated queries to connect Athena to any data source accessible by AWS Lambda from your account. Finally, we learnt that we can use federated queries to not only enable faster analytics, but also to extract, transform and load data into your datalake in S3.

Athena federated query is available in Preview in the us-east-1 (N. Virginia) region. Begin your Preview now by following these steps in the Athena FAQ.
To learn more about the feature, please see documentation the Connect to a Data Source documentation here.
To get started with using an existing connector, please follow this Connect to a Data Source guide.
To learn how to build your own data source connector using the Athena Query Federation SDK, please visit this Athena example in GitHub .

 


About the Author

Janak Agarwal is a product manager for Athena at AWS.

 

 

 

Simplify ETL data pipelines using Amazon Athena’s federated queries and user-defined functions

Post Syndicated from Manny Wald original https://aws.amazon.com/blogs/big-data/simplify-etl-data-pipelines-using-amazon-athenas-federated-queries-and-user-defined-functions/

Amazon Athena recently added support for federated queries and user-defined functions (UDFs), both in Preview. See Query any data source with Amazon Athena’s new federated query for more details. Jornaya helps marketers intelligently connect consumers who are in the market for major life purchases such as homes, mortgages, cars, insurance, and education.

Jornaya collects data from a variety of data sources. Our main challenge is to clean and ingest this data into Amazon S3 to enable access for data analysts and data scientists.

Legacy ETL and analytics solution

In 2012, Jornaya moved from using from MySQL to Amazon DynamoDB for our real-time database. DynamoDB allowed a company of our size to receive the benefits of create, read, update, and delete (CRUD) operations with predictable low latency, high availability, and excellent data durability without the administrative burden of having to manage the database. This allowed our technology team to focus on solving business problems and rapidly building new products that we could bring to market.

Running analytical queries on NoSQL databases can be tough. We decided to extract data from DynamoDB and run queries on it. This was not simple.

Here are a few methods we use at Jornaya to get data from DynamoDB:

  • Leveraging EMR: We temporarily provision additional read capacity with DynamoDB tables and create transient EMR clusters to read data from DynamoDB and write to Amazon S3.
    • Our Jenkins jobs trigger pipelines that spin up a cluster, extract data using EMR, and use the Amazon Redshift copy command to load data into Amazon Redshift. This is an expensive process because we use excess read capacity. To lower EMR costs, we use spot instances.
  • Enabling DynamoDB Streams: We use a homegrown Python AWS Lambda function named Dynahose to consume data from the stream and write it to a Amazon Kinesis Firehose delivery stream. We then configure the Kinesis Firehose delivery stream to write the data to an Amazon S3 location. Finally, we use another homegrown Python Lambda function named Partition to ensure that the partitions corresponding to the locations of the data written to Amazon S3 are added to the AWS Glue Data Catalog so that it can read using tools like AWS Glue, Amazon Redshift Spectrum, EMR, etc.

The process is shown in the following diagram.

We go through such pipelines because we want to ask questions about our operational data in a natural way, using SQL.

Using Amazon Athena to simplify ETL workflows and enable quicker analytics

Athena, a fully managed serverless interactive service for querying data in Amazon S3 using SQL, has been rapidly adopted by multiple departments across our organization. For our use case, we did not require an always-on EMR cluster waiting for an analytics query. Athena’s serverless nature is perfect for our use case. Along the way we discovered that we could use Athena to run extract, transform, and load (ETL) jobs.

However, Athena is a lot more than an interactive service for querying data in Amazon S3. We also found Athena to be a robust, powerful, reliable, scalable, and cost-effective ETL tool. The ability to schedule SQL statements, along with support for Create Table As Select (CTAS) and INSERT INTO statements, helped us accelerate our ETL workloads.

Before Athena, business users in our organization had to rely on engineering resources build pipelines. The release of Athena changed that in a big way. Athena enabled software engineers and data scientists to work with data that would have otherwise been inaccessible or required help from data engineers.

With the addition of query federation and UDFs to Athena, Jornaya has been able to replace many of our unstable data pipelines with Athena to extract and transform data from DynamoDB and write it to Amazon S3. The product and engineering teams at Jornaya noticed our reduced ETL failure rates. The finance department took note of lower EMR and DynamoDB costs. And the members of our on-call rotation (as well as their spouses) have been able to enjoy uninterrupted sleep.

For instance, the build history of one ETL pipeline using EMR looked like this (the history of ETL pipeline executions is shown in this chart with the job execution id on the x-axis and the execution time in minutes on the y-axis):

After migrating this pipeline to Athena and using federated queries to query DynamoDB, we were able to access data sources with ease that we simply could not previously with queries like the following:

CREATE TABLE "__TABLE_NAME__"
WITH (
  external_location = '__S3_LOCATION__'
, format = 'PARQUET'
, orc_compression = 'SNAPPY'
, partitioned_by = ARRAY['create_day']
) AS
SELECT DISTINCT
  d.key.s AS device_id
, CAST(d.created.n AS DECIMAL(14, 4)) AS created
, d.token.s AS token
, c.industry AS industry_code
, CAST(CAST(FROM_UNIXTIME(CAST(d.created.n AS DECIMAL(14, 4))) AS DATE) AS VARCHAR) AS create_day
FROM "rdl"."device_frequency_stream" d
  LEFT OUTER JOIN "lambda::dynamodb"."default"."campaigns" c ON c.key = d.campaign_key
WHERE d.insert_ts BETWEEN TIMESTAMP '__PARTITION_START_DATE__' AND TIMESTAMP '__PARTITION_END_DATE__'
  AND d.created.n >= CAST(CAST(TO_UNIXTIME(DATE '__START_DATE__') AS DECIMAL(14, 4)) AS VARCHAR)
  AND d.created.n < CAST(CAST(TO_UNIXTIME(DATE '__END_DATE__') AS DECIMAL(14, 4)) AS VARCHAR);

We achieved a much more performant process with a build history shown in the following diagram:

Conclusion

Using one SQL query, we were able to process data from DynamoDB, convert that data to Parquet, apply Snappy compression, create the correct partitions in our AWS Glue Data Catalog, and ingest data to Amazon S3. Our ETL process execution time was reduced from hours to minutes, the cost was significantly lowered, and the new process is simpler and much more reliable. The new process using Athena for ETL is also future-proof due to extensibility. In case we need to import a dataset from another purpose-built data store that does not have a ready data source connector, we can simply use the data source connector SDK to write our own connector and deploy it in Production—a one-time effort that will cost us barely one day.

Additionally, Athena federated queries have empowered Jornaya to run queries that connect data from not just different data sources, but from different data paradigms! We can run a single query that seamlessly links data from a NoSQL datastore, an RDS RDBMS, and an Amazon S3 data lake.

 


About the Authors

Manny Wald is the technical co-founder at Jornaya. He holds multiple patents and is passionate about the power of the cloud, big data and AI to accelerate the rate at which companies can bring products to market to solve real-world problems. He has a background in BI, application development, data warehousing, web services, and building tools to manage transactional and analytical information. Additionally, Manny created the internet’s first weekly hip hop turntablism mix show, is admitted to practice law at the state and federal levels, and plays basketball whenever he gets the chance.

Janak Agarwal is a product manager for Athena at AWS.

Highlight Critical Insights with Conditional Formatting in Amazon QuickSight

Post Syndicated from Susan Fang original https://aws.amazon.com/blogs/big-data/highlight-critical-insights-with-conditional-formatting-in-amazon-quicksight/

Amazon QuickSight now makes it easier for you to spot the highlights or low-lights in data through conditional formatting. With conditional formatting, you can specify customized text or background colors based on field values in the dataset, using solid or gradient colors. You can also display data values with the supported icons. Using color coding and icons, you can visually explore and analyze data, quickly detect issues, and identify trends, as seen in the following screenshot. These dynamic visual cues also help your dashboard viewers cut through the densely populated data values and get to the attention-worthy insights much faster.

With this release, you can apply conditional formatting to tables, pivot tables and KPI charts. This blog post walks you through how to leverage conditional formatting in each supported chart type in Amazon QuickSight.

Tables

You can apply conditional formatting to tables through the following methods.

Conditionally formatting columns based on data fields

When applying conditional formatting to tables, there are two ways to access the configuration pane. One option is to select Table in an analysis, and choose the caret in the upper right corner of the Table visual. Select Conditional formatting from the dropdown menu, as shown in the following screenshots, and the Conditional formatting configuration pane pops up to the left of the analysis canvas. Choose the + sign to select the target column to which you’d like to apply the formatting; it can be any column present in the table.

For example, select Sales from the dropdown menu to add a condition for the Sales target column.

Alternatively, from the selected table, you can expand the field well on the top of the analysis canvas and choose the Group by or Value target column directly from the columns used in the table. For example, select Sales from the Value well and select Conditional formatting from the dropdown, as seen in the following screenshots. The conditional formatting confirmation pane pops up to the left of the analysis canvas, with a condition already added for the Sales target column.

Next, choose the formatting style from the three available options.

  • Add background color
  • Add text color
  • Add icon

For each target column, you can add any formatting style or any combination of the styles.

Background and text color

Choose Add background color to expand the configuration pane for New background color. In the Sales target column are two Fill type options, Solid or Gradient, with Solid as the default.

You are prompted to decide the base column of the conditions, which can be any field in the dataset, regardless of whether it is present in the Table or not.

Next, define the Aggregation of the base column used in the conditions, which can be any aggregation currently supported in QuickSight, such as sum, average, min, max, or standard deviation. Then you can create the conditions that drive the color coding of the cell background.

This example highlights cells based on the following rules:

  • Cells with sum of sales greater than $100, green
  • Cells with sum of between $10 and $100 (inclusive), yellow
  • Cells with sum of sales less than $10, red

Since the conditions are discrete, apply solid color based on Sales in the Sum aggregation.

In Conditional #1:

  1. Choose the Greater than comparison method.
  2. Enter 100 as the threshold Value to be 100
  3. Use the color picker to pick green.
  4. Choose Add condition and configure the other two conditions accordingly.
  5. Choose Apply, and the Sales column is conditionally formatted, showing green, yellow and red colors when conditions are met.

If you’d like to re-order the conditions, choose and move the chosen condition up or down. You can also delete the condition. To reset all of the conditions, choose Clear to clear the applied formatting from the visual and remove the conditions from the configuration pane.

The following screenshots demonstrate the configuration step by step:

If you are happy with the highlighting rules, choose Close to return to the summary page.

You see a brief summary of the conditions applied to background color of the Sales column. You can Edit or Delete the applied conditions, Add other formatting styles such as text color or icon to the same Sales column, Delete the target column altogether, or Add another target column up top.

Icons

This example adds icons to the same Sales column based on data in the Profits column so that a thumbs-up icon shows when profits are greater than 0.

  1. Start with Add icon. In the New icon configuration page, choose the base column and Aggregation
  2. Under Icon set, QuickSight offers a list of canned color icon sets for quick application.
  3. For icons beyond this list, collapse the dropdown and check the Custom conditions Choose the pre-populated + icon for more icon selections and the option to Use custom Unicode icon. This example uses the thumbs-up and sets the color as orange.
  4. Choose Apply and to see icons show up in the Sales column accordingly.

The following screenshots show the configuration step by step:

Please note that you have the option to Show icon only. Once checked, the text values of the target column are hidden, showing icons only.

Conditionally formatting entire rows based on calculated fields

Next, here is a slightly more complex example that involves calculated fields. This example highlights rows in which Ship charges exceeds 20% of the total sales amount. This condition needs to be based on a flag that differentiates orders with high shipping charges from the rest.

Create a calculated field called high_shipping_flag using this function:

ifelse(sum({Ship charges}) > 0.2 * sum(Sales),1,0)

As a result, orders with shipping charges over 20% of sales amount take a value of 1, otherwise, they take a value of 0.

With the base field ready, proceed with conditional formatting configuration. Since the target property for formatting is the entire row instead of a specific column, access the configuration pane from the visual dropdown menu instead of a particular column from the field well.

Scroll down to the bottom of the dropdown list and select the option to conditionally format the [Entire row] in addition to individual columns.

Note the options to Add background color or Add text color, without the Add icon option, because it is not applicable to entire rows. Also note the Apply on top checkbox, which allows the conditional formatting for the entire row to paint over conditional formatting applied to individual columns. For this example, color the entire row teal if the shipping charge flag is 1.

The following screenshots demonstrate the configuration step by step:

Pivot tables

Conditional formatting is enabled for pivot tables as well, and the configuration workflow is very similar to that of tables, with the following differences:

  • Conditional formatting can be applied to measures only, or fields in the Values well, but not the other data types.
  • Conditional formatting in solid colors can be applied to various levels of your choice: values, subtotals, and totals.
  • Conditional formatting can only be applied to individual columns, not entire rows.

Applying gradient colors

Next, explore the conditional formatting functionalities unique to pivot tables. As mentioned above, only the two measures, or numeric fields, in the Values well can be the target fields for conditional formatting.

The following screenshot shows a conditionally formatted pivot table:

As with tables, you can access Conditional formatting configuration pane from the dropdown of either the visual or the field itself in the Values well. Note that only the two measures show up in the dropdown list.

Applying gradient colors to Percent Tested is fairly straightforward. When you select Gradient, you are prompted to define a base field, which can be any field from the dataset as well as Aggregation. Amazon QuickSight offers three pre-set gradient colors for easy application: blue, red-green, and red-yellow-green.

Choose the diverging color option. You have the option to further customize the three colors as well as the corresponding min, mid, and max values by overriding the default, as shown in the following screenshots.

Applying solid colors to column values, subtotals and totals

For Average Score, apply solid color rules to the text.

Add Average Score as the target column for conditional formatting and select Add text color. As mentioned above, only solid colors can be applied to subtotals and totals.

Note the extra text Values next to Condition, which is different from those in Table, right above the comparison method dropdown box. By default, Amazon QuickSight applies the solid color conditions to the most fine-grained values of the target field with the dimension fields at all levels expanded.

This example examines how the average scores at the Borough level compare with each other, which requires applying conditions to both Values and Subtotals. To do so, click on the icon to expose the options. This is a multi-select dropdown list, in which you can check one option or multiple options. After checking the desired boxes, the Values and Subtotals show up next to Condition. You can Add condition after the first you have completed.

Note that the next condition added inherits the level of granularity from the previous condition. In this example, Condition #2 automatically applies to Values and Subtotals, just like in the previous one. You can override from the check boxes of the multi-select dropdown.

The following screenshot shows the configuration and the effect.

For illustration purposes, add icons as well and enable Show subtotals for rows from the visual dropdown menu. With some rows collapsed at the Borough level, the solid color rules are applied to the average score for the collapsed boroughs, along with the school-level average scores on the expanded rows. The solid color rules also show up in the subtotals enabled for rows.

The following screenshots show conditional formatting applied to subtotals:

KPI Chart

This release also makes conditional formatting available for KPI charts so that you can easily tell whether the KPI is over or under the business-critical thresholds, using text color and icons as quick indicators. In cases in which progress bars are shown, you also have the option to conditionally format the foreground color of the progress bar. The configuration workflow is similar to that of a table or pivot table. The difference is that only the primary value field can be conditionally formatted, not the Target nor the Trend group. This example uses a simple KPI chart showing Primary value only and adds an icon if the Average SAT Score is above 1200.

The following screenshots show the configuration and the effect:

Summary

This post illustrated how you can leverage conditional formatting to create richer data visualizations and thus more appealing dashboards in Amazon QuickSight. As of this writing, conditional formatting is enabled for tables, pivot tables, and KPI charts. You can expect to see enhanced capability to dynamically customize data colors based on numeric fields in other chart types as well. Stay tuned.

If you have any questions or feedback, please leave a comment.

 


About the Authors

Susan Fang is a senior product manager for QuickSight with AWS.

 

 

 

 

Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilities

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/evolve-your-analytics-with-amazon-quicksights-new-apis-and-theming-capabilities/

The Amazon QuickSight team is excited to announce the availability of Themes and more APIs! Themes for dashboards let you align the look and feel of Amazon QuickSight dashboards with your application’s branding or corporate themes. The new APIs added allow you to manage your Amazon QuickSight deployments programmatically, with support for dashboards, datasets, data sources, SPICE ingestion, and fine-grained access control over AWS resources. Together, they allow you to creatively tailor Amazon QuickSight to your audiences, whether you are using Amazon QuickSight to provide your users with an embedded analytics experience or for your corporate Business Intelligence (BI) needs. This post provides an overview of these new capabilities and details on getting started.

Theming your dashboards

Amazon QuickSight themes allow you to control the look and feel of your dashboards. Specifically, you will be able to affect the following items through themes, and then share with other authors in the account.

  • Margins
  • Borders around visuals
  • Gutters (spacing between visuals)
  • Data colors (used within the visuals)
  • Background colors (for visuals within the dashboard and the dashboard itself)
  • Foreground colors (for text in the visuals and on the dashboard)
  • Highlight colors on visuals

The following screenshot of the dashboard highlights aspects of the dashboards you can modify via the theming options currently available:

With the available options, you can also make your QuickSight dashboards denser (e.g. no margins or gutters), as shown in the view below that uses QuickSight’s new “Seaside” theme.

You also have a dark dashboard option with QuickSight’s new “Midnight” theme as shown below.

Themes are accessible via the left hand panel in the dashboard authoring interface.

You can create your own theme by starting with one of the built-in themes and customizing it. The interactive theme editing experience makes this easy – you can edit as little or as much as you want, and get to the perfect theme for your dashboards.

Using APIs for dashboards, datasets, data sources, SPICE ingestion, and permissions

The new APIs cover dashboards, datasets, data sources, SPICE ingestion, and fine-grained access control over S3/Athena. We’ve introduced templates in QuickSight with this release: templates allow you to store the visual configuration and data schema associated with a dashboard and then easily replicate the dashboard for different sets of users or different data, within or across accounts.

The new template APIs allow you to create templates from QuickSight analyses. You can then use the dashboard APIs to create a dashboard from the template in the same or different account, version control dashboards, and connect them to different datasets as needed. The dataset APIs allow you to create custom datasets, with filters applied at the dataset level (for example, data for a specific customer only). Dataset APIs can be used to create both SPICE and direct query datasets. SPICE APIs let you refresh datasets programmatically and view SPICE ingestion status and history. Data source APIs programmatically define data source connections. Finally, fine-grained access permissions APIs let you secure access to data in Amazon S3, Amazon Athena, and other AWS resources.

As a developer using Amazon QuickSight’s embedded capabilities, or author using Amazon QuickSight to publish dashboards on different subject areas, the new APIs allow you to automate your workflows and programmatically perform repetitive activities such as rolling out identical versions of dashboards to different teams or customers. QuickSight assets can now be migrated from dev to QA to production via API, overcoming the need to manually access each environment. With versioning, you also have your previous dashboards backed up, providing you flexibility to roll back dashboard deployments if the need arises.

As an Amazon QuickSight administrator, you can use the new APIs to audit and manage access to dashboards, data sources, and datasets across your account. You can also pre-populate dashboards and datasets into individual author accounts for easier and faster onboarding. You no longer need to distribute individual credentials to data sources: you can provision centrally-managed data sources shared with all your users. You can also manage individual author access to S3, Athena, or other AWS resources on-the-fly using IAM policies through the fine-grained access control APIs. The APIs open up a lot of possibilities with Amazon QuickSight – we’ll cover a few scenarios in this blog post and then follow up with additional posts that cover more.

The Amazon QuickSight APIs can be invoked via the AWS Software Development Kit (SDK) or the AWS Command Line Interface (CLI). For our discussion, we will use the AWS CLI. To capture all the details needed for each API, this post uses the generate-cli-skeleton option available in the AWS CLI and walks you through the example of creating a Redshift data source, creating data sets, and dashboards associated with it. This option, when invoked, provides a skeleton file with all the required API details, which you can edit to the correct inputs and use for your API calls. The code below invokes the generate-cli-skeleton option and writes the details to the create-data-source-cli-input.json file; we would then modify this file and use it as the input to the create-data-source API when invoking it.

aws quicksight create-data-source --generate-cli-skeleton > create-data-source-cli-input.json

Let’s say that you want to help Alice, a new hire in your company, create an Amazon QuickSight dashboard from some customer revenue data in Redshift. To set things up for Alice, you want to use the new APIs to connect to Redshift and create a QuickSight data set within her QuickSight user account.

Creating a data source

To connect to your Redshift database, you’ll need to create a data source in QuickSight. First, you describe it with the following configuration, which covers the data source connection details, credentials, and permissions around who should be able to see the data sources.

cat create-data-source-cli-input.json

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DataSourceId": "SampleRedshiftDatasouce",
    "Name": "Sample Redshift Datasouce",
    "Type": "REDSHIFT",
    "DataSourceParameters": {
        "RedshiftParameters": {
            "Host": "redshift-cluster-1.xxxxxxxxxx.us-east-1.redshift.amazonaws.com",
            "Port": 5439,
            "Database": "dev",
            "ClusterId": "redshift-cluster-1"
        }
    },
    "Credentials": {
        "CredentialPair": {
            "Username": "xxxxxxxxx",
            "Password": "xxxxxxxxxxx"
        }
    },
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:UpdateDataSourcePermissions",
                "quicksight:DescribeDataSource",
                "quicksight:DescribeDataSourcePermissions",
                "quicksight:PassDataSource",
                "quicksight:UpdateDataSource",
                "quicksight:DeleteDataSource"
                
            ]
        }
    ],
    "VpcConnectionProperties": {
        "VpcConnectionArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:vpcConnection/VPC to Redshift"
    },
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-RedShiftDemoDataSource"
        }
    ]
}

You can then invoke create-data-source API with this input to start the creation of your data source as indicated below.

aws quicksight create-data-source --cli-input-json file://./create-data-source-cli-input.json

{
    "Status": 202,
    "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
    "DataSourceId": "SampleRedshiftDatasouce",
    "CreationStatus": "CREATION_IN_PROGRESS",
    "RequestId": "ac9fb8fe-71d8-4005-a7c9-d66d814e224e"
}

You can validate that the data source was successfully created by calling the describe-data-source API as shown below and checking the response’s Status:

aws quicksight describe-data-source --aws-account-id $AAI --data-source-id SampleRedshiftDatasouce

{
    "Status": 200,
    "DataSource": {
        "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
        "DataSourceId": "SampleRedshiftDatasouce",
        "Name": "Sample Redshift Datasouce",
        "Type": "REDSHIFT",
        "Status": "CREATION_SUCCESSFUL",
        "CreatedTime": 1574053515.41,
        "LastUpdatedTime": 1574053516.368,
        "DataSourceParameters": {
            "RedshiftParameters": {
                "Host": "redshift-cluster-1.c6qmmwnqxxxx.us-east-1.redshift.amazonaws.com",
                "Port": 5439,
                "Database": "dev",
                "ClusterId": "redshift-cluster-1"
            }
        },
        "VpcConnectionProperties": {
            "VpcConnectionArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:vpcConnection/VPC to Redshift"
        },
        "SslProperties": {
            "DisableSsl": false
        }
    },
    "RequestId": "57d2d6d6-ebbf-4e8c-82ab-c38935cae8aa"
}

Because you granted permissions to Alice as part of the call to create-data-source, this data source will now be visible to her when she logs in to QuickSight. If you didn’t grant user permissions, this data source is only visible to programmatic users with appropriate permissions.

Creating a dataset

Before Alice can build her dashboard, you’ll need to create a dataset, which identifies the specific data in a data source you want to use. To create a dataset from this data source, you can call the create-data-set API. The create-data-set API allows you to join two tables to create a desired view of data for Alice. Below, PhysicalTableMap points to the tables in your Redshift data source you want to join, and LogicalTableMap defines the join between them:

A sample dataset definition with joins included is shown below.

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DataSetId": "DemoDataSet1",
    "Name": "Demo Data Set 1",
    "PhysicalTableMap": {
        "DemoDataSet1-LineOrder": {
            "RelationalTable": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
                "Schema": "public",
                "Name": "lineorder",
                "InputColumns": [
                        {
                            "Name": "lo_custkey",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_revenue",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_quantity",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_orderkey",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_shipmode",
                            "Type": "STRING"
                        },
                        {
                            "Name": "lo_orderdate",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_ordertotalprice",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_tax",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_discount",
                            "Type": "INTEGER"
                        }
                ]
	    }
            },
        "DemoDataSet1-Customer": {
            "RelationalTable": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
                "Schema": "public",
                "Name": "customer",
                "InputColumns": [
                        {
                            "Name": "c_custkey",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "c_phone",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_address",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_name",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_city",
                            "Type": "STRING"
                        }
                ]
            }
        }
    },
    "LogicalTableMap": {
        "DemoDataSet1-LineOrder-Logical": {
            "Alias": "LineOrder",
            "Source": {
                "PhysicalTableId": "DemoDataSet1-LineOrder"
            }
        },
        "DemoDataSet1-Customer-Logical": { 
            "Alias": "Customer",
            "Source": {
                "PhysicalTableId": "DemoDataSet1-Customer"
            }
        },
        "DemoDataSet1-Intermediate": { 
            "Alias": "Intermediate",
            "DataTransforms": [
                {   
                    "ProjectOperation": {
                        "ProjectedColumns": [
                                "lo_revenue",
                                "c_name",
                                "c_city"
                        ]
                    }
                }
            ],
            "Source": {
                "JoinInstruction": {
                    "LeftOperand": "DemoDataSet1-LineOrder-Logical",
                    "RightOperand": "DemoDataSet1-Customer-Logical",
                    "Type": "INNER",
                    "OnClause": "{lo_custkey} = {c_custkey}"
                }
            }
        }

    },
    "ImportMode": "DIRECT_QUERY",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:UpdateDataSetPermissions",
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions",
                "quicksight:UpdateDataSet",
                "quicksight:DeleteDataSet",
                "quicksight:CreateIngestion",
                "quicksight:CancelIngestion"
            ]
        }
    ],
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDataSet1"
        }
    ]
}

If you wanted to create a similar dataset with a SQL statement instead of specifying all the columns, call the create-data-set API specifying your SQL statement in PhysicalTableMap as shown below:

aws quicksight create-data-set --generate-cli-skeleton >create-data-set-cli-input-sql.json

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DataSetId": "DemoDataSet2",
    "Name": "Demo Data Set 2",
    "PhysicalTableMap": {
        "DemoDataSet2-CustomerSales": {
            "CustomSql": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
                "Name": "Customer Sales",
                "SqlQuery":"SELECT lineorder.lo_revenue, customer.c_name, customer.c_city FROM public.lineorder INNER JOIN public.customer ON lineorder.lo_custkey = customer.c_custkey",
                    "Columns": [
                        {
                            "Name": "lo_revenue",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "c_name",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_city",
                            "Type": "STRING"
                        }
                    ]
 
	    }
        }
    },
    "LogicalTableMap": {
        "DemoDataSet2-CustomerSales-Logical": {
            "Alias": "Customer Sales",
            "Source": {
                "PhysicalTableId": "DemoDataSet2-CustomerSales"
            }
        }

    },
    "ImportMode": "DIRECT_QUERY",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:UpdateDataSetPermissions",
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions",
                "quicksight:UpdateDataSet",
                "quicksight:DeleteDataSet",
                "quicksight:CreateIngestion",
                "quicksight:CancelIngestion"
            ]
        }
    ],
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDataSet2"
        }
    ]
}

Because you granted permissions to Alice, these data sets will be visible to her under Your Data Sets when she logs into QuickSight.

This dataset can be used like any other dataset in Amazon QuickSight. Alice can log in to the QuickSight UI and create an analysis that utilizes this dataset, as shown below:

Creating a template

Once Alice creates her dashboard and it gets popular with her team, the next ask might be to provide the same view, but to 100 other teams within the company, and using different datasets for each team. Without APIs, this ask would be a challenge. However, with the APIs, you can now define a template from Alice’s analysis and then create dashboards as needed. A template is created by pointing to a source analysis in QuickSight, which creates references for all datasets used, so that new datasets can referenced when using the template.

The following configuration describes a template created from Alice’s analysis:

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "TemplateId": "DemoDashboardTemplate",
    "Name": "Demo Dashboard Template",
    "SourceEntity": {
        "SourceAnalysis": {
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:analysis/7975f7aa-261c-4e7c-b430-305d71e07a8e",
            "DataSetReferences": [
                {
                    "DataSetPlaceholder": "DS1",
                    "DataSetArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet1"
                }
            ]
        }
    },
    "VersionDescription": "1"
}

Templates are not visible within the QuickSight UI and are a developer/admin managed asset that is only accessible via the APIs at this point. You can save the template configuration by calling the create-template API to start creation of your template:

aws quicksight create-template --cli-input-json file://./create-template-cli-input.json

{
    "Status": 202,
    "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate",
    "VersionArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate/version/1",
    "TemplateId": "DemoDashboardTemplate",
    "CreationStatus": "CREATION_IN_PROGRESS",
    "RequestId": "aa189aee-8ab5-4358-9bba-8f4e32450aef"
}

To confirm the template was created, call the describe-template API as shown below:

aws quicksight describe-template --aws-account-id $AAI --template-id DemoDashboardTemplate

{
    "Status": 200,
    "Template": {
        "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate",
        "Name": "Demo Dashboard Template",
        "Version": {
            "CreatedTime": 1574137165.136,
            "VersionNumber": 1,
            "Status": "CREATION_SUCCESSFUL",
            "DataSetConfigurations": [
                {
                    "Placeholder": "DS1",
                    "DataSetSchema": {
                        "ColumnSchemaList": [
                            {
                                "Name": "c_city",
                                "DataType": "STRING"
                            },
                            {
                                "Name": "lo_revenue",
                                "DataType": "INTEGER"
                            }
                        ]
                    },
                    "ColumnGroupSchemaList": []
                }
            ],
            "Description": "1",
            "SourceEntityArn": "arn:aws:quicksight:us-east-1: xxxxxxxxxxxx:analysis/7975f7aa-261c-4e7c-b430-305d71e07a8e"
        },
        "TemplateId": "DemoDashboardTemplate",
        "LastUpdatedTime": 1574137165.104,
        "CreatedTime": 1574137165.104
    }
}

Creating dashboards

You can then use this template to create a dashboard that is connected to the second, similar, dataset you created earlier. When calling the create-dashboard API, you define a SourceEntity, (in this case, the template we just created), permissions, and any dashboard publish options, as shown in the configuration used for create-dashboard-cli-input.json

create-dashboard-cli-input.json

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DashboardId": "DemoDashboard1",
    "Name": "Demo Dashboard 1",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:DescribeDashboard",
                "quicksight:ListDashboardVersions",
                "quicksight:UpdateDashboardPermissions",
                "quicksight:QueryDashboard",
                "quicksight:UpdateDashboard",
                "quicksight:DeleteDashboard",
                "quicksight:DescribeDashboardPermissions",
                "quicksight:UpdateDashboardPublishedVersion"
            ]
        }
    ],
    "SourceEntity": {
        "SourceTemplate": {
            "DataSetReferences": [
                {
                    "DataSetPlaceholder": "DS1",
                    "DataSetArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet2"
                }
            ],
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate"
        }
    },
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDashboard"
        }
    ],
    "VersionDescription": "1",
    "DashboardPublishOptions": {
        "AdHocFilteringOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "ExportToCSVOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "SheetControlsOption": {
            "VisibilityState": "EXPANDED"
        }
    }
}

You can confirm you created the dashboard correctly with the describe-dashboard API. See the following code:

aws quicksight describe-dashboard --aws-account-id $AAI --dashboard-id DemoDashboard1

{
    "Status": 200,
    "Dashboard": {
        "DashboardId": "DemoDashboard1",
        "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dashboard/DemoDashboard1",
        "Name": "Demo Dashboard 1",
        "Version": {
            "CreatedTime": 1574138252.449,
            "Errors": [],
            "VersionNumber": 1,
            "Status": "CREATION_SUCCESSFUL",
            "SourceEntityArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate/version/1",
            "Description": "1"
        },
        "CreatedTime": 1574138252.395,
        "LastPublishedTime": 1574138252.395,
        "LastUpdatedTime": 1574138252.449
    },
    "RequestId": "62945539-bb63-4faf-b897-8e84ea5644ae"
}

Because you granted permissions to Alice during creation, this dashboard will be visible to her under All dashboards when she logs into QuickSight as shown below:

Configuring SPICE datasets        

Alice’s datasets are direct query datasets and may cause higher load on your Redshift cluster during peak times, which might also cause Alice and her customers to wait longer for their dashboards to load. With SPICE, Amazon QuickSight’s in-memory data store, you can offload user traffic from your database, while also achieving high concurrency limits and fast, responsive performance for your end-users. The following configuration redefines our second dataset and sets ImportMode to SPICE as highlighted below:

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DataSetId": "DemoDataSet3",
    "Name": "Demo Data Set 3",
    "PhysicalTableMap": {
        "DemoDataSet2-CustomerSales": {
            "CustomSql": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
                "Name": "Customer Sales",
                "SqlQuery":"SELECT lineorder.lo_revenue, customer.c_name, customer.c_city FROM public.lineorder INNER JOIN public.customer ON lineorder.lo_custkey = customer.c_custkey",
                    "Columns": [
                        {
                            "Name": "lo_revenue",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "c_name",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_city",
                            "Type": "STRING"
                        }
                    ]
 
	    }
        }
    },
    "LogicalTableMap": {
        "DemoDataSet2-CustomerSales-Logical": {
            "Alias": "Customer Sales",
            "Source": {
                "PhysicalTableId": "DemoDataSet2-CustomerSales"
            }
        }

    },
    "ImportMode": "SPICE",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:UpdateDataSetPermissions",
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions",
                "quicksight:UpdateDataSet",
                "quicksight:DeleteDataSet",
                "quicksight:CreateIngestion",
                "quicksight:CancelIngestion"
            ]
        }
    ],
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDataSet2"
        }
    ]
}

You can call the create-data-set API with that configuration to create the dataset in SPICE mode:

aws quicksight create-data-set --aws-account-id $AAI --cli-input-json file://./create-data-set-cli-input-sql-spice.json

{
    "Status": 201,
    "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3",
    "DataSetId": "DemoDataSet3",
    "IngestionArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/589eaf9d-0e21-4572-9ebe-f61b129886d9",
    "IngestionId": "589eaf9d-0e21-4572-9ebe-f61b129886d9",
    "RequestId": "7a442521-ab8f-448f-8668-1f0d6823f987"
}

When creating datasets in SPICE mode, you’ll consume SPICE capacity. The describe-data-set API provides details on the SPICE capacity consumed. See the relevant bits from the describe-data-set response highlighted in the code snippet below:

aws quicksight describe-data-set --aws-account-id $AAI --data-set-id DemoDataSet3|tail -5

        "ImportMode": "SPICE",
        "ConsumedSpiceCapacityInBytes": 107027200
    },
    "RequestId": "b0175568-94ed-40e4-85cc-b382979ca377"
}

For SPICE datasets, the APIs also provide the option of setting up ingestions or listing past ingestions. Listing past ingestions allows you to identify ingestion status and when the data in the dataset was last updated. To list past ingestions, call the list-ingestions API for your dataset:

aws quicksight list-ingestions --aws-account-id $AAI --data-set-id DemoDataSet3

{
    "Status": 200,
    "Ingestions": [
        {
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/589eaf9d-0e21-4572-9ebe-f61b129886d9",
            "IngestionId": "589eaf9d-0e21-4572-9ebe-f61b129886d9",
            "IngestionStatus": "COMPLETED",
            "ErrorInfo": {},
            "RowInfo": {
                "RowsIngested": 1126608,
                "RowsDropped": 0
            },
            "CreatedTime": 1574138852.564,
            "IngestionTimeInSeconds": 24,
            "IngestionSizeInBytes": 107027200,
            "RequestSource": "MANUAL",
            "RequestType": "INITIAL_INGESTION"
        }
    ],
    "RequestId": "ee1d9a6f-a290-418a-a526-8906f4689776"
}

Set up ingestions to update the data in your dataset from your data source. As you may have new data in your Redshift cluster, you can use the create-ingestion API for your dataset to trigger a SPICE refresh:

aws quicksight create-ingestion --aws-account-id $AAI --data-set-id DemoDataSet3 --ingestion-id DemoDataSet3-Ingestion2

{
    "Status": 201,
    "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/DemoDataSet3-Ingestion2",
    "IngestionId": "DemoDataSet3-Ingestion2",
    "IngestionStatus": "INITIALIZED",
    "RequestId": "fc1f7eea-1327-41d6-9af7-c12f097ed343"
}

Listing ingestion history again shows the new ingestion in the state RUNNING. See the following code:

aws quicksight list-ingestions --aws-account-id $AAI --data-set-id DemoDataSet3

{
    "Status": 200,
    "Ingestions": [
        {
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/DemoDataSet3-Ingestion2",
            "IngestionId": "DemoDataSet3-Ingestion2",
            "IngestionStatus": "RUNNING",
            "RowInfo": {
                "RowsIngested": 0
            },
            "CreatedTime": 1574139332.876,
            "RequestSource": "MANUAL",
            "RequestType": "FULL_REFRESH"
        },
        {
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/589eaf9d-0e21-4572-9ebe-f61b129886d9",
            "IngestionId": "589eaf9d-0e21-4572-9ebe-f61b129886d9",
            "IngestionStatus": "COMPLETED",
            "ErrorInfo": {},
            "RowInfo": {
                "RowsIngested": 1126608,
                "RowsDropped": 0
            },
            "CreatedTime": 1574138852.564,
            "IngestionTimeInSeconds": 24,
            "IngestionSizeInBytes": 107027200,
            "RequestSource": "MANUAL",
            "RequestType": "INITIAL_INGESTION"
        }
    ],
    "RequestId": "322c165e-fb90-45bf-8fa5-246d3034a08e"
}

Versioning and aliasing

As you may have noticed, the template and dashboard APIs support versioning. When you create a template or dashboard, version 1 of that entity is created, and each update of a template or dashboard results in a new version of that entity with an incremented version number. Versioning can be useful when you want to republish an old version of a dashboard or ensure you’re building off a known version of the template. For example, if you published your dashboard multiple times, but want to republish the first version, you can call the update-dashboard-published-version API as follows:

aws quicksight update-dashboard-published-version --aws-account-id $AAI --dashboard-id DemoDashboard1 --dashboard-version-number 1

{
   "DashboardArn": "arn:aws:quicksight:us-east-
1:xxxxxxxxxxxx:dashboard/DemoDashboard1",
   "DashboardId": "DemoDashboard1",
   "RequestId": "c97cdbff-d954-4f5b-a887-28659ad9cc9a"
}

Versioning becomes even more powerful with aliases. A template or dashboard alias is a pointer to a specific template or dashboard version. Aliases are helpful when you want to point to a logical version of a resource rather than an explicit version number that needs to be manually updated. This feature can be helpful for development workflows where you want to make changes to a template and only want production code to use an approved template version. For example, you can create a ‘PROD’ template alias for your approved production template and point it to version 1, as shown below:

aws quicksight create-template-alias --aws-account-id $AAI --template-id DemoDashboardTemplate --template-version-number 1 --alias-name PROD

{
   "RequestId": "c112a48c-14da-439e-b4d8-538b60c1f188",
   "TemplateAlias": { 
      "AliasName": "PROD",
      "Arn": "arn:aws:quicksight:us-east-
1:xxxxxxxxxxxx:template/DemoDashboardTemplate/alias/PROD",
      "TemplateVersionNumber": 1
   }
}

When you create a dashboard from this template, you can specify the SourceEntity as the ‘PROD’ template alias to always publish the dashboard from the approved production template version. Note how the template Arn contains the template alias name in the call to the create-dashboard API below:

aws quicksight create-dashboard --generate-cli-skeleton > create-dashboard-cli-input.json

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DashboardId": "DemoDashboard2",
    "Name": "Demo Dashboard 2",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/user/Alice",
            "Actions": [
                "quicksight:DescribeDashboard",
                "quicksight:ListDashboardVersions",
                "quicksight:UpdateDashboardPermissions",
                "quicksight:QueryDashboard",
                "quicksight:UpdateDashboard",
                "quicksight:DeleteDashboard",
                "quicksight:DescribeDashboardPermissions",
                "quicksight:UpdateDashboardPublishedVersion"
            ]
        }
    ],
    "SourceEntity": {
        "SourceTemplate": {
            "DataSetReferences": [
                {
                    "DataSetPlaceholder": "DS1",
                    "DataSetArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet2"
                }
            ],
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate/alias/PROD"
        }
    },
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDashboard"
        }
    ],
    "VersionDescription": "1",
    "DashboardPublishOptions": {
        "AdHocFilteringOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "ExportToCSVOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "SheetControlsOption": {
            "VisibilityState": "EXPANDED"
        }
    }
}

If you were to create new versions of the template, you could then update the ‘PROD’ template alias to point to the new template version, without touching your dashboard creation code from above. For example, if you wanted to set template version 3 to be your approved production version, you can call the update-template-alias API as shown below:

aws quicksight update-template-alias --aws-account-id $AAI --template-id DemoDashboardTemplate --template-version-number 3 --alias-name PROD

{
   "RequestId": "5ac9ff9e-28c1-4b61-aeab-5949986f5b2b",
   "TemplateAlias": { 
      "AliasName": "PROD",
      "Arn": "arn:aws:quicksight:us-east-
1:xxxxxxxxxxxx:template/DemoDashboardTemplate/alias/PROD",
      "TemplateVersionNumber": 3
   }
}

Keep an eye out for a future blog post that will detail more examples of using versioning and aliases for large scale deployments!

Amazon QuickSight’s data source, dataset, dashboard, and SPICE APIs are available in both Standard and Enterprise editions, with edition-specific support for functionality. Fine-grained access control APIs and template support for dashboard APIs are available in Amazon QuickSight Enterprise edition. APIs for Amazon QuickSight are part of the AWS SDK and are available in a variety of programming languages. For more information, see API Reference and Tools to Build on AWS.

Conclusion

With QuickSight’s new themes, APIs, and additional customization options underway, there are many creative ways to author the next chapter in your analytics story. To be a part of future API previews and connect with us, sign up for the embedding early access program.

 


About the Author

Jose Kunnackal John is a principal product manager for Amazon QuickSight.

 

 

 

Using Spatial Data with Amazon Redshift

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/using-spatial-data-with-amazon-redshift/

Today, Amazon Redshift announced support for a new native data type called GEOMETRY. This new type enables ingestion, storage, and queries against two-dimensional geographic data, together with the ability to apply spatial functions to that data. Geographic data (also known as georeferenced data) refers to data that has some association with a location relative to Earth. Coordinates, elevation, addresses, city names, zip (or postal) codes, administrative and socioeconomic boundaries are all examples of geographic data.

The GEOMETRY type enables us to easily work with coordinates such as latitude and longitude in our table columns, which can then be converted or combined with other types of geographic data using spatial functions. The type is abstract, meaning it cannot be directly instantiated, and polymorphic. The actual types supported for this data (and which will be used in table columns) are points, linestrings, polygons, multipoints, multilinestrings, multipolygons, and geometry collections. In addition to creating GEOMETRY-typed data columns in tables the new support also enables ingestion of geographic data from delimited text files using the existing COPY command. The data in the files is expected to be in hexadecimal Extended Well-Known Binary (EWKB) format which is a standard for representing geographic data.

To show the new type in action I imagined a scenario where I am working as a personal tour coordinator based in Berlin, Germany, and my client has supplied me with a list of attractions that they want to visit. My task is to locate accommodation for this client that is reasonably central to the set of attractions, and within a certain budget. Geographic data is ideal for solving this scenario. Firstly, the set of points representing the attractions combine to form one or more polygons which I can use to restrict my search for accommodation. In a single query I can then join the data representing those polygons with data representing a set of accommodations to arrive at the results. This spatial query is actually quite expensive in CPU terms yet Redshift is able to execute the query in less that one second.

Sample Scenario Data
To show my scenario in action I needed to first source various geographic data related to Berlin. Firstly I obtained the addresses, and latitude/longitude coordinates, of a variety of attractions in the city using several ‘top X things to see’ travel websites. For accommodation I used Airbnb data, licensed under the Creative Commons 1.0 Universal “Public Domain Dedication” from http://insideairbnb.com/get-the-data.html. I then added to this zip code data for the city, licensed under Creative Commons Attribution 3.0 Germany (CC BY 3.0 DE). The provider for this data is Amt für Statistik Berlin-Brandenburg.

Any good tour coordinator would of course have a web site or application with an interactive map so as to be able to show clients the locations of the accommodation that matched their criteria. In real life, I’m not a tour coordinator (outside of my family!) so for this post I’m going to focus solely on the back-end processes – the loading of the data, and the eventual query to satisfy our client’s request using the Redshift console.

Creating a Redshift Cluster
My first task is to load the various sample data sources into database tables in a Redshift cluster. To do this I go to the Redshift console dashboard and select Create cluster. This starts a wizard that walks me through the process of setting up a new cluster, starting with the type and number of nodes that I want to create.

In Cluster details I fill out a name for my new cluster, set a password for the master user, and select an AWS Identity and Access Management (IAM) role that will give permission for Redshift to access one of my buckets in Amazon Simple Storage Service (S3) in read-only mode when I come to load my sample data later. The new cluster will be created in my default Amazon Virtual Private Cloud for the region, and I also opted to use the defaults for node types and number of nodes. You can read more about available options for creating clusters in the Management Guide. Finally I click Create cluster to start the process, which will take just a few minutes.

Loading the Sample Data
With the cluster ready to use I can load the sample data into my database, so I head to the Query editor and using the pop-up, connect to my default database for the cluster.

My sample data will be sourced from delimited text files that I’ve uploaded as private objects to an S3 bucket and loaded into three tables. The first, accommodations, will hold the Airbnb data. The second, zipcodes, will hold the zip or postal codes for the city. The final table, attractions, will hold the coordinates of the city attractions that my client can choose from. To create and load the accommodations data I paste the following statements into tabs in the query editor, one at a time, and run them. Note that schemas in databases have access control semantics and the public prefix shown on the table names below simply means I am referencing the public schema, accessible to all users, for the database in use.

To create the accommodations table I use:

CREATE TABLE public.accommodations (
  id INTEGER PRIMARY KEY,
  shape GEOMETRY,
  name VARCHAR(100),
  host_name VARCHAR(100),
  neighbourhood_group VARCHAR(100),
  neighbourhood VARCHAR(100),
  room_type VARCHAR(100),
  price SMALLINT,
  minimum_nights SMALLINT,
  number_of_reviews SMALLINT,
  last_review DATE,
  reviews_per_month NUMERIC(8,2),
  calculated_host_listings_count SMALLINT,
  availability_365 SMALLINT
);

To load the data from S3:

COPY public.accommodations
FROM 's3://my-bucket-name/redshift-gis/accommodations.csv'
DELIMITER ';'
IGNOREHEADER 1
CREDENTIALS 'aws_iam_role=arn:aws:iam::123456789012:role/RedshiftDemoRole';

Next, I repeat the process for the zipcodes table.

CREATE TABLE public.zipcode (
  ogc_field INTEGER,
  wkb_geometry GEOMETRY,
  gml_id VARCHAR,
  spatial_name VARCHAR,
  spatial_alias VARCHAR,
  spatial_type VARCHAR
);
COPY public.zipcode
FROM 's3://my-bucket-name/redshift-gis/zipcode.csv'
DELIMITER ';'
IGNOREHEADER 1
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole';

And finally I create the attractions table and load data into it.

CREATE TABLE public.berlin_attractions (
  name VARCHAR,
  address VARCHAR,
  lat DOUBLE PRECISION,
  lon DOUBLE PRECISION,
  gps_lat VARCHAR,
  gps_lon VARCHAR
);
COPY public.berlin_attractions
FROM 's3://my-bucket-name/redshift-gis/berlin-attraction-coordinates.txt'
DELIMITER '|'
IGNOREHEADER 1
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole';

Finding Somewhere to Stay!
With the data loaded, I can now put on my travel coordinator hat and select some properties for my client to consider for their stay in Berlin! Remember, in the real world this would likely be surfaced from a web or other application for the client. I’m simply going to make use of the query editor again.

My client has decided they want a trip to focus on the city museums and they have a budget of 200 EUR per night for accommodation. Opening a new tab in the editor, I paste in and run the following query.

WITH museums(name,loc) AS
(SELECT name, ST_SetSRID(ST_Point(lon,lat),4326) FROM public.berlin_attractions
WHERE name LIKE '%Museum%')
SELECT a.name,a.price,avg(ST_DistanceSphere(m.loc,a.shape)) AS avg_distance
FROM museums m,public.accommodations a
WHERE a.price <= 200 GROUP BY a.name,a.price ORDER BY avg_distance
LIMIT 10;

The query finds the accommodation(s) that are “best located” to visit all the museums, and whose price is within the client’s budget. Here “best located” is defined as having the smallest average distance from all the selected museums. In the query you can see some of the available spatial functions, ST_SetSRID and ST_Point, operating on the latitude and longitude GEOMETRY columns for the attractions, and ST_DistanceSphere to determine distance.

This yields the following results.

Wrap a web or native application front-end around this and we have a new geographic data-based application that we can use to delight clients who have an idea of what they want to see in the city and also want convenient and in-budget accommodation best placed to enable that!

Let’s also consider another scenario. Imagine I have a client who wants to stay in the center of Berlin but isn’t sure what attractions or accommodations are present in the central district, and has a budget of 150 EUR per night. How can we answer that question? First we need the coordinates of what we might consider to be the center of Berlin – latitude 52.516667, longitude 13.388889. Using the zipcode table we can convert this coordinate location to a polygon enclosing that region of the city. Our query must then get all attractions within that polygon, plus all accommodations (within budget), ordered by average distance from the attractions. Here’s the query:

WITH center(geom) AS
(SELECT wkb_geometry FROM zipcode
  WHERE ST_Within(ST_SetSRID(ST_Point(13.388889, 52.516667), 4326), wkb_geometry)),
pois(name,loc) AS
(SELECT name, ST_SetSRID(ST_Point(lon,lat),4326) FROM public.berlin_attractions,center
  WHERE ST_Within(ST_SetSRID(ST_Point(lon,lat),4326), center.geom))
SELECT a.name,a.price,avg(ST_DistanceSphere(p.loc,a.shape))
  AS avg_distance, LISTAGG(p.name, ';') as pois
FROM pois p,public.accommodations a
WHERE a.price <= 150 GROUP BY a.name,a.price ORDER BY avg_distance
LIMIT 10;

When I run this in the query editor, I get the following results. You can see the list of attractions in the area represented by the zipcode in the pois column.

So there we have some scenarios for making use of geographic data in Amazon Redshift using the new GEOMETRY type and associated spatial functions, and I’m sure there are many more! The new type and functions are available now in all AWS Regions to all customers at no additional cost.

— Steve

Provisioning the Intuit Data Lake with Amazon EMR, Amazon SageMaker, and AWS Service Catalog

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/provisioning-the-intuit-data-lake-with-amazon-emr-amazon-sagemaker-and-aws-service-catalog/

This post shares Intuit’s learnings and recommendations for running a data lake on AWS. The Intuit Data Lake is built and operated by numerous teams in Intuit Data Platform. Thanks to Tristan Baker (Chief Architect), Neil Lamka (Principal Product Manager), Achal Kumar (Development Manager), Nicholas Audo, and Jimmy Armitage for their feedback and support.

A data lake is a centralized repository for storing structured and unstructured data at any scale. At Intuit, creating such a pile of raw data is easy. However, more interesting challenges present themselves:

  1. How should AWS accounts be organized?
  2. What ingestion methods will be used? How will analysts find the data they need?
  3. Where should data be stored? How should access be managed?
  4. What security measures are needed to protect Intuit’s sensitive data?
  5. Which parts of this ecosystem can be automated?

This post outlines the approach taken by Intuit, though it is important to remember that there are many ways to build a data lake (for example, AWS Lake Formation).

We’ll cover the technologies and processes involved in creating the Intuit Data Lake at a high level, including the overall structure and the automation used in provisioning accounts and resources. Watch this space in the future for more detailed blog posts on specific aspects of the system, from the other teams and engineers who worked together to build the Intuit Data Lake.

Architecture

Account Structure

Data lakes typically follow a hub-and-spoke model, with the hub account containing shared services that control access to data sources. For the purposes of this post, we’ll refer to the hub account as Central Data Lake.

In this pattern, access to Central Data Lake is apportioned to spoke accounts called Processing Accounts. This model maintains separation between end users and allows for division of billing among distinct business units.

 

 

It is common to maintain two ecosystems: pre-production (Pre-Prod) and production (Prod). This allows data lake administrators to silo access to data by preventing connectivity between Pre-Prod and Prod.

To enable experimentation and testing, it may also be advisable to maintain separate VPC-based environments within Pre-Prod accounts, such as dev, qa, and e2e. Processing Account VPCs would then be connected to the corresponding VPC in Central Data Lake.

Note that at first, we connected accounts via VPC Peering. However, as we scaled we quickly approached the hard limit of 125 VPC peering connections, requiring us to migrate to AWS Transit Gateway. As of this writing, we connect multiple new Processing Accounts weekly.

 

 

Central Data Lake

There may be numerous services running in a hub account, but we’ll focus on the aspects that are most relevant to this blog: ingestion, sanitization, storage, and a data catalog.

 

 

Ingestion, Sanitization, and Storage

A key component to Central Data Lake is a uniform ingestion pattern for streaming data. One example is an Apache Kafka cluster running on Amazon EC2. (You can read about how Intuit engineers do this in another AWS blog.) As we deal with hundreds of data sources, we’ve enabled access to ingestion mechanisms via AWS PrivateLink.

Note: Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an alternative for running Apache Kafka on Amazon EC2, but was not available at the start of Intuit’s migration.

In addition to stream processing, another method of ingestion is batch processing, such as jobs running on Amazon EMR. After data is ingested by one of these methods, it can be stored in Amazon S3 for further processing and analysis.

Intuit deals with a large volume of customer data, and each field is carefully considered and classified with a sensitivity level. All sensitive data that enters the lake is encrypted at the source. The ingestion systems retrieve the encrypted data and move it into the lake. Before it is written to S3, the data is sanitized by a proprietary RESTful service. Analysts and engineers operating within the data lake consume this masked data.

Data Catalog

A data catalog is a common way to give end users information about the data and where it lives. One example is a Hive Metastore backed by Amazon Aurora. Another alternative is the AWS Glue Data Catalog.

Processing Accounts

When Processing Accounts are delivered to end users, they include an identical set of resources. We’ll discuss the automation of Processing Accounts below, but the primary components are as follows:

 

 

                           Processing Account structure upon delivery to the customer

 

Data Storage Mechanisms

One reasonable question is whether all data should reside in Central Data Lake, or if it’s acceptable to distribute data across multiple accounts. A data lake might employ a combination of the two approaches, and classify data locations as primary or secondary.

The primary location for data is Central Data Lake, and it arrives there via the ingestion pipelines discussed previously. Processing Accounts can read from the primary source, either directly from the ingestion pipelines or from S3. Processing Accounts can contribute their transformed data back into Central Data Lake (primary), or store it in their own accounts (secondary). The proper storage location depends on the type of data, and who needs to consume it.

One rule worth enforcing is that no cross-account writes should be permitted. In other words, the IAM principal (in most cases, an IAM role assumed by EC2 via an instance profile) must be in the same account as the destination S3 bucket. This is because cross-account delegation is not supported—specifically, S3 bucket policies in Central Data Lake cannot grant Processing Account A access to objects written by a role in Processing Account B.

Another possibility is for EMR to assume different IAM roles via a custom credentials provider (see this AWS blog), but we chose not to go down this path at Intuit because it would have required many EMR jobs to be rewritten.

 

 

Data Access Patterns

The majority of end users are interested in the data that resides in S3. In Central Data Lake and some Processing Accounts, there may be a set of read-only S3 buckets: any account in the data lake ecosystem can read data from this type of bucket.

To facilitate management of S3 access for read-only buckets, we built a mechanism to control S3 bucket policies, administered entirely via code. Our deployment pipelines use account metadata to dynamically generate the correct S3 bucket policy based on the type of account (Pre-Prod or Prod). These policies are committed back into our code repository for auditability and ease of management.

We employ the same method for managing KMS key policies, as we use KMS with customer managed customer master keys (CMKs) for at-rest encryption in S3.

Here’s an example of a generated S3 bucket policy for a read-only bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ProcessingAccountReadOnly",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111111111111:root",
                    "arn:aws:iam::222222222222:root",
                    "arn:aws:iam::333333333333:root",
                    "arn:aws:iam::444444444444:root",
                    "arn:aws:iam::555555555555:root",
                    ...
                    ...
                    ...
                    "arn:aws:iam::999999999999:root",
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::intuit-data-lake-example/*",
                "arn:aws:s3:::intuit-data-lake-example"
            ]
        }
    ]
}

Note that we grant access at the account level, rather than using explicit IAM principal ARNs. Because the reads are cross-account, permissions are also required on the IAM principals in Processing Accounts. Maintaining these policies—with automation, at that level of granularity—is untenable at scale. Furthermore, using specific IAM principal ARNs would create an external dependency on foreign accounts. For example, if a Processing Account deletes an IAM role that is referenced in an S3 bucket policy in Central Data Lake, the bucket policy can no longer be saved, causing interruptions to deployment pipelines.

Security

Security is mission critical for any data lake. We’ll mention a subset of the controls we use, but not dive deep.

Encryption

Encryption can be enforced both in transit and at rest, using multiple methods:

  1. Traffic within the lake should use the latest version of TLS (1.2 as of this writing)
  2. Data can be encrypted with application-level (client-side) encryption
  3. KMS keys can used for at-rest encryption of S3, EBS, and RDS

Ingress and Egress

There’s nothing out of the ordinary in our approach to ingress and egress, but it’s worth mentioning the standard patterns we’ve found important:

Policies restricting ingress and egress are the primary points at which a data lake can guarantee quality (ingress) and prevent loss (egress).

Authorization

Access to the Intuit Data Lake is controlled via IAM roles, meaning no IAM users (with long-term credentials) are created. End users are granted access via an internal service that manages role-based, federated access to AWS accounts. Regular reviews are conducted to remove nonessential users.

Configuration Management

We use an internal fork of Cloud Custodian, which is a suite of preventative, detective, and responsive controls consisting of Amazon CloudWatch Events and AWS Config rules. Some of the violations it reports and (optionally) mitigates include:

  • Unauthorized CIDRs in inbound security group rules
  • Public S3 bucket policies and ACLs
  • IAM user console access
  • Unencrypted S3 buckets, EBS volumes, and RDS instances

Lastly, Amazon GuardDuty is enabled in all Intuit Data Lake accounts and is monitored by Intuit Security.

Automation

If there is one thing we’ve learned building the Intuit Data Lake, it is to automate everything.

There are four areas of automation we’ll discuss in this blog:

  1. Creation of Processing Accounts
  2. Processing Account Orchestration Pipeline
  3. Processing Account Terraform Pipeline
  4. EMR and SageMaker deployment via Service Catalog

Creation of Processing Accounts

The first step in creating a Processing Account is to make a request through an internal tool. This triggers automation that provisions an Intuit-stamped AWS account under the correct business unit.

 

Note: AWS Control Tower’s Account Factory was not available at the start of our journey, but it can be leveraged to provision new AWS accounts in a secured, best practice, self-service way.

Account setup also includes automated VPC creation (with optional VPN), fully automated using Service Catalog. End users simply specify subnet sizes.

It’s worth noting that Intuit leverages Service Catalog for self-service deployment of other common patterns, including ingress security groups, VPC endpoints, and VPC peering. Here’s an example portfolio:

Processing Account Orchestration Pipeline

After account creation and VPC provisioning, the Processing Account Orchestration Pipeline runs. This pipeline executes one-time tasks required for Processing Accounts. These tasks include:

  • Bootstrapping an IAM role for use in further configuration management
  • Creation of KMS keys for S3, EBS, and RDS encryption
  • Creation of variable files for the new account
  • Updating the master configuration file with account metadata
  • Generation of scripts to orchestrate the Terraform pipeline discussed below
  • Sharing Transit Gateways via Resource Access Manager

Processing Account Terraform Pipeline

This pipeline manages the lifecycle of dynamic, frequently-updated resources, including IAM roles, S3 buckets and bucket policies, KMS key policies, security groups, NACLs, and bastion hosts.

There is one pipeline for every Processing Account, and each pipeline deploys a series of layers into the account, using a set of parameterized deployment jobs. A layer is a logical grouping of Terraform modules and AWS resources, providing a way to shrink Terraform state files and reduce blast radius if redeployment of specific resources is required.

EMR and SageMaker Deployment via Service Catalog

AWS Service Catalog facilitates the provisioning of Amazon EMR and Amazon SageMaker, allowing end users to launch EMR clusters and SageMaker notebook instances that work out of the box, with embedded security.

Service Catalog allows data scientists and data engineers to launch EMR clusters in a self-service fashion with user-friendly parameters, and provides them with the following:

  • Bootstrap action to enable connectivity to services in Central Data Lake
  • EC2 instance profile to control S3, KMS, and other granular permissions
  • Security configuration that enables at-rest and in-transit encryption
  • Configuration classifications for optimal EMR performance
  • Encrypted AMI with monitoring and logging enabled
  • Custom Kerberos connection to LDAP

For SageMaker, we use Service Catalog to launch notebook instances with custom lifecycle configurations that set up connections or initialize the following: Hive Metastore, Kerberos, security, Splunk logging, and OpenDNS. You can read more about lifecycle configurations in this AWS blog. Launching a SageMaker notebook instance with best-practice configuration is as easy as follows:

 

 

Conclusion

This post illustrates the building blocks we used in creating the Intuit Data Lake. Our solution isn’t wholly unique, but comprised of common-sense approaches we’ve gleaned from dozens of engineers across Intuit, representing decades of experience. These practices have enabled us to push petabytes of data into the lake, and serve hundreds of Processing Accounts with varying needs. We are still building, but we hope our story helps you in your data lake journey.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Ben Covi is a staff software engineer at Intuit. At any given moment, he’s probably losing a game of Catan.

 

 

 

New – Using Step Functions to Orchestrate Amazon EMR workloads

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-using-step-functions-to-orchestrate-amazon-emr-workloads/

AWS Step Functions allows you to add serverless workflow automation to your applications. The steps of your workflow can run anywhere, including in AWS Lambda functions, on Amazon Elastic Compute Cloud (EC2), or on-premises. To simplify building workflows, Step Functions is directly integrated with multiple AWS Services: Amazon ECS, AWS Fargate, Amazon DynamoDB, Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), AWS Batch, AWS Glue, Amazon SageMaker, and (to run nested workflows) with Step Functions itself.

Starting today, Step Functions connects to Amazon EMR, enabling you to create data processing and analysis workflows with minimal code, saving time, and optimizing cluster utilization. For example, building data processing pipelines for machine learning is time consuming and hard. With this new integration, you have a simple way to orchestrate workflow capabilities, including parallel executions and dependencies from the result of a previous step, and handle failures and exceptions when running data processing jobs.

Specifically, a Step Functions state machine can now:

  • Create or terminate an EMR cluster, including the possibility to change the cluster termination protection. In this way, you can reuse an existing EMR cluster for your workflow, or create one on-demand during execution of a workflow.
  • Add or cancel an EMR step for your cluster. Each EMR step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster, including tools such as Apache Spark, Hive, or Presto.
  • Modify the size of an EMR cluster instance fleet or group, allowing you to manage scaling programmatically depending on the requirements of each step of your workflow. For example, you may increase the size of an instance group before adding a compute-intensive step, and reduce the size just after it has completed.

When you create or terminate a cluster or add an EMR step to a cluster, you can use synchronous integrations to move to the next step of your workflow only when the corresponding activity has completed on the EMR cluster.

Reading the configuration or the state of your EMR clusters is not part of the Step Functions service integration. In case you need that, the EMR List* and Describe* APIs can be accessed using Lambda functions as tasks.

Building a Workflow with EMR and Step Functions
On the Step Functions console, I create a new state machine. The console renders it visually, so that is much easier to understand:

To create the state machine, I use the following definition using the Amazon States Language (ASL):

{
  "StartAt": "Should_Create_Cluster",
  "States": {
    "Should_Create_Cluster": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.CreateCluster",
          "BooleanEquals": true,
          "Next": "Create_A_Cluster"
        },
        {
          "Variable": "$.CreateCluster",
          "BooleanEquals": false,
          "Next": "Enable_Termination_Protection"
        }
      ],
      "Default": "Create_A_Cluster"
    },
    "Create_A_Cluster": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
      "Parameters": {
        "Name": "WorkflowCluster",
        "VisibleToAllUsers": true,
        "ReleaseLabel": "emr-5.28.0",
        "Applications": [{ "Name": "Hive" }],
        "ServiceRole": "EMR_DefaultRole",
        "JobFlowRole": "EMR_EC2_DefaultRole",
        "LogUri": "s3://aws-logs-123412341234-eu-west-1/elasticmapreduce/",
        "Instances": {
          "KeepJobFlowAliveWhenNoSteps": true,
          "InstanceFleets": [
            {
              "InstanceFleetType": "MASTER",
              "TargetOnDemandCapacity": 1,
              "InstanceTypeConfigs": [
                {
                  "InstanceType": "m4.xlarge"
                }
              ]
            },
            {
              "InstanceFleetType": "CORE",
              "TargetOnDemandCapacity": 1,
              "InstanceTypeConfigs": [
                {
                  "InstanceType": "m4.xlarge"
                }
              ]
            }
          ]
        }
      },
      "ResultPath": "$.CreateClusterResult",
      "Next": "Merge_Results"
    },
    "Merge_Results": {
      "Type": "Pass",
      "Parameters": {
        "CreateCluster.$": "$.CreateCluster",
        "TerminateCluster.$": "$.TerminateCluster",
        "ClusterId.$": "$.CreateClusterResult.ClusterId"
      },
      "Next": "Enable_Termination_Protection"
    },
    "Enable_Termination_Protection": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:setClusterTerminationProtection",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "TerminationProtected": true
      },
      "ResultPath": null,
      "Next": "Add_Steps_Parallel"
    },
    "Add_Steps_Parallel": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "Step_One",
          "States": {
            "Step_One": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "Step": {
                  "Name": "The first step",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "hive-script",
                      "--run-hive-script",
                      "--args",
                      "-f",
                      "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
                      "-d",
                      "INPUT=s3://eu-west-1.elasticmapreduce.samples",
                      "-d",
                      "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
                    ]
                  }
                }
              },
              "End": true
            }
          }
        },
        {
          "StartAt": "Wait_10_Seconds",
          "States": {
            "Wait_10_Seconds": {
              "Type": "Wait",
              "Seconds": 10,
              "Next": "Step_Two (async)"
            },
            "Step_Two (async)": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "Step": {
                  "Name": "The second step",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "hive-script",
                      "--run-hive-script",
                      "--args",
                      "-f",
                      "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
                      "-d",
                      "INPUT=s3://eu-west-1.elasticmapreduce.samples",
                      "-d",
                      "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
                    ]
                  }
                }
              },
              "ResultPath": "$.AddStepsResult",
              "Next": "Wait_Another_10_Seconds"
            },
            "Wait_Another_10_Seconds": {
              "Type": "Wait",
              "Seconds": 10,
              "Next": "Cancel_Step_Two"
            },
            "Cancel_Step_Two": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:cancelStep",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "StepId.$": "$.AddStepsResult.StepId"
              },
              "End": true
            }
          }
        }
      ],
      "ResultPath": null,
      "Next": "Step_Three"
    },
    "Step_Three": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "Step": {
          "Name": "The third step",
          "ActionOnFailure": "CONTINUE",
          "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
              "hive-script",
              "--run-hive-script",
              "--args",
              "-f",
              "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
              "-d",
              "INPUT=s3://eu-west-1.elasticmapreduce.samples",
              "-d",
              "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
            ]
          }
        }
      },
      "ResultPath": null,
      "Next": "Disable_Termination_Protection"
    },
    "Disable_Termination_Protection": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:setClusterTerminationProtection",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "TerminationProtected": false
      },
      "ResultPath": null,
      "Next": "Should_Terminate_Cluster"
    },
    "Should_Terminate_Cluster": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.TerminateCluster",
          "BooleanEquals": true,
          "Next": "Terminate_Cluster"
        },
        {
          "Variable": "$.TerminateCluster",
          "BooleanEquals": false,
          "Next": "Wrapping_Up"
        }
      ],
      "Default": "Wrapping_Up"
    },
    "Terminate_Cluster": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:terminateCluster.sync",
      "Parameters": {
        "ClusterId.$": "$.ClusterId"
      },
      "Next": "Wrapping_Up"
    },
    "Wrapping_Up": {
      "Type": "Pass",
      "End": true
    }
  }
}

I let the Step Functions console create a new AWS Identity and Access Management (IAM) role for the executions of this state machine. The role automatically includes all permissions required to access EMR.

This state machine can either use an existing EMR cluster, or create a new one. I can use the following input to create a new cluster that is terminated at the end of the workflow:

{
"CreateCluster": true,
"TerminateCluster": true
}

To use an existing cluster, I need to provide input in the cluster ID, using this syntax:

{
"CreateCluster": false,
"TerminateCluster": false,
"ClusterId": "j-..."
}

Let’s see how that works. As the workflow starts, the Should_Create_Cluster Choice state looks into the input to decide if it should enter the Create_A_Cluster state or not. There, I use a synchronous call (elasticmapreduce:createCluster.sync) to wait for the new EMR cluster to reach the WAITING state before progressing to the next workflow state. The AWS Step Functions console shows the resource that is being created with a link to the EMR console:

After that, the Merge_Results Pass state merges the input state with the cluster ID of the newly created cluster to pass it to the next step in the workflow.

Before starting to process any data, I use the Enable_Termination_Protection state (elasticmapreduce:setClusterTerminationProtection) to help ensure that the EC2 instances in my EMR cluster are not shut down by an accident or error.

Now I am ready to do something with the EMR cluster. I have three EMR steps in the workflow. For the sake of simplicity, these steps are all based on this Hive tutorial. For each step, I use Hive’s SQL-like interface to run a query on some sample CloudFront logs and write the results to Amazon Simple Storage Service (S3). In a production use case, you’d probably have a combination of EMR tools processing and analyzing your data in parallel (two or more steps running at the same time) or with some dependencies (the output of one step is required by another step). Let’s try to do something similar.

First I execute Step_One and Step_Two inside a Parallel state:

  • Step_One is running the EMR step synchronously as a job (elasticmapreduce:addStep.sync). That means that the execution waits for the EMR step to be completed (or cancelled) before moving on to the next step in the workflow. You can optionally add a timeout to monitor that the execution of the EMR step happens within an expected time frame.
  • Step_Two is adding an EMR step asynchronously (elasticmapreduce:addStep). In this case, the workflow moves to the next step as soon as EMR replies that the request has been received. After a few seconds, to try another integration, I cancel Step_Two (elasticmapreduce:cancelStep). This integration can be really useful in production use cases. For example, you can cancel an EMR step if you get an error from another step running in parallel that would make it useless to continue with the execution of this step.

After those two steps have both completed and produce their results, I execute Step_Three as a job, similarly to what I did for Step_One. When Step_Three has completed, I enter the Disable_Termination_Protection step, because I am done using the cluster for this workflow.

Depending on the input state, the Should_Terminate_Cluster Choice state is going to enter the Terminate_Cluster state (elasticmapreduce:terminateCluster.sync) and wait for the EMR cluster to terminate, or go straight to the Wrapping_Up state and leave the cluster running.

Finally I have a state for Wrapping_Up. I am not doing much in this final state actually, but you can’t end a workflow from a Choice state.

In the EMR console I see the status of my cluster and of the EMR steps:

Using the AWS Command Line Interface (CLI), I find the results of my query in the S3 bucket configured as output for the EMR steps:

aws s3 ls s3://MY-BUCKET/MyHiveQueryResults/
...

Based on my input, the EMR cluster is still running at the end of this workflow execution. I follow the resource link in the Create_A_Cluster step to go to the EMR console and terminate it. In case you are following along with this demo, be careful to not leave your EMR cluster running if you don’t need it.

Available Now
Step Functions integration with EMR is available in all regions. There is no additional cost for using this feature on top of the usual Step Functions and EMR pricing.

You can now use Step Functions to quickly build complex workflows for executing EMR jobs. A workflow can include parallel executions, dependencies, and exception handling. Step Functions makes it easy to retry failed jobs and terminate workflows after critical errors, because you can specify what happens when something goes wrong. Let me know what are you going to use this feature for!

Danilo

Amazon EMR introduces EMR runtime for Apache Spark

Post Syndicated from Joseph Marques original https://aws.amazon.com/blogs/big-data/amazon-emr-introduces-emr-runtime-for-apache-spark/

Amazon EMR is happy to announce Amazon EMR runtime for Apache Spark, a performance-optimized runtime environment for Apache Spark that is active by default on Amazon EMR clusters. EMR runtime for Spark is up to 32 times faster than EMR 5.16, with 100% API compatibility with open-source Spark. This means that your workloads run faster, saving you compute costs without making any changes to your applications.

Amazon EMR has been adding Spark runtime improvements since EMR 5.24, and discussed them in Optimizing Spark Performance. EMR 5.28 features several new improvements.

To measure these improvements, we compared EMR 5.16 (with open source Apache Spark version 2.4) with EMR 5.28 (with EMR runtime for Apache Spark compatible with Apache Spark version 2.4). We used TPC-DS benchmark queries with 3 TB scale and ran them on a six-node c4.8xlarge EMR cluster with data in Amazon S3. We measured performance improvements as the geometric mean of improvement in total query execution time, and the total query execution time across all queries. The results showed considerable improvement—that the geometric mean was 2.4 times faster and the total query runtime was 3.2 times faster.

The following graph shows performance improvements measured as total runtime for 104 TPC-DS queries. EMR 5.28 has the better (lower) runtime.

The following graph shows performance improvements measured as the geometric mean for 104 TPC-DS queries. EMR 5.28 has the better (lower) geomean.

In breaking down the per-query improvements, you can observe the highest performance gains in long-running queries.

The following graph shows performance improvements in EMR 5.28 compared to EMR 5.16 for long-running queries (running for more than 130 seconds in EMR 5.16). In this comparison, the higher numbers are better.

The following graph shows performance improvements in EMR 5.28 compared to EMR 5.16 for short-running queries (running for less than 130 seconds). Again, the higher numbers are better.

Queries running for more than 130 seconds are up to 32 times faster as seen in query 72. Queries running for less than 130 seconds are up to 6 times faster, with an average improvement of 2 times faster across the board.

Customers use Spark for a wide array of analytics use cases ranging from large-scale transformations to streaming, data science, and machine learning. They choose to run Spark on EMR because EMR provides the latest, stable, open-source community innovations, performant storage with Amazon S3, and the unique cost savings capabilities of Spot Instances and Auto Scaling. It also provides ease of use with managed EMR Notebooks, notebook-scoped libraries, Git integration, and easy debugging and monitoring with off-cluster Spark History Services. Combined with the runtime improvements, and fine-grained access control using AWS Lake Formation, Amazon EMR presents an excellent choice for customers running Apache Spark.

With each of these performance optimizations to Apache Spark, you benefit from better query performance. Stay tuned for additional updates that improve Apache Spark performance in Amazon EMR. To keep up to date, subscribe to the Big Data blog’s RSS feed to learn about more Apache Spark optimizations, configuration best practices, and tuning advice.

 


About the Authors

Joseph Marques is a principal engineer for EMR at Amazon Web Services.

 

 

 

 

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.

 

 

 

 

Simplify management of Amazon Redshift clusters with the Redshift console

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/simplify-management-of-amazon-redshift-clusters-with-the-redshift-console/

Amazon Redshift is the most popular and the fastest cloud data warehouse. It includes a console for administrators to create, configure, and manage Amazon Redshift clusters. The new Amazon Redshift console modernizes the user interface and adds several features to improve managing your clusters and workloads running on clusters.

The new Amazon Redshift console provides the following benefits:

  • Visibility to the health and performance of clusters from a unified dashboard.
  • Simplified management of clusters by streamlining several screens and flows, and reducing the number of clicks for several everyday operations.
  • Improved mean-time-to-diagnose query performance issues by adding capabilities to monitor user queries and correlate with cluster performance metrics.

This post discusses how you can use the new console to create your first cluster, and manage and monitor your clusters in your AWS account.

Prerequisites

For the best experience in using the new Amazon Redshift console, make sure you meet the following prerequisites.

If you are using the AmazonRedshiftFullAccess, you don’t need to make any changes to your permissions. The AmazonRedshiftFullAccess grants full access to all Amazon Redshift resources for the AWS account.

If you are using custom policies, either attach AmazonRedshiftFullAccess or add the following code to your IAM user’s policy:

{  
   "Version":"2012-10-17",
   "Statement":[  
      {  
         "Action":[  
            "cloudwatch:ListMetrics",
            "cloudwatch:GetMetricWidgetImage",
            "cloudwatch:GetMetricData",
            "tag:GetResources",
            "tag:UntagResources",
            "tag:GetTagValues",
            "tag:GetTagKeys",
            "tag:TagResources",
            "iam:ListRoles"
         ],
         "Effect":"Allow",
         "Resource":"*"
      }
   ]
}

You also have to upgrade your clusters to the latest maintenance patch.

New console launch page

When you log in to the new console and you don’t have a cluster, you see the launch page. You can navigate to your dashboard, clusters, queries, Query Editor, configuration changes, and advisor using the left navigation menu.

You are redirected to the Dashboard page if you have at least one cluster.

The following screenshot shows your Amazon Redshift console and navigation menu.

To expand the menu, choose the expand icon at the top of the menu. See the following screenshot.

Setting up a new Amazon Redshift data warehouse

To use Amazon Redshift, you have to first create a new Amazon Redshift cluster. You can launch a cluster by providing a few parameters such as node type, number of nodes, and the master user password. The rest of the parameters can be default values. You can also get an estimated cost for your cluster, and calculate the best configuration based on the size of your data.

Now that you created an Amazon Redshift cluster, you’re ready to create some tables and load data using the Query Editor.

Query Editor

The Query Editor allows you to author and execute queries in your cluster. To connect to your cluster, from Editor, choose Connect to database.

You have to provide your cluster URL and database credentials to connect to your database. See the following screenshot.

You can run SQL commands to create tables, load data, run queries, and perform visual analysis in the Query Editor.

You can switch to a new database connection by clicking on the “Change connection” button as highlighted below.

Monitoring dashboard

More than 15,000 customers use Amazon Redshift to power their analytical workloads to enable modern analytics use cases such as Business Intelligence, predictive analytics, and real-time streaming analytics. As an Amazon Redshift administrator or developer, you want your users, such as data analysts or BI professionals, to get optimal performance for their workloads. Being proactive, you might set up Amazon CloudWatch alarms on different Amazon Redshift metrics.

With the existing console, you can navigate to different pages to get a glance of the health and performance of Amazon Redshift clusters. The new dashboard provides unified visibility to the health and performance of your clusters.

The dashboard enables Amazon Redshift administrators and operators to gain visibility into the following:

  • The number of clusters, number of nodes, availability, and cluster health status.
  • CloudWatch alarms for your clusters.
  • Critical performance metrics for the top five clusters, such as the number of queries, database connections, and CPU utilization.
  • Performance workloads such as query throughput, average query length, average run time, and average wait time for your clusters.
  • Consolidated view of all events for your clusters.

You can filter data in each widget by a specific filter or modify the range to perform a trend analysis for a metric.

The dashboard also allows you to isolate issues in your clusters to focus on the problem. For example, if you have CloudWatch alarm for an Amazon Redshift metric such as DB Connections, as the preceding screenshot shows, you can drill down to view the alarm details or drill down to the cluster to get details such as queries and loads running, and cluster and database metrics.

Cluster management

The Clusters page allows you to view node details and key performance metrics for the clusters, tags, and notifications. You can customize this page to include your preferred metric and areas of focus. In this post, the page includes a maintenance track and release status. Through the Actions menu, you can perform frequent operations such as modify, resize, manage tags, reboot, create a snapshot or configure a cross-region snapshot, or delete your cluster.

The following screenshot shows the Clusters page and options on the Actions menu.

Cluster details

If you are diagnosing a performance problem, you can drill down to view cluster details. The cluster details page shows the details for a specific cluster, organized in the following categories:

  • Query monitoring
  • Cluster performance
  • Maintenance
  • Backup
  • Properties

The Actions menu allows you to perform everyday operations such as resize, reboot, or change configuration. You can also modify the password for your master user for the database, create a snapshot, or restore a table in your database.

The following screenshot shows the details page for a cluster and the available menus and options.

Query and Load monitoring

You can monitor the current workload for your WLM queues and view critical metrics such as active queries and query throughput.

The following screenshot shows the Query monitoring section and the time-series view of the breakdown of the workload.

Cluster and database monitoring

You can view cluster and database performance metrics for your cluster, such as CPU utilization, percentage of disk space used, and database connections using the Cluster performance tab. You can also change the time period and period and granularity of data.

The following screenshot shows the Cluster performance section.

You can change your default metric view to the metrics of your choice by using the config button. You can either add 6, 9, or 20 metrics to this page. It is recommended that you keep these six metrics for optimal page performance.

Cluster maintenance

The Maintenance and monitoring tab allows you to view your CloudWatch alarms, events, and maintenance level. You can also enable auditing, create CloudWatch alarms, and create maintenance windows for your cluster.

The following screenshot shows the Maintenance and monitoring section.

Backup and restore

Amazon Redshift automatically takes incremental snapshots (backups) of your data every eight hours or 5 GB per node of data change. Alternatively, you can create a snapshot schedule to control when to create snapshots. You can create a new cluster by restoring from an existing snapshot.

You can view, search your snapshots, enable automatic snapshots, take a manual snapshot, and configure cross-region snapshots from this page.

You can also create a new cluster from an existing snapshot. With this release, Amazon Redshift allows you to restore snapshots to create a cluster with different node types than the original version. You can either migrate your Amazon Redshift data warehouse to a more powerful node type or clone to a smaller node type for development and testing.

You can view snapshots for your clusters by navigating the snapshots submenu from the cluster menu. You can search, view, and delete snapshots generated by existing and deleted clusters. You can restore a cluster from a snapshot, including a snapshot generated from a deleted cluster.

The following screenshot shows the Backup section and available actions in the Snapshots section.

Properties

Amazon Redshift clusters provide several customization options. The properties tab allows you to view and edit standard configurations such as IAM roles, cluster configuration details, database configurations, and network configurations. You can also manage tags for your cluster.

You need the JDBC and ODBC URLs for connecting to a cluster for using with different ETL, Business Intelligence, and developer tools. This post also provides you connection details such as JDBC and ODBC URLs and IP addresses for the nodes in the cluster.

The following screenshot shows the Properties section.

Conclusion

This post showed you the new features in the Amazon Redshift management console, such as monitoring your dashboard and updated flows to create, manage, and monitor Amazon Redshift clusters. The Clusters page logically organizes information for better navigation and simplify everyday operations. Switch today to the new Redshift Console and simplify management of your Amazon Redshift clusters.

In a later post, you can learn how the Amazon Redshift console improves query monitoring and helps you diagnose query performance.

 


About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

 

 

 

 

Raja Bhogi is an engineering manager at AWS. He is responsible for building delightful and easy-to-use web experiences for analytics and blockchain products. His work includes launching web experiences for new analytics products, and working on new feature launches for existing products. He is passionate about web technologies, performance insights, and tuning. He is a thrill seeker and enjoys everything from roller coasters to bungy jumping.

 

 

 

 

Publish and update data products dynamically with AWS Data Exchange

Post Syndicated from Akram Chetibi original https://aws.amazon.com/blogs/big-data/publish-and-update-data-products-dynamically-with-aws-data-exchange/

Data is revolutionizing the way organizations of all sizes conduct their business. Companies are increasingly using third-party data to complement their internal data and deliver value for their customers. Third party data is used across a wide variety of use-cases, such as to build applications for customers, to run analytics workloads to improve business operations and marketing activities, or to build predictive models using machine learning (ML) techniques.

However, as data becomes the center of how companies operate, the way data providers deliver to data subscribers has not changed in years. As data providers, you spend time and effort on undifferentiated heavy lifting to build data delivery and entitlement management mechanisms to serve your customers. Many data providers also rely on traditional sales and delivery channels and are often unable to reach many customers interested in their data, which leads to slower adoption of their data products.

Enter AWS Data Exchange.

AWS Data Exchange makes it easy to exchange data in the cloud efficiently. In a few minutes, customers can find and subscribe to hundreds of data products from more than 80 qualified data providers across industries, such as Financial Services, Healthcare and Life Sciences, and Consumer and Retail. After subscribing, customers can download a dataset or copy it to Amazon S3 and analyze it with a wide variety of AWS analytics and ML services. AWS Data Exchange gives data providers a secure, transparent, and reliable channel to reach millions of AWS customers. AWS Data Exchange also helps you service your existing customer subscriptions more efficiently and at a lower cost by eliminating the need to build and maintain data delivery, licensing, or billing infrastructure.

Many data providers publish data products that are updated regularly. For example, a stock market data provider may want to publish daily closing prices every day, or a weather forecast data provider may want to provide an updated forecast every week. This post walks through the process of publishing and updating products dynamically on AWS Data Exchange. The post first shows how to publish a new product and make it available to subscribers, which can be done in minutes using the AWS Data Exchange console. The post also reviews a workflow using a Lambda function to automatically update the product by publishing new revisions to its underlying data sets.

Prerequisites

Before you begin, complete the following prerequisites:

  1. You must be a registered provider on AWS Data Exchange. Only eligible and registered providers can publish data products on AWS Data Exchange. Eligible providers must agree to the Terms and Conditions for AWS Marketplace under a valid legal entity domiciled in the United States or a member state of the EU, supply valid banking and taxation identification, and be qualified by the AWS Data Exchange business operations team. For more information, see Providing Data Products on AWS Data Exchange.
  2. The data that you publish must be compliant with the AWS Marketplace Terms and Conditions and the AWS Data Exchange Publishing Guidelines.
  3. You must have the appropriate IAM permissions to use AWS Data Exchange as a provider. For example, you can use the AWSDataExchangeProviderFullAccess managed IAM policy.
  4. You need an S3 bucket for your ready-to-publish data files. For more information, see Create a Bucket and What is Amazon S3?

AWS Data Exchange concepts

Products are the unit of exchange in AWS Data Exchange. A product is a package of data sets that a provider publishes and others subscribe to. The AWS Data Exchange product catalog and AWS Marketplace website both list products. A product can contain one or more data sets, as well as product details, including the product’s name and description, categories, and contact details. The product also contains information related to the product’s offer terms, which are the terms that subscribers agree to when subscribing to a product. These terms include the available pricing and duration options, the data subscription agreement, and the refund policy.

A data set is a dynamic set of file-based data content. Data sets are dynamic and versioned using revisions. A revision is a specific version of a data set. Each revision can contain multiple files called assets, which you can import to a revision using an asynchronous workflow called a job. After creating a revision and importing assets into it, you need to finalize the revision to mark it as ready for publishing, before publishing it into the dataset’s product. For more information, see Working with Data Sets.

The following diagram summarizes the concepts described above and the hierarchy of the different resources.

Publishing a new product to AWS Data Exchange

Before reviewing how to automatically update an existing product, let’s start by setting up and creating a new product. If you already have a published product, skip this section and move on to “Publishing new data files to the product automatically.”

Creating a dataset

To publish a product, first create a dataset. Complete the following steps:

  1. On the AWS Data Exchange console’s, under Data sets, choose Create data set.
  2. Enter a Name and Description for the dataset and choose Create.

The name of the data set is visible as part of the product details in the catalog; consider using a concise name that enables customers to understand the content of the data set easily. The description is visible to subscribers who have an active subscription to the product; consider including coverage information as well as the features and benefits of the dataset.

The following screenshot shows the Create data set section with name and description. This post entered the name Exchange-A End of Day Prices, and the description, End-of-day pricing of all equities listed on Exchange-A. Covers all industries and all equities traded on the exchange (2,000+). This data set contains full history from 1985, and is updated daily with a new file every day around 5pm EST

Creating a revision

After creating the dataset, but before publishing it into a product, you need to create its first initial revision. Complete the following steps:

  1. On your data set’s page, choose the Revisions.
  2. Choose Create revision.
  3. For Revision settings, enter a brief comment about the data in this revision.
  4. Choose Create.The revision comment is visible to subscribers after they subscribe to your product.The following screenshot shows that this post entered the comment Historical data from January 1st, 1985 to November 13th, 2019.You can choose to import files (assets) to this revision from either an S3 bucket or your computer. This post imports a file from an S3 bucket. It is important to note that by default, AWS Data Exchange uses the source S3 Object’s key as an Asset name. The following screenshot shows the example file this post uses.
  5. When the import status is complete, choose Finalize.

Marking a revision as finalized means that it is staged for publishing. You can only publish finalized revisions to subscribers; you can’t modify a revision after publishing it.

Publishing a new product

You are now ready to publish a new product using this data set. Complete the following steps:

  1. On the AWS Data Exchange console, under Publish data, choose Products.
  2. Choose Publish new product.
  3. In Product overview, enter the product details that subscribers can use to identify the product. For information about best practices when populating your product’s details, see Publishing Products. In particular you may want to consider including links to a Data due diligence questionnaire (DDQ), information about the data set file types and schemas, and any other fact sheets.Note that you can use markdown to include links and format your product description.
  4. Choose Next to proceed to the Add data You can then add the dataset that you created above.
  5. Choose Next to proceed to the Configure the public offer page. This is the page where you configure the offer details for your product, including the available pricing options, the Data Subscription Agreement, and the refund policy.You can also choose whether you would like to enable subscription verification. If you enable subscription verification, prospective subscribers will have to fill in information such as their name, company name, email address, and use-case before being able to subscribe. The subscription request will then appear on your Product Dashboard page, and you will have up to 45 days to approve or decline the request. For information about subscription verification, see Subscription Verification for Providers.
  6. Choose Next to review your product. You can preview the product as it will appear on the AWS Data Exchange product catalog. When you are satisfied with your product and offer details, choose Publish the product.Important: Choosing Publish the product will publish your product to the AWS Data Exchange catalog and make it publicly available to subscribers.

You have now created a new data set, added your first revision to this data set with historical data, finalized the revision, and published a product using this finalized revision. This product is available for subscribers to purchase within a few hours after publishing.

Publishing new data files to the product automatically

Now that the product is available to customers, you need to update the product and continuously publish new revisions to it. In our example, you need to publish new equity prices every day. To do so, set up the following architecture, which automatically picks any files uploaded to your S3 bucket and publishes them to the product’s dataset as part of a new revision. The workflow creates and publishes a new revision for each file uploaded to the S3 bucket.

The workflow is as follows:

  1. You upload a ready-to-publish data file to the S3 bucket to update your data set.
  2. S3 invokes an AWS Lambda function with the S3 API event that contains details about the object. For more information, see Using AWS Lambda with Amazon S3.
  3. The AWS Lambda function creates a new revision under the pre-existing data set and starts a job to import the file.
  4. The AWS Lambda function modifies the pre-existing product to include the new dataset revision.
  5. Subscribers can now consume the new revision, which appears as part of their entitled data set.

Building a Lambda function

Now that you published a product with a data set, you have the foundational pieces in place to build a Lambda function that picks a new data file uploaded to S3 and publishes it as a part of that product.

To configure your Lambda function correctly, you first need to record the dataset ID and product ID that you created earlier. You can retrieve them from the AWS Data Exchange console. The product ID is available on the product page, which you can access from your Product Dashboard. The data set ID is available in the data set’s page, which you can access from the Data sets pages.

Data set page

Product page

Creating an IAM role

To give the Lambda function permission to read from the source S3 bucket, create a revision, upload files to it, and publish it to a product, you need to create an IAM role with the appropriate permissions.

To do so, create an IAM role and attach the following policy to it. Be sure to replace {INSERT-BUCKET-NAME} and {INSERT-ACCOUNTID} with your S3 bucket’s name and your account ID respectively.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "S3PermissionforGettingDataSet",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::{INSERT-BUCKET-NAME}/*"
        },
        {
            "Sid": "S3DataExchangeServicePermissions",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::*aws-data-exchange*"
        },
        {
            "Sid": "DataExchangeAPIPermissions",
            "Effect": "Allow",
            "Action": [
                "dataexchange:CreateRevision",
                "dataexchange:UpdateRevision",
                "dataexchange:CreateJob",
                "dataexchange:StartJob",
                "dataexchange:GetJob"
            ],
            "Resource": "*"
        },
        {
            "Sid": "MarketplaceAPIPermissions",
            "Effect": "Allow",
            "Action": [
                "aws-marketplace:DescribeEntity",
                "aws-marketplace:StartChangeSet",
                "aws-marketplace:DescribeChangeSet"
            ],
            "Resource": "*"
        },
        {
            "Sid": "CreateCloudwatchLogGroup",
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:us-east-1:{INSERT-ACCOUNTID}:*"
        },
        {
            "Sid": "CloudwatchLogsPermissions",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:us-east-1:{INSERT-ACCOUNTID}:log-group:*"
        }
    ]
}

For more information, see Creating IAM Roles.

Deploying a Lambda layer

This post uses a Lambda layer that extends the AWS Python SDK (boto3) that is built into the Lambda Python runtime by adding the AWS Data Exchange and AWS Marketplace Catalog API SDKs as of November 13, 2019. You can deploy a sample layer published for this post, but you should use the version of the AWS SDK that matches your needs.

Creating a Lambda function

Now that you created the IAM role and deployed the Lambda layer with your latest SDK, you can create a Lambda function using the following steps:

  1. On the Lambda console, choose Create a function.
  2. In the Create function section, choose Author from scratch.
  3. In the Basic information section, configure your Lambda function with the following information:
    1. For Function name, enter a name of your choice.
    2. For Runtime, choose Python 3.7.
    3. For Permissions, select Use an existing role.
    4. From the Existing role dropdown, select the Lambda role you created earlier.
  4. Choose Create function.

Configuring your Lambda function

You can now configure your Lambda function. You first need to configure the function to be triggered when new files upload to the S3 bucket. Complete the following steps:

  1. On the Lambda console, choose Functions.
  2. Select the newly created function.
  3. On the function configuration page, choose Add trigger.
  4. Under Trigger Configuration, choose S3.
  5. From the drop-down, select the bucket you created as a part of the prerequisites.
  6. Under Event type, choose All Object Create Events.
  7. Optionally, choose a Prefix or a Suffix if you want to only publish specific files to your AWS Data Exchange product.
  8. Choose Add.

To confirm your code is running with the appropriate SDK, associate the Lambda layer that you deployed earlier with your Lambda function. As noted previously, this post published a sample layer, but you should use the appropriate version of the AWS SDK that matches your needs.

  1. On the Lambda console, choose Functions.
  2. Select the newly created function.
  3. On the function configuration page, under the function name, choose Layers.
  4. Choose Add a layer.
  5. Under Layer Selection, deselect Select from list of runtime compatible layers.
  6. From the drop-down, choose the layer you deployed earlier.
  7. Choose Add.

You now need to configure the Lambda function’s code. You can copy the following code for the Lambda function. This code programmatically calls the following APIs, which are the same APIs that you performed earlier using the console:

  • CreateRevision creates a new revision.
  • CreateJob and StartJob start importing the file to the revision.
  • GetJob checks the status of the import.
  • UpdateRevision marks the revision as finalized.

To publish an update to the product, the Lambda function uses the AWS Marketplace Catalog API service with the following APIs. To learn more, see the AWS Marketplace Catalog API Reference.

  • DescribeEntity gets the product details.
  • StartChangeSet starts an update.
  • DescribeChangeSet checks the status of the product update.

Complete the following steps:

  1. On the Lambda console, choose Functions.
  2. Select your newly created function.
  3. Scroll down to the Function code
  4. Enter the following code:
    import os
    
    #Include the Lambda layer extracted location
    os.environ['AWS_DATA_PATH'] = '/opt/' 
    
    import boto3
    import time
    import datetime
    import json
    
    region = os.environ['AWS_REGION']
    
    try:
        data_set_id = os.environ['DATA_SET_ID']
    except KeyError:
        raise Exception("DATA_SET_ID environment variable must be defined!") 
    
    try:
        product_id = os.environ['PRODUCT_ID']
    except KeyError:
        raise Exception("PRODUCT_ID environment variable must be defined!")
    
    def lambda_handler(event, context):
        # Setup the boto3 clients needed
        dataexchange = boto3.client(
            service_name='dataexchange',
            region_name=region
    
        )
        marketplace_catalog = boto3.client(
            service_name='marketplace-catalog',
            region_name=region
        )
    
        # parse the s3 details from the triggered event
        bucket_name = event['Records'][0]['s3']['bucket']['name']
        object_key = event['Records'][0]['s3']['object']['key']
    
        # CREATE REVISION under the dataset provided as an environment variable
        current_time_for_creating_revision = datetime.datetime.utcnow().strftime("%d %B %Y %I:%M%p UTC")
        create_revision_response = dataexchange.create_revision(DataSetId=data_set_id,
                                                         Comment='Revision created programmatically on ' + current_time_for_creating_revision)
        revision_id = create_revision_response['Id']
    
        # CREATE JOB under the revision to import file from S3 to DataExchange
        create_job_s3_import = dataexchange.create_job(
            Type='IMPORT_ASSETS_FROM_S3',
            Details={
                'ImportAssetsFromS3': {
                    'DataSetId': data_set_id,
                    'RevisionId': revision_id,
                    'AssetSources': [
                        {
                            'Bucket': bucket_name,
                            'Key': object_key
                        }
    
                    ]
                }
            }
        )
    
        # Filter the ID of the Job from the response
        job_id = create_job_s3_import['Id']
    
        # invoke START JOB on the created job to change it from Waiting to Completed state
        start_created_job = dataexchange.start_job(JobId=job_id)
    
        # GET JOB details to track the state of the job and wait until it reaches COMPLETED state
        job_status = ''
    
        while job_status != 'COMPLETED':
            get_job_status = dataexchange.get_job(JobId=job_id)
            job_status = get_job_status['State']
            print('Job Status ' + job_status)
            
            if job_status=='ERROR' :
                job_errors = get_job_status['Errors']
                raise Exception('JobId: {} failed with error:{}'.format(job_id, job_errors))
            
            time.sleep(.5)
            
        # Finalize revision by invoking UPDATE REVISION
        current_time_for_finalize_revision = datetime.datetime.utcnow().strftime("%d %B %Y %I:%M%p UTC")
        print(current_time_for_finalize_revision)
        finalize_revision = dataexchange.update_revision(DataSetId=data_set_id, RevisionId=revision_id, Finalized=True,
                                                  Comment='Revision finalized programmatically on ' + current_time_for_finalize_revision)
    
        # New dataset version created and finalized, now let’s add it to an existing product specified as an env variable
    
        # Describe Product details to get the metadata about the product
        describe_entity = marketplace_catalog.describe_entity(Catalog='AWSMarketplace', EntityId=product_id)
    
        # Use the output to pull out producttype, productid and datasetarn for startchangeset call
        entity_type = describe_entity['EntityType']
        entity_id = describe_entity['EntityIdentifier']
        dataset_arn = ((json.loads(describe_entity['Details']))['DataSets'][0]['DataSetArn'])
        revision_arn = create_revision_response['Arn']
     
    
        # StartChangeSet to add the newly finalized revision to an existing product
        start_change_set = marketplace_catalog.start_change_set(
            Catalog='AWSMarketplace',
            ChangeSetName="Adding revision to my Product",
            ChangeSet=[
                {
                    "ChangeType": "AddRevisions",
                    "Entity": {
                        "Identifier": entity_id,
                        "Type": entity_type
                    },
                    "Details": json.dumps({
                        "DataSetArn": dataset_arn,
                        "RevisionArns": [revision_arn]
                    })
                }
            ]
        )
        
        #Filter the changeset id from the response
        changeset_id = start_change_set['ChangeSetId']
    
        # DESCRIBE CHANGESET to get the status of the changeset and wait until it reaches SUCCEEDED state
        change_set_status = ''
    
        while change_set_status != 'SUCCEEDED':
            describe_change_set = marketplace_catalog.describe_change_set(
                Catalog='AWSMarketplace',
                ChangeSetId=changeset_id
                )
            change_set_status = describe_change_set['Status']
            print('Change Set Status ' + change_set_status)
    
            if change_set_status=='FAILED' :
                print(describe_change_set)
                failurereason = describe_change_set['FailureDescription']
                raise Exception('ChangeSetID: {} failed with error:\n{}'.format(changeset_id, failurereason))
            time.sleep(1)
            
        return ('Your data has been published successfully')

  5. Scroll down to Environment Variables
  6. Set the DATA_SET_ID and PRODUCT_ID variables to the values you retrieved from the console.
  7. Scroll further down to Basic Settings and set the Timeout value to 1 minute.
  8. Choose Save.

When you upload a file to your S3 bucket, the S3 event now triggers the Lambda function, which updates the dataset automatically and publishes the new file to your subscribers. Subscribers also receive an Amazon CloudWatch event from AWS Data Exchange to automate exporting the data to their S3 buckets.

Conclusion

AWS Data Exchange provides an easy and convenient way for data providers to exchange data with their customers in a cloud-native, secure, and efficient way. This post showed you how to publish a new product from on a newly created data set and revision in the AWS Data Exchange Console. You also learned how to automatically publish files uploaded to your S3 bucket as new revisions. To learn more, visit AWS Data Exchange.

 


About the Authors

Akram Chetibi is a senior product manager of AWS Data Exchange. Akram joined AWS more than two years ago, and has launched multiple services including AWS Data Exchange and AWS Fargate.

 

 

 

 

Keerti Shah is a global solutions architect with Amazon Web Services. She enjoys working with Financial Services customers to drive innovation, digitization, and modernization of legacy applications.

 

 

 

 

Harsha W. Sharma is a global account solutions architect with AWS New York. Harsha joined AWS more than three years ago and works with Global Financial Services customers to design and develop architectures on AWS and support their journey on the cloud.

 

 

 

Find and acquire new data sets and retrieve new updates automatically using AWS Data Exchange

Post Syndicated from Akram Chetibi original https://aws.amazon.com/blogs/big-data/find-and-acquire-new-data-sets-and-retrieve-new-updates-automatically-using-aws-data-exchange/

Customers are doing some amazing things with data, such as improving medicine and tackling climate change. With AWS services, such as AWS Glue, Amazon EMR, Amazon SageMaker, Amazon QuickSight, and Amazon Athena, it is easier than ever to get data insights for your business. But how can you find the right data to fuel these analytics? This is where AWS Data Exchange steps in.

AWS Data Exchange makes it simple to exchange data in the cloud. In a few minutes, you can find and subscribe to hundreds of data products from more than 80 qualified data providers across industries such as Financial Services, Healthcare and Life Sciences, and Consumer and Retail. After subscribing, you can download data sets or copy them to Amazon S3 and analyze them with AWS’s analytics and machine learning services. With AWS Data Exchange, you can subscribe to data products and get access to data sets. Subscribers also access new data set revisions as providers publish new data.

This post uses an example scenario in which you would like to analyze daily treasury maturities in order to understand changes in the economy. We will use Rearc’s Daily Treasury Maturities | Federal Reserve Board product, which contains a data set that is updated daily with new data. This post walks through the process, from browsing the catalog and subscribing to the data products to setting up an automation to retrieve new revisions to S3 automatically, making it readily available to analyze using other AWS services.

Solution overview

The solution has three steps:

  1. Configure your prerequisites: an S3 bucket for your data and IAM permissions for using AWS Data Exchange.
  2. Subscribe to a new data product in AWS Data Exchange.
  3. Set up an automation using Amazon CloudWatch events to retrieve new revisions of subscribed data products in AWS Data Exchange automatically.

Prerequisites

This post assumes you have an S3 bucket to which you export your data sets. For more information, see Create a Bucket.

You also need permissions to use AWS Data Exchange and associated services to subscribe to and export data sets. You can, for example, use the AWS Data Exchange managed policy AWSDataExchangeSubscriberFullAccess, which gives you all the necessary permissions needed to use AWS Data Exchange as a subscriber. For more information, see Identity and Access Management in AWS Data Exchange.

Browsing the catalog and subscribing to data products

Browsing and subscribing to a new data product is straightforward. The first step is to determine what data products you wish to subscribe to. Complete the following steps:

  1. On the AWS Data Exchange console, choose Product catalog.You can search for a term and filter results by provider name and pricing plan.
  2. For Product catalog, enter federal reserve.
  3. Choose Search.You can see multiple data products listed, including a few products by Rearc and Crux Informatics. You can filter the results further by refining the results.
  4. Under Refine results, under Vendors, select Rearc.This post is searching for free product offerings, so filters the results further.
  5. Under Pricing plans, select Free.The filtered results contain Daily Treasury Maturities | Federal Reserve Board, which you can use for testing.Choosing the product name shows more product details, including its full description, which data sets are included in the product (some products offer multiple data sets in a single subscription), the product’s support contact information, as well as the its offer details, such as the data subscription agreement, available pricing options, and the refund policy. See the following screenshot of the product detail page.It is important to understand the offer details you are agreeing to, including the price and Data Subscription Agreement (DSA). A link to view the DSA is under the Usage tab. Read over the DSA; it is a legal agreement that defines the rights to use the data. You need to make sure that the agreement aligns with your intended usage before subscribing.
  6. Choose Continue to subscribe.
  7. Under Complete subscription, for Pricing information, choose a subscription duration and price.
  8. For Renewal settings, choose whether you want to enable auto-renewal when the subscription expires.The following screenshot shows that this post chooses a subscription for 12 months, and to renew automatically.
  9. Choose Subscribe.The subscription process can take up to a few minutes to complete.

When your subscription is active, it is visible under the Active subscriptions tab of the Subscriptions page. Choose your subscription to view its details, including the data sets included in the subscription. You can also see the Region to which the vendor publishes the data set.

Viewing revisions and exporting assets

When you click on the data set name, you proceed to the data set page. You can view revisions under the Revisions tab. The following screenshot shows the list of revisions organized by Revision ID and time created.

Over time, as Rearc updates the data set, you can see multiple revisions listed.

Choosing the latest Revision ID brings up all the files (called assets in AWS Data Exchange) available in the revision. To export the assets, complete the following steps:

  1. Choose the asset to export.
  2. Choose Export to Amazon S3.
  3. Choose an S3 Bucket in the S3 navigation modal.
  4. Choose Export.

AWS Data Exchange starts copying the asset to your S3 bucket. In the console, AWS Data Exchange uses the asset name as an S3 object key. View the export progress in the Jobs list. It progresses through three steps: Waiting, Processing, and Completed.

Subscription Verification

AWS Data Exchange also has a feature called Subscription Verification. Subscription Verification allows providers to approve or decline subscription requests before granting subscription to certain products. For products with Subscription Verification enabled, you need to complete a form to share some information with the provider, who has up to 45 days to approve or reject the request. The form includes information such as your contact name, email address, company name, AWS account number, and intended use case for the data. The provider uses this information (and potentially reaches out to you for more information) to decide whether to approve your subscription request. You can view your subscription request status on the Subscriptions page under the Subscription requests tab. To learn more about subscription verification, see Subscription Verification for Subscribers.

Automating the retrieval for new data set revisions

Providers update many products regularly by creating and publishing new revisions to the underlying data sets. For example, the Rearc data product is updated daily. You want your analytics and visualizations to add these revisions to their insights easily. To do so, you need to set up an automation to retrieve the new files stored in newly published revisions.

The following diagram shows the workflow of this process.

Every time a new revision is published, AWS Data Exchange publishes a CloudWatch event sourced from aws.dataexchange. Using a Cloudwatch event rule to trigger a Lambda function, an AWS Data Exchange Job exports the revision’s assets to a pre-defined S3 bucket. It is interesting to note that because AWS Data Exchange uses the asset name as a default S3 object key when exporting to Amazon S3, and since Rearc is publishing a new revision with the same asset name every day, this automation will always override the previous day’s file with a new file, allowing you to always refer to the same S3 object, which will have the latest data.

An AWS CloudFormation template packages this automation. It contains all the necessary resources, including an S3 bucket to store the data, the Lambda function to export the data, its IAM role and policy, and the CloudWatch event rule to trigger the function. Packaging this automation in an AWS CloudFormation template makes it simple to repeat the automation for each data set you subscribe to. You can configure the template using the Data Set ID, which you can retrieve from the data set page that we have seen above.

In this post, we use a Lambda layer that extends the AWS Python SDK (boto3) that is built into the Lambda Python runtime by adding the AWS Data Exchange and AWS Marketplace Catalog API SDKs as of November 13, 2019. This is an example layer published for this post; use the correct version of the AWS SDK for your needs.

Deploying the automation

Before deploying the automation, make sure you are in the Region in which the data set is located. You can find this on the Subscription details page under Data sets.

  1. Click this button to deploy the CloudFormation template in us-east-1 region from the CloudFormation console.

    Alternatively, if you’re using a different region, you can manually create the stack in that region:

    • On the AWS CloudFormation console, choose Create Stack.
    • On the Create stack screen, for Template source, select Amazon S3 URL, and enter this URL in the box:
      https://aws-bigdata-blog.s3.amazonaws.com/artifacts/aws-blog-DataExchange/DataExchangeDownloadDataSet-v0.5.yaml
  1. On the stack details screen, give the Stack a name and paste in the ID of the dataset from the subscription. You can retrieve the Data Set ID from the AWS Data Exchange Console’s Subscriptions. Optionally, you can enter a Revision ID to download an existing revision to the s3 bucket immediately after stack creation. You can leave the revision ID blank, and only revisions published after this time will be downloaded to the s3 bucket. Choose Next.
  2. On the Configure stack options page, choose Next.
  3. On the Review screen, scroll down and check the three boxes in the Capabilities and transforms Then choose the Create stack button.

The stack takes 3–4 minutes to complete. Choose the refresh icon to see the latest status.  You can see the created S3 bucket under the Resources tab. This is where you can see new data set revisions.

Conclusion

In this post, you have searched and subscribed to a product and deployed the automation needed to automatically export new revisions to Amazon S3. This automation makes the data readily available to catalog and analyze using other AWS services. For example, you can catalog the new data automatically with an AWS Glue crawler, which creates and updates a table in your database with the Rearc data automatically. For more information, see Build and automate a serverless data lake using an AWS Glue trigger for the Data Catalog and ETL jobs. After cataloging the data, you can run a serverless ETL job to transform it into Parquet, or use it directly as-is from Amazon Athena or Amazon QuickSight.

 


About the Authors

Akram Chetibi is a Senior Product Manager of AWS Data Exchange. Akram joined AWS more than two years ago, and has launched multiple services including AWS Data Exchange and AWS Fargate.

 

 

 

 

George Seib is an Enterprise Solutions Architect with Amazon Web Services. He helps Financial Services and Enterprise customers cost effectively scale and secure data workloads.

 

 

 

Enhancing dashboard interactivity with Amazon QuickSight Actions

Post Syndicated from Sahitya Pandiri original https://aws.amazon.com/blogs/big-data/enhancing-dashboard-interactivity-with-amazon-quicksight-actions/

Amazon QuickSight now offers enhanced dashboard interactivity capabilities through QuickSight Actions. QuickSight Actions provide advanced filtering capabilities through single point-and-click actions on dashboards. With Actions, you can link visuals within a dashboard so that selecting a dimensional point on one visual provides you with granular insights on the selected point on other visuals within your dashboard. Therefore, you can start with summaries and dive deep into details of your business metrics, all within the same dashboard sheet. You can define what visuals within your dashboard are interactive and how these interact with each other. As of this writing, QuickSight Actions lets you define two primary actions of interactivity: filter actions and URL actions. URL actions within Amazon QuickSight are not new, but the point of entry to create URL actions is now consolidated with Actions.

You can apply QuickSight Actions to any supported chart that holds at least one dimension. This post provides examples of getting starting with Actions, configuring different Actions on a dashboard, and enabling different forms of interactivity for each action configured.

This post uses the following data sets:

B2B Sales
This data set holds order details for a fictitious company ABCDO for 2016 and 2017. The dashboard we will build will report on sales metrics by industry, segment, region as primary dimensions, and also provides granular details for each order purchased.

Product Availability
This data set holds available quantity for every product by ID.

Prerequisites

Before implementing Actions on your Amazon QuickSight dashboards, review how to create and publish dashboards.

Getting started with QuickSight Actions

This screenshot is a dashboard built from the above two data sets. It shows sales by category, industry and region on line 1; segment sales by quarter, industry sales by segment on line 2; total profit, sales, discount and quanity sold on line 3; order details pivot on line 4 and shipping details pivot on line 5.

Before getting started, note the following terminology:

  • Source visual – The visual on which an action is created. Choosing a point on the source visual triggers the action, and the dimensional values chosen are passed as filters to target visuals.
  • Target visual – The visual that is filtered by the dimensional values chosen on the source visual.
  • Activation – You can trigger an action either by  selecting it directly (left-click), or selecting from the menu options (right-click).
  • Action type – You can configure two types of actions on a visual: filter actions and URL actions. Filter actions pass select or all dimensions as filters across select or all visuals across the dashboard. URL actions allow you to navigate from the dashboard to an external site or different sheet within the same dashboard with the selected dimension passed along.

Setting up a click/select action

To set up a click/select action, complete the following steps:

  1. Select the source visual on the analysis and choose Actions.The following screenshot shows you this step.
  2. Within Actions, you can either create and configure your actions or use the 1-click action setup. The 1-click setup creates a default action on the visual. As a result, when a point on the visual is selected/clicked, all dimensions on a selected point are passed as filters across all other visuals on the dashboard. Under Quick create, choose Filter same-sheet visuals.

This creates a default action, which you can modify at any time.

After creating the 1-click action, you can select any point on the source visual. The following screenshot shows that the selected point is highlighted and all the underlying dimensions are used as filters on all other visuals on the analysis or dashboard.

In this screenshot, selecting Copper and Diamonds for the year 2017 passes these two as filters to all other visuals.

To verify what filter actions are applied on a particular visual, choose the filter icon on the top right. This shows all the filters applied through filter actions and also regular filters.

Setting up custom filter actions

To perform further analysis on the sales metrics for any segment, configure an action on the Industry sales by Segment visual. To configure a filter action to pass segment as the filter onto the KPI charts, complete the following steps:

  1. Choose the source visual and choose Actions.
  2. Choose Define a custom action.
  3. For Action name, enter a name for your action. You could also add dynamic placeholder values to the name to be more descriptive to your readers by choosing the plus sign next to the Action name.You have four configuration settings. While this post uses values specific for this use case, you can choose other values according to your specific needs.
  4. For Activation, select Menu option.
  5. For Action type, choose Filter action.
  6. For Filter scope, select Selected fields, segment.
  7. For Target visuals, select Select visuals and check needed target visuals Total quantity, Total discount offered, and Total profit by year.
  8. Choose Save.

To view menu options and choose a filter action, right-click any segment on the visual. This post selects Strategic. The following screenshot shows that all the Strategic points are highlighted and the KPIs are updated.

Adding filter actions on a pivot tables

Filter actions on pivot tables provide exceptional flexibility to pass anything from the innermost to the outermost dimensions. Additionally, when a filter action is triggered, the source pivot table is highlighted to show values that are passed as filters to the target visuals.

This post adds click/select filter actions on the Order details pivot table to analyze shipping details for each order. This post also adds two filter actions to the menu options: one to filter by the product name and the other by the customer name.

To create click/select filter actions on order_id, complete the following steps:

  1. Choose the Order details pivot table chart.
  2. Choose Actions.
  3. Choose Define custom action.
  4. For Action name, enter your action name.
  5. For Activation, select Select.
  6. For Filter scope, select Selected fields and check order_id.
  7. For Target visuals, select Select visuals, and check Shipping details.
  8. Choose Save.

The following screenshot shows that when choosing any point on the pivot table, the order pertaining to the point is highlighted and the Shipping details table shows shipping details for the selected order.

You can only create one Selection action on a source visual and any number of menu option actions.

You can now create additional actions in the menu. To create a new actions, go to Actions, choose the plus sign and create these two actions.

After creating the two actions, they are visible in the Actions menu. See the following screenshot.

After setting up these action filters, you can right-click on the visual and trigger these filters from the menu.

The screenshots below shows you the menu on right-clicking product name.

The screenshots below shows you the menu on right-clicking customer name.

Removing applied filter actions

There are three ways to remove a triggered filter action:

  • On source visuals with filter actions triggered via select or menu options, clear the filter action by choosing the selected point again or a different point.
  • On source visuals with empty spaces within the visual (such as bar chart family or pie chart), clicking within the empty space also clears the filter action selection. This deselection is not applicable to heat maps and tree maps. Deselect points on these charts by choosing an already selected point.
  • On target visuals with a highlighted filter icon, choosing the X on the filter actions displayed removes the filter for this visual and all other visuals the filter is applied to.

Using action filters when the source and target visuals use different data sets

If you use multiple data sets within an analysis, Amazon QuickSight automatically maps fields across these data sets by the field name. For example, product_id is a field in the B2B Sales data set and Product Availability data set. When setting a filter action on product_id from a source visual using the B2B Sales data set, the target visual (the adjacent KPI) showing product quantity availability shows results for the selected product_id from the source visual. This implicit mapping of fields is case sensitive.

You can add the new data set Product Availability to this analysis. For more information, see Adding a Data Set to an Analysis.

Add a KPI chart to show count of product_id as in the screenshot below.

To show total quantity available of select products, create an action filter on the menu options, as seen in the screenshot below:

To drill down into product quantity availability, choose (right-click) product_id on the order details pivot table. The following screenshot shows the product availability details for the selected product_id.

Creating custom URL actions

Custom URL actions aren’t new to Amazon QuickSight, but the point of entry to create URL actions is now consolidated with Actions. To create a custom URL action on the Order details pivot table to see Google search results on a customer, complete the following steps:

  1. Choose the Order details pivot table.
  2. Choose Actions.
  3. Choose Add a new action.
  4. For Action name, enter an action name.
  5. For Action type, choose URL action.
  6. Enter the URL https://www.google.com/search?q=<<customer_name>>.Customer_name is the value of the dimension field to select.
  7. Choose New browser tab.
  8. Choose Save.

When you right-click on the pivot table, menu options appear with your two filter actions created previously and the new custom URL action. See the following screenshot.

Conclusion

With Amazon QuickSight, you can create filter actions on any chart type with at least one dimensional value. You can also create URL actions on all chart types, including KPIs and gauge charts, which do not hold dimensional values. As of this writing, QuickSight Actions don’t support selecting multiple data points within the same source visual, cascading filter actions in which you choose one filter action on one visual and a subsequent filter on another visual, or adding customer mappings between different data sets used in the dashboard. You can expect to see these features in future releases.

QuickSight Actions is available on both Enterprise and Standard editions in all QuickSight-supported regions.

 


About the Author

Sahitya Pandiri is a technical program manager with Amazon Web Services. Sahitya has been in product/program management for 6 years now, and has built multiple products in the retail, healthcare, and analytics space.

 

 

 

Secure your data on Amazon EMR using native EBS and per bucket S3 encryption options

Post Syndicated from Duncan Chan original https://aws.amazon.com/blogs/big-data/secure-your-data-on-amazon-emr-using-native-ebs-and-per-bucket-s3-encryption-options/

Data encryption is an effective solution to bolster data security. You can make sure that only authorized users or applications read your sensitive data by encrypting your data and managing access to the encryption key. One of the main reasons that customers from regulated industries such as healthcare and finance choose Amazon EMR is because it provides them with a compliant environment to store and access data securely.

This post provides a detailed walkthrough of two new encryption options to help you secure your EMR cluster that handles sensitive data. The first option is native EBS encryption to encrypt volumes attached to EMR clusters. The second option is an Amazon S3 encryption that allows you to use different encryption modes and customer master keys (CMKs) for individual S3 buckets with Amazon EMR.

Local disk encryption on Amazon EMR

Previously you could only choose Linux Unified Key Setup (LUKS) for at-rest encryption. You now have a choice of using LUKS or native EBS encryption to encrypt EBS volumes attached to an EMR cluster. EBS encryption provides the following benefits:

  • End-to-end encryption – When you enable EBS encryption for Amazon EMR, all data on EBS volumes, including intermediate disk spills from applications and Disk I/O between the nodes and EBS volumes, are encrypted. The snapshots that you take of an encrypted EBS volume are also encrypted and you can move them between AWS Regions as needed.
  • Amazon EMR root volumes encryption – There is no need to create a custom Amazon Linux Image for encrypting root volumes.
  • Easy auditing for encryption When you use LUKS encryption, though your EBS volumes are encrypted along with any instance store volumes, you still see EBS with Not Encrypted status when you use an Amazon EC2 API or the EC2 console to check on the encryption status. This is because the API doesn’t look into the EMR cluster to check the disk status; your auditors would need to SSH into the cluster to check for disk encrypted compliance. However, with EBS encryption, you can check the encryptions status from the EC2 console or through an EC2 API call.
  • Transparent Encryption – EBS encryption is transparent to any applications running on Amazon EMR and doesn’t require you to modify any code.

Amazon EBS encryption integrates with AWS KMS to provide the encryption keys that protect your data. To use this feature, you have to use a CMK in your account and Region. A CMK gives you control to create and manage the key, including enabling and disabling the key, controlling access, rotating the key, and deleting it. For more information, see Customer Master Keys.

Enabling EBS encryption on Amazon EMR

To enable EBS encryption on Amazon EMR, complete the following steps:

  1. Create your CMK in AWS KMS.
    You can do this either through the AWS KMS console, AWS CLI, or the AWS KMS CreateKey API. Create keys in the same Region as your EMR cluster. For more information, see Creating Keys.
  2. Give the Amazon EMR service role and EC2 instance profile permission to use your CMK on your behalf.
    If you are using the EMR_DefaultRole, add the policy with the following steps:

    • Open the AWS KMS console.
    • Choose your AWS Region.
    • Choose the key ID or alias of the CMK you created.
    • On the key details page, under Key Users, choose Add.
    • Choose the Amazon EMR service role.The name of the default role is EMR_DefaultRole.
    • Choose Attach.
    • Choose the Amazon EC2 instance profile.The name of the default role for the instance profile is EMR_EC2_DefaultRole.
    • Choose Attach.
      If you are using a customized policy, add the following code to the service role to allow Amazon EMR to create and use the CMK, with the resource being the CMK ARN:

      { 
      "Version": "2012-10-17", 
      "Statement": [ 
         { 
         "Sid": "EmrDiskEncryptionPolicy", 
         "Effect": "Allow", 
         "Action": [ 
            "kms:Encrypt", 
            "kms:Decrypt", 
            "kms:ReEncrypt*", 
            "kms:CreateGrant", 
            "kms:GenerateDataKeyWithoutPlaintext", 
            "kms:DescribeKey" 
            ], 
         "Resource": [ 
            " arn:aws:kms:region:account-id:key/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx " " 
            ] 
         } 
      ] 
      } 

       

  3. Create and configure the Amazon EMR Security configuration template.Do this either through the console or using CLI or SDK, with the following steps:
    • Open the Amazon EMR console.
    • Choose Security Configuration.
    • Under Local disk encryption, choose Enable at-rest encryption for local disks
    • For Key provider type, choose AWS KMS.
    • For AWS KMS customer master key, choose the key ARN of your CMK.This post uses the key ARN ebsEncryption_emr_default_role.
    • Select Encrypt EBS volumes with EBS encryption.

Default encryption with EC2 vs. Amazon EMR EBS encryption

EC2 has a similar feature called default encryption. With this feature, all EBS volumes in your account are encrypted without exception using a single CMK that you specify per Region. With EBS encryption from Amazon EMR, you can use different a KMS key per EMR cluster to secure your EBS volumes. You can use both EBS encryption provided by Amazon EMR and default encryption provided by EC2.

For this post, EBS encryption provided by Amazon EMR takes precedent, and you encrypt the EBS volumes attached to the cluster with the CMK that you selected in the security configuration.

S3 encryption

Amazon S3 encryption also works with Amazon EMR File System (EMRFS) objects read from and written to S3. You can use either server-side encryption (SSE) or client-side encryption (CSE) mode to encrypt objects in S3 buckets. The following table summarizes the different encryption modes available for S3 encryption in Amazon EMR.

Encryption locationKey storageKey management
SSE-S3Server side on S3S3S3
SSE-KMSServer side on S3KMS

Choose the AWS managed CMK for Amazon S3 with the alias aws/s3, or create a custom CMK.

 

CSE-KMSClient side on the EMR clusterKMSA custom CMK that you create.
CSE-CustomClient side on the EMR clusterYouYour own key provider.

The encryption choice you make depends on your specific workload requirements. Though SSE-S3 is the most straightforward option that allows you to fully delegate the encryption of S3 objects to Amazon S3 by selecting a check box, SSE-KMS or CSE-KMS are better options that give you granular control over CMKs in KMS by using policies. With AWS KMS, you can see when, where, and by whom your customer managed keys (CMK) were used, because AWS CloudTrail logs API calls for key access and key management. These logs provide you with full audit capabilities for your keys. For more information, see Encryption at Rest for EMRFS Data in Amazon S3.

Encrypting your S3 buckets with different encryption modes and keys

With S3 encryption on Amazon EMR, all the encryption modes use a single CMK by default to encrypt objects in S3. If you have highly sensitive content in specific S3 buckets, you may want to manage the encryption of these buckets separately by using different CMKs or encryption modes for individual buckets. You can accomplish this using the per bucket encryption overrides option in Amazon EMR. To do so, complete the following steps:

  1. Open the Amazon EMR console.
  2. Choose Security Configuration.
  3. Under S3 encryption, select Enable at-rest encryption for EMRFS data in Amazon S3.
  4. For Default encryption mode, choose your encryption mode.This post uses SSE-KMS.
  5. For AWS KMS customer master key, choose your key.The key you provide here encrypts all S3 buckets used with Amazon EMR. This post uses ebsEncryption_emr_default_role.
  6. Choose Per bucket encryption overrides.You can set different encryption modes for different buckets.
  7. For S3 bucket, add your S3 bucket that you want to encrypt differently.
  8. For Encryption mode, choose an encryption mode.
  9. For Encryption materials, enter your CMK.

If you have already enabled default encryption for S3 buckets directly in Amazon S3, you can also choose to bypass the S3 encryption options in the security configuration setting in Amazon EMR. This allows Amazon EMR to delegate encrypting objects in the buckets to Amazon S3, which uses the encryption key specified in the bucket policy to encrypt objects before persisting it on S3.

Summary

This post walked through the native EBS and S3 encryption options available with Amazon EMR to encrypt and secure your data. Please share your feedback on how these optimizations benefit your real-world workloads.

 


About the Author

Duncan Chan is a software development engineer for Amazon EMR. He enjoys learning and working on big data technologies. When he is not working, he will be playing with his dogs.

 

 

 

Amazon QuickSight announces the all-new QuickSight Mobile app

Post Syndicated from Tina Kelleher original https://aws.amazon.com/blogs/big-data/announcing-the-new-mobile-app-for-amazon-quicksight/

AWS is happy to announce the release of QuickSight Mobile for iOS and Android devices. This release is both a major update to the existing iOS app and the launch of a new Android application. The app enables you to securely get insights from your data from anywhere; favorite, browse, and interact with your dashboards; explore your data with drilldowns and filters; stay ahead of the curve via forecasting; get email alerts when unexpected changes happen in your data; and share those insights with colleagues.

To download the QuickSight Mobile app, visit the iOS App Store and Google Play.

The new QuickSight Mobile optimizes the dashboard consumption experience with newly added features and enhanced interactivity. In this blog, we will walk you through the new Mobile experience in detail.

Features and capabilities

Home page

After authenticating with the app, you land on the QuickSight home screen. From here, you have quick access to your favorite dashboards (synched from the browser application), and most recently viewed dashboards. The recently viewed list can display up to the last 100 dashboards you previously opened.

The following screenshots show your favorited and recently viewed dashboards.

Dashboards

You can browse through all your dashboards via the Dashboards tab. You can sort the dashboards by name or updated (published) date and change to a list or a grid view. You can also search for dashboards by name. Lastly, you can easily add any dashboard to your favorites by choosing the star icon next to the dashboard’s name.

The following screenshots show dashboards in grid view, display preferences and search tool.

Dashboard view

The dashboard view is optimized for your mobile device by showing the visuals in vertical stack view, with each visual expanded to full width. A multi-sheet dashboard shows the first sheet by default. To navigate across sheets, choose the sheet title and select the sheet name you want to view.

The following screenshots show an opened dashboard with visuals in vertical stack view and how you navigate across different sheets within the dashboard.

To apply filters to visuals, choose the funnel icon to view all available filter controls and options for your dashboard sheet. After you’ve made your selection, choose Apply, and choose Done. You can see the number of filters applied via the small blue tag next to the funnel icon. You can also expand the filter control view and reset a given filter to the default.

The following screenshots show dashboard filters, expanded filter pane and how you select filter values.

Some visuals have a scroll bar, which allows you to customize the range to zoom in on a visual. A long press on a data point on most charts, such as a line or a bar, brings up the context menu. You can focus on this data point, drill up or down if the visual has a drill hierarchy, and exclude the chosen data point.

The following screenshots show interactions with zoom bars and context menu.

To go back to the full view, choose the … icon in the upper right and choose Undo. To exit the dashboard view, choose the back arrow.

If it is an anomaly detection widget, you can click on “Explore anomalies” and get redirected to the detailed anomalies page. Here, you can see all of the anomalies detected for the latest period. To perform contribution analysis to uncover the hidden drivers, long press the anomaly data point and you will see the option show up in the bottom sheet. To find out more about anomaly detection and contribution analysis, see Amazon QuickSight Announces General Availability of ML Insights.

You can expect contribution analysis to extend support to all visuals beyond anomaly detection on the app in the near future.

The following screenshots show the anomaly details and contribution analysis flow.

Visual view

From the dashboard view, choose the expand icon in the upper right corner to expand the desired visual to full-screen. The app supports both portrait and landscape modes in the expanded visual, and you can use the same interactions such as zoom, drill, focus, and exclude.

The following screenshots show expanded visuals in both portrait and landscape modes.

Settings

The Settings tab allows you to change language, Region, and biometric authentication methods. You can also provide feedback by choosing Provide feedback, which initiates an email to our mobile team.

The following screenshot shows you the options on settings page.

Setup and authentication

The QuickSight Mobile app supports authentication by any provisioned user. You can authenticate by using either your Amazon QuickSight username and password (with MFA, if this option is enabled), or through your own identity provider (IdP). After authentication, you can re-authenticate using biometric authentication such as face ID or touch ID, if supported by your mobile device, or an application-specific PIN.

The following screenshots show the generic sign-in flow.

IdP-initiated login

For IdP-initiated logins, set up your account with identity federation. For more information, see Federate Amazon QuickSight access with Okta. You can federate using standard mobile federation processes that your IdP provides, either through an app or through the web. This post uses the Okta mobile app. When you are federated, you receive the prompt to either continue your session using the mobile web experience or in the QuickSight Mobile app.

Account administrators can set the duration of mobile sessions to be anywhere from 1–30 days through the administrative console. The default session length for new accounts is 30 days.

The following screenshots show IdP-initiated login flow.

Microsoft Active Directory

If your account uses Microsoft Active Directory, you can log in to the app using your current AD credentials. If your account requires MFA, you also need to provide your MFA token to authenticate. For more information, see Can I use AWS Directory Service for Microsoft Active Directory to authenticate users in Amazon QuickSight?

Email Invitations

If you receive an email invitation to Amazon QuickSight, follow the instructions in the email to set your username and password through the website (mobile or desktop). Use these credentials to authenticate with the mobile applications.

IAM Users

If your account is an IAM user, you need to provision yourself as an Amazon QuickSight user before authenticating in the mobile application. For more information on provisioning users, Provisioning Users for Amazon QuickSight.

Summary

This post provided an overview of the key features and functionalities as well as sign-in flows for the new QuickSight Mobile app. The app is available as a free download in Google Play and the iOS App Store. You can expect the app to continue to evolve its features and capabilities to give you the best experience for getting insights from your data while on the go!

If you have any questions or feedback, please leave a comment.

 


About the Authors

Susan Fang is a senior product manager for QuickSight with AWS.

 

 

 

 

Brian Stein is a software development manager at AWS.

 

 

 

 

Joining across data sources on Amazon QuickSight

Post Syndicated from Rakshith Dayananda original https://aws.amazon.com/blogs/big-data/joining-across-data-sources-on-amazon-quicksight/

Amazon QuickSight announced the launch of Cross Data Source Join, which allows you to connect to multiple data sources and join data across these sources in Amazon QuickSight directly to create data sets used to build dashboards. For example, you can join transactional data in Amazon Redshift that contains customer IDs with Salesforce tables that contain customer profile data to create an interactive dashboard with order and customer details. You can slice and dice transactional data by various customer dimensional data such as segment, geographic, or demographic without first pulling the data to a single source outside Amazon QuickSight.

With cross data source joins, you can join across all data sources supported by Amazon QuickSight, including file-to-file, file-to-database, and database-to-database joins using the built-in drag-and-drop UI, without heavily relying on the BI and data engineering teams to set up complex and time-consuming ETLs. Whether it is local CSV files, Amazon RDS databases, or JSON objects on an S3 bucket, you can now join any of these data sources together to create data sets.

Finally, you can set up a scheduled refresh up to the hour and confirm that the joined data set is always up to date with the latest information.

Getting started with Cross Data Source Join

The screenshot below shows all data sources you can connect to on QuickSight.

Amazon QuickSight allows you to connect to different data sources. It is common for businesses to have data spread across multiple data sources, depending on your data requirements. For example, you might have your web server logs stored in Amazon S3, customer details on Amazon Redshift tables, and order details on RDS. You may need to build reports from data combined from two or more of these different data sources.

You can accomplish this to some extent by building data pipelines to consolidate from multiple data sources into one single data source. However, creating these data pipelines results in data duplications across various AWS services, and adds additional cost in terms of effort and time to move data to a single data source. You would then build Amazon QuickSight data sets from this single data source. With cross data source join available directly on Amazon QuickSight, you can eliminate this problem.

There is no size restriction on your largest source, as long as the post-join table can fit into your SPICE capacity per data set. The rest of the tables together need to be within 1 GB in size. For example, if you have 20 numeric columns in your smaller table, you can fit about 5 million rows until you exceed the 1 GB memory limit.

This post demonstrates how to create data sets from two CSV files and how to join an RDS table with an S3 file. The post uses an example table with orders-related data in one data source, and returns-related data in another data source. The final goal is to create one single data set that contains both orders and returns data.

Prerequisites

Before getting started, download these CSV files in to your local machines:

Also, learn how to create, edit, delete QuickSight data sources from Working with Data Sources in Amazon QuickSight.

Joining multiple CSV files

To join two CSV files, complete the following steps:

  1. Use the orders CSV file downloaded from the S3 bucket above and upload to QuickSight.
  2. After selecting the ‘orders’ sheet, go to edit/preview data page where your data set details appears.
  1. From the top menu, choose Add data.A pop-up window appears with the option to either switch data sources or upload a new CSV to join with the current data set. The window also shown has existing source.
  1. Choose Upload a file and upload the ‘returns’ CSV file.After uploading the file, you can see sheet names for both the CSVs. The following screenshot shows orders and returns.
  1. Choose the two circles between the files.This step allows you to edit the join configuration between the two CSVs. In the Join configuration section, you can select the join type (inner, left, right, or full) and also select the column on which to apply the join. This post uses an inner join.
  1. For Join type, choose Inner.This post is joining the order ID classes of the two files.
  1. For Join classes, select Order ID in both drop-downs.
  1. Choose Apply.

You now have a data set that contains both orders and returns data from two different CSVs. You can save your data set and continue to create your analysis.

Joining an RDS table with S3

In this example, you have orders data in RDS and returns data as a JSON file in an S3 bucket. To join this data, complete the following steps:

  1. Create a data set on QuickSight from the RDS table. Review Create a Database Data Set and an Analysis to learn to connect to RDS to create data sets.
  2. Next, go to Manage data set, select the RDS data set. This post uses the orders data set.
  1. Choose Edit data set.A page with your data set details appears.
  1. From the top menu, choose Add data.A pop-up window appears with the option to either switch data sources or upload a new CSV to join with the current data set. The window also shown has existing source.
  2. Choose Switch data source.A list appears of the different data sets and their sources.
  1. Choose a data set. This post chooses Returns.You can now see both data sets linked together.
  2. Choose the two spheres between the data sets.
  3. Under Join configuration, choose your desired Join type. This post chooses Inner.
  1. For Join clauses, from the drop-downs of each data set, select the column on which to apply the join.This post chooses order_id for orders and Order ID for Returns.
  1. Choose Apply.

Your new data set contains both orders and returns data from two different CSVs. You can save your data set and continue to create your analysis.

Conclusion

This post showed you how to join data from two files, join tables from RDS and S3. You can join data from any two data sources (except IoT) on Amazon QuickSight with this method. Cross data source join capability is now available in both Enterprise and Standard editions in all Amazon QuickSight Regions.

 


About the Author

Rakshith Dayananda is an IT app dev engineer at AWS.