All posts by Louis Hourcade

Build a data storytelling application with Amazon Redshift Serverless and Toucan

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/build-a-data-storytelling-application-with-amazon-redshift-serverless-and-toucan/

This post was co-written with Django Bouchez, Solution Engineer at Toucan.

Business intelligence (BI) with dashboards, reports, and analytics remains one of the most popular use cases for data and analytics. It provides business analysts and managers with a visualization of the business’s past and current state, helping leaders make strategic decisions that dictate the future. However, customers continue to ask for better ways to tell stories with their data, and therefore increase the adoption rate of their BI tools.

Most BI tools on the market provide an exhaustive set of customization options to build data visualizations. It might appear as a good idea, but ultimately burdens business analysts that need to navigate through endless possibilities before building a report. Analysts are not graphic designers, and a poorly designed data visualization can hide the insight it’s intended to convey, or even mislead the viewer. To realize more value from your data, you should focus on building data visualizations that tell stories, and are easily understandable by your audience. This is where guided analytics helps. Instead of presenting unlimited options for customization, it intentionally limits choice by enforcing design best practices. The simplicity of a guided experience enables business analysts to spend more time generating actual insight rather than worrying about how to present them.

This post illustrates the concept of guided analytics and shows you how you can build a data storytelling application with Amazon Redshift Serverless and Toucan, an AWS Partner. Toucan natively integrates with Redshift Serverless, which enables you to deploy a scalable data stack in minutes without the need to manage any infrastructure component.

Amazon Redshift is a fully managed cloud data warehouse service that enables you to analyze large amounts of structured and semi-structured data. Amazon Redshift can scale from a few gigabytes to a petabyte-scale data warehouse, and AWS recently announced the global availability of Redshift Serverless, making it one of the best options for storing data and running ad hoc analytics in a scalable and cost-efficient way.

With Redshift Serverless, you can get insights on your data by running standalone SQL queries or by using data visualizations tools such as Amazon QuickSight, Toucan, or other third-party options without having to manage your data warehouse infrastructure.

Toucan is a cloud-based guided analytics platform built with one goal in mind: reduce the complexity of bringing data insights to business users. For this purpose, Toucan provides a no-code and comprehensive user experience at every stage of the data storytelling application, which includes data connection, building the visualization, and distribution on any device.

If you’re in a hurry and want to see what you can do with this integration, check out Shark attacks visualization with AWS & Toucan, where Redshift Serverless and Toucan help in understanding the evolution of shark attacks in the world.

Overview of solution

There are many BI tools in the market, each providing an ever-increasing set of capabilities and customization options to differentiate from the competition. Paradoxically, this doesn’t seem to increase the adoption rate of BI tools in enterprises. With more complex tools, data owners spend time building fancy visuals, and tend to add as much information as possible in their dashboards instead of providing a clear and simple message to business users.

In this post, we illustrate the concept of guided analytics by putting ourselves in the shoes of a data engineer that needs to communicate stories to business users with data visualizations. This fictional data engineer has to create dashboards to understand how shark attacks evolved in the last 120 years. After loading the shark attacks dataset in Redshift Serverless, we guide you in using Toucan to build stories that provide a better understanding of shark attacks through time. With Toucan, you can natively connect to datasets in Redshift Serverless, transform the data with a no-code interface, build storytelling visuals, and publish them for business users. The shark attacks visualization example illustrates what you can achieve by following instructions in this post.

Additionally, we have recorded a video tutorial that explains how to connect Toucan with Redshift Serverless and start building charts.

Solution architecture

The following diagram depicts the architecture of our solution.

Architecture diagram

We use an AWS CloudFormation stack to deploy all the resources you need in your AWS account:

  • Networking components – This includes a VPC, three public subnets, an internet gateway, and a security group to host the Redshift Serverless endpoint. In this post, we use public subnets to facilitate data access from external sources such as Toucan instances. In this case, the data in Redshift Serverless is still protected by the security group that restricts incoming traffic, and by the database credentials. For a production workload, it is recommended to keep traffic in the Amazon network. For that, you can set the Redshift Serverless endpoints in private subnets, and deploy Toucan in your AWS account through the AWS Marketplace.
  • Redshift Serverless components – This includes a Redshift Serverless namespace and workgroup. The Redshift Serverless workspace is publicly accessible to facilitate the connection from Toucan instances. The database name and the administrator user name are defined as parameters when deploying the CloudFormation stack, and the administrator password is created in AWS Secrets Manager. In this post, we use database credentials to connect to Redshift Serverless, but Toucan also supports connection with AWS credentials and AWS Identity and Access Management (IAM) profiles.
  • Custom resources – The CloudFormation stack includes a custom resource, which is an AWS Lambda function that loads shark attacks data automatically in your Redshift Serverless database when the CloudFormation stack is created.
  • IAM roles and permissions – Finally, the CloudFormation stack includes all IAM roles associated with services previously mentioned to interact with other AWS resources in your account.

In the following sections, we provide all the instructions to connect Toucan with your data in Redshift Serverless, and guide you to build your data storytelling application.

Sample dataset

In this post, we use a custom dataset that lists all known shark attacks in the world, starting from 1900. You don’t have to import the data yourself; we use the Amazon Redshift COPY command to load the data when deploying the CloudFormation stack. The COPY command is one the fastest and most scalable methods to load data into Amazon Redshift. For more information, refer to Using a COPY command to load data.

The dataset contains 4,900 records with the following columns:

  • Date
  • Year
  • Decade
  • Century
  • Type
  • Zone_Type
  • Zone
  • Country
  • Activity
  • Sex
  • Age
  • Fatal
  • Time
  • Species
  • href (a PDF link with the description of the context)
  • Case_Number

Prerequisites

For this solution, you should have the following prerequisites:

  • An AWS account. If you don’t have one already, see the instructions in Sign Up for AWS.
  • An IAM user or role with permissions on AWS resources used in this solution.
  • A Toucan free trial to build the data storytelling application.

Set up the AWS resources

You can launch the CloudFormation stack in any Region where Redshift Serverless is available.

  1. Choose Launch Stack to start creating the required AWS resources for this post:

  1. Specify the database name in Redshift Serverless (default is dev).
  2. Specify the administrator user name (default is admin).

You don’t have to specify the database administrator password because it’s created in Secrets Manager by the CloudFormation stack. The secret’s name is AWS-Toucan-Redshift-Password. We use the secret value in subsequent steps.

Test the deployment

The CloudFormation stack takes a few minutes to deploy. When it’s complete, you can confirm the resources were created. To access your data, you need to get the Redshift Serverless database credentials.

  1. On the Outputs tab for the CloudFormation stack, note the name of the Secrets Manager secret.

BDB-2389temp

  1. On the Secrets Manager console, navigate to the Amazon Redshift database secret and choose Retrieve secret value to get the database administrator user name and password.

  1. To make sure your Redshift Serverless database is available and contains the shark attacks dataset, open the Redshift Serverless workgroup on the Amazon Redshift console and choose Query data to access the query editor.
  2. Also note the Redshift Serverless endpoint, which you need to connect with Toucan.

  1. In the Amazon Redshift query editor, run the following SQL query to view the shark attacks data:
SELECT * FROM "dev"."public"."shark_attacks";

Redshift Query Editor v2

Note that you need to change the name of the database in the SQL query if you change the default value when launching the CloudFormation stack.

You have configured Redshift Serverless in your AWS account and uploaded the shark attacks dataset. Now it’s time to use this data by building a storytelling application.

Launch your Toucan free trial

The first step is to access Toucan platform through the Toucan free trial.

Fill the form and complete the signup steps. You then arrive in the Storytelling Studio, in Staging mode. Feel free to explore what has been already created.

Toucan Home page

Connect Redshift Serverless with Toucan

To connect Redshift Serverless and Toucan, complete the following steps:

  1. Choose Datastore at the bottom of the Toucan Storytelling Studio.
  2. Choose Connectors.

Toucan is natively integrated with Redshift Serverless with AnyConnect.

  1. Search for the Amazon Redshift connector, and complete the form with the following information:
    • Name – The name of the connector in Toucan.
    • Host – Your Redshift Serverless endpoint.
    • Port – The listening port of your Amazon Redshift database (5439).
    • Default Database – The name of the database to connect to (dev by default, unless edited in the CloudFormation stack parameters).
    • Authentication Method – The authentication mechanism to connect to Redshift Serverless. In this case, we use database credentials.
    • User – The user name to use for authentication with Redshift Serverless (admin by default, unless edited in the CloudFormation stack parameters).
    • Password – The password to use for authentication with Redshift Serverless (you should retrieve it from Secrets Manager; the secret’s name is AWS-Toucan-Redshift-Password).

Toucan connection

Create a live query

You are now connected to Redshift Serverless. Complete the following steps to create a query:

  1. On the home page, choose Add tile to create a new visualization.

Toucan new tile

  1. Choose the Live Connections tab, then choose the Amazon Redshift connector you created in the previous step.

Toucan Live Connection

The Toucan trial guides you in building your first live query, in which you can transform your data without writing code using the Toucan YouPrep module.

For instance, as shown in the following screenshot, you can use this no-code interface to compute the sum of fatal shark attacks by activities, get the top five, and calculate the percentage of the total.

Toucan query data

Build your first chart

When your data is ready, choose the Tile tab and complete the form that helps you build charts.

For example, you can configure a leaderboard of the five most dangerous activities, and add a highlight for activities with more than 100 attacks.

Choose Save Changes to save your work and go back to the home page.

Toucan chart builder

Publish and share your work

Until this stage, you have been working in working in Staging mode. To make your work available to everyone, you need to publish it into Production.

On the bottom right of the home page, choose the eye icon to preview your work by putting yourself in the shoes of your future end-users. You can then choose Publish to make your work available to all.

Toucan publish

Toucan also offers multiple embedding options to make your charts easier for end-users to access, such as mobile and tablet.

Toucan multi devices

Following these steps, you connected to Redshift Serverless, transformed the data with the Toucan no-code interface, and built data visualizations for business end-users. The Toucan trial guides you in every stage of this process to help you get started.

Redshift Serverless and Toucan guided analytics provide an efficient approach to increase the adoption rate of BI tools by decreasing infrastructure work for data engineers, and by simplifying dashboard understanding for business end-users. This post only covered a small part of what Redshift Serverless and Toucan offer, so feel free to explore other functionalities in the Amazon Redshift Serverless documentation and Toucan documentation.

Clean up

Some of the resources deployed in this post through the CloudFormation template incur costs as long as they’re in use. Be sure to remove the resources and clean up your work when you’re finished in order to avoid unnecessary cost.

On the CloudFormation console, choose Delete stack to remove all resources.

Conclusion

This post showed you how to set up an end-to-end architecture for guided analytics with Redshift Serverless and Toucan.

This solution benefits from the scalability of Redshift Serverless, which enables you to store, transform, and expose data in a cost-efficient way, and without any infrastructure to manage. Redshift Serverless natively integrates with Toucan, a guided analytics tool designed to be used by everyone, on any device.

Guided analytics focuses on communicating stories through data reports. By setting intentional constraints on customization options, Toucan makes it easy for data owners to build meaningful dashboards with a clear and concise message for end-users. It works for both your internal and external customers, on an unlimited number of use cases.

Try it now with our CloudFormation template and a free Toucan trial!


About the Authors


Louis
Louis Hourcade
is a Data Scientist in the AWS Professional Services team. He works with AWS customer across various industries to accelerate their business outcomes with innovative technologies. In his spare time he enjoys running, climbing big rocks, and surfing (not so big) waves.


Benjamin
Benjamin Menuet
is a Data Architect with AWS Professional Services. He helps customers develop big data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some mythic races like the UTMB.


Xavier
Xavier Naunay
is a Data Architect with AWS Professional Services. He is part of the AWS ProServe team, helping enterprise customers solve complex problems using AWS services. In his free time, he is either traveling or learning about technology and other cultures.


Django
Django Bouchez
is a Solution Engineer at Toucan. He works alongside the Sales team to provide support on technical and functional validation and proof, and is also helping R&D demo new features with Cloud Partners like AWS. Outside of work, Django is a homebrewer and practices scuba diving and sport climbing.

Query cross-account AWS Glue Data Catalogs using Amazon Athena

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/query-cross-account-aws-glue-data-catalogs-using-amazon-athena/

Many AWS customers rely on a multi-account strategy to scale their organization and better manage their data lake across different projects or lines of business. The AWS Glue Data Catalog contains references to data used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. Using a centralized Data Catalog offers organizations a unified metadata repository and minimizes the administrative overhead related to sharing data across different accounts, thereby expanding access to the data lake.

Amazon Athena is one of the popular choices to run analytical queries in data lakes. This interactive query service makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you’re charged based on the amount of data scanned by your queries.

In May 2021, Athena introduced the ability to query Data Catalogs across multiple AWS accounts, enabling you to access your data lake without the complexity of replicating catalog metadata in individual AWS accounts. This blog post details the procedure for using the feature.

Solution overview

The following diagram shows the necessary components used in two different accounts (consumer account and producer account, hosting a central Data Catalog) and the flow between the two for cross-account Data Catalog access using Athena.

Our use case showcases Data Catalog sharing between two accounts:

  • Producer account – The account that administrates the central Data Catalog
  • Consumer account – The account querying data from the producer’s Data Catalog (the central Data Catalog)

In this walkthrough, we use the following two tables, extracted from an ecommerce dataset:

  • The orders table logs the website’s orders and contains the following key attributes:
    • Row ID­ – Unique entry identifier in the orders table
    • Order ID – Unique order identifier
    • Order date – Date the order was placed
    • Profit – Profit value of the order
  • The returns table logs the returned items and contains the following attributes:
    • Returned – If the order has been returned (Yes/No)
    • Order ID – Unique order identifier
    • Market – Region market

We walk you through the following high-level steps to use this solution:

  1. Set up the producer account.
  2. Set up the consumer account.
  3. Set up permissions.
  4. Register the producer account in the Data Catalog.
  5. Query your data.

You use Athena in the consumer account to perform different operations using the producer account’s Data Catalog.

First, you use the consumer account to query the orders table in the producer account’s Data Catalog.

Next, you use the consumer account to join the two tables and retrieve information about lost profit from returned items. The returns table is in the consumer’s Data Catalog, and the orders table is in the producer’s.

Prerequisites

The following are the prerequisites for this walkthrough:

This lists all your Athena workgroups. Make sure that the one you use runs on Athena engine version 2.

If all your workgroups are using Athena engine version 1, you need to update the engine version of an existing workgroup or create a new workgroup with the appropriate version.

Set up the producer account

In the producer account, complete the following steps:

  1. Create an S3 bucket for your producer’s data. For information about how to secure your S3 bucket, see Security Best Practices for Amazon S3.
  2. In this bucket, create a prefix named orders.
  3. Download the orders table in CSV format and upload it to the orders prefix.
  4. Run the following Athena query to create the producer’s database:
CREATE DATABASE producer_database
  COMMENT 'Producer data'
  1. Run the following Athena query to create the orders table in the producer’s database. Make sure to replace <your-producer-s3-bucket-name> with the name of the bucket you created.
CREATE EXTERNAL TABLE producer_database.orders(
  `row id` bigint, 
  `order id` string, 
  `order date` string, 
  `ship date` string, 
  `ship mode` string, 
  `customer id` string, 
  `customer name` string, 
  `segment` string, 
  `city` string, 
  `state` string, 
  `country` string, 
  `postal code` bigint, 
  `market` string, 
  `region` string, 
  `product id` string, 
  `category` string, 
  `sub-category` string, 
  `product name` string, 
  `sales` string, 
  `quantity` bigint, 
  `discount` string, 
  `profit` string, 
  `shipping cost` string, 
  `order priority` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\;'
LOCATION
  's3://<your-producer-s3-bucket-name>/orders/'
TBLPROPERTIES (
  'skip.header.line.count'='1'
)

Set up the consumer account

In the consumer account, complete the following steps:

  1. Create an S3 bucket for your consumer’s data.
  2. In this bucket, create a prefix named returns.
  3. Download the returns table in CSV format and upload it to the returns prefix.
  4. Run the following Athena query to create the consumer’s database:
CREATE DATABASE consumer_database
COMMENT 'Consumer data'
  1. Run the following Athena query to create the returns table in the consumer’s database. Make sure to replace <your-consumer-s3-bucket-name> with the name of the bucket you created.
CREATE EXTERNAL TABLE consumer_database.returns(
  `returned` string, 
  `order id` string, 
  `market` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\;' 
LOCATION
  's3://<your-consumer-s3-bucket-name>/returns/'
TBLPROPERTIES (
  'skip.header.line.count'='1'
)

Set up permissions

For the consumer account to query data in the producer account, we need to set up permissions.

First, we give the consumer account permission to access the producer account’s AWS Glue resources.

  1. In the producer account’s Data Catalog settings, add the following AWS Glue resource policy, which grants the consumer account access to the Data Catalog:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Consumer-account-id>:role/<role-in-consumer-account>"
            },
            "Action": [
        "glue:GetDatabases",
        "glue:GetTables"
      ],
            "Resource": [
                "arn:aws:glue:<Region>:<Producer-account-id>:catalog",
                "arn:aws:glue:<Region>:<Producer-account-id>:database/producer-database",
                "arn:aws:glue:<Region>:<Producer-account-id>:table/producer-database/orders"
            ]
        }
    ]
}

Next, we give the consumer account permission to list and get data from the S3 bucket in the producer account.

  1. In the producer account, add the following S3 bucket policy to the bucket <Producer-bucket>, which stores the data:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Consumer-account-id>:role/<role-in-consumer-account>"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<Producer-bucket>",
                "arn:aws:s3:::<Producer-bucket>/orders/*"
            ]
        }
    ]
}

Register the producer account’s Data Catalog

At this stage, you have set up the required permissions to access the central Data Catalog in the producer account from the consumer account. You now need to register the central Data Catalog as a data source in Athena.

  1. In the consumer account, go the Athena console and choose Connect data source.
  2. Select S3 – AWS Glue Data Catalog as the data source selection.
  3. Select AWS Glue Data Catalog in another account.

You then need to provide some information regarding the central Data Catalog you want to register.

  1. For Data source name, enter a name for the catalog (for example, Central_Data_Catalog). This serves as an alias in the consumer account, pointing to the central Data Catalog in the producer account.
  1. For Catalog ID, enter the producer account ID.
  2. Choose Register to complete the process.

Query your data

You have now registered the central Data Catalog as a data source in the consumer account. In the Athena query editor, you can then choose Central_Data_Catalog as a data source. Under Database, you can see all the databases for which you were granted access in the producer account’s AWS Glue resource policy. The same applies for the tables. After completing the steps in the earlier sections, you should see the orders table from producer_database located in the producer account.

You can start querying the Data Catalog of the producer account directly from Athena in the consumer account. You can test this by running the following SQL query in Athena:

SELECT * FROM "Central_Data_Catalog"."producer_database"."orders" limit 10;

This SQL query extracts the first 10 rows of the orders table located in the producer account.

You just queried a Data Catalog located in another AWS account, which enables you to easily access your central Data Catalog and scale your data lake strategy.

Now, let’s see how we can join two tables that are in different AWS accounts. In our scenario, the returns table is in the consumer account and the orders table is in the producer account. Suppose you want to join the two tables and see the total amount of items returned in each market. The Athena built-in support for cross-account Data Catalogs makes this operation easy. In the Athena query editor, run the following SQL query:

SELECT
returns_tb.Market as Market,
sum(orders_tb.quantity) as Total_Quantity
FROM "Central_Data_Catalog"."producer_database"."orders" as orders_tb
JOIN "AwsDataCatalog"."consumer_database"."returns" as returns_tb
ON orders_tb."order id" = returns_tb."order id"
GROUP BY returns_tb.Market;

In this SQL query, you use both the consumer’s Data Catalog AwsDataCatalog and the producer’s Data Catalog Central_Data_Catalog to join tables and get insights from your data.

Limitations and considerations

The following are some limitations that you should take into consideration before using Athena built-in support for cross-account Data Catalogs:

  • This Athena feature is available only in Regions where Athena engine version 2 is supported. For a list of Regions that support Athena engine version 2, see Athena engine version 2. To upgrade a workgroup to engine version 2, see Changing Athena Engine Versions.
  • As of this writing, CREATE VIEW statements that include a cross-account Data Catalog are not supported.
  • Cross-Region Data Catalog queries are not supported.

Clean up

After you query and analyze the data, you should clean up the resources used in this tutorial to prevent any recurring AWS costs.

To clean up the resources, navigate to the Amazon S3 console in both the provider and consumer accounts, and empty the S3 buckets. Also, navigate to the AWS Glue console and delete the databases.

Conclusion

In this post, you learned how to query data from multiple accounts using Athena, which allows your organization to access to a centralized Data Catalog. We hope that this post helps you build and explore your data lake across multiple accounts.

To learn more about AWS tools to manage access to your data, check out AWS Lake Formation. This service facilitates setting up a centralized data lake and allows you to grant users and ETL jobs cross-account access to Data Catalog metadata and underlying data.


About the Authors

Louis Hourcade is a Data Scientist in the AWS Professional Services team. He works with AWS customer across various industries to accelerate their business outcomes with innovative technologies. In his spare time he enjoys running, climbing big rocks, and surfing (not so big) waves.

Sara Kazdagli is a Professional Services consultant specialized in data analytics and machine learning. She helps customers across different industries build innovative solutions and make data-driven decisions. Sara holds a MSc in Software engineering and a MSc in data science. In her spare time, she likes to go on hikes and walks with her australian shepherd dog Kiba.

Jahed Zaïdi is an AI/ML & Big Data specialist at AWS Professional Services. He is a builder and a trusted advisor to companies across industries, helping them innovate faster and on a larger scale. As a lifelong explorer, Jahed enjoys discovering new places, cultures, and outdoor activities.