All posts by Ying Wang

Talk to your data: Query your data lake with Amazon QuickSight Q

2022-09-21 Ying Wang

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/talk-to-your-data-query-your-data-lake-with-amazon-quicksight-q/

Amazon QuickSight Q uses machine learning (ML) and natural language technology to empower you to ask business questions about your data and get answers instantly. You can simply enter your questions (for example, “What is the year-over-year sales trend?”) and get the answer in seconds in the form of a QuickSight visual.

Some business questions can’t be answered through existing business intelligence (BI) dashboards. It can take days or weeks for the BI team to accommodate these needs and refine their solution. Because Q doesn’t depend on prebuilt dashboards or reports to answer questions, it removes the need for BI teams to create or update dashboards every time a new business question arises. You can ask questions and receive answers in the form of visuals in seconds directly from within QuickSight or from web applications and portals. Q empowers every business user to self-serve and get insights faster, regardless of their background or skillset.

In this post, we walk you through the steps to configure Q using an Olympic Games public dataset and demonstrate how an end-user can ask simple questions directly from Q in an interactive manner and receive answers in seconds.

You can interactively play with the Olympic dashboard and Q search bar in the following interactive demo.

Solution overview

We use Olympic games public datasets to configure a Q topic and discuss tips and tricks on how to make further configurations on the topic that enable Q to provide prompt answers using ML-powered, natural language query (NLQ) capabilities that empower you to ask questions about data using everyday business language.

The video from Data Con LA provides a high-level demonstration of the capabilities covered in this post.

Additionally, we discuss the following:

Best practices for data modeling of a Q topic
How to perform data cleansing using AWS Glue DataBrew, SQL, or an Amazon SageMaker Jupyter notebook on datasets to build a Q topic

We use multiple publicly available datasets from Kaggle. The datasets have historical information about athletes, including name, ID, age, weight, country, and medals.

We use the 2020 Olympic datasets and historical data. We also use the datasets Introduction of Women Olympic Sport and Women of Olympic Games to determine the participation of women athletes in Olympics and discover trends. The QuickSight datasets created using these public data files are added to a Q topic, as shown in the following screenshot. We provide details on creating QuickSight datasets later in this post.

Prerequisites

To follow along with the solution presented in this post, you must have access to the following:

Amazon Athena
AWS CloudFormation
Amazon QuickSight
Amazon Simple Storage Service (Amazon S3)

Create solution resources

The public datasets in Kaggle can’t be directly utilized to create a Q topic. We have already cleansed the raw data and have provided the cleansed datasets in the GitHub repo. If you are interested in learning more about data cleansing, we discussed three different data cleansing methods at the end of this post.

To create your resources, complete the following steps:

Create an S3 bucket called olympicsdata.
Create a folder for each data file, as shown in the following screenshot.
Upload the data files from the GitHub repo into their respective folders.
Deploy the provided CloudFormation template and provide the necessary information.

The template creates an Athena database and tables, as shown in the following screenshot.

The template also creates the QuickSight data source athena-olympics and datasets.

Create datasets in QuickSight

To build the Q topic, we need to combine the datasets, because each table contains only partial data. Joining these tables helps answer questions across all the features of the 2020 Olympics.

We create the Olympics 2021 dataset by joining the tables Medals_athletes_2021, Athletes_full_2021, Coach_full_2021, and Tech_official_2021.

The following screenshot shows the joins for our complete dataset.

Medals_athletes_2021 is the main table, with the following join conditions:

Left outer join athletes_full_2021 on athlete_name, discipline_code, and country_code
Left outer join coach_full_2021 on country, discipline, and event
Left outer join tech_official_2021 on discipline

Finally, we have the following datasets that we use for our Q topic:

Olympics 2021 Details
Medals 2021
Olympics History (created using the Olympics table)
Introduction of Women Olympics Sports
Women in the Olympic Movement

Create a Q topic

Topics are collections of one or more datasets that represent a subject area that your business users can ask questions about. In QuickSight, you can create and manage topics on the Topics page. When you create a topic, your business users can ask questions about it in the Q search bar.

When you create topics in Q, you can add multiple datasets to them and then configure all the fields in the datasets to make them natural language-friendly. This enables Q to provide your business users with the correct visualizations and answers to their questions.

The following are data modeling best practices for Q topics:

Reduce the number of datasets by consolidating the data. Any given question can only hit one data set, so only include multiple datasets if they are related enough to be part of the same topic, but distinct enough that you can ask a question against them independently.
For naming conventions, provide a meaningful name or alias (synonym) of a field to allow the end-user to easily query it.
If a field appears in different datasets, make sure that this field has the same name across different datasets.
Validate data consistency. For example, the total value of a metric that aggregates from different datasets should be consistent.
For fields that don’t request on-the-fly calculations, for example, metrics with distributive functions (sum, max, min, and so on), push down the calculation into a data warehouse.
For fields that request on-the-fly calculations, create the calculated field in the QuickSight dataset or Q topic. If other topics or dashboards might reuse the same field, create it in the datasets.

To create a topic, complete the following steps:

On the QuickSight console, choose Topics in the navigation pane.
Choose New topic.
For Topic name, enter a name.
For Description, enter a description.
Choose Save.
On the Add data to topic page that opens, choose Datasets, and then select the datasets that we created in the previous section.
Choose Add data to create the topic.

Enhance the topic

In this section, we discuss various ways that you can enhance the topic.

Add calculated fields to a topic dataset

You can add new fields to a dataset in a topic by creating calculated fields.

For example, we have the column Age in our Olympics dataset. We can create a calculated field to group age into different ranges using the ifelse function. This calculated field can help us ask a question like “How many athletes for each age group?”

Choose Add calculated field.

In the calculation editor, enter the following syntax:

ifelse(
Age<=20, '0-20',
Age>20 and Age <=40, '21-40',
Age>40 and Age<=60, '41-60',
'60+'
)

Name the calculated field Age Groups.
Choose Save.

The calculated field is added to the list of fields in the topic.

Add filters to a topic dataset

Let’s say lot of analysis is expected on the dataset for the summer season. We can add a filter to allow for easy selection of this value. Furthermore, if we want to allow analysis against data for the summer season only, we can choose to always apply this filter or apply it as the default choice, but allow users to ask questions about other seasons as well.

Choose Add filter.
For Name, enter Summer.
Choose the Women in the Olympic Movement dataset.
Choose the Olympics Season field.
Choose Custom filter list for Filter type and set the rule as include.
Enter Summer under Values.
Choose Apply always, unless a question results in an explicit filter from the dataset.
Choose Save.

The filter is added to the list of fields in the topic.

Add named entities to a topic dataset

We can define named entities if we need to show users a combination of fields. For example, when someone asks for player details, it makes sense to show them player name, age, country, sport, and medal. We can make this happen by defining a named entity.

Choose Add named entity.
Choose the Olympics dataset.
Enter Player Profile for Name.
Enter Information of Player for Description.
Choose Add field.
Choose Player Name from the list.
Choose Add field again and add the fields Age, Countries, Sport, and Medal.
The fields listed are the order they appear in answers. To move a field, choose the six dots next to the name and drag and drop the field to the order that you want.
Choose Save.

The named entity is added to the list of fields in the topic.

Make Q topics natural language-friendly

To help Q interpret your data and better answer your readers’ questions, provide as much information about your datasets and their associated fields as possible.

To make the topic more natural language-friendly, use the following procedures.

Rename fields

You can make your field names more user-friendly in your topics by renaming them and adding descriptions.

Q uses field names to understand the fields and link them to terms in your readers’ questions. When your field names are user-friendly, it’s easier for Q to draw links between the data and a reader’s question. These friendly names are also presented to readers as part of the answer to their question to provide additional context.

Let’s rename the birth date field from the athlete dataset as Athlete Birth Date. Because we have multiple birth date fields in the topics for coach, athlete, and tech roles, renaming the athletes’ birth date field helps Q easily link to the data field when we ask questions regarding athletes’ birth dates.

On the Fields page, choose the down arrow at far right of the Birth Date field to expand it.
Choose the pencil icon next to the field name.
Rename the field to Athlete Birth Date.

Add synonyms to fields in a topic

Even if you update your field names to be user-friendly and provide a description for them, your readers might still use different names to refer to them. For example, a player name field might be referred to as player, players, or sportsman in your reader’s questions.

To help Q make sense of these terms and map them to the correct fields, you can add one or more synonyms to your fields. Doing this improves Q’s accuracy.

On the Fields page, under Synonyms, choose the pencil icon for Player Name.
Enter player and sportsman as synonyms.

Add synonyms to field values

Like we did for field names, we can add synonyms for category values as well.

Choose the Gender field’s row to expand it.
Choose Configure value synonyms, then choose Add.
Choose the pencil icon next to the F value.
Add the synonym Female.
Repeat these steps to add the synonym Male for M.
Choose Done.

Assign field roles

Every field in your dataset is either a dimension or a measure. Knowing whether a field is a dimension or a measure determines what operations Q can and can’t perform on a field.

For example, setting the field Age as a dimension means that Q doesn’t try to aggregate it as it does measures.

On the Fields page, expand the Age field.
For Role, choose Dimension.

Set field aggregations

Setting field aggregations tells Q which function should or shouldn’t be used when those fields are aggregated across multiple rows. You can set a default aggregation for a field, and specify aggregations that aren’t allowed.

A default aggregation is the aggregation that’s applied when there’s no explicit aggregation function mentioned or identified in a reader’s question. For example, let’s ask Q “Show total number of events.” In this case, Q uses the field Total Events, which has a default aggregation of Sum, to answer the question.

On the Fields page, expand the Total Events field.
For Default aggregation, choose Sum.
For Not allowed aggregation, choose Average.

Specify field semantic types

Providing more details on the field context will help Q answer more natural language questions. For example, users might ask “Who won the most medals?” We haven’t set any semantic information for any fields in our dataset yet, so Q doesn’t know what fields to associate with “who.” Let’s see how we can enable Q to tackle this question.

On the Fields page, expand the Player Name field.
For Semantic Type, choose Person.

This enables Q to surface Player Name as an option when answering “who”-based questions.

Exclude unused or unnecessary fields

Fields from all included datasets are displayed by default. However, we have a few fields like Short name of Country, URL Coach Full 2021, and URL Tech Official 2021 that we don’t need in our topic. We can exclude unnecessary fields from the topic to prevent them from showing up in results by choosing the slider next to each field.

Ask questions with Q

After we create and configure our topic, we can now interact with Q by entering questions in the Q search bar.

For example, let’s enter show total medals by country. Q presents an answer to your question as a visual.

You can see how Q interpreted your question in the description at the visual’s upper left. Here you can see the fields, aggregations, topic filters, and datasets used to answer the question. The topic filter na is applied on the Medal attribute, which excludes na values from the aggregation. For more information on topic filters, see Adding filters to a topic dataset.

Q displays the results using the visual type best suited to convey the information. However, Q also gives you the flexibility to view results in other visual types by choosing the Visual icon.

Another example, let’s enter who is the oldest player in basketball. Q presents an answer to your question as a visual.

Sometimes Q might not interpret your question the way you wanted. When this happens, you can provide feedback on the answer or make suggestions for corrections to the answer. For more information about providing answer feedback, see Providing feedback about QuickSight Q topics. For more information about correcting answers, see Correcting wrong answers provided by Amazon QuickSight Q.

Conclusion

In this post, we showed you how to configure Q using an Olympic games public dataset and so end-users can ask simple questions directly from Q in an interactive manner and receive answers in seconds. If you have any feedback or questions, please leave them in the comments section.

Appendix 1: Types of questions supported by Q

Let’s look at samples of each question type that Q can answer using the topic created earlier in this post.

Try the following questions or your own questions and continue enhancing the topic to improve accuracy of responses.

Question Type	Example
Dimensional Group Bys	show total medals by country
Dimensional Filters (Include)	show total medals for united states
Date Group Bys	show yearly trend of women participants
Multi Metrics	number of women events compared to total events
KPI-Based Period over Periods (PoPs)	how many women participants in 2018 over 2016
Relative Date Filters	show total medals for united states in the last 5 years
Time Range Filters	list of women sports introduced since 2016
Top/Bottom Filter	show me the top 3 player with gold medal
Sort Order	show top 3 countries with maximum medals
Aggregate Metrics Filter	show teams that won more than 50 medals
List Questions	list the women sports by year in which they are introduced
OR filters	Show player who got gold or silver medal
Percent of Total	Percentage of players by country
Where Questions	where are the most number of medals
When Questions	when women volleyball introduced into olympic games
Who Questions	who is the oldest player in basketball
Exclude Questions	show countries with highest medals excluding united states

Appendix 2: Data cleansing

In this section, we provide three options for data cleansing: SQL, DataBrew, and Python.

Option 1: SQL

For our first option, we discuss how to create Athena tables on the downloaded Excel or CSV files and then perform the data cleansing using SQL. This option is suitable for those who use Athena tables as a data source for QuickSight datasets and are comfortable using SQL.

The SQL queries to create Athena tables are available in the GitHub repo. In these queries, we perform data cleansing by renaming, changing the data type of some columns, as well as removing the duplicates of rows. Proper naming conventions and accurate data types help Q efficiently link the questions to the data fields and provide accurate answers.

Use the following sample DDL query to create an Athena table for women_introduction_to_olympics:

CREATE EXTERNAL TABLE women_introduction_to_olympics(
year string,
sport string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<<s3 bucket name>>/womeninolympics/introduction_of_women_olympic_sports'
TBLPROPERTIES (
'has_encrypted_data'='false')

In our data files, there are few columns that are common across more than one dataset that have different column names. For example, gender is available as gender or sex, country is available as country or team or team/noc, and person names have a role prefix in one dataset but not in other datasets. We rename such columns using SQL to maintain consistent column names.

Additionally, we need to change other demographic columns like age, height, and weight to the INT data type, so that they don’t get imported as String.

The following columns from the data files have been transformed using SQL.

Data File	Original Column	New Column
medals	`Discipline` `Medal_date (timestamp)`	`Sport` `Medal_date (date)`
Athletes	`name` `gender` `birth_date` `birth_place` `birth_country`	`athlete_name` `athlete_gender` `athlete_birth_date` `athlete_birth_place` `athlete_birth_country`
Coaches	`name` `gender` `birth_date` `function`	`coach_name` `coach_gender` `coach_birth_date` `coach_function`
Athlete_events (history)	`Team` `NOC` `Age (String)` `Height (String)` `Weight (String)`	`country` `country_code` `Age (Integer)` `Height (Integer)` `Weight (Integer)`

Option 2: DataBrew

In this section, we discuss a data cleansing option using DataBrew. DataBrew is a visual data preparation tool that makes it easy to clean and prepare data with no prior coding knowledge. You can directly load the results into an S3 bucket or load the data by uploading an Excel or CSV file.

For our example, we walk you through the steps to implement data cleansing on the medals_athletes_2021 dataset. You can follow the same process to perform any necessary data cleaning on other datasets as well.

Create a new dataset in DataBrew using medals_athletes.csv and then create a DataBrew project and implement the following recipes to cleanse the data in the medals_athletes_2021 dataset.

Delete empty rows in the athlete_name column.
Delete empty rows in the medal_type column.
Delete duplicate rows in the dataset.
Rename discipline to Sport.
Delete the column discipline_code.
Split the column medal_type on a single delimiter.
Delete the column medal_type_2, which was created as a result of step 6.
Rename medal_type_1 to medal_type.
Change the data type of column medal_date from timestamp to date.

After you create the recipe, publish it and create a job to output the results in your desired destination. You can create QuickSight SPICE datasets by importing the cleaned CSV file.

Option 3: Python

In this section, we discuss data cleansing using NumPy and Pandas of Python on the medals_athletes_2021 dataset. You can follow the same process to perform any necessary data cleansing on other datasets as well. The sample Python code is available on GitHub. This option is suitable for someone who is comfortable processing the data using Python.

Delete the column discipline_code:
```
olympic.drop(columns='discipline_code')
```

Rename the column discipline to sport:

olympic.rename(columns={'discipline': 'sport'})

You can create QuickSight SPICE datasets by importing the cleansed CSV.

Appendix 3: Data cleansing and modeling in the QuickSight data preparation layer

In this section, we discuss one more method of data cleansing that you can perform from the QuickSight data preparation layer, in addition to the methods discussed previously. Using SQL, DataBrew, or Python have advantages because you can prepare and clean the data outside QuickSight so other AWS services can use the cleansed results. Additionally, you can automate the scripts. However, Q authors have to learn other tools and programming languages to take advantage of these options.

Cleansing data in the QuickSight dataset preparation stage allows non-technical Q authors to build the application end to end in QuickSight with a codeless method.

The QuickSight dataset stores any data preparation done on the data, so that the prepared data can be reused in multiple analyses and topics.

We have provided a few examples for data cleansing in the QuickSight data preparation layer.

Change a field name

Let’s change the name data field from Athletes_full_2021 to athlete_name.

In the data preview pane, choose the edit icon on the field that you want to change.
For Name, enter a new name.
Choose Apply.

Change a field data type

You can change the data type of any field from the data source in the QuickSight data preparation layer using the following procedure.

In the data preview pane, choose the edit icon on the field you want to change (for example, birth_date).
Choose Change data type and choose Date.

This converts the string field to a date field.

Appendix 4: Information about the tables

The following table illustrates the scope of each table in the dataset.

Table Name	Link	Table Data Scope
`medals`	https://www.kaggle.com/piterfm/tokyo-2020-olympics?select=medals.csv	Information about medals won by each athlete and the corresponding event and country details
`athletes`	https://www.kaggle.com/piterfm/tokyo-2020-olympics?select=athletes.csv	Details about each athlete, such as demographic and country
`coaches`	https://www.kaggle.com/piterfm/tokyo-2020-olympics?select=coaches.csv	Details about each coach, such as demographic and country
`technical_officials`	https://www.kaggle.com/piterfm/tokyo-2020-olympics?select=technical_officials.csv	Details about each technical official, such as demographic and country
`athlete_events`	https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results	Historical information of Olympic games
`Introduction_of_Women_Olympics_Sports`	https://data.world/sports/women-in-the-olympic-games	Information on when the women Olympic sports were introduced
`womens_participation_in_the_olympic`	https://data.world/sports/women-in-the-olympic-games	Information on participation of women in Olympic sports

About the authors

Ying Wang is a Manager of Software Development Engineer. She has 12 years experience in data analytics and data science. In her data architect life, she helped customer on enterprise data architecture solutions to scale their data analytics in the cloud. Currently, she helps customer to unlock the power of Data with QuickSight from engineering/product by delivering new features.

Ginni Malik is a Data & ML Engineer with AWS Professional Services. She assists customers by architecting enterprise level data lake solutions to scale their data analytics in the cloud. She is a travel enthusiast and likes to run half-marathons.

Niharika Katnapally is a QuickSight Business Intelligence Engineer with AWS Professional Services. She assists customers by developing QuickSight dashboards to help them gain insights into their data and make data driven business decisions.

BIOps: Amazon QuickSight object migration and version control

2021-08-27 Ying Wang

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/biops-amazon-quicksight-object-migration-and-version-control/

DevOps is a set of practices that combines software development and IT operations. It aims to shorten the systems development lifecycle and provide continuous delivery with high software quality. Similarly, BIOps (business intelligence and IT operations) can help your Amazon QuickSight admin team automate assets migration and version control. Your team can design the migration and version control strategy in your organization by following the suggested pattern in this post. You can utilize the suggested framework and sample scripts to reduce your daily workload.

In this post, we discuss the following:

The suggested automation process of QuickSight assets migration across accounts or Regions
The suggested workflow of dashboard version control in one QuickSight account
The advanced usage of Amazon QuickSight APIs

For migration across accounts, we provide two options and sample code packages:

Amazon SageMaker notebooks with migration Python scripts for data scientists or engineers. These Python scripts can do batch migration (migrate all assets of the source account to the target account) and on-demand incremental migration (migrate specific assets across accounts).
An application to perform migrations with a QuickSight embedded website as an UI. The backend of this application contains an Amazon API Gateway endpoint, several AWS Lambda functions, an Amazon Simple Queue Service (Amazon SQS) queue, and an Amazon Simple Storage Service (Amazon S3) bucket. This application is packed in AWS Cloud Development Kit (AWS CDK) stacks and can be easily deployed into your environment.

Migrate across accounts and Regions automatically

Let’s assume that we have two QuickSight accounts: development and production. Both accounts are configured to connect to valid data sources. The following diagram illustrates our architecture.

The architecture contains the following workflow:

The Python scripts (SageMaker notebooks or Lambda functions) call QuickSight APIs (list_datasources) to get the data source list in the development account.
The scripts call the QuickSight describe_data_source API to describe the data source. The response of the describe_data_source API is a JSON object. The scripts update the JSON object with production account information, for instance, Amazon Redshift credentials or cluster ID.
The scripts create the data source in the production account and share the data source with the BI admin team.
The scripts perform the same procedure to the datasets.
The scripts create a template of the dashboard or analysis that the BI admin wants to migrate. (A template only can be created from an analysis or an existing template. When we create a template of a dashboard, we have to create the template from the underlying analysis of this dashboard. The version of the published dashboard might be behind the underlying analysis.)
The scripts call the create_analysis or create_dashboard API in the production account to create the analysis or the dashboard from the remote template in the development account, and apply the theme.
The scripts share the analysis or dashboard to some specific groups or users.
The scripts log the success messages and errors messages to Amazon CloudWatch Logs.

For migration across Regions, the BI admin can follow the same procedure to migrate assets from the source Region to the target Region. Instead of changing the account ID in the ARN of assets, change the Region name of the ARN.

We provide sample Python scripts later in this post.

Dashboard version control in one account

Under some conditions, the BI team might want to perform version control of the dashboard development in one account. The following diagram illustrates our architecture.

The workflow includes the following steps:

The BI developer creates an analysis and a template of this analysis. Let’s call the analysis and template version 1 assets.
The BI developer publishes the analysis as a dashboard, and the QA team runs tests on this dashboard.
After the QA test, the BI developer continues to develop the analysis to be version 2.
The BI team publishes version 2 of the dashboard.
The QA team tests version 2 of dashboard again, and takes the following action based on the result:
1. If the test is successful, the BI admin can update the template to be version 2.
2. If the tests detect errors, the BI developer has to edit the analysis to fix the issues. However, some issues in the analysis may be unfixable. The BI admin can roll back the analysis or dashboard to be version 1 with the backup template. QuickSight allows authors to roll back analysis to previous version using an undo button. In case the undo history was reset (with user’s confirmation) due to an event like dataset swap, or authors want to go back to a confirmed V1 starting point, you can use the V1 template in an update-analysis API call to reset the analysis to V1 state.
The BI developer works on the version 1 analysis to repeat the development cycle.

This workflow is the best practice we suggest to QuickSight users. You can modify the sample code packages we provide to automate this suggested process.

QuickSight API

For more information about the QuickSight API, see the QuickSight API reference and Boto3 QuickSight documentation.

Option 1: SageMaker notebooks of migration scripts

In this section, we present the first migration option for data scientists and engineers: using SageMaker notebooks with migration scripts.

Solution overview

We provide the sample Python scripts for migrating across accounts in three SageMaker notebooks:

functions – Provides all the functions, including describe objects, create objects, and so on. The supportive functions are developed to perform the tasks to automate the whole process. For example, update the data source connection information, get the dashboard ID from dashboard name, and write logs.
batch migration – Provides the sample automation procedure to migrate all the assets from the source account to the target account.
incremental migration – Provides on-demand incremental migration to migrate specific assets across accounts.

The following diagram illustrates the functions of each notebook.

You can download the notebooks from the GitHub repo.

Prerequisites

For this solution, you should have the following prerequisites:

Access to the following AWS services:
- QuickSight
- SageMaker
- AWS Identity and Access Management (IAM)
Two different QuickSight accounts, for instance, development and production
Basic knowledge of Python
Basic AWS SDK knowledge

Create resources

Create your resources in the source account by completing the following steps:

Download the notebooks from the GitHub repository.
Create a notebook instance.
Edit the IAM role of this instance to add an inline policy called qs-admin-source:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole",
                "quicksight:*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Deny",
            "Action": [
                "quicksight:DeleteA*",
                "quicksight:DeleteC*",
                "quicksight:DeleteD*",
                "quicksight:DeleteG*",
                "quicksight:DeleteI*",
                "quicksight:DeleteN*",
                "quicksight:DeleteTh*",
                "quicksight:DeleteU*",
                "quicksight:DeleteV*",
                "quicksight:Unsubscribe"
            ],
            "Resource": "*"
        }
    ]
}

On the notebook instance page, on the Actions menu, choose Open JupyterLab.
Upload the three notebooks into the notebook instance.

Implement the solution

In this section, we walk you through the steps to implement the solution.

AssumeRole

To use AssumeRole, complete the following steps:

Create an IAM role in the target (production) account that can be used by the source (development) account.
On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Choose the Another AWS account role type.
For Account ID, enter the source (development) account ID.
Create an IAM policy called qs-admin-target:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": ["quicksight:*",
		      "sts:AssumeRole"],
            "Resource": "*"
        },
       {
            "Effect": "Deny",
            "Action": "quicksight:Unsubscribe",
            "Resource": "*"
        }
    ]
}

Grant the IAM role the qs-admin-target IAM policy.
Provide the qs-admin-source and qs-admin-target role name in the Assume Role cells of the notebooks.

Static profile

To use a static profile, complete the following steps:

Create the IAM user qs-admin-source with policy qs-admin-source in the source account.
Create the IAM user qs-admin-target with policy qs-admin-target in the target account.
Get the aws_access_key_id and secret_access_key of these two IAM users.
In the terminal of the SageMaker notebook, go to the directory /home/ec2-user/.aws.
Edit the config and credential file to add a profile named source with the aws_access_key_id and secret_access_key of qs-admin-source.
Edit the config and credential file to add a profile named target with the aws_access_key_id and secret_access_key of qs-admin-target.
Provide the source and target profile name in the Static Profile cell of the notebook.

The tutorials of these notebooks are provided as comments inside the notebooks. You can run it cell by cell. If you want to schedule the notebooks to run automatically, you can schedule the Jupyter notebooks on SageMaker ephemeral instances.

In this solution, we assume that the name of dashboard and dataset are unique in the target (production) account. If you have multiple dashboards or datasets with the same name, you will encounter an error during the migration. Every dashboard has its own business purpose, so we shouldn’t create multiple dashboards with the same name in the production environment to confuse the dashboard viewers.

Option 2: Dashboard as UI to enter the migration workflow

In this section, we present the second migration option with the use of a QuickSight embedded website as an UI.

Solution overview

The following diagram illustrates our solution architecture.

The following diagram illustrates the resources deployed in the central account to facilitate the migration process.

The resources include the following:

Dashboard as UI – The QuickSight dashboard is based on a ticketing backend, QuickSight assets information, and migration status data. You can use the bottom banner of the dashboard to trigger a migration of resources. Choosing Submit sends the migration request and required parameters (asset name, source environment, and target environment) to API Gateway. The dashboard also displays the migration results, which are stored in an S3 bucket.
S3 bucket – An S3 bucket hosts a static website to present you with a simple embedded dashboard that shows all active dashboards, analyses, datasets, data sources, and migration status.
API Gateway – API Gateway provides endpoints for embedding a QuickSight dashboard and accepting POST requests to perform migrations:
- quicksight-embed – Embeds the migration status QuickSight dashboard. The API endpoint invokes the Lambda backend to generate a short-lived QuickSight embed URL, and presents the dashboard in an iFrame.
- quicksight-migration-sqs – Presents a footer form that allows the user to submit migration details with POST requests, which invoke the QuickSight migration Lambda function.
SQS queue – We use an SQS queue between the QuickSight migration API endpoint and the backend Lambda function to perform the migration. Messages are deleted after a migration is complete.
Lambda functions – We use three different functions:
- QuickSight migration – This function is invoked by the SQS queue, and it performs the necessary migration tasks depending on the parameters it receives. This function can perform both batch and incremental migration of QuickSight resources by querying the QuickSight service API, AWS Systems Manager Parameter Store, and AWS Secrets Manager.
- QuickSight embed URL – When invoked, this function fetches an embed URL of a given dashboard and returns an HTTP payload to the caller.
- QuickSight status – This function periodically queries the QuickSight API for details about dashboards, datasets, data sources, analyses, and themes, and uploads the results to Amazon S3. This S3 bucket is then used as a data source for a QuickSight dashboard to display a centralized view of all relevant resources.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Access to the following AWS services:
- API Gateway
- Amazon Athena
- Lambda
- QuickSight
- Amazon SQS
- Amazon S3
- Amazon CloudFront
- Parameter Store
- Secrets Manager
Two different QuickSight accounts, such as development and production
Basic knowledge of Python
Basic AWS SDK knowledge
Git and npm installed
The AWS CDK installed (see AWS CDK Intro Workshop: Python Workshop)

Create resources

Create your resources by cloning the following AWS CDK stack from the GitHub repo:

git clone https://github.com/aws-samples/amazon-quicksight-sdk-proserve.git ~/amazon-quicksight-sdk-proserve

Implement the solution

The following diagram illustrates the services deployed to our central and target accounts.

Deploy to the central account

We use the following stacks to deploy resources to the central account:

QuicksightStatusStack – Deploys the Lambda functions and related resources to populate the S3 bucket with active QuickSight dashboard details
QuicksightMigrationStack – Deploys the Lambda function, SQS queue, S3 bucket, and the API Gateway endpoint for initiating migration of QuickSight resources
QuicksightEmbedStack – Deploys the API Gateway endpoint, CloudFront distribution, and Lambda functions to process the embed URL requests

The migration scripts require a QuickSight user to be created with the name quicksight-migration-user. This user is given permissions to the migrated resources in the destination. However, another QuickSight user or group can be used in place of quicksight-migration-user by replacing the following:

The parameter in ~/amazon-quicksight-sdk-proserve/Migration-scripts/cdk/lambda/quicksight_migration/quicksight_migration/lambda_function.py (line 66)
The QUICKSIGHT_USER_ARN variable in ~/amazon-quicksight-sdk-proserve/Migration-scripts/cdk/cdk/quicksight_embed_stack.py (line 81)

Creating VPC connections in QuickSight allows QuickSight to access your private data resources and enhances your security. Create this connection in the central account with the VPC connection name set to the VPC ID.

Set up your environment

Set up your environment with the following code:

cd ~/amazon-quicksight-sdk-proserve/Migration-scripts/cdk/
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Deploy QuickSight status and migration stacks

Deploy the QuickSight status and migration stacks with the following code:

export CDK_DEPLOY_ACCOUNT=CENTRAL_ACCOUNT_ID
export CDK_DEPLOY_REGION=CENTRAL_REGION
cdk bootstrap aws://CENTRAL_ACCOUNT_ID/CENTRAL_REGION
cdk deploy quicksight-status-stack quicksight-migration-stack

Note down the API Gateway endpoint from the output for a future step.

Create a dashboard

After the AWS CDK is deployed, run the Lambda function quicksight_status manually and then two files, group_membership.csv and object_access.csv, are created in the S3 bucket quicksight-dash-CENTRAL_ACCOUNT_ID. By default, this Lambda function is invoked hourly.

In the source account, you can run the following SQL query to create two Athena tables (group_membership and object_access):

CREATE EXTERNAL TABLE `group_membership`(
`account_id` string,   
`namespace` string,   
`group` string, 
`user` string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://quicksight-dash-Source_ACCOUNT_ID/monitoring/quicksight/group_membership/'
TBLPROPERTIES (
'areColumnsQuoted'='false', 
'classification'='csv', 
'columnsOrdered'='true', 
'compressionType'='none', 
'delimiter'=',',
'typeOfData'='file')

CREATE EXTERNAL TABLE `object_access`(
`account_id` string,   
`aws_region` string,   
`object_type` string, 
`object_name` string,
`object_id` string,
`principal_type` string,
`principal_name` string,
`namespace` string,
`permissions` string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://quicksight-dash-Source_ACCOUNT_ID/monitoring/quicksight/object_access/'
TBLPROPERTIES (
'areColumnsQuoted'='false', 
'classification'='csv', 
'columnsOrdered'='true', 
'compressionType'='none', 
  'delimiter'=',',
  'typeOfData'='file')

You can create two SPICE datasets in QuickSight with the two new Athena tables, and then create a dashboard based on these two datasets. For more information, see Using administrative dashboards for a centralized view of Amazon QuickSight objects.

Deploy the QuickSight embed stack

Update DASHBOARD_ID in cdk/quicksight_embed_stack.py (line 80) with the dashboard ID that you just created with the two joined Athena tables.

Update basic authentication credentials in lambda/embed_auth/index.js (lines 9–10) with your preferred username and password.

Deploy the QuickSight embed stack with the following code:

cdk deploy quicksight-embed-stack

In the html/index.html file, update the following values to the output values from the QuicksightMigrationStack and QuicksightEmbedStack deployment steps, then upload it to the S3 bucket (quicksight-embed-CENTRAL_ACCOUNT_ID) created by this stack:

quicksight-embed-stack.EmbedAPIGatewayURL output value (line 85) – apiGatewayUrl: '<quicksight-embed-stack.EmbedAPIGatewayURL>'
quicksight-migration-stack.MigrationAPIGatewayURL output value (line 38) – const apiGatewayUrl = '<quicksight-migration-stack.MigrationAPIGatewayURL>';

The index.html file should be placed in the root of the S3 bucket with no prefix.

In the event index.html was updated after CloudFront was deployed, and is displaying the wrong content, the CloudFront cache may need to be invalidated. The URI /index.html should be invalidated in the central account CloudFront distribution.

The URL for the CloudFront distribution we created is outputted when the QuicksightEmbedStack stack is deployed. In QuickSight, add the output value for quicksight-embed-stack.EmbedCloudFrontURL to allow dashboard embedding, and select Include subdomains.

Deploy to the target account

We use the following stacks to deploy resources to the target account:

InfraTargetAccountStack – Deploys an IAM role that can be assumed by the migration Lambda role. This stack should also be deployed to any target accounts that contain QuickSight resources.
OptionalInfraTargetAccountStack – Deploys Amazon VPC, Amazon Redshift cluster, and Amazon Aurora cluster. This stack is optional and can be ignored if you have existing infrastructure for this proof of concept.

Deployment of target resources to the target account can either be done from the central account Amazon Elastic Compute Cloud (Amazon EC2) instance with the appropriate cross-account permissions or from an EC2 instance provisioned within the target account.

For this post, Amazon Redshift and Amazon Relational Database Service (Amazon RDS) clusters are required to perform migrations. Amazon Redshift and Amazon RDS aren’t necessary for migrating QuickSight resources that depend on Amazon S3 or Athena, for example. The stack optional-infra-stack deploys both Amazon Redshift and Amazon RDS clusters in the target account. Although deploying this stack isn’t necessary if you already have these resources provisioned in your target account, it does set up the environment correctly for the example migrations. To deploy, use the following command:

cdk deploy optional-infra-target-account-stack

Deploy the target account stack

Update line 16 in cdk/infra_target_account_stack.py:

self.central_account_id = "123456789123" with the central account ID.

If OptionalInfraTargetAccountStack was deployed, update the /infra/config AWS Systems Manager parameter found in cdk/infra_target_account_stack.py (lines 67–77) file with the values of the newly created Amazon Redshift or Amazon RDS clusters. All values are provided as AWS CloudFormation outputs.

However, if you already have clusters deployed in your environment, update the /infra/config Systems Manager parameter found in the cdk/infra_target_account_stack.py (lines 67–77) file with the values of your existing Amazon Redshift or Amazon RDS clusters. Set redshiftPassword and rdsPassword to the name of the secret found in Secrets Manager for these resources. These secrets for Amazon Redshift and Amazon RDS should include username and password values, as shown in the following screenshot.

The following are example values for the /infra/config parameter:

def to_dict(self):
    config={}
    config['vpcId'] = 'vpc-0b13eb0989c8de79f'
    config['redshiftUsername'] = 'admin'
    config['redshiftPassword'] = 'redshift-secret-name-here'
    config['redshiftClusterId'] = 'test-redshift-cluster'
    config['redshiftHost'] = 'test-redshift-cluster-1fsoudi1bunb6.c6mqlnbwke57.us-east-1.redshift.amazonaws.com'
    config['redshiftDB'] = 'test'
    config['rdsUsername'] = 'admin'
    config['rdsPassword'] = 'rds-secrent-name-here'
    config['rdsClusterId'] = 'test-rds-cluster-socet72ltstg'
    config['namespace'] = 'default'
    config['version'] = '1'

After the values have been updated in cdk/infra_target_account_stack.py file run the following shell commands:

export CDK_DEPLOY_ACCOUNT=TARGET_ACCOUNT_ID
export CDK_DEPLOY_REGION=TARGET_REGION
cdk bootstrap aws://TARGET_ACCOUNT_ID/TARGET_REGION
cdk deploy infra-target-account-stack

Creating an Amazon VPC connection in QuickSight allows QuickSight to access your private data resources, and enhances your security. Create this connection in the target account with the VPC connection name set to the VPC ID. This is required because the data sources created in the OptionalInfraTargetAccountStack stack are within a VPC.

Trigger a migration

The URL for the CloudFront distribution we created is outputted when the QuicksightEmbedStack stack is deployed. Navigate in your browser to the output value for quicksight-embed-stack.EmbedCloudFrontURL; the CloudFront distribution prompts you for basic authentication credentials, then redirects to the embedded QuickSight dashboard.

You can trigger a migration via the migration tool at the bottom of the dashboard.

As of this writing, this tool supports two types of migrations:

Batch – Attempts to migrate all resources, including themes, dashboards, analyses, datasets, and data sources
Incremental – Allows you to select a migration resource of the dashboard, analysis, or theme, and provide names of these items in the Migration Items field.

Choosing Submit sends a message to the SQS queue, which triggers the migration Lambda function on the backend. The embedded dashboard should reflect the status of the dashboard migration after it performs its periodic refresh.

Clean up

Finally, to clean up the resources created in this post, perform the following cleanup steps, depending on the solution option you used.

Option 1 cleanup

If you implemented the Option 1 solution, complete the following steps:

Delete the SageMaker notebooks running the migration scripts.
Delete the IAM role attached to the SageMaker notebooks.

Option 2 cleanup

If you implemented the Option 2 solution, complete the following steps:

Destroy the resources created by the AWS CDK:

cdk destroy quicksight-status-stack quicksight-migration-stack quicksight-embed-stack

Destroy the resources created by the AWS CDK in the target accounts:

cdk destroy infra-target-account-stack optional-infra-target-account-stack

Manually delete S3 buckets created in both central and target accounts.

Things to consider

This solution will help you with QuickSight object migration and version control. Here are few limitations to consider:

If there are deleted datasets used in the QuickSight analysis or dashboard, then consider deleting such datasets.
If there are duplicate object names then consider naming them different.
If there are file based data sources then consider converting them to S3 based data sources.

Pricing

For the pricing details of the services used in this post, see the following:

Conclusion

Object migration and version control in a programmable method is always highly demanded by the BI community. This post provides the best practices and practical code package to address QuickSight object migration and version control. This solution can readily fit into a ticketing system or CI/CD pipelines.

If you have any feedback or questions, please leave them in the comments section. You can also start a new thread on the Amazon QuickSight forum.

About the Authors

Ying Wang is a Senior Data Visualization Architect with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

Samruth Reddy is a DevOps Consultant in the AWS ProServe Global Delivery team working on automation tooling, security and infrastructure implementations, and promoting DevOps methodologies and practices to his customers.

Build a centralized granular access control to manage assets and data access in Amazon QuickSight

2021-07-28 Ying Wang

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/build-a-centralized-granular-access-control-to-manage-assets-and-data-access-in-amazon-quicksight/

A large business intelligence (BI) project with many users and teams and sensitive information demands a multi-faceted security architecture. Such architecture should provide BI administrators and architects with the capability to minimize the amount of information accessible to users. For a straightforward solution to manage Amazon QuickSight user and asset access permissions, you can use the AWS Command Line Interface (AWS CLI) or AWS Management Console to manually edit QuickSight user role and dashboard access. However, in specific cases, an enterprise can easily have hundreds or thousands of users and groups, and these access management methods aren’t efficient. We have received a large number of requests to provide an advanced programmable approach to deploy and manage a centralized QuickSight security architecture.

This post describes the best practices for QuickSight authentication and authorization granular access control, and provides a centralized cloud application with an AWS Cloud Development Kit (AWS CDK) stack to download. One of the advantages of our solution is enterprises can deploy the security framework to administer access control of their BI without leaving AWS.

All configurations are saved in the AWS Systems Manager Parameter Store. Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. You can store data such as user name, user permissions, passwords, and database strings as parameter values. You can reference AWS Systems Manager parameters in your scripts and configuration and automation workflows by using the unique name that you specified when you created the parameter.

The AWS CDK application template fits into the continuous integration and continuous deployment (CI/CD) infrastructure and grants or revokes all authentications and authorizations based on a defined policy prescribed by AWS. This avoids possible human errors made by BI developers or administrators. BI developers can edit configuration parameters to release new dashboards to end-users. At the same time, BI administrators can edit another set of parameters to manage users or groups. This AWS CDK CI/CD design bridges the gaps between development and operation activities by enforcing automation in building and deploying BI applications.

Security requirements

In enterprise BI application design, multi-tenancy is a common use case, which serves multiple sets of users with one infrastructure. Tenants could either be different customers of an independent software vendor (ISV), or different departments of an enterprise. In a multi-tenancy design, each tenant shares the dashboards, analyses, and other QuickSight assets. Each user, who can see all other users belonging to the same tenant (for example, when sharing content), remains invisible to other tenants. Within each tenant, the BI admin team has to create different user groups to control the data authorization, including asset access permissions and granular-level data access.

Let’s discuss some use cases of asset access permissions in detail. In a BI application, different assets are usually categorized according to business domains (such as an operational dashboard or executive summary dashboard) and data classification (critical, highly confidential, internal only, and public). For example, you can have two dashboards for analyzing sales results data. The look and feel of both dashboards are similar, but the security classification of the data is different. One dashboard, named Sales Critical Dashboard, contains critical columns and rows of data. The other dashboard, called Sales Highly-Confidential Dashboard, contains highly confidential columns and rows of data. Some users are granted permission to view both dashboards, and others have lower security level permission and can only access Sales Highly-Confidential Dashboard.

In the following use case, we address granular-level data access as follows:

Row-level access (RLS) – For the users who can access Sales Critical Dashboard, some of them can only view US data. However, some global users can view the data of all countries, including the US and UK.
Column-level access (CLS) – Some users can only view non-personally identifiable information (PII) data columns of a dataset, whereas the HR team can view all the columns of the same dataset.

Large projects might have several tenants, hundreds of groups, and thousands of users in one QuickSight account. The data leader team wants to deploy one protocol for user creation and authentication in order to reduce the maintenance cost and security risk. The architecture and workflow described in this post help the data leader achieve this goal.

Additionally, to avoid human errors in daily operation, we want these security permissions to be granted and revoked automatically, and fit into the CI/CD infrastructure. The details are explained later in this post.

Architecture overview

The following diagram shows the QuickSight account architecture of this solution.

Authors create dashboards and update AWS Systems Manager Parameter Store to release dashboards to different groups
Admins approve the requests from authors
Admins update user management (roles, namespace,) by editing AWS Systems ManagerParameter Store
DevOps deploy the updates with AWS CDK

*Groups: Object access permission groups control the owner/viewer of the objects. Data segment groups combined with RLS/CLS control data access.

*Datasets: Contain all data, restricted by row-level security (RLS) and column-level security (CLS)

The following diagram illustrates the authentication workflow of the architecture:

*First time log in QuickSight: If the QuickSight user is not registered before first time log in, a reader is created and this reader only can view the landing page dashboard, which shares to all users of this account. The landing page provides the reports list that this user can view.

The following diagram illustrates the authorization workflow of the architecture.

Authorization diagram details:

User information (department, team, geographic location) is stored in Amazon Redshift, Amazon Athena, or any other database. Combined with group-user mapping, RLS databases are built for control data access.
Hourly permissions assignment:
1. According to group-employee name (user) mapping (membership.csv) and group-role mapping (/qs/console/roles), an AWS Lambda function creates groups, registers, users, assigns group members, removes group memberships, promotes readers to author or admin, and deletes users if they’re demoted from author or admin to reader.
2. According to group-dashboard mapping in /qs/config/access, an AWS Lambda function updates dashboard permissions to QuickSight groups.
3. According to group-namespace mapping in membership.csv, an AWS Lambda function creates QuickSight groups in the specified namespace.
Sample parameters of objects access permissions and data segments:

Sample parameters of QuickSight user role:

Sample data of membership.csv:

In this solution, custom namespaces are deployed to support multi-tenancy. The default namespace is for all internal users of a company (we call it OkTank). OkTank creates the 3rd-Party namespace for external users. If we have to support more tenants, we can create more custom namespaces. By default, we’re limited to 100 namespaces per AWS account. To increase this limit, contact the QuickSight product team. For more information about multi-tenancy, see Embed multi-tenant analytics in applications with Amazon QuickSight.

In each namespace, we create different types of groups. For example, in the default namespace, we create the BI-Admin and BI-Developer groups for the admin and author users. For reader, we deploy two types of QuickSight groups to control asset access permissions and data access: object access permission groups and data segment groups.

The following table summarizes how the object access permission groups control permissions.

Group Name	Namespace	Permission	Notes
`critical`	Default	View both dashboards (containing the critical data and highly confidential data)
`highlyconfidential`	Default	Only view Sales Highly-Confidential Dashboard
`BI-Admin`	Default	Account management and edit all assets	Users in the `BI-Admin` group are assigned the `Admin` QuickSight user role.
`BI-Developer`	Default	Edit all assets	Users in the `BI-Developer` group are assigned the Author QuickSight user role.
`Power-reader`	Default	View all assets and create ad hoc analysis to run self-service analytics reports	Users in the `Power-reader` group are assigned the `Author` QuickSight user role. However, this group can’t save or share their ad hoc reports.
`3rd-party`	Non-default namespaces (`3rd-party` namespace, for example)	Can only share with readers (`3rd-party-reader` group, for example) in the same namespace	In non-default namespaces, we can also create other object access permission groups, which is similar to the critical group in the default namespace.

For more information about QuickSight groups, users, and user roles, see Managing User Access Inside Amazon QuickSight, Provisioning Users for Amazon QuickSight, and Using administrative dashboards for a centralized view of Amazon QuickSight objects.

The second type of groups (data segment groups), combined with row-level security datasets and column-level security, control data access as described in the following table.

Group Name	Namespace	Permission	Scope
`USA`	Default	Only view US data on any dashboard	Row-level
`GBR`	Default	Only view UK data on any dashboard	Row-level
`All countries`	Default	View data of all countries on any dashboard	Row-level
`non-PII`	Default	Can’t view Social Security numbers, annual income, and all other columns of PII data	Column-level
`PII`	Default	Can view all columns including PII data	Column-level

We can set up similar groups in non-default namespaces.

These different groups can overlap each other. For example, if a user belongs to the groups USA, Critical, and PII, they can view US data on both dashboards, with all columns. The following Venn diagram illustrates the relationships between these groups.

In summary, we can define a multi-faceted security architecture by combining QuickSight features, including namespace, group, user, RLS, and CLS. All related configurations are saved in the Parameter Store. The QuickSight users list and group-user mapping information are in an Amazon Simple Storage Service (Amazon S3) bucket as a CSV file (named membership.csv). This CSV file could be output results of LDAP queries. Several AWS Lambda functions are scheduled to run hourly (you can also invoke these functions on demand, such as daily, weekly, or any time granularity that fits your requirements) to read the parameters and the membership.csv. According to the configuration defined, the Lambda functions create, update, or delete groups, users, and asset access permissions.

When the necessary security configurations are complete, a Lambda function calls the QuickSight APIs to get the updated information and record the results in an S3 bucket as CSV files. The BI admin team can build datasets with these files and visualize the results with dashboards. For more information, see Using administrative dashboards for a centralized view of Amazon QuickSight objects and Building an administrative console in Amazon QuickSight to analyze usage metrics.

In addition, the errors of Lambda functions and the user deletion events are stored in this S3 bucket for the admin team to review.

Automation

The following diagram illustrates the overall workflow of the Lambda functions.

We use a programmable method to create and configure the groups and users automatically. For any ad hoc user registration request (such as the user isn’t recorded in membership.csv yet due to latency), as long as the user can be authenticated, they can assume the AWS Identity and Access Management (IAM) role quicksight-fed-user to self-provision as a QuickSight reader. This self-provisioned reader can only view a landing page dashboard, which provides the list of dashboards and corresponding groups. According to the dashboard-group mapping, this new reader can apply for membership of a given group to access the dashboards. If the group owner approves the application, the hourly Lambda functions add the new user into the group the next time they run.

The CI/CD pipeline starts from AWS CDK. The BI administrator and author can update the Systems Manager parameters to release new dashboards or other QuickSight assets in the AWS CDK stack granular_access_stack.py. The BI administrator can update the Systems Manager parameters in the same stack to create, update, or delete namespaces, groups, or users. Then the DevOps team can deploy the updated AWS CDK stack to apply these changes to the Systems Manager parameters or other AWS resources. The Lambda functions are triggered hourly to call APIs to apply changes to the related QuickSight account.

Scale

The Lambda functions are restricted by the maximum runtime of 15 minutes. To overcome this limitation, we can convert the Lambda functions to AWS Glue Python shell scripts with the following high-level steps:

Download Boto3 wheel files from pypi.org.
Upload the wheel file into an S3 bucket.
Download the Lambda functions and merge them into one Python script and create an AWS Glue Python shell script.
Add the S3 path of the Boto3 wheel file into the Python library path. If you have multiple files to add, separate them with a comma.
Schedule this AWS Glue job to run daily.

For more information, see Program AWS Glue ETL Scripts in Python and Using Python Libraries with AWS Glue.

Prerequisites

You must have the following prerequisites to implement this solution:

A QuickSight Enterprise account
Basic knowledge of Python
Basic knowledge of SQL
Basic knowledge of BI

Create the resources

Create your resources by downloading the AWS CDK stack from the GitHub repo.

In the granular_access folder, run the command cdk deploy granular-access to deploy the resources. For more information, see AWS CDK Intro Workshop: Python Workshop.

Deploy the solution

When you deploy the AWS CDK stack, it creates five Lambda functions, as shown in the following screenshot.

The stack also creates additional supportive resources in your account.

The granular_user_governance function is triggered by the Amazon CloudWatch event rule qs-gc-everyhour. The information of groups and users is defined in the file membership.csv. The S3 bucket name is stored in the parameter store /qs/config/groups. The following diagram shows the flowchart of this function.

Set the destination of granular_user_governance to another Lambda function, downgrade_user, with source=Asynchronous invocation and condition=On Success.

The following diagram is a flowchart of this function.

To avoid breaking critical access to QuickSight assets governed by Admin or Author, we demote an admin or author by deleting the admin or author user and creating a new reader user with the Lambda function downgrade_user. The granular_user_governance function handles downgrading admin to author, or upgrading author to admin.

Set the destination of downgrade_user to the Lambda function granular_access_assets_govenance with source=Asynchronous invocation and condition=On Success.

The following diagram shows a flowchart of this function.

Set the destination of downgrade_user to the Lambda function check_team_members with source=Asynchronous invocation and condition=On Failure.

The check_team_members function simply calls QuickSight APIs to get the namespaces, groups, users, and assets information, and saves the results in the S3 bucket. The S3 key is monitoring/quicksight/group_membership/group_membership.csv and monitoring/quicksight/object_access/object_access.csv.

Besides the two output files of the previous step, the error logs and user deletion logs (logs of downgrade_user) are also saved in the monitoring/quicksight folder.

Set the destination of granular_access_assets_govenance to the Lambda function check_team_members with source=Asynchronous invocation and condition=On Success or condition=On Failure.

Create row-level security datasets

As a final step, we create RLS datasets. This allows you to change the dashboard records based on the users that view the dashboards.

QuickSight supports RLS by applying a system-managed dataset that sub-selects records from the dashboard dataset. The mechanism allows the administrator to provide a filtering dataset (the RLS dataset) with username or groupname columns, which are automatically filtered to the user that is logged in. For example, a user named YingWang belongs to QuickSight group BI, so all the rows of the RLS dataset that correspond to the username YingWang or group name BI are filtered. The rows that remain in the RLS after applying the username and the group name filters are then used to filter the dashboard datasets further by matching columns with the same names. For more information about row-level security, see Using Row-Level Security (RLS) to Restrict Access to a Dataset.

In this solution, we export the sample user information into the file membership.csv, which is stored in an S3 bucket. In this file, we provide some sample groups for RLS dataset definition. These groups are the data segment groups, as described in the overall architecture design. The following screenshot shows some of the groups and the users in those groups.

The granular_user_governance function creates these groups and adds the related users to be members of these groups.

How do we create the RLS dataset? Let’s say we have a table called employee_information in our organization’s HR database. The following screenshot shows some sample data.

Based on the employee_information table, we create a view called rls for an RLS dataset. See the following SQL code:

create view
rls(groupname, username, country, city)
as
(SELECT 
concat('quicksight-fed-'::text, lower(employee_information.country::text)) AS groupname,
concat(concat('quicksight-fed-us-users/'::text, employee_information.employee_login::text),'@oktank.com'::text) AS username,
employee_information.country,
employee_information.city
FROM 
employee_information)

The following screenshot shows our sample data.

Now we have the table ready, we can create the RLS dataset with the following custom SQL:

select distinct 
r.groupname as GroupName,
null as UserName,
r.country,
null as city 
from 
rls as r 
join fact_revenue as f 
on r.country=f.country
union
select distinct 'quicksight-fed-all-countries' as GroupName,
null as UserName,
null as country,
null as city
from rls as r
union
select distinct null as GroupName,
r.username as UserName,
r.country,
r.city 
from 
rls as r
join fact_revenue as f 
on r.country=f.country 
and 
r.city=f.city

The following screenshot shows our sample data.

For the group quicksight-fed-all-countries, we set the username, country, and city as null, which means that all the users in this group can view the data of all countries.

For country level, only the security rules defined in the groupname and country columns are used for filtering. The username and city columns are set as null. The users in the quicksight-fed-usa group can view the data of USA, and the users in the quicksight-fed-gbr group can view the data of GBR.

For each user with groupname set as null, they can only view the specific country and city assigned to their username. For example, TerryRigaud can only view data of Austin, in the US.

In QuickSight, multiple rules in an RLS dataset are combined together with OR.

With these multi-faceted RLS rules, we can define a comprehensive data access pattern.

Clean up

To avoid incurring future charges, delete the resources you created by running the following command:

cdk destroy granular_access

Conclusion

This post discussed how BI administrators can design and automate QuickSight authentication and authorization granular access control. We combined QuickSight security features like row-level and column-level security, groups, and namespaces to provide a comprehensive solution. Managing these changes through “BIOps” ensures a robust, scalable mechanism for managing QuickSight security. To learn more, sign up for a QuickSight demo.

About the Authors

Ying Wang is a Senior Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Amir Bar Or is a Principal Data Architect at AWS Professional Services. After 20 years leading software organizations and developing data analytics platforms and products, he is now sharing his experience with large enterprise customers and helping them scale their data analytics in the cloud.

Building an administrative console in Amazon QuickSight to analyze usage metrics

2021-02-19 Ying Wang

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/building-an-administrative-console-in-amazon-quicksight-to-analyze-usage-metrics/

Given the scalability of Amazon QuickSight to hundreds and thousands of users, a common use case is to monitor QuickSight group and user activities, analyze the utilization of dashboards, and identify usage patterns of an individual user and dashboard. With timely access to interactive usage metrics, business intelligence (BI) administrators and data team leads can efficiently plan for stakeholder engagement and dashboard improvements. For example, you can remove inactive authors to reduce license cost, as well as analyze dashboard popularity to understand user acceptance and stickiness.

This post demonstrates how to build an administrative console dashboard and serverless data pipeline. We combine QuickSight APIs with AWS CloudTrail logs to create the datasets to collect comprehensive information of user behavior and QuickSight asset usage patterns.

This post provides a detailed workflow that covers the data pipeline, sample Python code, and a sample dashboard of this administrative console. With the guidance of this post, you can configure this administrative console in your own environment.

Let’s look at Forwood Safety, an innovative, values-driven company with a laser focus on fatality prevention. An early adopter of QuickSight, they have collaborated with AWS to deploy this solution to collect BI application usage insights.

“Our engineers love this admin console solution,” says Faye Crompton, Leader of Analytics and Benchmarking at Forwood. “It helps us to understand how users analyze critical control learnings by helping us to quickly identify the most frequently visited dashboards in Forwood’s self-service analytics and reporting tool, FAST.”

Solution overview

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

The AWS Lambda function Data_Prepare is scheduled to run hourly. This function calls QuickSight APIs to get QuickSight namespace, group, user, and assets access permissions information and saves the results to an Amazon Simple Storage Service (Amazon S3) bucket.
CloudTrail logs are stored in S3 bucket.
Based on the file in Amazon S3 that contains user-group information, the QuickSight assets access permissions information, as well as view dashboard and user login events in CloudTrail logs. Three Amazon Athena tables and several views are created. Optionally, the BI engineer can combine these two tables with employee information tables to display human resource information of the users.
Two QuickSight datasets fetch the data in the Athena tables created in Step 3 through SPICE mode. Then, based on these datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
Access to the following AWS services:
- Amazon QuickSight
- Amazon Athena
- AWS Lambda
- Amazon S3
Basic knowledge of Python
Optionally, Security Assertion Markup Language 2.0 (SAML 2.0) or OpenID Connect (OIDC) single sign-on (SSO) configured for QuickSight access

Creating resources

Create your resources by launching the following AWS CloudFormation stack:

After the stack creation is successful, you have one Amazon CloudWatch Events rule, one Lambda function, one S3 bucket, and the corresponding AWS Identity and Access Management (IAM) policies.

To create the resources in a Region other than us-east-1, download the Lambda function.

Creating Athena tables

The Data_Prepare Lambda function is scheduled to run hourly with the CloudWatch Events rule admin-console-every-hour. This function calls the QuickSight APIs list_namespaces, list_users, list_user_groups, list_dashboards, list_datasets, list_datasources, list_analyses, list_themes, describe_data_set_permissions, describe_dashboard_permissions, describe_data_source_permissions, describe_analysis_permissions, and describe_theme_permissions to get QuickSight users and assets access permissions information. Finally, this function creates two files, group_membership.csv and object_access.csv, and saves these files to an S3 bucket.

Run the following SQL query to create two Athena tables (group_membership and object_access):

CREATE EXTERNAL TABLE `group_membership`(
`namespace` string,   
`group` string, 
`user` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3:// admin-console<aws_account_id>/monitoring/quicksight/group_membership/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',',
  'typeOfData'='file')
CREATE EXTERNAL TABLE `object_access`(
`aws_region` string,   
`object_type` string, 
`object_name` string,
`object_id` string,
`principal_type` string,
`principal_name` string,
`namespace` string,
`permissions` string
)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3:// admin-console<aws_account_id>/monitoring/quicksight/object_access/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',',
  'typeOfData'='file')

The following screenshot is sample data of the group_membership table.

The following screenshot is sample data of the group_membership table.

The following screenshot is sample data of the object_access table.

The following screenshot is sample data of the object_access table.

For instructions on building an Athena table with CloudTrail events, see Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail. For this post, we create the table cloudtrail_logs in the default database.

Creating views in Athena

Now we have the tables ready in Athena and can run SQL queries against them to generate some views to analyze the usage metrics of dashboards and users.

Create a view of a user’s role status with the following code:

CREATE OR REPLACE VIEW users AS
(select Namespace,
 Group,
 User,
(case 
when Group in ('quicksight-fed-bi-developer', 'quicksight-fed-bi-admin') 
then 'Author' 
else 'Reader' 
end) 
as author_status
from "group_membership" );

Create a view of GetDashboard events that happened in the last 3 months with the following code:

CREATE OR REPLACE VIEW getdashboard AS 
(SELECT 
"useridentity"."type",   "split_part"("useridentity"."sessioncontext"."sessionissuer"."arn",'/', 2) AS "assumed_role", COALESCE("useridentity"."username","concat"("split_part"("userid
entity"."arn", '/', 2), '/', "split_part"("useridentity"."arn",
'/', 3))) AS "user_name",
awsregion,
"split_part"("split_part"("serviceeventdetails", 'dashboardName":', 2),',', 1) AS dashboard_name, "split_part"("split_part"("split_part"("split_part"("serviceeventdetails", 'dashboardId":', 2),',', 1), 'dashboard/', 2),'"}',1) AS dashboardId,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time, max(date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ')) AS latest_event_time
FROM cloudtrail_logs
WHERE 
eventsource = 'quicksight.amazonaws.com' 
AND
eventname = 'GetDashboard' 
AND
DATE_TRUNC('day',date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ')) > cast(current_date - interval '3' month AS date)
GROUP BY  1,2,3,4,5,6,7)

In the preceding query, the conditions defined in the where clause only fetch the records of GetDashboard events of QuickSight.

How can we design queries to fetch records of other events? We can review the CloudTrail logs to look for the information. For example, let’s look at the sample GetDashboard CloudTrail event:

{
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "<principal_id>: <user_name>",
        "arn": "arn:aws:sts:: <aws_account_id>:assumed-role/<IAM_role_ name>/<user_name>",
        "accountId": "<aws_account_id>",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "<principal_id>",
                …
            }
        }
    },
    "eventTime": "2021-01-13T16:55:36Z",
    "eventSource": "quicksight.amazonaws.com",
    "eventName": "GetDashboard",
    "awsRegion": "us-east-1",
    "eventID": "a599c8be-003f-46b7-a40f-2319efb6b87a",
    "readOnly": true,
    "eventType": "AwsServiceEvent",
    "serviceEventDetails": {
        "eventRequestDetails": {
            "dashboardId": "arn:aws:quicksight:us-east-1: <aws_account_id>:dashboard/<dashboard_id>"
        },
        "eventResponseDetails": {
            "dashboardDetails": {
                "dashboardName": "Admin Console",
                "dashboardId": "arn:aws:quicksight:us-east-1: <aws_account_id>:dashboard/<dashboard_id>",
                "analysisIdList": [
                    "arn:aws:quicksight:us-east-1: <aws_account_id>:analysis/<analysis_id>"
            }
        }
    }
}

With eventSource=“quicksight.amazonaws.com” and eventName=“GetDashboard”, we can get all the view QuickSight dashboard events.

Similarly, we can define the condition as eventname = ‘AssumeRoleWithSAML‘ to fetch the user login events. (This solution assumes that the users log in to their QuickSight account with identity federation through SAML.) For more information about querying CloudTrail logs to monitor other interesting user behaviors, see Using administrative dashboards for a centralized view of Amazon QuickSight objects.

Furthermore, we can join with employee information tables to get a QuickSight user’s human resources information.

Finally, we can generate a view called admin_console with QuickSight group and user information, assets information, CloudTrail logs, and, optionally, employee information. The following screenshot shows an example preview.

The following screenshot shows an example preview.

Creating datasets

With the Athena views ready, we can build some QuickSight datasets. We can load the view called admin_console to build a SPICE dataset called admin_console and schedule this dataset to be refreshed hourly. Optionally, you can create a similar dataset called admin_console_login_events with the Athena table based on eventname = ‘AssumeRoleWithSAML‘ to analyze QuickSight users log in events. According to the usage metrics requirement in your organization, you can create other datasets to serve the different requests.

Building dashboards

Now we can build a QuickSight dashboard as the administrative console to analyze usage metrics. The following steps are based on the dataset admin_console. The schema of the optional dataset admin_console_login_events is the same as admin_console. You can apply the same logic to create the calculated fields to analyze user login activities.

Create parameters.

For example, we can create a parameter called InActivityMonths, as in the following screenshot.Similarly, we can create other parameters such as InActivityDays, Start Date, and End Date.

Create controls based on the parameters.

Create controls based on the parameters.

Create calculated fields.

For instance, we can create a calculated field to detect the active or inactive status of QuickSight authors. If the time span between the latest view dashboard activity and now is larger or equal to the number defined in the Inactivity Months control, the author status is Inactive. The following screenshot shows the relevant code.

The following screenshot shows the relevant code.

According to end user’s requirement, we can define several calculated fields to perform the analysis.

Create visuals.

For example, we create an insight to display the top three dashboards view by readers and a visual to display the authors of these dashboards.

We can add URL action to define some extra features to email inactive authors or check details of users.

We can add URL action to define some extra features to email inactive authors or check details of users.

The following sample code defines the action to email inactive authors:

mailto:<<email>>?subject=Alert to inactive author! &body=Hi, <<username>>, any author without activity for more than a month will be deleted. Please log in to your QuickSight account to continue accessing and building analyses and dashboards!

The following sample code defines the action to email inactive authors:
The following screenshots show an example dashboard that you can make using our data.

The following is the administrative console landing page. We provide the overview, terminology explanation and thumbnails of the other two tabs in this page.

The following is the administrative console landing page.

The following screenshots show the User Analysis tab.

The following screenshots show the User Analysis tab.

The following screenshots show the Dashboards Analysis tab.

The following screenshots show the Dashboards Analysis tab.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

You can reference to public template of the preceding dashboard in create-template, create-analysis, and create-dashboard API calls to create this dashboard and analysis in your account. The public template of this dashboard with the template ARN is 'TemplateArn': 'arn:aws:quicksight:us-east-1:889399602426:template/admin-console'.

Additional usage metrics

Additionally, we can perform some complicated analysis to collect advanced usage metrics. For example, Forwood Safety raised a unique request to analyze the readers who log in but don’t do any viewing of dashboard actions (see the following code). This helps their clients identify and prevent any wasting of reader sessions fees. Leadership teams value the ability to minimize uneconomical user activity.

CREATE OR REPLACE VIEW "loginwithoutviewdashboard" AS
with login as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventname = 'AssumeRoleWithSAML'
GROUP BY  1,2,3),
dashboard as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventsource = 'quicksight.amazonaws.com'
AND
eventname = 'GetDashboard'
GROUP BY  1,2,3),
users as 
(select Namespace,
Group,
User,
(case
when Group in (‘quicksight-fed-bi-developer’, ‘quicksight-fed-bi-admin’)
then ‘Author’
else ‘Reader’
end)
as author_status
from "group_membership" )
select l.* 
from login as l 
join dashboard as d 
join users as u 
on l.user_name=d.user_name 
and 
l.awsregion=d.awsregion 
and 
l.user_name=u.user_name
where d.event_time>(l.event_time + interval '30' minute ) 
and 
d.event_time<l.event_time 
and 
u.author_status='Reader'

Cleaning up

To avoid incurring future charges, delete the resources you created with the CloudFormation template.

Conclusion

This post discussed how BI administrators can use QuickSight, CloudTrail, and other AWS services to create a centralized view to analyze QuickSight usage metrics. We also presented a serverless data pipeline to support the administrative console dashboard.

You can request a demo of this administrative console to try for yourself.

About the Authors

Ying Wang is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Jill Florant manages Customer Success for the Amazon QuickSight Service team

Using administrative dashboards for a centralized view of Amazon QuickSight objects

2020-10-12 Ying Wang

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/using-administrative-dashboards-for-a-centralized-view-of-amazon-quicksight-objects/

“Security is job 0” is the primary maxim of all endeavors undertaken at AWS. Amazon QuickSight, the fast-growing, cloud-native business intelligence (BI) platform from AWS, allows security controls in a variety of means, including web browsers and API calls. These controls apply to various functions, such as user management, authorization, authentication, and data governance.

This post demonstrates how to build a workflow to enable a centralized visualization of QuickSight groups and user information, as well as QuickSight objects access permission auditing information. Combined with AWS CloudTrail logs, the solution enables your security team to detect any abnormal behavior in near-real time to ensure security compliance.

Benefits of a centralized dashboard

A group in QuickSight consists of a set of users. Using groups makes it easy to manage access and security. For example, you can configure three groups, called Marketing, HR, and BI Developer, and each has specific access privileges:

The users in the Marketing group can only view the dashboards with marketing data
The users in the HR group can only view the human resources data
The users in the BI Developer group can edit all objects, including data sources, datasets, and dashboards

After the users and groups are configured, BI administrators can check and edit the object access permission by choosing Share for dashboards, datasets, and all other objects. The following screenshot shows the Manage dashboard sharing page on the QuickSight console.

As of this writing, individual object permission information is available on the QuickSight console, and user information is provided on the user management view on the QuickSight console. Our solution integrates QuickSight APIs with other AWS services to create an administrative dashboard that provides a centralized view of essential security information. This dashboard covers not only user lists and individual object access permission information available on the current platform, but also additional security information like group lists, user-group mapping information, and overall objects access permissions. This dashboard allows you to acquire unique security insights with its collection of comprehensive security information.

This post provides a detailed workflow that covers the data pipeline, sample Python codes, the AWS CloudFormation template, and a sample administrative dashboard. With the guidance of this post, you can configure a centralized information center in your own environment.

Solution overview

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

A new user creation event in the CloudTrail log triggers the Amazon CloudWatch Events rule CreateUser.
The CreateUser rule triggers the AWS Lambda function User_Initiation. This function checks if the new user belongs to an existing group (for this post, we assume that their AWS Identity and Access Management (IAM) role equates to the group they should belong to in QuickSight). If such a group exists, the function adds the user into the group (CreateGroupMembership). Otherwise, it creates a new group (CreateGroup). The following is the process flow diagram of the Lambda function.
If the CreateGroupMembership event occurs, it triggers the Lambda function Data_Prepare. This function calls QuickSight APIs to get QuickSight group, user, and object access permissions information and saves the results to an Amazon Simple Storage Service (Amazon S3) bucket.

If the Lambda function User_Initiation creates a new QuickSight group in Step 2, it triggers a CloudWatch rule CreateGroup and the Lambda function Group_Initiation. The Group_Initiation function updates the QuickSight objects permission for the new group, such as granting it permission to view a dashboard.
The update object permission event in Step 4 triggers the Lambda function Data_Prepare, which updates the object access permissions information and saves the updated information to an S3 bucket.
The DeleteUser and DeleteGroup events also trigger the Lambda function Data_Prepare.
Based on the file in S3 that contains user-group mapping information and the QuickSight objects access permissions information, an Amazon Athena table is created.
A QuickSight dataset fetches the data in the Athena table created in Step 7 through DirectQuery Another QuickSight dataset is created based on the CloudTrail logs data. Then, based on these two datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
Access to the following AWS services:
- QuickSight
- Athena
- Lambda
- Amazon S3
Basic knowledge of Python
Security Assertion Markup Language 2.0 (SAML 2.0) or OpenID Connect (OIDC) single sign-on (SSO) configured for QuickSight access

Creating resources

Create your resources by downloading the following AWS Cloud Development Kit (AWS CDK) stack from the GitHub repo.

Pull the Administrative Dashboard folder and run the command cdk deploy QuickSightStack to deploy the resources. For more information, see AWS CDK Intro Workshop: Python Workshop.

Implementing the solution

This solution assumes that the users log in to their QuickSight account with identity federation through SAML or OIDC. For instructions on setting up SAML SSO, see Single Sign-On Access to Amazon QuickSight Using SAML 2.0, Federate Amazon QuickSight access with Okta and Enabling Amazon QuickSight federation with Azure AD. For OIDC SSO, see Use Amazon QuickSight Federated Single Sign-On with Amazon Cognito User Pools.

After you set up the IAM policy of the web identity or SAML federation role, you don’t need to invite users manually. A QuickSight user is provisioned automatically when opening QuickSight for the first time.

In this solution, one SAML federation role corresponds to a QuickSight group. There are four sample SAML roles: Marketing, HR, BI-Admin, and BI Developer.

The following code is the sample CreateUser CloudTrail event:

{
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAZEAWJBC3FBJ7KDH2N:[email protected]",
        "arn": "arn:aws:sts::<aws_account_id>:assumed-role/ BI-Developer/[email protected]",
        "accountId": <aws_account_id>,
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROAZEAWJBC3FBJ7KDH2N",
                "arn": "arn:aws:iam:: <aws_account_id>:role/BI-Developer",
                "accountId": <aws_account_id>,
                "userName": " BI-Developer"}
        }
    },
    "eventSource": "quicksight.amazonaws.com",
    "eventName": "CreateUser",
    "awsRegion": "us-east-1",
    "eventType": "AwsServiceEvent",
…    
}

This event triggers the CloudWatch events rule CreateUser. The following screenshot shows the details of this rule.

The CreateUser rule triggers the Lambda function User_Initiation. This function gets the QuickSight group name (Marketing, HR, BI-Admin, or BI Developer) and compares the group name with the existing group list. If such a group exists, it adds this new user into that group (CreateGroupMembership). Otherwise, it creates a new group (CreateGroup).

The Data_Prepare Lambda function is triggered by adding a new user into a group event (CreateGroupMembership). This function calls the QuickSight API describe_data_set_permissions, describe_dashboard_permissions, or describe_data_source_permissions to get the object access permissions. It also calls the APIs list_user_groups and list_users to get the list of users and the groups of each user. Finally, this function creates two files containing QuickSight group, user, or object access information, and saves these files into a S3 bucket.

If a new QuickSight group is created, it triggers the Lambda function Group_Initiation to update the QuickSight dashboard, dataset, or data source permission for this new group. For example, if the HR group is created, the Group_Initiation function lets the HR group view the Employee Information dashboard.

The UpdateDashboardPermissions, UpdateDatasetPermissions, and UpdateDatasourcePermissions events trigger the Lambda function Data_Prepare to update the object access permissions information stored in the S3 bucket.

To create two Athena tables (Groups and Objects), run an AWS Glue crawler.

The following screenshot is sample data of the Groups table.

The following screenshot is sample data of the Objects table.

You can create a DirectQuery dataset in QuickSight with the two new Athena tables joined. See the following screenshot.

The Objects table contains the information of objects (such as dashboards and datasets) belonging to each group or user. Furthermore, we can create a calculated field called Ownership based on Permissions information (the actions column in objects table) to provide the objects owner with viewer or user information (the object owner can delete this object, whereas the viewer or user can’t do the deletion action).

The following screenshot shows the relevant code.

After that, run the following SQL query to build an Athena view with QuickSight events for the last 24 hours:

CREATE OR REPLACE VIEW qsctlog_last_24h AS 
SELECT "useridentity"."type", "split_part"("useridentity"."sessioncontext"."sessionissuer"."arn", '/', 2) "group_name"
, COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) "user_name"
, "eventtime"
, "eventname"
, "awsregion"
, "resources"
, "eventtype"
, "recipientaccountid"
, "serviceeventdetails"
FROM default.cloudtrail_logs
WHERE (("eventsource" = 'quicksight.amazonaws.com') AND (CAST("split_part"("eventtime", 'T', 1) AS date) > "date_add"('hour', -24, "now"())))

Running queries in Athena

Now we have the datasets ready in Athena and can run SQL queries against them to answer some common administrative questions.

To create a QuickSight dataset to catch all orphan users that don’t belong to any group, as well as the events done by these users in the last 24 hours, run the following SQL query:

SELECT g.*
,log.group_name as role_name
,log.user_name as log_user_name
,log.type
,log.eventtime
,log.eventname
,log.awsregion
,log.eventtype
,log.recipientaccountid
,log.serviceeventdetails
FROM "default"."qsctlog_last_24h" as log 
full outer join 
"default"."groups" as g 
on log.awsregion=g.aws_region AND log.group_name=g.group_name AND log.user_name=g.user_name 
where g.group_name is null or g.group_name=''

To create a QuickSight dataset to list objects belonging to each group or user, run the following query:

SELECT group_name AS "Group/User Name"
, object_name
, object_type
, if((actions LIKE '%Delete%'), 'Owner', 'Viewer/User') AS Ownership
FROM "default"."object" full outer
JOIN "default"."groups"
    ON group_name=principal_name
WHERE principal_type='group'
UNION
SELECT user_name AS "Group/User Name"
, object_name
, object_type
, if((actions LIKE '%Delete%'), 'Owner', 'Viewer/User') AS Ownership
FROM "default"."object" full outer
JOIN "default"."groups"
    ON user_name=principal_name
WHERE principal_type='user'
ORDER BY  "Group/User Name" asc;

The following screenshot shows the sample data.

Building dashboards

In addition to running queries directly in Athena, we can build a dashboard using this same data in QuickSight. The following screenshot shows an example dashboard that you can make using our data.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

Cleaning up

To avoid incurring future charges, delete the resources you created by running the following command:

cdk destroy QuickSightStack

Then, on the Amazon S3 console, delete the S3 bucket administrative-dashboard<your_aws_account_id>.

Conclusion

This post discussed how BI administrators can use the QuickSight dashboard, Lambda functions, and other AWS services to create a centralized view of groups, users, and objects access permission information and abnormal access auditing. We also presented a serverless data pipeline to support the administrative dashboard. This dashboard can provide you with unique security insights with its collection of comprehensive security information.

About the Author

Ying Wang is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.