Tag Archives: Amazon QuickSight

Designing a High-volume Streaming Data Ingestion Platform Natively on AWS

2021-10-19 Soonam Jose

Post Syndicated from Soonam Jose original https://aws.amazon.com/blogs/architecture/designing-a-high-volume-streaming-data-ingestion-platform-natively-on-aws/

The total global data storage is projected to exceed 200 zettabytes by 2025. This exponential growth of data demands increased vigilance against cybercrimes. Emerging cybersecurity trends include increasing service attacks, ransomware, and critical infrastructure threats. Businesses are changing how they approach cybersecurity and are looking for new ways to tackle these threats. In the past, they have relied on internal IT or engaged a managed security services provider (MSSP) to monitor and prevent unauthorized access and attacks.

An end-to-end analytics solution should ingest and process log data streaming from various computing and IoT devices. It can then make processed data available to analytics systems users in near-real-time. However, the sheer volume of data in the future makes this difficult to address in a reliable and cost-effective manner.

In this blog post, we present three approaches for a high-volume log data ingestion and processing platform natively on Amazon Web Services (AWS). We also compare the pros and cons of each. We’ll discuss factors to consider when evaluating the different options and their associated flexibility, to take full advantage of AWS. We will showcase a fictional use case for a top MSSP who ingests high volumes of logs from security devices to cloud. This MSSP also performs downstream analytics and threat detection modeling.

The options we present here have a log collection platform (LCP) on-premises. It collects logs from security devices and sensors, performs necessary translations and tokenization, and pushes compressed log files to the processing tier on cloud. The collection platform can also be modernized to have the IoT-enabled devices send logs to AWS IoT services. This will push the data to Amazon Kinesis, a managed service for collecting and analyzing streaming data.

Approach 1: Amazon Kinesis for log ingestion and format conversion

Figure 1 illustrates a comprehensive solution that uses managed and serverless services on AWS.

Figure 1. Amazon Kinesis for log ingestion and format conversion

1. LCP will invoke a scalable producer application for Amazon Kinesis Data Streams running on AWS Fargate behind an Application Load Balancer. The producer application will use the Amazon Kinesis Producer Library (KPL). KPL aggregates and batches data records to make ingestion into the data stream more efficient. The application may provide compressed records to the KPL to have it manage object compression.

The application can be set up as an HTTP endpoint that receives log files and processes them using KPL. Customer ID sent as part of an HTTP request header can be used to maintain affinity. The application can run in a Docker container, which is orchestrated by Amazon ECS on AWS Fargate. A target tracking scaling policy can manage the number of parallel running data ingestion containers to manage scalability of the ingestion process.

2. Amazon Kinesis Scaling Utility can be used to scale data streams up or down by a count, or as a percentage of the total fleet. The scaling utility archive file can be imported as a library to AWS Lambda. It will automatically manage the number of shards in the stream based on the observed PUT or GET rate of the stream. The combination of customer ID and security device ID may be used to define the partition key.

3. Records uploaded to the stream by the producer application will be consumed by Lambda. It will perform gateway transformations (required by all downstream consumers) and the normalization of record format. Any additional consumer level transformations may be handled separately, associated with respective consumers.

A combination of batch window and batch size configurations can improve efficiency of function invocations. Batch windows are the maximum amount of time in seconds to gather records before invoking the function. Batch size is the number of records to send to the function in each batch. The Lambda function will throttle sending records to Amazon Kinesis Data Firehose. Error handling will be accomplished via retries with a smaller batch size, with number of retries limited as appropriate. It will discard records that are too old.

An Amazon Simple Queue Service (SQS) queue can be configured as a failed-event destination for further offline analysis. A Lambda function can read from the error SQS queue to do basic checks and determine appropriate follow-up actions. This can be an initiated email for additional investigation or a command to discard the message.

4. Output of transformations by Lambda will be saved to the short term (hot) storage Amazon S3 bucket via Kinesis Data Firehose. This can efficiently handle Parquet format conversion required by downstream analytics applications. Kinesis Data Firehose delivery streams will be created per customer and configured with associated AWS Glue Data Catalog table, to perform parquet format conversion.

5. AWS Glue jobs will be used to consolidate and write larger files to the long term (cold) storage bucket.

6. The data in the cold storage bucket will be accessed by internal SOC analysts for threat detection and mitigation.

7. The data in cold storage buckets will also be accessed by end customers via dashboards in Amazon QuickSight.

8. This architecture also provides additional options to modernize streaming analytics using Amazon Kinesis Data Analytics or AWS Glue streaming jobs as appropriate.

While this architecture proposes a fully managed, end-to-end solution, the sheer volume of log messages may drive up the total cost of the solution. This is especially true for Kinesis Data Streams and Kinesis Data Firehose costs.

Approach 2: Containerized application on AWS Fargate for ingestion and Amazon Kinesis for format conversion

An alternative approach shown in Figure 2 replaces the gateway Kinesis Data Streams and transformations, with a containerized application on Fargate. Conversion to Parquet format and writing to the S3 bucket is still handled by Kinesis Data Firehose.

Figure 2. Containerized application for ingestion and Amazon Kinesis for format conversion

1. LCP will upload log files to a raw storage bucket in Amazon S3.

2. A Lambda function will process Event Notifications from the raw data storage bucket. It can insert Amazon S3 object pointers to a Kinesis Data Stream partitioned by Customer ID and Device ID.

3. The producer application will retrieve the Event Notifications from the Data Stream and retrieve corresponding log files from S3. It will perform initial aggregations and transformations, and output to Kinesis Data Firehose. The application can run in a Docker container that is orchestrated by Amazon ECS on Fargate. A target tracking scaling policy can manage the number of parallel running data ingestion containers, to manage scalability of the ingestion process. ECS cluster capacity can be scaled up or down based on Amazon CloudWatch alarms.

4. Kinesis Data Firehose converts to Parquet format, zips the data, and persists to a short-term storage bucket in S3. This is backed by Glue Data Catalog.

Steps 5, 6 and 7 perform consolidation and availability of the processed data to downstream consumers, as in the previous approach.

This option uses the built-in capabilities of Kinesis Data Firehose to transform to Parquet format and deliver to S3. Note that higher costs associated with the service may still be cost prohibitive for larger data volumes.

Approach 3: Containerized application on AWS Fargate for ingestion and format conversion

Figure 3 uses a containerized application running on Fargate for both gateway transformations. This app also provides conversion to Parquet format before writing the files to a short term (hot) storage bucket. All the other steps are the same as in option 2.

Figure 3. Containerized application for ingestion and format conversion

This option offers the least expensive way to transform, aggregate, and enrich the incoming log records, as well as convert them to Parquet format. But it comes with additional overhead for custom development of format conversion, checkpointing, error handling, and application management. Evaluate based on your business needs and workflow.

Conclusion

In this post, we discussed multiple approaches to design a platform on AWS to ingest and process high-volume security log records. We compared the pros and cons for each option. Amazon Kinesis is a fully managed and scalable service that helps easily collect, process, and analyze video and data streams in real time. A solution primarily based on Kinesis may become cost prohibitive due to large data volumes. Consider alternate approaches that use containerized applications on AWS Fargate. The trade-off would be the ability for custom development versus application management overhead.

To improve your security log analysis solution, explore one of the approaches we illustrate and customize as appropriate to fit your unique needs.

Create and reuse governed datasets in Amazon QuickSight with new Dataset-as-a-Source feature

2021-10-15 Shailesh Chauhan

Post Syndicated from Shailesh Chauhan original https://aws.amazon.com/blogs/big-data/create-and-reuse-governed-datasets-in-amazon-quicksight-with-new-dataset-as-a-source-feature/

Amazon QuickSight is a fast, cloud-powered, business intelligence (BI) service that makes it easy to deliver insights to everyone in your organization. QuickSight recently introduced Dataset-as-a-Source, a new feature that allows data owners to create authoritative datasets that can then be reused and further extended by thousands of users across the enterprise. This post walks through an example of how QuickSight makes it easy to create datasets that are reusable and easy to govern, with Dataset-as-a-source.

Introducing Dataset-as-a-Source

Dataset-as-a-Source allows QuickSight authors and data owners to create authoritative datasets, as a single source of truth, that either use the QuickSight SPICE in-memory engine, or directly query the underlying database. These datasets may contain data from a single table from a single database, or a combination of data across multiple data sources, including flat files, software as a service (SaaS) sources, databases, and data warehouses. Data owners with a deep understanding of the data can predefine metrics and calculations in the dataset with metadata that makes it easy for authors to understand and consume this data.

After you create these authoritative datasets, you can share them with authors who want to consume this data – either directly on a per-user or group basis, or using shared folders in QuickSight. Authors can now simply use these datasets for their dashboard creation activities, or choose to further augment these datasets by adding additional calculated fields or joining them with other data that is relevant to them. Any updates made by the data owner to the authoritative dataset automatically cascades to the downstream datasets created by individual authors using those datasets. This provides the organization with a secure, governed, and effortless data sharing process that can scale to thousands of authors as needed, without any server setup or data silos on client desktops.

Dataset-as-a-Source example use case

This example use case uses the Amazon Customer Reviews Data. It is public data stored in the us-east-1 Region. You have the following three tables:

product_reviews – Customer reviews for a specific product
customer – Customer profile data
customer_address – Customer address information

The following diagram shows the relationship of the three tables.

In the following sections, we explain the process of using Dataset-as-a-Source using the tables from the preceding schema.

Create central datasets

Let’s say you’re part of the central data engineering team that has access to the preceding data with three tables in a database. Your team serves over 100 business analysts across 10 different teams in eight countries, namely the Coupons team, Outbound Marketing team, Mobile Platform team, and Recommendations team. Analysts on each team want to analyze product and customer data along with data specific to their teams.

Rather than manually creating and maintaining several unconnected datasets for each team, your team created a central dataset. You created the central Product and Customer dataset by joining the three tables in a schema. Dashboard performance and query costs are optimized for large central dataset with Dynamic Querying. Dynamic Querying enables direct query datasets that contain joined tables to run more efficiently. Joins are applied dynamically so that only the subset of tables needed to render a visual are used in the join. Your team also created a key metric (calculated field): Average Rating. All of these teams use the Average Rating metric as a base to analyze their own business line. Its definition is: Average Rating = Sum(Star_rating) / Unique_Count(Start_rating).

Reuse and join with other datasets

Individual teams reused the central Product and Customer dataset and joined it with the data of their own, to create their own datasets. For example, the Marketing team wanted to understand how their team helped improving the product ratings. Therefore, they combined the central Product and Customer dataset with campaign data to create a new dataset: Marketing – Product & Customer Rating. Similarly, the Mobile Platform team combined mobile data with the Product and Customer dataset to understand rating impact on mobile.

With Dataset-as-a-Source, you can centralize data management while allowing each team to customize the data for their own needs, all while syncing updates to the data, such as updates to metric definitions, and maintaining dataset definitions like column names, descriptions, and field folders. Additionally, if these are SPICE datasets, the newly created datasets are created with SPICE as a source, so you don’t reach to the datasource every time a dataset is created or refreshed.

Govern central datasets

You and the central team defined business metrics like Average Rating in the central dataset. With Dataset-as-a-Source, individual teams can use the central dataset without having to redo the work of recreating the field themselves, while extending it with data representing their specific business needs. All teams are able to use a uniform Average Rating metric definition.

Suppose you now want to make a change to the Average Rating definition because of a business operation change. You want to exclude Digital_Video_Download from the product category, and need to provide an updated definition to each team. To make modifications, you have to modify just the central dataset, and the associated datasets get the updates automatically. This saves you time and prevents errors in business metric definitions.

Conclusion

Creating datasets from existing datasets using Dataset-as-a-Source helps you with the following:

Dataset governance – Data engineers can easily scale to the needs of multiple teams within their organization by developing and maintaining a few general-purpose datasets that describe the organization’s main data models—without compromising on query performance.
Data source management reduction – Analysts spend considerable amounts of time and effort requesting access to databases, managing database credentials, finding the right tables, and managing data refresh schedules. Building new datasets from existing datasets means that analysts don’t have to start from scratch with raw data from databases. They can start with pre-curated data, while also ensuring that on the data source side, optimizations such as workload management can be put in place to ensure optimal performance of backend stores.
Metrics accuracy – Creating datasets from existing datasets allows data engineers to centrally define and maintain critical data definitions, such as sales growth and net marginal return, across their company’s many organizations. It also allows them to distribute changes to those definitions, and gives their analysts the ability to get started with visualizing the right data more quickly and reliably.
Dataset customization and flexibility – Creating datasets from existing datasets gives analysts more flexibility to customize datasets for their own business needs without worrying about disrupting data for other teams.

This post showed how QuickSight’s Dataset-as-a-Source can help your data management workflows. This feature greatly improves governance and reusability of the datasets. Dataset-as-a-Source is now generally available in Amazon QuickSight Standard and Enterprise Editions in all QuickSight regions. For further details, visit here.

About the Author

Shailesh Chauhan is product managing Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Before QuickSight, Shailesh was global product lead at Uber for all data applications built from ground-up. Earlier, he was a founding team member at ThoughtSpot, where he created world’s first analytics search engine. Shailesh is passionate about building meaningful and impactful products from scratch. He looks forward to helping customers while working with people with great mind and big heart.

Use Amazon Athena and Amazon QuickSight in a cross-account environment

2021-09-30 Lotfi Mouhib

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/use-amazon-athena-and-amazon-quicksight-in-a-cross-account-environment/

Many AWS customers use a multi-account strategy to host applications for different departments within the same company. However, you might deploy services like Amazon QuickSight using a single-account approach, which raises challenges when you need to use QuickSight in combination with Amazon Athena to build reports and dashboards. With the recently announced built-in support for cross-account Data Catalogs in Athena, you can now use AWS Glue Data Catalogs in different accounts to create datasets, and build reports and dashboards from a single AWS account using QuickSight and Athena – creating a serverless data visualization solution that lets you share insights from all your data to all your users.

In this post, I show you how to use this new feature to set up cross-account access to Athena for QuickSight.

Solution overview

To set up cross-account access, you complete the following steps:

Grant QuickSight cross-account access to an AWS Glue Data Catalog.
Register the Data Catalog in Athena.
Grant QuickSight cross-account access to an Amazon Simple Storage Service (Amazon S3) bucket.
Add the shared bucket to QuickSight.
Connect QuickSight to Athena.

The following architecture shows the deployment steps.

athena-quicksight-cross-account-architecture

Grant cross-account access for the Data Catalog

QuickSight uses a service role to interact with other AWS services. QuickSight creates this role for you under the name aws-quicksight-s3-consumers-role-v0. You need this role to allow access to the Data Catalog cross-account share. To allow the QuickSight service role (Account A, the borrower account) to access the Data Catalog (Account B, the owner account), you need to grant cross-account access by updating the AWS Glue resource policy.

In the AWS account of the Data Catalog, complete the following steps:

On the AWS Glue console, choose Settings in the navigation pane.

Under Permissions, enter the following resource policy:

{
  “Version” : “2012-10-17”,
  “Statement” : [
{
    “Effect” : “Allow”,
    “Principal” : {
      “AWS” : “arn:aws:iam::<Account A>:role/service-role/aws-quicksight-s3-consumers-role-v0”
    },
    “Action”: [
                “glue:SearchTables”,
                “glue:GetDatabase”,
                “glue:GetPartition”,
                “glue:GetTables”,
                “glue:GetDatabases”,
                “glue:GetTable”
            ],
    “Resource” : [ “arn:aws:glue:<region>:<Account B>:catalog”, “arn:aws:glue:<region>:<Account B>:database/*”, “arn:aws:glue:<region>:<Account B>:table/*” ]
  } ]
}

Choose Save.

The resource policy gives QuickSight access to all the databases and tables in the Data Catalog. You can further scope it down by adding the name of the tables and databases to the resource element.

The following screenshot shows the Settings page on the AWS Glue console and the catalog UI for updating the resource permission.

glue-catalog-screenshot-ui

Register the Data Catalog in Athena

Now you need to register the shared Data Catalog with Athena in the AWS account (borrower) that hosts QuickSight.

On the Athena console, choose Data sources in the navigation pane.
Choose Connect data source.
For Choose where your data is located, select Query data in Amazon S3.
For Choose a metadata catalog, select AWS Glue Data Catalog.
Choose AWS Glue Data Catalog in another account.
For Connection details, enter a Data Catalog name, optional description, and the Data Catalog owner’s AWS account ID.
Choose Register.

When you complete these steps, you can see the borrowed catalog on the Data sources page on the Athena console.

borrowed-catalog-athena-ui

Grant QuickSight cross-account access to an S3 bucket

Creating a resource policy on the Data Catalog to allow cross-account access for QuickSight is not sufficient. You also need to grant QuickSight access to the S3 bucket where the data is stored. You use the same QuickSight service role that we used for the Data Catalog to update the S3 bucket policy.

In the account of the Data Catalog, complete the following steps:

On the Amazon S3 console, choose Buckets.
Choose the bucket that you want to create a policy for, or whose policy you want to edit.
Choose Permissions.

Enter the following policy:

{
    "Version": "2012-10-17",
    "Id": "Policy1621366959711",
    "Statement": [
{
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Account A>:role/service-role/aws-quicksight-s3-consumers-role-v0"
            },
            "Action": ["s3:ListBucket", "s3:GetObject", "s3:GetObjectVersion" ],
            "Resource": [
                "arn:aws:s3:::<name-of-bucket>",
                "arn:aws:s3:::<name-of-bucket>/*"
            ]
        }
    ]
}

Choose Save changes.

Add the shared S3 bucket to QuickSight

The last step before you can connect QuickSight to Athena is to add the S3 bucket (Account B) as a resource that the QuickSight service role (Account A) can access. To allow your QuickSight service role access to the S3 bucket in another account, perform the following steps:

On the QuickSight console, on the account drop-down menu, choose Manage QuickSight.
Choose Security & permissions.
Choose Add or remove.
Choose Details.
Choose Select S3 buckets.
Under Use a different bucket, add your bucket.
Choose Finish.

Connect QuickSight to Athena

After you set up the necessary permissions, you can follow the instruction in this section to add a dataset in Athena by using the remote (borrowed) Data Catalog.

On the Athena console, choose Datasets in the navigation pane.
Choose New dataset.
Create a new connection profile by providing a data source name and Athena workgroup.
Choose Validate connection.
Choose Create data source.
In the Choose your table section, for Catalog, choose the catalog you created in Athena.
Choose a database and table and click on Select.
Choose Edit/Preview data.
To create a dataset and analyse the data using the table, choose Visualize.

Conclusion

This post showed how to use the built-in support for a cross-account Data Catalog in Athena with Quicksight when the Data Catalog and the S3 bucket containing the data are in a different account. This feature greatly reduces operational overhead by having a single account managing the Data Catalog and its data.

After you set up your data sources, you can join your data across these various sources. You can also use these new data sources to gain further insights from your data by setting up ML Insights in QuickSight and set graphical representations of your data using QuickSight visuals.

author Lotfi is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

Amazon QuickSight deployment models for cross-account and cross-Region access to Amazon Redshift and Amazon RDS

2021-09-29 Lotfi Mouhib

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/amazon-quicksight-deployment-models-for-cross-account-and-cross-region-access-to-amazon-redshift-and-amazon-rds/

Many AWS customers use multiple AWS accounts and Regions across different departments and applications within the same company. However, you might deploy services like Amazon QuickSight using a single-account approach to centralize users, data source access, and dashboard management. This post explores how you can use different Amazon Virtual Private Cloud (Amazon VPC) private connectivity features to connect QuickSight to Amazon RDS or Amazon Redshift deployed in a different AWS account or Region.

Prepare the QuickSight environment

To implement any of the options discussed in this post, you need QuickSight with an Enterprise Edition subscription. One of the differences between this edition and the Standard Edition is the ability to connect QuickSight to a VPC—through an Elastic Network Interface (ENI)—and keep network traffic private within the AWS network. In this section, we take you through creating a VPC and enabling a VPC connection for QuickSight. For the rest of this post, you use AWS CloudShell from the AWS Management Console to run AWS Command Line Interface (AWS CLI) commands.

Before you complete the following steps, make sure you have sufficient AWS Identity and Access Management (IAM) privileges associated with your IAM user or role to create VPC endpoints, security groups, route tables, and a VPC peering connection.

If you have Amazon RDS or Amazon Redshift in the same Region and your organization uses cross-account resource sharing for your VPC, you can skip the next steps and go to the section on VPC sharing.

If you already have a VPC and subnet you want to use, you can skip to step 3.

Create a VPC and a subnet in the Region where your QuickSight account is deployed. See the following code:

aws ec2 create-vpc \
    --cidr-block CIDR block different than where redshift and rds are deployed \
    --tag-specifications \ 
 'ResourceType=vpc,Tags=[{Key=Name,Value=QuicksightVPC}]

Create a subnet where QuickSight can deploy an ENI:

aws ec2 create-subnet \
    --vpc-id ID of the VPC you created above\
    --cidr-block subnet cidr block within your VPC range \
    --tag-specifications \ 
    'ResourceType=subnet,Tags=[{Key=Name,Value=QuicksightSubnet}]'

Create a security group and allow a self-reference for all TCP traffic from the VPC CIDR range:

aws ec2 create-security-group \
    --group-name QuicksightSG \
    --description "Quicksight security group" \
    --vpc-id ID of the VPC you created above \
    --tag-specifications 'ResourceType=security-group,Tags=[{Key=Name,Value=QuicksightSG}]'

Create a route table and associate it to the subnet you created:

aws ec2 create-route-table \
    --vpc-id ID of the VPC you created above \
    --tag-specifications 'ResourceType=route- \ table,Tags=[{Key=Name,Value=QuicksightRouteTable}]'

aws ec2 associate-route-table \ 
    --route-table-id id of route table created above \
    --subnet-id subnet id created above

Now you configure QuickSight to create a VPC connection in the subnet you just created.

Sign in to the QuickSight console with administrator privileges.
Choose your profile icon and choose Manage QuickSight.
On the Manage QuickSight console, in the left panel, choose Manage VPC Connections.
Choose Add VPC Connection.
Provide the VPC ID, subnet ID, and security group ID you created earlier.
You can leave DNS resolver endpoints empty unless you have a private DNS deployment in the VPC.

You have now enabled QuickSight to access a subnet in your VPC. The following diagram shows the infrastructure you deployed.

Quicksight-ENI-VPC-Deployment

Set up the data source infrastructure

You can choose from multiple possible deployment models when connecting Amazon Redshift or Amazon RDS. In the following sections, we walk you through how you can achieve each solution and when to use each one. For the rest of this post, we refer to Amazon Redshift and Amazon RDS as the data source.

VPC peering

A VPC peering connection is a networking connection between two VPCs that enables you to route traffic between them using private addresses. Instances in either VPC can communicate with each other as if they’re within the same network. This model is suitable for when the two accounts aren’t in the same AWS Organizations organization or Region, or are in the same organization but there is no AWS Transit Gateway attachment to the data source VPC. In this section, we create a network path between the QuickSight VPC and the data source VPC.

Create a VPC peering between the two VPCs. For instructions, see Creating and accepting a VPC peering connection.
To use the hostname of the data source, you need to enable the DNS resolution for the peering connection at the requester and accepter of the peering connection. The DNS resolution must also be enabled at bot VPCs for this to work.

Run the following commands to enable the DNS resolution (depending on which account or Region the peering connection is initiated in, you need to run one command in the requester account or Region and the other in the accepter account or Region)

aws ec2 modify-vpc-peering-connection-options \
    --vpc-peering-connection-id your peering connection id \
    --accepter-peering-connection-options \ 
'{"AllowDnsResolutionFromRemoteVpc":true}'

aws ec2 modify-vpc-peering-connection-options \
    --vpc-peering-connection-id your peering connection id \
    --requester-peering-connection-options \ 
'{"AllowDnsResolutionFromRemoteVpc":true}'

Update the route tables in both the QuickSight VPC and data source VPC to route network traffic between them:

aws ec2 create-route \
    --route-table-id amazon quicksight subnet route table id\
    --destination-cidr-block data source vpc cidr \
    --vpc-peering-connection-id the id of the peering you created above

aws ec2 create-route \
    --route-table-id data source subnet route table id\
    --destination-cidr-block quicksight vpc cidr \
    --vpc-peering-connection-id the id of the peering you created above

Finally, you update the security groups for both QuickSight and the data source to allow traffic. If both VPCs are in the same Region, you can reference the security group instead of the CIDR of either data source or the QuickSight subnet.

In the QuickSight AWS account, run the following command:

aws ec2 authorize-security-group-ingress \
    --group-id quicksight security group \
    --ip-permissions \ 
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=data source subnet CIDR}]

In the data source AWS account, run the following command:
```
aws ec2 authorize-security-group-ingress \
    --group-id data source security group \
    --ip-permissions \ 
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=quicksight subnet CIDR}]
```
You have now created and configured a network path between QuickSight and your data source. The following diagram shows the infrastructure you deployed.

Quiclsight-VPC-Peering-Deployment-Architecture

You can skip the next sections and proceed to connecting QuickSight to the data source.

AWS Transit Gateway

Transit Gateway connects VPCs through a central hub. It’s usually used in large deployments because it simplifies your network and reduces complex peering relationships. The transit gateway acts as a cloud router—each new connection is only made one time. A deployment model for the VPC of QuickSight with Transit Gateway is suitable when your network or infrastructure team has a policy on limiting VPC peering connections and enforces all new connections between VPCs through the transit gateway.

The Transit Gateway deployment model only works if the two AWS accounts are in the same organization and Region. This method also works if you have two transit gateway in different region that are peered.

Complete the following steps to create a network path between the QuickSight VPC and the data source VPC:

Share the transit gateway with the AWS account where QuickSight is deployed.

Run the following command to attach the QuickSight VPC to the shared transit gateway:

aws ec2 create-transit-gateway-vpc-attachment \
    --transit-gateway-id transit gateway id \
    --vpc-id quicksight vpc id \
    --subnet-id quicksight subnet id \
    --options {"DnsSupport": "enable"}

Run the following commands to accept the transit gateway attachment you created:
```
aws ec2 accept-transit-gateway-vpc-attachment \
    --transit-gateway-attachment-id your transit gateway attachment id
```
Now you need to update the transit gateway attachment with same route table that is used by the data source VPC transit gateway attachment.

Get the route table ID (TransitGatewayRouteTableId) with the following command:

aws ec2 describe-transit-gateway-attachments \
    --transit-gateway-attachment-ids your transit gateway id

Run the following command to attach the route table to the transit gateway attachment of QuickSight:
```
aws ec2 associate-transit-gateway-route-table \
    --transit-gateway-route-table-id your transit gateway id \
    --transit-gateway-attachment-id your transit gateway attachment id
```
Lastly, we update the security groups for both QuickSight and the data source to allow traffic.

In the AWS account where QuickSight is deployed, run the following command:

aws ec2 authorize-security-group-ingress \
    --group-id quicksight security group \
    --ip-permissions \ 
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=data source subnet CIDR}]

In the AWS account where the data source is deployed, run the following command:
```
aws ec2 authorize-security-group-ingress \
    --group-id data source security group \
    --ip-permissions \
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=quicksight subnet CIDR}]
```
You have now created and configured a network path between QuickSight and your data source. The following diagram shows the infrastructure you deployed.

You can skip the next sections and proceed to connecting QuickSight to the data source.

AWS PrivateLink

AWS PrivateLink provides private connectivity between VPCs, AWS services, and your on-premises networks. It simplifies your network architecture and makes it easy to connect AWS services across different accounts and VPCs.

You can use AWS PrivateLink as a self-managed endpoint to connect two VPCs or as a managed VPC endpoint for some services like Amazon Redshift. With an Amazon Redshift-managed VPC endpoint, you can privately access your Amazon Redshift data warehouse in your VPC from your client applications in another VPC within the same AWS account or another AWS account.

A self-managed AWS PrivateLink deployment is a solution for cross-account access; however, I don’t discuss it in this post. The solution relies on the IP of the data source ENI, which may change, which causes the AWS PrivateLink endpoint to lose connectivity to the data source, because the AWS PrivateLink Network Load Balancer uses IP-as-a-target to refer to the data source. This results in QuickSight being unable to refresh the dashboards and reports until you update the AWS PrivateLink configuration with the new IP.

The following steps walk you through creating a network path—using an Amazon Redshift managed VPC endpoint—between the QuickSight VPC and the data source VPC. This method only works with the RA3-instance of Amazon Redshift and not the DS2 instances.

Set up an Amazon Redshift-managed VPC endpoint between the Amazon Redshift cluster VPC and QuickSight VPC. For instructions, see Connecting to Amazon Redshift using an interface VPC endpoint.
Then you update the security groups for both QuickSight and Amazon Redshift to allow traffic.

In the AWS account where QuickSight is deployed, run the following command:

aws ec2 authorize-security-group-ingress \
    --group-id quicksight security group \
    --ip-permissions \ 
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=data source subnet CIDR}]

In the AWS account where the data source is deployed, run the following command:
```
aws ec2 authorize-security-group-ingress \
    --group-id data source security group \
    --ip-permissions \
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=quicksight subnet CIDR}]
```
You have now created and configured a network path between QuickSight and your Amazon Redshift cluster. The following diagram shows the infrastructure you deployed.

You can skip the next section and proceed to connecting QuickSight to the data source.

VPC sharing

VPC sharing allows multiple AWS accounts to create their application resources, such as Amazon Elastic Compute Cloud (Amazon EC2) or RDS instances, into a shared, centrally managed VPC. In this model, the account that owns the VPC where the data source is deployed (owner) shares one or more subnets with other accounts (participants) that belong to the same organization. After a subnet is shared, the participants can use QuickSight to create the VPC connection in a subnet shared with them. Participants can’t view, modify, or delete resources that belong to other participants or the VPC owner.

This model is the most cost-effective because there’s no underlying cost for VPC peering, Transit Gateway, or AWS PrivateLink. It also simplifies network topologies and reduces the number of VPCs that you create and manage, while using separate accounts for billing and access control for each service used.

The following steps walk you through creating a VPC share and deploying a QuickSight ENI in the VPC to connect to the data source.

Share the data source VPC with the AWS account where QuickSight is deployed. For instructions, see Work with shared VPCs.
Configure QuickSight to create a VPC connection in the VPC you shared. For instructions, see Setting Up a VPC to Use with Amazon QuickSight.

You have now created and configured a network path between QuickSight and your data source. The following diagram shows the infrastructure you deployed.

Quicksight-VPC-sharing-deployment-architecture

You can now connect QuickSight to the data source.

Connect QuickSight to the data source

Now that you have established the network link and configured both QuickSight and the data source to accept incoming and outgoing network traffic, you set up QuickSight to connect to the data source.

On the QuickSight console, choose Datasets in the navigation pane.
Choose Add a dataset.
Choose your database engine (such as PostgreSQL, MySQL or Redshift Manual connect).
Do not choose Amazon RDS or Amazon Redshift auto-discover.
For Connection type, choose the VPC connection you created.
Provide the necessary information about your database server.
Choose Validate connection button to make sure QuickSight can connect to the data source.
Choose Create data source.
Choose the database you want to use and select the table.

You’re now ready to build your dashboards and reports.

Clean up

After following these steps, you should have successfully connected QuickSight to your data source. To avoid incurring any extra charges from the VPC configurations you made and depending on the use case you followed, you should delete the VPC peering, AWS PrivateLink endpoint, or transit gateway attachment.

Conclusion

In this post, we walked through reference access patterns that connect QuickSight to either Amazon RDS or Amazon Redshift running either on a different account, Region, or both. These methods aren’t limited to Amazon RDS or Amazon Redshift; you can use them with any database engine that is self-managed on Amazon EC2 and supported by QuickSight.

The following table summarizes the different deployment models and the preferred way to connect QuickSight to the data source.

	Same Region		Different Region
	Data Source		Data Source
	Amazon Redshift RA3	Amazon RDS or Amazon Redshift DC2	Amazon Redshift RA3	Amazon RDS or Amazon Redshift DC2
Different AWS accounts same AWS organization	Amazon Redshift managed VPC endpoint or VPC sharing	Transit Gateway or VPC sharing	VPC peering + Amazon Redshift managed VPC endpoint	VPC peering
Different AWS accounts not in the same AWS organization	Amazon Redshift managed VPC endpoint	VPC peering	VPC peering + Amazon Redshift managed VPC endpoint	VPC peering

After you set up your various data sources, you can join your data across various sources. You can also use these new data sources to gain further insights from your data by setting up ML Insights in QuickSight and setting graphical representations of your data using QuickSight visuals.

About the author

author-picture Lotfi Mouhib is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

Coming January 2022: An updated Amazon QuickSight sign-in experience

2021-09-29 Sahitya Pandiri

Post Syndicated from Sahitya Pandiri original https://aws.amazon.com/blogs/big-data/coming-january-2022-an-updated-amazon-quicksight-sign-in-experience/

Starting January 2022, the Amazon QuickSight will undergo minor updates, such as changes in the sign-in domain and a new look and feel while signing in. This won’t impact your access to QuickSight. In this post, we walk through changes to expect in the sign-in experience and domains to allow on your organization’s network to sign in to QuickSight.

If your QuickSight account is integrated with an identity provider such as Okta, this change is not applicable to you. If you sign in to QuickSight either using AWS root, AWS Identity and Access Management (IAM), corporate Active Directory, or your native QuickSight credentials, please keep reading.

What’s changing?

The QuickSight sign-in process will be a three-step experience.

Step 1: The first page requires your QuickSight account name.

Step 2: The second step asks for your user name.

Step 3: The third step varies depending on the user type you sign in as: native QuickSight or Active Directory user, AWS root user, or IAM user.

Native QuickSight or Active Directory

If you’re signing in as a native QuickSight user or use your corporate Active Directory credentials, you’re redirected to signin.aws, which prompts you to enter your password with the user name prefilled.

If your account is Multi-Factor Authentication enabled, you will be prompted to enter the MFA code as below,

AWS root user

If you’re signing in as an AWS root user, you’re redirected to signin.aws.amazon.com (or amazon.com) to complete the sign-in process. This page has your username prefilled. On selecting next, it prompts you to enter password.

IAM user

If signing in as an IAM user, you’re redirected to a sign-in page that prompts password with IAM user name pre-filled.

Summary

These changes to the QuickSight sign-in experience will be effective starting January, 2022. Please note the four new domains below you may encounter depending on who you sign in as. As a network administrator, please allow-list these domains within your organization network.

User type	Domain to allow list
Native QuickSight user and Active Directory users	signin.aws and awsapps.com
AWS root user	signin.aws.amazon.com and amazon.com
AWS IAM user	signin.aws.amazon.com

If you have any questions, please reach out to AWS Support via the Support Center on the AWS Management Console.

Create stunning, pixel perfect dashboards with the new free-form layout mode in Amazon QuickSight

2021-09-28 Jose Kunnackal

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/create-stunning-pixel-perfect-dashboards-with-the-new-free-form-layout-mode-in-amazon-quicksight/

The latest update of Amazon QuickSight introduces a new free-form dashboard layout option, along with granular per-visual interaction controls and conditional rendering options that open up a range of creative possibilities for dashboard authors. In this post, we look at the new capabilities available and how you can use them to create and share stunning dashboards, whether for QuickSight readers in your organization or to embed in your application for end-users.

What’s new in QuickSight dashboard layouts?

Layout modes in QuickSight determine how visuals in a dashboard are sized and presented to end-users, and the degree of flexibility authors have in modifying the dashboard to meet their visual styling needs. QuickSight now supports three layout options for dashboards:

Tiled – Visuals snap to a grid layout, minimizing effort needed to resize and arrange visuals. Authors can choose the base screen size they’re building for (for example, laptops or HD monitors), and the dashboard scales automatically to any screen width or display, zooming in or out as needed to scale and display the content on the user’s screen as designed by the author. In mobile view, the visuals display as designed when in landscape mode and automatically arrange to a single column when in portrait mode. As the name suggests, visuals are tiled in this mode, and can’t overlap each other.
Free-form (new!) – This mode allows visuals to be placed anywhere, including overlapping other visuals, along with pixel-level control over elements on the dashboard. This offers very precise control and detail-oriented design approaches. In this mode, authors still pick a screen size they’re designing for, and dashboards display as designed on all screens, zooming in or out as needed. Because dashboards are designed intricately with overlaps and images, QuickSight doesn’t automatically adjust this view for mobile access, and so mobile users see the author-created layout as is, scaled to fit on their landscape or portrait screen orientation on smaller screens. This mode also offers a new capability of conditionally rendering a visual based on QuickSight parameter-based rules, which opens up creative possibilities for interactive dashboards.
Classic – This is the legacy layout mode, where visuals snap to a grid layout. In this mode, QuickSight might show or hide content based on screen sizes, which can cause dashboards to display content differently on different devices.

Existing dashboards will remain in classic mode until authors explicitly modify and republish them in tiled or free-form mode as necessary.

Get started with free-form mode

You can get started with free-form mode in five easy steps:

Open a new QuickSight analysis and change the layout setting to Free-form.
1. Optionally, set the View to Fit to window if you prefer seeing the full canvas (scaled up or down).
Try resizing and moving the initial visual around. See how the visual stays where you drop it.

When using mouse to move the visuals, they still snap to the closest grid for ease of alignment. Try using the arrow keys for finer adjustments.

Add another visual and move it on top of the first visual. See how the visual overlaps the first visual.
Make the second visual fully overlap the first visual.
To access the first visual without moving the second visual, choose the visual menu of the second bigger visual and send it backward.

Wondering if you can switch your existing analysis to free-form mode? Sure thing! Choose Settings on the left panel of the analysis and choose Free-form from the Layout options.

Get proficient in free-form layout

Now that you know the basics, let’s dive deeper. Four groups of settings are key to building in free-form layout: placement, style, interactions, and rules.

You can access these settings from each visual’s Format visual panel.

Placement – This lets you control the exact position and size of each visual on your dashboard. You do this by specifying the X,Y coordinates (for the top left corner of your visual) and the desired height and width. Keep in mind that the X,Y coordinates start from the top left of your dashboard (the top left point of your dashboard is 0,0 (X,Y)). You can use the mouse to resize and position a visual, but the placement control gives you an easier way to get perfect alignment.
Style – With style controls, you can set the background, border and selection colors, their transparencies, or turn these off. Disabling the background makes the visual fully transparent. This is extremely useful when you’re layering visuals on top of other visuals for awesome effects.
Interactions – With these controls, you can remove all the visual-level interactions (like drill-downs, sorting, maximizing visuals, and exporting data) when desired. These settings apply to the dashboard view only, and allow you to overlay charts on other charts or images to create the impression of a single composite visual to the end-user.
Rules – You can set rules to show or hide visuals based on parameter values. All visuals are visible by default. You can add rules on when they should be hidden. Alternatively, you can flip the default state of the visual to hidden and configure rules on when to show the visual. Pair this with overlaid placement of visuals to build dynamic dashboards that respond to user selections or inputs from the parent application (when used in embedded context).

You can explore a sample dashboard for inspiration, and examine the analysis view to see how it is built.

Also check out the QuickSight dashboard gallery with samples from our partners.

Conclusion

With the new free-form layout, QuickSight enables you to build dashboards that can be tailored to meet your exact dashboard specifications, which can then be distributed to hundreds of thousands of users via the QuickSight portal, mobile app, email, or embedded within your own applications.

About the Authors

Jose Kunnackal, is a principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

Arun Santhosh is a Senior Solution Architect for Amazon QuickSight. Arun started his career at IBM as a developer and progressed on to be an Application Architect. Later, he worked as a Technical Architect at Cognizant. Business Intelligence has been his core focus in these prior roles as well.

Amazon QuickSight Q – Business Intelligence Using Natural Language Questions

2021-09-24 Jeff Barr

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-quicksight-q-business-intelligence-using-natural-language-questions/

Making sense of business data so that you can get value out of it is worthwhile yet still challenging. Even though the term Business Intelligence (BI) has been around since the mid-1800s (according to Wikipedia) adoption of contemporary BI tools within enterprises is still fairly low.

Amazon QuickSight was designed to make it easier for you to put BI to work in your organization. Announced in 2015 and launched in 2016, QuickSight is a scalable BI service built for the cloud. Since that 2016 launch, we have added many new features, including geospatial visualization and private VPC access in 2017, pay-per-session pricing in 2018, additional APIs (data, dashboard, SPICE, and permissions in 2019), embedded authoring of dashboards & support for auto-naratives in 2020, and Dataset-as-a-Source in 2021.

QuickSight Q is Here
My colleague Harunobu Kameda announced Amazon QuickSight Q (or Q for short) last December and gave you a sneak peek. Today I am happy to announce the general availability of Q, and would like to show you how it works!

To recap, Q is a natural language query tool for the Enterprise Edition of QuickSight. Powered by machine learning, it makes your existing data more accessible, and therefore more valuable. Think of Q as your personal Business Intelligence Engineer or Data Analyst, one that is on call 24 hours a day and always ready to provide you with quick, meaningful results! You get high-quality results in seconds, always shown in an appropriate form.

Behind the scenes, Q uses Natural Language Understanding (NLU) to discover the intent of your question. Aided by models that have been trained to recognize vocabulary and concepts drawn from multiple domains (sales, marketing, retail, HR, advertising, financial services, health care, and so forth), Q is able to answer questions that refer all data sources supported by QuickSight. This includes data from AWS sources such as Amazon Redshift, Amazon Relational Database Service (RDS), Amazon Aurora, Amazon Athena, and Amazon Simple Storage Service (Amazon S3) as well as third party sources & SaaS apps such as Salesforce, Adobe Analytics, ServiceNow, and Excel.

Q in Action
Q is powered by topics, which are generally created by QuickSight Authors for use within an organization (if you are a QuickSight Author, you can learn more about getting started). Topics represent subject areas for questions, and are created interactively. To learn more about the five-step process that Authors use to create a topic, be sure to watch our new video, Tips to Create a Great Q Topic.

To use Q, I simply select a topic (B2B Sales in this case) and enter a question in the Q bar at the top of the page:

Q query --

In addition to the actual results, Q gives me access to explanatory information that I can review to ensure that my question was understood and processed as desired. For example, I can click on sales and learn how Q handles the field:

Detailed information on the use of the sales field.

I can fine-tune each aspect as well; here I clicked Sorted by:

Changing sort order for sales field.

Q chooses an appropriate visual representation for each answer, but I can fine-tune that as well:

Select a new visual type.

Perhaps I want a donut chart instead:

Now that you have seen how Q processes a question and gives you control over how the question is processed & displayed, let’s take a look at a few more questions, starting with “which product sells best in south?”

Question:

Here’s “what is total sales by region and category?” using the vertical stacked bar chart visual:

Total sales by region and catergory.

Behind the Scenes – Q Topics
As I mentioned earlier, Q uses topics to represent a particular subject matter. I click Topics to see the list of topics that I have created or that have been shared with me:

I click B2B Sales to learn more. The Summary page is designed to provide QuickSight Authors with information that they can use to fine-tune the topic:

Info about the B2B Sales Topic.

I can click on the Data tab and learn more about the list of fields that Q uses to answer questions. Each field can have some synonyms or friendly names to make the process of asking questions simpler and more natural:

List of fields for the B2B Sales topic.

I can expand a field (row) to learn more about how Q “understands” and uses the field. I can make changes in order to exercise control over the types of aggregations that make sense for the field, and I can also provide additional semantic information:

Information about the Product Name field.

As an example of providing additional semantic information, if the field’s Semantic Type is Location, I can choose the appropriate sub-type:

The User Activity tab shows me the questions that users are asking of this topic:

User activirty for the B2B Sales topic.

QuickSight Authors can use this tab to monitor user feedback, get a sense of the most common questions, and also use the common questions to drive improvements to the content provided on QuickSight dashboards.

Finally, the Verified answers tab shows the answers that have been manually reviewed and approved:

Things to Know
Here are a couple of things to know about Amazon QuickSight Q:

Pricing – There’s a monthly fee for each Reader and each Author; take a look at the QuickSight Pricing Page for more information.

Regions – Q is available in the US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), Europe (Frankfurt), and Europe (London) Regions.

Supported Languages – We are launching with question support in English.

— Jeff;

Field Notes: Tracking Overall Equipment Effectiveness with AWS IoT Analytics and Amazon QuickSight

2021-09-18 Shailaja Suresh

Post Syndicated from Shailaja Suresh original https://aws.amazon.com/blogs/architecture/field-notes-tracking-overall-equipment-effectiveness-with-aws-iot-analytics-and-amazon-quicksight/

This post was co-written with Michael Brown, Senior Solutions Architect, Manufacturing at AWS.

Overall equipment effectiveness (OEE) is a measure of how well a manufacturing operation is utilized (facilities, time and material) compared to its full potential, during the periods when it is scheduled to run. Measuring OEE provides a way to obtain actionable insights into manufacturing processes to increase the overall productivity along with reduction in waste.

In order to drive process efficiencies and optimize costs, manufacturing organizations need a scalable approach to accessing data across disparate silos across their organization. In this blog post, we will demonstrate how OEE can be calculated, monitored, and scaled out using two key services: AWS IoT Analytics and Amazon QuickSight.

Overview of solution

We will use the standard OEE formulas for this example:

Table 1. OEE Calculations
Availability = Actual Time / Planned Time (in minutes)
Performance = (Total Units/Actual Time) / Ideal Run Rate
Quality = Good Units Produced / Total Units Produced
*OEE = Availability Performance * Quality**

To calculate OEE, identify the following data for the calculation and its source:

Table 2. Source of supporting data
Supporting Data	Method of Ingest
Total Planned Scheduled Production Time	Manufacturing Execution Systems (MES)
Ideal Production Rate of Machine in Units	MES
Total Production for the Scheduled time	Programmable Logic Controller (PLC), MES
Total Number of Off-Quality units produced	PLC, Quality DB
Total Unplanned Downtime in minutes	PLC

For the purpose of this exercise, we assume that the supporting data is ingested as an MQTT message.

As indicated in Figure 1, the data is ingested into AWS IoT Core and then sent to AWS IoT Analytics by an IoT rule to calculate the OEE metrics. These IoT data insights can then be viewed from a QuickSight dashboard. Specific machine states, like machine idling, could be notified to the technicians through email or SMS by Amazon Simple Notification Service (Amazon SNS). All OEE metrics can then be republished to AWS IoT Core so any other processes can consume them.

Figure 1. Tracking OEE using PLCs with AWS IoT Analytics and QuickSight

Walkthrough

The components of this solution are:

PLC – An industrial computer that has been ruggedized and adapted for the control of manufacturing processes, such as assembly lines, robotic devices, or any activity that requires high reliability, ease of programming, and process fault diagnosis.
AWS IoT Greengrass – Provides a secure way to seamlessly connect your edge devices to any AWS service and to third-party services.
AWS IoT Core – Subscribes to the IoT topics and ingests data into the AWS Cloud for analysis.
AWS IoT rule – Rules give your devices the ability to interact with AWS services. Rules are analyzed and actions are performed based on the MQTT topic stream.
Amazon SNS – Sends notifications to the operations team when the machine downtime is greater than the rule threshold.
AWS IoT Analytics – Filters, transforms, and enriches IoT data before storing it in a time-series data store for analysis. You can set up the service to collect only the data you need from your PLC and sensors and apply mathematical transforms to process the data.
QuickSight – Helps you to visualize the OEE data across multiple shifts from AWS IoT Analytics.
Amazon Kinesis Data Streams – Enables you to build custom applications that process and analyze streaming data for specialized needs.
AWS Lambda – Lets you run code without provisioning or managing servers. In this example, it gets the JSON data records from Kinesis Data Streams and passes it to AWS IOT Analytics.
AWS Database Migration Service (AWS DMS) – Helps migrate your databases to AWS with nearly no downtime. All data changes to the source database (MES, Quality Databases) that occur during the data sync are continuously replicated to the target, allowing the source database to be fully operational during the migration process.

Follow these steps to build an AWS infrastructure to track OEE:

Collect data from the factory floor using PLCs and sensors.

Here is a sample of the JSON data which will be ingested into AWS IoT Core.

In AWS IoT Analytics, a data store needs to be created which is needed to query and gather insights for OEE calculation. Refer to Getting started with AWS IoT Analytics to create a channel, pipeline, and data store. Note that AWS IoT Analytics receives data from the factory sensors and PLCs, as well as through Kinesis Data Streams from AWS DMS. In this blog post, we focus on how the data from AWS IoT Analytics is integrated with QuickSight to calculate OEE.

Create a dataset in AWS IoT Analytics.In this example, one of our targets is to determine the total number of good units produced per shift to calculate the OEE over a one-day time period across shifts. For this purpose, only the necessary data is stored in AWS IoT Analytics as datasets and partitioned for performant analysis. Because the ingested data includes data across all machine states, we want to selectively collect data only when the machine is in a running state. AWS IoT Analytics helps to gather this specific data through SQL queries as shown in Figure 2.

Figure 2. SQL query in IoT Analytics to create a dataset

Cron expressions are expressions that indicate a schedule such that the tasks can be run automatically based on a schedule and frequency. AWS IoT Analytics provides options to query for the datasets at a frequency based on cron expressions.

Because we want to produce daily reports in QuickSight, set the Cron expression as shown in Figure 3.

Figure 3. Cron expression to query data daily

Create an Amazon QuickSight dashboard to analyze the AWS IOT Analytics data.

To connect the AWS IoT Analytics dataset in this example to QuickSight, follow the steps contained in Visualizing AWS IoT Analytics data. After you have created a dataset under QuickSight, you can add calculated fields (Figure 4) as needed. We are creating the following fields to enable the dashboard to show the sum of units produced across shifts.

Figure 4. Adding calculated fields in Amazon QuickSight

We first add a calculated field as DateFromEpoch to produce a date from the ‘epochtime’ key of the JSON as shown in Figure 5.

Figure 5. DateFromEpoch calculated field

Similarly, you can create the following fields using the built-in functions available in QuickSight dataset as shown in Figures 6, 7, and 8.

Figure 6. HourofDay calculated field

Figure 7. ShiftNumber calculated field

Figure 8. ShiftSegregator calculated field

To determine the total number of good units produced, use the formula shown in Figure 9.

Figure 9. Formula for total number of good units produced

After the fields are calculated, save the dataset and create a new analysis with this dataset. Choose the stacked bar combo chart and add the dimensions and measures from Figure 10 to produce the visualization. This analysis shows the sum of good units produced across shifts using the calculated field GoodUnits.

Figure 10. Good units across shifts on Amazon QuickSight dashboard

Calculate OEE.To calculate OEE across shifts, we need to determine the values stated in Table 1. For the sake of simplicity, determine the OEE for Shift 1 and Shift 2 on 6/30.

Let us introduce the calculated field ShiftQuality as in Figure 11.

1. Calculate Quality

Good Units Produced / Total Units Produced

Figure 11. Quality calculation

Add a filter to include only Shift 1 and 2 on 6/30. Change the Range value for the bar to be from .90 to .95 to see the differences in Quality across shifts as in Figure 12.

Figure 12. Quality differences across shifts

1. Calculate Performance

(Total Units/Actual Time) / Ideal Run Rate

For this factory, we know the Ideal Production Rate is 203 units per minute per shift (100,000 units/480 minutes). We already know the actual run time for each shift by excluding the idle and stopped state times. We add a calculated field for ShiftPerformance using the previous formula. Change the range of the bar in the visual to be able to see the differences in performances across the shifts as in Figure 13.

Figure 13. Performance calculation

1. Calculate Availability

Actual Time / Planned Time (in minutes)

The planned time for a shift is 480 minutes. Add a calculated field using the previous formula.

1. Calculate OEE

OEE = Performance * Availability * Quality

Finally, add a calculated field for ShiftOEE as in Figure 14. Include this field as the Bar to be able to see the OEE differences across shifts as in Figure 15.

Figure 14. OEE calculation

Figure 15. OEE across shifts

Shift 3 on 6/28 has the higher of the two OEEs compared in this example.

Note that you can schedule a dataset refresh in QuickSight for everyday as shown in Figure 16. This way you get to see the dataset and the visuals with the most recent data.

Figure 16. Dataset refresh schedule

All the above is a one-time setup to calculate OEE.

Enable an AWS IoT rule to invoke Amazon SNS notifications when a machine is idle beyond the threshold time using AWS IoT rule.

You can create rules to invoke alerts over an Amazon SNS topic by adding an action under AWS IoT core as shown in Figures 17 and 18. In our example, we can invoke alerts to the factory operations team whenever a machine is in idle state. Refer to AWS IoT SQL reference for more information on creating rules through AWS IoT Core rule query statement.

Figure 17. Send messages through SNS

Figure 18. Set up IoT rules

Conclusion

In this blog post, we showed you how to calculate the OEE on factory IoT data by using two AWS IoT services: AWS IoT Core and AWS IoT Analytics. We used the seamless integration of QuickSight with AWS IoT Analytics and also the calculated fields feature of QuickSight to run calculations on industry data with field level formulas.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Embed multi-tenant dashboards in SaaS apps using Amazon QuickSight without provisioning or managing users

2021-09-09 Raji Sivasubramaniam

Post Syndicated from Raji Sivasubramaniam original https://aws.amazon.com/blogs/big-data/embed-multi-tenant-dashboards-in-saas-apps-using-amazon-quicksight-without-provisioning-or-managing-users/

Amazon QuickSight is a fully-managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data, create interactive dashboards, and share these with tens of thousands of users, either within QuickSight itself, or embedded in software as a service (SaaS) apps.

QuickSight Enterprise Edition recently added row-level security (RLS) using tags, a new feature that allows developers to share a single dashboard with tens of thousands of users, while ensuring that each user can only see and have access to particular data. This means that when an independent software vendor (ISV) adds a QuickSight-embedded dashboard in their app, they don’t have to provision their end-users in QuickSight, and can simply set up tags to filter data based on who the dashboard is being served to. For example, if an ISV wanted to set up a dashboard that was to be shared with 20,000 users across 100 customers of an app, with all users within a customer having access to identical data, this new feature allows you to share a single dashboard for all users, without having to set up or manage the 20,000 users in QuickSight.

RLS enforced using tags makes sure that each end-user only sees data that is relevant to them, while QuickSight automatically scales to meet user concurrency to ensure every end-user sees consistently fast performance. In this post, we look at how this can be implemented.

Solution overview

To embed dashboards without user provisioning, we use the API GenerateEmbedURLForAnonymousUser, which works with QuickSight’s session capacity pricing. With this API, the embedding server (logic in the SaaS app) determines and manages the identity of the user to whom the dashboard is being displayed (as opposed to this identity being provisioned and managed within QuickSight).

The following diagram shows an example workflow of embedded dashboards that secures data based on who is accessing the application using RLS with tags.

In this case, an ISV has a SaaS application that is accessed by two end-users. One is a manager and other is a site supervisor. Both users access the same application and the same QuickSight dashboard embedded in the application and they’re not provisioned in QuickSight. When the site supervisor accesses the dashboard, they only see data pertaining to their site, and when the manager accesses the dashboard, they see data pertaining to all the sites they manage.

To achieve this behavior, we use a new feature that enables configuring the row-level security using tags. This method of securing data on embedded dashboards works only when dashboards are embedded without user provisioning (also called anonymous embedding). The process includes two steps:

Set up tag keys on the columns of the datasets used to build the dashboard.
Set values for the tag keys at runtime when embedding the dashboard anonymously.

Set up tag keys on columns in the datasets used to build the dashboard

ISVs or developers can set columns on the datasets using the CreateDataset or UpdateDataset APIs as follows:

create-data-set
--aws-account-id 
--data-set-id 
--name 
--physical-table-map 
[--logical-table-map ]
--import-mode 
[--column-groups ]
[--field-folders ]
[--permissions ]
[--row-level-permission-data-set ]
[--column-level-permission-rules ]
[--tags ]
[--cli-input-json ]
[--generate-cli-skeleton ]
[--row-level-permission-tag-configuration //upto 50 tagkeys can be added at this time
    '{
       "Status": "ENABLED",
       "TagRules": 
        [
            {
               "TagKey": "tag_name_1", //upto 128 characters
               "ColumnName": "column_name_1",
               "TagMultiValueDelimiter": ",",
               "MatchAllValue": "*"
            },
            {
               "TagKey": "tag_name_2", //upto 128 characters
               "ColumnName": "column_name_2"
            }
        ]
    }'
]

update-data-set
--aws-account-id <value>
--data-set-id <value>
--name <value>
--physical-table-map <value>
[--logical-table-map <value>]
--import-mode <value>
[--column-groups <value>]
[--field-folders <value>]
[--row-level-permission-data-set <value>]
[--column-level-permission-rules <value>]
[--cli-input-json <value>]
[--generate-cli-skeleton <value>]
[--row-level-permission-tag-configuration //upto 50 tagkeys can be added at this time
    '{
       "Status": "ENABLED",
       "TagRules": 
        [
            {
               "TagKey": "tag_name_1", //upto 128 characters
               "ColumnName": "column_name_1",
               "TagMultiValueDelimiter": ",",
               "MatchAllValue": "*"
            },
            {
               "TagKey": "tag_name_2", //upto 128 characters
               "ColumnName": "column_name_2",
               "MatchAllValue": "*"
            },
           {
               "TagKey": "tag_name_3", //upto 128 characters
               "ColumnName": "column_name_3"
           } 
        ]
    }'
]

In the preceding example code, row-level-permission-tag-configuration is the element that you can use to define tag keys on the columns of a dataset. For each tag, you can define the following optional items:

TagMultiValueDelimiter – This option when set on a column enables you to pass more than one value to the tag at runtime, and the values are delimited by the string set for this option. In this sample, a comma is set as a delimiter string.
MatchAllValue – This option when set on a column enables you to pass all values of a column at runtime, and the values are represented by the string set for this option. In this sample, an asterisk is set as a match all string.

After we define our tags, we can enable or disable these rules using the Status element of the API. In this case the value is set to ENABLED. To disable the rules, the value is DISABLED. After the tags are enabled, we can pass values to the tags at runtime to secure the data displayed based on who is accessing the dashboard.

Each dataset can have up to 50 tag keys.

We receive the following response for the CreateDataset or UpdateDataset API:

{
"Status": 201,
“Arn”: “string”, //ARN of the dataset
“DataSetId”: “string”, //ID of the dataset
“RequestId”: “string”
}

Enable authors to access data protected by tag keys when authoring analysis

After tags keys are set and enabled on the dataset, it is secured. Authors when using this dataset to author a dashboard don’t see any data. They must be given permissions to see any of the data in the dataset when authoring a dashboard. To give QuickSight authors permission to see data in the dataset, create a permissions file or a rules dataset. For more information, see Creating Dataset Rules for Row-Level Security. The following is an example rules dataset.

UserName	column_name_1	column_name_2	column_name_3
admin/sampleauthor

In this sample dataset, we have the author’s username listed in the UserName column. The other three columns are the columns from the dataset on which we set tag keys. The values are left empty for these columns for the author added to this table. This enables the author to see all the data in these columns without any restriction when they’re authoring analyses.

Set values to the tag keys at runtime when embedding the dashboard

After the tag keys are set for columns of the datasets, developers set values to the keys at runtime when embedding the dashboard. Developers call the API GenerateDashboardEmbedURLForAnonymousUser to embed the dashboard and pass values to the tag keys in the element SessionTags, as shown in the following example code:

POST /accounts//embed-url/anonymous-user HTTP/1.1
Content-type: application/json
{
    "AwsAccountId": "string",
    "SessionLifetimeInMinutes": integer,
    "Namespace": "string", 
    "SessionTags": 
        [ 
            {
                "Key": "tag_name_1", // Length: [1-128]
                "Value": "value1, value2" // Length: [0-256]
            }
            {
               "Key": "tag_name_2", // Length: [1-128]
               "Value": "*" // Length: [0-256]
            }
            {
               "Key": "tag_name_3", // Length: [1-128]
               "Value": "value3" // Length: [0-256]
            }
        ],
    "AuthorizedResourceArns": 
        [ 
            // Length: [1-25]
            // Dashboard ARNs in the same AWS Account
            "string" 
        ],
        
        "ExperienceConfiguration": 
        {
            "Dashboard": 
            {
                "InitialDashboardId": "string" 
            }
        }
    }
}

Because this feature secures data for users not provisioned in QuickSight, the API call is for AnonymousUser only and therefore this feature works only with the API GenerateDashboardEmbedURLForAnonymousUser.

The preceding example code has the following components:

For tag_name_1, you set two values (value1 and value2) using the TagMultiValueDelimiter defined when setting the tag keys (in this case, a comma).
For tag_name_2, you set one value as an asterisk. This enables this tag key to have all values for that column assigned because we defined asterisk as the MatchAllValue when setting a tag key on the column earlier.
For tag_name_3, you set one value (value3).

API response definition

The response of the API has the EmbedURL, Status, and RequestID. You can embed this URL in your HTML page. Data in this dashboard is secured based on the values passed to the tag keys when calling the embedding API GenerateDashboardEmbedURLForAnonymousUser:

EmbedUrl (string) – A single-use URL that you can put into your server-side webpage to embed your dashboard. This URL is valid for 5 minutes. The API operation provides the URL with an auth_code value that enables one (and only one) sign-on to a user session that is valid for up to 10 hours. This URL renders the dashboard with RLS rules applied based on the values set for the RLS tag keys.
Status (integer) – The HTTP status of the request.
RequestId (string) – The AWS request ID for this operation.

Fine-grained access control

You can achieve fine-grained access control by using dynamic AWS Identity and Access Management (IAM) policy generation. For more information, see Isolating SaaS Tenants with Dynamically Generated IAM Policies. When using the GenerateEmbedUrlForAnonymousUser API for embedding, you need to mention two resource types in the IAM policy: the namespace ARNs your anonymous users virtually belong to, and the dashboard ARNs that can be used in the AuthorizedResourceArns input parameter value. The sessions generated using this API can access the authorized resources and the ones (dashboards) shared with the namespace.

Because anonymous users are part of a namespace, any dashboards shared with the namespace are accessible to them, regardless of whether they are passed explicitly via the AuthorizedResourceArns parameter.

To allow the caller identity to generate a URL for any user and any dashboard, the Resource block of the policy can be set to *. To allow the caller identity to generate a URL for any anonymous user in a specific namespace (such as Tenant1), the Resource part of the policy can be set to arn:aws:quicksight:us-east-1:<YOUR_AWS_ACCOUNT_ID>:namespace/Tenant1. This is the same for the dashboard ID. For dynamic policy generation, you can also use placeholders for the namespace and users.

The following code is an example IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "QuickSightEmbeddingRole",
      "Effect": "Allow",
      "Action": [
        "quicksight:GenerateEmbedUrlForAnonymousUser"
      ],
      "Resource": [
        "arn:aws:quicksight:us-east-1::namespace/tenant1",
        "arn:aws:quicksight:us-east-1::dashboard/dashboard-id-123"
 
        // You can add specific Namespace IDs (tenant IDs), or namespace prefixes here
        // e.g. "arn:aws:quicksight:us-east-1::namespace/{{tenant-id}}" will allow the role to
        // generate embedding URL for namespace dynamically substituted
        // into the placeholder {{tenant-id}}
 
        // or "arn:aws:quicksight:us-east-1::namespace/MyTenantIdPrefix*" will allow the role to
        // generate embedding URL for namespaces having prefix MyTenantIdPrefix.
        
        
        // You can add specific Dashboard IDs, or ID prefixes here
        // e.g. "arn:aws:quicksight:us-east-1::dashboard/{{dashboard-id}}" will allow the role to
        // generate embedding URL for dashboard dynamically substituted
        // into the placeholder {{dashboard-id}}
 
        // or "arn:aws:quicksight:us-east-1::dashboard/MyDashboardIdPrefix*" will allow the role to
        // generate embedding URL for namespaces having prefix MyDashboardIdPrefix.
      ]
    }
  ]
}

Use case

OkTank is an ISV in the healthcare space. They have a SaaS application that is used by different hospitals across different regions of the country to manage their revenue. OkTank has thousands of healthcare employees accessing their application and has embedded operations related to their business in a QuickSight dashboard in their application. OkTank doesn’t want to manage their users in QuickSight separately, and wants to secure data based on which user from which hospital is accessing their application. OkTank is securing the data on the dashboards at runtime using row-level security using tags.

OkTank has hospitals (North Hospital, South Hospital, and Downtown Hospital) in regions Central, East, South, and West.

In this example, the following users access OkTank’s application and the embedded dashboard. Each user has a certain level of restriction rules that define what data they can access in the dashboards. PowerUser is a super user that can see the data for all hospitals and regions.

OkTank’s application’s user	Hospital	Region
NorthUser	North Hospital	Central and East
NorthAdmin	North Hospital	All regions
SouthUser	South Hospital	South
SouthAdmin	South Hospital	All regions
PowerUser	All hospitals	All regions

None of these users have been provisioned in QuickSight. OkTank manages these users in its own application and therefore knows which region and hospital each user belongs to. When any of these users access the embedded QuickSight dashboard in the application, OkTank must secure the data on the dashboard so that users can only see the data for their region and hospital.

First, OkTank created tag keys on the dataset they’re using to power the dashboard. In their UpdateDataset API call, the RowLevelPermissionTagConfiguration element on the dataset is as follows:

"RowLevelPermissionTagConfiguration": 
        {
            "Status": "ENABLED",
            "TagRules": [
                {
                    "TagKey": "customer_region",
                    "ColumnName": "region",
                    "TagMultiValueDelimiter": ",",
                    "MatchAllValue": "*"
                },
                {
                    "TagKey": "customer_hospital",
                    "ColumnName": "hospital",
                    "TagMultiValueDelimiter": ",",
                    "MatchAllValue": "*"
                }
            ]
        }

Second, at runtime when embedding the dashboard via the GenerateDashboardEmbedURLForAnonymousUser API, they set SessionTags for each user.

SessionTags for NorthUser in the GenerateDashboardEmbedURLForAnonymousUser API call are as follows:

"SessionTags": 
        [ 
            {
                "Key": "customer_hospital",
                "Value": "North Hospital"
            },
            {
               "Key": " customer_region",
               "Value": "Central, East"
            }
        ]

SessionTags for NorthAdmin are as follows:

"SessionTags": 
        [ 
            {
                "Key": " customer_hospital",
                "Value": "North Hospital"
            },
            {
               "Key": " customer_region",
               "Value": "*"
            }
        ]

SessionTags for SouthUser are as follows:

"SessionTags": 
        [ 
            {
                "Key": " customer_hospital",
                "Value": "South Hospital"
            },
            {
               "Key": " customer_region",
               "Value": "South"
            }
        ]

SessionTags for SouthAdmin are as follows:

"SessionTags": 
        [ 
            {
                "Key": " customer_hospital",
                "Value": "South Hospital"
            },
            {
               "Key": " customer_region",
               "Value": "*"
            }
        ]

SessionTags for PowerUser are as follows:

"SessionTags": 
        [ 
            {
                "Key": " customer_hospital",
                "Value": "*"
            },
            {
               "Key": " customer_region",
               "Value": "*"
            }
        ]

The following screenshot shows what SouthUser sees pertaining to South Hospital in the South region.

The following screenshot shows what SouthAdmin sees pertaining to South Hospital in all regions.

The following screenshot shows what PowerUser sees pertaining to all hospitals in all regions.

Based on session tags, OkTank has secured data on the embedded dashboards such that each user only sees specific data based on their access. You can access the dashboard as one of the users (by changing the user in the drop-down menu on the top right) and see how the data changes based on the user selected.

Overall, with row-level security using tags, OkTank is able to provide a compelling analytics experience within their SaaS application, while making sure that each user only sees the appropriate data without having to provision and manage users in QuickSight. QuickSight provides a highly scalable, secure analytics option that you can set up and roll out to production in days, instead of weeks or months previously.

Conclusion

The combination of embedding dashboard for users not provisioned in QuickSight and row-level security using tags enables developers and ISVs to quickly and easily set up sophisticated, customized analytics for their application users—all without any infrastructure setup or management while scaling to millions of users. For more updates from QuickSight embedded analytics, see What’s New in the Amazon QuickSight User Guide.

About the Authors

Raji Sivasubramaniam is a Specialist Solutions Architect at AWS, focusing on Analytics. Raji has 20 years of experience in architecting end-to-end Enterprise Data Management, Business Intelligence and Analytics solutions for Fortune 500 and Fortune 100 companies across the globe. She has in-depth experience in integrated healthcare data and analytics with wide variety of healthcare datasets including managed market, physician targeting and patient analytics. In her spare time, Raji enjoys hiking, yoga and gardening.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.

Kareem Syed-Mohammed is a Product Manager at Amazon QuickSight. He focuses on embedded analytics, APIs, and developer experience. Prior to QuickSight he has been with AWS Marketplace and Amazon retail as a PM. Kareem started his career as a developer and then PM for call center technologies, Local Expert and Ads for Expedia. He worked as a consultant with McKinsey and Company for a short while.

Visualize AWS Security Hub Findings using Analytics and Business Intelligence Tools

2021-09-02 Sujatha Kuppuraju

Post Syndicated from Sujatha Kuppuraju original https://aws.amazon.com/blogs/architecture/visualize-aws-security-hub-findings-using-analytics-and-business-intelligence-tools/

To improve the security posture in your organization, you first must have a comprehensive view of your security, operations, and compliance data. AWS Security Hub gives you a thorough view of your security alerts and security posture across all your AWS accounts. This is shown as Security Hub findings, which are generated from different AWS services and partner products. Security Hub also provides the capability to filter, aggregate, and visualize these findings as Security Hub insights.

Organizations have additional requirements to centralize the Security Hub findings into their existing operational store. They also must connect the findings with other operational data. In this blog, we share two architecture design options, which collect Security Hub findings across Regions. You can make these findings searchable, and build multiple visualization dashboards using analytics and BI Tools in order to gain insights.

Some of the benefits of these architectures:

Ability to combine Security Hub findings across Regions and generate a single dashboard view
Ability to combine the various security and compliance data into a single centralized dashboard
Ability to correlate security and compliance findings with operational data. This can be AWS CloudTrail logs and customer logs for deeper analysis and insights
Ability to build a security and compliance scorecard across various dimensions. This is achieved by combining the Security Hub findings and AWS resource inventory generated using an enterprise-wide tagging strategy

Approach to visualize Security Hub findings in multi-account environments

There are four steps involved in this approach, as shown in Figure 1:

Figure 1. Steps involved in improving the visibility of AWS Security Hub findings

Set up your AWS Security Hub administrator account. Designate one of the AWS accounts within your AWS Organizations to be a delegated administrator for Security Hub. This account can manage and receive and findings across member accounts.
Enable AWS Security Hub in member accounts. Enable required security standards, AWS native service integration, and partner integrations in all the member accounts across your AWS Regions.
Export and consolidate findings. For each Region you operate in, collect findings and consolidate across Regions by ingesting the findings to a centralized repository.
Query and visualize insights. Query the findings from the centralized findings repository and build dashboards for visualizations.

Design option one: View Security Hub findings using AWS serverless analytics services

This option, shown in Figure 2, uses Amazon Athena, a serverless, interactive, query service that analyzes data in Amazon Simple Storage Service (S3) using standard SQL. AWS Glue, a serverless, data integration service discovers, prepares, and combines data for analytics, machine learning (ML), and application development is also used. Amazon QuickSight, a scalable, serverless, embeddable, ML-powered, business intelligence (BI) service is used to search and visualize Security Hub findings from multiple accounts and Regions.

Figure 2. Architecture to view Security Hub findings using AWS serverless analytics services

Architecture overview

Designate an AWS account in your AWS Organization as a delegated administrator for Security Hub. This account will publish events to Amazon EventBridge for its own findings, in addition to findings received from member accounts.
Configure the EventBridge rule to deliver the Security Hub finding event type into Amazon Kinesis Data Firehose. If you are operating in multiple Regions set up an EventBridge rule and Kinesis Data Firehose in each of those Regions.
Set up Kinesis Data Firehose in multiple Regions to deliver data into a Single S3 bucket, which helps to consolidate findings across multiple Regions.
Partition the data in your S3-based by account number, Region, date, and other preferred parameters.
Use AWS Glue to crawl the S3 bucket and build the schema of the Security Hub findings. This is used by Amazon Athena to query the data. You can create a view in Athena to flatten some of the nested attributes in the Security Hub findings.
Build your Amazon QuickSight dashboard using the view created in Athena.

Figure 3 shows a sample dashboard created in QuickSight to view consolidated Security Hub findings across accounts and Regions.

Figure 3. Sample Security Hub findings dashboard created using Amazon QuickSight

Design option two: View Security Hub findings using a managed Amazon ES cluster and Kibana

This option, shown in Figure 4, uses a managed Amazon Elasticsearch Service cluster to ingest the findings, and Kibana to search and visualize the findings. Amazon Elasticsearch Service is a fully managed service that allows you to deploy, secure, and run Elasticsearch cost-effectively, and at scale.

Figure 4. Architecture to view Security Hub findings using Amazon ES cluster and Kibana

Architecture overview

Similar to the previous design option, the Security Hub administrator account publishes events to Amazon EventBridge for findings.
Configure the EventBridge rule to deliver the Security Hub finding event type into Amazon Kinesis Data Firehose. If you are operating in multiple Regions, then you must set up an EventBridge rule and Kinesis Data Firehose in each of those Regions.
It’s recommended that you set up Kinesis Data Firehose in multiple Regions to deliver data into a central Amazon ES cluster. This serves as a single pane of glass for security findings across these different Regions.
Use Kibana, a popular open source visualization tool designed to work with Elasticsearch. You’ll be able to create visualizations and dashboards to analyze and share your findings.

Amazon ES can help you configure rules on the findings to send specialized alerts. When coupled with anomaly detection, Amazon ES can automatically detect anomalies in your findings data using unsupervised machine learning algorithm and alert you in near-real.

Figure 5 shows a sample dashboard created in Kibana to view consolidated Security Hub findings across accounts and Regions in an Elasticsearch cluster.

Figure 5. Sample Security Hub findings dashboard created in Kibana

Conclusion

In this post, we showed you two architectural design options to collect AWS Security Hub findings across multiple AWS Regions in a multi-account AWS environment. These approaches allow you to connect the AWS Security Hub findings with other operational data. This makes it searchable, and will allow you to draw insights and achieve an improved organization-wide security posture. These options use AWS managed and serverless services, which are scalable and configurable for high availability and performance. Make your design choice based on your enterprise needs for search, analytics, and insights visualization options.

Further Reading:

Create a custom Amazon S3 Storage Lens metrics dashboard using Amazon QuickSight

2021-08-30 Jignesh Gohel

Post Syndicated from Jignesh Gohel original https://aws.amazon.com/blogs/big-data/create-amazon-s3-storage-lens-metrics-dashboard-amazon-quicksight/

Companies use Amazon Simple Storage Service (Amazon S3) for its flexibility, durability, scalability, and ability to perform many things besides storing data. This has led to an exponential rise in the usage of S3 buckets across numerous AWS Regions, across tens or even hundreds of AWS accounts. To optimize costs and analyze security posture, Amazon S3 Storage Lens provides a single view of object storage usage and activity across your entire Amazon S3 storage. S3 Storage Lens includes an embedded dashboard to understand, analyze, and optimize storage with over 29 usage and activity metrics, aggregated for your entire organization, with drill-downs for specific accounts, Regions, buckets, or prefixes. In addition to being accessible in a dashboard on the Amazon S3 console, the raw data can also be scheduled for export to an S3 bucket.

For most customers, the S3 Storage Lens dashboard will cover all your needs. However, you may require specialized views of your S3 Storage Lens metrics, including combining data across multiple AWS accounts, or with external data sources. For such cases, you can use Amazon QuickSight, which is a scalable, serverless, embeddable, machine learning (ML)-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights.

In this post, you learn how to use QuickSight to create simple customized dashboards to visualize S3 Storage Lens metrics. Specifically, this solution demonstrates two customization options:

Combining S3 Storage Lens metrics with external sources and being able to filter and visualize the metrics based on one or multiple accounts
Restricting users to view Amazon S3 metrics data only for specific accounts

You can further customize these dashboards based on your needs.

Solution architecture

The following diagram shows the high-level architecture of this solution. In addition to S3 Storage Lens and QuickSight, we use other AWS Serverless services like AWS Glue and Amazon Athena.

Solution Architecture for Amazon S3 Storage Lens custom metrics

The data flow includes the following steps:

S3 Storage Lens collects the S3 metrics and exports them daily to a user-defined S3 bucket. Note that first we need to activate S3 Storage Lens from the Amazon S3 console and configure it to export the file either in CSV or Apache Parquet format.
An AWS Glue crawler scans the data from the S3 bucket and populates the AWS Glue Data Catalog with tables. It automatically infers schema, format, and data types from the S3 bucket.
You can schedule the crawler to run at regular intervals to keep metadata, table definitions, and schemas in sync with data in the S3 bucket. It automatically detects new partitions in Amazon S3 and adds the partition’s metadata to the AWS Glue table.
Athena performs the following actions:
- Uses the table populated by the crawler in Data Catalog to fetch the schema.
- Queries and analyzes the data in Amazon S3 directly using standard SQL.
QuickSight performs the following actions:
- Uses the Athena connector to import the Amazon S3 metrics data.
- Fetches the external data from a custom CSV file.

To demonstrate this, we have a sample CSV file that contains the mapping of AWS account numbers to team names owning these accounts. QuickSight combines these datasets using the data source join feature.

When the combined data is available in QuickSight, users can create custom analysis and dashboards, apply appropriate QuickSight permissions, and share dashboards with other users.

At a high level, this solution requires you to complete the following steps:

Enable S3 Storage Lens in your organization’s payer account or designate a member account. For instructions to have a member account as a delegated administrator, see Enabling a delegated administrator account for Amazon S3 Storage Lens.
Set up an AWS Glue crawler, which populates the Data Catalog to query S3 Storage Lens data using Athena.
Use QuickSight to import data (using the Athena connector) and create custom visualizations and dashboards that can be shared across multiple QuickSight users or groups.

Enable and configure the S3 Storage Lens dashboard

S3 Storage Lens includes an interactive dashboard available on the Amazon S3 console. It shows organization-wide visibility into object storage usage, activity trends, and makes actionable recommendations to improve cost-efficiency and apply data protection best practices. First you need to activate S3 Storage Lens via the Amazon S3 console. After it’s enabled, you can access an interactive dashboard containing preconfigured views to visualize storage usage and activity trends, with contextual recommendations. Most importantly, it also provides the ability to export metrics in CSV or Parquet format to an S3 bucket of your choice for further use. We use this export metrics feature in our solution. The following steps provide details on how you can enable this feature in your account.

On the Amazon S3 console, under Storage Lens in the navigation pane, choose Dashboards.
Choose Create dashboard.

Create S3 Storage Dashboard

Provide the appropriate details in the Create dashboard
- Make sure to select Include all accounts in your organization, Include Regions, and Include all Regions.

S3 Storage Lens Dashboard Configure

S3 Storage Lens has two tiers: Free metrics, which is free of charge, automatically available for all Amazon S3 customers, and contains 15 usage-related metrics; and Advanced metrics and recommendations, which has an additional charge, but includes all 29 usage and activity metrics with 15-month data retention, and contextual recommendations. For this solution, we select Free metrics. If you need additional metrics, you may select Advanced metrics.

For Metrics export, select Enable.
For Choose and output format, select Apache Parquet.
For Destination bucket, select This account.
For Destination, enter your S3 bucket path.

S3 Storage Lens Metrics Export Configuration

We highly recommend following security best practices for the S3 bucket you use, along with server-side encryption available with export. You can use an Amazon S3 key (SSE-S3) or AWS Key Management Service key (SSE-KMS) as encryption key types.

Choose Create dashboard.

The data population process can take up to 48 hours. Proceed to the next steps only after the dashboard is available.

Set up the AWS Glue crawler

AWS Glue is serverless, fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. We can use the AWS Glue to discover data, transform it, and make it available for search and querying. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. Athena uses this metadata definition to query data available in Amazon S3 using simple SQL statements.

The AWS Glue crawler populates the Data Catalog with tables from various sources, including Amazon S3. When the crawler runs, it classifies data to determine the format, schema, and associated properties of the raw data, performs grouping of data into tables or partitions, and writes metadata to the Data Catalog. You can configure the crawler to run at specific intervals to make sure the Data Catalog is in sync with the underlying source data.

For our solution, we use these services to catalog and query exported S3 Storage Lens metrics data. First, we create the crawler via the AWS Glue console. For the purpose of this example, we provide an AWS CloudFormation template that deploys the required AWS resources. This template creates the CloudFormation stack with three AWS resources in your AWS account:

AWS Identity and Access Management (IAM) role
AWS Glue database
AWS Glue crawler

When you create your stack with the CloudFormation template, provide the following information:

AWS Glue database name
AWS Glue crawler name
S3 URL path pointing to the reports folder where S3 Storage Lens has exported metrics data. For example, s3://[Name of the bucket]/StorageLens/o-lcpjprs6wq/s3-storage-lense-parquet-v1/V_1/reports/.

After the stack is complete, navigate to the AWS Glue console and confirm that a new crawler job is listed on the Crawlers page. When the crawler runs for the first time, it creates the table reports in the Data Catalog. The Data Catalog may need to be periodically refreshed, so this job is configured to run every day at midnight to sync the data. You can change this configuration to your desired schedule.

After the crawler job runs, we can confirm that the data is accessible using the following query in Athena (make sure to run this query in the database provided in the CloudFormation template):

select * from reports limit 10

Running this query should return results similar to the following screenshot.

Query Results

Create a QuickSight dashboard

When the data is available to access using Athena, we can use QuickSight to create customized analytics and publish dashboards across multiple users. This process involves creating a new QuickSight dataset, creating the analysis using this dataset, creating the dashboard, and configuring user permissions and security.

To get started, you must be signed in to QuickSight using the same payer account. If you’re signing into QuickSight for the first time, you’re prompted to complete the initial signup process (for example, choosing QuickSight Enterprise Edition). You’re also required to provide QuickSight access to your S3 bucket and Athena. For instructions on adding permissions, see Insufficient Permissions When Using Athena with Amazon QuickSight.

In the QuickSight navigation pane, choose Datasets.
Choose New dataset and select Athena.

QuickSight Create Dataset

For Data source name, enter a name.
Choose Create data source.

QuickSight create Athena Dataset

For Catalog, choose AwsDataCatalog.
For Database, choose the AWS Glue database that contains the table for S3 Storage Lens.
For Tables, select your table (for this post, reports).

QuickSight table selection

Choose Edit dataset and choose the query mode SPICE.
Change the format of report_date and dt to Date.
Choose Save.

We can use the cross data source join feature in QuickSight to connect external data sources to the S3 Storage Lens dataset. For this example, let’s say we want to visualize the number of S3 buckets mapped to the internal teams. This data is external to S3 Storage Lens and stored in a CSV file, which contains the mapping between the account numbers and internal team names.

Account to Team Mapping

To import this data into our existing dataset, choose Add data.

QuickSight add external data

Choose Upload a file to import the CSV file to the dataset.

QuickSight Upload External File

We’re redirected to the join configuration screen. From here, we can configure the join type and join clauses to connect these two data sources. For more information about the cross data source join functionality, see Joining across data sources on Amazon QuickSight.

For this example, we need to perform the left join on columns aws_account_number (from the reports table) and Account (from the Account-to-Team-mapping table). This left join returns all records from the reports table and matching records from Account-to-Team-mapping.
Choose Apply after selecting this configuration.

QuickSight DataSet Join

Choose Save & visualize.

From here, you can create various analyses and visualizations on the imported datasets. For instructions on creating visualizations, see Creating an Amazon QuickSight Visual. We provide a sample template you can use to get the basic dashboard. This dashboard provides metrics for total Amazon S3 storage size, object count, S3 bucket by internal team, and more. It also allows authorized users to filter the metrics based on accounts and report dates. This is a simple report that can be further customized based on your needs.

Quicksight Final Dashboard

S3 Storage Lens’s IAM security policies don’t apply to imported data into QuickSight. So before you share this dashboard with anyone, one might want to restrict access according to the security requirement and business role of the user. For a comprehensive set of security features, see AWS Security in Amazon QuickSight. For implementation examples, see Applying row-level and column-level security on Amazon QuickSight dashboards. In our example, instead of all users having access to view S3 Storage Lens data for all accounts, you might want to restrict user access to only specific accounts.

QuickSight provides a feature called row-level security that can restrict user access to only a subset of table rows (or records). You can base the selection of these subsets of rows on filter conditions defined on specific columns.

For our current example, we want to allow user access to view the Amazon S3 metrics dashboard only for a few accounts. For this, we can use the column aws_account_number as filter criteria with account number values. We can implement this by creating a CSV file with columns named UserName and aws_account_number, and adding the rows for users and a list of account numbers (comma-separated). In the following example file, we have added a sample value for the user awslabs-qs-1 with a specific account. This means that user awslabs-qs-1 can only see the rows (or records) that match with the corresponding aws_account_number values specified in the permission CSV.

QuickSight Permissions file

For instructions on applying a permission rule file, see Using Row-Level Security (RLS) to Restrict Access to a Dataset.

You can further customize this QuickSight analysis to produce additional visualizations, apply additional permissions, and publish it to enterprise users and groups with various levels of security.

Conclusion

Harnessing the knowledge of S3 Storage Lens metrics with other custom data enables you to discover anomalies and identify cost-efficiencies across accounts. In this post, we used serverless components to build a workflow to use this data for real-time visualization. You can use this workflow to scale up and design an enterprise-level solution with a multi-account strategy and control fine-grained access to its data using the QuickSight row-level security feature.

About the Authors

Jignesh Gohel is a Technical Account Manager at AWS. In this role, he provides advocacy and strategic technical guidance to help plan and build solutions using best practices, and proactively keep customers’ AWS environments operationally healthy. He is passionate about building modular and scalable enterprise systems on AWS using serverless technologies. Besides work, Jignesh enjoys spending time with family and friends, traveling and exploring the latest technology trends.

Suman Koduri is a Global Category Lead for Data & Analytics category in AWS Marketplace. He is focused towards business development activities to further expand the presence and success of Data & Analytics ISVs in AWS Marketplace. In this role, he leads the scaling, and evolution of new and existing ISVs, as well as field enablement and strategic customer advisement for the same. In his spare time, he loves running half marathon’s and riding his motorcycle.

BIOps: Amazon QuickSight object migration and version control

2021-08-27 Ying Wang

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/biops-amazon-quicksight-object-migration-and-version-control/

DevOps is a set of practices that combines software development and IT operations. It aims to shorten the systems development lifecycle and provide continuous delivery with high software quality. Similarly, BIOps (business intelligence and IT operations) can help your Amazon QuickSight admin team automate assets migration and version control. Your team can design the migration and version control strategy in your organization by following the suggested pattern in this post. You can utilize the suggested framework and sample scripts to reduce your daily workload.

In this post, we discuss the following:

The suggested automation process of QuickSight assets migration across accounts or Regions
The suggested workflow of dashboard version control in one QuickSight account
The advanced usage of Amazon QuickSight APIs

For migration across accounts, we provide two options and sample code packages:

Amazon SageMaker notebooks with migration Python scripts for data scientists or engineers. These Python scripts can do batch migration (migrate all assets of the source account to the target account) and on-demand incremental migration (migrate specific assets across accounts).
An application to perform migrations with a QuickSight embedded website as an UI. The backend of this application contains an Amazon API Gateway endpoint, several AWS Lambda functions, an Amazon Simple Queue Service (Amazon SQS) queue, and an Amazon Simple Storage Service (Amazon S3) bucket. This application is packed in AWS Cloud Development Kit (AWS CDK) stacks and can be easily deployed into your environment.

Migrate across accounts and Regions automatically

Let’s assume that we have two QuickSight accounts: development and production. Both accounts are configured to connect to valid data sources. The following diagram illustrates our architecture.

The architecture contains the following workflow:

The Python scripts (SageMaker notebooks or Lambda functions) call QuickSight APIs (list_datasources) to get the data source list in the development account.
The scripts call the QuickSight describe_data_source API to describe the data source. The response of the describe_data_source API is a JSON object. The scripts update the JSON object with production account information, for instance, Amazon Redshift credentials or cluster ID.
The scripts create the data source in the production account and share the data source with the BI admin team.
The scripts perform the same procedure to the datasets.
The scripts create a template of the dashboard or analysis that the BI admin wants to migrate. (A template only can be created from an analysis or an existing template. When we create a template of a dashboard, we have to create the template from the underlying analysis of this dashboard. The version of the published dashboard might be behind the underlying analysis.)
The scripts call the create_analysis or create_dashboard API in the production account to create the analysis or the dashboard from the remote template in the development account, and apply the theme.
The scripts share the analysis or dashboard to some specific groups or users.
The scripts log the success messages and errors messages to Amazon CloudWatch Logs.

For migration across Regions, the BI admin can follow the same procedure to migrate assets from the source Region to the target Region. Instead of changing the account ID in the ARN of assets, change the Region name of the ARN.

We provide sample Python scripts later in this post.

Dashboard version control in one account

Under some conditions, the BI team might want to perform version control of the dashboard development in one account. The following diagram illustrates our architecture.

The workflow includes the following steps:

The BI developer creates an analysis and a template of this analysis. Let’s call the analysis and template version 1 assets.
The BI developer publishes the analysis as a dashboard, and the QA team runs tests on this dashboard.
After the QA test, the BI developer continues to develop the analysis to be version 2.
The BI team publishes version 2 of the dashboard.
The QA team tests version 2 of dashboard again, and takes the following action based on the result:
1. If the test is successful, the BI admin can update the template to be version 2.
2. If the tests detect errors, the BI developer has to edit the analysis to fix the issues. However, some issues in the analysis may be unfixable. The BI admin can roll back the analysis or dashboard to be version 1 with the backup template. QuickSight allows authors to roll back analysis to previous version using an undo button. In case the undo history was reset (with user’s confirmation) due to an event like dataset swap, or authors want to go back to a confirmed V1 starting point, you can use the V1 template in an update-analysis API call to reset the analysis to V1 state.
The BI developer works on the version 1 analysis to repeat the development cycle.

This workflow is the best practice we suggest to QuickSight users. You can modify the sample code packages we provide to automate this suggested process.

QuickSight API

For more information about the QuickSight API, see the QuickSight API reference and Boto3 QuickSight documentation.

Option 1: SageMaker notebooks of migration scripts

In this section, we present the first migration option for data scientists and engineers: using SageMaker notebooks with migration scripts.

Solution overview

We provide the sample Python scripts for migrating across accounts in three SageMaker notebooks:

functions – Provides all the functions, including describe objects, create objects, and so on. The supportive functions are developed to perform the tasks to automate the whole process. For example, update the data source connection information, get the dashboard ID from dashboard name, and write logs.
batch migration – Provides the sample automation procedure to migrate all the assets from the source account to the target account.
incremental migration – Provides on-demand incremental migration to migrate specific assets across accounts.

The following diagram illustrates the functions of each notebook.

You can download the notebooks from the GitHub repo.

Prerequisites

For this solution, you should have the following prerequisites:

Access to the following AWS services:
- QuickSight
- SageMaker
- AWS Identity and Access Management (IAM)
Two different QuickSight accounts, for instance, development and production
Basic knowledge of Python
Basic AWS SDK knowledge

Create resources

Create your resources in the source account by completing the following steps:

Download the notebooks from the GitHub repository.
Create a notebook instance.
Edit the IAM role of this instance to add an inline policy called qs-admin-source:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole",
                "quicksight:*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Deny",
            "Action": [
                "quicksight:DeleteA*",
                "quicksight:DeleteC*",
                "quicksight:DeleteD*",
                "quicksight:DeleteG*",
                "quicksight:DeleteI*",
                "quicksight:DeleteN*",
                "quicksight:DeleteTh*",
                "quicksight:DeleteU*",
                "quicksight:DeleteV*",
                "quicksight:Unsubscribe"
            ],
            "Resource": "*"
        }
    ]
}

On the notebook instance page, on the Actions menu, choose Open JupyterLab.
Upload the three notebooks into the notebook instance.

Implement the solution

In this section, we walk you through the steps to implement the solution.

AssumeRole

To use AssumeRole, complete the following steps:

Create an IAM role in the target (production) account that can be used by the source (development) account.
On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Choose the Another AWS account role type.
For Account ID, enter the source (development) account ID.
Create an IAM policy called qs-admin-target:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": ["quicksight:*",
		      "sts:AssumeRole"],
            "Resource": "*"
        },
       {
            "Effect": "Deny",
            "Action": "quicksight:Unsubscribe",
            "Resource": "*"
        }
    ]
}

Grant the IAM role the qs-admin-target IAM policy.
Provide the qs-admin-source and qs-admin-target role name in the Assume Role cells of the notebooks.

Static profile

To use a static profile, complete the following steps:

Create the IAM user qs-admin-source with policy qs-admin-source in the source account.
Create the IAM user qs-admin-target with policy qs-admin-target in the target account.
Get the aws_access_key_id and secret_access_key of these two IAM users.
In the terminal of the SageMaker notebook, go to the directory /home/ec2-user/.aws.
Edit the config and credential file to add a profile named source with the aws_access_key_id and secret_access_key of qs-admin-source.
Edit the config and credential file to add a profile named target with the aws_access_key_id and secret_access_key of qs-admin-target.
Provide the source and target profile name in the Static Profile cell of the notebook.

The tutorials of these notebooks are provided as comments inside the notebooks. You can run it cell by cell. If you want to schedule the notebooks to run automatically, you can schedule the Jupyter notebooks on SageMaker ephemeral instances.

In this solution, we assume that the name of dashboard and dataset are unique in the target (production) account. If you have multiple dashboards or datasets with the same name, you will encounter an error during the migration. Every dashboard has its own business purpose, so we shouldn’t create multiple dashboards with the same name in the production environment to confuse the dashboard viewers.

Option 2: Dashboard as UI to enter the migration workflow

In this section, we present the second migration option with the use of a QuickSight embedded website as an UI.

Solution overview

The following diagram illustrates our solution architecture.

The following diagram illustrates the resources deployed in the central account to facilitate the migration process.

The resources include the following:

Dashboard as UI – The QuickSight dashboard is based on a ticketing backend, QuickSight assets information, and migration status data. You can use the bottom banner of the dashboard to trigger a migration of resources. Choosing Submit sends the migration request and required parameters (asset name, source environment, and target environment) to API Gateway. The dashboard also displays the migration results, which are stored in an S3 bucket.
S3 bucket – An S3 bucket hosts a static website to present you with a simple embedded dashboard that shows all active dashboards, analyses, datasets, data sources, and migration status.
API Gateway – API Gateway provides endpoints for embedding a QuickSight dashboard and accepting POST requests to perform migrations:
- quicksight-embed – Embeds the migration status QuickSight dashboard. The API endpoint invokes the Lambda backend to generate a short-lived QuickSight embed URL, and presents the dashboard in an iFrame.
- quicksight-migration-sqs – Presents a footer form that allows the user to submit migration details with POST requests, which invoke the QuickSight migration Lambda function.
SQS queue – We use an SQS queue between the QuickSight migration API endpoint and the backend Lambda function to perform the migration. Messages are deleted after a migration is complete.
Lambda functions – We use three different functions:
- QuickSight migration – This function is invoked by the SQS queue, and it performs the necessary migration tasks depending on the parameters it receives. This function can perform both batch and incremental migration of QuickSight resources by querying the QuickSight service API, AWS Systems Manager Parameter Store, and AWS Secrets Manager.
- QuickSight embed URL – When invoked, this function fetches an embed URL of a given dashboard and returns an HTTP payload to the caller.
- QuickSight status – This function periodically queries the QuickSight API for details about dashboards, datasets, data sources, analyses, and themes, and uploads the results to Amazon S3. This S3 bucket is then used as a data source for a QuickSight dashboard to display a centralized view of all relevant resources.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Access to the following AWS services:
- API Gateway
- Amazon Athena
- Lambda
- QuickSight
- Amazon SQS
- Amazon S3
- Amazon CloudFront
- Parameter Store
- Secrets Manager
Two different QuickSight accounts, such as development and production
Basic knowledge of Python
Basic AWS SDK knowledge
Git and npm installed
The AWS CDK installed (see AWS CDK Intro Workshop: Python Workshop)

Create resources

Create your resources by cloning the following AWS CDK stack from the GitHub repo:

git clone https://github.com/aws-samples/amazon-quicksight-sdk-proserve.git ~/amazon-quicksight-sdk-proserve

Implement the solution

The following diagram illustrates the services deployed to our central and target accounts.

Deploy to the central account

We use the following stacks to deploy resources to the central account:

QuicksightStatusStack – Deploys the Lambda functions and related resources to populate the S3 bucket with active QuickSight dashboard details
QuicksightMigrationStack – Deploys the Lambda function, SQS queue, S3 bucket, and the API Gateway endpoint for initiating migration of QuickSight resources
QuicksightEmbedStack – Deploys the API Gateway endpoint, CloudFront distribution, and Lambda functions to process the embed URL requests

The migration scripts require a QuickSight user to be created with the name quicksight-migration-user. This user is given permissions to the migrated resources in the destination. However, another QuickSight user or group can be used in place of quicksight-migration-user by replacing the following:

The parameter in ~/amazon-quicksight-sdk-proserve/Migration-scripts/cdk/lambda/quicksight_migration/quicksight_migration/lambda_function.py (line 66)
The QUICKSIGHT_USER_ARN variable in ~/amazon-quicksight-sdk-proserve/Migration-scripts/cdk/cdk/quicksight_embed_stack.py (line 81)

Creating VPC connections in QuickSight allows QuickSight to access your private data resources and enhances your security. Create this connection in the central account with the VPC connection name set to the VPC ID.

Set up your environment

Set up your environment with the following code:

cd ~/amazon-quicksight-sdk-proserve/Migration-scripts/cdk/
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Deploy QuickSight status and migration stacks

Deploy the QuickSight status and migration stacks with the following code:

export CDK_DEPLOY_ACCOUNT=CENTRAL_ACCOUNT_ID
export CDK_DEPLOY_REGION=CENTRAL_REGION
cdk bootstrap aws://CENTRAL_ACCOUNT_ID/CENTRAL_REGION
cdk deploy quicksight-status-stack quicksight-migration-stack

Note down the API Gateway endpoint from the output for a future step.

Create a dashboard

After the AWS CDK is deployed, run the Lambda function quicksight_status manually and then two files, group_membership.csv and object_access.csv, are created in the S3 bucket quicksight-dash-CENTRAL_ACCOUNT_ID. By default, this Lambda function is invoked hourly.

In the source account, you can run the following SQL query to create two Athena tables (group_membership and object_access):

CREATE EXTERNAL TABLE `group_membership`(
`account_id` string,   
`namespace` string,   
`group` string, 
`user` string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://quicksight-dash-Source_ACCOUNT_ID/monitoring/quicksight/group_membership/'
TBLPROPERTIES (
'areColumnsQuoted'='false', 
'classification'='csv', 
'columnsOrdered'='true', 
'compressionType'='none', 
'delimiter'=',',
'typeOfData'='file')

CREATE EXTERNAL TABLE `object_access`(
`account_id` string,   
`aws_region` string,   
`object_type` string, 
`object_name` string,
`object_id` string,
`principal_type` string,
`principal_name` string,
`namespace` string,
`permissions` string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://quicksight-dash-Source_ACCOUNT_ID/monitoring/quicksight/object_access/'
TBLPROPERTIES (
'areColumnsQuoted'='false', 
'classification'='csv', 
'columnsOrdered'='true', 
'compressionType'='none', 
  'delimiter'=',',
  'typeOfData'='file')

You can create two SPICE datasets in QuickSight with the two new Athena tables, and then create a dashboard based on these two datasets. For more information, see Using administrative dashboards for a centralized view of Amazon QuickSight objects.

Deploy the QuickSight embed stack

Update DASHBOARD_ID in cdk/quicksight_embed_stack.py (line 80) with the dashboard ID that you just created with the two joined Athena tables.

Update basic authentication credentials in lambda/embed_auth/index.js (lines 9–10) with your preferred username and password.

Deploy the QuickSight embed stack with the following code:

cdk deploy quicksight-embed-stack

In the html/index.html file, update the following values to the output values from the QuicksightMigrationStack and QuicksightEmbedStack deployment steps, then upload it to the S3 bucket (quicksight-embed-CENTRAL_ACCOUNT_ID) created by this stack:

quicksight-embed-stack.EmbedAPIGatewayURL output value (line 85) – apiGatewayUrl: '<quicksight-embed-stack.EmbedAPIGatewayURL>'
quicksight-migration-stack.MigrationAPIGatewayURL output value (line 38) – const apiGatewayUrl = '<quicksight-migration-stack.MigrationAPIGatewayURL>';

The index.html file should be placed in the root of the S3 bucket with no prefix.

In the event index.html was updated after CloudFront was deployed, and is displaying the wrong content, the CloudFront cache may need to be invalidated. The URI /index.html should be invalidated in the central account CloudFront distribution.

The URL for the CloudFront distribution we created is outputted when the QuicksightEmbedStack stack is deployed. In QuickSight, add the output value for quicksight-embed-stack.EmbedCloudFrontURL to allow dashboard embedding, and select Include subdomains.

Deploy to the target account

We use the following stacks to deploy resources to the target account:

InfraTargetAccountStack – Deploys an IAM role that can be assumed by the migration Lambda role. This stack should also be deployed to any target accounts that contain QuickSight resources.
OptionalInfraTargetAccountStack – Deploys Amazon VPC, Amazon Redshift cluster, and Amazon Aurora cluster. This stack is optional and can be ignored if you have existing infrastructure for this proof of concept.

Deployment of target resources to the target account can either be done from the central account Amazon Elastic Compute Cloud (Amazon EC2) instance with the appropriate cross-account permissions or from an EC2 instance provisioned within the target account.

For this post, Amazon Redshift and Amazon Relational Database Service (Amazon RDS) clusters are required to perform migrations. Amazon Redshift and Amazon RDS aren’t necessary for migrating QuickSight resources that depend on Amazon S3 or Athena, for example. The stack optional-infra-stack deploys both Amazon Redshift and Amazon RDS clusters in the target account. Although deploying this stack isn’t necessary if you already have these resources provisioned in your target account, it does set up the environment correctly for the example migrations. To deploy, use the following command:

cdk deploy optional-infra-target-account-stack

Deploy the target account stack

Update line 16 in cdk/infra_target_account_stack.py:

self.central_account_id = "123456789123" with the central account ID.

If OptionalInfraTargetAccountStack was deployed, update the /infra/config AWS Systems Manager parameter found in cdk/infra_target_account_stack.py (lines 67–77) file with the values of the newly created Amazon Redshift or Amazon RDS clusters. All values are provided as AWS CloudFormation outputs.

However, if you already have clusters deployed in your environment, update the /infra/config Systems Manager parameter found in the cdk/infra_target_account_stack.py (lines 67–77) file with the values of your existing Amazon Redshift or Amazon RDS clusters. Set redshiftPassword and rdsPassword to the name of the secret found in Secrets Manager for these resources. These secrets for Amazon Redshift and Amazon RDS should include username and password values, as shown in the following screenshot.

The following are example values for the /infra/config parameter:

def to_dict(self):
    config={}
    config['vpcId'] = 'vpc-0b13eb0989c8de79f'
    config['redshiftUsername'] = 'admin'
    config['redshiftPassword'] = 'redshift-secret-name-here'
    config['redshiftClusterId'] = 'test-redshift-cluster'
    config['redshiftHost'] = 'test-redshift-cluster-1fsoudi1bunb6.c6mqlnbwke57.us-east-1.redshift.amazonaws.com'
    config['redshiftDB'] = 'test'
    config['rdsUsername'] = 'admin'
    config['rdsPassword'] = 'rds-secrent-name-here'
    config['rdsClusterId'] = 'test-rds-cluster-socet72ltstg'
    config['namespace'] = 'default'
    config['version'] = '1'

After the values have been updated in cdk/infra_target_account_stack.py file run the following shell commands:

export CDK_DEPLOY_ACCOUNT=TARGET_ACCOUNT_ID
export CDK_DEPLOY_REGION=TARGET_REGION
cdk bootstrap aws://TARGET_ACCOUNT_ID/TARGET_REGION
cdk deploy infra-target-account-stack

Creating an Amazon VPC connection in QuickSight allows QuickSight to access your private data resources, and enhances your security. Create this connection in the target account with the VPC connection name set to the VPC ID. This is required because the data sources created in the OptionalInfraTargetAccountStack stack are within a VPC.

Trigger a migration

The URL for the CloudFront distribution we created is outputted when the QuicksightEmbedStack stack is deployed. Navigate in your browser to the output value for quicksight-embed-stack.EmbedCloudFrontURL; the CloudFront distribution prompts you for basic authentication credentials, then redirects to the embedded QuickSight dashboard.

You can trigger a migration via the migration tool at the bottom of the dashboard.

As of this writing, this tool supports two types of migrations:

Batch – Attempts to migrate all resources, including themes, dashboards, analyses, datasets, and data sources
Incremental – Allows you to select a migration resource of the dashboard, analysis, or theme, and provide names of these items in the Migration Items field.

Choosing Submit sends a message to the SQS queue, which triggers the migration Lambda function on the backend. The embedded dashboard should reflect the status of the dashboard migration after it performs its periodic refresh.

Clean up

Finally, to clean up the resources created in this post, perform the following cleanup steps, depending on the solution option you used.

Option 1 cleanup

If you implemented the Option 1 solution, complete the following steps:

Delete the SageMaker notebooks running the migration scripts.
Delete the IAM role attached to the SageMaker notebooks.

Option 2 cleanup

If you implemented the Option 2 solution, complete the following steps:

Destroy the resources created by the AWS CDK:

cdk destroy quicksight-status-stack quicksight-migration-stack quicksight-embed-stack

Destroy the resources created by the AWS CDK in the target accounts:

cdk destroy infra-target-account-stack optional-infra-target-account-stack

Manually delete S3 buckets created in both central and target accounts.

Things to consider

This solution will help you with QuickSight object migration and version control. Here are few limitations to consider:

If there are deleted datasets used in the QuickSight analysis or dashboard, then consider deleting such datasets.
If there are duplicate object names then consider naming them different.
If there are file based data sources then consider converting them to S3 based data sources.

Pricing

For the pricing details of the services used in this post, see the following:

Conclusion

Object migration and version control in a programmable method is always highly demanded by the BI community. This post provides the best practices and practical code package to address QuickSight object migration and version control. This solution can readily fit into a ticketing system or CI/CD pipelines.

If you have any feedback or questions, please leave them in the comments section. You can also start a new thread on the Amazon QuickSight forum.

About the Authors

Ying Wang is a Senior Data Visualization Architect with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Vamsi Bhadriraju is a Data Architect at AWS. He works closely with enterprise customers to build data lakes and analytical applications on the AWS Cloud.

Samruth Reddy is a DevOps Consultant in the AWS ProServe Global Delivery team working on automation tooling, security and infrastructure implementations, and promoting DevOps methodologies and practices to his customers.

How MEDHOST’s cardiac risk prediction successfully leveraged AWS analytic services

2021-08-23 Pandian Velayutham

Post Syndicated from Pandian Velayutham original https://aws.amazon.com/blogs/big-data/how-medhosts-cardiac-risk-prediction-successfully-leveraged-aws-analytic-services/

MEDHOST has been providing products and services to healthcare facilities of all types and sizes for over 35 years. Today, more than 1,000 healthcare facilities are partnering with MEDHOST and enhancing their patient care and operational excellence with its integrated clinical and financial EHR solutions. MEDHOST also offers a comprehensive Emergency Department Information System with business and reporting tools. Since 2013, MEDHOST’s cloud solutions have been utilizing Amazon Web Services (AWS) infrastructure, data source, and computing power to solve complex healthcare business cases.

MEDHOST can utilize the data available in the cloud to provide value-added solutions for hospitals solving complex problems, like predicting sepsis, cardiac risk, and length of stay (LOS) as well as reducing re-admission rates. This requires a solid foundation of data lake and elastic data pipeline to keep up with multi-terabyte data from thousands of hospitals. MEDHOST has invested a significant amount of time evaluating numerous vendors to determine the best solution for its data needs. Ultimately, MEDHOST designed and implemented machine learning/artificial intelligence capabilities by leveraging AWS Data Lab and an end-to-end data lake platform that enables a variety of use cases such as data warehousing for analytics and reporting.

Since you’re reading this post, you may also be interested in the following:

MEDHOST expands multi-tenant cloud EHR capabilities with AWS (AWS for Industries)

Getting started

MEDHOST’s initial objectives in evaluating vendors were to:

Build a low-cost data lake solution to provide cardiac risk prediction for patients based on health records
Provide an analytical solution for hospital staff to improve operational efficiency
Implement a proof of concept to extend to other machine learning/artificial intelligence solutions

The AWS team proposed AWS Data Lab to architect, develop, and test a solution to meet these objectives. The collaborative relationship between AWS and MEDHOST, AWS’s continuous innovation, excellent support, and technical solution architects helped MEDHOST select AWS over other vendors and products. AWS Data Lab’s well-structured engagement helped MEDHOST define clear, measurable success criteria that drove the implementation of the cardiac risk prediction and analytical solution platform. The MEDHOST team consisted of architects, builders, and subject matter experts (SMEs). By connecting MEDHOST experts directly to AWS technical experts, the MEDHOST team gained a quick understanding of industry best practices and available services allowing MEDHOST team to achieve most of the success criteria at the end of a four-day design session. MEDHOST is now in the process of moving this work from its lower to upper environment to make the solution available for its customers.

Solution

For this solution, MEDHOST and AWS built a layered pipeline consisting of ingestion, processing, storage, analytics, machine learning, and reinforcement components. The following diagram illustrates the Proof of Concept (POC) that was implemented during the four-day AWS Data Lab engagement.

Ingestion layer

The ingestion layer is responsible for moving data from hospital production databases to the landing zone of the pipeline.

The hospital data was stored in an Amazon RDS for PostgreSQL instance and moved to the landing zone of the data lake using AWS Database Migration Service (DMS). DMS made migrating databases to the cloud simple and secure. Using its ongoing replication feature, MEDHOST and AWS implemented change data capture (CDC) quickly and efficiently so MEDHOST team could spend more time focusing on the most interesting parts of the pipeline.

Processing layer

The processing layer was responsible for performing extract, tranform, load (ETL) on the data to curate them for subsequent uses.

MEDHOST used AWS Glue within its data pipeline for crawling its data layers and performing ETL tasks. The hospital data copied from RDS to Amazon S3 was cleaned, curated, enriched, denormalized, and stored in parquet format to act as the heart of the MEDHOST data lake and a single source of truth to serve any further data needs. During the four-day Data Lab, MEDHOST and AWS targeted two needs: powering MEDHOST’s data warehouse used for analytics and feeding training data to the machine learning prediction model. Even though there were multiple challenges, data curation is a critical task which requires an SME. AWS Glue’s serverless nature, along with the SME’s support during the Data Lab, made developing the required transformations cost efficient and uncomplicated. Scaling and cluster management was addressed by the service, which allowed the developers to focus on cleaning data coming from homogenous hospital sources and translating the business logic to code.

Storage layer

The storage layer provided low-cost, secure, and efficient storage infrastructure.

MEDHOST used Amazon S3 as a core component of its data lake. AWS DMS migration tasks saved data to S3 in .CSV format. Crawling data with AWS Glue made this landing zone data queryable and available for further processing. The initial AWS Glue ETL job stored the parquet formatted data to the data lake and its curated zone bucket. MEDHOST also used S3 to store the .CSV formatted data set that will be used to train, test, and validate its machine learning prediction model.

Analytics layer

The analytics layer gave MEDHOST pipeline reporting and dashboarding capabilities.

The data was in parquet format and partitioned in the curation zone bucket populated by the processing layer. This made querying with Amazon Athena or Amazon Redshift Spectrum fast and cost efficient.

From the Amazon Redshift cluster, MEDHOST created external tables that were used as staging tables for MEDHOST data warehouse and implemented an UPSERT logic to merge new data in its production tables. To showcase the reporting potential that was unlocked by the MEDHOST analytics layer, a connection was made to the Redshift cluster to Amazon QuickSight. Within minutes MEDHOST was able to create interactive analytics dashboards with filtering and drill-down capabilities such as a chart that showed the number of confirmed disease cases per US state.

Machine learning layer

The machine learning layer used MEDHOST’s existing data sets to train its cardiac risk prediction model and make it accessible via an endpoint.

Before getting into Data Lab, the MEDHOST team was not intimately familiar with machine learning. AWS Data Lab architects helped MEDHOST quickly understand concepts of machine learning and select a model appropriate for its use case. MEDHOST selected XGBoost as its model since cardiac prediction falls within regression technique. MEDHOST’s well architected data lake enabled it to quickly generate training, testing, and validation data sets using AWS Glue.

Amazon SageMaker abstracted underlying complexity of setting infrastructure for machine learning. With few clicks, MEDHOST started Jupyter notebook and coded the components leading to fitting and deploying its machine learning prediction model. Finally, MEDHOST created the endpoint for the model and ran REST calls to validate the endpoint and trained model. As a result, MEDHOST achieved the goal of predicting cardiac risk. Additionally, with Amazon QuickSight’s SageMaker integration, AWS made it easy to use SageMaker models directly in visualizations. QuickSight can call the model’s endpoint, send the input data to it, and put the inference results into the existing QuickSight data sets. This capability made it easy to display the results of the models directly in the dashboards. Read more about QuickSight’s SageMaker integration here.

Reinforcement layer

Finally, the reinforcement layer guaranteed that the results of the MEDHOST model were captured and processed to improve performance of the model.

The MEDHOST team went beyond the original goal and created an inference microservice to interact with the endpoint for prediction, enabled abstracting of the machine learning endpoint with the well-defined domain REST endpoint, and added a standard security layer to the MEDHOST application.

When there is a real-time call from the facility, the inference microservice gets inference from the SageMaker endpoint. Records containing input and inference data are fed to the data pipeline again. MEDHOST used Amazon Kinesis Data Streams to push records in real time. However, since retraining the machine learning model does not need to happen in real time, the Amazon Kinesis Data Firehose enabled MEDHOST to micro-batch records and efficiently save them to the landing zone bucket so that the data could be reprocessed.

Conclusion

Collaborating with AWS Data Lab enabled MEDHOST to:

Store single source of truth with low-cost storage solution (data lake)
Complete data pipeline for a low-cost data analytics solution
Create an almost production-ready code for cardiac risk prediction

The MEDHOST team learned many concepts related to data analytics and machine learning within four days. AWS Data Lab truly helped MEDHOST deliver results in an accelerated manner.

About the Authors

Pandian Velayutham is the Director of Engineering at MEDHOST. His team is responsible for delivering cloud solutions, integration and interoperability, and business analytics solutions. MEDHOST utilizes modern technology stack to provide innovative solutions to our customers. Pandian Velayutham is a technology evangelist and public cloud technology speaker.

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

Simplify data discovery for business users by adding data descriptions in the AWS Glue Data Catalog

2021-08-23 Karim Hammouda

Post Syndicated from Karim Hammouda original https://aws.amazon.com/blogs/big-data/simplify-data-discovery-for-business-users-by-adding-data-descriptions-in-the-aws-glue-data-catalog/

In this post, we discuss how to use AWS Glue Data Catalog to simplify the process for adding data descriptions and allows data analysts to access, search, and discover this cataloged metadata with BI tools.

In this solution, we use AWS Glue Data Catalog, to break the silos between cross-functional data producer teams, sometimes also known as domain data experts, and business-focused consumer teams that author business intelligence (BI) reports and dashboards.

Since you’re reading this post, you may also be interested in the following:

Build a DataOps platform to break the silos between data engineers and data analysts around running business logic for data processing

Data democratization and the need for self-service BI

To be able to extract insights and get value out of organizational-wide data assets, data consumers like data analysts need to understand the meaning of existing data assets. They rely on data platform engineers to perform such data discovery tasks on their behalf.

Although data platform engineers can programmatically extract and obtain some technical and operational metadata, such as database and table names and sizes, column schemas, and keys, this metadata is primarily used for organizing and manipulating data inside the data lake. They still rely on source data domain experts to gain more knowledge about the meaning of the data, its business context, and classification. It becomes more challenging when data domain experts tend to prioritize operational-critical requests and delay the analytical-related ones.

Such a cycled dependency, as illustrated in the following figure, can delay the organizational strategic vision for implementing a self-service data analytics platform to reduce the time of the data-to-insights process.

Solution overview

The Data Catalog fundamentally holds basic information about the actual data stored in various data sources, including but not limited to Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), and Amazon Redshift. Information like data location, format, and columns schema can be automatically discovered and stored as tables, where each table specifies a single data store.

Throughout this post, we see how we can use the Data Catalog to make it easy for domain experts to add data descriptions, and for data analysts to access this metadata with BI tools.

First, we use the comment field in Data Catalog schema tables to store data descriptions with more business meaning. Comment fields aren’t like the other schema table fields (such as column name, data type, and partition key), which are typically populated automatically by an AWS Glue crawler.

We also use Amazon AI/ML capabilities to initially identify the description of each data entity. One way to do that is by using the Amazon Comprehend text analysis API. When we provide a sample of values for each data entity type, Amazon Comprehend natural language processing (NLP) models can identify a standard range of data classification, and we can use this as a description for identified data entities.

Next, because we need to identify entities unique to our domain or organization, we can use custom named entity recognition (NER) in Amazon Comprehend to add more metadata that is related to our business domain. One way to train custom NER models is to use Amazon SageMaker Ground Truth; for more information, see Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend.

For this post, we use a dataset that has a table schema defined as per TPC-DS, and was generated using a data generator developed as part of AWS Analytics Reference Architecture code samples.

In this example, Amazon Comprehend API recognizes PII-related fields like Aid as a MAC address. While the none PII-related fields like Estatus, aren’t recognized. Therefore, the user enters a custom description manually, and we use the custom NER to automatically populate those fields, as shown in the following diagram.

After we add data meanings, we need to expose all the metadata captured in the Data Catalog to various data consumers. This can be done two different ways:

Via the AWS Glue Data Catalog API, which is more suitable for technical users comprehending ETL jobs programmatically using services like AWS Glue ETL or Amazon EMR
Via SQL-querying to information_schema for data analysts performing ad hoc SQL querying using services like Amazon Athena

We can also use the latter method to expose the Data Catalog to BI authors comprehending data analyses and dashboards using Amazon QuickSight, so we use the second method for this post.

We do this by defining an Athena dataset that queries the information_schema and allows BI authors to use the QuickSight capability of text search filter to search and discover data using its business meaning (see the following diagram).

Solution details

The core part of this solution is done using AWS Glue jobs. We use two AWS Glue jobs, which are responsible for calling Amazon Comprehend APIs and updating the AWS Glue Data Catalog with added data descriptions accordingly.

The first job (Glue_Comprehend_Job) performs the first stage of detection using the Amazon Comprehend Detect PII API, and the second job (Glue_Comprehend_Custom) uses Amazon Comprehend custom entity recognition for entities labeled by domain experts. The following diagram illustrates this workflow.

We describe the details of each stage in the upcoming sections.

You can integrate this workflow into your existing data processing pipeline, which might be orchestrated with AWS services like AWS Step Functions, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Glue workflows, or any third-party orchestrator.

The workflow can complement AWS Glue crawler functionality and inherit the same logic for scheduling and running crawlers. On the other end, we can query the updated Data Catalog with data descriptions via Athena (see the following diagram).

To show an end-to-end implementation of this solution, we have adopted a choreographically built architecture with additional AWS Lambda helper functions, which communicate between AWS services, triggering the AWS Glue crawler and AWS Glue jobs.

Stage-one: Enrich the Data Catalog with a standard built-in Amazon Comprehend entity detector

To get started, Choose to launch a CloudFormation stack.

Define unique S3 bucket name and on the CloudFormation console, accept default values for the parameters.

This CloudFormation stack consists of the following:

An AWS Identity Access Management (IAM) role called Lambda-S3-Glue-comprehend.
An S3 bucket with a bucket name that can be defined based on preference.
A Lambda function called trigger_data_cataloging. This function is automatically triggered when any CSV file is uploaded to the folder row_data inside our S3 bucket. Then it creates an AWS Glue database if one doesn’t exist, and creates and runs an AWS Glue crawler called glue_crawler_comprehend.
An AWS Glue job called Glue_Comprehend_Job, which calls Amazon Comprehend APIs and updates the AWS Glue Data Catalog table accordingly.
A Lambda function called Glue_comprehend_workflow, which is triggered when the AWS Glue Crawler successfully finishes and calls the AWS Glue job Glue_Comprehend_Job.

To test the solution, create a prefix called row_data under the S3 bucket created from the CF stack, then upload the customer dataset sample to the prefix.

The first Lambda function is triggered to run the subsequent AWS Glue crawler and AWS Glue job to get data descriptions using Amazon Comprehend, and it updates the comment section of the dataset created in the AWS Glue Data Catalog.

Stage-two: Use Amazon Comprehend custom entity recognition

Amazon Comprehend was able to detect some of the entity types within the customer sample dataset. However, for the remaining undetected fields, we can get help from a domain data expert to label a sample dataset using Ground Truth. Then we use the labeled data output to train a custom NER model and rerun the AWS Glue job to update the comment column with a customized data description.

Train an Amazon Comprehend custom entity recognition model

One way to train Amazon Comprehend custom entity recognizers is to get augmented manifest information using Ground Truth to label the data. Ground Truth has a built-in NER task for creating labeling jobs so domain experts can identify entities in text. To learn more about how to create the job, see Named Entity Recognition.

As an example, we tagged three labels entities: customer information ID, current level of education, and customer credit rating. The domain experts get a web interface like one shown in the following screenshot to label the dataset.

We can use the output of the labeling job to train an Amazon Comprehend custom entity recognition model using the augmented manifest.

The augmented manifest option requires a minimum of 1,000 custom entity recognition samples. Another option can be to use a CSV file that contains the annotations of the entity lists for the training dataset. The required format depends on the type of CSV file that we provide. In this post, we use the CSV entity lists option with two sample files:

The entity list labels_entity_list .csv
The sample dataset training_dataset.csv

To create the training job, we can use the Amazon Comprehend console, the AWS Command Line Interface (AWS CLI), or the Amazon Comprehend API. For this post, we use the API to programmatically create a training Lambda function using the AWS SDK for Python, as shown on GitHub.

The training process can take approximately 15 minutes. When the training process is complete, choose the recognizer and make a note of the recognizer ARN, which we use in the next step.

Run custom entity recognition inference

When the training job is complete, create an Amazon Comprehend analysis job using the console or APIs as shown on GitHub.

The process takes approximately 10 minutes, and again we need to make a note of the output job file.

Create an AWS Glue job to update the Data Catalog

Now that we have the Amazon Comprehend inference output, we can use the following AWS CLI command to create an AWS Glue job that updates the Data Catalog Comment fields for this dataset with customized data description.

Download the AWS Glue job script from the GitHub repo, upload to the S3 bucket created from the CF Stack in stage-1, and run the following AWS CLI command:

aws glue create-job 
--name "Glue_Comprehend_Job_custom_entity" 
--role "Lambda-S3-Glue-comprehend" 
--command '{"Name" : "pythonshell", "ScriptLocation" : "s3://<Your S3 bucket>/glue_comprehend_workflow_custom.py","PythonVersion":"3"}'
--default-arguments '{"--extra-py-files": "s3://aws-bigdata-blog/artifacts/simplify-data-discovery-for-business-users/blog/python/library/boto3-1.17.70-py2.py3-none-any.whl" }'

After you create the AWS Glue job, edit the job script and update the bucket and key name variables with the output data location of the Amazon Comprehend analysis jobs and run the AWS Glue job. See the following code:

bucket ="<Bucket Name>"
key = "comprehend_output/<Random number>output/output.tar.gz"

When the job is complete, it updates the Data Catalog with customized data descriptions.

Expose Data Catalog data to data consumers for search and discovery

Data consumers that prefer using SQL can use Athena to run queries against the information_schema.columns table, which includes the comment field of the Data Catalog. See the following code:

SELECT table_catalog,
         table_schema,
         table_name,
         column_name,
         data_type,
         comment
FROM information_schema.columns
WHERE comment LIKE '%customer%'
AND table_name = 'row_data_row_data'

The following screenshot shows our query results.

The query searches all schema columns that might have any data meanings that contain customer; it returns crating, which contains customer in the comment field.

BI authors can use text search instead of SQL to search for data meanings of data stored in an S3 data lake. This can be done by setting up a visual layer on top of Athena inside QuickSight.

QuickSight is scalable, serverless, embeddable, and machine learning (ML) powered BI tool that is deeply integrated with other AWS services.

BI development in QuickSight is organized as a stack of datasets, analyses, and dashboards. We start by defining a dataset from a list of various integrated data sources. On top of this dataset, we can design multiple analyses to uncover hidden insights and trends in the dataset. Finally, we can publish these analyses as dashboards, which is the consumable form that can be shared and viewed across different business lines and stakeholders.

We want to help the BI authors while designing analyses to get a better knowledge of the datasets they’re working on. To do so, we first need to connect to the data source where the metadata is stored, in this case the Athena table information_schema.columns, so we create a dataset to act as a Data Catalog view inside QuickSight.

QuickSight offers different modes of querying data sources, which is decided as part of the dataset creation process. The first mode is called direct query, in which the fetching query runs directly against the external data source. The second mode is a caching layer called QuickSight Super-fast Parallel In-memory Calculation Engine (SPICE), which improves performance when data is shared and retrieved by various BI authors. In this mode, the data is stored locally and can be reused multiple times, instead of running queries against the data source every time the data needs to be retrieved. However, as with all caching solutions, you must take data volume limits into consideration while choosing datasets to be stored in SPICE.

In our case, we choose to keep the Data Catalog dataset in SPICE, because the volume of the dataset is relatively small and won’t consume a lot of SPICE resources. However, we need to decide if we want to refresh the data cached in SPICE. The answer depends on how frequently the data schema and Data Catalog change, but in any case we can use the built-in scheduling within QuickSight to refresh SPICE at the desired interval. For information about triggering a refresh in an event-based manner, see Event-driven refresh of SPICE datasets in Amazon QuickSight.

After we create the Data Catalog view as a dataset inside QuickSight stored in SPICE, we can use row-level security to restrict the access to this dataset. Each BI author has access with respect to their privileges for columns they can view metadata for.

Next, we see how we can allow BI authors to search through data descriptions populated in the comment field of the Data Catalog dataset. QuickSight offers features like filters, parameters, and controls to add more flexibility into QuickSight analyses and dashboards.

Finally, we use the QuickSight capability to add more than one dataset within an analysis view to allow BI authors to switch between the metadata for the dataset and the actual dataset. This allows the BI authors to self-serve, reducing dependency on data platform engineers to decide which columns they should use in their analyses.

To set up a simple Data Catalog search and discovery inside QuickSight, complete the following steps:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
For New data sources, choose Amazon Athena.
Name the dataset Data Catalog.
Choose Create data source.
For Choose your table, choose Use custom SQL.
For Enter custom SQL query, name the query Data Catalog Query.
Enter the following query:

SELECT * FROM information_schema.columns

Choose Confirm query.
Select Import to QuickSight SPICE for quicker analytics.
Choose Visualize.

Next, we design an analysis on the dataset we just created to access the Data Catalog through Athena.

When we choose Visualize, we’re redirected to the QuickSight workspace to start designing our analysis.

Under Visual types, choose Table.
Under Fields list, add table_name, column_name, and comment to the Values field well.

Next, we use the filter control feature to allow users to perform text search for data descriptions.

In the navigation pane, choose Filter.
Choose the plus sign (+) to access the Create a new filter list.
On the list of columns, choose comment to be the filter column.
From the options menu (…) on the filter, choose Add to sheet.

We should be able to see a new control being added into our analysis to allow users to search the comment field.

Now we can start a text search for data descriptions that contain customer, where QuickSight shows the list of fields matching the search criteria and provides table and column names accordingly.

Alternatively, we can use parameters to be associated with the filter control if needed, for example to connect one dashboard to another. For more information, see the GitHub repo.

Finally, BI authors can switch between the metadata view that we just created and the actual Athena table view (row_all_row_data), assuming it’s already imported (if not, we can use the same steps from earlier to import the new dataset).

In the navigation pane, choose Visualize.
Choose the pen icon to add, edit, replace, or remove datasets.
Choose Add dataset.
Add row_all_row_data.
Choose Select.

BI authors can now switch between data and metadata datasets.

They now have a metadata view along with the actual data view, so they can better understand the meaning of each column in the dataset they’re working on, and they can read any comment that can be passed from other teams within the organization without needing to do this manually.

Conclusion

In this post, we showed how to build a quick workflow using AWS Glue and Amazon AI/ML services to complement the AWS Glue crawler functionality. You can integrate this workflow into a typical AWS Glue data cataloging and processing pipeline to achieve alignment between cross-functional teams by simplifying and automating the process of adding data descriptions in the Data Catalog. This is an important step in data discovery, and the topic will be covered more in upcoming posts.

This solution is also a step towards implementing data privacy and protection regimes such as the Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR) by identifying sensitive data types like PII and enforcing access polices.

You can find the source code from this post on GitHub and use it to build your own solution. For more information about NER models, see Developing NER models with Amazon SageMaker Ground Truth and Amazon Comprehend.

About the Authors

Karim Hammouda is a Specialist Solutions Architect for Analytics at AWS with a passion for data integration, data analysis, and BI. He works with AWS customers to design and build analytics solutions that contribute to their business growth. In his free time, he likes to watch TV documentaries and play video games with his son.

Ahmed Raafat is a Senior Solutions Architect at Amazon Web Services, with a passion for machine learning solutions. Ahmed acts as a trusted advisor for many AWS enterprise customers to support and accelerate their cloud journey.

Orchestrating and Monitoring Multichannel Messaging with Amazon Pinpoint

2021-08-20 Hamilton Oliveira

Post Syndicated from Hamilton Oliveira original https://aws.amazon.com/blogs/messaging-and-targeting/orchestrating-and-monitoring-multichannel-messaging-with-amazon-pinpoint/

The union of marketing and technology (MarTech) has contributed to making communications and customers interactions more dynamic and personalized. In a multichannel environment with increasingly connected customers, it is essential for a MarTech system to orchestrate a digital marketing strategy using customers’ preferred channels in addition to monitoring their effectiveness during these engagements.

Companies in a variety of industries, from financial and retail to manufacturing seek to communicate with customers in the most efficient way, at the right time and channels. One way to facilitate this communication is to engage the customer in a personalized multi-step experience, or journeys. Amazon Pinpoint is a tool that gives marketers the flexibility to create multi-channel campaigns and monitor end user interactions such as email opens and clicks.

In this blog post we’ll go deeper into how Amazon Pinpoint can be configured for customer interactions and orchestration. We’ll also learn how to monitor and observe the results of these interactions through other AWS services that complement the MarTech stack.

Enabling Multi-Channel on Amazon Pinpoint

Sign in to the Amazon Pinpoint console and choose a region where the service is available. To organize the settings, campaigns, segments, and data, marketers can create a project on Amazon Pinpoint. To do this, simply specify a name for the project in the Get started box and select Create a Project.

After creating the project, a number of options related to the newly created project will appear on the menu on the left.

The first step to getting a project running is to activate the desired channels. A channel represents the platform through which you engage your audience segment with messages. Currently Amazon Pinpoint supports push notifications, email, SMS, voice and the creation of custom channels such as WhatsApp, Facebook Messenger or any other service that allows API integrations. In this blog post we will use the native Amazon Pinpoint channels: email, push notifications and SMS.

Let’s start by configuring the e-mail channel. From the menu related to the newly created project, navigate to Settings → Email and follow step 5 of the Creating an Amazon Pinpoint project with email support.

After configuring the email channel, we will start with configuring the SMS channel by navigating to Settings → SMS and Voice. Follow the walkthrough available in Setting up the Amazon Pinpoint SMS channel from the step 5. Then activate a phone number for the SMS service by following the steps on Requesting a number.

Note that Amazon Pinpoint supports more types of phone numbers in the United States than in other countries. Please review the available numbers within the United States and other countries. For testing in the United States a Toll Free Number (TFN) can be provisioned to the account immediately.

Remember that the usage of AWS services may incur costs and for detailed information about the costs regarding each service, by region, please visit this .

(Optional) Activate the push notification channel by going to, Settings → Push notifications and follow from Step 5 of the guide Setting up Amazon Pinpoint mobile push channels.

At the end of the settings, when accessing the Settings menu of the created project, you will see a similar screen like the following image.

We’ve now finished the channel configuration and are ready to move onto building Amazon Pinpoint Journeys.

Configuring Multi-Channel Experiences on Amazon Pinpoint Journeys

Now, let’s create a multichannel journey based on an external event. A journey is a personalized engagement experience made up of multiple steps across multiple channels. For example, in the case of a financial institution that wants to communicate with a customer over their preferred channel to notify the customer to activate a travel notice.

To simulate this use case, we will insert some endpoints. An Endpoint represents a destination that you can send messages, and a user can have one or more endpoints.

The example below is a json-document with 4 endpoints for 3 users, since the same user has two endpoints for two different channels. You should change the addresses to your own test email addresses, phone numbers, and push tokens, before using the example below.

Note that if your account is still in the sandbox these will need to be verified email addresses.

If you only have access to a single email address you can use labels by adding a plus sign (+) followed by a string of text after the local part of the address and before the at (@) sign. For example: [email protected] and [email protected]

Then, the following steps:

Create a json file based on the example below.
Update the Address fields with your test email addresses and phone numbers.
Run AWS CLI to import the JSON file created in step 1.

{
    "Item": [
        {
            "ChannelType": "EMAIL",
            "Address": "[email protected]",
            "Attributes": {
                "PreferredChannel": ["N"]
            },
            "Id": "example_endpoint_1",
            "User": {
                "UserId": "example_user_1",
                "UserAttributes": {
                    "FirstName": ["Richard"],
                    "LastName": ["Roe"]
                }
            }
        },
        {
            "ChannelType": "SMS",
            "Address": "+16145550100",
            "Attributes": {
                "PreferredChannel": ["Y"]
            },
            "Id": "example_endpoint_1b",
            "User": {
                "UserId": "example_user_1",
                "UserAttributes": {
                    "FirstName": ["Richard"],
                    "LastName": ["Roe"]
                }
            }
        },
        {
            "ChannelType": "SMS",
            "Address": "+16145550102",
            "Attributes": {
                "PreferredChannel": ["Y"]
            },
            "Id": "example_endpoint_2",
            "User": {
                "UserId": "example_user_2",
                "UserAttributes": {
                    "FirstName": ["Mary"],
                    "LastName": ["Major"]
                }
            }
        },
        {
            "ChannelType": "APNS",
            "Address": "1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7q8r9s0t1u2v3w4x5y6z7a8b9c0d1e2f",
            "Attributes": {
                "PreferredChannel": ["Y"]
            },
            "Id": "example_endpoint_3",
            "User": {
                "UserId": "example_user_3",
                "UserAttributes": {
                    "FirstName": ["Wang"],
                    "LastName": ["Xiulan"]
                }
            }
        }
    ]
}

Once the endpoints are inserted, let’s create 3 segments to represent each preferred channel — Email, Push Notifications, and SMS:

Navigate to your project in the Amazon Pinpoint Console, choose Segments and then Create a segment.
Select Build a segment.
Provide a name for your segment, for example, SMS Preferred.
Configure Segment Group 1 following the steps below to filter the endpoints where the preferred channel is SMS.
1. Under Base segments, select Include any audiences
2. Choose Add criteria and choose Channel Types → SMS.
3. Choose Add filter, select Custom Endpoint Attributes → PreferredChannel, Operator Is, and on the dropdown choose Y.

Follow the same steps above for the Push and Email channels, choosing each of these channels in step 4.2. When you finish the configuration, you will have a result similar to the one presented below.

Next, let’s create the message templates for each of the channels. Follow the step-by-step in the User Guide for each of the following channels:

You should see the following:

Next, lets create the journey to notify users when a travel notice event occurs.

Under your project Amazon Pinpoint Console, navigate to Journeys and choose Create journey.
1. If this is your first time creating a Journey, click through the help messages
Name your journey Travel Notice.
Choose Set entry condition
1. In Choose how to start the journey, select: Add participants when they perform an activity.
2. In the field Events enter TravelNoticeAlert
3. Choose Save.
Click Add activity under the Journey Entry box and select Multivariate split
1. Add 2 new branches by selecting Add Another Branch
2. For the Branch A, under Choose a condition type, select Segment and for Segments choose E-mail Preferred
3. For the Branch B, under Choose a condition type select Segment and for Segments choose SMSPreferred
4. For the Branch C, under Choose a condition type select Segment and for Segments choose Push Preferred
5. Leave everything else as the default values and select Save
Finally, add a message sending activity for each segment.
1. Under Branch A, select Add Activity, choose Send an email, then Choose an email template and select the template you created before for email channel.
2. Choose Save.
3. Under Branch B, select Add Activity, choose Send an SMS message, then Choose an SMS template and select the template you created before for SMS channel.
4. Under Origination phone number, select the phone you configured when creating the SMS Channel
5. Choose Save.
6. Under Branch C, select Add Activity, choose Send a push notification activity, then Choose a push notification template and select the template you created before for push channel.
7. Choose Save.
8. When you complete these steps your journey will have a similar structure to the one presented below.
Choose
1. Under Review your journey choose Next, Mark as reviewed and finally Publish.
2. Wait for the Journey to begin before continuing.

Installing Event Monitoring Components on Amazon Pinpoint

We can monitor and analysys the events generated by Amazon Pinpoint in real time by installing the Digital User Engagement Events Database solution, which is a reference implementation that installs the necessary services to track and query Amazon Pinpoint events.

To install this solution, follow the walkthrough available at Digital User Engagement Events Database Automated Deployment making sure to select the same region you used to configure Pinpoint earlier.

In Step 1. Launch the stack, for the Amazon Pinpoint Project ID field enter the Project ID that you created earlier, and leave the other fields as default. Wait for the end of the solution deployment. It will create a bucket in Amazon S3, a delivery stream in Amazon Kinesis Firehose, and a database and views in Amazon Athena, plus an AWS Lambda function responsible for partitioning the data.

Remember that the usage of AWS services may incur costs and for detailed information about the costs regarding the Digital User Engagement Events Database, please refer to the solution cost page.

Validating Your Multi-Channel Journey

Finally, we will use the commands below, to validate the event that triggers the journey and monitoring.

Note that we are using an Endpoint ID and not User ID. Amazon Pinpoint will see that the endpoint is associated with a user and as such use the appropriate Preferred Channel for that user.

For the following commands you can use AWS CLI.

aws pinpoint put-events\
--application-id application-id\
--events-request '{"BatchItem": { "example_endpoint_1": { "Endpoint": {}, "Events": { "TravelNoticeAlert": {"EventType": "TravelNoticeAlert", "Timestamp": "2021-03-09T08:00:00Z"}}}}}'

aws pinpoint put-events\
--application-id application-id\
--events-request '{"BatchItem": { "example_endpoint_2": { "Endpoint": {}, "Events": { "TravelNoticeAlert": {"EventType": "TravelNoticeAlert", "Timestamp": "2021-03-09T08:00:00Z"}}}}}'

aws pinpoint put-events\
--application-id application-id\
--events-request '{"BatchItem": { "example_endpoint_3": { "Endpoint": {}, "Events": { "TravelNoticeAlert": {"EventType": "TravelNoticeAlert", "Timestamp": "2021-03-09T08:00:00Z"}}}}}'

application-id is your Amazon Pinpoint project ID. It can be accessed within AWS Pinpoint Console.

The value for the EventType parameter is the same you defined during the configuration of the Event field within the journey. In our example the value is TravelNoticeAlert.

Monitoring the Events of Your Multi-Channel Journey

Amazon Pinpoint natively offers a set of dashboards that can be accessed through the Analytics menu. However, with the architecture proposed in this blogpost it is possible to extract more detailed analysis. Navigate to the Amazon Athena console.

Choose the Database due_eventdb that was configured by the solution above.

Under New query tab copy and paste the statement below and choose Run query. The statement below creates a view that returns all endpoints to which SMS messages have been sent, with the status of sending at the telephone carrier. For more information about Views, access the topic Working With Views in Amazon Athena User Guide. Note that you may need to configure an S3 Bucket to store Athena Query Results.

CREATE OR REPLACE VIEW sms_carrier_delivery AS
SELECT event_type,
        client.client_id,
        from_unixtime(event_timestamp/1000) event_date,
        attributes['journey_activity_id'] journey_activity_id,
        attributes['destination_phone_number'] destination_phone_number, 
        attributes['record_status'] record_status
FROM "due_eventdb"."all_events"
WHERE event_type = '_SMS.SUCCESS'
ORDER BY event_timestamp

Open a new tab, copy and paste the following query, and select Run query. The command below creates a view that returns all endpoints to which SMS were sent, the message type (transactional or promotional), and the cost of sending.

CREATE OR REPLACE VIEW sms_pricing AS
SELECT event_type,
        client.client_id,
        from_unixtime(event_timestamp/1000) event_date,
        attributes['destination_phone_number'] destination_phone_number, 
        attributes['message_type'] message_type,
        metrics.price_in_millicents_usd/100000 sms_message_price
FROM "due_eventdb"."all_events"
WHERE event_type = '_SMS.SUCCESS'
ORDER BY event_timestamp

To see all of the events available please refer to the Events Database Data Dictionary.

Finally, let’s further explore other monitoring options by creating dashboards in Amazon Quicksight.

From the AWS console, go to Amazon Quicksight and, if necessary, sign up.

Select the top left menu where your username is and then Manage QuickSight.
1. Select Security & permissions
2. On QuickSight access to AWS services, select Add or remove.
3. Check the option Amazon Athena, access Next and in S3 → S3 Buckets Linked To QuickSight Account.
  1. If the check box is clear, enable the check box next to Amazon S3.
  2. If the check box is already enabled, choose Details, and then choose Select S3 buckets.
4. Check the S3 bucket created by the Digital User Engagement Events Database solution. If you have questions about the bucket name, check the Outputs tab for the value for the Dues3DataLakeName key of the CloudFormation stack you created.
5. Select Finish and Update.
Go back to the Amazon QuickSight home screen and select Datasets and then New dataset.
Choose Athena.
In Data source name field enter Pinpoint Dataset.
Choose Validate connection, and Create data source.
1. In the window Choose your table, in the Database: contain sets of tables select due_eventdb and the table sms_carrier_delivery.
2. Select Edit/Preview data
3. On the dataset definition screen press Save button.
Choose Dataset
1. Press the button New dataset.
2. Scroll down to FROM EXISTING DATA SOURCES and access Pinpoint Dataset.
3. Select Create dataset
4. In the window Choose your table, in the Database: contain sets of tables select due_eventdb and the table sms_pricing.
5. Select Edit/Preview data
6. On the dataset definition screen press Save
7. Repeat these steps again but select the journey_send table for the step
Choose Analyses
1. Press the button New analysis.
2. For Your Datasets, choose journey_send and then access Create analysis. This view was created by Digital User Engagement Events Database solution.
3. Under Field lists choose journey_send_status. Amazon QuickSight will draw a chart showing journeys events by status.
4. Select the pen symbol next to Dataset and press the button Add dataset.
5. Choose sms_carrier_delivery and Select.
6. Choose the field record_status.
7. Under Visual types, choose Pie chart. This chart will display message delivery status on your carrier.
8. Press the pencil symbol next to Dataset and press the button Add dataset.
9. Check sms_pricing and
10. Choose sms_message_price and message_type
11. Under Visual types, select Donut chart. This graph will display costs by transactional or promotional message type.

The final result will be something close to the one shown in the image below:

Conclusion

In this blogpost we walked through how to set up Amazon Pinpoint for an end-to-end scenario. We defined the basic components to a multichannel journey and monitoring, introduced AWS services as a MarTech solution that allows companies to send notifications to their customers preferred channels and also monitor their engagement data using Amazon Pinpoint events.

Clean up

Choose AWS CloudFormation.
1. Delete and Delete stack
Navigate to Amazon Pinpoint console.
1. Go to Settings → SMS and voice, select the number created during the execution of this blogpost and choose Remove phone number.
2. Under All projects, open the created project and then in the menu on the left select Settings → General settings. Choose Delete project and confirm the deletion by filling “delete” in the indicated field and select Delete.
Choose Amazon Quicksight.
1. Delete your user.

Automate Amazon QuickSight user and group management using LDAP data for row-level security

2021-08-19 Anand Sakhare

Post Syndicated from Anand Sakhare original https://aws.amazon.com/blogs/big-data/automate-amazon-quicksight-user-and-group-management-using-ldap-data-for-row-level-security/

In any business intelligence system, securing and restricting access to the data is important. For example, you might want a particular dashboard to only be viewed by the users with whom the dashboard has been shared, yet customize the data displayed on that dashboard per user by implementing row-level security. With row-level security, you can grant and restrict access to only specific rows, all rows, or no rows of any dataset.

Organizations typically want to display different data to users within different business units. To do this, we need to traverse the organizational hierarchy to find the ultimate business owner for each employee. This post goes into the details of how we can use Lightweight Directory Access Protocol (LDAP) data to find what business unit each employee belongs to, and how we can apply data restriction using row-level security at an Amazon QuickSight group level rather than at an individual user level.

Additionally, this post explains how to automate user and group management within QuickSight to add or remove a QuickSight group or a user to a group.

Architectural overview

The following diagram illustrates the solution architecture.

The solution has two parts:

We first move data from an LDAP data store to Amazon Redshift or any other QuickSight supported data source, flatten it and integrate it with existing analytical data, and create a joined result set matching the organizational hierarchy
Then we automate the user and group management, which we apply to the permission file in QuickSight to dictate access to a particular dataset

We demonstrate this with a fictitious IT support ticketing system. This system tracks who is creating a particular support ticket and who is working on that ticket. We use QuickSight to visualize trends like number of tickets worked on by a particular assignment group, number of open vs. closed tickets, and so on for a particular business unit. We use row-level security to demonstrate different levels of access and how the same dataset changes for a super user like a president of a company overseeing multiple business units vs. a manager of a particular business or a direct report working on the actual tickets.

Solution overview

We extract and dump the LDAP dataset into Amazon Simple Storage Service (Amazon S3). This dataset can be a direct database dump. The LDAP dataset is usually not flattened and can’t be directly queried to find the reporting hierarchy of a particular employee. To flatten the directory structure, we load the data in an Amazon Relational Database Service (Amazon RDS) instance that supports recursive SQLs. Optionally, we can also flatten the LDAP data using AWS Glue with Apache Spark. The flattened LDAP data is written back into Amazon S3 and then loaded into Amazon Redshift or a QuickSight supported data store. This portion of loading to Amazon S3 and then to Amazon Redshift is optional if the QuickSight data source you’re using supports running recursive SQLs.

The flattened LDAP structure table should now consist of a user column, their manager, and who they report to, up to a vice president in the organization hierarchy. This is then joined with the ticketing system dataset and assignment group or business unit ownership table, which tells which manager or vice president is responsible for a particular business unit. The final joined and aggregated table provides ticketing metadata, ownership, and the business unit hierarchy for each of the tickets. This can be directly loaded into SPICE either as a query or a dataset. You can apply a permission file to this new dataset that dictates which group has access to which datasets.

The second component of the solution goes into the details of how to automate user management, which is done by uploading separate CSV files to Amazon S3 for adding new groups, adding new users, and removing users. When a particular object is uploaded, an event triggers an AWS Lambda function, which makes API calls for the QuickSight client to add or remove the users or add a group.

When you combine these two components, you get an automated way of incorporating your LDAP structure in QuickSight and managing the users in a similar fashion to how you manage users in your corporate active directory.

Prerequisites

To follow along with this post, clone the GitHub repo, deploy the infrastructure, and download the scripts.

Prepare the data

To get started, create a new S3 bucket or use an existing bucket to upload the provided scripts. On the Amazon S3 console, the prefix structure of the bucket should look like the following screenshot.

Use the redshift-cluster.yml AWS CloudFormation template file under the folder infrastructure deploy to deploy the Amazon Redshift cluster and an AWS Identity and Access Management (IAM) role that is attached to Amazon Redshift to access Amazon S3.

The CloudFormation template requires several parameters, including the following:

InboundTraffic – We recommend restricting access to IP addresses within your network or to the IP address of the client. Entering 0.0.0.0/0 allows incoming traffic from all IP addresses.
S3BucketForRedshiftIAMRole – The name of the S3 bucket where the scripts are uploaded. This is used to create an IAM role that is assumed by the cluster.

Alternatively, you can create the cluster via the Amazon Redshift console and attach the IAM role to access the S3 bucket to the cluster.

To perform the next steps, you can use the Amazon Redshift query editor. The scripts that need to be run are under the scripts folder. In the provided queries, replace bucket-name with the bucket where the scripts and data are uploaded, replace the iam_role ARN (arn:aws:iam::111122223333:role/quicksight-row-level-demo-redshift-role) with your actual account number, and replace the Region us-east-1 with the Region where the S3 bucket is created. When the queries are complete, several tables are now in the cluster (see the following screenshot).

Ticketing data

For this post, we assume that the IT support ticketing data is already loaded into Amazon Redshift. The important columns of interest in this dataset are ticket_no, which can be a primary key for this table, and assigned_user_email_id, which we use as a foreign key to join with the flattened LDAP dataset. We refer to this table as ticket_data. You can query this table in your Amazon Redshift cluster, as shown in the following screenshot.

In this post, we take the organizational structure where different business units are managed by different vice presidents.

We want the ability to categorize our ticketing data based on user assignment into business units. Our ticket data has assigned users for each ticket. We use assigned_user_email_id to identify owners for each of these tickets. Based on the LDAP dataset, we identify the vice president in the hierarchy for each assigned user. We also have an assignment group and assignment_group_manager_email mapping in a separate table, which gives us relationships between business units (or assignment group) and the respective vice presidents.

LDAP data

Our primary key for this data in this example dataset is user_email_id, and reports_to is the foreign key that refers to user_email_id in the same dataset. The following screenshot shows the employees table in an RDS instance. This table can also exist in any database that supports recursive queries.

We’re interested in taking a particular user, finding their manager, and moving up the hierarchy until we find the vice president of a particular business unit. For example, in the preceding LDAP table, Todd V reports to Jimmy P, Jimmy P reports to Steve W, and Steve W reports to President Thomas J. This hierarchy is difficult to query in the current table structure because Amazon Redshift doesn’t support recursive queries. As an alternative, we decided to load the data in an RDS instance that supports recursive queries first and then load it into Amazon Redshift.

After we load the data into the RDS instance, we can use a SQL query like the following to find the hierarchy order we discussed:

SELECT USER_EMAIL_ID, NAME, DESIGNATION, REPORTS_TO, 
SYS_CONNECT_BY_PATH(USER_EMAIL_ID, '/') "Path", 
SUBSTR(SYS_CONNECT_BY_PATH(USER_EMAIL_ID, '/'), 
Instr(SYS_CONNECT_BY_PATH(USER_EMAIL_ID, '/'), '/', 2)-2) vp_roll_up
FROM employees 
WHERE designation != 'Vice President'
START WITH designation = 'Vice President' 
CONNECT BY PRIOR USER_EMAIL_ID = REPORTS_TO
ORDER BY REPORTS_TO;

The following screenshot shows the flattened sample LDAP dataset for the preceding example. Let’s call this LDAP_flattened_data. You can query this table in your Amazon Redshift cluster.

Note that for this post, the dataset is already flattened.

Assignment groups and manager mapping

The next step is to identify the business unit or assignment group each vice president belongs to.

For simplicity, we assume the following mapping between assignment groups (business units) and their respective vice presidents (or assignment group manager). This can be simply stored in a table. Let’s call the table assignment_group_manager_mapping. After we join LDAP_flattened_data with the assignment_group_manager_mapping table, we can identify which assignment group each user belongs to. Carrying forward the previous example, Todd V rolls up to vice president Steve W and therefore belongs to the IT Repairs assignment group. You can query this table in your Amazon Redshift cluster.

Now that we have the three base tables ready, all we need to do is join them to form a flattened table that is suitable for visualizations in QuickSight. We store this data into SPICE, which is an in-memory optimized calculation engine designed specifically to support low-latency, ad hoc data visualization. You can create this table by performing a join between ticket_data and LDAP_flattened_data on assigned_user_email_id and USER_EMAIL_ID, which gives us VP_ROLL_UP, which we can then use to join with the assignment_group_manager_mapping table on assignment_group_manager_email. The final dataset created from the sample data is shown in the following table.

For this post, we use the following sample SQL statement, but you can achieve this in multiple ways.

SELECT * 
FROM ticket_data td, LDAP_flattened_data ld, assignment_group_manager_mapping ag
WHERE td.assigned_user_email_id = ld.USER_EMAIL_ID and
ld.VP_ROLL_UP = ag.assignment_group_manager_email;

You can directly load this table and query into SPICE from Amazon S3, Amazon Athena, Amazon Redshift, or any other supported data sources. For more information, see Working with Datasets.

Add row-level security

To add row-level security, we follow the same steps as in the documentation Using Row-Level Security (RLS) to Restrict Access to a Dataset. We create a dataset in QuickSight by querying the table in Amazon Redshift. Alternatively, you can upload the permissions file from the GitHub repository. You can query this table in your Amazon Redshift cluster with select * from row_level_security. The following screenshot shows our results.

With the preceding examples, the users that are added to a particular assignment group get the same permissions as the group, and the users that belong to the superuser group have elevated access. In QuickSight, we perform the following steps.

Create a new dataset with the row_level_security table.

Both datasets are visible in QuickSight.

In the details for the ticketing-data-final dataset, and choose Row-level security.
Select the row_level_security dataset that we created.
Select Apply dataset and confirm.

When the row-level security has been successfully applied, the lock icon is visible next to the ticketing dataset.

Automate user management

To tie it all together, we automate user and group management. When a particular user is added to a group, they get access as defined by the preceding permissions file. For this setup, we use the following Amazon S3 prefix structure, essentially a separate path for adding groups, adding users, and removing users.

We upload CSV files under each of these Amazon S3 paths.

To add or remove users from a group, refer to the user.csv file under the quicksight-user-group-management folder and copy it to the add-user-to-group folder.

To add groups, refer to the groups.csv file under the quicksight-user-group-management folder and copy it to the add-group folder.

When a file is added in any of these S3 folder paths, a Lambda function is triggered, which makes a Boto3 call to add groups or add or remove users. You can use the quicksight-row-level-lambda.yml CloudFormation template under the Infrastructure deploy folder of the GitHub repo to create the Lambda function and execution role.

Now we create an Amazon S3 event trigger for this bucket to trigger the Lambda function. On the Properties tab of the bucket, choose Events.

Add a notification and provide the ARN of the Lambda function that you created.

Visualize the data

To demonstrate the solution, we show how Steve W. can view all the tickets under IT Repairs, and Todd V can view only the tickets assigned to him.

We create a sample analysis using the dataset created earlier. Then we can publish it as a dashboard and share it with the target user group.

For this post, we set up the following visualizations:

Tickets assigned this month – The count of the tickets opened in the current month
- Visual type – KPI
- Value – ticket_no
- Filters – created_date should be current month
Ticket Details – The details of the tickets, such as status and assigned owner
- Visual type – Table
- Groupby columns – ticket_no, created_by, problem_issue, assigned_user
- Filters – created_date should be current month
Tickets by Status – The status of all the tickets by assignment group
- Visual type – Pie chart
- Group/Color – Status
- Value – ticket_no
- Filters – created_date should be current month
Tickets Assignment by Groups – The status of all the tickets by assignment group
- Visual type – Bar chart
- X-axis – assignment_group
- Y-axis – count of ticket_no
- Filters – created_date should be current month
Tickets Resolved by day – The number of tickets closed each day
- Visual type – Line chart
- X-axis – resolved_date
- Y-axis – count of ticket_no
- Filters – created_date should be current month and status is closed

When user Todd V (the QuickSight user name is the same as the users email ID) logs in, he sees a dashboard like the following screenshot.

When Steve W logs in, his dashboard shows more information.

When a superuser logs in, they see the following dashboard.

Conclusion

We demonstrated one of the many ways we can use LDAP data for organizational hierarchy-based visualizations in QuickSight. In this post, we talked about how we can find organizational ownership for ticket data. You can further generalize this solution to fit any data within an organization that needs business unit-based grouping. Another use case for this can be for visualizing security threats or sales data across different business units.

This post covers only one level of organizational hierarchy, but in many enterprises, the organizational structure can be much more complicated. You can use a similar approach to deal with these nested organizational hierarchies, where we can report on different levels of business units. If you are interested in implementing row-level security using organizational LDAP hierarchical structure refer to Implement row-level security using a complete LDAP hierarchical organization structure in Amazon QuickSight.

About the Author

Anand Sakhare is a Big Data Architect with AWS. He helps customers build big data, analytics, and machine learning capabilities using a mix of technologies. He is passionate about innovation and solving complex problems.

Rohan Jamadagni is a Sr. Data Architect, working with AWS for the past 5 years. He works closely with customers to implement data and analytics solutions on AWS. He enjoys understanding the meaning behind data and helping customers visualize their data to provide meaningful insights.

Umair Nawaz is a DevOps Engineer at Amazon Web Services in New York City. He works on building secure architectures and advises enterprises on agile software delivery. He is motivated to solve problems strategically by utilizing modern technologies.

Implement row-level security using a complete LDAP hierarchical organization structure in Amazon QuickSight

2021-08-18 Anand Sakhare

Post Syndicated from Anand Sakhare original https://aws.amazon.com/blogs/big-data/implement-row-level-security-using-a-complete-ldap-hierarchical-organization-structure-in-amazon-quicksight/

In a world where data security is a crucial concern, it’s very important to secure data even within an organization. Amazon QuickSight provides a sophisticated way of implementing data security by applying row-level security so you can restrict data access for visualizations.

An entire organization may need access to the same dashboard, but may also want to restrict access to the data within the dashboard per the organization’s hierarchical structure. For instance, vice presidents need visibility into all data within their organization, team managers need to see the data related to all their direct reports, and an individual contributor just needs to see the their own data. Creating and maintaining these data security rules can be laborious if managed manually.

In this post, we go into the details of how to extract the organizational hierarchical structure from Lightweight Directory Access Protocol (LDAP) data, flatten it, and create a row-level security permissions file to mimic the same level of hierarchical access controls to a QuickSight dataset. We show this with mock datasets of the LDAP data and ticketing data, and use that data to implement user-level access on QuickSight visualizations and datasets.

Overview of solution

In this post, we demonstrate processing LDAP data and implementing row-level security to control user-level access according to organizational hierarchies. We demonstrate with a sample dataset how to dynamically change the data behind visualizations. We also talk about automatically creating the permissions file required for implementing security on QuickSight visualizations using the LDAP data. Additionally, we create a sample dashboard and apply the generated permissions file to manage data access for the users.

LDAP hierarchical data (employee dataset)

We use sample LDAP data, which is a mock organizational hierarchical structure. Sample code to generate the hierarchical data is available on the GitHub repo.

The basic idea is we have a hierarchical dataset where an employee-manager or employee-supervisor relationship exists. In our sample dataset, we have employee_id and manager_id columns, which represent the employee-manager relationship. The following screenshot is the hierarchical representation of the first few employees in the dataset and the first few rows in the table. The data shows “Mies, Crin” (employee_id 0) is the root user. In our mock data, employee IDs range from 0–249, with 10 levels of hierarchies.

The following screenshot shows the hierarchical structure of the sample data.

Ticketing dataset

Our randomly generated sample ticketing dataset has about 250 tickets (see the following screenshot). Each of these tickets is assigned to an employee. The column assigned_to_emp_id represents the employee that the ticket is assigned to. The code to generate a random sample of the data is available on the GitHub repo.

We can replace emp_id with any unique identifier such as UID, RACF ID, email ID, and so on.

Sample organizational structure

The row-level security that we talk about in the next sections ties back to these two datasets. For example, tickets assigned to any employee ID present in the column assigned_to_emp_id should only be visible to the employees that are higher in that employee’s hierarchy. Tickets assigned to employee ID 18 can only be viewed by employee IDs 18, 7, 5, 1, and 0, because employee ID 18 directly or indirectly reports to them. The following screenshot shows an example of the hierarchy.

Preprocess ticketing data by flattening hierarchical relationships

We first need to flatten the LDAP data to get the employee IDs in the hierarchy in a row. This flattened data needs to be refreshed to account for any organizational changes, such as new employees onboarding. In this example, we use the SYS_CONNECT_BY_PATH function on an Amazon Relational Database Service (Amazon RDS) database to achieve that. We can achieve the same result programmatically or by using common table expressions (CTEs) in Amazon Redshift. The goal is to create 10 columns containing the complete path from that employee to the highest-level manager. Not every employee has a value in every column, assuming the employee isn’t at the lowest level of the hierarchy. A given employee ID appears on their own row, as well as rows for employees they’re a direct or indirect manager of. We query the data with the following code:

SELECT EMPLOYEE_ID, MANAGER_ID, "name", DOB, DEPT, SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/') "Path", 
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 1) "CEO",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 2) "CEO_1",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 3) "CEO_2",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 4) "CEO_3",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 5) "CEO_4",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 6) "CEO_5",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 7) "CEO_6",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 8) "CEO_7",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 9) "CEO_8",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 10) "CEO_9",
regexp_substr(SYS_CONNECT_BY_PATH(EMPLOYEE_ID,'/'), '[^\/]+', 1, 11) "CEO_10"
FROM employees
START WITH MANAGER_ID is NULL
CONNECT BY PRIOR EMPLOYEE_ID = MANAGER_ID
ORDER BY EMPLOYEE_ID;

The following screenshot shows the query output.

The following screenshot shows an example of the hierarchy for employee ID 147.

For employee ID 147, we can see the employees in the hierarchy are organized in columns with their levels. Employee ID 147 reports to 142, 142 reports to 128, and so on. Similarly, for employee ID 142, we can see that the employees above 142 are present in their respective columns.

Join the ticketing data with the flattened LDAP data

To get to the final dataset that we need for visualizations, we need to join the ticketing data with the LDAP data with flattened hierarchies. For our demo, we created two tables, Tickets and Employees, and copied the data we showed earlier to these tables using an Amazon Redshift copy command from Amazon Simple Storage Service (Amazon S3). The following is a sample output of the join query between these tables. This dataset is what we import into SPICE in QuickSight. SPICE is QuickSight’s Super-fast, Parallel, In-memory Calculation Engine, and it’s engineered to rapidly perform advanced calculations and serve data.

select ticket_num,assigned_to_emp_id,name,category,manager_id,ceo,ceo_1,ceo_2,ceo_3,ceo_4,ceo_5,ceo_6,ceo_7,ceo_8,ceo_9,ceo_10  
from blogpostdb.Tickets a JOIN blogpostdb.Employees b
ON a.assigned_to_emp_id = b.EMPLOYEE_ID;

The following screenshot shows our flattened data.

Create the permissions file

You can use the following code snippet to create the permissions file needed to apply row-level security on your dataset in QuickSight:

import csv
def create_permissions_file(list_of_emp_ids, number_of_levels):
    output_header=["ceo_" + str(i) if i!=0 else 'ceo' for i in range(number_of_levels)]
    output_header.insert(0,'UserName')
    f=open("./sample_permissions_file.csv", 'w')
    writer = csv.writer(f)
    writer.writerow(output_header)
    for i in list_of_emp_ids:
        for j in range(1,len(output_header)):
            l = [None] * (len(output_header))
            l[j]=i
            l[0]=i
            print(l)
            writer.writerow(l)
    f.close()

The input to this function is a list of your employee IDs. These employee IDs appear as owners in the ticketing data as well. There are multiple ways to get the data. If the ticketing data ownership is related to any other user-specific information such as email ID or any unique identifier, then a list of that information is the input to this function. The second input is the number of levels your organization has (integer value). The goal is to create a CSV file to use as a permissions file for your dataset in QuickSight.

Assume there are 10 hierarchical levels in your organization. The output permissions file looks something like the following screenshot.

Create QuickSight analyses

We now apply the permission file to the QuickSight dataset. For instructions, see Using Row-Level Security (RLS) to Restrict Access to a Dataset.

Now we create a sample visualization to show the specific tickets owned by an employee or their reportees.

After we import the permissions file and apply it to the final dataset (created by joining the ticketing data with the flattened LDAP data) in SPICE, we’ve created the following visualization. The goal is to verify that when different users log in and see the visualization, they see the same visualization with different data, in this case the only tickets that concern them.

The following screenshot shows the visualization without any row-level security.

For our hierarchy, we’ve created QuickSight users with usernames that are the same as their employee IDs (the employee with ID 142 has the QuickSight username 142; this can easily be replaced by any unique identifiers your organization uses). We log in with employee IDs 232, 147, 61, 84, and 28, and verify that they only see the tickets that concern them. In the visualization “You are viewing tickets concerning these employees,” we can see whose tickets the logged-in user is authorized to see. Because the mocked data only consists of around 250 tickets randomly assigned to 250 employees, some visualizations may show no data.

The following screenshot shows the example hierarchy. Employee ID 232 is a leaf node (nobody reports to them).

Employee 232 is only authorized to view their own tickets, as shown in the following visualization.

Similarly, because employee ID 147 is also a leaf node, they can only view their assigned tickets.

In our example hierarchy, employee IDs 72, 75, 174, 229, and 134 report to employee 61. In our dataset, only four tickets are assigned to those employees. The following screenshot shows the tickets of concern to employee ID 61.

The following screenshot shows the visualizations visible to employee ID 61.

Similarly, when we log in with employee IDs 84 and 28, we can verify that they only see the tickets concerning them.

Publish the dashboard

You can use the share function to publish the analysis to a dashboard and share the data with stakeholders.

Clean up

To avoid incurring future charges, make sure to remove resources you created when you’re done using them.

Conclusion

Data security is an important concern for many organizations. This solution is an easy way to use organizational LDAP data to implement data security with row-level security in QuickSight. With organizational restructuring, hierarchies are bound to change with time. Therefore, the LDAP data can be dumped on a periodic basis and be updated in the respective Amazon Redshift table. This enables users to have better visibility in the data within their organizational hierarchy.

About the Author

Field Notes: Creating Custom Analytics Dashboards with FireEye Helix and Amazon QuickSight

2021-08-16 Karish Chowdhury

Post Syndicated from Karish Chowdhury original https://aws.amazon.com/blogs/architecture/field-notes-creating-custom-analytics-dashboards-with-fireeye-helix-and-amazon-quicksight/

FireEye Helix is a security operations platform that allows organizations to take control of any incident from detection to response. FireEye Helix detects security incidents by correlating logs and configuration settings from sources like VPC Flow Logs, AWS CloudTrail, and Security groups.

In this blog post, we will discuss an architecture that allows you to create custom analytics dashboards with Amazon QuickSight. These dashboards are based on the threat detection logs collected by FireEye Helix. We automate this process so that data can be pulled and ingested based on a provided schedule. This approach uses AWS Lambda, and Amazon Simple Storage Service (Amazon S3) in addition to QuickSight.

Architecture Overview

The solution outlines how to ingest the security log data from FireEye Helix to Amazon S3 and create QuickSight visualizations from the log data. With this approach, you need Amazon EventBridge to invoke a Lambda function to connect to the FireEye Helix API. There are two steps to this process:

Download the log data from FireEye Helix and store it in Amazon S3.
Create a visualization Dashboard in QuickSight.

The architecture shown in Figure 1 represents the process we will walk through in this blog post. To implement this solution, you will need the following AWS services and features involved:

Figure 1: Solution architecture

Prerequisites to implement the solution:

The following items are required to get your environment set up for this walkthrough.

AWS account.
FireEye Helix search alerts API endpoint. This is available under the API documentation in the FireEye Helix console.
FireEye Helix API key. This FireEye community page explains how to generate an API key with appropriate permissions (always follow least privilege principles). This key is used by the Lambda function to periodically fetch alerts.
AWS Secrets Manager secret (to store the FireEye Helix API key). To set it up, follow the steps outlined in the Creating a secret.

Extract the data from FireEye Helix and load it into Amazon S3

You will use the following high-level steps to retrieve the necessary security log data from FireEye Helix and store it on Amazon S3 to make it available for QuickSight.

Establish an AWS Identity and Access Management (IAM) role for the Lambda function. It must have permissions to access Secrets Manager and Amazon S3 so they can retrieve the FireEye Helix API key and store the extracted data, respectively.
Create an Amazon S3 bucket to store the FireEye Helix security log data.
Create a Lambda function that uses the API key from Secrets Manager, calls the FireEye Helix search alerts API to extract FireEye Helix’s threat detection logs, and stores the data in the S3 bucket.
Establish a CloudWatch EventBridge rule to invoke the Lambda function on an automated schedule.

To simplify this deployment, we have developed a CloudFormation template to automate creating the preceding requirements. Follow the below steps to deploy the template:

Download the source code from the GitHub repository
Navigate to CloudFormation console and select Create Stack
Select “Upload a template file” radio button, click on “Choose file” button, and select “helix-dashboard.yaml” file in the downloaded Github repository. Click “Next” to proceed.
On “Specify stack details” screen enter the parameters shown in Figure 2.

Figure 2: CloudFormation stack creation with initial parameters

The parameters in Figure 2 include:

HelixAPISecretName – Enter the Secrets Manager secret name where the FireEye Helix API key is stored.
HelixEndpointUrl – Enter the Helix search API endpoint URL.
Amazon S3 bucket – Enter the bucket prefix (a random suffix will be added to make it unique).
Schedule – Choose the default option that pulls logs once a day, or enter the CloudWatch event schedule expression.

Select the check box next to “I acknowledge that AWS CloudFormation might create IAM resources.” and then press the Create Stack button. After the CloudFormation stack completes, you will have a fully functional process that will retrieve the FireEye Helix security log data and store it on the S3 bucket.

You can also select the Lambda function from the CloudFormation stack outputs to navigate to the Lambda console. Review the following default code, and add any additional transformation logic according to your needs after fetching the results (line 32).

import boto3
from datetime import datetime
import requests
import os

region_name = os.environ['AWS_REGION']
secret_name = os.environ['APIKEY_SECRET_NAME']
bucket_name = os.environ['S3_BUCKET_NAME']
helix_api_url = os.environ['HELIX_ENDPOINT_URL']

def lambda_handler(event, context):

    now = datetime.now()
    # Create a Secrets Manager client to fetch API Key from Secrets Manager
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )
    apikey = client.get_secret_value(
            SecretId=secret_name
        )['SecretString']
    
    datestr = now.strftime("%B %d, %Y")
    apiheader = {'x-fireeye-api-key': apikey}
    
    try:
        # Call Helix Rest API to fetch the Alerts
        helixalerts = requests.get(
            f'{helix_api_url}?format=csv&query=start:"{datestr}" end:"{datestr}" class=alerts', headers=apiheader)
    
        # Optionally transform the content according to your needs..
        
        # Create a S3 client to upload the CSV file
        s3 = boto3.client('s3')
        path = now.strftime("%Y/%m/%d")+'/alerts-'+now.strftime("%H-%M-%S")+'.csv'
        response = s3.put_object(
            Body=helixalerts.content,
            Bucket=bucket_name,
            Key=path,
        )
        print('S3 upload response:', response)
    except Exception as e:
        print('error while fetching the alerts', e)
        raise e
        
    return {
        'statusCode': 200,
        'body': f'Successfully fetched alerts from Helix and uploaded to {path}'
    }

Creating a visualization in QuickSight

Once the FireEye data is ingested into an S3 bucket, you can start creating custom reports and dashboards using QuickSight. Following is a walkthrough on how to create a visualization in QuickSight based on the data that was ingested from FireEye Helix.

Step 1 – When placing the FireEye data into Amazon S3, ensure you have a clean directory structure so you can partition your data. By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. The following is a sample directory structure you could use.

ssw-fireeye-logs

2021

04

05

The following is an example of what the data will look like after it is ingested from FireEye Helix into your Amazon S3 bucket. In this blog post, we will use the Alert_Desc column to report on the types of common attacks.

Step 2 – Next, you must create a manifest file that will instruct QuickSight how to read the FireEye log files on Amazon S3. The preceding example is a manifest file that instructs QuickSight to recursively search for files in the ssw-fireeye-logs bucket, and can be seen in the URIPrefixes section. The GlobalUploadSettings section informs QuickSight the type and format of files it will read.

"fileLocations": [
	{
		"URIPrefixes": [
			"s3://ssw-fireeye-logs/"
		]
	},
	],
	"globalUploadSettings": {
		"format": "CSV",
		"delimiter": ",",
		"textqalifier": "'",
		"containsHeader": "true"
	}
}

Step 3 – Open Amazon QuickSight. Use the AWS Management Console and search for QuickSight.

Step 4 – Below the QuickSight logo, find and select Datasets.

Step 5 – Push the blue New dataset button.

Step 6 – Now you are on Create a Dataset page which enables you to select a data source you would like QuickSight to ingest. Because we have stored the FireEye Helix data on S3, you should choose the S3 data source.

Step 7 – A pop-up box will appear called New S3 data source. Type a data source name, and upload the manifest file you created. Next, push the Connect button.

Step 8 – You are now directed to the Visualize screen. For this exercise let’s choose a Pie chart, you can find this in the Visual types section by hovering over each icon and reading each tool tip that comes up. Look for the tool tip that says Pie chart. After selecting the Pie Chart visual type, two entries in the Field wells section at the top of the screen will show up called Group/Color and Value. Click the drop down in Group/Color and select the Alert_Desc column. Now click the drop down in Value and also select Alter_Desc column but choose count as an aggregate. This will create an informative visualization of the most common attacks based on the sample data shown previously in Step 1.

Figure 3: Visualization screen in QuickSight

Clean up

If you created any resources in AWS for this solution, consider removing them and any example resources you deployed. This includes the FireEye Helix API keys, S3 buckets, Lambda functions, and QuickSight visualizations. This helps ensure that there are no unwanted recurring charges. For more information, review how to delete the CloudFormation stack.

Conclusion

This blog post showed you how to ingest security log data from FireEye Helix and store that data in Amazon S3. We also showed you how to create your own visualizations in QuickSight to better understand the overall security posture of your AWS environment. With this solution, you can perform deep analysis and share these insights among different audiences in your organization (such as, operations, security, and executive management).

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Visualize data using Apache Spark running on Amazon EMR with Amazon QuickSight

2021-07-29 Tom McMeekin

Post Syndicated from Tom McMeekin original https://aws.amazon.com/blogs/big-data/visualize-data-using-apache-spark-running-on-amazon-emr-with-amazon-quicksight/

Organizations often need to process large volumes of data before serving to business stakeholders. In this blog, we will learn how to leverage Amazon EMR to process data using Apache Spark, the go-to platform for in-memory analytics of large data volume, and connect business intelligence (BI) tool Amazon QuickSight to serve data to end-users.

QuickSight is a fast, cloud-powered BI service that makes it easy to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. With our cloud-based service, you can easily connect to your data, perform advanced analysis, and create stunning visualizations and rich dashboards that can be accessed from any browser or mobile device.

QuickSight supports connectors for big data analytics using Spark. With the SparkSQL connector in QuickSight, you can easily create interactive visualizations over large datasets using Amazon EMR. Amazon EMR provides a simple and cost-effective way to run highly distributed processing frameworks such as Spark.

In this post, we use the public data set, New York City Taxi and Limousine Commission (TLC) Trip Record Data, which contains data of trips taken by taxis and for-hire vehicles in New York City. We use an optimized Parquet version of the CSV public dataset available from the Registry of Open Data on AWS.

This post also explores how to use AWS Glue to create the Data Catalog by crawling the NYC taxi data in an Amazon Simple Storage Service (Amazon S3) bucket, making it immediately query able for analyzing. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. To learn more about how to use AWS Glue to transform a dataset from CSV to Parquet, see Harmonize, Query, and Visualize Data from Various Providers using AWS Glue, Amazon Athena, and Amazon QuickSight.

Prerequisites

The following steps assume that you have a VPC with public and private subnets, with NAT configured for private subnets and an available S3 bucket for Amazon EMR logging.

If you create the EMR cluster in a private subnet, you can use AWS Systems Manager Session Manager, a bastion host, or a VPN connection to access the EMR cluster. For this post, we use Session Manager to access our EMR cluster.

In QuickSight Enterprise edition, you can create connections to your VPCs from your QuickSight account. Each connection creates an elastic network interface in your VPC for QuickSight to send traffic to instances in your VPC. For more information, see Connecting to a VPC with Amazon QuickSight. If you haven’t already signed up for QuickSight, you can sign up before getting started. QuickSight offers a free trial so you can try out this solution at no cost.

The following AWS CloudFormation template offers single-click deployment.

We use the US East (N. Virginia) Region as the default; we highly recommended you launch the stack to optimize querying the public dataset in Amazon S3. You can change to any Region that supports Amazon EMR, AWS Glue, and QuickSight, but it may impact the time it takes to query the data.

If deploying into production, we recommend you secure communication by using the Configure SSL with a QuickSight supported authority step after deployment of the CloudFormation template to enable SSL.

Solution overview

We walk you through the following steps:

Deploy and configure Amazon EMR with a CloudFormation template.
Run AWS Glue crawlers to crawl and populate the Hive-compatible metastore.
Test JDBC connectivity using Beeline.
Visualize the data with QuickSight.

The CloudFormation template provided in the prerequisites, provides a configured Amazon EMR cluster for you to start querying your data with Spark. After deploying the CloudFormation stack, you can skip the first step and start running the AWS Glue crawlers.

Deploy and Configure Amazon EMR

Those looking to dive deep to understand what the CloudFormation template is deploying can use the following steps to manually deploy Amazon EMR running Spark and connect it to QuickSight:

Create an EMR cluster with 5.30.0 or later release.
Connect to the cluster using Session Manager.
Install and configure OpenLDAP.
Create a user in LDAP.
Start the Thrift server.
Configure SSL using a QuickSight supported authority.

Create an EMR cluster

For this post, we use an EMR cluster with 5.30.0 or later release.

On the Amazon EMR console, choose Create cluster.
For Cluster name, enter a name (for example, visualisedatablog).
For Release, choose your release version.
For Applications, select Spark.
Select Use AWS Glue Data Catalog for table metadata.
For Instance type¸ choose your instance.
For EC2 key pair, choose Proceed without an EC2 key pair.
Choose Create cluster.

Make sure you enabled run as support for Session Manager.

Connect to the EMR cluster using Session Manager

Session Manager is a fully managed AWS Systems Manager capability that lets you manage your Amazon Elastic Compute Cloud (Amazon EC2) instances, on-premises instances, and virtual machines through an interactive, one-click, browser-based shell or through the AWS Command Line Interface (AWS CLI). Session Manager provides secure and auditable instance management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys.

By default, sessions are launched using the credentials of a system-generated ssm-user account that is created on a managed instance. For Amazon EMR, you can instead launch sessions using the Hadoop user account. Session Manager provides two methods for specifying the Hadoop user operating system account to use. For more information, see Enable run as support for Linux and macOS instances. For those configuring Systems Manager for the first time, review Why is my EC2 instance not appearing under Managed Instances in the Systems Manager console? for helpful tips on getting started with adding managed instances.

After you log in to the primary node of your cluster, run the following commands to install and configure OpenLDAP.

Install and configure OpenLDAP

To install and configure OpenLDAP, complete the following steps (alternatively, you can download the script used in the CloudFormation script and run it):

Run the following commands:

# Install LDAP Server
sudo yum -y install openldap compat-openldap openldap-clients openldap-servers openldap-servers-sql openldap-devel
# Restart LDAP 
sudo service slapd restart

For more about configuring OpenLDAP, see the OpenLDAP documentation.

Run the following command to set a new password for the root account and store the resulting hash:

slappasswd

This command outputs a hash that looks like the following sample:

{SSHA}DmD616c3yZyKndsccebZK/vmWiaQde83

Copy the hash output to a text editor to use in subsequent steps.

Next, we prepare the commands to set the password for the LDAP root.

Run the following code (replace the hash with the one you generated in the previous step, and make sure carriage returns are preserved):

cat > /tmp/config.ldif <<EOF
dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcSuffix
olcSuffix: dc=example,dc=com

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootDN
olcRootDN: cn=dev,dc=example,dc=com

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootPW
olcRootPW: <<REPLACE_WITH_PASSWORD_HASH>>
EOF

Run the following command to run the preceding commands against LDAP:

sudo ldapmodify -Y EXTERNAL -H ldapi:/// -f /tmp/config.ldif

Copy the sample database configuration file to /var/lib/ldap and add relevant schemas:

sudo cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG

sudo ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/cosine.ldif
sudo ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/nis.ldif 
sudo ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/inetorgperson.ldif

Create a user in LDAP

Next, create a user account with a password in the LDAP directory with the following commands. When prompted for a password, use the LDAP root password that you created in the previous step (for this post, we use sparky as the username). Make sure carriage returns are preserved when copying and entering the code.

cat > /tmp/accounts.ldif <<EOF 
dn: dc=example,dc=com
objectclass: domain
objectclass: top
dc: example

dn: ou=dev,dc=example,dc=com
objectclass: organizationalUnit
ou: dev
description: Container for developer entries

dn: uid=$username,ou=dev,dc=example,dc=com
uid: $username
objectClass: inetOrgPerson
userPassword: <<REPLACE_WITH_STRONG_PASSWORD>>
sn: sparky
cn: dev
EOF

Run the following command to run the preceding commands against LDAP (you must enter the root LDAP password specified in the previous section):

sudo ldapadd -x -w <<LDAP_ROOT_PASSWORD>> -D "cn=dev,dc=example,dc=com" -f /tmp/accounts.ldif

We have now configured OpenLDAP on the EMR cluster running Spark and created the user sparky that we use to connect to QuickSight.

Start the Thrift server

Start the Thrift server by running the following command. By default, the Thrift server runs on port 10001. Amazon EMR by default places limits on executor sizes to avoid having the executor consume too much memory and interfere with the operating system and other processes running on the instance. To optimize the use of R family instances with the flexibility of also using the smallest supported instance types, we use –executor-memory=18GB –executor-cores=4 for our Thrift server configuration. See the following code:

sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn –executor-memory=18GB –executor-cores=4

Now that we have configured the EMR cluster to accept connections, let’s get our public dataset ready.

Configure SSL using a QuickSight supported authority

If deploying into production, we recommend using a secure communication between QuickSight and Spark. QuickSight doesn’t accept certificates that are self-signed or issued from a non-public CA. For more information, see Amazon QuickSight SSL and CA Certificates. To secure the Thrift connection, you can enable the SSL encryption and restart the hive-server2 and Thrift service on the primary EMR instance.

After you have your certificate, you can enable SSL.

In your preferred editor, open and edit /etc/hive/conf/hive-site.xml:

    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>HOSTNAME</value>
    </property>
    <property>
        <name>hive.server2.use.SSL</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.server2.keystore.path</name>
        <value>PATH_TO_KEYSTORE/KEYSTORE/KEYSTORE.jks</value>
    </property>
    <property>
        <name>hive.server2.keystore.password</name>
        <value>KEYSTORE_PASSWORD</value>
    </property>

Restart the Thrift server by running the following command:

sudo /usr/lib/spark/sbin/stop-thriftserver.sh --master yarn && sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn

Run AWS Glue crawlers

Now let’s use AWS Glue crawlers to detect the schema. If you used the CloudFormation template, you already have a crawler ready to start via the AWS Glue console. When the crawler is complete, you should have a table listed in the database.

If you’re configuring the crawler manually on the AWS Glue console, the following screenshot summarizes the crawler configuration.

After the crawler has run, you can go to the Tables page to view the taxi_ny_pub table with the table properties and schema. The following screenshot shows the table details page; here you can find the partitions and various versions of the schema.

The Data Catalog is shared between Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. You can use Athena to preview the data that is stored in this table.

Test JDBC connectivity using Beeline

Now that the EMR cluster is deployed and the data is copied, we can quickly test the JDBC connectivity on the EMR cluster using Beeline. Beeline is an open-source JDBC client, based on the SQLLine CLI, used to connect to your cluster via the command line.

Log in to your EMR cluster using Session Manager. You can use Beeline to connect to the Thrift server and test the connection:

/usr/lib/spark/bin/beeline -u 'jdbc:hive2://<REPLACE_MASTER_PUBLIC_DNS>:10001/default' -n <<USERNAME>> -p <<PASSWORD>> -e "show databases;"

The preceding command connects to the Spark cluster and shows you the list of databases, as in the following example code:

Connected to: Spark SQL (version 2.3.0) 
Driver: Hive JDBC (version 1.2.1-spark2-amzn-0) 
Transaction isolation: TRANSACTION_REPEATABLE_READ 
+---------------+--+ 
| databaseName | 
+---------------+--+ 
| default | 
| nyc_taxi |
| sampledb | 
+---------------+--+ 
3 rows selected (0.171 seconds) 
Beeline version 1.2.1-spark2-amzn-0 by Apache Hive 
Closing: 0: 
jdbc:hive2://<REPLACE_MASTER_PUBLIC_DNS>:10001/default

Visualize data with QuickSight

Now let’s connect Amazon EMR to QuickSight and do a quick visualization of this data.

On the QuickSight console, on the Datasets page, choose New dataset.
Choose Spark as your connector.

For Data source name, enter a name (for example, SPARKY).
For Database server, enter your public primary DNS.

To allow QuickSight to connect to your EMR cluster, you must create a security group containing an inbound rule authorizing access from the appropriate IP address range for the QuickSight servers in that Region. For further details on how to create appropriate security group rules, see Authorizing Connections from Amazon QuickSight to Amazon EC2 Instances.

For this post, we use security groups to control network connectivity.

For Port, add TCP port 10001 as an inbound rule to allow for inbound connectivity from QuickSight to Amazon EMR.

If deploying into production, we recommend using a secure communication between QuickSight and Spark, which we covered in a previous step.

QuickSight Enterprise edition provides full integration with Amazon Virtual Private Cloud (Amazon VPC), which enables you to secure and isolate traffic between resources. For more information, see Connecting to a VPC with Amazon QuickSight. This allows you to deploy your EMR cluster in a private VPC Subnet.

Enter a username and password.
If you configured SSL, select Enable SSL.
Choose Create data source.

The Spark cluster reads the Data Catalog and provides information about the schema and the tables in the schema. You can also choose the table created by the AWS Glue crawler and load the data into SPICE for faster analytics. SPICE is the in-memory calculation engine in QuickSight that provides blazing fast performance at scale. SPICE automatically replicates data for high availability, allowing thousands of users to simultaneously perform fast, interactive analysis, while shielding your underlying data infrastructure, which saves you time and resources. QuickSight supports uploading 250 million rows (and 500 GB) per SPICE dataset. If you have larger datasets than this, you can use the direct query option. In this post, we use SPICE.

Also make sure that you defined the correct permissions to access the S3 bucket for the EMR cluster. For instructions, see Reading and Writing Data to Amazon S3 Using EMRFS.

Let’s create a custom SQL query to perform some aggregations prior to loading into SPICE (see the following screenshot).

Enter the following code:

SELECT 
SUM (cast (fare_amount as double)) as TotalFare 
,AVG(cast (fare_amount as double)) as AvgFare 
,AVG (cast (trip_distance as double)) as AvgTripDistance 
,AVG(passenger_count) as AvgPassengerCount 
,year 
,month
FROM nyc_taxi.taxi_ny_pub
WHERE year BETWEEN 2011 AND 2016
GROUP BY year, month;

The database and table names may vary in your deployment.

For this post, select Import to SPICE for quicker analytics.

Alternatively, because the NYC taxi dataset is larger than 250 million rows, you can choose to directly query your data.

To create a visualization, select the fields in the left panel.

For this post, we review the Average Fare Amount and Passenger Count between 2013–2019, using ML Insights to automatically generate natural language narratives when analyzing the 229.12 GB dataset.

Summary

In less than an hour, we created an EMR cluster, enabled OpenLDAP, and started the Thrift server. We also used AWS Glue to crawl a public dataset and visualize the data. Now you have what you need to get started creating powerful dashboards and reports using QuickSight on your Amazon S3 data using Apache Spark. Feel free to reach out if you have any questions or suggestions.

To learn more about these capabilities and start using them in your dashboards, check out the QuickSight User Guide.

If you have questions and suggestions, you can post them on the QuickSight forum.

Go to the QuickSight website to get started now for free.

About the Author

Tom McMeekin is an Enterprise Solutions Architect with a career in technology spanning over 20 years. Tom has worked across a number of industry verticals including Telecommunications, Manufacturing, Infrastructure and Development, Utilities, Energy, and Retail. Throughout his career, he has focused on solving complex business problems through innovative technologies that deliver the right business outcomes for his customers.

Build a centralized granular access control to manage assets and data access in Amazon QuickSight

2021-07-28 Ying Wang

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/build-a-centralized-granular-access-control-to-manage-assets-and-data-access-in-amazon-quicksight/

A large business intelligence (BI) project with many users and teams and sensitive information demands a multi-faceted security architecture. Such architecture should provide BI administrators and architects with the capability to minimize the amount of information accessible to users. For a straightforward solution to manage Amazon QuickSight user and asset access permissions, you can use the AWS Command Line Interface (AWS CLI) or AWS Management Console to manually edit QuickSight user role and dashboard access. However, in specific cases, an enterprise can easily have hundreds or thousands of users and groups, and these access management methods aren’t efficient. We have received a large number of requests to provide an advanced programmable approach to deploy and manage a centralized QuickSight security architecture.

This post describes the best practices for QuickSight authentication and authorization granular access control, and provides a centralized cloud application with an AWS Cloud Development Kit (AWS CDK) stack to download. One of the advantages of our solution is enterprises can deploy the security framework to administer access control of their BI without leaving AWS.

All configurations are saved in the AWS Systems Manager Parameter Store. Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. You can store data such as user name, user permissions, passwords, and database strings as parameter values. You can reference AWS Systems Manager parameters in your scripts and configuration and automation workflows by using the unique name that you specified when you created the parameter.

The AWS CDK application template fits into the continuous integration and continuous deployment (CI/CD) infrastructure and grants or revokes all authentications and authorizations based on a defined policy prescribed by AWS. This avoids possible human errors made by BI developers or administrators. BI developers can edit configuration parameters to release new dashboards to end-users. At the same time, BI administrators can edit another set of parameters to manage users or groups. This AWS CDK CI/CD design bridges the gaps between development and operation activities by enforcing automation in building and deploying BI applications.

Security requirements

In enterprise BI application design, multi-tenancy is a common use case, which serves multiple sets of users with one infrastructure. Tenants could either be different customers of an independent software vendor (ISV), or different departments of an enterprise. In a multi-tenancy design, each tenant shares the dashboards, analyses, and other QuickSight assets. Each user, who can see all other users belonging to the same tenant (for example, when sharing content), remains invisible to other tenants. Within each tenant, the BI admin team has to create different user groups to control the data authorization, including asset access permissions and granular-level data access.

Let’s discuss some use cases of asset access permissions in detail. In a BI application, different assets are usually categorized according to business domains (such as an operational dashboard or executive summary dashboard) and data classification (critical, highly confidential, internal only, and public). For example, you can have two dashboards for analyzing sales results data. The look and feel of both dashboards are similar, but the security classification of the data is different. One dashboard, named Sales Critical Dashboard, contains critical columns and rows of data. The other dashboard, called Sales Highly-Confidential Dashboard, contains highly confidential columns and rows of data. Some users are granted permission to view both dashboards, and others have lower security level permission and can only access Sales Highly-Confidential Dashboard.

In the following use case, we address granular-level data access as follows:

Row-level access (RLS) – For the users who can access Sales Critical Dashboard, some of them can only view US data. However, some global users can view the data of all countries, including the US and UK.
Column-level access (CLS) – Some users can only view non-personally identifiable information (PII) data columns of a dataset, whereas the HR team can view all the columns of the same dataset.

Large projects might have several tenants, hundreds of groups, and thousands of users in one QuickSight account. The data leader team wants to deploy one protocol for user creation and authentication in order to reduce the maintenance cost and security risk. The architecture and workflow described in this post help the data leader achieve this goal.

Additionally, to avoid human errors in daily operation, we want these security permissions to be granted and revoked automatically, and fit into the CI/CD infrastructure. The details are explained later in this post.

Architecture overview

The following diagram shows the QuickSight account architecture of this solution.

Authors create dashboards and update AWS Systems Manager Parameter Store to release dashboards to different groups
Admins approve the requests from authors
Admins update user management (roles, namespace,) by editing AWS Systems ManagerParameter Store
DevOps deploy the updates with AWS CDK

*Groups: Object access permission groups control the owner/viewer of the objects. Data segment groups combined with RLS/CLS control data access.

*Datasets: Contain all data, restricted by row-level security (RLS) and column-level security (CLS)

The following diagram illustrates the authentication workflow of the architecture:

*First time log in QuickSight: If the QuickSight user is not registered before first time log in, a reader is created and this reader only can view the landing page dashboard, which shares to all users of this account. The landing page provides the reports list that this user can view.

The following diagram illustrates the authorization workflow of the architecture.

Authorization diagram details:

User information (department, team, geographic location) is stored in Amazon Redshift, Amazon Athena, or any other database. Combined with group-user mapping, RLS databases are built for control data access.
Hourly permissions assignment:
1. According to group-employee name (user) mapping (membership.csv) and group-role mapping (/qs/console/roles), an AWS Lambda function creates groups, registers, users, assigns group members, removes group memberships, promotes readers to author or admin, and deletes users if they’re demoted from author or admin to reader.
2. According to group-dashboard mapping in /qs/config/access, an AWS Lambda function updates dashboard permissions to QuickSight groups.
3. According to group-namespace mapping in membership.csv, an AWS Lambda function creates QuickSight groups in the specified namespace.
Sample parameters of objects access permissions and data segments:

Sample parameters of QuickSight user role:

Sample data of membership.csv:

In this solution, custom namespaces are deployed to support multi-tenancy. The default namespace is for all internal users of a company (we call it OkTank). OkTank creates the 3rd-Party namespace for external users. If we have to support more tenants, we can create more custom namespaces. By default, we’re limited to 100 namespaces per AWS account. To increase this limit, contact the QuickSight product team. For more information about multi-tenancy, see Embed multi-tenant analytics in applications with Amazon QuickSight.

In each namespace, we create different types of groups. For example, in the default namespace, we create the BI-Admin and BI-Developer groups for the admin and author users. For reader, we deploy two types of QuickSight groups to control asset access permissions and data access: object access permission groups and data segment groups.

The following table summarizes how the object access permission groups control permissions.

Group Name	Namespace	Permission	Notes
`critical`	Default	View both dashboards (containing the critical data and highly confidential data)
`highlyconfidential`	Default	Only view Sales Highly-Confidential Dashboard
`BI-Admin`	Default	Account management and edit all assets	Users in the `BI-Admin` group are assigned the `Admin` QuickSight user role.
`BI-Developer`	Default	Edit all assets	Users in the `BI-Developer` group are assigned the Author QuickSight user role.
`Power-reader`	Default	View all assets and create ad hoc analysis to run self-service analytics reports	Users in the `Power-reader` group are assigned the `Author` QuickSight user role. However, this group can’t save or share their ad hoc reports.
`3rd-party`	Non-default namespaces (`3rd-party` namespace, for example)	Can only share with readers (`3rd-party-reader` group, for example) in the same namespace	In non-default namespaces, we can also create other object access permission groups, which is similar to the critical group in the default namespace.

For more information about QuickSight groups, users, and user roles, see Managing User Access Inside Amazon QuickSight, Provisioning Users for Amazon QuickSight, and Using administrative dashboards for a centralized view of Amazon QuickSight objects.

The second type of groups (data segment groups), combined with row-level security datasets and column-level security, control data access as described in the following table.

Group Name	Namespace	Permission	Scope
`USA`	Default	Only view US data on any dashboard	Row-level
`GBR`	Default	Only view UK data on any dashboard	Row-level
`All countries`	Default	View data of all countries on any dashboard	Row-level
`non-PII`	Default	Can’t view Social Security numbers, annual income, and all other columns of PII data	Column-level
`PII`	Default	Can view all columns including PII data	Column-level

We can set up similar groups in non-default namespaces.

These different groups can overlap each other. For example, if a user belongs to the groups USA, Critical, and PII, they can view US data on both dashboards, with all columns. The following Venn diagram illustrates the relationships between these groups.

In summary, we can define a multi-faceted security architecture by combining QuickSight features, including namespace, group, user, RLS, and CLS. All related configurations are saved in the Parameter Store. The QuickSight users list and group-user mapping information are in an Amazon Simple Storage Service (Amazon S3) bucket as a CSV file (named membership.csv). This CSV file could be output results of LDAP queries. Several AWS Lambda functions are scheduled to run hourly (you can also invoke these functions on demand, such as daily, weekly, or any time granularity that fits your requirements) to read the parameters and the membership.csv. According to the configuration defined, the Lambda functions create, update, or delete groups, users, and asset access permissions.

When the necessary security configurations are complete, a Lambda function calls the QuickSight APIs to get the updated information and record the results in an S3 bucket as CSV files. The BI admin team can build datasets with these files and visualize the results with dashboards. For more information, see Using administrative dashboards for a centralized view of Amazon QuickSight objects and Building an administrative console in Amazon QuickSight to analyze usage metrics.

In addition, the errors of Lambda functions and the user deletion events are stored in this S3 bucket for the admin team to review.

Automation

The following diagram illustrates the overall workflow of the Lambda functions.

We use a programmable method to create and configure the groups and users automatically. For any ad hoc user registration request (such as the user isn’t recorded in membership.csv yet due to latency), as long as the user can be authenticated, they can assume the AWS Identity and Access Management (IAM) role quicksight-fed-user to self-provision as a QuickSight reader. This self-provisioned reader can only view a landing page dashboard, which provides the list of dashboards and corresponding groups. According to the dashboard-group mapping, this new reader can apply for membership of a given group to access the dashboards. If the group owner approves the application, the hourly Lambda functions add the new user into the group the next time they run.

The CI/CD pipeline starts from AWS CDK. The BI administrator and author can update the Systems Manager parameters to release new dashboards or other QuickSight assets in the AWS CDK stack granular_access_stack.py. The BI administrator can update the Systems Manager parameters in the same stack to create, update, or delete namespaces, groups, or users. Then the DevOps team can deploy the updated AWS CDK stack to apply these changes to the Systems Manager parameters or other AWS resources. The Lambda functions are triggered hourly to call APIs to apply changes to the related QuickSight account.

Scale

The Lambda functions are restricted by the maximum runtime of 15 minutes. To overcome this limitation, we can convert the Lambda functions to AWS Glue Python shell scripts with the following high-level steps:

Download Boto3 wheel files from pypi.org.
Upload the wheel file into an S3 bucket.
Download the Lambda functions and merge them into one Python script and create an AWS Glue Python shell script.
Add the S3 path of the Boto3 wheel file into the Python library path. If you have multiple files to add, separate them with a comma.
Schedule this AWS Glue job to run daily.

For more information, see Program AWS Glue ETL Scripts in Python and Using Python Libraries with AWS Glue.

Prerequisites

You must have the following prerequisites to implement this solution:

A QuickSight Enterprise account
Basic knowledge of Python
Basic knowledge of SQL
Basic knowledge of BI

Create the resources

Create your resources by downloading the AWS CDK stack from the GitHub repo.

In the granular_access folder, run the command cdk deploy granular-access to deploy the resources. For more information, see AWS CDK Intro Workshop: Python Workshop.

Deploy the solution

When you deploy the AWS CDK stack, it creates five Lambda functions, as shown in the following screenshot.

The stack also creates additional supportive resources in your account.

The granular_user_governance function is triggered by the Amazon CloudWatch event rule qs-gc-everyhour. The information of groups and users is defined in the file membership.csv. The S3 bucket name is stored in the parameter store /qs/config/groups. The following diagram shows the flowchart of this function.

Set the destination of granular_user_governance to another Lambda function, downgrade_user, with source=Asynchronous invocation and condition=On Success.

The following diagram is a flowchart of this function.

To avoid breaking critical access to QuickSight assets governed by Admin or Author, we demote an admin or author by deleting the admin or author user and creating a new reader user with the Lambda function downgrade_user. The granular_user_governance function handles downgrading admin to author, or upgrading author to admin.

Set the destination of downgrade_user to the Lambda function granular_access_assets_govenance with source=Asynchronous invocation and condition=On Success.

The following diagram shows a flowchart of this function.

Set the destination of downgrade_user to the Lambda function check_team_members with source=Asynchronous invocation and condition=On Failure.

The check_team_members function simply calls QuickSight APIs to get the namespaces, groups, users, and assets information, and saves the results in the S3 bucket. The S3 key is monitoring/quicksight/group_membership/group_membership.csv and monitoring/quicksight/object_access/object_access.csv.

Besides the two output files of the previous step, the error logs and user deletion logs (logs of downgrade_user) are also saved in the monitoring/quicksight folder.

Set the destination of granular_access_assets_govenance to the Lambda function check_team_members with source=Asynchronous invocation and condition=On Success or condition=On Failure.

Create row-level security datasets

As a final step, we create RLS datasets. This allows you to change the dashboard records based on the users that view the dashboards.

QuickSight supports RLS by applying a system-managed dataset that sub-selects records from the dashboard dataset. The mechanism allows the administrator to provide a filtering dataset (the RLS dataset) with username or groupname columns, which are automatically filtered to the user that is logged in. For example, a user named YingWang belongs to QuickSight group BI, so all the rows of the RLS dataset that correspond to the username YingWang or group name BI are filtered. The rows that remain in the RLS after applying the username and the group name filters are then used to filter the dashboard datasets further by matching columns with the same names. For more information about row-level security, see Using Row-Level Security (RLS) to Restrict Access to a Dataset.

In this solution, we export the sample user information into the file membership.csv, which is stored in an S3 bucket. In this file, we provide some sample groups for RLS dataset definition. These groups are the data segment groups, as described in the overall architecture design. The following screenshot shows some of the groups and the users in those groups.

The granular_user_governance function creates these groups and adds the related users to be members of these groups.

How do we create the RLS dataset? Let’s say we have a table called employee_information in our organization’s HR database. The following screenshot shows some sample data.

Based on the employee_information table, we create a view called rls for an RLS dataset. See the following SQL code:

create view
rls(groupname, username, country, city)
as
(SELECT 
concat('quicksight-fed-'::text, lower(employee_information.country::text)) AS groupname,
concat(concat('quicksight-fed-us-users/'::text, employee_information.employee_login::text),'@oktank.com'::text) AS username,
employee_information.country,
employee_information.city
FROM 
employee_information)

The following screenshot shows our sample data.

Now we have the table ready, we can create the RLS dataset with the following custom SQL:

select distinct 
r.groupname as GroupName,
null as UserName,
r.country,
null as city 
from 
rls as r 
join fact_revenue as f 
on r.country=f.country
union
select distinct 'quicksight-fed-all-countries' as GroupName,
null as UserName,
null as country,
null as city
from rls as r
union
select distinct null as GroupName,
r.username as UserName,
r.country,
r.city 
from 
rls as r
join fact_revenue as f 
on r.country=f.country 
and 
r.city=f.city

The following screenshot shows our sample data.

For the group quicksight-fed-all-countries, we set the username, country, and city as null, which means that all the users in this group can view the data of all countries.

For country level, only the security rules defined in the groupname and country columns are used for filtering. The username and city columns are set as null. The users in the quicksight-fed-usa group can view the data of USA, and the users in the quicksight-fed-gbr group can view the data of GBR.

For each user with groupname set as null, they can only view the specific country and city assigned to their username. For example, TerryRigaud can only view data of Austin, in the US.

In QuickSight, multiple rules in an RLS dataset are combined together with OR.

With these multi-faceted RLS rules, we can define a comprehensive data access pattern.

Clean up

To avoid incurring future charges, delete the resources you created by running the following command:

cdk destroy granular_access

Conclusion

This post discussed how BI administrators can design and automate QuickSight authentication and authorization granular access control. We combined QuickSight security features like row-level and column-level security, groups, and namespaces to provide a comprehensive solution. Managing these changes through “BIOps” ensures a robust, scalable mechanism for managing QuickSight security. To learn more, sign up for a QuickSight demo.

About the Authors

Ying Wang is a Senior Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Amir Bar Or is a Principal Data Architect at AWS Professional Services. After 20 years leading software organizations and developing data analytics platforms and products, he is now sharing his experience with large enterprise customers and helping them scale their data analytics in the cloud.