Tag Archives: Amazon QuickSight

Organize and share your content with folders in Amazon QuickSight

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/organize-and-share-your-content-with-folders-in-amazon-quicksight/

Amazon QuickSight Enterprise Edition now supports folders for organization and sharing content. Folders in QuickSight are of two types:

  • Personal folders – Allow individual authors and administrators to organize assets for their personal ease of navigation and manageability
  • Shared folders – Allow authors and administrators to define folder hierarchies that they can share across the organization and use to manage user permissions and access to dashboards, analyses, and datasets

You can access folders directly from shortcuts on the new QuickSight home page (see the following screenshot). In this post, we take a deeper look at folders and how you can implement this in your QuickSight account.

Asset permissions and folders

Before we dive into how the two types of folders work, let’s understand how asset permissions work in QuickSight. QuickSight assets (dashboards, analyses, and datasets) are created by authors or admins, reside in the cloud, and by default are permissioned to be visible from the UI to only the owner, which in this case is the creator of the asset. The owner can share the asset with other users (authors or admins, or in the case of dashboards, readers) or groups of users. When the asset needs to be shared, QuickSight allows the owner to share with specific users or groups of users, who can then be provided viewer or owner access.

Previously, these flows meant that admins and authors who have hundreds of assets have to manage permissions for users and groups individually. There was no hierarchical structure to easily navigate and discover key assets available. We built personal folders to solve the need for organization for authors and admins, while shared folders provide easier bulk permissions management for authors and discovery of assets for both authors and readers.

Personal folders are available to all authors and admins in QuickSight Enterprise Edition. You can create these folders within your user interface and add assets in them. Personal folders aren’t visible to other users within the account, and they don’t affect the permissions of any objects placed within. This means that if you create a personal folder called Published dashboards and add a dashboard to it, there are no changes to user permissions in the dashboard on account of its addition to this folder. An important difference here is that unlike traditional folders, QuickSight allows you to place the same asset in multiple folders, which avoids the need to replicate the same asset in different folders. This allows you to update one time and make sure all your stakeholders get the latest information.

The following screenshot shows the My folders page on the QuickSight console.

Shared folders in QuickSight are visible to permissioned users across author, admin, and reader roles in QuickSight Enterprise Edition. Top (root)-level shared folders can only be created by admins of the QuickSight account, who can share these with other users or groups. When sharing, folders offer two levels of permissions:

  • Owner access – Allows admins or authors with access to the folder to add and remove assets (including subfolders), modify the folder itself, and share as needed with users or groups.
  • Viewer access – Restricts users to only viewing the folder and contents within, including subfolders. Readers can only be assigned viewer access, and can see the Shared folders section when at least one folder is shared with them.

The following screenshot shows the Shared folders page.

The following screenshot shows the Share folder pop-up window, which you use to choose who to share folders with.

Permissions granted to a user or group at a parent folder level are propagated to subfolders within, which means that owners of a parent folder have access to subfolders. As a result, it’s best to model your permissions tree and folder structure before implementing and sharing folders in your account. Users who are to be restricted to specific folders are best granted access at the lowest level possible.

Folder permissions are currently also inherited by the assets within. For example, if a dashboard is placed in a shared folder, and Sally is granted access to the folder as an owner, Sally now has ownership over the folder and the dashboard. This model allows you to effectively use folders to manage shared permissions across thousands of users without having to implement this on a per-user or per-asset level.

For example, a team of 10 analysts could have owner permissions to a shared folder, which allows them to own both the folder and contents within, while thousands of other users (readers, authors, and admins) can be granted viewer permissions to the folder. This ensures that permissions management for these viewers can be done by the one-time action of granting them viewer permissions over the folder, instead of granting these permissions to users and groups within each dashboard. Permissions applied at the individual asset level continue to be enforced, and the final permissions of a user is the combination of the folder and individual asset permissions (whichever is higher).

Shared folders also enforce a uniqueness check over the folder path, which means that you can’t have two folders that have the same name at the same level in the folder tree. For example, if the admin creates /Oktank/ and shares with Tom and Sally as owners, and Tom creates /Oktank/Marketing/, Sally can no longer create a folder with the name Marketing. She should coordinate with Tom on permissions and get Tom to share this folder as an owner so that she can also contribute to the marketing assets. For personal folders (and for other asset types including dashboards, analyses, and datasets), QuickSight doesn’t require such uniqueness.

With QuickSight Enterprise Edition, dashboards, analyses, and datasets—whether owned by a user or shared with them—exist within the user’s QuickSight account and can be accessed via the asset-specific details page or search. All assets continue to be displayed via these pages, while those added to specific folders become visible via the folders view.

Use case: Oktank Analytics

Let’s put this all together and look from the lens of how a fictional customer, Oktank Analytics, can set up shared folders within their account. Let’s assume that Oktank has three departments: marketing, sales, and finance, with the sales team subdivided into US and EU orgs. Each of these departments and sub-teams has their own set of analysts that build and manage dashboards, and departmental users that expect to see data pertaining to their functional area. Oktank also has C-level executives that need access to dashboards from each department. Finally, QuickSight administrators oversee the overall business intelligence solution.

To implement this in QuickSight and provide a scalable model, the admin team first creates the top-level folder /Oktank/ and grants viewer access to the C-level executives. This grants the leadership team access to all subfolders underneath, making sure that there are no access issues. Access is also limited to viewer, so that the leadership has visibility but can’t accidentally make any changes.

Next, the admin team creates subfolders for marketing, sales, and finance. Both the admins and C-level executives have access to these folders (as owner and viewer, respectively) due to their permissions on the top-level folder.

The following diagram illustrates this folder hierarchy.

Oktank admins grant owner permissions to the Marketing folder to the marketing analyst team (via QuickSight groups). This allows the analyst team to create subfolders that match expectations of their users and leadership. To streamline access, the marketing analyst team creates two subfolders: Assets and Dashboards. The marketing analyst team uses Assets (/Oktank/Marketing/Assets/) to store datasets and analyses that they need to build and manage dashboards. Because all the marketing analysts have access to this folder, critical assets aren’t disrupted when an analyst is on vacation or leaves the company. Marketing analysts have owner permissions, the admin team has owner permissions, and C-level executives have viewer permissions.

The marketing analyst team uses the Dashboards folder to store dashboards that are shared to all marketing users (via QuickSight groups). All marketing users are granted viewer permissions to this folder (/Oktank/Marketing/Dashboards/); marketing analysts grant themselves owner permissions while the admin team and C-level executives have owner and viewer permissions propagated. For marketing users, access to this folder means that all the dashboards relevant to their roles can be explored in /Oktank/Marketing/Dashboards/, which is available through the Shared Folders link on the home page. The marketing analyst team also doesn’t have to share these assets individually or worry about permissions being missed out for specific users or dashboards.

The sales team needs further division because US and EU have different teams and data. The admin team creates the Sales subfolder, and then creates US and EU subfolders. They grant US sales analysts owner access to the US subfolder (/Oktank/Sales/US/), which gives the analysts the ability to create subfolders and share with users as appropriate. This allows the US sales analyst team to create /Oktank/Sales/US/Assets and /Oktank/Sales/US/Dashboards/. Similar to the marketing team, they can now store their critical datasets, analyses, and dashboards in the Assets folder, and open up the Dashboards folder to all US sales personnel, providing a one-stop shop for their users. The C-level executives have reader access to these folders and can access these assets and anything added in the future.

Admins and C-level executives see the following hierarchy in their shared folder structure; admins have owner access to all, and C-level executives have viewer access:

Oktank

Oktank > Marketing

Oktank > Marketing > Assets

Oktank > Marketing > Dashboards

Oktank > Sales

Oktank > Sales > US

Oktank > Sales > US > Assets

Oktank > Sales > US > Dashboards

Oktank > Sales > EU

Oktank > Sales > EU > Assets

Oktank > Sales > EU > Dashboards

Oktank > Finance

Oktank > Finance > Assets

Oktank > Finance > Dashboards

A member of the marketing analyst team sees the following:

Oktank

Oktank > Marketing

Oktank > Marketing > Assets

Oktank > Marketing > Dashboards

A member of the Oktank marketing team (e.g., marketing manager) sees the following:

Oktank

Oktank > Marketing

Oktank > Marketing > Dashboards

A member of the US Sales analyst team sees the following:

Oktank

Oktank > Sales

Oktank > Sales > US

Oktank > Sales > US > Assets

Oktank > Sales > US > Dashboards

A member of the Oktank US Sales team (e.g., salesperson) sees the following:

Oktank

Oktank > Sales

Oktank > Sales > US

Oktank > Sales > US > Dashboards

Conclusion

QuickSight folders provide a powerful way for admins and authors to organize, manage, and share content while being a powerful discovery mechanism for readers. Folders are now generally available in QuickSight Enterprise Edition in all supported QuickSight Regions.

 


About the Author

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

Embed multi-tenant analytics in applications with Amazon QuickSight

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/embed-multi-tenant-analytics-in-applications-with-amazon-quicksight/

Amazon QuickSight recently introduced four new features—embedded authoring, namespaces for multi-tenancy, custom user permissions, and account-level customizations—that, with existing dashboard embedding and API capabilities available in the Enterprise Edition, allow you to integrate advanced dashboarding and analytics capabilities within SaaS applications. Developers and independent software vendors (ISVs) who build these applications can now offer embedded, pre-configured (canned) dashboards to all end-users, while also providing sophisticated ad-hoc exploration and dashboard-building capabilities to power users.

In this post, we look at a use case for a fictional ISV and explore how QuickSight makes it easy to embed analytics into the app with no infrastructure to deploy or manage and scale to hundreds of thousands of users.

New features

  • Embedded authoring – You can embed the full dashboard-building experience within a portal or application, including the QuickSight home page, search, and data experiences. This allows ISVs to provide ad-hoc data exploration and authoring capabilities to an application’s power users, who might want to explore usage data, create specific views as dashboards, and share their creations with other users or groups in their organization. For the ISV, this means freedom from having to perform ad-hoc development for every customer request, while for customers, this empowers them to quickly and easily find insights that are relevant to them.
  • Namespaces – With namespaces, you can logically group and isolate sets of users in a QuickSight account. Before using namespaces, all users in a QuickSight account resided in a single (default) namespace, and as a result could see each other (for example, when trying to share content). You can now provision authors and readers from an organization within a unique namespace so they can see each other but are isolated from all other users in the QuickSight account. Namespaces aren’t required for reader-only scenarios, but are important for providing secure multi-tenancy when using embedded authoring so that authors are restricted to sharing and collaborating with their coworkers only. All users provisioned with a QuickSight account by default (via UI or SSO) exist in the default namespace; non-default namespaces are currently only accessible for federated single sign-on users.
  • Custom user permissions – This feature allows you to customize author permissions—for example, you can remove the ability to create a data source or dataset. This allows admins to provide a restricted set of capabilities to embedded or regular authors, and creates a curated experience for authors by sharing only specific datasets or data sources that are required for the use case.
  • QuickSight customizations – You can turn off the sample analyses and video content in QuickSight and also specify a default theme at a namespace level.

Multi-tenant embedded analytics architecture

The following diagram illustrates how authors reside within namespaces and how they tie into the overall AWS account.

Without namespaces, developers could set up their own embedded dashboards for hundreds of thousands users with QuickSight. For example, see the following dashboard for our fictional company, Oktank Analytica.

With namespaces in place, you can extend this to provide ad-hoc authoring capabilities using curated datasets specific to each customer, created and shared by the developer or ISV. See the following screenshot.

Use case

With this end-result as a target, let’s explore how Oktank Analytica implements such multi-tenant analytics—with both canned dashboards and ad-hoc analysis and dashboard-building capabilities—in its existing SaaS application. For simplicity, we assume that Oktank has two customers, with two authors and two readers within each customer. They also have a development team that uses the default namespace to develop content.

If Oktank only provided embedded dashboards for all users, we could simplify this implementation using just a single namespace for all—dashboards are view-only, so the readers provisioned for each of the customers can’t discover or view each other. This could simply use the QuickSight row-level security feature to make sure appropriate data is displayed to the right users or groups. For more information, see Using Row-level Security (RLS) to Restrict Access to a Dataset.

The following table summarizes Oktank’s namespaces.

Default NamespaceFooCompany Namespace
Foo2Company Namespace
Readers

Bob

Sally

Peter

David

Julie

Tom

Authors

Tom

Susan

Jill

Emma

John

Nancy

Admins

Mike

Andrew

NotesDevelopment teamFoo Company usersFoo2 Company users

When provisioning these users, Oktank uses the default namespace (created as part of QuickSight account setup) for the development team, and provisions admin, authors, and readers as they do normally. They set up customer users after creating the namespaces. For this post, we walk you through the example of Foo Company.

Users in the default namespace are regular QuickSight authors and admins because these are users from Oktank Analytica. However, Oktank wants to restrict the users from FooCompany and Foo2 Company from discovering any assets within the account and limit them to the specific datasets shared with them.

Creating a namespace

To implement this solution, we first create the namespace. See the following code:

aws quicksight create-namespace --aws-account-id 111122223333 --region us-east-1 --namespace FooCompanyNamespace --identity-store QUICKSIGHT 
{
"Status": 202,
"Name": "FooCompanyNamespace",
"CapacityRegion": "us-east-1",
"CreationStatus": "CREATING",
"IdentityStore": "QUICKSIGHT",
"RequestId": "9576f625-39b4-47ee-b56a-bcb95ed1f040"

}

This operation is asynchronous, so we wait and verify that namespace creation finished:

aws quicksight describe-namespace --aws-account-id 111122223333 --region us-east-1 --namespace FooCompanyNamespace
{
    "Status": 200,
    "Namespace": {
        "Name": "FooCompanyNamespace",
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:namespace/FooCompanyNamespace",
        "CapacityRegion": "us-east-1",
        "CreationStatus": "CREATED",
        "IdentityStore": "QUICKSIGHT"
    },
    "RequestId": "9a00b6bd-02b1-471a-9eaf-30db68bedd99"
}

Customizing QuickSight

Now that the namespace is created, we can customize QuickSight to make sure it works well in the multi-tenant setup.

First, we turn off the instructional videos and samples in QuickSight, because Oktank has its own product welcome videos and wants to include QuickSight onboarding in those. QuickSight admins can access this on the Account customization page.

Customizing user permissions

Next, we use custom user permissions to restrict users from creating new data sources or datasets. This allows Oktank to create users that can explore data and create dashboards but not connect to data sources outside of what is shared in the application.

To do this, create a custom user permissions policy. On the Manage users page, choose Manage permissions.

In the policy, restrict the user to only using datasets that Oktank admins share. You can expect more features in the future than what is currently available in this screenshot.

Creating a user account and group

With the custom user permissions policy available, we can create the user account for Jill, who is an author in Foo Company. See the following code:

aws quicksight register-user --namespace FooCompanyNamespace --identity-type QUICKSIGHT --user-role AUTHOR --region us-east-1 --custom-permissions-name DataExploration [email protected] --user-name jill --aws-account-id 111122223333

{
    "Status": 201,
    "User": {
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/jill",
        "UserName": "jill",
        "Email": "[email protected]",
        "Role": "AUTHOR",
        "Active": false,
        "PrincipalId": "user/d-90677fdc8e/086f9e70-f140-4ac0-9d10-7a21fa718bb9",
        "CustomPermissionsName": "DataExploration"
    },
    "UserInvitationUrl": "https://signin.aws.amazon.com/inviteuser?token=11a1mJtOYlD9T-quoo8b7tJK7bd4_Sa0lPb8Wdr9AW5p29NA7m30lvLKvomMNPnKIaZr3lmLwFf3E0tij5fPj5R9XkT4dSf5b11xsO8MnfNOmerqtHdrgt2StqywbMP2PNUuii1Pz3Xz8pIOIS_4xzFhtPuNwSuiP7JALO5kK3So_HuNdhZn_WLNjiMj47u_dq-NvuBrnZB8Lc8w",
    "RequestId": "78d36175-5d67-4fa9-b45f-124bbf4806dc"
}

For convenience, we can also create groups for authors and readers within FooCompanyNamespace.

aws quicksight create-group --namespace FooCompanyNamespace --region us-east-1 --aws-account-id 111122223333 --group-name="all-authors"
{
    "Status": 201,
    "Group": {
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:group/FooCompanyNamespace/all-authors",
        "GroupName": "all-authors",
        "PrincipalId": "group/d-90677fdc8e/3ec30833-3a1c-4e67-ac65-92950a770c0e"
    },
    "RequestId": "b70a78c8-2453-4b47-8a49-f1f9058a0716"
}

aws quicksight create-group-membership --namespace FooCompanyNamespace --region us-east-1 --aws-account-id 111122223333 --group-name="all-authors" --member-name="jill"
{
    "Status": 200,
    "GroupMember": {
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/jill",
        "MemberName": "jill"
    },
    "RequestId": "593d1774-27fd-4b11-8fa0-78025b819a8f"
}

aws quicksight create-group-membership --namespace FooCompanyNamespace --region us-east-1 --aws-account-id 111122223333 --group-name="all-authors"  --member-name="emma"
{
    "Status": 200,
    "GroupMember": {
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/emma",
        "MemberName": "emma"
    },
    "RequestId": "32e65803-497b-475c-a959-d334403320ac"
}
aws quicksight list-group-memberships --namespace FooCompanyNamespace --region us-east-1 --aws-account-id 111122223333 --group-name="all-authors"
{
    "Status": 200,
    "GroupMemberList": [
        {
            "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/emma",
            "MemberName": "emma"
        },
        {
            "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/jill",
            "MemberName": "jill"
        }
    ],
    "RequestId": "7085373a-53a0-4209-9d6d-cf685caa3184"
}

Sharing assets

Oktank can now use this group to share assets with the users. For example, if Oktank wants to share a dataset with these users, they can use the following code:

aws quicksight update-data-set-permissions --cli-input-json file://datapermissions.json

The datapermissions.json file contains the following:

{
    "AwsAccountId": "111122223333", 
    "DataSetId": "011cef30-cd53-425c-827b-045a171f90f3", 
    "GrantPermissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:111122223333:group/FooCompanyNamespace/all-authors",
            "Actions": [
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions"
            ]
        }
    ]
}

Similarly, Oktank can share a pre-built dashboard with this group:

aws quicksight update-dashboard-permissions --cli-input-json file://dbpermissions.json

The dbpermissions.json file contains the following:

{
    "AwsAccountId": ""111122223333", ", 
    "DashboardId": "Oktank-supply-chain-dashboard",
    "GrantPermissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:111122223333:group/FooCompanyNamespace/all-authors", 
            "Actions": [
                "quicksight:DescribeDashboard",
                "quicksight:ListDashboardVersions",
                "quicksight:QueryDashboard"
                
            ]
        }
    ]
}

If needed, you can also share dashboards with the namespace, which makes it easy to make sure access is always granted to new users in the namespace.

Creating and sharing a default theme

Before we start with the embedded authoring flow, we can set up a default theme for authors in the Foo namespace. The theme editor in QuickSight allows you to change the colors, fonts, and layouts that are used in a dashboard.

After you create a custom theme, you can share it with the namespace so all users within the namespace have access to it. See the following code:

aws quicksight update-theme-permissions --region us-east-1 --aws-account-id 111122223333 --cli-input-json file://permissions.json

The permissions.json file contains the following:

{
    "AwsAccountId": "111122223333", 
    "ThemeId": "25515eb4-e7e3-4a68-a274-4a863bd79d81", 
    "GrantPermissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:111122223333:group/FooCompanyNamespace/all-authors",
            "Actions": [
    "quicksight:DescribeTheme",
    "quicksight:DescribeThemeAlias",
    "quicksight:ListThemeAliases",
    "quicksight:ListThemeVersions"
            ]
        }
    ]
}

aws quicksight create-account-customization --region us-east-1 --aws-account-id 111122223333 --namespace=FooCompanyNamespace --account-customization DefaultTheme="arn:aws:quicksight:us-east-1:111122223333:theme/25515eb4-e7e3-4a68-a274-4a863bd79d81"

{
    "Status": 201,
    "AwsAccountId": "111122223333",
    "Namespace": "FooCompanyNamespace",
    "AccountCustomization": {
        "DefaultTheme": "arn:aws:quicksight:us-east-1:111122223333:theme/25515eb4-e7e3-4a68-a274-4a863bd79d81"
    },
    "RequestId": "221ac63b-8c71-48c2-95e4-ffd9b5476e07"
}

Setting up the embedded authoring experience

Now that we have created the namespace and users, customized QuickSight, enabled a default theme, and shared the theme and dataset, we can set up the embedded authoring experience.

First, make sure that the domain where you embed QuickSight is allowed in your admin settings—access this on the Domains and embedding page.

Next, make sure that your application server has the permissions to invoke the get-dashboard-embed-url and get-session-embed-url commands, which you need for embedding dashboards and the authoring interface, respectively.

For user authentication, QuickSight supports both AWS Identity and Access Management (IAM) federated users and direct QuickSight federation. Both options mean that your end-users never see a QuickSight login page and are simply authenticated by your server. In this use case, we use QuickSight federation because we use a QuickSight identity type for Jill and Emma. See the following code:

aws quicksight get-session-embed-url --aws-account-id 111122223333 --region us-east-1 --user-arn arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/jill --entry-point /start/favorites 
{
    "Status": 200,
    "EmbedUrl": "https://us-east-1.quicksight.aws.amazon.com/embedding/eab15737343f4335867483528cd53d20/sn/data-sets?code=AYABeK_9AGgfIBA6_NSqDti_M1cAAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy1lYXN0LTE6MjU5NDgwNDYyMTMyOmtleS81NGYwMjdiYy03MDJhLTQxY2YtYmViNS0xNDViOTExNzFkYzMAuAECAQB4P-lAb3AsrVHPwO-wVCEmuFDTp0yz4wFfkrwutzPupewBHRHPzBvGEF0mwTMFwR7fSQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDAYnevNrQG_42UsUigIBEIA7_amsZyNiF2wOi-LEXZq8X3ToZ3LwlTCAGHhNli8208lv5zIfjEbSr9zOxeF4SsyEWryxNzVYiXd6kg4CAAAAAAwAABAAAAAAAAAAAAAAAAAACuwtWXlySImCI98W0x6jkv____8AAAABAAAAAAAAAAAAAAABAAAA5i3XsPdfGRy9rMCB-EM39rDeOgFDMZKIFqA3lQcLZI_nXith2wGH_1dDP8n5uL1BFLIxVURk3caapiNXyDMxDkHS9x22U1w2TH6wZlQ0_Nd4Eqzn05rPowTzAXU0yiG1nTXo6Rv-_p01tQ1g2IVzm4ykigGXwuxD92ekgKsgEE46OvTwjBRQMtsUaaJKKKcl0J-whFJ3G-p2ATZMcVKACwyUtHWScIJFpVOc4AIM0m4u5quf2m9_OyWpAJgzJn_TtyKxkl-jcOht7S9KsIuLEmvnlJTMdUB6FXk6M1-OxyQ_ogoXH2hvYnW7D7GA1yrUR2T2tj29NA%3D%3D&identityprovider=quicksight&isauthcode=true",
    "RequestId": "48d1c858-36fc-421e-a22b-e64ce6e58545"
}

With the help of JavaScript SDK, embedding this URL within the Oktank SaaS application, we can send Jill directly to access the home page showing default analyses that are curated for them. From here Jill can navigate to the datasets page, where she can access the curated dataset to start exploring from scratch. Because we disabled dataset and data source creation, the embedded authoring interface doesn’t show options to connect to any data sources in the account or outside; it only displays the specific dataset shared.

When Jill tries to explore the data available, QuickSight applies the Oktank theme by default, and allows her to explore data and create dashboards, as needed.

As this dashboard is being built, Jill can share it with Emma, collaborate on it, and decide which users within Foo Company to publish to. With namespaces, Oktank knows that Jill doesn’t see any other users except Emma.

Overall, this set of Amazon QuickSight capabilities allows Oktank to provide a compelling analytics experience within their SaaS portal, while making sure that only the right users see the right data. QuickSight provides a highly scalable, multi-tenant analytics option that you can set up and productionize in days. For more information about this integration, see the tutorial Embed Amazon QuickSight.

Conclusion

The combination of embedded dashboards and authoring capabilities, together with namespaces for multi-tenancy and account customization options, allows developers and ISVs to quickly and easily set up sophisticated analytics for their application users—all without any infrastructure setup or management and scaling to millions of users. For more updates from QuickSight Embedded Analytics, see What’s New in the Amazon QuickSight User Guide!

Additional resources

For more resources, see the following:

 


About the Author

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

Enable fine-grained permissions for Amazon QuickSight authors in AWS Lake Formation

Post Syndicated from Adnan Hasan original https://aws.amazon.com/blogs/big-data/enable-fine-grained-permissions-for-amazon-quicksight-authors-in-aws-lake-formation/

We’re excited to announce the integration of Amazon QuickSight with the AWS Lake Formation security model, which provides fine-grained access control for QuickSight authors. Data lake administrators can now use the Lake Formation console to grant QuickSight users and groups permissions to AWS Glue Data Catalog databases, tables, and Amazon Simple Storage Service (Amazon S3) buckets that are registered and managed via Lake Formation.

This new feature enhances the fine-grained access control capability previously introduced in QuickSight, which allows admins to use AWS Identity and Access Management (IAM) policies to scope down QuickSight author access to Amazon S3, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon Redshift. The scope-down access is enforced by attaching IAM policies to the QuickSight user or a group in the QuickSight portal. For more information, see Introducing Amazon QuickSight fine-grained access control over Amazon S3 and Amazon Athena.

For Athena-based datasets, you’re no longer required to use IAM policies to scope down QuickSight author access to Amazon S3, or Data Catalog databases and tables. You can grant permissions directly in the Lake Formation console. An added benefit is that you can also grant column-level permissions to the QuickSight users and groups. Lake Formation handles all this for you centrally.

This feature is currently available in the QuickSight Enterprise edition in the following Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)

It will soon be available in all Regions where Lake Formation exists as of this post. For more information, see Region Table.

This post compares the new fine-grained permissions model in Lake Formation to the IAM policy-based access control in QuickSight. It also provides guidance on how to migrate fine-grained permissions for QuickSight users and groups to Lake Formation.

QuickSight fine-grained permissions vs. Lake Formation permissions

In QuickSight, you can limit user or group access to AWS resources by attaching a scope-down IAM policy. If no such policies exist for a user or a group (that the user is a member of), QuickSight service role permissions determine access to the AWS resources. The following diagram illustrates how permissions work for a QuickSight user trying to create an Athena dataset.

With the Lake Formation integration, the permissions model changes slightly. The two important differences while creating an Athena dataset are:

  • Users can view the Data Catalog resources (databases and tables) that have one of the following:
    1. The IAMAllowedPrincipal security group is granted Super permission to the resource in Lake Formation.
    2. An ARN for the QuickSight user or group (that the user is a member of) is explicitly granted permissions to the resource in Lake Formation.
  • If the S3 source bucket for the Data Catalog resource is registered in Lake Formation. Amazon S3 access settings in QuickSight are ignored, including scope-down IAM policies for users and groups.

The following diagram shows the change in permission model when a QuickSight user tries to create an Athena dataset.

The following sections dive into how fine-grained permissions work in QuickSight and how you can migrate the existing permissions to the Lake Formation security model.

Existing fine-grained access control in QuickSight

For this use case, a business analyst in the marketing team, lf-gs-author, created an Athena dataset Monthly Sales in QuickSight. It was built using the month_b2bsalesdata table in AWS Glue and the data in S3 bucket b2bsalesdata.

The following screenshot shows the table details.

The following screenshot shows the dataset details.

The dataset is also shared with a QuickSight group analystgroup. See the following screenshot of the group details.

A fine-grained IAM policy enforces access to the S3 bucket b2bsalesdata for lf-qs-author and analystgroup. The following code is an example of an Amazon S3 access policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "arn:aws:s3:::"
        },
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::b2bsalesdata"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::b2bsalesdata/"
            ]
        }
    ]
}

Enabling QuickSight permissions in Lake Formation

To migrate QuickSight permissions to Lake Formation,  follow the steps described below (in the given order):

1.) Capturing the ARN for the QuickSight user and group

First, capture the QuickSight ARN for the business analyst user and marketing team group. You can use the describe-user API and the describe-group API to retrieve the user ARN and the group ARN, respectively. For example, to retrieve the ARN for the QuickSight group analystgroup, enter the following code in the AWS Command Line Interface (AWS CLI):

aws quicksight describe-group --group-name 'analystgroup' --aws-account-id 253914981264 --namespace default

Record the group ARN from the response, similar to the following code:

{
 "Status": 200,
 "Group": {
 "Arn": "arn:aws:quicksight:us-east-1:253914981264:group/default/analystgroup",
 "GroupName": "analystgroup",
 "PrincipalId": "group/d-906706bd27/3095e3ab-e901-479b-88da-92f7629b202d"
 },
 "RequestId": "504ec460-2ceb-46ca-844b-a33a46bc7080"
}

Repeat the same step to retrieve the ARN for the business analyst lf-qs-author.

2.) Granting permissions in the data lake

To grant permissions to the month_b2bsalesdata table in salesdb, complete the following steps:

  1. Sign in to the Lake Formation console as the data lake administrator.

A data lake administrator can grant any principal (IAM, QuickSight, or Active Directory) permissions to Data Catalog resources (databases and tables) or data lake locations in Amazon S3. For more information about creating a data lake administrator and the data lake security model, see AWS Lake Formation: How It Works.

  1. Choose Tables.
  2. Select month_b2bsalesdata.
  3. From the Actions drop-down menu, choose View permissions.

You see a list of principals with associated permissions for each resource type.

  1. Choose Grant.
  2. For Active Directory and Amazon QuickSight users and groups, enter the QuickSight user ARN.
  3. For Table permissions, select Select.
  4. Optionally, under Column permissions, you can grant column-level permissions to the user. This is a benefit of using Lake Formation permissions over QuickSight policies.
  5. Choose Grant.

  1. Repeat the preceding steps to grant select table permissions to analystgroup, using the ARN you recorded earlier.
  2. Select month_b2bsalesdata.
  3. From the Actions drop-down menu, choose View permissions.

The following screenshot shows the added permissions for the QuickSight user and group.

3.) Removing IAMAllowedPrincipal group permissions

For Lake Formation permissions to take effect, you must remove the IAMAllowedPrincipal group from the month_b2bsalesdata table.

  1. Select month_b2bsalesdata.
  2. From the Actions drop-down menu, choose View permissions.
  3. Select IAMAllowedPrincipals.
  4. Choose Revoke.

  1. Choose Revoke

4.) Registering your S3 bucket in Lake Formation

You can now register the S3 source bucket (b2bsalesdata) in Lake Formation. Registering the S3 bucket switches Amazon S3 authorization from QuickSight scope-down policies to Lake Formation security.

  1. Choose Data lake locations.
  2. Choose Register location.
  3. For Amazon S3 path, enter the path for your source bucket (s3://b2bsalesdata).
  4. For IAM role, choose the role with permissions to that bucket.
  5. Choose Register location.

5.) Cleaning up the scope-down policies in QuickSight

You can now remove the scope-down policies for the user and group in QuickSight. To find these policies, under Security and Permissions, choose IAM policy assignments.

6.) Creating a dataset in QuickSight

To create a dataset, complete the following steps:

  1. Log in to QuickSight as a user who is a member of analystgroup (someone besides lf-qs-author).
  2. Choose Manage data.
  3. Choose New data set.
  4. Choose Athena.
  5. For the data source name, enter Marketing Data.
  6. Choose Create data source.
  7. In the list of databases, choose salesdb.
  8. Choose month_b2bsalesdata.
  9. Choose Edit/Preview data.

The following screenshot shows the details of month_b2bsalesdata table.

You can also use custom SQL to query the data.

Conclusion

This post demonstrates how to extend the Lake Formation security model to QuickSight users and groups, which allows data lake administrators to manage data catalog resource permissions centrally from one console. As organizations embark on the journey to secure their data lakes with Lake Formation, having the ability to centrally manage fine-grained permissions for QuickSight authors can extend the data governance and enforcement of security controls at the data consumption (business intelligence) layer. You can enable these fine-grained permissions for QuickSight users and groups at the database, table, or column level, and they’re reflected in the Athena dataset in QuickSight.

Start migrating your fine-grained permissions to Lake Formation today, and leave your thoughts and questions in the comments.

 


About the Author

Adnan Hasan is a Solutions Architect with Amazon QuickSight at Amazon Web Services.

 

Enforce column-level authorization with Amazon QuickSight and AWS Lake Formation

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/enforce-column-level-authorization-with-amazon-quicksight-and-aws-lake-formation/

Amazon QuickSight is a fast, cloud-powered, business intelligence service that makes it easy to deliver insights and integrates seamlessly with your data lake built on Amazon Simple Storage Service (Amazon S3). QuickSight users in your organization often need access to only a subset of columns for compliance and security reasons. Without having a proper solution to enforce column-level security, you have to develop additional solutions, such as views, data masking, or encryption, to enforce security.

QuickSight accounts can now take advantage of AWS Lake Formation column-level authorization to enforce granular-level access control for their users.

Overview of solution

In this solution, you build an end-to-end data pipeline using Lake Formation to ingest data from an Amazon Aurora MySQL database to an Amazon S3 data lake and use Lake Formation to enforce column-level access control for QuickSight users.

The following diagram illustrates the architecture of this solution.

Walkthrough overview

The detailed steps in this solution include building a data lake using Lake Formation, which uses an Aurora MySQL database as the source and Amazon S3 as the target data lake storage. You create a workflow in Lake Formation that imports a single table from the source database to the data lake. You then use Lake Formation security features to enforce column-level security for QuickSight service on the imported table. Finally, you use QuickSight to connect to this data lake and visualize only the columns for which Lake Formation has given access to QuickSight user.

To implement the solution, you complete the following steps:

  1. Prerequisites
  2. Creating a source database
  3. Importing a single table from the source database
    • Creating a connection to the data source
    • Creating and registering your S3 bucket
    • Creating a database in the Data Catalog and granting permissions
    • Creating and running the workflow
    • Granting Data Catalog permissions
  4. Enforcing column-level security in Lake Formation
  5. Creating visualizations in QuickSight

Prerequisites

For this walkthrough, you should have the following prerequisites:

Creating a source database

In this step, create an Aurora MySQL database cluster and use the DDLs in the following GitHub repo to create an HR schema with associated tables and sample data.

You should then see the schema you created using the MySQL monitor or your preferred SQL client. For this post, I used SQL Workbench. See the following screenshot.

Record the Aurora database JDBC endpoint information; you need it in subsequent steps.

Importing a single table from the source database

Before you complete the following steps, make sure you have set up Lake Formation and met the JDBC prerequisites.

The Lake Formation setup creates a datalake_user IAM user. You need to add the same user as a QuickSight user. For instructions, see Managing User Access Inside Amazon QuickSight. For Role, choose AUTHOR.

Creating a connection to the data source

After you complete the Lake Formation prerequisites, which include creating IAM users datalake_admin and datalake_user, create a connection in your Aurora database. For instructions, see Create a Connection in AWS Glue. Provide the following information:

  • Connection name<yourPrefix>-blog-datasource
  • Connection type – JDBC
  • Database connection parameters – JDBC URL, user name, password, VPC, subnet, and security group

Creating and registering your S3 bucket

In this step, you create an S3 bucket named <yourPrefix>-blog-datalake, which you use as the root location of your data lake. After you create the bucket, you need to register the Amazon S3 path. Lastly, grant data location permissions.

Creating a database in the Data Catalog and granting permissions

Create a database in the Lake Formation Data Catalog named <yourPrefix>-blog-database, which stores the metadata tables. For instructions, see Create a Database in the Data Catalog.

After you create the database, you grant data permissions to the metadata tables to the LakeFormationWorkflowRole role, which you use to run the workflows.

Creating and running the workflow

In this step, you copy the EMPLOYEES table from the source database using a Lake Formation blueprint. Provide the following information:

  • Blueprint type – Database snapshot
  • Database connection<yourPrefix>-blog-datasource
  • Source data pathHR/EMPLOYEES
  • Target database<yourPrefix>-blog-database
  • Target storage location<yourPrefix>-blog-datalake
  • Workflow name<yourPrefix>-datalake-quicksight
  • IAM roleLakeFormationWorkflowRole
  • Table prefixblog

For instructions, see Use a Blueprint to Create a Workflow.

When the workflow is ready, you can start the workflow and check its status by choosing View graph. When the workflow is complete, you can see the employee table available in your Data Catalog under <yourPrefix>-blog-database. See the following screenshot.

You can also view the imported data using Athena, which is integrated with Lake Formation. You need to select “View Data” from “Actions” drop down menu for this purpose. See the following screenshot.

Granting Data Catalog permissions

In this step, you provide the Lake Formation Data Catalog access to the IAM user datalake_user. This is the same user that you added in QuickSight to create the dashboard. For Database permissions, select Create table and Alter for this use case, but you can change the permission level based on your specific requirements. For instructions, see Granting Data Catalog Permissions.

When this step is complete, you see the permissions for your database <yourPrefix>-blog-database.

Enforcing column-level security in Lake Formation

Now that your table is imported into the data lake, enforce column-level security to the dataset. For this use case, you want to hide the Salary and Phone_Number columns from business intelligence QuickSight users.

  1. In the Lake Formation Data Catalog, choose Databases.
  2. From the list of databases, choose <yourPrefix>-blog-database.
  3. Choose View tables.
  4. Select blog_hr_employees.
  5. From the Actions drop-down menu, choose Grant.

  1. For Active Directory and Amazon QuickSight users and groups, provide the QuickSight user ARN.

You can find the ARN by entering the code aws quicksight list-users --aws-account-id <your AWS account id> --namespace default --region us-east-1 in the AWS Command Line Interface (AWS CLI).

  1. For Database, choose <yourPrefix>-blog-database.
  2. For Table, choose blog_hr_employees.
  3. For Columns, choose Exclude columns.
  4. For Exclude columns, choose salary and phone_number.
  5. For Table permissions, select Select.

You should receive a confirmation on the console that says Permission granted for: datalake_user to Exclude: <yourPrefix>-blog-database.blog_hr_employees.[salary, phone_number].

You can also verify that appropriate permission is reflected for the QuickSight user on the Lake Formation console by navigating to the Permissions tab and filtering for your database and table.

You can also specify column-level permissions in the AWS CLI with the following code:

aws lakeformation grant-permissions --principal DataLakePrincipalIdentifier=<QuickSight User ARN> --permissions "SELECT" --resource '{ "TableWithColumns": {"DatabaseName":"<yourPrefix>-blog-database", "Name":"blog_hr_employees", "ColumnWildcard": {"ExcludedColumnNames": ["salary", "phone_number"]}}}'  --region us-west-2 --profile datalake_admin

Creating visualizations in QuickSight

In this step, you use QuickSight to access the blog_hr_employees table in your data lake. While accessing this dataset from QuickSight, you can see that QuickSight doesn’t show the salary and phone_number columns, which you excluded from the source table in the previous step.

  1. Log in to QuickSight using the datalake_user IAM user.
  2. Choose New analysis.
  3. Choose New dataset.
  4. For the data source, choose Athena.

  1. For your data source name, enter Athena-HRDB.
  2. For Database, choose <yourPrefix>-blog-database.
  3. For Tables, select blog_hr_employees.
  4. Choose Select.

  1. Choose Import to SPICE for quicker analysis or Directly query your data.

For this use case, choose Import to SPICE. This provides faster visualization in a production setup, and you can run a scheduled refresh to make sure your dashboards are referring to the current data. For more information, see Scheduled Refresh for SPICE Data Sets on Amazon QuickSight.

When you complete the previous steps, your data is imported to your SPICE machine and you arrive at the QuickSight visualization dashboard. You can see that SPICE has excluded the salary and phone_number fields from the table. In the following screenshot, we created a pie chart visualization to show how many employees are present in each department.

Cleaning up

To avoid incurring future charges, delete the resources you created in this walkthrough, including your S3 bucket, Aurora cluster, and other associated resources.

Conclusion

Restricting access to sensitive data to various users in a data lake is a very common challenge. In this post, we demonstrated how to use Lake Formation to enforce column-level access to QuickSight dashboard users. You can enhance security further with Athena workgroups. For more information, see Creating a Data Set Using Amazon Athena Data and Benefits of Using Workgroups.

 


About the Author

Avijit Goswami is a Sr. Startups Solutions Architect at AWS, helping startup customers become tomorrow’s enterprises. When not at work, Avijit likes to cook, travel, watch sports, and listen to music.

 

 

Visualizing Amazon API Gateway usage plans using Amazon QuickSight

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/visualizing-amazon-api-gateway-usage-plans-using-amazon-quicksight/

This post is courtesy of Roberto Iturralde, Solutions Architect.

Many customers build applications for their users accessible via HTTP API endpoints. Users provide unique keys in their requests for authentication, authorization, and optional metering by the service provider. Business and technical owners benefit from detailed analytics across the API endpoints and usage patterns across customers. This information helps understand product adoption and informs future features.

Amazon API Gateway can produce detailed access logs to show who has accessed the API. When using usage plans, a customer identifier is included in the log records. You can use these logs to populate a business intelligence service, such as Amazon QuickSight, to analyze and report on usage patterns across your APIs and customers.

Solution overview

QuickSight dashboard

Using enriched API Gateway access logs, you can analyze how customers are accessing your API products. This dashboard shows several visualizations in Amazon QuickSight based on traffic to a sample API Gateway endpoint.

  • The pie chart shows the share of month-to-date traffic across all APIs by usage plan.
  • The bar chart shows the top customers in the Enterprise usage plan by month-to-date traffic, with bar coloring by HTTP status code.
  • The pivot table shows the percent of traffic to each API endpoint by usage plan and customer.

The solution described in this post is meant for business intelligence (BI) analysis. A BI dashboard is useful for historical reporting and typically the data freshness ranges from hours to days.

Solution architecture

Solution architecture

Components:

  • API access logs stream – API access logs are streamed in real time from Amazon API Gateway to Amazon Kinesis Data Firehose. Kinesis Firehose buffers the records and enriches them with information from the API usage plans. It then writes the batches of enriched records to an Amazon S3 bucket for durable, secure storage.
  • Access logs indexing – Metadata about the API access logs is stored in an AWS Glue Data Catalog that is used by Amazon QuickSight for querying. A nightly AWS Glue crawler detects and indexes newly written access logs. The Glue crawler can run more frequently for fresher data in QuickSight.
  • Data visualization – Amazon QuickSight is configured with the S3 location of the access logs as a data source to feed a QuickSight analysis.

Implementation walk-through

This tutorial assumes you already have an API Gateway API with a usage plan configured. If you do not, follow this tutorial to create an API and follow this article to create a usage plan.

First, deploy an AWS SAM template into your account. This template creates an Amazon S3 bucket where the access logs are stored for analysis. It also creates an AWS Lambda function to enrich the API access logs.

Then you create a Kinesis Data Firehose delivery stream to receive access logs from API Gateway. The stream enriches the records using the Lambda function, buffers and batches the records, and writes them to the S3 bucket. Finally, you update a deployed API Gateway stage to write access logs to the Kinesis delivery stream.

Launch the AWS SAM Template

To create some of the resources referenced in this post, you can download the SAM template or choose the button below to launch the stack.

Launch Stack button

Choose Next on each screen of the CloudFormation stack creation process. Once the stack creation completes, note the names of the resources on the Outputs tab.

Stack outputs tab

The Lambda function created by the SAM template performs a few key tasks. During function initialization, it fetches API Gateway usage plan details into memory. On each invocation, it iterates through each access log record from Kinesis Firehose. Each record is decoded from base64 encoded binary and enriched with usage plan name and customer name. Each record is then converted back to base64 encoded binary to return to the Kinesis Firehose stream.

API access logs stream

  1. Navigate to the Kinesis Data Firehose console and choose Create delivery stream.
    Create delivery stream
  2. Under Delivery stream name, enter a name in the format amazon-apigateway-{your-delivery-stream-name}. It is required that your stream name begin with amazon-apigateway-.
    New delivery stream
  3. Leave the default Source setting of Direct PUT or other sources. Choose Next.
  4. Under Data transformation, select Enabled. In the Lambda function dropdown, select the function created earlier. Choose Next.
    Transform source records
  5. Select Amazon S3 as the Destination. In the S3 bucket dropdown, select the bucket created earlier.
    S3 destination
  6. Under S3 prefix, enter logs/year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/. This naming convention allows the AWS Glue crawler to automatically partition this data during indexing.
    S3 prefix
  7. Under S3 error prefix, enter errors/!{firehose:random-string}/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/. This will write errors encountered by the Firehose delivery stream to a folder named errors in the S3 bucket, followed by folders by error type and error timestamp. Choose Next.
    S3 error prefix
  8. Leave the default buffer, compression, and other settings. At the bottom of the screen, select Create new or choose to create a new IAM role for this delivery stream. In the window that opens, leave the default settings. Choose Allow.
    Setting permissions
  9. This will return you to the Kinesis Firehose delivery stream creation wizard. Choose Next.
  10. On the review page, verify the settings and choose Create delivery stream. Wait for the stream to be successfully created.
  11. You can now configure API Gateway to stream access logs to this Kinesis Firehose delivery stream. Follow these instructions to enable access logging on your API stages using the ARN of the Firehose delivery stream you created.
  12. Under Log Format, choose the fields to include in the access logs in JSON format. Find examples in the API Gateway documentation as well as the full set of available fields in the $context variable. The below fields and mapped names are required for the enrichment Lambda function. Choose Save Changes.
    {
      "apiId": "$context.apiId",
      "identity.apiKeyId": "$context.identity.apiKeyId",
      "stage": "$context.stage"
    }
  13. As the API stages where you enabled access logging receive traffic, you will see files written to your Amazon S3 bucket. Note that the Firehose delivery stream buffers data before writing to S3, so it may take some time before files appear.

Access logs for your API are now flowing to an Amazon S3 bucket enriched with usage plan information. You now need to index this data for querying and make it available in Amazon QuickSight for analysis.

Access Logs Indexing

  1. Navigate to the AWS Glue console. If this is your first time using AWS Glue, choose the Get Started button on the landing page. On the left side of the console select Crawlers. On the Crawlers tab, choose Add crawler.
  2. Enter a name for the Crawler and choose Next.
  3. On the Specify crawler source type page, choose Data stores. Choose Next.
  4. Select S3 as the data store and leave the Connection field empty. In the Include path section, use the folder icon to browse your existing S3 buckets. Use the plus sign to expand the folders beneath the S3 bucket created earlier. Select the logs folder and choose Select. If you don’t see the logs folder, you add it manually later.
    Choose S3 path
  5. If you did not see a logs folder on the prior screen, you can add it to the end of the S3 location in the input box. Choose Next.
    Crawl data options
  6. On the Add another data source screen, leave No selected and choose Next.
  7. Select Create an IAM Role and enter a name for the IAM role that AWS Glue uses to crawl the S3 bucket. Choose Next.
  8. Under the Frequency for scheduled crawling, select Daily and choose the time when you want to update your index of access logs. The crawl frequency can be modified later. Choose Next.
  9. On the crawler output selection page, select Add database to create a new metadata database for the API Gateway access logs. Name your metadata database and choose Create. Back on the output configuration screen, choose Next.
    Configure the crawler's output
  10. Choose Finish.
  11. In the Crawlers tab of the AWS Glue console, select the checkbox next to the crawler you created. Choose Run crawler.
    Run crawler
  12. After the crawler finishes, you see a table named logs in the Glue database. Navigate to the Tables page of the Glue console to view this table. Selecting the table name will show the metadata that the crawler populated, including the file format, number of records, and schema of the access logs records.
    Tables

You now have an AWS Glue database with metadata of the access logs stored in Amazon S3 and a scheduled Glue crawler. Lastly, you need to make this data available in Amazon QuickSight for visualization and analysis.

Data Visualization

  1. Navigate to the Amazon QuickSight console.
    First-time QuickSight users: Follow these instructions to create a QuickSight account.
    All users: Follow these instructions to update your S3 permissions to include the S3 bucket created earlier containing the API Gateway access logs.
  2. In the menu bar, select Manage data.
    Manage data
  3. On the top left of the Data Sets page, choose New data set.
  4. On the Create data set page, select Amazon Athena.
    Create a data set
  5. On the New Athena data source page, enter a name for this data source. Leave the Athena workgroup on the default setting and select Create data source.
  6. On the following page, use the Database section to select the Glue database you created earlier. Once selected, you will see the tables available inside that database. Select the database table you created earlier to hold the metadata for the access logs in S3.
  7. On the final data set creation page, select Direct query your data. You can change this option later to use QuickSight’s native data cache to improve performance. Choose Visualize.
  8. This will create a QuickSight analysis based on a data set of the API Gateway access logs data. You should see the logs data set selected and the access logs fields available in the Fields list. You can now create visuals based on the API Gateway access logs data.
    QuickSight create visuals menu

Conclusion

In this post, I walk through configuring streaming of API access logs from Amazon API Gateway to Amazon S3 via a Kinesis Firehose delivery stream. An AWS Glue crawler periodically updates metadata in an AWS Glue data catalog for the access logs in S3. This metadata is used by Amazon QuickSight to query the data in S3 to populate visuals in a QuickSight analysis. This allows business and technical owners of API-based products to analyze access trends by customers accessing their APIs.

To learn more, read about different types of visualizations available in QuickSight. As a performance and cost optimization, enable compression and format conversion from JSON to a columnar data format in your Kinesis Firehose delivery stream.

Measure Effectiveness of Virtual Training in Real Time with AWS AI Services

Post Syndicated from Rajeswari Malladi original https://aws.amazon.com/blogs/architecture/measure-effectiveness-of-virtual-training-in-real-time-with-aws-ai-services/

As per International Data Corporation (IDC), worldwide spending on digital transformation will reach $2.3 trillion in 2023. As organizations adopt digital transformation, training becomes an important aspect of this journey. Whether these are internal trainings to upskill existing workforce or a packaged content for commercial use, these trainings need to be efficient and cost effective. With the advent of education technology, it is a common practice to deliver trainings via digital platforms. This makes it accessible for larger population and is cost effective, but it is important that the trainings are interactive and effective. According to  a recent article published by Forbes, immersive education and data driven insights are among the top five Education Technology (EdTech) innovations. These are the key characteristics of creating an effective training experience.

An earlier blog series explored how to build a virtual trainer on AWS using Amazon Sumerian. This series illustrated how to easily build an immersive and highly engaging virtual training experience without needing additional devices or a complex virtual reality platform management. These trainings are easy to maintain and are cost effective.

In this blog post, we will further extend the architecture to gather real-time feedback about the virtual trainings and create data-driven insights to measure its effectiveness with the help of Amazon artificial intelligence (AI) services.

Architecture and its benefits

Virtual training on AWS and AI Services - Architecture

Virtual training on AWS and AI Services – Architecture

Consider a scenario where you are a vendor in the health care sector. You’ve developed a cutting-edge device, such as patient vital monitoring hardware that goes through frequent software upgrades and it is about to be rolled out across different U.S. hospitals. The nursing staff needs to be well trained before it can begin using the device. Let’s take a look at an architecture to solve this problem. We will first explain the architecture for building the training and then we will show how we can measure its effectiveness.

At the core of the architecture is Amazon Sumerian. Sumerian is a managed service that lets you create and run 3D, Augmented Reality (AR), and Virtual Reality (VR) applications. Within Sumerian, real-life scenes from a hospital environment can be created by importing the assets from the assets library. Scenes consist of host(s) and an AI-driven animated character with built-in animation, speech, and behavior. The hosts act as virtual trainers that interact with the nursing staff. The speech component assigns text to the virtual trainer for playback with Amazon Polly. Polly helps convert training content from Sumerian to life-like speech in real time and ensures the nursing staff receives the latest content related to the equipment on which it’s being trained.

The nursing staff accesses the training via web browsers on iOS or Android mobile devices or laptops, and authenticates using Amazon Cognito. Cognito is a service that lets you easily add user sign-up and authentication to your mobile and web apps. Sumerian then uses the Cognito identity pool to create temporary credentials to access AWS services.

The flow of the interactions within Sumerian is controlled using a visual state machine in the Sumerian editor. Within the editor, the dialogue component assigns an Amazon Lex chatbot to an entity, in this case the virtual trainer or host. Lex is a service for building conversational interfaces with voice and text. It provides you the ability to have interactive conversations with the nursing staff, understand its areas of interest, and deliver appropriate training material. This is an important aspect of the architecture where you can customize the training per users’ needs.

Lex has native interoperability with AWS Lambda, a serverless compute offering where you just write and run your code in Lambda functions. Lambda can be used to validate user inputs or apply any business logic, such as fetching the user selected training material from Amazon DynamoDB (or another database) in real time. This material is then delivered to Lex as a response to user queries.

You can extend the state machine within the Sumerian editor to introduce new interactive flows to collect user feedback. Amazon Lex collects user feedback, which is saved in Amazon Simple Storage Service (S3) and analyzed by Amazon Comprehend. Amazon Comprehend is a natural language processing service that uses AI to find meaning and insights/sentiments in text. Insights from user feedback are stored in S3, which is a highly scalable, durable, and highly available object storage.

You can analyze the insights from user feedback using Amazon Athena, an interactive query service which analyzes data in S3 using standard SQL. You can then easily build visualizations using Amazon QuickSight.

By using this architecture, you not only deliver the virtual training to your nursing staff in an immersive environment created by Amazon Sumerian, but you can also gather the feedback interactively. You can gain insights from this feedback and iterate over it to make the training experience more effective.

Conclusion and next steps

In this blog post we reviewed the architecture to build interactive trainings and measure their effectiveness. The serverless nature of this architecture makes it cost effective, agile, and easy to manage, and you can apply it to a number of use cases. For example, an educational institution can develop training content designed for multiple learning levels and the training level can be adjusted in real time based on live interactions with the students. In the manufacturing scenario, you can build a digital twin of your process and train your resources to handle different scenarios with full interactions. You can integrate AWS services just like Lego blocks, and you can further expand this architecture to integrate with Amazon Kendra to build interactive FAQ or integrate with Amazon Comprehend Medical to build trainings for the healthcare industry. Happy building!

Build an automatic data profiling and reporting solution with Amazon EMR, AWS Glue, and Amazon QuickSight

Post Syndicated from Francesco Marelli original https://aws.amazon.com/blogs/big-data/build-an-automatic-data-profiling-and-reporting-solution-with-amazon-emr-aws-glue-and-amazon-quicksight/

In typical analytics pipelines, one of the first tasks that you typically perform after importing data into your data lakes is data profiling and high-level data quality analysis to check the content of the datasets. In this way, you can enrich the basic metadata that contains information such as table and column names and their types.

The results of data profiling help you determine whether the datasets contain the expected information and how to use them downstream in your analytics pipeline. Moreover, you can use these results as one of the inputs to an optional data semantics analysis stage.

The great quantity and variety of data in modern data lakes make unstructured manual data profiling and data semantics analysis impractical and time-consuming. This post shows how to implement a process for the automatic creation of a data profiling repository, as an extension of AWS Glue Data Catalog metadata, and a reporting system that can help you in your analytics pipeline design process and by providing a reliable tool for further analysis.

This post describes in detail the application Data Profiler for AWS Glue Data Catalog and provides step-by-step instructions of an example implementation.

Overview and architecture

The following diagram illustrates the architecture of this solution.

Data Profiler for AWS Glue Data Catalog is an Apache Spark Scala application that profiles all the tables defined in a database in the Data Catalog using the profiling capabilities of the Amazon Deequ library and saves the results in the Data Catalog and an Amazon S3 bucket in a partitioned Parquet format. You can use other analytics services such as Amazon Athena and Amazon QuickSight to query and visualize the data.

For more information about the Amazon Deequ data library, see Test data quality at scale with Deequ or the source code on the GitHub repo.

Metadata can be defined as data about data. Metadata for a table contains information like the table name and other attributes, column names and types, and the physical location of the files that contain the data. The Data Catalog is the metadata repository in AWS, and you can use it with other AWS services like Athena, Amazon EMR, and Amazon Redshift.

After you create or update the metadata for tables in a database (for example, adding new data to the table), either with an AWS Glue crawler or manually, you can run the application to profile each table. The results are stored as new versions of the tables’ metadata in the Data Catalog, which you can view interactively via the AWS Lake Formation console or query programmatically via the AWS CLI for AWS Glue.

For more information about the Data Profiler, see the GitHub repo.

The Deequ library does not support tables with nested data (such as JSON). If you want to run the application on a table with nested data, this must be un-nested/flattened or relationalized before profiling. For more information about useful transforms for this task, see AWS Glue Scala DynamicFrame APIs or AWS Glue PySpark Transforms Reference.

The following table shows the profiling metrics the application computes for column data types Text and Numeric. The computation of some profiling metrics for Text columns can be costly and is disabled by default. You can enable it by setting the compExp input parameter to true (see next section).

MetricDescriptionData Type
ApproxCountDistinctApproximate number of distinct values, computed with HyperLogLogPlusPlus sketches.Text / Numeric
CompletenessFraction of non-null values in a column.Text / Numeric
DistinctnessFraction of distinct values of a column over the number of all values of a column. Distinct values occur at least one time. For example, [a, a, b] contains two distinct values a and b, so distinctness is 2/3.Text / Numeric
MaxLengthMaximum length of the column.Text
MinLengthMinimum length of the column.Text
CountDistinctExact number of distinct values.Text
EntropyEntropy is a measure of the level of information contained in an event (value in a column) when considering all possible events (values in a column). It is measured in nats (natural units of information). Entropy is estimated using observed value counts as the negative sum of (value_count/total_count) * log(value_count/total_count). For example, [a, b, b, c, c] has three distinct values with counts [1, 2, 2]. Entropy is then (-1/5*log(1/5)-2/5*log(2/5)-2/5*log(2/5)) = 1.055.Text
HistogramThe summary of values in a column of a table. Groups the given column’s values and calculates the number of rows with that specific value and the fraction of this value.Text
UniqueValueRatioFraction of unique values over the number of all distinct values of a column. Unique values occur exactly one time; distinct values occur at least one time. Example: [a, a, b] contains one unique value b, and two distinct values a and b, so the unique value ratio is 1/2.Text
UniquenessFraction of unique values over the number of all values of a column. Unique values occur exactly one time. Example: [a, a, b] contains one unique value b, so uniqueness is 1/3.Text
ApproxQuantilesApproximate quantiles of a distribution.Numeric
MaximumMaximum value of the column.Numeric
MeanMean value of the column.Numeric
MinimumMinimum value of the column.Numeric
StandardDeviationStandard deviation value of the column.Numeric
SumSum of the column.Numeric

Application description

You can run the application via spark-submit on a transient or permanent EMR cluster (see the “Creating an EMR cluster” section in this post for minimum version specification) with Spark installed and configured with the Data Catalog settings option Use for Spark table metadata enabled.

The following example code executes the application:

 $ spark-submit \
  --class awsdataprofiler.DataProfilerForAWSGlueDataCatalog \
  --master yarn \
  --deploy-mode cluster \
  --name data-profiler-for-aws-glue-data-catalog \
  /home/hadoop/data-profiler-for-aws-glue-data-catalog-assembly-1.0.jar \
  --dbName nyctlcdb \
  --region eu-west-1 \
  --compExp true \
  --statsPrefix DQP \
  --s3BucketPrefix deequ-profiler/deequ-profiler-metrics \
  --profileUnsupportedTypes true \
  --noOfBins 30 \
  --quantiles 10

The following table summarizes the input parameters that the application accepts.

NameTypeRequiredDefaultDescription
--dbName (-d)StringYesN/AData Catalog database name. The database must be defined in the Catalog owned by the same account where the application is executed.
--region (-r)StringYesN/AAWS Region endpoint where the Data Catalog database is defined, for example us-west-1 or us-east-1. For more information, see Regional Endpoints.
--compExp (-c)BooleanNofalseIf true, the application also executes “expensive” profiling analyzers on Text columns. These are CountDistinct, Entropy, Histogram, UniqueValueRatio, and Uniqueness. If false, only the following default analyzers are executed: ApproxCountDistinct, Completeness, Distinctness, MaxLength, MinLength. All analyzers for Numeric columns are always executed.
--statsPrefix (-p)StringNoDQPString prepended to the statistics names in the Data Catalog. The application also adds two underscores (__). This is useful to identify metrics calculated by the application.
--s3BucketPrefix (-s)StringNoblankFormat must be s3Buckename/prefix. If specified, the application writes Parquet files with metrics in the prefixes db_name=…/table_name=….
--profileUnsupportedTypes (-u)BooleanNofalseBy default, the Amazon Deequ library only supports Text and Numeric columns. If this parameter is set to true, the application also profiles columns of type Boolean and Date.
--noOfBins (-b)IntegerNo10When --compExp (-c) is true, sets the number of maximum values to create for the Histogram analyzer for String columns.
--quantiles (-q)IntegerNo10Sets the number of quantiles to calculate for the ApproxQuantiles analyzer for numeric columns.

Setting up your environment

The following walkthrough demonstrates how to create and populate a Data Catalog database with three tables, which simulates a process with monthly updates. For this post, you simulate three monthly runs: February 2, 2019, March 2, 2019, and April 2, 2019.

After table creation and after each monthly table update, you run the application to generate and update the profiling information for each table in the Data Catalog and Amazon S3 repository. The post also provides examples of how to query the data using the AWS CLI and Athena, and build a simple Amazon QuickSight dashboard.

This post uses the New York City Taxi and Limousine Commission (TLC) Trip Record Data on the Registry of Open Data on AWS. In particular, you use the tables yellow_tripdata, fhv_tripdata, and green_tripdata.

The following steps explain how to set up the environment.

Creating an EMR cluster

The first step is to create an EMR cluster. Connect to the cluster master node and execute the code via spark-submit. Make sure that the cluster version is at least 5.28.0 with at least Hadoop and Spark installed and that you use the Data Catalog as table metadata for Spark.

The master node should also be accessible via SSH. For instructions, see Connect to the Master Node Using SSH.

Downloading the application

You can download the source code and create a uber jar with all dependencies from the application GitHub repo. You can build the application as a uber jar with all dependencies using the Scala Build Tool (sbt) with the following commands (adjust the memory values according to your needs):

$ export SBT_OPTS="-Xms1G -Xmx3G -Xss2M -XX:MaxMetaspaceSize=3G" && sbt assembly

By default, the .jar file is created in the following path, relative to the project root directory:

./target/scala-2.11/data-profiler-for-aws-glue-data-catalog-assembly-1.0.jar

When the .jar file is available, copy it to the master node of the EMR cluster. One way to copy the file to the master node code is to copy it to Amazon S3 from the client where the file was created and download it from Amazon S3 to the master node.

For this post, copy the file in the /home/hadoop directory of the master node. The full path is /home/hadoop/data-profiler-for-aws-glue-data-catalog-assembly-1.0.jar.

Setting up S3 buckets and copy initial data

You use an S3 bucket to store the data that you profile. For this post, the bucket name is

s3://aws-big-data-blog-samples/

You need to create a bucket with a unique name in your account and store the data there. When the bucket is created and available, use the AWS CLI to copy the first set of files for January 2019 (therefore simulating the February 2, 2019, run) from the s3://nyc-tlc/ bucket. See the following code:

$ DEST_BUCKET=aws-big-data-blog-samples
$ MONTH=2019-01
 
$ aws s3 cp "s3://nyc-tlc/trip data/yellow_tripdata_${MONTH}.csv" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-yellow/yellow_tripdata_${MONTH}.csv"
 
$ aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_${MONTH}.csv" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-fhv/fhv_tripdata_${MONTH}.csv"
 
$ aws s3 cp "s3://nyc-tlc/trip data/green_tripdata_${MONTH}.csv" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-green/green_tripdata_${MONTH}.csv"

After you copy the data to your destination bucket, create a second bucket to store the files with the profiling metrics created by the application. This step is optional because you can write the metrics to a prefix in an existing bucket. See the following code:

$ aws s3 mb s3://deequ-profiler/

You are now ready to create the database and tables metadata in the Data Catalog.

Creating metadata in the Data Catalog

The first step is to create a database in AWS Glue. You can create the database using the AWS CLI. See the following code:

$ aws glue create-database \
    --database-input '{"Name": "nyctlcdb"}'

Alternatively, on the AWS Glue console, choose Databases, Add database.

After you create the database, create a new AWS Glue Crawler to infer the schema of the data in the files you copied in the previous step. Complete the following steps:

  1. On the AWS Glue console, choose Crawler.
  2. Choose Add crawler.
  3. For Crawler name, enter nyc-tlc-db-raw.
  4. Choose Next.
  5. For Choose a data store, choose S3.
  6. For Crawl data in, choose Specified path in my account.
  7. For Include path, enter the S3 bucket and prefix where you copied the data earlier.
  8. Choose Next.
  9. In the Choose an IAM role section, select Choose an existing IAM role.
  10. Choose an IAM role that provides access to the S3 bucket and allows writing to the Data Catalog, or create a new one while creating this crawler.
  11. Choose Next.
  12. Choose the database you created earlier to store the tables’ metadata.
  13. Review the crawler properties and choose Finish.
    You can run the crawler when it’s ready. It creates three new tables in the database. The following screenshot shows the update you receive that the crawler is complete.
  14. You can now use the Lake Formation console to check the tables are correct. See the following screenshot of the Tables.If you select one of the tables, the table version is now 0. See the following screenshot.You can also perform the same check using the AWS CLI. See the following code:
    $ aws glue get-table-versions \
    --database-name nyctlcdb \
    --table-name trip_data_yellow \
    --query 'TableVersions[*].VersionId' 

    [
        "0"
    ]

  15. Check the parameters in the table metadata to verify which values the crawler generated. See the following code:
    $ aws glue get-table \
    	--database-name nyctlcdb \
    	--name trip_data_yellow \
       	--query 'Table.Parameters' 

    {
        "CrawlerSchemaDeserializerVersion": "1.0",
        "CrawlerSchemaSerializerVersion": "1.0",
        "UPDATED_BY_CRAWLER": "nyc-tlc-db-raw",
        "areColumnsQuoted": "false",
        "averageRecordSize": "144",
        "classification": "csv",
        "columnsOrdered": "true",
        "compressionType": "none",
        "delimiter": ",",
        "objectCount": "1",
        "recordCount": "4771445",
        "sizeKey": "687088084",
        "skip.header.line.count": "1",
        "typeOfData": "file"
    }

  16. Check the metadata attributes for three columns in the same table. This post chooses the following columns because they have different data types, though any other column is suitable for this check. In the following code, the only attributes currently available for the columns are “Name” and “Type:”
    $ aws glue get-table \
    --database-name nyctlcdb \
    --name trip_data_yellow \
    --query 'Table.StorageDescriptor.Columns[?Name==`store_and_fwd_flag`]'

    [
        {
            "Name": "store_and_fwd_flag",
            "Type": "string"
        }
    ]

    $ aws glue get-table \
    --database-name nyctlcdb \
    --name trip_data_yellow \
    --query 'Table.StorageDescriptor.Columns[?Name==`fare_amount`]'

    [
        {
            "Name": "fare_amount",
            "Type": "double"
        }
    ]

    $ aws glue get-table \
    --database-name nyctlcdb \
    --name trip_data_yellow \
    --query 'Table.StorageDescriptor.Columns[?Name==`passenger_count`]'

    [
        {
            "Name": "passenger_count",
            "Type": "bigint"
        }
    ]

You can display the same information via the Lake Formation console. See the following screenshot.

You are now ready to execute the application.

First application execution

Connect to the EMR cluster master node via SSH and run the application with the following code (change the input parameters as needed, especially the value for the s3BucketPrefix parameter):

$ spark-submit \
    --class awsdataprofiler.DataProfilerForAWSGlueDataCatalog \
    --master yarn \
    --deploy-mode cluster \
    --name data-profiler-for-aws-glue-data-catalog \
    /home/hadoop/data-profiler-for-aws-glue-data-catalog-assembly-1.0.jar \
    --dbName nyctlcdb \
    --region eu-west-1 \
    --compExp true \
    --statsPrefix DQP \
    --s3BucketPrefix deequ-profiler/deequ-profiler-metrics \
    --profileUnsupportedTypes true \
    --noOfBins 30 \
    --quantiles 10

Profiling information in the metadata in the Data Catalog

When the application is complete, you can recheck the metadata via the Lake Formation console for the tables and verify that a new table version was created. See the following screenshot.

You can verify the same information via the AWS CLI. See the following code:

$ aws glue get-table-versions \
    --database-name nyctlcdb \
    --table-name trip_data_yellow \
    --query 'TableVersions[*].VersionId'
[
    "1",
    "0"
]

Check the metadata for the table and verify that the profiling information the application generated was successfully stored. In the following code, the parameter “DQP__Size” was generated, which contains the number of records in the table as calculated by the Deequ library:

$ aws glue get-table \ 
    --database-name nyctlcdb \
    --name trip_data_yellow \
    --query 'Table.Parameters'
{
    "CrawlerSchemaDeserializerVersion": "1.0-",
    "CrawlerSchemaSerializerVersion": "1.0",
    "DQP__Size": "7667793.0",
    "UPDATED_BY_CRAWLER": "nyc-tlc-db-raw",
    "areColumnsQuoted": "false",
    "averageRecordSize": "144",
    "classification": "csv",
    "columnsOrdered": "true",
    "compressionType": "none",
    "delimiter": ",",
    "objectCount": "1",
    "recordCount": "4771445",
    "sizeKey": "687088084",
    "skip.header.line.count": "1",
    "typeOfData": "file"
}

Similarly, you can verify that the metadata for the columns you checked previously contains the profiling information the application generated. This is stored in the “Parameters” object for each column. Each new attribute starts with the string “DQP” as specified in the statsPrefix input parameter. See the following code:

$ aws glue get-table \
    --database-name nyctlcdb \
    --name trip_data_yellow \
    --query 'Table.StorageDescriptor.Columns[?Name==`store_and_fwd_flag`]' 
[
    {
        "Name": "store_and_fwd_flag",
        "Type": "string",
        "Parameters": {
            "DQP__ApproxCountDistinct": "3.0",
            "DQP__Completeness": "1.0",
            "DQP__CountDistinct": "3.0",
            "DQP__Distinctness": "3.912468685578758E-7",
            "DQP__Entropy": "0.03100483390393341",
            "DQP__Histogram.abs.N": "7630142.0",
            "DQP__Histogram.abs.Y": "37650.0",
            "DQP__Histogram.abs.store_and_fwd_flag": "1.0",
            "DQP__Histogram.bins": "3.0",
            "DQP__Histogram.ratio.N": "0.9950897213839758",
            "DQP__Histogram.ratio.Y": "0.004910148200401341",
            "DQP__Histogram.ratio.store_and_fwd_flag": "1.3041562285262527E-7",
            "DQP__MaxLength": "18.0",
            "DQP__MinLength": "1.0",
            "DQP__UniqueValueRatio": "0.3333333333333333",
            "DQP__Uniqueness": "1.3041562285262527E-7"
        }
    }
]
$ aws glue get-table \
    --database-name nyctlcdb \
    --name trip_data_yellow \
    --query 'Table.StorageDescriptor.Columns[?Name==`fare_amount`]'
[
    {
        "Name": "fare_amount",
        "Type": "double",
        "Parameters": {
            "DQP__ApproxCountDistinct": "6125.0",
            "DQP__Completeness": "0.9999998695843771",
            "DQP__Distinctness": "8.187492802687814E-4",
            "DQP__Maximum": "623259.86",
            "DQP__Mean": "12.40940884025023",
            "DQP__Minimum": "-362.0",
            "DQP__StandardDeviation": "262.0720412055651",
            "DQP__Sum": "9.515276582999998E7",
            "DQP__name-0.1": "5.0",
            "DQP__name-0.2": "6.0",
            "DQP__name-0.3": "7.0",
            "DQP__name-0.4": "8.0",
            "DQP__name-0.5": "9.0",
            "DQP__name-0.6": "10.5",
            "DQP__name-0.7": "12.5",
            "DQP__name-0.8": "15.5",
            "DQP__name-0.9": "23.5",
            "DQP__name-1.0": "623259.86"
        }
    }
]

The parameters named “DQP__name-x.x” are the results of the ApproxQuantiles Deequ analyzer for numeric columns; the number of quantiles is set via the –quantiles (-q) input parameter of the application. See the following code:

$ aws glue get-table \
    --database-name nyctlcdb \
    --name trip_data_yellow \
    --query 'Table.StorageDescriptor.Columns[?Name==`passenger_count`]'
[
    {
        "Name": "passenger_count",
        "Type": "bigint",
        "Parameters": {
            "DQP__ApproxCountDistinct": "10.0",
            "DQP__Completeness": "0.9999998695843771",
            "DQP__Distinctness": "1.3041562285262527E-6",
            "DQP__Maximum": "9.0",
            "DQP__Mean": "1.5670782410373156",
            "DQP__Minimum": "0.0",
            "DQP__StandardDeviation": "1.2244305354114957",
            "DQP__Sum": "1.201603E7",
            "DQP__name-0.1": "1.0",
            "DQP__name-0.2": "1.0",
            "DQP__name-0.3": "1.0",
            "DQP__name-0.4": "1.0",
            "DQP__name-0.5": "1.0",
            "DQP__name-0.6": "1.0",
            "DQP__name-0.7": "1.0",
            "DQP__name-0.8": "2.0",
            "DQP__name-0.9": "3.0",
            "DQP__name-1.0": "9.0"
        }
    }
]

Profiling information in Amazon S3

You can now also verify that the profiling information was saved in Parquet format in the S3 bucket you specified in the s3BucketPrefix application input parameter. The following screenshot shows the buckets via the Amazon S3 console.

The data is stored using prefixes that are compatible with Apache Hive partitions. This is useful to optimize performance and costs when you use analytics services like Athena. The partitions are defined on db_name and table_name. The following screenshot shows the details of table_name=trip_data_yellow.

Each execution of the application generates one Parquet file appending data to the metrics table for each physical table.

Second execution after monthly table updates

To run the application after monthly table updates, complete the following steps:

  1. Copy the new files for February 2019 to simulate the March 2 monthly update of the system. See the following code:
    $ DEST_BUCKET=aws-big-data-blog-samples
    $ MONTH=2019-02
     
    $ aws s3 cp "s3://nyc-tlc/trip data/yellow_tripdata_${MONTH}.csv" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-yellow/yellow_tripdata_${MONTH}.csv"
     
    $ aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_${MONTH}" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-fhv/fhv_tripdata_${MONTH}.csv"
     
    $ aws s3 cp "s3://nyc-tlc/trip data/green_tripdata_${MONTH}.csv" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-green/green_tripdata_${MONTH}.csv"

  2. Run the nyc-tlc-db-raw crawler to update the table metadata to include the new files. The following screenshot shows that the three tables were updated successfully.
  3. Check that the crawler created a third version of the table. See the following code:
    $ aws glue get-table-versions \
        --database-name nyctlcdb \
        --table-name trip_data_yellow \
        --query 'TableVersions[*].VersionId'

    [
        "2",
        "1",
        "0"
    ]

  4. Rerun the application to generate the new profiling metadata, entering the same code as before. To keep clean information, before storing new profiling information in the metadata, the application removes all custom attributes starting with the string specified in the “statsPrefix” See the following code:
    $ spark-submit \
        --class awsdataprofiler.DataProfilerForAWSGlueDataCatalog \
        --master yarn \
        --deploy-mode cluster \
        --name data-profiler-for-aws-glue-data-catalog \
        /home/hadoop/data-profiler-for-aws-glue-data-catalog-assembly-1.0.jar \
        --dbName nyctlcdb \
        --region eu-west-1 \
        --compExp true \
        --statsPrefix DQP \
        --s3BucketPrefix deequ-profiler/deequ-profiler-metrics \
        --profileUnsupportedTypes true \
        --noOfBins 30 \
        --quantiles 10

    Following a successful execution, a new version of the table was created. See the following code:

    $ aws glue get-table-versions \
        --database-name nyctlcdb \
        --table-name trip_data_yellow \
        --query 'TableVersions[*].VersionId'

    [
        "3",
        "2",
        "1",
        "0"
    ]

  5. Check the value of the DQP__Size attribute; its value has changed. See the following code:
    $ aws glue get-table \
        --database-name nyctlcdb \
        --name trip_data_yellow \
        --query Table.Parameters.{'DQP__Size:DQP__Size}'

    {
        "DQP__Size": "1.4687169E7"
    }

  6. Check one of the columns you saw earlier to see the updated profiling properties values. See the following code:
    $ aws glue get-table \
        --database-name nyctlcdb \
        --name trip_data_yellow \
        --query 'Table.StorageDescriptor.Columns[?Name==`store_and_fwd_flag`]'

    [
        {
            "Name": "store_and_fwd_flag",
            "Type": "string",
            "Parameters": {
                "DQP__ApproxCountDistinct": "3.0",
                "DQP__Completeness": "1.0",
                "DQP__CountDistinct": "3.0",
                "DQP__Distinctness": "2.042599223853147E-7",
                "DQP__Entropy": "0.0317381414905775",
                "DQP__Histogram.abs.N": "1.4613018E7",
                "DQP__Histogram.abs.Y": "74149.0",
                "DQP__Histogram.abs.store_and_fwd_flag": "2.0",
                "DQP__Histogram.bins": "3.0",
                "DQP__Histogram.ratio.N": "0.9949513074984022",
                "DQP__Histogram.ratio.Y": "0.005048556328316233",
                "DQP__Histogram.ratio.store_and_fwd_flag": "1.361732815902098E-7",
                "DQP__MaxLength": "18.0",
                "DQP__MinLength": "1.0",
                "DQP__UniqueValueRatio": "0.0",
                "DQP__Uniqueness": "0.0"
            }
        }
    ]

Third execution after monthly tables updates

To run the application a third time, complete the following steps:

  1. Copy the new files for March 2019 to simulate the April 2 monthly update of the system. See the following code:
    $ DEST_BUCKET=aws-big-data-blog-samples
    $ MONTH=2019-03
     
    $ aws s3 cp "s3://nyc-tlc/trip data/yellow_tripdata_${MONTH}.csv" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-yellow/yellow_tripdata_${MONTH}.csv"
     
    $ aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_${MONTH}.csv" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-fhv/fhv_tripdata_${MONTH}.csv"
     
    $ aws s3 cp "s3://nyc-tlc/trip data/green_tripdata_${MONTH}.csv" "s3://${DEST_BUCKET}/data/raw/nyc-tlc/trip-data-green/green_tripdata_${MONTH}.csv"

  2. Run the nyc-tlc-db-raw crawler to update the table metadata to include the new files. You now have five versions of the table metadata. See the following code:
    $ aws glue get-table-versions \
        --database-name nyctlcdb \
        --table-name trip_data_yellow \
        --query 'TableVersions[*].VersionId'

    [
        "4",
        "3",
        "2",
        "1",
        "0"
    ]

  3. Rerun the application to update the profiling information. See the following code:
    $ spark-submit \
        --class awsdataprofiler.DataProfilerForAWSGlueDataCatalog \
        --master yarn \
        --deploy-mode cluster \
        --name data-profiler-for-aws-glue-data-catalog \
        /home/hadoop/data-profiler-for-aws-glue-data-catalog-assembly-1.0.jar \
        --dbName nyctlcdb \
        --region eu-west-1 \
        --compExp true \
        --statsPrefix DQP \
        --s3BucketPrefix deequ-profiler/deequ-profiler-metrics \
        --profileUnsupportedTypes true \
        --noOfBins 30 \
        --quantiles 10

  4. Check the DQP__Size parameter to see its new updated value. See the following code:
    $ aws glue get-table \
        --database-name nyctlcdb \
        --name trip_data_yellow \
        --query Table.Parameters.{'DQP__Size:DQP__Size}'

    {
        "DQP__Size": "2.2519715E7"
    }

  5. Check one of the columns you saw earlier to the update profiling properties values. See the following code:
    $ aws glue get-table \
        --database-name nyctlcdb \
        --name trip_data_yellow \
        --query 'Table.StorageDescriptor.Columns[?Name==`store_and_fwd_flag`]' 

    [
        {
            "Name": "store_and_fwd_flag",
            "Type": "string",
            "Parameters": {
                "DQP__ApproxCountDistinct": "3.0",
                "DQP__Completeness": "1.0",
                "DQP__CountDistinct": "3.0",
                "DQP__Distinctness": "1.3321660598280218E-7",
                "DQP__Entropy": "0.030948463301702846",
                "DQP__Histogram.abs.N": "2.2409376E7",
                "DQP__Histogram.abs.Y": "110336.0",
                "DQP__Histogram.abs.store_and_fwd_flag": "3.0",
                "DQP__Histogram.bins": "3.0",
                "DQP__Histogram.ratio.N": "0.9951003376374878",
                "DQP__Histogram.ratio.Y": "0.004899529145906154",
                "DQP__Histogram.ratio.store_and_fwd_flag": "1.3321660598280218E-7",
                "DQP__MaxLength": "18.0",
                "DQP__MinLength": "1.0",
                "DQP__UniqueValueRatio": "0.0",
                "DQP__Uniqueness": "0.0"
            }
        }
    ]

You can view and manage the same values via the Lake Formation console. See the following screenshot of the Edit column section.

Data profiling reporting with Athena and Amazon QuickSight

As demonstrated earlier, the application can save profiling information in Parquet format to an S3 bucket and prefix into db_name and table_name partitions. See the following code:

$ aws s3 ls s3://deequ-profiler/deequ-profiler-metrics/ --recursive
2020-01-28 09:30:12          0 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_fhv/_SUCCESS
2020-01-28 09:17:15       6506 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_fhv/part-00000-760dafb1-fc37-4700-a506-a9dc71b7a745-c000.snappy.parquet
2020-01-28 09:01:19       6498 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_fhv/part-00000-78dd2c4a-83c2-44c4-aa71-30e7a9fb0089-c000.snappy.parquet
2020-01-28 09:30:11       6505 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_fhv/part-00000-cff4f2de-64b4-4338-a0f6-a50ed34a378f-c000.snappy.parquet
2020-01-28 09:30:08          0 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_green/_SUCCESS
2020-01-28 09:01:15       6355 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_green/part-00000-0d5969c9-70a7-4cd4-ac64-8f16e35e23b5-c000.snappy.parquet
2020-01-28 09:17:11       6353 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_green/part-00000-12a7b0b0-6a2a-45d5-a241-645148af41d7-c000.snappy.parquet
2020-01-28 09:30:08       6415 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_green/part-00000-adecccd6-a884-403f-aa80-c574647a10f9-c000.snappy.parquet
2020-01-28 09:29:56          0 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_yellow/_SUCCESS
2020-01-28 09:16:59       6408 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_yellow/part-00000-2e5e3280-29db-41b9-be67-a68ef8cb9777-c000.snappy.parquet
2020-01-28 09:01:02       6424 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_yellow/part-00000-c4972037-7d3c-4279-8b77-361741133816-c000.snappy.parquet
2020-01-28 09:29:55       6398 deequ-profiler-metrics/db_name=nyctlcdb/table_name=trip_data_yellow/part-00000-f2d6076e-7019-4b03-97ba-a6aab8a677b8-c000.snappy.parquet

The application generates one Parquet file per execution.

Preparing metadata for profiler metrics data

To prepare the metadata for profiler metrics data, complete the following steps:

  1. On the Lake Formation console, create a new database with the name deequprofilerdb to contain the metadata.
  2. On the AWS Glue console, create a new crawler with the name deequ-profiler-metrics to infer the schema of the profiling information stored in Amazon S3.

The following screenshot shows the properties of the new crawler.

After you run the crawler, one table with the name deequ_profiler_metrics was created in the database. The table has the following columns.

NameData TypePartitionDescription
instancestringColumn name the statistic in column “name” refers to. Set to “*” if entity is “Dataset”.
entitystringEntity the statistic refers to. Valid values are “Column” and “Dataset”.
namestringMetrics name, derived from the Deequ Analyzer used for the calculation.
valuedoubleValue of the metric.
typestringData type of the column if entity is “Column”, blank otherwise.
db_name_embedstringDatabase name, same values as in partition “db_name”.
table_name_embedstringTable name, same values as in partition “table_name”.
profiler_run_dtdateDate the profiler application was run.
profiler_run_tstimestampDate/time the profile application was run; it can also be used as execution identifier.
db_namestring1Database name.
table_namestring2Table name.

Reporting with Athena

You can use Athena to run a query that checks the statistics for a column in the database for the execution you ran in March 2019. See the following code:

SELECT db_name, 
    profiler_run_dt,
    profiler_run_ts,
    table_name, 
    entity,
    instance,
    name,
    value
FROM "deequprofilerdb"."deequ_profiler_metrics" 
WHERE db_name = 'nyctlcdb' AND
    table_name = 'trip_data_yellow' AND 
    entity = 'Column' AND
    instance = 'extra' AND
    profiler_run_dt = date_parse('2019-03-02','%Y-%m-%d')
ORDER BY name
;

The following screenshot shows the query results.

Reporting with Amazon QuickSight

To create a dashboard in Amazon QuickSight based on the profiling metrics data the application generated, complete the following steps:

  1. Create a new QuickSight dataset called deequ_profiler_metrics with Athena as the data source.
  2. In the Choose your table section, select the profiling metrics table that you created earlier.
  3. Import the data into SPICE.

After you create the dataset, you can view it and edit its properties. For this post, leave the properties unchanged.

You are now ready to build visualizations and dashboards.

The following images in this section show a simple analysis with controls that allow for the selection of the Database, Table profiled, Entity, Column, and Profiling Metric.

Control NameMapped Column
Databasedb_name
Tabletable_name
Entityentity
Columninstance
Profiling Metricname

For more information about adding controls, see Create Amazon QuickSight dashboards that have impact with parameters, on-screen controls, and URL actions.

For example, you can select the Size metric of a specific table to see how many records are available in the table after each monthly load. See the following screenshot.

Similarly, you can use the same analysis to see how a specific metric changes over time for a column. The following screenshot shows that the mean of the fare_amount column changes after each monthly load.

You can select any metric calculated on any column, which makes for a very flexible profiling data reporting system.

Conclusion

This post demonstrated how to extend the metadata contained in the Data Catalog with profiling information calculated with an Apache Spark application based on the Amazon Deequ library running on an EMR cluster.

You can query the Data Catalog using the AWS CLI. You can also build a reporting system with Athena and Amazon QuickSight to query and visualize the data stored in Amazon S3.

Special thanks go to Sebastian Schelter at Amazon Search and Sven Hansen and Vincent Gromakowski at AWS for their help and support

 


About the Author

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has lived and worked in London for 10 years, after that he has worked in Italy, Switzerland and other countries in EMEA. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems, mainly for Enterprise and FSI customers. Francesco also has a strong experience in systems integration and design and implementation of web applications. He loves sharing his professional knowledge, collecting vinyl records and playing bass.

New – Announcing Amazon AppFlow

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/new-announcing-amazon-appflow/

Software as a service (SaaS) applications are becoming increasingly important to our customers, and adoption is growing rapidly. While there are many benefits to this way of consuming software, one challenge is that data is now living in lots of different places. To get meaningful insights from this data, we need to have a way to analyze it, and that can be hard when our data is spread out across multiple data islands.

Developers spend huge amounts of time writing custom integrations so they can pass data between SaaS applications and AWS services so that it can be analysed; these can be expensive and can often take months to complete. If data requirements change, then costly and complicated modifications have to be made to the integrations. Companies that don’t have the luxury of engineering resources might find themselves manually importing and exporting data from applications, which is time-consuming, risks data leakage, and has the potential to introduce human error.

Today it is my pleasure to announce a new service called Amazon AppFlow that will solve this issue. Amazon AppFlow allows you to automate the data flows between AWS services and SaaS applications such as Salesforce, Zendesk, and ServiceNow. SaaS application administrators, business analysts, and BI specialists can quickly implement most of the integrations they need without waiting months for IT to finish integration projects.

As well as allowing data to flow in from SaaS applications to AWS services, it’s also capable of sending data from AWS services to SaaS applications. Security is our top priority at AWS, and so all of the data is encrypted while in motion. Some of the SaaS applications have integrated with AWS PrivateLink; this adds an extra layer of security and privacy. When the data flows between the SaaS application and AWS with AWS PrivateLink, the traffic stays on the Amazon network rather than using the public internet. If the SaaS application supports it, Amazon AppFlow automatically takes care of this connection, making private data transfer easy for everyone and minimizing the threat from internet-based attacks and the risk of sensitive data leakage.

You can schedule the data transfer to happen on a schedule, in response to a business event, or on demand, giving you speed and flexibility with your data sharing.

To show you the power of this new service, I thought it would be interesting to show you how to set up a simple flow.

I run a Slack workspace in the United Kingdom and Ireland for web community organizers. Since Slack is one of the supported SaaS applications in Amazon AppFlow, I thought it would be nice to try and import some of the conversation data into S3. Once it was in S3 I could then start to analyze it using Amazon Athena and then ultimately create a visualization using Amazon QuickSight.

To get started, I go to the Amazon AppFlow console and click the Create Flow button.

In the next step, I enter the Flow name and a Flow description. There are also some options for data encryption. By default, all data is encrypted in transit, and in the case of S3 it’s also encrypted at rest. You have the option to supply your own encryption key, but for this demo, I’m just going to use the key in my account that is used by default.

On this step you are also given the option to enter Tags for the resource, I have been getting into the habit of tagging demo infrastructure in my account as Demo which makes it easier for me to know which resources I can delete.

On the next step, I select the source of my data. I pick Slack and go through the wizard to establish a connection with my Slack workspace. I also get to choose what data I want to import from my Slack Workspace. I select the Conversations object in the general slack channel. This will import any messages that are posted to the general channel and then send it to the destination that I configure next.

There are a few destinations that I can pick, but to keep things simple, I ask for the data to be sent to an S3 bucket. I also set the frequency that I want to fetch the data on this step. I want the data to be retrieved every hour, so I select Run the flow on schedule and make the necessary configurations. Slack can be triggered on demand or on schedule; some other sources can be triggered by specific events, such as converting a lead in Salesforce.

The next step is to map the data fields, I am just going to go with the defaults, but you could customize this and combine fields or take only the specific fields required for analysis.

Now the flow has been created, and I have activated it; it runs automatically every hour, adding new data to my S3 bucket.

I won’t go it the specifics of Amazon Athena or Amazon QuickSight, but I used both of these AWS services to take my data stored in S3 and produce a word cloud of the most common words that are used in my Slack Channel.

The cool thing about Athena is that you can run SQL queries directly over the encrypted data in S3 without needing any additional data warehouse. You can see the results in the image below. I could now easily share this as a dashboard with anyone in my organization.

Amazon AppFlow is launching today with support for S3 and 13 SaaS applications as sources of data, and S3, Amazon Redshift, Salesforce, and Snowflake as destinations, and you will see us add hundreds more as the service develops.

The service automatically scales up or down to meet the demands you place on it, it also allows you to transfer 100GB in a single flow which means you don’t need to break data down into batches. You can trust Amazon AppFlow with your most valuable data as we have architected to be highly available and resilient.

Amazon AppFlow is available from today in US East (Northern Virginia), US East (Ohio), US West (Northern California), US West (Oregon), Canada (Central), Asia Pacific (Singapore), Asia Pacific (Toyko), Asia Pacific (Sydney), Asia Pacific (Seoul), Asia Pacific (Mumbai), Europe (Paris), Europe (Ireland), Europe (Frankfurt), Europe (London), and South America (São Paulo) with more regions to come.

Happy Data Flowing

— Martin

Speed up your ELT and BI queries with Amazon Redshift materialized views

Post Syndicated from Juan Yu original https://aws.amazon.com/blogs/big-data/speed-up-your-elt-and-bi-queries-with-amazon-redshift-materialized-views/

The Amazon Redshift materialized views function helps you achieve significantly faster query performance on repeated or predictable workloads such as dashboard queries from Business Intelligence (BI) tools, such as Amazon QuickSight. It also speeds up and simplifies extract, load, and transform (ELT) data processing. You can use materialized views to store frequently used precomputations and seamlessly use them to achieve lower latency on subsequent analytical queries.

This post demonstrates how to create a materialized view, refresh it after data ingestion, and speed up your BI workload.

Setting up your sample datasets

This walkthrough uses the Amazon Customer Reviews Dataset. It is a public dataset stored in the us-east-1 Region. You will create the following three tables:

  • product_reviews – Contains customer reviews for a specific product
  • customer – Contains customer profile data
  • customer_address – Contains customer address information

The following diagram shows the relationship of the three tables.

To download the script and set up the tables, choose mv_blog.sql.

Creating and using materialized views

For this use case, your marketing team wants to build a report that shows how many customers per state like your products. You also want to drill down to each product category when needed.

In this first step, you create a regular view. See the following code:

CREATE VIEW v_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code is a report to analyze the product review count per state:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

The following code is a report to analyze the product review count per state for specific categories:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The preceding reports take approximately 4 seconds to run. As you sell more products and get more reviews, this elapsed time gradually gets longer. To speed up those reports, you can create a materialized view to precompute the count of reviews per product category and per state. See the following code:

CREATE MATERIALIZED VIEW mv_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code are the reports to analyze the product review against the materialized view.

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The same reports against materialized views take less than 200 milliseconds because the new queries access precomputed joins, filters, grouping, and partial sums instead of the multiple, larger base tables.

Speeding up and simplifying ELT data processing

To achieve similar performance without the use of materialized views, many users use the CREATE TABLE AS (CTAS) command. However, as you update base tables with new data inserts, updates, or deletes, the CTAS tables become stale; you must recreate them to keep them up-to-date with the latest changes from the base tables. Now with Amazon Redshift materialized views, you can overcome this problem by efficiently and incrementally refreshing the materialized views with supported SQL. For example, the following code ingests another 10,000 reviews:

INSERT INTO product_reviews
SELECT   marketplace, 
  cast(customer_id as bigint) customer_id, 
  review_id, 
  product_id, 
  cast(product_parent as bigint) product_parent, 
  product_title, 
  star_rating, 
  helpful_votes, 
  total_votes, 
  vine, 
  verified_purchase, 
  review_headline, 
  review_body, 
  review_date, 
  year,
  product_category
FROM demo.products_reviews
WHERE review_date = '2015-07-01' LIMIT 10000;

Now the materialized view is out-of-date. To refresh the materialized view, enter the following code:

REFRESH MATERIALIZED VIEW mv_reviews_byprod_and_state;

Within 200 milliseconds, the materialized view is up-to-date again. Your report queries have the same consistent, fast performance.

The following screenshot is the query log that shows query performance. The log shows newer statements at the top.

The materialized views refresh is much faster because it’s incremental: Amazon Redshift only uses the new data to update the materialized view instead of recomputing the entire materialized view again from the base tables.  For more information, see REFRESH MATERIALIZED VIEW.

Materialized views also simplify and make ELT easier and more efficient. Without materialized views, you might create an ELT job and use CTAS to precompute the product analysis data. The ELT job recomputes this data after new data is ingested and stores the data in the precomputed product analysis table to meet the dashboard latency requirement.

In particular, the ETL job drops and recreates the precomputed product analysis table after each ingestion. See the following code:

BEGIN;
    DROP TABLE IF EXISTS latest_product_analysis;
    CREATE TABLE latest_product_analysis as SELECT ...;
END;

With materialized views, you just need to create the materialized view one time and refresh to keep it up-to-date. To refresh materialized views after ingesting new data, add REFRESH MATERIALIZED VIEW to the ELT data ingestion scripts. Redshift will automatically and incrementally bring the materialized view up-to-date.

Achieving faster performance for BI dashboards

You can use materialized views to help your BI team build a dashboard to analyze product trends.

For example, to create a materialized view to join customer and customer_address dimension tables and precompute reviews and ratings, enter the following code:

CREATE MATERIALIZED VIEW mv_product_analysis 
    sortkey(product_category, Customer_State, review_date) 
AS
SELECT PR.product_category,
       A.ca_state AS Customer_State,
       PR.review_date AS Review_Date,
       PR.product_title,
       COUNT(1) AS review_total,
       SUM(PR.star_rating) AS rating
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND marketplace = 'US'
GROUP BY 1,2,3,4;

You access materialized views the same as you do a regular table. For this walkthrough, choose a materialized view as the source for an Amazon QuickSight dataset. As showing by the following screenshot.

You can preview data of the materialized view in Amazon QuickSight to understand what information can be used to build the dashboard. The following screenshot shows the sample data of mv_product_analysis.

To track how many reviews customers post over time, use review_date as the X-axis and Sum(review_total) as the Y-axis. The following graph shows this visualization.

The following screenshot shows a complete dashboard “Product trend” that analyzes the top product category, product popularity by state, and more.

Because you are using materialized views, the product trend dashboard loads in seconds and is always up-to-date. You can gain the latest insights, understand customer purchase behavior, and identify business opportunities and optimizations.

You can compare the performance of materialized views with other possible alternatives, such as using regular views and using CTAS. The following graph shows the overall query execution for the product trend dashboard. Materialized views not only improve query performance by more than an order of magnitude compared to using a regular view, but also have low maintenance costs compared to using a CTAS because the incremental refresh time is proportional to the delta of changes in the base tables. In contrast, the CTAS recreate approach needs to processes all the data in the base tables.

The following animated gif shows the actual response time for the product trend dashboard built using Amazon QuickSight in direct query mode.

Conclusion

This post showed how to create Amazon Redshift materialized views with one or more base tables to speed up both BI queries and ELT. You can easily build and maintain efficient data processing pipelines and seamlessly extend the low latency query execution benefits of materialized views to data analysis.

 


About the Authors

Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.

 

 

 

 

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

 

 

 

Query, visualize, and forecast TruFactor web session intelligence with AWS Data Exchange

Post Syndicated from Jay Park original https://aws.amazon.com/blogs/big-data/query-visualize-and-forecast-trufactor-web-session-intelligence-with-aws-data-exchange/

Given the infinite nature of data, finding the right data set to gain business insights can be a challenge. You can improve your business by having access to a central repository of various data sets to query, visualize, and forecast. With AWS Data Exchange, finding the right data set has become much simpler. As an example, you can use data sets on web session visitation and demographics to help you understand which demographic groups visit your website most frequently. You can then improve your business through machine learning (ML) models and visitation forecasts.

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. After you subscribe to a data product within AWS Data Exchange, you can use the AWS Data Exchange API, AWS CLI, or the AWS Management Console to load data into Amazon S3 directly. You can then analyze the imported data with a wide variety of AWS services, ranging from analytics to machine learning.

This post showcases TruFactor Intelligence-as-a-Service data on AWS Data Exchange. TruFactor’s anonymization platform and proprietary AI ingests, filters, and transforms more than 85 billion high-quality raw signals daily from wireless carriers, OEMs, and mobile apps into a unified phygital consumer graph across physical and digital dimensions. TruFactor intelligence is application-ready for use within any AWS analytics or ML service to power your models and applications running on AWS, with no additional processing required. Common use cases include the following:

  • Consumer segmentation – Web intelligence on internet browsing behavior in the US provides a complete view of the consumer, including interests, opinions, values, digital behavior, and sentiment, to inform segmentation of your customers and those of your competitors.
  • Customer acquisition or churn campaigns – Internet browsing behavior can identify affinity properties for new prospects as well as switching to competitors’ websites.

This walkthrough uses TruFactor’s Daily Mobile Web Session Index and Daily Demographics by Mobile Web Sessions data sets, which are both available for free subscription through the AWS Data Exchange console. While there are commercial data sets available for purchase in AWS Data Exchange, this post uses trial data sets to showcase the breadth and depth of analytics possible with TruFactor’s intelligence.

This TruFactor intelligence is aggregated on over 3 billion records from telco carrier networks and mobile apps per day, originating from approximately 30 million consistent users, distilled into session-level information that provides a complete view of user digital interests. The accuracy, breadth of data provided, and the persistency of the panel deliver a unified view of consumers that can inform insights or power analytic models or applications on AWS.

These two data sets have applications across verticals such as retail, financial services, and advertising. Common use cases include creating detailed customer segmentation (for example, full DNA maps of consumers based on visits to specific web HTTP hosts), identifying affinity properties, and estimating demand for apps or services. This intelligence is also ideal for identifying trends and changes over time.

Solution overview

The following diagram illustrates the architecture of the solution.

The workflow is comprised of the following steps:

  1. Subscribe to a data set from AWS Data Exchange and export to Amazon S3
  2. Run an AWS Glue crawler to load product data
  3. Perform queries with Amazon Athena
  4. Visualize the queries and tables with Amazon QuickSight
  5. Run an ETL job with AWS Glue
  6. Create a time series forecast with Amazon Forecast
  7. Visualize the forecasted data with Amazon QuickSight

This post looks at the demographic distributions across various websites and how to use ML to forecast website visitation.

Walkthrough overview

The walkthrough includes the following steps:

  1. Subscribe to a TruFactor data set from the AWS Data Exchange console and export the data set to Amazon S3
  2. Use an AWS Glue crawler to load the product data into an AWS Glue Data Catalog
  3. Use Amazon Athena for SQL querying
  4. Visualize the query views and tables with Amazon QuickSight
  5. Use AWS Glue jobs to extract, transform, and load your data for forecasting with Amazon Forecast
  6. Use Amazon Forecast to create a time series forecast of the transformed data
  7. Visualize the forecasted web visitation data with Amazon QuickSight

You do not have to perform additional processing or manipulation of the TruFactor intelligence for this walkthrough.

The data sets

The TruFactor data sets this post uses are in Parquet format and snappy compression. The following section provides additional details and schema for each data set.

TruFactor Daily Mobile Web Session Index (US – Nationwide) — Trial

The TruFactor Daily Mobile Web Session Index (US – Nationwide) — Trial data set provides aggregate information per HTTP host as a view of the internet browsing behavior in the US. TruFactor generates the data from high-quality packet layer data sourced from mobile carriers that includes the mobile internet traffic originating from a user’s device. TruFactor derives the projected counts from observed counts that are filtered for exclusion and anonymized to make sure users cannot be re-identified. It extrapolates values from US Census data using a proprietary algorithm. For the avoidance of doubt, this data set does not include user-level data.

The following screenshot shows the schema for the mobile web session data set by HTTP host, session time, MB transferred, number of events, sessions, users, and dates.

TruFactor Daily Demographics by Mobile Web Session (US) — Trial

The TruFactor Daily Demographics by Mobile Web Session (US) — Trial data set includes aggregate demographics: a projected distribution of users per HTTP host as a view of the internet browsing behavior in the US. TruFactor generates the data from high-quality packet layer data sourced from mobile carriers that includes the mobile internet traffic originating from a user’s device. TruFactor derives the distribution from observed counts that are filtered for exclusion and anonymized to make sure users cannot be re-identified. It extrapolates values from US Census data using a proprietary algorithm. Demographics include gender, age range, ethnicity, and income range.

The following screenshot shows the partial schema for the demographics by web session data set. The full schema includes the following attributes: HTTP host, age ranges, genders, ethnicity, income ranges, and date.

Prerequisites

To complete this walkthrough successfully, you must have the following resources:

  • An AWS account.
  • Familiarity with AWS core services and concepts.
  • The ability to launch new resources in your account. Some resources may not be eligible for Free Tier usage and might incur costs.
  • Subscription to TruFactor’s Daily Mobile Web Session Index (US – Nationwide) – Trial and Daily Demographics by Mobile Web Session (US) – Trial data sets. For instructions on subscribing to a data set on AWS Data Exchange, see AWS Data Exchange – Find, Subscribe To, and Use Data Products.

Using AWS Data Exchange, Amazon S3, AWS Glue, Amazon Athena, and Amazon QuickSight

This section examines the key demographics of visitors to the top seven e-commerce websites. This information can help you understand which demographic groups are visiting your website most frequently and also help you target ads and cater to certain demographics groups. You use AWS Glue crawlers to crawl your data sets in Amazon S3, populate your AWS Glue Data Catalog, query the AWS Glue Data Catalog using Amazon Athena, and use Amazon QuickSight to visualize the queries.

Step 1: Exporting the data from AWS Data Exchange to Amazon S3

To export your TruFactor data set subscriptions into an Amazon S3 bucket, complete the following steps:

  1. Create an Amazon S3 bucket in your working account. For the purposes of our demo, we have named our S3 bucket trufactor-data-exchange-bucket.
  2. Create two folders within the S3 bucket: web_sess and demo_by_web_sess.

This post uses a trial data set with a sample of 14 days. A paid subscription to TruFactor’s Web Sessions data on AWS Data Exchange includes 6 months of historical data, which refreshes daily.

The following screenshot shows the two folders within the S3 bucket.You are now ready to export the data sets.

  1. On the AWS Data Exchange console, under Subscriptions, locate TruFactor Daily Mobile Web Sessions Index (US – Nationwide) – Trial.
  2. Under Revisions, choose the most recent Revision ID.
  3. Choose all assets except the manifest.json files.
  4. Choose Export to Amazon S3.
  5. In the window that opens, choose the S3 bucket and folder to export the product data into.
    • Export all the assets into the S3 bucket’s web_sess folder.
  6. Repeat the previous steps for the TruFactor Daily Demographics by Mobile Web Sessions (US) – Trial data set, with the following change:
    • Export the assets into the demo_by_web_sess folder in your S3 bucket.
  7. Check to make sure you successfully imported the TruFactor data sets in the Overview. The following screenshot shows that the data sets are partitioned into folders by date. Each folder contains Parquet files of web session data for each day.

Step 2: Populating your AWS Glue Data Catalog with the TruFactor data sets

Now that you have successfully exported the TruFactor data sets into an Amazon S3 bucket, you create and run an AWS Glue crawler to crawl your Amazon S3 bucket and populate the AWS Glue Data Catalog. Complete the following steps:

  1. On the AWS Glue console, under Data Catalog, choose Crawlers.
  2. Choose Add crawler.
  3. For Crawler name, enter a name; for example, trufactor-data-exchange-crawler.
  4. For Crawler source type, choose Data stores.
  5. Choose Next.
  6. For Choose a data store, choose S3.
  7. For Crawl data in, select Specified path in my account.
  8. For Include path, enter the path for the web_sess data set folder. The crawler points to the following path: s3://<trufactor-data-exchange-bucket>/web_sess.
  9. Choose Next.
  10. Select Yes to Add another data store.
  11. Choose Next.
  12. For Include path, enter the path for the demo_by_web_sess data set folder. The crawler points to the following path: s3://<trufactor-data-exchange-bucket>/demo_by_web_sess.
  13. Choose Next.
  14. In the Choose an IAM role section, select Create an IAM role. This is the role that the AWS Glue crawler and AWS Glue jobs use to access the Amazon S3 bucket and its content.
  15. For IAM role, enter the suffix demo-data-exchange.
  16. Choose Next.
  17. In the schedule section, leave the Frequency with the default Run on Demand.
  18. Choose Next.
  19. In the Output section, choose Add database.
  20. Enter a name for the database; for example, trufactor-db.
  21. Choose Next, then choose Finish.This database contains the tables that the crawler discovers and populates. With these data sets separated into different tables, you join and relationalize the data.
  1. In the Review all steps section, review the crawler settings and choose Finish.
  2. Under Data Catalog, choose Crawlers.
  3. Select the crawler you just created.
  4. Choose Run crawler.The AWS Glue crawler crawls the data sources and populates your AWS Glue Data Catalog. This process can take up to a few minutes. When the crawler is finished, you can see two tables added to your crawler details. See the following screenshot.You can now view your new tables.
  1. Under Databases, choose Tables.
  2. Choose your database.
  3. Choose View the tables. The table names correspond to the Amazon S3 folder directory you used to point your AWS Glue crawler. See the following screenshot.

Step 3: Querying the data using Amazon Athena

After you populate the AWS Glue Data Catalog with TruFactor’s Mobile Web Session and Demographics data, you can use Amazon Athena to run SQL queries and create views for visualization. Complete the following steps:

  1. On the Amazon Athena console, choose Query Editor.
  2. On the Database drop-down menu, choose the database you created.
  3. To preview one of the tables in Amazon Athena, choose Preview table.
    On the Results section, you should see 10 records from the web_sess table. See the following screenshot.In this next step, you run a query that creates a view of the Web Session Index and Demographics data across a group of e-commerce HTTP hosts. This is broken down by the percentage of users categorized by age and gender, number of users, MB transferred, and number of sessions ordered by date.
  4. Run the following SQL query in Amazon Athena:
    CREATE OR REPLACE VIEW e_commerce_web_sess_data AS 
    SELECT
      "date_parse"("a"."partition_0", '%Y%m%d') "date",
      "a"."http_host",
      "a"."users",
      "a"."mb_transferred",
      "a"."number_of_sessions",
      "b"."18_to_25",
      "b"."26_to_35",
      "b"."36_to_45",
      "b"."46_to_55",
      "b"."56_to_65",
      "b"."66_to_75",
      "b"."76_plus",
      "b"."male",
      "b"."female"
    FROM  
      ((
       SELECT
         "partition_0",
         "http_host",
         "users",
         "mb_transferred",
         "number_of_sessions"
       FROM
         "trufactor-db"."web_sess"
       WHERE ("http_host" IN ('www.amazon.com', 'www.walmart.com', 'www.ebay.com', 'www.aliexpress.com', 'www.etsy.com', 'www.rakuten.com', 'www.craigslist.com'))
    )  a
    LEFT JOIN (
       SELECT
         "http_host" "http_host_2",
         "partition_0" "partition_2",
         "age_ranges"."18_to_25",
         "age_ranges"."26_to_35",
         "age_ranges"."36_to_45",
         "age_ranges"."46_to_55",
         "age_ranges"."56_to_65",
         "age_ranges"."66_to_75",
         "age_ranges"."76_plus",
         "genders"."male",
         "genders"."female"
       FROM
         "trufactor-db"."demo_by_web_sess"
       WHERE ("http_host" IN ('www.amazon.com', 'www.walmart.com', 'www.ebay.com', 'www.aliexpress.com', 'www.etsy.com', 'www.rakuten.com', 'www.craigslist.com'))
    )  b ON (("a"."http_host" = "b"."http_host_2") AND ("a"."partition_0" = "b"."partition_2")))
    ORDER BY "date" ASC

  5. After you create the view, you can preview it by repeating the above steps for previewing a table. The following screenshot shows the results, which include the number of users, user percentages by age group and gender, and a list of e-commerce hosts listed by date.

Step 4: Visualizing with Amazon QuickSight

After you query your data sets in Amazon Athena, you can use Amazon QuickSight to visualize your results. You must first grant Amazon QuickSight access to the Amazon S3 bucket that holds the TruFactor data sets, which you can do through the Manage QuickSight setting on the Amazon QuickSight console. After you grant access to the Amazon S3 bucket, you visualize the tables and queries with Amazon QuickSight. Complete the following steps:

  1. In the Amazon QuickSight console, choose New Analysis.
  2. Choose New data set.
  3. Choose Athena as the data source.
  4. For Data source name, enter trufactor-data-exchange-source.
  5. From the drop-down menu, choose the database and view you created.
  6. Choose Directly query your data.
  7. Choose Visualize. Because TruFactor intelligence is application-ready, you can gain immediate insights by using Amazon Athena to query and Amazon QuickSight to visualize. This post includes visualizations of the data set for the first two weeks of October 2019. The following graph visualizes the number of users on different HTTP hosts.The following pie charts further filter the HTTP hosts by age range.The following bar chart offers another visualization of users by age range.You could add other fields such as income range, ethnicity, and gender.

Running AWS Glue Jobs and Amazon Forecast

This section discusses how to use AWS Glue jobs to query and export your data set for forecasting with Amazon Forecast. This walkthrough examines the amount of users’ visitation over 14 days across the top 50 HTTP hosts ranked by users’ visitation. From there, you forecast the users’ visitation for these HTTP hosts for the next three days.

Step 1: Creating and running an AWS Glue job

To create and run your AWS Glue job, complete the following steps:

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Add job.
  3. For Name, enter a name for the AWS Glue job; for example, demo-glue-job.
  4. For Type and Glue version, keep the default values.
  5. For This job runs, select A new script to be authored by you.
  6. In the Security configuration, script libraries, and job parameters (optional) section, set the Maximum capacity cluster size to 2. This reduces the cost of running the AWS Glue job. By default, the cluster size is set to 10 Data Processing Units (DPU).
  7. Choose Next.
  8. In the Connections section, keep the default values.
  9. Choose Save job and edit script.
  10. Enter the following code in the script section, and replace YOUR_BUCKET_NAME on line 42 with the name of your bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.dynamicframe import DynamicFrame
    from awsglue.job import Job
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    db_name = "trufactor-db"
    tbl_name = "web_sess"
    
    web_sess_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name, transformation_ctx = "web_sess_dyf")
    web_sess_df = web_sess_dyf.toDF()
    web_sess_df.createOrReplaceTempView("webSessionTable")
    web_sess_sql_df = spark.sql("""
    SELECT to_date(partition_0, 'yyyyMMdd') AS date,
             http_host,
             users
    FROM 
        (SELECT partition_0,
             http_host,
             users,
             row_number()
            OVER ( PARTITION By partition_0
        ORDER BY users DESC ) AS rn
        FROM webSessionTable )
    WHERE rn<=50
    ORDER BY date""")
    
    web_sess_sql_df.coalesce(1).write.format("csv").option("header","false").save("s3://YOUR_BUCKET_NAME/amazon_forecast_demo/dataset/sampleset")
    job.commit()

    This code queries the top 50 HTTP hosts, ranked by users’ visitation during the first half of October and returns the users, date, and HTTP hosts columns. The query results upload to your Amazon S3 bucket in CSV format (you need the files in CSV to use Amazon Forecast).

  11. Choose Save and close the AWS Glue job screen.Before you can run the AWS Glue job, you need to modify the IAM role associated with AWS Glue. Currently, the IAM role only has permission to get and put objects in the directories you specified earlier. You need to update the IAM policy to allow permission to get and put objects in all subdirectories of the Amazon S3 bucket.
  12. On the IAM console, choose the role you used for this walkthrough: AWSGlueServiceRole-demo-data-exchange.
  13. In the Summary section for the IAM role, on the Permissions tab, choose the IAM policy associated with the Managed policy.
  14. Choose Edit policy.
  15. Change the view from Visual editor to JSON.
  16. Within this JSON object, under Resource, add another resource into the list of values. The following code is the updated IAM policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::trufactor-data-exchange-bucket/web_sess*",
                    "arn:aws:s3:::trufactor-data-exchange-bucket/demo_by_web_sess*",
                    "arn:aws:s3:::trufactor-data-exchange-bucket/*"
                ]
            }
        ]
    }

  17. Choose Review policy and Save changes.
  18. On the AWS Glue console, under ETL, choose Jobs. Select the job you created earlier.
  19. From the Action drop-down menu, choose Run job. On the History tab, you can see when the status changes to Succeeded. See the following screenshot.This job can take 15–20 minutes to complete.

Step 2: Creating a dataset group, training a predictor, and creating forecasts in Amazon Forecast

To create your dataset group, train a predictor, and create forecasts, complete the following steps:

  1. On the Amazon Forecast console, choose View dataset groups.
  2. Choose Create dataset group.
  3. For Dataset group name, enter a name; for example, users_visitation_sample_dataset_group.
  4. For Forecasting domain, choose Web traffic.
  5. Choose Next.
  6. On the Create target time series dataset page, for Dataset name, enter the name of your dataset; for example, users_visitation_sample_dataset.
  7. For Frequency of your data, choose 1 day.
  8. For Data schema, update the data schema JSON object with the following code:
    {
      "Attributes":[
        {
          "AttributeName": "timestamp",
          "AttributeType": "timestamp"
        },
        {
          "AttributeName": "item_id",
          "AttributeType": "string"
        },
        {
          "AttributeName": "value",
          "AttributeType": "float"
        }
      ]
    }

  9. Choose Next.
  10. On the Import target time series data page, for Dataset import name, enter your dataset name; for example, users_visitation_sample_dataset_import.
  11. For Timestamp format, enter yyyy-MM-dd.
  12. For IAM Role, create a new role and grant Amazon Forecast access to the S3 bucket that you are using for this demo.
  13. For Data Location, use the S3 path that you exported your CSV file to after the AWS Glue job: s3://<trufactor-data-exchange-bucket>/amazon_forecast_demo/dataset/sampleset.
  14. Review the settings for import target time series data and choose Start import.The process of importing the data can take approximately 10 minutes. When the status changes to Active, you can begin training a predictor.
  1. On the Dashboard page, choose Start next to Predictor training.
  2. On the Train predictor page, for Predictor name, enter a name for the predictor; for example, users_visitation_sample_dataset_predictor.
  3. For Forecast horizon, choose 3.
  4. For Forecast frequency, choose day.
  5. For Algorithm selection, select Manual. If you use the other algorithm option, AutoML, you allow Amazon Forecast to choose the right algorithm based on a pre-defined objective function, which is not necessary for this walkthrough.
  6. For Algorithm, choose Deep_AR_Plus (you use deep learning to forecast users’ visitation across 50 HTTP hosts).
  7. Leave all other options with the default values.
  8. Review the settings and choose Train predictor. The predictor training process can take 20–30 minutes. When the training completes, the status changes to Active. To evaluate the predictor’s (ML model) accuracy, Amazon Forecast splits the input time series data into two data sets: training and test. This process tests a predictive model on historical data and is called backtesting. When it splits the input time series data, it maintains the data’s order, which is crucial for time series data. After training the dataset, Amazon Forecast calculates the root mean square error (RSME) and weighted quantile losses to determine how well the predictor performed. For more detailed information about backtesting and predictor metrics, see Evaluating Predictor Accuracy. When the predictor is finished training, you can create a forecast.
  9. On the Dashboard page, under Generate forecasts, choose Start.
  10. For Forecast name, enter a forecast name; for example, users_visitation_sample_forecast.
  11. For Predictor, choose your trained predictor.
  12. For Forecast types, you can enter any values between 0.01 and 0.99 and the mean. These are percentage probabilities of satisfying the original demand. This post enters .50, .90, .99, mean.
  13. Choose Create a forecast.The forecast creation process can take 15–20 minutes.
  14. When the forecast is complete, choose Forecasts.
    You should see a single forecast. See the following screenshot.
    You can now export the generated forecast to a new folder within your existing Amazon S3 bucket for visualization with Amazon QuickSight.
  1. Choose the newly generated forecast.
  2. Under Exports, choose Create forecast export.
  3. For Export name, enter a name for the export; for example, users_visitation_sample_forecast_export.
  4. For Generated forecast, choose users_visitation_sample_forecast.
  5. For IAM Role, choose the role you created earlier.
  6. For S3 forecast export location, enter the S3 path to store the forecasts: s3://<trufactor-data-exchange-bucket>/amazon_forecast_demo/forecasts/sampleset.
  7. Choose Create forecast export.The exporting process can take up to 5 minutes. Alternatively, you can visualize the user visitation forecasts for the 50 HTTP hosts directly through the Amazon Forecast console or Query API.

Step 3: Querying a view using Amazon Athena and downloading the forecast file

Before you visualize users’ visitation forecast data, create a view in Amazon Athena for the top 50 HTTP hosts ranked by users’ visitation over 14 days. Complete the following steps:

  1. Run the following query in Amazon Athena:
    CREATE OR REPLACE VIEW "top_50_users" AS
    SELECT date_format(date_parse(partition_0,
             '%Y%m%d'),'%Y-%m-%d') AS "date", http_host, users
    FROM 
        (SELECT partition_0,
             http_host,
             users,
             row_number()
            OVER (PARTITION By partition_0
        ORDER BY  users DESC ) AS rn
        FROM "trufactor-db"."web_sess")
    WHERE rn<=50
    ORDER BY date

    The code queries the top 50 HTTP hosts ranked by users’ visitation sorted by date.

  2. In the Amazon S3 console, navigate to the S3 bucket and directory holding the files: s3://<trufactor-data-exchange-bucket>/amazon_forecast_demo/forecasts/sampleset. The following screenshot shows three different files inside the folder.
  1. Download the CSV file.

Step 4: Visualizing in Amazon QuickSight

To visualize the data in Amazon QuickSight, complete the following steps:

  1. On the Amazon QuickSight console, choose Manage data.
  2. Choose New data set.
  3. Choose Upload a file.
  4. Upload the CSV file that you downloaded.
  5. On the Confirm file upload settings page, choose Next.
  6. Choose Visualize.
  7. Return to the Amazon QuickSight console and choose Manage data.
  8. Choose New data set for the top 50 HTTP hosts view you queried earlier.
  9. On the Create a Data set page, find the data source you created earlier: trufactor-data-exchange-source.
  10. From the drop-down list, choose the database and view you created.
  11. Choose Directly query your data.
  12. Choose Visualize.
  13. On the new Amazon QuickSight analysis page, choose the pencil icon next to Data set.
  14. Choose Add data set.
  15. Choose the CSV file you uploaded.

You now have a single Amazon QuickSight analysis with multiple data sets to visualize.

The following graphs highlight the historical data for the users’ visitation across 50 HTTP hosts for the first two weeks of October and the mean forecast for users’ visitation for the next three days.

The following graphs highlight the historical data and forecasted P50, P90, and P99 quantile values for www.google.com.

Amazon Forecast makes it easier to get started with machine learning without having to create your own ML models from scratch. You can use this information to anticipate the web traffic for the upcoming week, which can aid in scaling your resources and applications accordingly.

Cleaning up

To avoid incurring future charges, delete the following resources that you created in this walkthrough:

  • The Amazon S3 bucket trufactor-data-exchange-bucket
  • The AWS Glue crawler trufactor-data-exchange-crawler
  • The AWS Glue job demo-glue-job
  • The AWS IAM role AWSGlueServiceRole-demo-data-exchange
  • The AWS Glue database trufactor-db
  • The Amazon QuickSight demo data sets and analysis
  • The following Amazon Forecast resources (in this order) for users_visitation_sample_dataset_group via the console:
    • Existing forecasts under Forecasts
    • Existing predictors under Predictors
    • Existing datasets under Datasets

Conclusion

This walkthrough detailed how to import a data set to Amazon S3 from AWS Data Exchange, use AWS Glue to run crawlers and an ETL job on the data, run SQL queries with Amazon Athena, create a time series forecast of the queried data with Amazon Forecast, and visualize the queried and forecasted data with Amazon QuickSight.

This post used TruFactor Intelligence-as-a-Service, one of the AWS Data Exchange launch partners, to power this walkthrough. TruFactor intelligence on AWS Data Exchange highlighted the ease of loading directly into Amazon S3 and layering advanced AWS services.

For more information about TruFactor and the AWS Data Exchange, see TruFactor on AWS Data Exchange on the TruFactor website. You can subscribe to TruFactor Intelligence directly on AWS Data Exchange or engage with TruFactor directly to identify the right offering from the larger product portfolio of anonymized consumer intelligence.


About the Authors

Jay Park is a solutions architect at AWS.

 

 

 

 

Ariana Rahgozar is a solutions architect at AWS.

 

 

 

 

Build a cloud-native network performance analytics solution on AWS for wireless service providers

Post Syndicated from Angelo Sampietro original https://aws.amazon.com/blogs/big-data/build-a-cloud-native-network-performance-analytics-solution-on-aws-for-wireless-service-providers/

This post demonstrates a serverless, cloud-based approach to building a network performance analytics solution using AWS services that can provide flexibility and performance while keeping costs under control with pay-per-use AWS services.

Without good network performance, you may struggle to face the challenges of real-time and low latency services and the increase of the total bandwidth your customers consume.

Considering the large volume of data that you need to ingest, store, and process every second for optimal performance monitoring, standard on-premises monitoring approaches are no longer efficient.

A cloud-native approach allows you to invest in the solutions that generate business value and move from the typical capital expenditure model to an operational expenditure model by avoiding upfront costs and over-provisioning of infrastructure.

Data and voice network complexity for mobile service providers

According to Cisco’s global mobile data traffic forecast, there will be 13.1 billion mobile-connected devices by 2023, and 1.4 billion of those will be 5G capable.

As a mobile service provider, you must understand how to perform accurate network planning and sizing for your access and core networks.

The increase in the global demand for network throughput and the number of services such as VoLTE, IoT, and video streaming on mobile networks is forcing mobile service providers to implement new architectures to match the desired quality of service (QoS).

Addressing optimal QoS when a multitude of services are running on a converged network is not an easy task. The workflow is complex, starting from collecting counters and statistics data from a multitude of different network elements to transforming the collected data into key performance indicators (KPIs) that you can link to the quality of one of the multiple services delivered over the network.

Modern mobile networks, with deployment of 4G, 5G, and IoT services, have given rise to an increased number of cells that are deployed on the territory, so you must collect counters and generate KPIs on thousands of different network elements.

Considering that every network element can generate a few thousand counters, the network performance system has to manage millions of measurements at every collection cycle.

This is difficult to manage at scale with on-premises deployments without a high-cost solution. Instead, you can use AWS services to design a modern network performance analytics solution that covers all the requirements of different departments of telecommunications service providers (TSPs).

Data and voice network architecture

The main problem that you may face as a service provider is the complexity of modern mobile networks, which originate from several evolutions of communication standards in the last few decades (from 2G to 5G for data core, from CS to VoLTE for voice core) and the hardware and capabilities of the network elements.

The following diagram illustrates a simplified schema of a mobile wireless network element currently deployed, where you can find:

  • Access network, with the network elements needed for the coverage from 2G to 5G
  • Core Network, that includes the network elements for all the functions needed to deliver the  services, the authentication and the database of all the users on the network
  • Services delivered by the network, that includes Voice (PSTN/PLMN), Internet (data service ) and Application Services

IoT network architecture and service differentiation

The IoT traffic paradigm is completely different from the traffic patterns of smartphone and tablets. These mobile devices normally open an always on session (PDP context or bearer) and generate a considerable amount of user plane traffic. An IoT device opens a session, sends few bytes, and closes the session to limit the power consumption and avoid allocating resources on cells when not necessary.

The following diagram depicts a real-world use case in which a service provider deploys IoT services, in most cases, IoT traffic is delivered to a different 4G core for the following reasons:

  • EPC optimization needed to manage IoT traffic
  • Avoiding IoT traffic managed by the same core network of customers’ traffic for security reasons
  • IoT devices generate huge volumes of control plane traffic that can impact customer performance if managed by the same network

This usage pattern requires a different traffic model and forces mobile service providers to perform firmware upgrades to the current cells or deploy new cells into the network. This has an impact on the QoS due to the increase in the number of counters and cells.

What is network performance, and why is it important?

You can use network performance monitoring systems to generate reports and insights and track network performance. Multiple organizations like Operations, Network Planning, and Engineering use these tools to view the overall quality of networks and services and control important KPIs like availability, response times, and download and upload speeds.

Network performance is strictly connected to the QoS that, by definition, is the mechanism to control the performance, reliability, and usability of a telecommunications service.

Pain points for service providers

Every vendor of network elements usually also provides the element managers for the hardware. These systems are proprietary, but most of the time, they comply with the standard provided by 3rd Generation Partnership Project (3GPP) for the XML file format of the performance measurement (PM) files to export the counters you need to measure network performance accurately.

Service providers normally also use multiple vendors for core and access network elements, so monitoring a non-heterogenous environment is one of the main challenges in today’s deployments.

Data types and 3GPP standards

Counters and statistics about the performance on a network element are exported in PM files. 3GPP standardized the format in a specific document, which you can download from the 3GPP website.

The network vendor is not guaranteed to follow this standard, but even if there are some customizations, the structure and format in most cases remain very similar to the open standard.

The following schema from 3GPP describes the XML format for PM that is exported from network elements.

The XML schema root element is “measCollectFile” and it has three child elements:

  1. “fileHeader”: contains information such as file sender and the beginning time of the measurements.
  2. “fileFooter”: contains the end time of the measurements.
  3. “measData”: contains all the measurements information such as measurement types and their values.

The most important tags are:

  • measInfo – Contains the family of the measurements, the granularity, and the counters list for each measValue
  • measValue – Contains multiple measResults with the results of the measurements

Building a network performance solution on AWS

In this section we describe a possible architecture that can be used to implement the solution on AWS following a serverless approach.

Prerequisites

To implement this solution, you must have the following prerequisites:

  • An AWS account in the same Region
  • The AdministratorAccess policy granted to your AWS account (for production, you should restrict access as needed)

This post uses the EU (Ireland) Region. However, you can choose another Region of your choice where the following services are available:

For more information about AWS Regions and where AWS services are available, see Region Table.

The following diagram illustrates the high-level, end-to-end solution and the AWS services it uses, the workflow begins with the ingestion of the files using SFTP or Kinesis Data Firehose, data is stored in S3 and processed with a Lambda function and Glue to create a data catalog. Data querying is done with Athena and the visualization in Quicksight.

Collecting data using Kinesis Data Firehose or AWS SFTP

For information about collecting PM files via element managers, see chapter 5 of Technical Specification 3GPP TS 31.432 on the 3GPP website.

Element managers act as collectors of XML measurement files that the network elements provide. They can send the files to an S3 bucket via AWS SFTP or Kinesis Data Firehose.

For the data transfer, you can refer to the following documentation about creating a Kinesis Firehose Delivery Stream and sending data to Kinesis Firehose Delivery Stream, or If you are planning to use SFTP transfer, you can read here how AWS Transfer for SFTP works.

Transforming data using a Lambda function

This post provides a Lambda function that is associated with the ObjectCreated event type of the destination S3 bucket rawxml prefix (s3://wireless-pm/rawxml). The function runs every time a new XML file is saved in this location.

The Lambda function is written in Python 3.7. You can download it from the GitHub repository for this blog post. The function uses a layer to resolve the dependency for the xmltodict library used in the code.

The following screenshot of the AWS Management Console shows some of the main properties of the Lambda function

  1. The Lambda name and its associated Lambda Layer.
  2. The S3 bucket trigger event on the ObjectCreated event.

The function transposes the XML files, converts them into CSV or JSON (depending on the value set in the function’s output_format environment variable), and formats the files with one record per measurement (measData), measure type, and value (measInfo). The technical specification 3GPP TS 32.435 document on the 3GPP website provides three example XML files in the Annex section.

The following code is an example A.1 XML file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<measCollecFile xmlns="http://www.3gpp.org/ftp/specs/archive/32_series/32.435#measCollec"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.3gpp.org/ftp/specs/archive/32_series/32.435#measCollec http://www.3gpp.org/ftp/specs/archive/32_series/32.435#measCollec">
    <fileHeader fileFormatVersion="32.435 V7.0" vendorName="Company NN" dnPrefix="DC=a1.companyNN.com,SubNetwork=1,IRPAgent=1">
        <fileSender localDn="SubNetwork=CountryNN,MeContext=MEC-Gbg-1,ManagedElement=RNC-Gbg-1" elementType="RNC"/>
        <measCollec beginTime="2000-03-01T14:00:00+02:00"/>
    </fileHeader>
    <measData>
        <managedElement localDn="SubNetwork=CountryNN,MeContext=MEC-Gbg-1,ManagedElement=RNC-Gbg-1" userLabel="RNC Telecomville"/>
        <measInfo>
            <job jobId="1231"/>
            <granPeriod duration="PT900S" endTime="2000-03-01T14:14:30+02:00"/>
            <repPeriod duration="PT1800S"/>
            <measTypes>attTCHSeizures succTCHSeizures attImmediateAssignProcs succImmediateAssignProcs</measTypes>
            <measValue measObjLdn="RncFunction=RF-1,UtranCell=Gbg-997">
                <measResults>234 345 567 789</measResults>
            </measValue>
            <measValue measObjLdn="RncFunction=RF-1,UtranCell=Gbg-998">
                <measResults>890 901 123 234</measResults>
            </measValue>
            <measValue measObjLdn="RncFunction=RF-1,UtranCell=Gbg-999">
                <measResults>456 567 678 789</measResults>
                <suspect>true</suspect>
            </measValue>
        </measInfo>
    </measData>
    <fileFooter>
        <measCollec endTime="2000-03-01T14:15:00+02:00"/>
    </fileFooter>
</measCollecFile>

The Lambda function transforms the preceding file into the following JSON format (this code example only shows the first record of the transformed and transposed dataset):

{
    "fh_file_format_version": "32.435 V7.0",
    "fh_vendor_name": "Company NN",
    "fh_dn_prefix": "DC=a1.companyNN.com,SubNetwork=1,IRPAgent=1",
    "fh_fs_local_dn": "SubNetwork=CountryNN,MeContext=MEC-Gbg-1,ManagedElement=RNC-Gbg-1",
    "fh_fs_element_type": "RNC",
    "fh_mc_begin_time": "2000-03-01T14:00:00+02:00",
    "ff_mc_end_time": "2000-03-01T14:15:00+02:00",
    "md_me_local_dn": "SubNetwork=CountryNN,MeContext=MEC-Gbg-1,ManagedElement=RNC-Gbg-1",
    "md_me_user_label": "RNC Telecomville",
    "md_me_sw_version": "",
    "md_mi_meas_info_id": "",
    "md_mi_job_jobid": "1231",
    "md_mi_gp_duration": "PT900S",
    "md_mi_gp_end_time": "2000-03-01T14:14:30+02:00",
    "md_mi_rp_duration": "PT1800S",
    "md_mi_meas_obj_ldn": "RncFunction=RF-1,UtranCell=Gbg-997",
    "md_mi_meas_name": "attTCHSeizures",
    "md_mi_meas_value": 234,
    "md_mi_meas_p": null,
    "md_mi_meas_suspect": null
}

You can change the output field names in the get_record_header static method of the GPPXml class defined in the Lambda function. The fields md_mi_meas_name and md_mi_meas_value contain the measure name and measure value, respectively.

The transformed CSV and JSON files are saved in the raw_transform_csv and raw_transform_json prefixes, respectively, in the S3 bucket (only one format is created for each execution, depending on the value of the output_format environment variable). The following screenshot shows the S3 bucket overview on the Amazon S3 console.

For this use case, a Lambda function is triggered for the transformation task every time a new XML file arrives. Depending on the volume and size of the XML files, you can choose other compute options on AWS based on the right fit, such as in containers using AWS Batch, Amazon ECS, or Amazon EKS. For more information about configuring and running processes using Amazon ECS, see Building, deploying, and operating containerized applications with AWS Fargate.

Alternatively, you could ingest the raw XML file using a Kinesis Firehose stream with a Lambda function attached to it to transform the data to JSON and output directly to Amazon S3 in Parquet or ORC format using the record format conversion feature of Kinesis Firehose. You would need to predefine a table in the Data Catalog whose schema, serializer, and deserializer you use to convert the data.

Building a Data Catalog using AWS Glue

To create an AWS Glue table in the Data Catalog, first create an AWS Glue crawler. Complete the following steps:

  1. Select AWS Glue from the services in the AWS Console.
  2. Select Crawlers in the left side menu.
  3. Click the button “Add Crawlers”.
  4. For Crawler name, enter a name for your crawler; for example, Wireless_PM_crawler.
  5. Choose Next.
  6. For Choose a data store, choose S3.
  7. For Crawl data in, select Specified path in my account.
  8. For Include path, enter the path to the input files in CSV format.
  9. Choose Next.
  10. In the Choose an IAM role section, select Choose an existing IAM role.
  11. For IAM role, enter the name of your role to grant read permission to access the S3 bucket.
  12. Choose Next.The following code is the AWSGlueServiceRole-S3 IAM policy for this post:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::npm-blog/raw_transform_csv/Parquet*",
                    "arn:aws:s3:::npm-blog/raw_transform_csv/*"
                ]
            }
        ]
    }

  13. In the Choose the crawler’s output section, for Database, choose the database in which the Data Catalog table is created; for example, wireless_pm.
  14. Choose Next.In these next steps, you create an ETL job that converts from CSV to Parquet (a columnar file format) and writes the Parquet files to an S3 bucket. Parquet file format helps with performance of the downstream consumption of the data. To convert to Parquet, you can either use an AWS Glue generated script (which this post uses) or use your own PySpark or Scala script.
  15. In the Configure the job properties section, for Name, enter wireless_pm_parquet_conversion.
  16. For IAM role, enter AWSGlueServiceRole-wirelesspm
  17. For Type, choose Spark.
  18. For Glue version, choose Spark 2.4, Python 3 (Glue Version 1.0).
  19. For This job runs, select A proposed script generated by AWS Glue.
  20. For Script file name, enter wireless_pm_parquet_conversion.
  21. In the Choose a data target section, select Create tables in your data target.
  22. For Data store, choose Amazon S3.
  23. For Format, choose Parquet.
  24. For Target path, enter the path to your S3 bucket.
  25. Choose Next.

You can configure a trigger to start the job by creating a workflow in AWS Glue that runs the job after the crawler. When the job is complete, a new set of Parquet files are stored in the destination S3 bucket.

You can analyze the processed data directly with Athena by selecting the database created in AWS Glue, and analyzing the data from the AWS Glue table.

The following screenshot shows the AWS Glue database and the two tables created.

The following screenshot shows an example Athena query, which you can run by selecting a few columns to validate the data from the Data Catalog table you created.

You can also use other AWS services for analytics such as Amazon Redshift or Amazon EMR for further processing, analysis, and KPI calculations from the data.

Visualizing data using Amazon QuickSight

Amazon QuickSight is a business analytics service you can use to build visualizations, perform ad hoc analysis, and get business insights from your data. For more information, see What Is Amazon QuickSight?

Amazon QuickSight provides out-of-the-box integration with Athena, which lets you run SQL queries on top of the metadata in your Data Catalog. For more information, see Creating a Data Set Using Amazon Athena Data.

The following screenshot shows an Amazon QuickSight analysis created from an Amazon QuickSight dataset. The dataset is based on the table that AWS Glue defined, and you query it via Athena.

You can create a visualization by loading the example dataset of PM counters. For example, the following screenshot shows the measurement values for the GPRS SuccessAttach and AbortedAttach and highlights a problem.

Summary

This post discussed the main pain points for wireless service providers and how AWS services can help you build a serverless solution that scales according to your network growth with no upfront costs.

This solution also helps you visualize and analyze data. Additionally, it provided insights that can help operations and network planning departments manage and evolve their network according to current and future standards and services.

As always, AWS welcomes feedback. Please submit comments or questions in the comments section.

 


About the Authors

Angelo Sampietro is a senior partner solutions architect at Amazon Web Services. Angelo has a strong background in cloud computing, with 20 years of experience in the Telecommunications industry, working in the United States and Europe. With 5 AWS Certifications, including Big Data and Machine Learning Specialties, he currently works as Senior Partner Solutions Architect, helping Global System Integrators to be successful in the partnership with AWS. He loves to share new ideas with colleagues and friends and propose new solutions thinking out of the box.

 

 

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has lived and worked in London for 10 years, after that he has worked in Italy, Switzerland and other countries in EMEA. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems, mainly for Enterprise and FSI customers. Francesco also has a strong experience in systems integration and design and implementation of web applications. He loves sharing his professional knowledge, collecting vinyl records and playing bass.

 

 

 

A public data lake for analysis of COVID-19 data

Post Syndicated from AWS Data Lake Team original https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/

As the COVID-19 pandemic continues to threaten and take lives around the world, we must work together across organizations and scientific disciplines to fight this disease. Innumerable healthcare workers, medical researchers, scientists, and public health officials are already on the front lines caring for patients, searching for therapies, educating the public, and helping to set policy. At AWS, we believe that one way we can help is to provide these experts with the data and tools needed to better understand, track, plan for, and eventually contain and neutralize the virus that causes COVID-19.

Today, we are making a public AWS COVID-19 data lake available – a centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and related coronaviruses from the Allen Institute for AI. We will regularly add to this data lake as other reliable sources make their data publicly available.

The breakthroughs that can win the battle against this disease arrive faster when it’s easy for everyone to access and experiment with this vital information. The AWS COVID-19 data lake allows experimenters to quickly run analyses on the data in place without wasting time extracting and wrangling data from all the available data sources. They can use AWS or third-party tools to perform trend analysis, do keyword search, perform question/answer analysis, build and run machine learning models, or run custom analyses to meet their specific needs. Since every stakeholder in this battle brings their own perspective, users can choose to work with the public data lake, combine it with their own data, or subscribe to the source datasets directly through AWS Data Exchange.

We imagine local health authorities could build dashboards to track infections and collaborate to efficiently deploy vital resources like hospital beds and ventilators. Or epidemiologists could complement their own models and datasets to generate better forecasts of hotspots and trends.

For example, at Chan Zuckerberg Biohub, a nonprofit where leaders in science and technology collaborate to cure, prevent, or manage disease, scientists are using the AWS COVID-19 data lake for new epidemiological insights. “Our team of researchers is now analyzing trends in disease spread, its geography, and time evolution by leveraging datasets from the AWS COVID-19 data lake, combined with our own data, in order to better predict COVID epidemiology,” said Jim Karkanias, Vice President of Data Science and Information Technology at Chan Zuckerberg Biohub.

This post walks you through examples of how to use the AWS COVID-19 data lake for analysis. This data lake is comprised of data in a publicly readable Amazon S3 bucket (s3://covid19-lake). The post shows how to set up the definitions for that data in an AWS Glue Data Catalog to expose it to analytics engines. You can then query the AWS COVID-19 data lake with Amazon Athena, a serverless SQL query engine.

Prerequisites

This post assumes you have the following:

  • Access to an AWS account
  • Permissions to create an AWS CloudFormation stack
  • Permissions to create AWS Glue resources (catalog databases and tables)

Configuring access to the data using a CloudFormation template

To make the data from the AWS COVID-19 data lake available in the Data Catalog in your AWS account, create a CloudFormation stack using the following template. If you are signed in to your AWS account, the following link fills out most of the stack creation form for you. All you need to do is choose Create stack. For instructions on creating a CloudFormation stack, see Get Started in the Cloud Formation documentation.

This template creates a covid-19 database in your Data Catalog and tables that point to the public AWS COVID-19 data lake. You do not need to host the data in your account, and you can rely on AWS to refresh the data as datasets are updated through AWS Data Exchange.

Exploring the data through the Data Catalog in your AWS account

When the CloudFormation stack shows a status of CREATE_COMPLETE, access the Glue Data Catalog to see the tables that the template created. You should see the following tables:

  • Global Coronavirus (COVID-19) Data – Tracks confirmed COVID-19 cases in provinces, states, and countries across the world with a breakdown to the county level in the US.

 

Table NameDescriptionSourceProvider
enigma_jhuConfirmed COVID-19 casesJohns HopkinsEnigma

 

 

Table NameDescriptionSourceProvider
nytimes_statesData on COVID-19 cases at US state levelNY TimesRearc
nytimes_countiesData on COVID-19 cases at US county level

 

 

Table NameDescriptionSourceProvider
covid_testing_states_dailyUSA total test daily trend by stateCOVID Tracking ProjectRearc
covid_testing_us_dailyUSA total test daily trend
covid_testing_us_totalUSA total tests

 

 

Table NameDescriptionSourceProvider
hospital_bedsHospital beds and their utilization in the USDefinitive HealthcareRearc

 

 

Table NameDescriptionSource/Provider
alleninstitute_metadataMetadata on papers pulled from the CORD-19 dataset. The sha column indicates the paper ID, which is the file name of the paper in the data lake.Allen Institute for AI
alleninstitute_comprehend_medicalResults from Amazon Comprehend Medical run against the CORD-19 dataset.

 

  • Lookup tables to support visualizations.

 

Table NameDescription
country_codesLookup table for country codes
county_populationsLookup table for the population for each county based on recent census data
us_state_abbreviationsLookup table for US state abbreviations

In addition, you can see descriptions of the columns in these tables. For example, the following screenshot shows the metadata of the table containing COVID-19 cases from Johns Hopkins.

Querying data via Amazon Athena

This section demonstrates how to query these tables using Athena. Athena is a serverless interactive query service that makes it easy to analyze the data in the AWS COVID19 data lake. Athena supports SQL, a common language that data analysts use for analyzing structured data. To query the data, complete the following steps:

  1. Sign in to the Athena console.

If this is the first time you are using Athena, you must specify a query result location on Amazon S3.

  1. From the drop-down menu, choose the covid-19 database.
  2. Enter your query.

The following query returns the growth of confirmed cases for the past 7 days joined side-by-side with hospital bed availability, broken down by US county:

SELECT 
  cases.fips, 
  admin2 as county, 
  province_state, 
  confirmed,
  growth_count, 
  sum(num_licensed_beds) as num_licensed_beds, 
  sum(num_staffed_beds) as num_staffed_beds, 
  sum(num_icu_beds) as num_icu_beds
FROM 
  "covid-19"."hospital_beds" beds, 
  ( SELECT 
      fips, 
      admin2, 
      province_state, 
      confirmed, 
      last_value(confirmed) over (partition by fips order by last_update) - first_value(confirmed) over (partition by fips order by last_update) as growth_count,
      first_value(last_update) over (partition by fips order by last_update desc) as most_recent,
      last_update
    FROM  
      "covid-19"."enigma_jhu" 
    WHERE 
      from_iso8601_timestamp(last_update) > now() - interval '7' day AND country_region = 'US') cases
WHERE 
  beds.fips = cases.fips AND last_update = most_recent
GROUP BY cases.fips, confirmed, growth_count, admin2, province_state
ORDER BY growth_count desc

The following screenshot shows the results of this query.

Athena also allows you to run these queries through REST APIs, for example, for building your own visualizations. Moreover, Athena is just one of the many engines that you can use on the data lake. For example, you can use Amazon Redshift Spectrum to join lake data with other datasets in your Redshift data warehouse, or use Amazon QuickSight to visualize your datasets.

We have also created a public Amazon QuickSight dashboard from the COVID-19 case tracking data, testing data, and hospital bed data. You can track daily updates with this dashboard. You can also drill-down to see breakdowns by country, province, and county without having to write a line of SQL. The following is a recent screenshot of the dashboard.

CORD-19 research articles

The CORD-19 dataset is a collection of metadata and full-text of research articles about COVID-19, SARS-CoV-2, and related coronaviruses. You can index this data with Amazon Kendra for question/answer exploration, or enrich the data with Amazon Comprehend Medical. We have already done the latter and put it in the table called alleninstitute_comprehend_medical.

The alleninsitute_metadata table provides detailed fields for each paper, such as the title, authors, journal, and URL. The alleninstitute_comprehend_medical table contains key medical concepts such as medical condition, medication, dosage, strength, and frequency. With this metadata, you can quickly query over concepts, analyze or aggregate over authors and journals, and locate papers.

Aggregating over journals

Using IL-6 inhibitors is a possible therapy for COVID-19, and clinical trials are underway. To demonstrate how to use these tables, this post presents a use case in which you want to understand which journals discuss IL-6 the most by counting the papers they published. You can do this by running the following query:

SELECT m.journal,
       count(distinct(cm.paper_id)) as paper_count
FROM "covid-19".alleninstitute_metadata m
JOIN "covid-19".alleninstitute_comprehend_medical cm
    ON (contains(split(m.sha, '; '), cm.paper_id))
WHERE contains(generic_name, 'IL-6')
GROUP BY  m.journal
ORDER BY paper_count desc

The following screenshot shows an example of the results. The data provider updates this dataset over time, so your results may look different (here, we notice that the second highest count has no journal information).

Drilling down into papers

To see the URLs and the titles of the papers in one of these journals, you simply query both these tables again. For example, to drill into IL-6 related papers in the Crit Care journal, enter the following query:

SELECT distinct m.url, m.title
FROM "covid-19".alleninstitute_metadata m
JOIN "covid-19".alleninstitute_comprehend_medical cm
    ON (contains(split(m.sha, '; '), cm.paper_id))
WHERE contains(generic_name, 'IL-6')
      AND m.journal = 'Crit Care'

The following screenshot shows an example of the results.

These examples are a few of the innumerable analyses you can run on the public data lake. You incur no additional cost for accessing the AWS COVID-19 data lake beyond the standard charges for the AWS services that you use. For example, if you use Athena, you will incur the costs for running queries and the data storage in the S3 query result location, but incur no costs for accessing the data lake. In addition, if you want this data in raw form, you can subscribe to, download, and stay up-to-date through AWS Data Exchange. We encourage you to try using the public AWS COVID-19 data lake yourself.

Conclusion

Combining our efforts across organizations and scientific disciplines can help us win the fight against the COVID-19 pandemic. With the AWS COVID-19 data lake, anyone can experiment with and analyze curated data related to the disease, as well as share their own data and results. We believe that through an open and collaborative effort that combines data, technology, and science, we can inspire insights and foster breakthroughs necessary to contain, curtail, and ultimately cure COVID-19.

For daily updates on how AWS is addressing the crisis, see Amazon’s COVID-19 blog.

 


About the Authors

The AWS Data Lake Team members are Roy Ben-Alta, Jason Berkowitz, Chris Casey, Patrick Combes, Lucy Friedmann, Fred Lee, Megan Maxwell, Rourke McNamara, Herain Oberoi, Stephen Orban, Brian Ross, Nikki Rouda, Noah Schwartz, Noritaka Sekiyama, Mehul A. Shah, Ben Snively, and Ying Wang.

Visualize user behavior with Auth0 and Amazon EventBridge

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/visualize-user-behavior-with-auth0-and-amazon-eventbridge/

In this post, I show how to capture user events and monitor user behavior by using the Amazon EventBridge partner integration with Auth0. This enables you to gain insights to help deliver a more customized application experience for your users.

Auth0 is a flexible, drop-in solution that adds authentication and authorization services to your applications. The EventBridge integration automatically and continuously pushes Auth0 log events your AWS account via a custom SaaS event bus.

The examples used in this post are implemented in a custom-built serverless application called FreshTracks. This is a demo application built in Vue.js, which I will use to demonstrate multiple SaaS integrations into AWS with EventBridge in this and future blog posts.

FreshTracks – a demo serverless web application with multiple SaaS Integrations.

FreshTracks – a demo serverless web application with multiple SaaS Integrations.

The components for this EventBridge integration with Auth0 have been extracted into a separate example application in this GitHub repo.

How the application works

Routing Auth0 Events with Amazon EventBridge.

Routing Auth0 Events with Amazon EventBridge.

  1. Events are emitted from Auth0 when a user interacts with the login service on the front-end application.
  2. These events are streamed into a custom SaaS event bus in EventBridge.
  3. Event rules match events and send them downstream to a Lambda function target.
  4. The receiving Lambda function performs some data transformation before writing an object to S3.
  5. These objects are made available by a QuickSight data source manifest file and used as datapoints for QuickSight visuals.

Configuring the Auth0 EventBridge integration

To capture Auth0 emitted events in EventBridge, you must first configure Auth0 for use as the Event Source on your Auth0 Dashboard.

  1. Log in to the Auth0 Dashboard.
  2. Choose to Logs > Streams.
  3. Choose + Create Stream.
  4. Choose Amazon EventBridge and enter a unique name for the new Amazon EventBridge Event Stream.
  5. Create the Event Source by providing your AWS Account ID and AWS Region. The Region you select must match the Region of the Amazon EventBridge bus.
  6. Choose Save.
Event Source Configuration on Auth0 dashboard

Event Source Configuration on Auth0 dashboard

Auth0 provides you with an Event Source Name. Make sure to save your Event Source Name value since you need this at a later point to complete the integration.

Creating a custom event bus

  1. Go to the EventBridge partners tab in your AWS Management Console. Ensure the AWS Region matches where the Event Source was created.
  2. Paste the Event Source Name in the Partner event sources search box to find and choose the new Auth0 event source.Note: The Event Source remains in a pending state until it is associated with an event bus.

    Partner event source

  3. Choose the event source, then choose Associate with Event Bus.
  4. Choose Associate.

Deploying the application

Once you have associated the Event Source with a new partner event bus, you are ready to deploy backend services to receive and respond to these events.

To set up the example application, visit the GitHub repo and follow the instructions in the README.md file.

When deploying the application stack, make sure to provide the custom event bus name with –parameter-overrides.

sam deploy --parameter-overrides Auth0EventBusName=aws.partner/auth0.com/auth0username-0123344567-e5d2-4514-84f2-97dd4ff8aad0/auth0.logs

You can find the name of the new Auth0 custom event bus in the custom event bus section of the EventBridge console:

Custom event bus name

Custom event bus name

Routing events with rules

The AWS Serverless Application Model (SAM) template in the example application creates four event rules:

  1. Successful sign-in
  2. Successful signup
  3. Successful log-out
  4. Unsuccessful signup

These are defined with the `AWS::Events::Rule` resource type. Each of these rules is routed to a single target Lambda function. For a successful sign-in, the rule event pattern is matched with detail:data:type:s.  This refers to the Auth0 event type code for a successful sign-in. Every Auth0 event code is listed here.

SuccessfullSignIn: 
    Type: AWS::Events::Rule
    Properties: 
      Description: "Auth0 User Successfully signed in"
      EventBusName: 
         Ref: Auth0EventBusName
      EventPattern: 
        account:
        - !Sub '${AWS::AccountId}'
        detail:
          data:
            type:
            - s
      Targets: 
        - 
          Arn:
            Fn::GetAtt:
              - "SaveAuth0EventToS3" 
              - "Arn"
          Id: "SignInSuccessV1"

To respond to additional events, copy this event rule pattern and change the event code string for the event you want to match.

Writing events to S3 with Lambda

The application routes events to a Lambda function, which performs some data transformation before writing an object to S3.  The function code uses an environment variable named AuthLogsBucket to store the S3 bucket name. The permissions to write to S3 are granted by policy defined within the SAM template:

  SaveAuth0EventToS3:
    Type: AWS::Serverless::Function 
    Properties:
      CodeUri: src/
      Handler: saveAuth0EventToS3.handler
      Runtime: nodejs12.x
      MemorySize: 128
      Environment:
        Variables:
          AuthLogBucket: !Ref AuthZeroToEventBridgeUserActivitylogs
      Policies:
        - S3CrudPolicy:
            BucketName: !Ref AuthZeroToEventBridgeUserActivitylogs

The S3 object is a CSV file with context about each event. Each of the Auth0 event schemas is different. To maintain a consistent CSV file structure across different event types, this Lambda function explicitly defines each of the header and row values. An output string is constructed from the Auth0 event:

Lambda function output string

Lambda function output string

This string is placed into a new buffer and written to S3 with the AWS SDK for Javascript as referenced in GitHub here

Sending events to the application

There is a test event in the /event directory of the example application. This contains an example of a successful sign-in event emitted from Auth0.

Sending a test Auth0 event to Lambda

Send a test event to the Lambda function using the AWS Command Line Interface.

Run the following command in the root directory of the example application, replacing {function-name} with the full name of your Lambda function.

aws lambda invoke --function-name {function-name} --invocation-type Event --payload file://events/event.json  events/response.json --log-type Tail

Response:

{  
 "StatusCode": 202
}

The response output appears in the output terminal window. To confirm that an object is stored in S3, navigate to the S3 Console.  Choose the AuthZeroToEventBridgeUserActivityLogs bucket. You see a new auth0 directory and can open the CSV file that holds context about the event.

Object written to S3

Object written to S3

Sending real Auth0 events from a front-end application

Follow the instructions in the Fresh Tracks repo on GitHub to deploy the front-end application. This application includes Auth0’s authentication flow. You can connect to your Auth0 application by entering your credentials in the `auth0_config.json` file:

{
  "domain": "<YOUR AUTH0 DOMAIN>",
  "clientId": "<YOUR AUTH0 CLIENT ID>"
}

The example backend application starts receiving Auth0 emitted events immediately.

To see the full Fresh Tracks application continue to the backend deployment instructions. This is not required for the examples in this blog post.

Building a QuickSight dashboard

You can visualize these Auth0 user events with an Amazon QuickSight dashboard. This provides a snapshot analysis that you can share with other QuickSight users for reporting purposes.

To use Auth0 events as metrics, create a separate calculated field for each event (for example, successful signup and successful login).  For example, an analysis could include multiple visuals, custom fields, conditional formatting, and events. This gives a snapshot of user interaction with the front-end application at any given time.

FreshTracks final dashboard example

FreshTracks final dashboard example

The example application in the GitHub repo provides instructions on how to create a dashboard.

Conclusion

This post explains how to set up EventBridge’s third-party integration with Auth0 to capture events. The example backend application demonstrates how to filter these events, perform computations on them, save as S3 objects, and send to a downstream service.

The ability to build QuickSight story boards from these events and share visuals with key business stakeholders can provide a narrative about the analysis data. This is implemented with minimal code to provide near real-time streaming of events and without adding latency to your application.

The possibilities are vast. I am excited to see how builders use this serverless pattern to create their own visuals to build a better, more customized application experience for their users.

Start here to learn about other SaaS integrations with Amazon EventBridge.

Ingest Excel data automatically into Amazon QuickSight

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/ingest-excel-data-automatically-into-amazon-quicksight/

Amazon QuickSight is a fast, cloud-powered, business intelligence (BI) service that makes it easy to deliver insights to everyone in your organization. This post demonstrates how to build a serverless data ingestion pipeline to automatically import frequently changed data into a SPICE (Super-fast, Parallel, In-memory Calculation Engine) dataset of Amazon QuickSight dashboards.

It is sometimes quite difficult to be agile in BI development. For example, end-users that perform self-service analytics may want to add their additional ad hoc data into an existing dataset and have a view of the corresponding updated dashboards and reports in a timely fashion. However, dashboards and reports are usually built on top of a single online analytic processing (OLAP) data warehouse with a rigid schema. Therefore, an end-user (who doesn’t have permission to update the dataset directly) has to go through a complicated and time-consuming procedure to have their data updated in the warehouse. Alternatively, they could open a ticket for you to edit the dataset manually, but it is still a very inconvenient solution that involves a significant amount of repetitive manual effort, especially if they frequently need to update the data.

Therefore, an automated data processing tool that can perform real-time data ingestion is very useful. This post discusses a tool that, when an end-user uploads Excel files into Amazon S3 or any other data file sharing location, performs the following end-to-end process:

  • Cleans the raw data from the Excel files, which might contain a lot of formatting and redundant information.
  • Ingests the cleaned data.
  • Performs a status check to monitor the data cleaning and ingestion process.
  • Sends a notification of the results to the end-user and BI development team.

With the recently launched feature cross data source joins, you can join across all data sources that Amazon QuickSight supports, including file-to-file, file-to-database, and database-to-database joins. For more information, see Joining across data sources on Amazon QuickSight.

In addition to cross data source joins, Amazon QuickSight has also launched new APIs for SPICE ingestion. For more information, see Importing Data into SPICE and Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilities.

This post shows how you can combine these features to build an agile solution that cleans and ingests an Excel file into a SPICE dataset of Amazon QuickSight automatically. In SPICE, the real-time data from Excel joins with the Amazon Redshift OLAP data warehouse, and end-users receive Amazon SNS messages about its status throughout the process.

Solution overview

The following diagram illustrates the workflow of the solution.

The workflow includes the following steps:

  1. An end-user uploads an Excel file into an on-premises shared folder.
  2. The Excel files upload to the Amazon S3 bucket excel-raw-data.Alternatively, the end-user can skip this step and upload the Excel file into this Amazon S3 bucket directly.
  3. This upload event triggers the SNS message Excel-Raw-Data-Uploaded.
  4. Both the end-user and the BI team receive a message about the new upload event.
  5. The upload event also triggers the AWS Lambda function DataClean to process the Excel data.
  6. The Lambda function removes the formatting and redundant information of the Excel file, saves the cleaned data as a CSV file into the S3 bucket autoingestionqs, and publishes an SNS message to notify end-users about the data cleansing status.
  7. This cleaned CSV file is mapped as an Amazon Athena table.
  8. In the Amazon QuickSight SPICE dataset, this table joins with the Amazon Redshift table through the cross data source join functionality.
  9. The CSV file creation event in the S3 bucket autoingestionqs triggers the Lambda function qsAutoIngestion.
  10. This function calls the data ingestion API of Amazon QuickSight and checks the data ingestion status.
  11. When the data ingestion is complete, end-users receive the Ingestion-Finished SNS message.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Creating resources

Create your resources by launching the following AWS CloudFormation stack:

During the stack creation process, you have to provide a valid email address as the endpoint of Amazon SNS services. After the stack creation is successful, you have three SNS topics, two S3 buckets, and the corresponding IAM policies.

Walkthrough

To implement this solution, complete the following steps:

  1. Enable SNS notification of new object creation event in S3 bucket excel-raw-data. For more information, see How Do I Enable and Configure Event Notifications for an S3 Bucket? When an end-user uploads an Excel file into the excel-raw-data S3 bucket, the event triggers an Amazon SNS message.The following screenshot shows the example Excel file that this post uses.The following screenshot shows the SNS message Excel-Raw-Data-Upload, which includes details of the upload event.
  2. Download the sample code DataClean.py in Python 3.7 from the GitHub repo.
  3. Create a Lambda function named DataClean.
  4. Configure the function to be a subscriber of the SNS topic Excel-Raw-Data-Uploaded.
  5. Edit the SNS topic Cleaning-is-Done, and add the following code to the access policy:
    "Sid": "example-statement-ID",
          "Effect": "Allow",
          "Principal": {
            "AWS": "*"
          },
          "Action": "SNS:Publish",
          "Resource": " arn:aws:sns:us-east-1:AWS Account ID: SNS Topic Name",
          "Condition": {
            "ArnLike": {
              "aws:SourceArn": "arn:aws:lambda:us-east-1:AWSAccountID:function:DataClean"
            }
          }

    The policy allows the Lambda function DataClean to trigger the SNS message Cleaning-is-Done.

    The function DataClean saves the CSV file of the cleaned data into the S3 bucket autoingestionqs. You should see the new CSV file in this bucket. See the following screenshot.

    When the Lambda function ends, it triggers the SNS message Cleaning-is-Done. The following screenshot shows the text of the notification message.

  6. Add an event notification into the S3 bucket autoingestionqs to trigger a Lambda function named qsAutoIngestion.This function calls the Amazon QuickSight data API to ingest data into the SPICE dataset.The cleaned CSV file in the S3 bucket autoingestionqs is mapped as an Athena table. The following screenshot shows the sample data in the CSV file.

    In the Amazon QuickSight SPICE dataset, the Athena table joins with the Amazon Redshift table through the cross data source join functionality.

  7. Create the SPICE dataset. For more information, see Joining across data sources on Amazon QuickSight.The following screenshot shows the Data page in Amazon QuickSight where your data set details appear. The Athena table joins a Redshift table.The new object creation event in the Amazon S3 bucket autoingestionqs triggers another Lambda function named qsAutoIngestion. This function calls the data ingestion API of Amazon QuickSight and checks the data ingestion status. If the data ingestion is completed successfully, end-users receive the SNS message Ingestion-Finished. You can download the sample code of qsAutoIngestion from the GitHub repo.

Cleaning up

To avoid incurring future charges, delete the resources you created: the two Lambda functions, three SNS topics, two S3 buckets, and corresponding IAM policies.

Conclusion

This post discussed how BI developers and architects can use data API, Lambda functions, and other AWS services to complete an end-to-end automation process. End-users can have their real-time data ingested and joined with OLAP data warehouse tables and visualize their data in a timely fashion without the need to wait for nightly or hourly ETL or the need to understand the complex technical development steps. You should now be fully equipped to construct a solution in a development environment and demo it to non-technical business end-users.

 


About the Author

Ying Wang is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Build machine learning-powered business intelligence analyses using Amazon QuickSight

Post Syndicated from Osemeke Isibor original https://aws.amazon.com/blogs/big-data/build-machine-learning-powered-business-intelligence-analyses-using-amazon-quicksight/

Imagine you can see the future—to know how many customers will order your product months ahead of time so you can make adequate provisions, or to know how many of your employees will leave your organization several months in advance so you can take preemptive actions to encourage staff retention. For an organization that sees the future, the possibilities are limitless. Machine learning (ML) makes it possible to predict the future with a higher degree of accuracy.

Amazon SageMaker provides every developer and data scientist the ability to build, train, and deploy ML models quickly, but for business users who usually work on creating business intelligence dashboards and reports rather than ML models, Amazon QuickSight is the service of choice. With Amazon QuickSight, you can still use ML to forecast the future. This post goes through how to create business intelligence analyses that use ML to forecast future data points and detect anomalies in data, with no technical expertise or ML experience needed.

Overview of solution

Amazon QuickSight ML Insights uses AWS-proven ML and natural language capabilities to help you gain deeper insights from your data. These powerful, out-of-the-box features make it easy to discover hidden trends and outliers, identify key business drivers, and perform powerful what-if analysis and forecasting with no technical or ML experience. You can use ML insights in sales reporting, web analytics, financial planning, and more. You can detect insights buried in aggregates, perform interactive what-if analysis, and discover what activities you need to meet business goals.

This post imports data from Amazon S3 into Amazon QuickSight and creates ML-powered analyses with the imported data. The following diagram illustrates this architecture.

Walkthrough

In this walkthrough, you create an Amazon QuickSight analysis that contains ML-powered visuals that forecast the future demand for taxis in New York City. You also generate ML-powered insights to detect anomalies in your data. This post uses the New York City Taxi and Limousine Commission (TLC) Trip Record Data on the Registry of Open Data on AWS.

The walkthrough includes the following steps:

  1. Set up and import data into Amazon QuickSight
  2. Create an ML-powered visual to forecast the future demand for taxis
  3. Generate an ML-powered insight to detect anomalies in the data set

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Amazon Quicksight Enterprise edition
  • Basic knowledge of AWS

Setting up and importing data into Amazon QuickSight

Set up Amazon Quicksight as an individual user. Complete the following steps:

  1. On the AWS Management Console, in the Region list, select US East (N. Virginia) or any Region of your choice that Amazon QuickSight
  2. Under Analytics, for Services, choose Amazon QuickSight.If you already have an existing Amazon QuickSight account, make sure it is the Enterprise edition; if it is not, upgrade to Enterprise edition. For more information, see Upgrading your Amazon QuickSight Subscription from Standard Edition to Enterprise Edition.If you do not have an existing Amazon QuickSight account, proceed with the setup and make sure you choose Enterprise Edition when setting up the account. For more information, see Setup a Free Standalone User Account in Amazon QuickSight.After you complete the setup, a Welcome Wizard screen appears.
  1. Choose Next on each of the Welcome Wizard screens.
  2. Choose Get Started.

Before you import the data set, make sure that you have at least 3GB of SPICE capacity. For more information, see Managing SPICE Capacity.

Importing the NYC Taxi data set into Amazon QuickSight

The NYC Taxi data set is in an S3 bucket. To import S3 data into Amazon QuickSight, use a manifest file. For more information, see Supported Formats for Amazon S3 Manifest Files. To import your data, complete the following steps:

  1. In a new text file, copy and paste the following code:
    "fileLocations": [
            {
                "URIs": [
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-01.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-02.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-03.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-04.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-05.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-06.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-07.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-08.csv", 
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-09.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-10.csv",
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-11.csv", 
              "https://nyc-tlc.s3.amazonaws.com/trip data/green_tripdata_2018-12.csv"
    
                ]
            }
        ],
        "globalUploadSettings": {
            "textqualifier": "\""
        }
    }

  2. Save the text file as nyc-taxi.json.
  3. On the Amazon QuickSight console, choose New analysis.
  4. Choose New data set.
  5. For data source, choose S3.
  6. Under New S3 data source, for Data source name, enter a name of your choice.
  7. For Upload a manifest file field, select Upload.
  8. Choose the nyc-taxi.json file you created earlier.
  9. Choose Connect.
    The S3 bucket this post uses is a public bucket that contains a public data set and open to the public. When using S3 buckets in your account with Amazon QuickSight, it is highly recommended that the buckets are not open to the public; you need to configure authentication to access your S3 bucket from Amazon QuickSight. For more information about troubleshooting, see I Can’t Connect to Amazon S3.After you choose Connect, the Finish data set creation screen appears.
  10. Choose Visualize.
  11. Wait for the import to complete.

You can see the progress on the top right corner of the screen. When the import is complete, the result shows the number of rows imported successfully and the number of rows skipped.

Creating an ML-powered visual

After you import the data set into Amazon QuickSight SPICE, you can start creating analyses and visuals. Your goal is to create an ML-powered visual to forecast the future demand for taxis. For more information, see Forecasting and Creating What-If Scenarios with Amazon Quicksight.

To create your visual, complete the following steps:

  1. From the Data source details pop screen, choose Visualize.
  2. From the field list, select Lpep_pickup_datetime.
  3. Under Visual types, select the first visual.Amazon QuickSight automatically uses the best visual based on the number and data type of fields you selected. From your selection, Amazon Quicksight displays a line chart visual.From the preceding graph, you can see that the bulk of your data clusters are around December 31, 2017, to January 1, 2019, for the Lpep_pickup_datetime field. There are a few data points with date ranges up to June 2080. These values are incorrect and can impact your ML forecasts.To clean up your data set, filter out the incorrect data using the data in the Lpep_pickup_datetime. This post only uses data in which Lpep_pickup_datetime falls between January 1, 2018, and December 18, 2018, because there is a more consistent amount of data within this date range.
  4. Use the filter menu to create a filter using the Lpep_pickup_datetime
  5. Under Filter type, choose Time range and Between.
  6. For Start date, enter 2018-01-01 00:00.
  7. Select Include start date.
  8. For End date, enter 2018-12-18 00:00.
  9. Select Include end date.
  10. Choose Apply.

The line chart should now contain only data with Lpep_pickup_datetime from January 1, 2018, and December 18, 2018. You can now add the forecast for the next 31 days to the line chart visual.

Adding the forecast to your visual

To add your forecast, complete the following steps:

  1. On the visual, choose the arrow.
  2. From the drop-down menu, choose Add forecast.
  3. Under Forecast properties, for Forecast length, for Periods forward, enter 31.
  4. For Periods backwards, enter 0.
  5. For Prediction interval, leave at the default value 90.
  6. For Seasonality, leave at the default selection Automatic.
  7. Choose Apply.

You now see an orange line on your graph, which is the forecasted pickup quantity per day for the next 31 days after December 18, 2018. You can explore the different dates by hovering your cursor over different points on the forecasted pickup line. For example, hovering your cursor over January 10, 2019, shows that the expected forecasted number of pickups for that day is approximately 22,000. The forecast also provides an upper bound (maximum number of pickups forecasted) of about 26,000 and a lower bound (minimum number of pickups forecasted) of about 18,000.

You can create multiple visuals with forecasts and combine them into a sharable Amazon QuickSight dashboard. For more information, see Working with Dashboards.

Generating an ML-powered insight to detect anomalies

In Amazon QuickSight, you can add insights, autonarratives, and ML-powered anomaly detection to your analyses without ML expertise or knowledge. Amazon QuickSight generates suggested insights and autonarratives automatically, but for ML-powered anomaly detection, you need to perform additional steps. For more information, see Using ML-Powered Anomaly Detection.

This post checks if there are any anomalies in the total fare amount over time from select locations. For example, if the total fare charged for taxi rides is about $1,000 and above from the first pickup location (for example, the airport) for most dates in your data set, an anomaly is when the total fare charged deviates from the standard pattern. Anomalies are not necessarily negative, but rather abnormalities that you can choose to investigate further.

To create an anomaly insight, complete the following steps:

  1. From the top right corner of the analysis creation screen, click the Add drop-down menu and choose Add insight.
  2. On the Computation screen, for Computation type, select Anomaly detection.
  3. Under Fields list, choose the following fields:
  • fare_amount
  • lpep_pickup_datetime
  • PULocationID
  1. Choose Get started.
  2. The Configure anomaly detection
  3. Choose Analyze all combinations of these categories.
  4. Leave the other settings as their default.You can now perform a contribution analysis and discover how the drop-off location contributed to the anomalies. For more information, see Viewing Top Contributors.
  5. Under Contribution analysis, choose DOLocationID.
  6. Choose Save.
  7. Choose Run Now.The anomaly detection can take up to 10 minutes to complete. If it is still running after about 10 minutes, your browser may have timed out. Refresh your browser and you should see the anomalies displayed in the visual.
  8. Choose Explore anomalies.

By default, the anomalies you see are for the last date in your data set. You can explore the anomalies across the entire date range of your data set by choosing SHOW ANOMALIES BY DATE and dragging the slider at the bottom of the visual to display the entire date range from January 1, 2018, to December 30, 2018.

This graph shows that March 21, 2018, has the highest number of anomalies of the fare charged in the entire data set. For example, the total fare amount charged by taxis that picked up passengers from location 74 on March 21, 2018, was 7,181. This is -64% (about 19,728.5) of the total fare charged by taxis for the same pickup location on March 20, 2018. When you explore the anomalies of other pickup locations for that same date, you can see that they all have similar drops in the total fare charged. You can also see the top DOLocationID contributors to these anomalies.

What happened in New York City on March 21, 2018, to cause this drop? A quick online search reveals that New York City experienced a severe weather condition on March 21, 2018.

Publishing your analyses to a dashboard, sharing the dashboard, and setting up email alerts

You can create additional visuals to your analyses, publish the analyses as a dashboard, and share the dashboard with other users. QuickSight anomaly detection allows you to uncover hidden insights in your data by continuously analyzing billions of data points. You can subscribe to receive alerts to your inbox if an anomaly occurs in your business metrics. The email alert also indicates the factors that contribute to these anomalies. This allows you to act immediately on the business metrics that need attention.

From the QuickSight dashboard, you can configure an anomaly alert to be sent to your email with Severity set to High and above and Direction set to Lower than expected. Also make sure to schedule a data refresh so that the anomaly detection runs on your most recent data. For more information, see Refreshing Data.

Cleaning up

To avoid incurring future charges, you need to cancel your Amazon Quicksight subscription.

Conclusion

This post walked you through how to use ML-powered insights with Amazon QuickSight to forecast future data points, detect anomalies, and derive valuable insights from your data without needing prior experience or knowledge of ML. If you want to do more forecasting without ML experience, check out Amazon Forecast.

If you have questions or suggestions, please leave a comment.

 


About the Author

Osemeke Isibor is a partner solutions architect at AWS. He works with AWS Partner Network (APN) partners to design secure, highly available, scalable and cost optimized solutions on AWS. He is a science-fiction enthusiast and a fan of anime.

 

 

 

ICYMI: Serverless Q4 2019

Post Syndicated from Rob Sutter original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2019/

Welcome to the eighth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, checkout what happened last quarter here.

The three months comprising the fourth quarter of 2019

AWS re:Invent

AWS re:Invent 2019

re:Invent 2019 dominated the fourth quarter at AWS. The serverless team presented a number of talks, workshops, and builder sessions to help customers increase their skills and deliver value more rapidly to their own customers.

Serverless talks from re:Invent 2019

Chris Munns presenting 'Building microservices with AWS Lambda' at re:Invent 2019

We presented dozens of sessions showing how customers can improve their architecture and agility with serverless. Here are some of the most popular.

Videos

Decks

You can also find decks for many of the serverless presentations and other re:Invent presentations on our AWS Events Content.

AWS Lambda

For developers needing greater control over performance of their serverless applications at any scale, AWS Lambda announced Provisioned Concurrency at re:Invent. This feature enables Lambda functions to execute with consistent start-up latency making them ideal for building latency sensitive applications.

As shown in the below graph, provisioned concurrency reduces tail latency, directly impacting response times and providing a more responsive end user experience.

Graph showing performance enhancements with AWS Lambda Provisioned Concurrency

Lambda rolled out enhanced VPC networking to 14 additional Regions around the world. This change brings dramatic improvements to startup performance for Lambda functions running in VPCs due to more efficient usage of elastic network interfaces.

Illustration of AWS Lambda VPC to VPC NAT

New VPC to VPC NAT for Lambda functions

Lambda now supports three additional runtimes: Node.js 12, Java 11, and Python 3.8. Each of these new runtimes has new version-specific features and benefits, which are covered in the linked release posts. Like the Node.js 10 runtime, these new runtimes are all based on an Amazon Linux 2 execution environment.

Lambda released a number of controls for both stream and async-based invocations:

  • You can now configure error handling for Lambda functions consuming events from Amazon Kinesis Data Streams or Amazon DynamoDB Streams. It’s now possible to limit the retry count, limit the age of records being retried, configure a failure destination, or split a batch to isolate a problem record. These capabilities help you deal with potential “poison pill” records that would previously cause streams to pause in processing.
  • For asynchronous Lambda invocations, you can now set the maximum event age and retry attempts on the event. If either configured condition is met, the event can be routed to a dead letter queue (DLQ), Lambda destination, or it can be discarded.

AWS Lambda Destinations is a new feature that allows developers to designate an asynchronous target for Lambda function invocation results. You can set separate destinations for success and failure. This unlocks new patterns for distributed event-based applications and can replace custom code previously used to manage routing results.

Illustration depicting AWS Lambda Destinations with success and failure configurations

Lambda Destinations

Lambda also now supports setting a Parallelization Factor, which allows you to set multiple Lambda invocations per shard for Kinesis Data Streams and DynamoDB Streams. This enables faster processing without the need to increase your shard count, while still guaranteeing the order of records processed.

Illustration of multiple AWS Lambda invocations per Kinesis Data Streams shard

Lambda Parallelization Factor diagram

Lambda introduced Amazon SQS FIFO queues as an event source. “First in, first out” (FIFO) queues guarantee the order of record processing, unlike standard queues. FIFO queues support messaging batching via a MessageGroupID attribute that supports parallel Lambda consumers of a single FIFO queue, enabling high throughput of record processing by Lambda.

Lambda now supports Environment Variables in the AWS China (Beijing) Region and the AWS China (Ningxia) Region.

You can now view percentile statistics for the duration metric of your Lambda functions. Percentile statistics show the relative standing of a value in a dataset, and are useful when applied to metrics that exhibit large variances. They can help you understand the distribution of a metric, discover outliers, and find hard-to-spot situations that affect customer experience for a subset of your users.

Amazon API Gateway

Screen capture of creating an Amazon API Gateway HTTP API in the AWS Management Console

Amazon API Gateway announced the preview of HTTP APIs. In addition to significant performance improvements, most customers see an average cost savings of 70% when compared with API Gateway REST APIs. With HTTP APIs, you can create an API in four simple steps. Once the API is created, additional configuration for CORS and JWT authorizers can be added.

AWS SAM CLI

Screen capture of the new 'sam deploy' process in a terminal window

The AWS SAM CLI team simplified the bucket management and deployment process in the SAM CLI. You no longer need to manage a bucket for deployment artifacts – SAM CLI handles this for you. The deployment process has also been streamlined from multiple flagged commands to a single command, sam deploy.

AWS Step Functions

One powerful feature of AWS Step Functions is its ability to integrate directly with AWS services without you needing to write complicated application code. In Q4, Step Functions expanded its integration with Amazon SageMaker to simplify machine learning workflows. Step Functions also added a new integration with Amazon EMR, making EMR big data processing workflows faster to build and easier to monitor.

Screen capture of an AWS Step Functions step with Amazon EMR

Step Functions step with EMR

Step Functions now provides the ability to track state transition usage by integrating with AWS Budgets, allowing you to monitor trends and react to usage on your AWS account.

You can now view CloudWatch Metrics for Step Functions at a one-minute frequency. This makes it easier to set up detailed monitoring for your workflows. You can use one-minute metrics to set up CloudWatch Alarms based on your Step Functions API usage, Lambda functions, service integrations, and execution details.

Step Functions now supports higher throughput workflows, making it easier to coordinate applications with high event rates. This increases the limits to 1,500 state transitions per second and a default start rate of 300 state machine executions per second in US East (N. Virginia), US West (Oregon), and Europe (Ireland). Click the above link to learn more about the limit increases in other Regions.

Screen capture of choosing Express Workflows in the AWS Management Console

Step Functions released AWS Step Functions Express Workflows. With the ability to support event rates greater than 100,000 per second, this feature is designed for high-performance workloads at a reduced cost.

Amazon EventBridge

Illustration of the Amazon EventBridge schema registry and discovery service

Amazon EventBridge announced the preview of the Amazon EventBridge schema registry and discovery service. This service allows developers to automate discovery and cataloging event schemas for use in their applications. Additionally, once a schema is stored in the registry, you can generate and download a code binding that represents the schema as an object in your code.

Amazon SNS

Amazon SNS now supports the use of dead letter queues (DLQ) to help capture unhandled events. By enabling a DLQ, you can catch events that are not processed and re-submit them or analyze to locate processing issues.

Amazon CloudWatch

Amazon CloudWatch announced Amazon CloudWatch ServiceLens to provide a “single pane of glass” to observe health, performance, and availability of your application.

Screenshot of Amazon CloudWatch ServiceLens in the AWS Management Console

CloudWatch ServiceLens

CloudWatch also announced a preview of a capability called Synthetics. CloudWatch Synthetics allows you to test your application endpoints and URLs using configurable scripts that mimic what a real customer would do. This enables the outside-in view of your customers’ experiences, and your service’s availability from their point of view.

CloudWatch introduced Embedded Metric Format, which helps you ingest complex high-cardinality application data as logs and easily generate actionable metrics. You can publish these metrics from your Lambda function by using the PutLogEvents API or using an open source library for Node.js or Python applications.

Finally, CloudWatch announced a preview of Contributor Insights, a capability to identify who or what is impacting your system or application performance by identifying outliers or patterns in log data.

AWS X-Ray

AWS X-Ray announced trace maps, which enable you to map the end-to-end path of a single request. Identifiers show issues and how they affect other services in the request’s path. These can help you to identify and isolate service points that are causing degradation or failures.

X-Ray also announced support for Amazon CloudWatch Synthetics, currently in preview. CloudWatch Synthetics on X-Ray support tracing canary scripts throughout the application, providing metrics on performance or application issues.

Screen capture of AWS X-Ray Service map in the AWS Management Console

X-Ray Service map with CloudWatch Synthetics

Amazon DynamoDB

Amazon DynamoDB announced support for customer-managed customer master keys (CMKs) to encrypt data in DynamoDB. This allows customers to bring your own key (BYOK) giving you full control over how you encrypt and manage the security of your DynamoDB data.

It is now possible to add global replicas to existing DynamoDB tables to provide enhanced availability across the globe.

Another new DynamoDB capability to identify frequently accessed keys and database traffic trends is currently in preview. With this, you can now more easily identify “hot keys” and understand usage of your DynamoDB tables.

Screen capture of Amazon CloudWatch Contributor Insights for DynamoDB in the AWS Management Console

CloudWatch Contributor Insights for DynamoDB

DynamoDB also released adaptive capacity. Adaptive capacity helps you handle imbalanced workloads by automatically isolating frequently accessed items and shifting data across partitions to rebalance them. This helps reduce cost by enabling you to provision throughput for a more balanced workload instead of over provisioning for uneven data access patterns.

Amazon RDS

Amazon Relational Database Services (RDS) announced a preview of Amazon RDS Proxy to help developers manage RDS connection strings for serverless applications.

Illustration of Amazon RDS Proxy

The RDS Proxy maintains a pool of established connections to your RDS database instances. This pool enables you to support a large number of application connections so your application can scale without compromising performance. It also increases security by enabling IAM authentication for database access and enabling you to centrally manage database credentials using AWS Secrets Manager.

AWS Serverless Application Repository

The AWS Serverless Application Repository (SAR) now offers Verified Author badges. These badges enable consumers to quickly and reliably know who you are. The badge appears next to your name in the SAR and links to your GitHub profile.

Screen capture of SAR Verifiedl developer badge in the AWS Management Console

SAR Verified developer badges

AWS Developer Tools

AWS CodeCommit launched the ability for you to enforce rule workflows for pull requests, making it easier to ensure that code has pass through specific rule requirements. You can now create an approval rule specifically for a pull request, or create approval rule templates to be applied to all future pull requests in a repository.

AWS CodeBuild added beta support for test reporting. With test reporting, you can now view the detailed results, trends, and history for tests executed on CodeBuild for any framework that supports the JUnit XML or Cucumber JSON test format.

Screen capture of AWS CodeBuild

CodeBuild test trends in the AWS Management Console

Amazon CodeGuru

AWS announced a preview of Amazon CodeGuru at re:Invent 2019. CodeGuru is a machine learning based service that makes code reviews more effective and aids developers in writing code that is more secure, performant, and consistent.

AWS Amplify and AWS AppSync

AWS Amplify added iOS and Android as supported platforms. Now developers can build iOS and Android applications using the Amplify Framework with the same category-based programming model that they use for JavaScript apps.

Screen capture of 'amplify init' for an iOS application in a terminal window

The Amplify team has also improved offline data access and synchronization by announcing Amplify DataStore. Developers can now create applications that allow users to continue to access and modify data, without an internet connection. Upon connection, the data synchronizes transparently with the cloud.

For a summary of Amplify and AppSync announcements before re:Invent, read: “A round up of the recent pre-re:Invent 2019 AWS Amplify Launches”.

Illustration of AWS AppSync integrations with other AWS services

Q4 serverless content

Blog posts

October

November

December

Tech talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year. These are listed in the Serverless section of the AWS Online Tech Talks page.

Here are the ones from Q4:

Twitch

October

There are also a number of other helpful video series covering Serverless available on the AWS Twitch Channel.

AWS Serverless Heroes

We are excited to welcome some new AWS Serverless Heroes to help grow the serverless community. We look forward to some amazing content to help you with your serverless journey.

AWS Serverless Application Repository (SAR) Apps

In this edition of ICYMI, we are introducing a section devoted to SAR apps written by the AWS Serverless Developer Advocacy team. You can run these applications and review their source code to learn more about serverless and to see examples of suggested practices.

Still looking for more?

The Serverless landing page has much more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. We’re also kicking off a fresh series of Tech Talks in 2020 with new content providing greater detail on everything new coming out of AWS for serverless application developers.

Throughout 2020, the AWS Serverless Developer Advocates are crossing the globe to tell you more about serverless, and to hear more about what you need. Follow this blog to keep up on new launches and announcements, best practices, and examples of serverless applications in action.

You can also follow all of us on Twitter to see latest news, follow conversations, and interact with the team.

Chris Munns: @chrismunns
Eric Johnson: @edjgeek
James Beswick: @jbesw
Moheeb Zara: @virgilvox
Ben Smith: @benjamin_l_s
Rob Sutter: @rts_rob
Julian Wood: @julian_wood

Happy coding!

Amazon QuickSight: 2019 in review

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/amazon-quicksight-2019-in-review/

2019 has been an exciting year for Amazon QuickSight. We onboarded thousands of customers, expanded our global presence to 10 AWS Regions, and launched over 60 features—more than a feature a week! We are inspired by you—our customers, and all that you do with Amazon QuickSight. We are thankful for the time you spend with us across in-person meetings, conference calls, emails, discussion forums, and AWS summits. As we close this year, here’s a quick summary of the highlights.

re:Invent 2019

The Amazon QuickSight team was at re:Invent alongside customers such as Best Western, Capital One, and Club OS, who spoke about their experiences implementing and using Amazon QuickSight for their analytics needs. We also ran two hands-on workshops using the newly launched APIs.

ANT324: Deploy business analytics at enterprise scale with Amazon QuickSight

This session discussed how enterprises are rolling out Amazon QuickSight Enterprise Edition for all their users and using features such as Active Directory or Federated SSO (SAML/OpenID Connect) authentication, private connectivity to data in AWS, email reports, and embedded dashboards.

Best Western rolled out Amazon QuickSight for business reporting across tens of thousands of users, using it in conjunction with Amazon Redshift and AWS Glue. While their previous legacy reporting system took 18 months and over 7,000 staff hours to upgrade server software versions, they now benefit from Amazon QuickSight’s modern reporting infrastructure and constant feature updates, and do not need to manage any servers to report on this data.

Capital One has made Amazon QuickSight available internally to their BI community of over 25,000 users. Key benefits that the Capital One team sees from QuickSight include the ability to roll out new BI use cases without server setup or capacity planning, integrated machine learning (ML) capabilities, and the absence of a traditional software update cycle. All this is with PrivateLink connectivity to data in AWS (Snowflake, Presto, Amazon Redshift, and Amazon RDS), and pay-per-session pricing for consumption with a max charge of $5 per month, per reader.

ANT217: Embedding analytics into applications with Amazon QuickSight

This session dove into the latest features that allow you to integrate Amazon QuickSight dashboards into your software development lifecycle, including new APIs, theming capabilities, and dashboard embedding. The full session recording is available on YouTube.

As an Independent Software Vendor (ISV), Club OS uses Amazon QuickSight’s embedded capabilities to add analytics to their cloud-based Customer Relationship Management for fitness and wellness businesses. Amazon QuickSight’s serverless architecture lets Club OS focus on meeting actual customer needs without being burdened by operations and infrastructure management. This architecture also provides fast, consistent performance for end-users; the analytics dashboards are rolled out to over 40,000 users across 3,500 health and fitness locations. For more information, see the section of the talk on YouTube by Nicholas Hahn, VP of Product Club OS.

As an enterprise with tens of thousands of internal users, Capital One uses Amazon QuickSight’s embedded capabilities to add analytics to internal applications and portals. With no servers to set up, teams can set up embedded dashboards in their portals in as quickly as a week, whether they want to serve this to just a few users or thousands. Traditional BI approaches to such portals would require server setup and maintenance, and expensive licensing contracts. For more information, see the section of the talk on YouTube by Latha Govada, Senior Manager Analytics at Capital One.

ANT302 – Enhancing your applications with Amazon QuickSight dashboards

This hands-on workshop session guided users through setting up an Amazon QuickSight Enterprise Edition account, connecting to data, creating dashboards, and using APIs to add users, move assets across development and staging versions, and embedding dashboards in a web portal. For more information, see Dashboard Embedding & Operationalizing with Quicksight APIs. Try it out for yourself!

Embedding, theming, and APIs

We followed up on our launch of dashboard embedding for federated AWS identities at re:Invent 2018 (for more information about how the NFL uses embedded Amazon QuickSight dashboards for hundreds of stakeholders of Next Gen Stats, see the session video on YouTube). In 2019, we launched the following features:

Machine learning integrations

We added the following native ML integrations in Amazon QuickSight to help you make the most of your data in AWS and QuickSight:

Visualizations and interactivity

We added the following support for new chart types, conditional formatting, QuickSight Actions, sheets in dashboards, and more:

Calculations and aggregations

We added the following new aggregation options, new functions, and more:

Data connectivity and security

We increased SPICE limits, provided finer control over data sources in QuickSight, enabled data source sharing, and more:

  • SPICE datasets now accommodate up to 100 million rows of data in Enterprise Edition and 25 million rows for Standard Edition. For more information, see Data Source Limits.
  • Fine-grained access control over S3 and Athena allows you to scope down access to these data sources to specific users or groups in Amazon QuickSight using IAM.
  • Cross data source joins allow business analysts to perform joins across supported data sources without relying on data engineering teams to set up complex ETL processes.
  • Shared data sources centralize credential management of data sources. This also allows shared ownership of data connections, allowing collaboration over SQL scripts that define custom SQL datasets.
  • Connecting to Presto data sources in AWS without any public routing of data similar to Amazon Redshift, Snowflake, RDS, and others.
  • Amazon QuickSight data sources connect to Amazon Athena, which you can tag against specific workgroups. This allows cost allocation of Athena queries by workgroup.

Mobile

We launched the new QuickSight Mobile app for iOS and Android, which allows you to access your dashboards and explore your data with drill-downs and filters. For more information, see New QuickSignt Mobile App on YouTube.

Geographic availability

We added the following new Regions and language support in Amazon QuickSight:

  • Amazon QuickSight is now available in 10 AWS regions: US East (N. Virginia and Ohio), US West (Oregon), EU (Frankfurt, Ireland, and London), and Asia Pacific (Seoul, Singapore, Sydney, and Tokyo).
  • The Amazon QuickSight interface is now available in 10 languages: English, German, Spanish, French, Portuguese, Italian, Japanese Korean, Simplified Chinese, and Traditional Chinese.

With all of our updates you can build dashboards like the ones below and integrate them into your applications and portals.

Looking ahead

To see the full list of 2019 launches, see What’s New in Amazon QuickSight or subscribe to the Amazon QuickSight YouTube channel for the latest training and feature walkthroughs. We have a packed roadmap for 2020, and continue to focus on enabling you with insights from all your data, sharing these with all your users, while not having to worry about operations and servers. Thank you for your support.

We wish you all the very best in the new year (and decade)!


About the authors

Jose Kunnackal John is a principal product manager for Amazon QuickSight.

 

 

 

 

Sahitya Pandiri is a senior technical program manager with Amazon Web Services. Sahitya has been in product/program management for 6 years now, and has built multiple products in the retail, healthcare, and analytics space.

 

 

 

Highlight Critical Insights with Conditional Formatting in Amazon QuickSight

Post Syndicated from Susan Fang original https://aws.amazon.com/blogs/big-data/highlight-critical-insights-with-conditional-formatting-in-amazon-quicksight/

Amazon QuickSight now makes it easier for you to spot the highlights or low-lights in data through conditional formatting. With conditional formatting, you can specify customized text or background colors based on field values in the dataset, using solid or gradient colors. You can also display data values with the supported icons. Using color coding and icons, you can visually explore and analyze data, quickly detect issues, and identify trends, as seen in the following screenshot. These dynamic visual cues also help your dashboard viewers cut through the densely populated data values and get to the attention-worthy insights much faster.

With this release, you can apply conditional formatting to tables, pivot tables and KPI charts. This blog post walks you through how to leverage conditional formatting in each supported chart type in Amazon QuickSight.

Tables

You can apply conditional formatting to tables through the following methods.

Conditionally formatting columns based on data fields

When applying conditional formatting to tables, there are two ways to access the configuration pane. One option is to select Table in an analysis, and choose the caret in the upper right corner of the Table visual. Select Conditional formatting from the dropdown menu, as shown in the following screenshots, and the Conditional formatting configuration pane pops up to the left of the analysis canvas. Choose the + sign to select the target column to which you’d like to apply the formatting; it can be any column present in the table.

For example, select Sales from the dropdown menu to add a condition for the Sales target column.

Alternatively, from the selected table, you can expand the field well on the top of the analysis canvas and choose the Group by or Value target column directly from the columns used in the table. For example, select Sales from the Value well and select Conditional formatting from the dropdown, as seen in the following screenshots. The conditional formatting confirmation pane pops up to the left of the analysis canvas, with a condition already added for the Sales target column.

Next, choose the formatting style from the three available options.

  • Add background color
  • Add text color
  • Add icon

For each target column, you can add any formatting style or any combination of the styles.

Background and text color

Choose Add background color to expand the configuration pane for New background color. In the Sales target column are two Fill type options, Solid or Gradient, with Solid as the default.

You are prompted to decide the base column of the conditions, which can be any field in the dataset, regardless of whether it is present in the Table or not.

Next, define the Aggregation of the base column used in the conditions, which can be any aggregation currently supported in QuickSight, such as sum, average, min, max, or standard deviation. Then you can create the conditions that drive the color coding of the cell background.

This example highlights cells based on the following rules:

  • Cells with sum of sales greater than $100, green
  • Cells with sum of between $10 and $100 (inclusive), yellow
  • Cells with sum of sales less than $10, red

Since the conditions are discrete, apply solid color based on Sales in the Sum aggregation.

In Conditional #1:

  1. Choose the Greater than comparison method.
  2. Enter 100 as the threshold Value to be 100
  3. Use the color picker to pick green.
  4. Choose Add condition and configure the other two conditions accordingly.
  5. Choose Apply, and the Sales column is conditionally formatted, showing green, yellow and red colors when conditions are met.

If you’d like to re-order the conditions, choose and move the chosen condition up or down. You can also delete the condition. To reset all of the conditions, choose Clear to clear the applied formatting from the visual and remove the conditions from the configuration pane.

The following screenshots demonstrate the configuration step by step:

If you are happy with the highlighting rules, choose Close to return to the summary page.

You see a brief summary of the conditions applied to background color of the Sales column. You can Edit or Delete the applied conditions, Add other formatting styles such as text color or icon to the same Sales column, Delete the target column altogether, or Add another target column up top.

Icons

This example adds icons to the same Sales column based on data in the Profits column so that a thumbs-up icon shows when profits are greater than 0.

  1. Start with Add icon. In the New icon configuration page, choose the base column and Aggregation
  2. Under Icon set, QuickSight offers a list of canned color icon sets for quick application.
  3. For icons beyond this list, collapse the dropdown and check the Custom conditions Choose the pre-populated + icon for more icon selections and the option to Use custom Unicode icon. This example uses the thumbs-up and sets the color as orange.
  4. Choose Apply and to see icons show up in the Sales column accordingly.

The following screenshots show the configuration step by step:

Please note that you have the option to Show icon only. Once checked, the text values of the target column are hidden, showing icons only.

Conditionally formatting entire rows based on calculated fields

Next, here is a slightly more complex example that involves calculated fields. This example highlights rows in which Ship charges exceeds 20% of the total sales amount. This condition needs to be based on a flag that differentiates orders with high shipping charges from the rest.

Create a calculated field called high_shipping_flag using this function:

ifelse(sum({Ship charges}) > 0.2 * sum(Sales),1,0)

As a result, orders with shipping charges over 20% of sales amount take a value of 1, otherwise, they take a value of 0.

With the base field ready, proceed with conditional formatting configuration. Since the target property for formatting is the entire row instead of a specific column, access the configuration pane from the visual dropdown menu instead of a particular column from the field well.

Scroll down to the bottom of the dropdown list and select the option to conditionally format the [Entire row] in addition to individual columns.

Note the options to Add background color or Add text color, without the Add icon option, because it is not applicable to entire rows. Also note the Apply on top checkbox, which allows the conditional formatting for the entire row to paint over conditional formatting applied to individual columns. For this example, color the entire row teal if the shipping charge flag is 1.

The following screenshots demonstrate the configuration step by step:

Pivot tables

Conditional formatting is enabled for pivot tables as well, and the configuration workflow is very similar to that of tables, with the following differences:

  • Conditional formatting can be applied to measures only, or fields in the Values well, but not the other data types.
  • Conditional formatting in solid colors can be applied to various levels of your choice: values, subtotals, and totals.
  • Conditional formatting can only be applied to individual columns, not entire rows.

Applying gradient colors

Next, explore the conditional formatting functionalities unique to pivot tables. As mentioned above, only the two measures, or numeric fields, in the Values well can be the target fields for conditional formatting.

The following screenshot shows a conditionally formatted pivot table:

As with tables, you can access Conditional formatting configuration pane from the dropdown of either the visual or the field itself in the Values well. Note that only the two measures show up in the dropdown list.

Applying gradient colors to Percent Tested is fairly straightforward. When you select Gradient, you are prompted to define a base field, which can be any field from the dataset as well as Aggregation. Amazon QuickSight offers three pre-set gradient colors for easy application: blue, red-green, and red-yellow-green.

Choose the diverging color option. You have the option to further customize the three colors as well as the corresponding min, mid, and max values by overriding the default, as shown in the following screenshots.

Applying solid colors to column values, subtotals and totals

For Average Score, apply solid color rules to the text.

Add Average Score as the target column for conditional formatting and select Add text color. As mentioned above, only solid colors can be applied to subtotals and totals.

Note the extra text Values next to Condition, which is different from those in Table, right above the comparison method dropdown box. By default, Amazon QuickSight applies the solid color conditions to the most fine-grained values of the target field with the dimension fields at all levels expanded.

This example examines how the average scores at the Borough level compare with each other, which requires applying conditions to both Values and Subtotals. To do so, click on the icon to expose the options. This is a multi-select dropdown list, in which you can check one option or multiple options. After checking the desired boxes, the Values and Subtotals show up next to Condition. You can Add condition after the first you have completed.

Note that the next condition added inherits the level of granularity from the previous condition. In this example, Condition #2 automatically applies to Values and Subtotals, just like in the previous one. You can override from the check boxes of the multi-select dropdown.

The following screenshot shows the configuration and the effect.

For illustration purposes, add icons as well and enable Show subtotals for rows from the visual dropdown menu. With some rows collapsed at the Borough level, the solid color rules are applied to the average score for the collapsed boroughs, along with the school-level average scores on the expanded rows. The solid color rules also show up in the subtotals enabled for rows.

The following screenshots show conditional formatting applied to subtotals:

KPI Chart

This release also makes conditional formatting available for KPI charts so that you can easily tell whether the KPI is over or under the business-critical thresholds, using text color and icons as quick indicators. In cases in which progress bars are shown, you also have the option to conditionally format the foreground color of the progress bar. The configuration workflow is similar to that of a table or pivot table. The difference is that only the primary value field can be conditionally formatted, not the Target nor the Trend group. This example uses a simple KPI chart showing Primary value only and adds an icon if the Average SAT Score is above 1200.

The following screenshots show the configuration and the effect:

Summary

This post illustrated how you can leverage conditional formatting to create richer data visualizations and thus more appealing dashboards in Amazon QuickSight. As of this writing, conditional formatting is enabled for tables, pivot tables, and KPI charts. You can expect to see enhanced capability to dynamically customize data colors based on numeric fields in other chart types as well. Stay tuned.

If you have any questions or feedback, please leave a comment.

 


About the Authors

Susan Fang is a senior product manager for QuickSight with AWS.

 

 

 

 

Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilities

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/evolve-your-analytics-with-amazon-quicksights-new-apis-and-theming-capabilities/

The Amazon QuickSight team is excited to announce the availability of Themes and more APIs! Themes for dashboards let you align the look and feel of Amazon QuickSight dashboards with your application’s branding or corporate themes. The new APIs added allow you to manage your Amazon QuickSight deployments programmatically, with support for dashboards, datasets, data sources, SPICE ingestion, and fine-grained access control over AWS resources. Together, they allow you to creatively tailor Amazon QuickSight to your audiences, whether you are using Amazon QuickSight to provide your users with an embedded analytics experience or for your corporate Business Intelligence (BI) needs. This post provides an overview of these new capabilities and details on getting started.

Theming your dashboards

Amazon QuickSight themes allow you to control the look and feel of your dashboards. Specifically, you will be able to affect the following items through themes, and then share with other authors in the account.

  • Margins
  • Borders around visuals
  • Gutters (spacing between visuals)
  • Data colors (used within the visuals)
  • Background colors (for visuals within the dashboard and the dashboard itself)
  • Foreground colors (for text in the visuals and on the dashboard)
  • Highlight colors on visuals

The following screenshot of the dashboard highlights aspects of the dashboards you can modify via the theming options currently available:

With the available options, you can also make your QuickSight dashboards denser (e.g. no margins or gutters), as shown in the view below that uses QuickSight’s new “Seaside” theme.

You also have a dark dashboard option with QuickSight’s new “Midnight” theme as shown below.

Themes are accessible via the left hand panel in the dashboard authoring interface.

You can create your own theme by starting with one of the built-in themes and customizing it. The interactive theme editing experience makes this easy – you can edit as little or as much as you want, and get to the perfect theme for your dashboards.

Using APIs for dashboards, datasets, data sources, SPICE ingestion, and permissions

The new APIs cover dashboards, datasets, data sources, SPICE ingestion, and fine-grained access control over S3/Athena. We’ve introduced templates in QuickSight with this release: templates allow you to store the visual configuration and data schema associated with a dashboard and then easily replicate the dashboard for different sets of users or different data, within or across accounts.

The new template APIs allow you to create templates from QuickSight analyses. You can then use the dashboard APIs to create a dashboard from the template in the same or different account, version control dashboards, and connect them to different datasets as needed. The dataset APIs allow you to create custom datasets, with filters applied at the dataset level (for example, data for a specific customer only). Dataset APIs can be used to create both SPICE and direct query datasets. SPICE APIs let you refresh datasets programmatically and view SPICE ingestion status and history. Data source APIs programmatically define data source connections. Finally, fine-grained access permissions APIs let you secure access to data in Amazon S3, Amazon Athena, and other AWS resources.

As a developer using Amazon QuickSight’s embedded capabilities, or author using Amazon QuickSight to publish dashboards on different subject areas, the new APIs allow you to automate your workflows and programmatically perform repetitive activities such as rolling out identical versions of dashboards to different teams or customers. QuickSight assets can now be migrated from dev to QA to production via API, overcoming the need to manually access each environment. With versioning, you also have your previous dashboards backed up, providing you flexibility to roll back dashboard deployments if the need arises.

As an Amazon QuickSight administrator, you can use the new APIs to audit and manage access to dashboards, data sources, and datasets across your account. You can also pre-populate dashboards and datasets into individual author accounts for easier and faster onboarding. You no longer need to distribute individual credentials to data sources: you can provision centrally-managed data sources shared with all your users. You can also manage individual author access to S3, Athena, or other AWS resources on-the-fly using IAM policies through the fine-grained access control APIs. The APIs open up a lot of possibilities with Amazon QuickSight – we’ll cover a few scenarios in this blog post and then follow up with additional posts that cover more.

The Amazon QuickSight APIs can be invoked via the AWS Software Development Kit (SDK) or the AWS Command Line Interface (CLI). For our discussion, we will use the AWS CLI. To capture all the details needed for each API, this post uses the generate-cli-skeleton option available in the AWS CLI and walks you through the example of creating a Redshift data source, creating data sets, and dashboards associated with it. This option, when invoked, provides a skeleton file with all the required API details, which you can edit to the correct inputs and use for your API calls. The code below invokes the generate-cli-skeleton option and writes the details to the create-data-source-cli-input.json file; we would then modify this file and use it as the input to the create-data-source API when invoking it.

aws quicksight create-data-source --generate-cli-skeleton > create-data-source-cli-input.json

Let’s say that you want to help Alice, a new hire in your company, create an Amazon QuickSight dashboard from some customer revenue data in Redshift. To set things up for Alice, you want to use the new APIs to connect to Redshift and create a QuickSight data set within her QuickSight user account.

Creating a data source

To connect to your Redshift database, you’ll need to create a data source in QuickSight. First, you describe it with the following configuration, which covers the data source connection details, credentials, and permissions around who should be able to see the data sources.

cat create-data-source-cli-input.json

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DataSourceId": "SampleRedshiftDatasouce",
    "Name": "Sample Redshift Datasouce",
    "Type": "REDSHIFT",
    "DataSourceParameters": {
        "RedshiftParameters": {
            "Host": "redshift-cluster-1.xxxxxxxxxx.us-east-1.redshift.amazonaws.com",
            "Port": 5439,
            "Database": "dev",
            "ClusterId": "redshift-cluster-1"
        }
    },
    "Credentials": {
        "CredentialPair": {
            "Username": "xxxxxxxxx",
            "Password": "xxxxxxxxxxx"
        }
    },
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:UpdateDataSourcePermissions",
                "quicksight:DescribeDataSource",
                "quicksight:DescribeDataSourcePermissions",
                "quicksight:PassDataSource",
                "quicksight:UpdateDataSource",
                "quicksight:DeleteDataSource"
                
            ]
        }
    ],
    "VpcConnectionProperties": {
        "VpcConnectionArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:vpcConnection/VPC to Redshift"
    },
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-RedShiftDemoDataSource"
        }
    ]
}

You can then invoke create-data-source API with this input to start the creation of your data source as indicated below.

aws quicksight create-data-source --cli-input-json file://./create-data-source-cli-input.json

{
    "Status": 202,
    "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
    "DataSourceId": "SampleRedshiftDatasouce",
    "CreationStatus": "CREATION_IN_PROGRESS",
    "RequestId": "ac9fb8fe-71d8-4005-a7c9-d66d814e224e"
}

You can validate that the data source was successfully created by calling the describe-data-source API as shown below and checking the response’s Status:

aws quicksight describe-data-source --aws-account-id $AAI --data-source-id SampleRedshiftDatasouce

{
    "Status": 200,
    "DataSource": {
        "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
        "DataSourceId": "SampleRedshiftDatasouce",
        "Name": "Sample Redshift Datasouce",
        "Type": "REDSHIFT",
        "Status": "CREATION_SUCCESSFUL",
        "CreatedTime": 1574053515.41,
        "LastUpdatedTime": 1574053516.368,
        "DataSourceParameters": {
            "RedshiftParameters": {
                "Host": "redshift-cluster-1.c6qmmwnqxxxx.us-east-1.redshift.amazonaws.com",
                "Port": 5439,
                "Database": "dev",
                "ClusterId": "redshift-cluster-1"
            }
        },
        "VpcConnectionProperties": {
            "VpcConnectionArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:vpcConnection/VPC to Redshift"
        },
        "SslProperties": {
            "DisableSsl": false
        }
    },
    "RequestId": "57d2d6d6-ebbf-4e8c-82ab-c38935cae8aa"
}

Because you granted permissions to Alice as part of the call to create-data-source, this data source will now be visible to her when she logs in to QuickSight. If you didn’t grant user permissions, this data source is only visible to programmatic users with appropriate permissions.

Creating a dataset

Before Alice can build her dashboard, you’ll need to create a dataset, which identifies the specific data in a data source you want to use. To create a dataset from this data source, you can call the create-data-set API. The create-data-set API allows you to join two tables to create a desired view of data for Alice. Below, PhysicalTableMap points to the tables in your Redshift data source you want to join, and LogicalTableMap defines the join between them:

A sample dataset definition with joins included is shown below.

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DataSetId": "DemoDataSet1",
    "Name": "Demo Data Set 1",
    "PhysicalTableMap": {
        "DemoDataSet1-LineOrder": {
            "RelationalTable": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
                "Schema": "public",
                "Name": "lineorder",
                "InputColumns": [
                        {
                            "Name": "lo_custkey",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_revenue",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_quantity",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_orderkey",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_shipmode",
                            "Type": "STRING"
                        },
                        {
                            "Name": "lo_orderdate",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_ordertotalprice",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_tax",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "lo_discount",
                            "Type": "INTEGER"
                        }
                ]
	    }
            },
        "DemoDataSet1-Customer": {
            "RelationalTable": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
                "Schema": "public",
                "Name": "customer",
                "InputColumns": [
                        {
                            "Name": "c_custkey",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "c_phone",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_address",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_name",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_city",
                            "Type": "STRING"
                        }
                ]
            }
        }
    },
    "LogicalTableMap": {
        "DemoDataSet1-LineOrder-Logical": {
            "Alias": "LineOrder",
            "Source": {
                "PhysicalTableId": "DemoDataSet1-LineOrder"
            }
        },
        "DemoDataSet1-Customer-Logical": { 
            "Alias": "Customer",
            "Source": {
                "PhysicalTableId": "DemoDataSet1-Customer"
            }
        },
        "DemoDataSet1-Intermediate": { 
            "Alias": "Intermediate",
            "DataTransforms": [
                {   
                    "ProjectOperation": {
                        "ProjectedColumns": [
                                "lo_revenue",
                                "c_name",
                                "c_city"
                        ]
                    }
                }
            ],
            "Source": {
                "JoinInstruction": {
                    "LeftOperand": "DemoDataSet1-LineOrder-Logical",
                    "RightOperand": "DemoDataSet1-Customer-Logical",
                    "Type": "INNER",
                    "OnClause": "{lo_custkey} = {c_custkey}"
                }
            }
        }

    },
    "ImportMode": "DIRECT_QUERY",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:UpdateDataSetPermissions",
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions",
                "quicksight:UpdateDataSet",
                "quicksight:DeleteDataSet",
                "quicksight:CreateIngestion",
                "quicksight:CancelIngestion"
            ]
        }
    ],
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDataSet1"
        }
    ]
}

If you wanted to create a similar dataset with a SQL statement instead of specifying all the columns, call the create-data-set API specifying your SQL statement in PhysicalTableMap as shown below:

aws quicksight create-data-set --generate-cli-skeleton >create-data-set-cli-input-sql.json

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DataSetId": "DemoDataSet2",
    "Name": "Demo Data Set 2",
    "PhysicalTableMap": {
        "DemoDataSet2-CustomerSales": {
            "CustomSql": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
                "Name": "Customer Sales",
                "SqlQuery":"SELECT lineorder.lo_revenue, customer.c_name, customer.c_city FROM public.lineorder INNER JOIN public.customer ON lineorder.lo_custkey = customer.c_custkey",
                    "Columns": [
                        {
                            "Name": "lo_revenue",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "c_name",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_city",
                            "Type": "STRING"
                        }
                    ]
 
	    }
        }
    },
    "LogicalTableMap": {
        "DemoDataSet2-CustomerSales-Logical": {
            "Alias": "Customer Sales",
            "Source": {
                "PhysicalTableId": "DemoDataSet2-CustomerSales"
            }
        }

    },
    "ImportMode": "DIRECT_QUERY",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:UpdateDataSetPermissions",
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions",
                "quicksight:UpdateDataSet",
                "quicksight:DeleteDataSet",
                "quicksight:CreateIngestion",
                "quicksight:CancelIngestion"
            ]
        }
    ],
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDataSet2"
        }
    ]
}

Because you granted permissions to Alice, these data sets will be visible to her under Your Data Sets when she logs into QuickSight.

This dataset can be used like any other dataset in Amazon QuickSight. Alice can log in to the QuickSight UI and create an analysis that utilizes this dataset, as shown below:

Creating a template

Once Alice creates her dashboard and it gets popular with her team, the next ask might be to provide the same view, but to 100 other teams within the company, and using different datasets for each team. Without APIs, this ask would be a challenge. However, with the APIs, you can now define a template from Alice’s analysis and then create dashboards as needed. A template is created by pointing to a source analysis in QuickSight, which creates references for all datasets used, so that new datasets can referenced when using the template.

The following configuration describes a template created from Alice’s analysis:

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "TemplateId": "DemoDashboardTemplate",
    "Name": "Demo Dashboard Template",
    "SourceEntity": {
        "SourceAnalysis": {
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:analysis/7975f7aa-261c-4e7c-b430-305d71e07a8e",
            "DataSetReferences": [
                {
                    "DataSetPlaceholder": "DS1",
                    "DataSetArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet1"
                }
            ]
        }
    },
    "VersionDescription": "1"
}

Templates are not visible within the QuickSight UI and are a developer/admin managed asset that is only accessible via the APIs at this point. You can save the template configuration by calling the create-template API to start creation of your template:

aws quicksight create-template --cli-input-json file://./create-template-cli-input.json

{
    "Status": 202,
    "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate",
    "VersionArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate/version/1",
    "TemplateId": "DemoDashboardTemplate",
    "CreationStatus": "CREATION_IN_PROGRESS",
    "RequestId": "aa189aee-8ab5-4358-9bba-8f4e32450aef"
}

To confirm the template was created, call the describe-template API as shown below:

aws quicksight describe-template --aws-account-id $AAI --template-id DemoDashboardTemplate

{
    "Status": 200,
    "Template": {
        "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate",
        "Name": "Demo Dashboard Template",
        "Version": {
            "CreatedTime": 1574137165.136,
            "VersionNumber": 1,
            "Status": "CREATION_SUCCESSFUL",
            "DataSetConfigurations": [
                {
                    "Placeholder": "DS1",
                    "DataSetSchema": {
                        "ColumnSchemaList": [
                            {
                                "Name": "c_city",
                                "DataType": "STRING"
                            },
                            {
                                "Name": "lo_revenue",
                                "DataType": "INTEGER"
                            }
                        ]
                    },
                    "ColumnGroupSchemaList": []
                }
            ],
            "Description": "1",
            "SourceEntityArn": "arn:aws:quicksight:us-east-1: xxxxxxxxxxxx:analysis/7975f7aa-261c-4e7c-b430-305d71e07a8e"
        },
        "TemplateId": "DemoDashboardTemplate",
        "LastUpdatedTime": 1574137165.104,
        "CreatedTime": 1574137165.104
    }
}

Creating dashboards

You can then use this template to create a dashboard that is connected to the second, similar, dataset you created earlier. When calling the create-dashboard API, you define a SourceEntity, (in this case, the template we just created), permissions, and any dashboard publish options, as shown in the configuration used for create-dashboard-cli-input.json

create-dashboard-cli-input.json

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DashboardId": "DemoDashboard1",
    "Name": "Demo Dashboard 1",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:DescribeDashboard",
                "quicksight:ListDashboardVersions",
                "quicksight:UpdateDashboardPermissions",
                "quicksight:QueryDashboard",
                "quicksight:UpdateDashboard",
                "quicksight:DeleteDashboard",
                "quicksight:DescribeDashboardPermissions",
                "quicksight:UpdateDashboardPublishedVersion"
            ]
        }
    ],
    "SourceEntity": {
        "SourceTemplate": {
            "DataSetReferences": [
                {
                    "DataSetPlaceholder": "DS1",
                    "DataSetArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet2"
                }
            ],
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate"
        }
    },
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDashboard"
        }
    ],
    "VersionDescription": "1",
    "DashboardPublishOptions": {
        "AdHocFilteringOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "ExportToCSVOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "SheetControlsOption": {
            "VisibilityState": "EXPANDED"
        }
    }
}

You can confirm you created the dashboard correctly with the describe-dashboard API. See the following code:

aws quicksight describe-dashboard --aws-account-id $AAI --dashboard-id DemoDashboard1

{
    "Status": 200,
    "Dashboard": {
        "DashboardId": "DemoDashboard1",
        "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dashboard/DemoDashboard1",
        "Name": "Demo Dashboard 1",
        "Version": {
            "CreatedTime": 1574138252.449,
            "Errors": [],
            "VersionNumber": 1,
            "Status": "CREATION_SUCCESSFUL",
            "SourceEntityArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate/version/1",
            "Description": "1"
        },
        "CreatedTime": 1574138252.395,
        "LastPublishedTime": 1574138252.395,
        "LastUpdatedTime": 1574138252.449
    },
    "RequestId": "62945539-bb63-4faf-b897-8e84ea5644ae"
}

Because you granted permissions to Alice during creation, this dashboard will be visible to her under All dashboards when she logs into QuickSight as shown below:

Configuring SPICE datasets        

Alice’s datasets are direct query datasets and may cause higher load on your Redshift cluster during peak times, which might also cause Alice and her customers to wait longer for their dashboards to load. With SPICE, Amazon QuickSight’s in-memory data store, you can offload user traffic from your database, while also achieving high concurrency limits and fast, responsive performance for your end-users. The following configuration redefines our second dataset and sets ImportMode to SPICE as highlighted below:

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DataSetId": "DemoDataSet3",
    "Name": "Demo Data Set 3",
    "PhysicalTableMap": {
        "DemoDataSet2-CustomerSales": {
            "CustomSql": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:datasource/SampleRedshiftDatasouce",
                "Name": "Customer Sales",
                "SqlQuery":"SELECT lineorder.lo_revenue, customer.c_name, customer.c_city FROM public.lineorder INNER JOIN public.customer ON lineorder.lo_custkey = customer.c_custkey",
                    "Columns": [
                        {
                            "Name": "lo_revenue",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "c_name",
                            "Type": "STRING"
                        },
                        {
                            "Name": "c_city",
                            "Type": "STRING"
                        }
                    ]
 
	    }
        }
    },
    "LogicalTableMap": {
        "DemoDataSet2-CustomerSales-Logical": {
            "Alias": "Customer Sales",
            "Source": {
                "PhysicalTableId": "DemoDataSet2-CustomerSales"
            }
        }

    },
    "ImportMode": "SPICE",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/alice",
            "Actions": [
                "quicksight:UpdateDataSetPermissions",
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions",
                "quicksight:UpdateDataSet",
                "quicksight:DeleteDataSet",
                "quicksight:CreateIngestion",
                "quicksight:CancelIngestion"
            ]
        }
    ],
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDataSet2"
        }
    ]
}

You can call the create-data-set API with that configuration to create the dataset in SPICE mode:

aws quicksight create-data-set --aws-account-id $AAI --cli-input-json file://./create-data-set-cli-input-sql-spice.json

{
    "Status": 201,
    "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3",
    "DataSetId": "DemoDataSet3",
    "IngestionArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/589eaf9d-0e21-4572-9ebe-f61b129886d9",
    "IngestionId": "589eaf9d-0e21-4572-9ebe-f61b129886d9",
    "RequestId": "7a442521-ab8f-448f-8668-1f0d6823f987"
}

When creating datasets in SPICE mode, you’ll consume SPICE capacity. The describe-data-set API provides details on the SPICE capacity consumed. See the relevant bits from the describe-data-set response highlighted in the code snippet below:

aws quicksight describe-data-set --aws-account-id $AAI --data-set-id DemoDataSet3|tail -5

        "ImportMode": "SPICE",
        "ConsumedSpiceCapacityInBytes": 107027200
    },
    "RequestId": "b0175568-94ed-40e4-85cc-b382979ca377"
}

For SPICE datasets, the APIs also provide the option of setting up ingestions or listing past ingestions. Listing past ingestions allows you to identify ingestion status and when the data in the dataset was last updated. To list past ingestions, call the list-ingestions API for your dataset:

aws quicksight list-ingestions --aws-account-id $AAI --data-set-id DemoDataSet3

{
    "Status": 200,
    "Ingestions": [
        {
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/589eaf9d-0e21-4572-9ebe-f61b129886d9",
            "IngestionId": "589eaf9d-0e21-4572-9ebe-f61b129886d9",
            "IngestionStatus": "COMPLETED",
            "ErrorInfo": {},
            "RowInfo": {
                "RowsIngested": 1126608,
                "RowsDropped": 0
            },
            "CreatedTime": 1574138852.564,
            "IngestionTimeInSeconds": 24,
            "IngestionSizeInBytes": 107027200,
            "RequestSource": "MANUAL",
            "RequestType": "INITIAL_INGESTION"
        }
    ],
    "RequestId": "ee1d9a6f-a290-418a-a526-8906f4689776"
}

Set up ingestions to update the data in your dataset from your data source. As you may have new data in your Redshift cluster, you can use the create-ingestion API for your dataset to trigger a SPICE refresh:

aws quicksight create-ingestion --aws-account-id $AAI --data-set-id DemoDataSet3 --ingestion-id DemoDataSet3-Ingestion2

{
    "Status": 201,
    "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/DemoDataSet3-Ingestion2",
    "IngestionId": "DemoDataSet3-Ingestion2",
    "IngestionStatus": "INITIALIZED",
    "RequestId": "fc1f7eea-1327-41d6-9af7-c12f097ed343"
}

Listing ingestion history again shows the new ingestion in the state RUNNING. See the following code:

aws quicksight list-ingestions --aws-account-id $AAI --data-set-id DemoDataSet3

{
    "Status": 200,
    "Ingestions": [
        {
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/DemoDataSet3-Ingestion2",
            "IngestionId": "DemoDataSet3-Ingestion2",
            "IngestionStatus": "RUNNING",
            "RowInfo": {
                "RowsIngested": 0
            },
            "CreatedTime": 1574139332.876,
            "RequestSource": "MANUAL",
            "RequestType": "FULL_REFRESH"
        },
        {
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet3/ingestion/589eaf9d-0e21-4572-9ebe-f61b129886d9",
            "IngestionId": "589eaf9d-0e21-4572-9ebe-f61b129886d9",
            "IngestionStatus": "COMPLETED",
            "ErrorInfo": {},
            "RowInfo": {
                "RowsIngested": 1126608,
                "RowsDropped": 0
            },
            "CreatedTime": 1574138852.564,
            "IngestionTimeInSeconds": 24,
            "IngestionSizeInBytes": 107027200,
            "RequestSource": "MANUAL",
            "RequestType": "INITIAL_INGESTION"
        }
    ],
    "RequestId": "322c165e-fb90-45bf-8fa5-246d3034a08e"
}

Versioning and aliasing

As you may have noticed, the template and dashboard APIs support versioning. When you create a template or dashboard, version 1 of that entity is created, and each update of a template or dashboard results in a new version of that entity with an incremented version number. Versioning can be useful when you want to republish an old version of a dashboard or ensure you’re building off a known version of the template. For example, if you published your dashboard multiple times, but want to republish the first version, you can call the update-dashboard-published-version API as follows:

aws quicksight update-dashboard-published-version --aws-account-id $AAI --dashboard-id DemoDashboard1 --dashboard-version-number 1

{
   "DashboardArn": "arn:aws:quicksight:us-east-
1:xxxxxxxxxxxx:dashboard/DemoDashboard1",
   "DashboardId": "DemoDashboard1",
   "RequestId": "c97cdbff-d954-4f5b-a887-28659ad9cc9a"
}

Versioning becomes even more powerful with aliases. A template or dashboard alias is a pointer to a specific template or dashboard version. Aliases are helpful when you want to point to a logical version of a resource rather than an explicit version number that needs to be manually updated. This feature can be helpful for development workflows where you want to make changes to a template and only want production code to use an approved template version. For example, you can create a ‘PROD’ template alias for your approved production template and point it to version 1, as shown below:

aws quicksight create-template-alias --aws-account-id $AAI --template-id DemoDashboardTemplate --template-version-number 1 --alias-name PROD

{
   "RequestId": "c112a48c-14da-439e-b4d8-538b60c1f188",
   "TemplateAlias": { 
      "AliasName": "PROD",
      "Arn": "arn:aws:quicksight:us-east-
1:xxxxxxxxxxxx:template/DemoDashboardTemplate/alias/PROD",
      "TemplateVersionNumber": 1
   }
}

When you create a dashboard from this template, you can specify the SourceEntity as the ‘PROD’ template alias to always publish the dashboard from the approved production template version. Note how the template Arn contains the template alias name in the call to the create-dashboard API below:

aws quicksight create-dashboard --generate-cli-skeleton > create-dashboard-cli-input.json

{
    "AwsAccountId": "xxxxxxxxxxxx",
    "DashboardId": "DemoDashboard2",
    "Name": "Demo Dashboard 2",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:user/default/user/Alice",
            "Actions": [
                "quicksight:DescribeDashboard",
                "quicksight:ListDashboardVersions",
                "quicksight:UpdateDashboardPermissions",
                "quicksight:QueryDashboard",
                "quicksight:UpdateDashboard",
                "quicksight:DeleteDashboard",
                "quicksight:DescribeDashboardPermissions",
                "quicksight:UpdateDashboardPublishedVersion"
            ]
        }
    ],
    "SourceEntity": {
        "SourceTemplate": {
            "DataSetReferences": [
                {
                    "DataSetPlaceholder": "DS1",
                    "DataSetArn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:dataset/DemoDataSet2"
                }
            ],
            "Arn": "arn:aws:quicksight:us-east-1:xxxxxxxxxxxx:template/DemoDashboardTemplate/alias/PROD"
        }
    },
    "Tags": [
        {
            "Key": "Name",
            "Value": "API-DemoDashboard"
        }
    ],
    "VersionDescription": "1",
    "DashboardPublishOptions": {
        "AdHocFilteringOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "ExportToCSVOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "SheetControlsOption": {
            "VisibilityState": "EXPANDED"
        }
    }
}

If you were to create new versions of the template, you could then update the ‘PROD’ template alias to point to the new template version, without touching your dashboard creation code from above. For example, if you wanted to set template version 3 to be your approved production version, you can call the update-template-alias API as shown below:

aws quicksight update-template-alias --aws-account-id $AAI --template-id DemoDashboardTemplate --template-version-number 3 --alias-name PROD

{
   "RequestId": "5ac9ff9e-28c1-4b61-aeab-5949986f5b2b",
   "TemplateAlias": { 
      "AliasName": "PROD",
      "Arn": "arn:aws:quicksight:us-east-
1:xxxxxxxxxxxx:template/DemoDashboardTemplate/alias/PROD",
      "TemplateVersionNumber": 3
   }
}

Keep an eye out for a future blog post that will detail more examples of using versioning and aliases for large scale deployments!

Amazon QuickSight’s data source, dataset, dashboard, and SPICE APIs are available in both Standard and Enterprise editions, with edition-specific support for functionality. Fine-grained access control APIs and template support for dashboard APIs are available in Amazon QuickSight Enterprise edition. APIs for Amazon QuickSight are part of the AWS SDK and are available in a variety of programming languages. For more information, see API Reference and Tools to Build on AWS.

Conclusion

With QuickSight’s new themes, APIs, and additional customization options underway, there are many creative ways to author the next chapter in your analytics story. To be a part of future API previews and connect with us, sign up for the embedding early access program.

 


About the Author

Jose Kunnackal John is a principal product manager for Amazon QuickSight.

 

 

 

Enhancing dashboard interactivity with Amazon QuickSight Actions

Post Syndicated from Sahitya Pandiri original https://aws.amazon.com/blogs/big-data/enhancing-dashboard-interactivity-with-amazon-quicksight-actions/

Amazon QuickSight now offers enhanced dashboard interactivity capabilities through QuickSight Actions. QuickSight Actions provide advanced filtering capabilities through single point-and-click actions on dashboards. With Actions, you can link visuals within a dashboard so that selecting a dimensional point on one visual provides you with granular insights on the selected point on other visuals within your dashboard. Therefore, you can start with summaries and dive deep into details of your business metrics, all within the same dashboard sheet. You can define what visuals within your dashboard are interactive and how these interact with each other. As of this writing, QuickSight Actions lets you define two primary actions of interactivity: filter actions and URL actions. URL actions within Amazon QuickSight are not new, but the point of entry to create URL actions is now consolidated with Actions.

You can apply QuickSight Actions to any supported chart that holds at least one dimension. This post provides examples of getting starting with Actions, configuring different Actions on a dashboard, and enabling different forms of interactivity for each action configured.

This post uses the following data sets:

B2B Sales
This data set holds order details for a fictitious company ABCDO for 2016 and 2017. The dashboard we will build will report on sales metrics by industry, segment, region as primary dimensions, and also provides granular details for each order purchased.

Product Availability
This data set holds available quantity for every product by ID.

Prerequisites

Before implementing Actions on your Amazon QuickSight dashboards, review how to create and publish dashboards.

Getting started with QuickSight Actions

This screenshot is a dashboard built from the above two data sets. It shows sales by category, industry and region on line 1; segment sales by quarter, industry sales by segment on line 2; total profit, sales, discount and quanity sold on line 3; order details pivot on line 4 and shipping details pivot on line 5.

Before getting started, note the following terminology:

  • Source visual – The visual on which an action is created. Choosing a point on the source visual triggers the action, and the dimensional values chosen are passed as filters to target visuals.
  • Target visual – The visual that is filtered by the dimensional values chosen on the source visual.
  • Activation – You can trigger an action either by  selecting it directly (left-click), or selecting from the menu options (right-click).
  • Action type – You can configure two types of actions on a visual: filter actions and URL actions. Filter actions pass select or all dimensions as filters across select or all visuals across the dashboard. URL actions allow you to navigate from the dashboard to an external site or different sheet within the same dashboard with the selected dimension passed along.

Setting up a click/select action

To set up a click/select action, complete the following steps:

  1. Select the source visual on the analysis and choose Actions.The following screenshot shows you this step.
  2. Within Actions, you can either create and configure your actions or use the 1-click action setup. The 1-click setup creates a default action on the visual. As a result, when a point on the visual is selected/clicked, all dimensions on a selected point are passed as filters across all other visuals on the dashboard. Under Quick create, choose Filter same-sheet visuals.

This creates a default action, which you can modify at any time.

After creating the 1-click action, you can select any point on the source visual. The following screenshot shows that the selected point is highlighted and all the underlying dimensions are used as filters on all other visuals on the analysis or dashboard.

In this screenshot, selecting Copper and Diamonds for the year 2017 passes these two as filters to all other visuals.

To verify what filter actions are applied on a particular visual, choose the filter icon on the top right. This shows all the filters applied through filter actions and also regular filters.

Setting up custom filter actions

To perform further analysis on the sales metrics for any segment, configure an action on the Industry sales by Segment visual. To configure a filter action to pass segment as the filter onto the KPI charts, complete the following steps:

  1. Choose the source visual and choose Actions.
  2. Choose Define a custom action.
  3. For Action name, enter a name for your action. You could also add dynamic placeholder values to the name to be more descriptive to your readers by choosing the plus sign next to the Action name.You have four configuration settings. While this post uses values specific for this use case, you can choose other values according to your specific needs.
  4. For Activation, select Menu option.
  5. For Action type, choose Filter action.
  6. For Filter scope, select Selected fields, segment.
  7. For Target visuals, select Select visuals and check needed target visuals Total quantity, Total discount offered, and Total profit by year.
  8. Choose Save.

To view menu options and choose a filter action, right-click any segment on the visual. This post selects Strategic. The following screenshot shows that all the Strategic points are highlighted and the KPIs are updated.

Adding filter actions on a pivot tables

Filter actions on pivot tables provide exceptional flexibility to pass anything from the innermost to the outermost dimensions. Additionally, when a filter action is triggered, the source pivot table is highlighted to show values that are passed as filters to the target visuals.

This post adds click/select filter actions on the Order details pivot table to analyze shipping details for each order. This post also adds two filter actions to the menu options: one to filter by the product name and the other by the customer name.

To create click/select filter actions on order_id, complete the following steps:

  1. Choose the Order details pivot table chart.
  2. Choose Actions.
  3. Choose Define custom action.
  4. For Action name, enter your action name.
  5. For Activation, select Select.
  6. For Filter scope, select Selected fields and check order_id.
  7. For Target visuals, select Select visuals, and check Shipping details.
  8. Choose Save.

The following screenshot shows that when choosing any point on the pivot table, the order pertaining to the point is highlighted and the Shipping details table shows shipping details for the selected order.

You can only create one Selection action on a source visual and any number of menu option actions.

You can now create additional actions in the menu. To create a new actions, go to Actions, choose the plus sign and create these two actions.

After creating the two actions, they are visible in the Actions menu. See the following screenshot.

After setting up these action filters, you can right-click on the visual and trigger these filters from the menu.

The screenshots below shows you the menu on right-clicking product name.

The screenshots below shows you the menu on right-clicking customer name.

Removing applied filter actions

There are three ways to remove a triggered filter action:

  • On source visuals with filter actions triggered via select or menu options, clear the filter action by choosing the selected point again or a different point.
  • On source visuals with empty spaces within the visual (such as bar chart family or pie chart), clicking within the empty space also clears the filter action selection. This deselection is not applicable to heat maps and tree maps. Deselect points on these charts by choosing an already selected point.
  • On target visuals with a highlighted filter icon, choosing the X on the filter actions displayed removes the filter for this visual and all other visuals the filter is applied to.

Using action filters when the source and target visuals use different data sets

If you use multiple data sets within an analysis, Amazon QuickSight automatically maps fields across these data sets by the field name. For example, product_id is a field in the B2B Sales data set and Product Availability data set. When setting a filter action on product_id from a source visual using the B2B Sales data set, the target visual (the adjacent KPI) showing product quantity availability shows results for the selected product_id from the source visual. This implicit mapping of fields is case sensitive.

You can add the new data set Product Availability to this analysis. For more information, see Adding a Data Set to an Analysis.

Add a KPI chart to show count of product_id as in the screenshot below.

To show total quantity available of select products, create an action filter on the menu options, as seen in the screenshot below:

To drill down into product quantity availability, choose (right-click) product_id on the order details pivot table. The following screenshot shows the product availability details for the selected product_id.

Creating custom URL actions

Custom URL actions aren’t new to Amazon QuickSight, but the point of entry to create URL actions is now consolidated with Actions. To create a custom URL action on the Order details pivot table to see Google search results on a customer, complete the following steps:

  1. Choose the Order details pivot table.
  2. Choose Actions.
  3. Choose Add a new action.
  4. For Action name, enter an action name.
  5. For Action type, choose URL action.
  6. Enter the URL https://www.google.com/search?q=<<customer_name>>.Customer_name is the value of the dimension field to select.
  7. Choose New browser tab.
  8. Choose Save.

When you right-click on the pivot table, menu options appear with your two filter actions created previously and the new custom URL action. See the following screenshot.

Conclusion

With Amazon QuickSight, you can create filter actions on any chart type with at least one dimensional value. You can also create URL actions on all chart types, including KPIs and gauge charts, which do not hold dimensional values. As of this writing, QuickSight Actions don’t support selecting multiple data points within the same source visual, cascading filter actions in which you choose one filter action on one visual and a subsequent filter on another visual, or adding customer mappings between different data sets used in the dashboard. You can expect to see these features in future releases.

QuickSight Actions is available on both Enterprise and Standard editions in all QuickSight-supported regions.

 


About the Author

Sahitya Pandiri is a technical program manager with Amazon Web Services. Sahitya has been in product/program management for 6 years now, and has built multiple products in the retail, healthcare, and analytics space.