Tag Archives: Amazon QuickSight

AWS Weekly Roundup: AWS Parallel Computing Service, Amazon EC2 status checks, and more (September 2, 2024)

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-aws-parallel-computing-service-amazon-ec2-status-checks-and-more-september-2-2024/

With the arrival of September, AWS re:Invent 2024 is now 3 months away and I am very excited for the new upcoming services and announcements at the conference. I remember attending re:Invent 2019, just before the COVID-19 pandemic. It was the biggest in-person re:Invent with 60,000+ attendees and it was my second one. It was amazing to be in that atmosphere! Registration is now open for AWS re:Invent 2024. Come join us in Las Vegas for five exciting days of keynotes, breakout sessions, chalk talks, interactive learning opportunities, and career-changing connections!

Now let’s look at the last week’s new announcements.

Last week’s launches
Here are the launches that got my attention.

Announcing AWS Parallel Computing Service – AWS Parallel Computing Service (AWS PCS) is a new managed service that lets you run and scale high performance computing (HPC) workloads on AWS. You can build scientific and engineering models and run simulations using a fully managed Slurm scheduler with built-in technical support and a rich set of customization options. Tailor your HPC environment to your specific needs and integrate it with your preferred software stack. Build complete HPC clusters that integrates compute, storage, networking, and visualization resources, and seamlessly scale from zero to thousands of instances. To learn more, visit AWS Parallel Computing Service and read Channy’s blog post.

Amazon EC2 status checks now support reachability health of attached EBS volumes – You can now use Amazon EC2 status checks to directly monitor if the Amazon EBS volumes attached to your instances are reachable and able to complete I/O operations. With this new status check, you can quickly detect attachment issues or volume impairments that may impact the performance of your applications running on Amazon EC2 instances. You can further integrate these status checks within Auto Scaling groups to monitor the health of EC2 instances and replace impacted instances to ensure high availability and reliability of your applications. Attached EBS status checks can be used along with the instance status and system status checks to monitor the health of your instances. To learn more, refer to the Status checks for Amazon EC2 instances documentation.

Amazon QuickSight now supports sharing views of embedded dashboards – You can now share views of embedded dashboards in Amazon QuickSight. This feature allows you to enable more collaborative capabilities in your application with embedded QuickSight dashboards. Additionally, you can enable personalization capabilities such as bookmarks for anonymous users. You can share a unique link that displays only your changes while staying within the application, and use dashboard or console embedding to generate a shareable link to your application page with QuickSight’s reference encapsulated using the QuickSight Embedding SDK. QuickSight Readers can then send this shareable link to their peers. When their peer accesses the shared link, they are taken to the page on the application that contains the embedded QuickSight dashboard. For more information, refer to Embedded view documentation.

Amazon Q Business launches IAM federation for user identity authenticationAmazon Q Business is a fully managed service that deploys a generative AI business expert for your enterprise data. You can use the Amazon Q Business IAM federation feature to connect your applications directly to your identity provider to source user identity and user attributes for these applications. Previously, you had to sync your user identity information from your identity provider into AWS IAM Identity Center, and then connect your Amazon Q Business applications to IAM Identity Center for user authentication. At launch, Amazon Q Business IAM federation will support the OpenID Connect (OIDC) and SAML2.0 protocols for identity provider connectivity. To learn more, visit Amazon Q Business documentation.

Amazon Bedrock now supports cross-Region inferenceAmazon Bedrock announces support for cross-Region inference, an optional feature that enables you to seamlessly manage traffic bursts by utilizing compute across different AWS Regions. If you are using on-demand mode, you’ll be able to get higher throughput limits (up to 2x your allocated in-Region quotas) and enhanced resilience during periods of peak demand by using cross-Region inference. By opting in, you no longer have to spend time and effort predicting demand fluctuations. Instead, cross-Region inference dynamically routes traffic across multiple Regions, ensuring optimal availability for each request and smoother performance during high-usage periods. You can control where your inference data flows by selecting from a pre-defined set of Regions, helping you comply with applicable data residency requirements and sovereignty laws. Find the list at Supported Regions and models for cross-Region inference. To get started, refer to the Amazon Bedrock documentation or this Machine Learning blog.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

We launched existing services and instance types in additional Regions:

Other AWS events
AWS GenAI Lofts are collaborative spaces and immersive experiences that showcase AWS’s cloud and AI expertise, while providing startups and developers with hands-on access to AI products and services, exclusive sessions with industry leaders, and valuable networking opportunities with investors and peers. Find a GenAI Loft location near you and don’t forget to register.

Gen AI loft workshop

credit: Antje Barth

Upcoming AWS events
Check your calendar and sign up for upcoming AWS events:

AWS Summits are free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. AWS Summits for this year are coming to an end. There are 3 more left that you can still register: Jakarta (September 5), Toronto (September 11), and Ottawa (October 9).

AWS Community Days feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. While AWS Summits 2024 are almost over, AWS Community Days are in full swing. Upcoming AWS Community Days are in Belfast (September 6), SF Bay Area (September 13), where our own Antje Barth is a keynote speaker, Argentina (September 14), and Armenia (September 14).

Browse all upcoming AWS led in-person and virtual events here.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Esra

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Set up cross-account AWS Glue Data Catalog access using AWS Lake Formation and AWS IAM Identity Center with Amazon Redshift and Amazon QuickSight

Post Syndicated from Poulomi Dasgupta original https://aws.amazon.com/blogs/big-data/set-up-cross-account-aws-glue-data-catalog-access-using-aws-lake-formation-and-aws-iam-identity-center-with-amazon-redshift-and-amazon-quicksight/

Most organizations manage their workforce identity centrally in external identity providers (IdPs) and are comprised of multiple business units that produce their own datasets and manage the lifecycle spread across multiple AWS accounts. These business units have varying landscapes, where a data lake is managed by Amazon Simple Storage Service (Amazon S3) and analytics workloads are run on Amazon Redshift, a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data.

Business units that create data products like to share them with others, without copying the data, to promote analysis to derive insights. Also, they want tighter control on user access and the ability to audit access to their data products. To address this, enterprises usually catalog the datasets in the AWS Glue Data Catalog for data discovery and use AWS Lake Formation for fine-grained access control to adhere to the compliance and operating security model for their business units. Given the diverse range of services, fine-grained data sharing, and personas involved, these enterprises often want a streamlined experience for enterprise user identities when accessing their data using AWS Analytics services.

AWS IAM Identity Center enables centralized management of workforce user access to AWS accounts and applications using a local identity store or by connecting corporate directories using IdPs. Amazon Redshift and AWS Lake Formation are integrated with the new trusted identity propagation capability in IAM Identity Center, allowing you to use third-party IdPs such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin.

With trusted identity propagation, Lake Formation enables data administrators to directly provide fine-grained access to corporate users and groups, and simplifies the traceability of end-to-end data access across supported AWS services. Because access is managed based on a user’s corporate identity, end-users don’t need to use database local user credentials or assume an AWS Identity and Access Management (IAM) role to access data. Furthermore, this enables effective user permissions based on collective group membership and supports group hierarchy.

In this post, we cover how to enable trusted identity propagation with AWS IAM Identity Center, Amazon Redshift, and AWS Lake Formation residing on separate AWS accounts and set up cross-account sharing of an S3 data lake for enterprise identities using AWS Lake Formation to enable analytics using Amazon Redshift. Then we use Amazon QuickSight to build insights using Redshift tables as our data source.

Solution overview

This post covers a use case where an organization centrally manages corporate users within their IdP and where the users belong to multiple business units. Their goal is to enable centralized user authentication through IAM Identity Center in the management account, while keeping the business unit that analyzes data using a Redshift cluster and the business unit that produces data cataloged using the Data Catalog in separate member accounts. This allows them to maintain a single authentication mechanism through IAM Identity Center within an organization while retaining access control, resource, and cost separation through the use of separate AWS accounts per business units and enabling cross-account data sharing using Lake Formation.

For this solution, AWS Organizations is enabled in the central management account and IAM Identity Center is configured for managing workforce identities. The organization has two member accounts: one account that manages the S3 data lake using the Data Catalog, and another account that runs analytical workloads on Amazon Redshift and QuickSight, with all the services enabled with trusted identity propagation. Amazon Redshift will access cross-account AWS Glue resources using IAM Identity Center users and groups set up in the central management account using QuickSight in member account 1. In member account 2, permissions on the AWS Glue resources are managed using Lake Formation and are shared with member account 1 using Lake Formation data sharing.

The following diagram illustrates the solution architecture.

The solution consists of the following:

  • In the centralized management account, we create a permission set and create account assignments for Redshift_Member_Account. We integrate users and groups from the IdP with IAM Identity Center.
  • Member account 1 (Redshift_Member_Account) is where the Redshift cluster and application exist.
  • Member account 2 (Glue_Member_Account) is where metadata is cataloged in the Data Catalog and Lake Formation is enabled with IAM Identity Center integration.
  • We assign permissions to two IAM Identity Center groups to access the Data Catalog resources:
    • awssso-sales – We apply column-level filtering for this group so that users belonging to this group will be able to select two columns and read all rows.
    • awssso-finance – We apply row-level filtering using data filters for this group so that users belonging to this group will be able to select all columns and see rows after row-level filtering is applied.
  • We apply different permissions for three IAM Identity Center users:
    • User Ethan, part of awssso-sales – Ethan will be able to select two columns and read all rows.
    • User Frank, part of awssso-finance – Frank will be able to select all columns and see rows after row-level filtering is applied.
    • User Brian, part of awssso-sales and awssso-finance – Brian inherits permissions defined for both groups.
  • We set up QuickSight in the same account where Amazon Redshift exists, enabling authentication using IAM Identity Center.

Prerequisites

You should have the following prerequisites alreday set up:

Member account 2 configuration

Sign in to the Lake Formation console as the data lake administrator. To learn more about setting up permissions for a data lake administrator, see Create a data lake administrator.

In this section, we walk through the steps to set up Lake Formation, enable Lake Formation permissions, and grant database and table permissions to IAM Identity Center groups.

Set up Lake Formation

Complete the steps in this section to set up Lake Formation.

Create AWS Glue resources

You can use an existing AWS Glue database that has a few tables. For this post, we use a database called customerdb and a table called reviews whose data is stored in the S3 bucket lf-datalake-<account-id>-<region>.

Register the S3 bucket location

Complete the following steps to register the S3 bucket location:

  • On the Lake Formation console, in the navigation pane, under Administration, choose Data lake locations.
  • Choose Register location.
  • For Amazon S3 location, enter the S3 bucket location that contains table data.
  • For IAM role, provide a user-defined IAM role. For instructions to create a user-defined IAM role, refer to Requirements for roles used to register locations.
  • For Permission mode, select Lake Formation.
  • Choose Register location.

Set cross-account version

Complete the following steps to set your cross-account version:

  • Sign in to the Lake Formation console as the data lake admin.
  • In the navigation pane, under Administration, choose Data Catalog settings.
  • Under Cross-account version settings, keep the latest version (Version 4) as the current cross-account version.
  • Choose Save.

Add permissions required for cross-account access

If the AWS Glue Data Catalog resource policy is already enabled in the account, then you can either remove the policy or add the following permissions to the policy that are required for cross-account grants. The provided policy enables AWS Resource Access Manager (AWS RAM) to share a resource policy while cross-account grants are made using Lake Formation. For more information, refer to Prerequisites. Please skip to the following step if your policy is blank under Catalog Settings.

  • Sign in to the AWS Glue console as an IAM admin.
  • In the navigation pane, under Data Catalog, choose Catalog settings.
  • Under Permissions, add the following policy, and provide the account ID where your AWS Glue resources exist:
{ "Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ram.amazonaws.com"
},
"Action": "glue:ShareResource",
"Resource": [
"arn:aws:glue:us-east-1:<account-id>:table/*/*",
"arn:aws:glue:us-east-1:<account-id>:database/*",
"arn:aws:glue:us-east-1:<account-id>:catalog"
]
}
]
}
  • Choose Save.

For more information, see Granting cross-account access.

Enable IAM Identity Center integration for Lake Formation

To integrate IAM Identity Center with your Lake Formation organization instance of IAM Identity Center, refer to Connecting Lake Formation with IAM Identity Center.

To enable cross-account sharing for IAM Identity Center users and groups, add the target recipient accounts to your Lake Formation IAM Identity Center integration under the AWS account and organization IDs.

  • Sign in to the Lake Formation console as a data lake admin.
  • In the navigation pane, under Administration, choose IAM Identity Center integration.
  • Under AWS account and organization IDs, choose Add.
  • Enter your target accounts.
  • Choose Add.

Enable Lake Formation permissions for databases

For Data Catalog databases that contain tables that you might share, you can stop new tables from having the default grant of Super to IAMAllowedPrincipals. Complete the following steps:

  • Sign in to the Lake Formation console as a data lake admin.
  • In the navigation pane, under Data Catalog, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Edit.
  • Under Default permissions for newly created tables, deselect Use only IAM access control for new tables in this database.
  • Choose Save.

For Data Catalog databases, remove IAMAllowedPrincipals.

  • Under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose View.
  • Select IAMAllowedPrincipals and choose Revoke.

Repeat the same steps for tables under the customerdb database.

Grant database permissions to IAM Identity Center groups

Complete the following steps to grant database permissions to your IAM Identity Center groups:

  • On the Lake Formation console, under Data Catalog, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select your IAM Identity Center group names and choose Assign.

  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Database permissions, select Describe for Database permissions.
  • Choose Grant.

Grant table permissions to IAM Identity Center groups

In the following section, we will grant different permissions to our two IAM Identity Center groups.

Column filter

We first add permissions to the group awssso-sales. This group will have access to the customerdb database and be able to select only two columns and read all rows.

  • On the Lake Formation console, under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select awssso-sales and choose Assign.

  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Tables, choose reviews.
  • Under Table permissions, select Select for Table permissions.
  • Select Column-based access.
  • Select Include columns and choose product_title and star_rating.
  • Choose Grant.

Row filter

Next, we grant permissions to awssso-finance. This group will have access to customerdb and be able to select all columns and apply filters on rows.

We need to first create a data filter by performing the following steps:

  • On the Lake Formation console, choose Data filters under Data Catalog.
  • Choose Create data filter.
  • For Data filter name, provide a name.
  • For Target database, choose customerdb.
  • For Target table, choose reviews.
  • For Column-level access, select Access to all columns.
  • For Row-level access, choose Filter rows and apply your filter. In this example, we are filtering reviews with star_rating as 5.
  • Choose Create data filter.

  • Under Data Catalog in the navigation pane, choose Databases.
  • Select the database customerdb.
  • Choose Actions, then choose Grant.
  • Select IAM Identity Center.
  • Choose Add and select Get Started.
  • Search for and select awssso-finance and choose Assign.
  • Select Named Data Catalog resources.
  • Under Databases, choose customerdb.
  • Under Tables, choose reviews.
  • Under Data Filters, choose the High_Rating
  • Under Data Filter permissions, select Select.
  • Choose Grant.

Member account 1 configuration

In this section, we walk through the steps to add Amazon Redshift Spectrum table access in member account 1, where the Redshift cluster and application exist.

Accept Invite from RAM

You should have received a Resource Access Manager (RAM) invite from member account 2 when you added member account 1 under IAM Identity Center integration in Lake Formation at the member account 1.

  • Navigate to Resource Access Manager(RAM) from admin console.
  • Under Shared with me, click on resource shares.
  • Select the resource name and click on Accept resource share.

Please make sure that you have followed this entire blog to establish the Redshift Integration with IAM Identity Center before following the next steps.

Set up Redshift Spectrum table access for the IAM Identity Center group

Complete the following steps to set up Redshift Spectrum table access:

  1. Sign in to the Amazon Redshift console using the admin role.
  2. Navigate to Query Editor v2.
  3. Choose the options menu (three dots) next to the cluster and choose Create connection.
  4. Connect as the admin user and run the following commands to make the shared resource link data in the S3 data lake available to the sales group (use the account ID where the Data Catalog exists):
create external schema if not exists <schema_name> from DATA CATALOG database '<glue_catalog_name>' catalog_id '<accountid>';
grant usage on schema <schema_name> to role "<role_name>";

For example:

create external schema if not exists cross_account_glue_schema from DATA CATALOG database 'customerdb' catalog_id '932880906720';
grant usage on schema cross_account_glue_schema to role "awsidc:awssso-sales";
grant usage on schema cross_account_glue_schema to role "awsidc:awssso-finance";

Validate Redshift Spectrum access as an IAM Identity Center user

Complete the following steps to validate access:

  • On the Amazon Redshift console, navigate to Query Editor v2.
  • Choose the options menu (three dots) next to the cluster and choose Create connection.
  • Select IAM Identity Center.
  • Enter your Okta user name and password in the browser pop-up.

  • When you’re connected as a federated user, run the following SQL commands to query the cross_account_glue_schema data lake table.
select * from "dev"."cross_account_glue_schema"."reviews";

The following screenshot shows that user Ethan, who is part of the awssso-sales group, has access to two columns and all rows from the Data Catalog.

The following screenshot shows that user Frank, who is part of the awssso-finance group, has access to all columns for records that have star_rating as 5.

The following screenshot shows that user Brian, who is part of awssso-sales and awssso-finance, has access to all columns for records that have star_rating as 5 and access to only two columns (other columns are returned NULL) for records with star_rating other than 5.

Subscribe to QuickSight with IAM Identity Center

In this post, we set up QuickSight in the same account where the Redshift cluster exists. You can use the same or a different member account for QuickSight setup. To subscribe to QuickSight, complete the following steps:

  • Sign in to your AWS account and open the QuickSight console.
  • Choose Sign up for QuickSight.

  • Enter a notification email address for the QuickSight account owner or group. This email address will receive service and usage notifications.
  • Select the identity option that you want to subscribe with. For this post, we select Use AWS IAM Identity Center.
  • Enter a QuickSight account name.
  • Choose Configure.

  • Next, assign groups in IAM Identity Center to roles in QuickSight (admin, author, and reader.) This step enables your users to access the QuickSight application. In this post, we choose awssso-sales and awssso-finance for Admin group.
  • Specify an IAM role to control QuickSight access to your AWS resources. In this post, we select Use QuickSight managed role (default).
  • For this post, we deselect Add Paginated Reports.
  • Review the choices that you made, then choose Finish.

Enable trusted identity propagation in QuickSight

Trusted identity propagation authenticates the end-user in Amazon Redshift when they access QuickSight assets that use a trusted identity propagation enabled data source. When an author creates a data source with trusted identity propagation, the identity of the data source consumers in QuickSight is propagated and logged in AWS CloudTrail. This allows database administrators to centrally manage data security in Amazon Redshift and automatically apply data security rules to data consumers in QuickSight.

To configure QuickSight to connect to Amazon Redshift data sources with trusted identity propagation, configure Amazon Redshift OAuth scopes to your QuickSight account:

aws quicksight update-identity-propagation-config --aws-account-id "AWSACCOUNTID" --service "REDSHIFT" --authorized-targets "IAM Identity Center managed application ARN"

For example:

aws quicksight update-identity-propagation-config --aws-account-id "1234123123" --service "REDSHIFT" --authorized-targets "arn:aws:sso::XXXXXXXXXXXX:application/ssoins-XXXXXXXXXXXX/apl-XXXXXXXXXXXX"

After you have added the scope, the following command lists all OAuth scopes that are currently on a QuickSight account:

aws quicksight list-identity-propagation-configs --aws-account-id "AWSACCOUNTID"

The following code is the example with output:

aws quicksight list-identity-propagation-configs --aws-account-id "1234123123"
{
"Status": 200,
"Services": [
{
"Service": "REDSHIFT",
"AuthorizedTargets": [
"arn:aws:sso::1004123000:application/ssoins-1234f1234bb1f123/apl-12a1234e2e391234"
]
}
],
"RequestId": "116ec1b0-1533-4ed2-b5a6-d7577e073b35"
}

For more information, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

For QuickSight to connect to a Redshift instance, you must add an appropriate IP address range in the Redshift security group for the specific AWS Region. For more information, see AWS Regions, websites, IP address ranges, and endpoints.

Test your IAM Identity Center and Amazon Redshift integration with QuickSight

Now you’re ready to connect to Amazon Redshift using QuickSight.

  • In the management account, open the IAM Identity Center console and copy the AWS access portal URL from the dashboard.
  • Sign out from the management account and enter the AWS access portal URL in a new browser window.
  • In the pop-up window, enter your IdP credentials.
  • On the Applications tab, select the QuickSight app.
  • After you federate to QuickSight, choose Datasets.
  • Select New Dataset and then choose Redshift (Auto Discovered).
  • Enter your data source details. Make sure to select Single sign-on for Authentication method.
  • Choose Create data source.

Congratulations! You’re signed in using IAM Identity Center integration with Amazon Redshift and are ready to explore and analyze your data using QuickSight.

The following screenshot from QuickSight shows that user Ethan, who is part of the awssso-sales group, has access to two columns and all rows from the Data Catalog.

The following screenshot from QuickSight shows that user Frank, who is part of the awssso-finance group, has access to all columns for records that have star_rating as 5.

The following screenshot from QuickSight shows that user Brian, who is part of awssso-sales and awssso-finance, has access to all columns for records that have star_rating as 5 and access to only two columns (other columns are returned NULL) for records with star_rating other than 5.

Clean up

Complete the following steps to clean up your resources:

  • Delete the data from the S3 bucket.
  • Delete the Data Catalog objects that you created as part of this post.
  • Delete the Lake Formation resources and QuickSight account.
  • If you created new Redshift cluster for testing this solution, delete the cluster.

Conclusion

In this post, we established cross-account access to enable centralized user authentication through IAM Identity Center in the management account, while keeping the Amazon Redshift and AWS Glue resources isolated by business unit in separate member accounts. We used Query Editor V2 for querying the data from Amazon Redshift. Then we showed how to build user-facing dashboards by integrating with QuickSight. Refer to Integrate Tableau and Okta with Amazon Redshift using AWS IAM Identity Center to learn about integrating Tableau and Okta with Amazon Redshift using IAM Identity Center.

Learn more about IAM Identity Center with Amazon Redshift, QuickSight, and Lake Formation. Leave your questions and feedback in the comments section.


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

AWS Weekly Roundup: Global AWS Heroes Summit, AWS Lambda, Amazon Redshift, and more (July 22, 2024)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-global-aws-heroes-summit-aws-lambda-amazon-redshift-and-more-july-22-2024/

Last week, AWS Heroes from around the world gathered to celebrate the 10th anniversary of the AWS Heroes program at Global AWS Heroes Summit. This program recognizes a select group of AWS experts worldwide who go above and beyond in sharing their knowledge and making an impact within developer communities.

Matt Garman, CEO of AWS and a long-time supporter of developer communities, made a special appearance for a Q&A session with the Heroes to listen to their feedback and respond to their questions.

Here’s an epic photo from the AWS Heroes Summit:

As Matt mentioned in his Linkedin post, “The developer community has been core to everything we have done since the beginning of AWS.” Thank you, Heroes, for all you do. Wishing you all a safe flight home.

Last week’s launches
Here are some launches that caught my attention last week:

Announcing the July 2024 updates to Amazon Corretto — The latest updates for the Corretto distribution of OpenJDK is now available. This includes security and critical updates for the Long-Term Supported (LTS) and Feature (FR) versions.

New open-source Advanced MYSQL ODBC Driver now available for Amazon Aurora and RDS — The new AWS ODBC Driver for MYSQL provides faster switchover and failover times, and authentication support for AWS Secrets Manager and AWS Identity and Access Management (IAM), making it a more efficient and secure option for connecting to Amazon RDS and Amazon Aurora MySQL-compatible edition databases.

Productionize Fine-tuned Foundation Models from SageMaker Canvas — Amazon SageMaker Canvas now allows you to deploy fine-tuned Foundation Models (FMs) to SageMaker real-time inference endpoints, making it easier to integrate generative AI capabilities into your applications outside the SageMaker Canvas workspace.

AWS Lambda now supports SnapStart for Java functions that use the ARM64 architecture — Lambda SnapStart for Java functions on ARM64 architecture delivers up to 10x faster function startup performance and up to 34% better price performance compared to x86, enabling the building of highly responsive and scalable Java applications using AWS Lambda.

Amazon QuickSight improves controls performance — Amazon QuickSight has improved the performance of controls, allowing readers to interact with them immediately without having to wait for all relevant controls to reload. This enhancement reduces the loading time experienced by readers.

Amazon OpenSearch Serverless levels up speed and efficiency with smart caching — The new smart caching feature for indexing in Amazon OpenSearch Serverless automatically fetches and manages data, leading to faster data retrieval, efficient storage usage, and cost savings.

Amazon Redshift Serverless with lower base capacity available in the Europe (London) Region — Amazon Redshift Serverless now allows you to start with a lower data warehouse base capacity of 8 Redshift Processing Units (RPUs) in the Europe (London) region, providing more flexibility and cost-effective options for small to large workloads.

AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions — AWS Lambda now supports Amazon MQ for ActiveMQ and RabbitMQ in five new regions, enabling you to build serverless applications with Lambda functions that are invoked based on messages posted to Amazon MQ message brokers.

From community.aws
Here’s my top 5 personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. To learn more about future AWS Summit events, visit the AWS Summit page. Register in your nearest city: AWS Summit Taipei (July 23–24), AWS Summit Mexico City (Aug. 7), and AWS Summit Sao Paulo (Aug. 15).

AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world. Upcoming AWS Community Days are in Aotearoa (Aug. 15), Nigeria (Aug. 24), New York (Aug. 28), and Belfast (Sept. 6).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Building a scalable streaming data platform that enables real-time and batch analytics of electric vehicles on AWS

Post Syndicated from Ayush Agrawal original https://aws.amazon.com/blogs/big-data/building-a-scalable-streaming-data-platform-that-enables-real-time-and-batch-analytics-of-electric-vehicles-on-aws/

The automobile industry has undergone a remarkable transformation because of the increasing adoption of electric vehicles (EVs). EVs, known for their sustainability and eco-friendliness, are paving the way for a new era in transportation. As environmental concerns and the push for greener technologies have gained momentum, the adoption of EVs has surged, promising to reshape our mobility landscape.

The surge in EVs brings with it a profound need for data acquisition and analysis to optimize their performance, reliability, and efficiency. In the rapidly evolving EV industry, the ability to harness, process, and derive insights from the massive volume of data generated by EVs has become essential for manufacturers, service providers, and researchers alike.

As the EV market is expanding with many new and incumbent players trying to capture the market, the major differentiating factor will be the performance of the vehicles.

Modern EVs are equipped with an array of sensors and systems that continuously monitor various aspects of their operation including parameters such as voltage, temperature, vibration, speed, and so on. From battery management to motor performance, these data-rich machines provide a wealth of information that, when effectively captured and analyzed, can revolutionize vehicle design, enhance safety, and optimize energy consumption. The data can be used to do predictive maintenance, device anomaly detection, real-time customer alerts, remote device management, and monitoring.

However, managing this deluge of data isn’t without its challenges. As the adoption of EVs accelerates, the need for robust data pipelines capable of collecting, storing, and processing data from an exponentially growing number of vehicles becomes more pronounced. Moreover, the granularity of data generated by each vehicle has increased significantly, making it essential to efficiently handle the ever-increasing number of data points. The challenges include not only the technical intricacies of data management but also concerns related to data security, privacy, and compliance with evolving regulations.

In this blog post, we delve into the intricacies of building a reliable data analytics pipeline that can scale to accommodate millions of vehicles, each generating hundreds of metrics every second using Amazon OpenSearch Ingestion. We also provide guidelines and sample configurations to help you implement a solution.

Of the prerequisites that follow, the IOT topic rule and the Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster can be set up by following How to integrate AWS IoT Core with Amazon MSK. The steps to create an Amazon OpenSearch Service cluster are available in Creating and managing Amazon OpenSearch Service domains.

Prerequisites

Before you begin the implementing the solution, you need the following:

  • IOT topic rule
  • Amazon MSK Simple Authentication and Security Layer/Salted Challenge Response Mechanism (SASL/SCRAM) cluster
  • Amazon OpenSearch Service domain

Solution overview

The following architecture diagram provides a scalable and fully managed modern data streaming platform. The architecture uses Amazon OpenSearch Ingestion to stream data into OpenSearch Service and Amazon Simple Storage Service (Amazon S3) to store the data. The data in OpenSearch powers real-time dashboards. The data can also be used to notify customers of any failures occurring on the vehicle (see Configuring alerts in Amazon OpenSearch Service). The data in Amazon S3 is used for business intelligence and long-term storage.

Architecture diagram

In the following sections, we focus on the following three critical pieces of the architecture in depth:

1. Amazon MSK to OpenSearch ingestion pipeline

2. Amazon OpenSearch Ingestion pipeline to OpenSearch Service

3. Amazon OpenSearch Ingestion to Amazon S3

Solution Walkthrough

Step 1: MSK to Amazon OpenSearch Ingestion pipeline

Because each electric vehicle streams massive volumes of data to Amazon MSK clusters through AWS IoT Core, making sense of this data avalanche is critical. OpenSearch Ingestion provides a fully managed serverless integration to tap into these data streams.

The Amazon MSK source in OpenSearch Ingestion uses Kafka’s Consumer API to read records from one or more MSK topics. The MSK source in OpenSearch Ingestion seamlessly connects to MSK to ingest the streaming data into OpenSearch Ingestion’s processing pipeline.

The following snippet illustrates the pipeline configuration for an OpenSearch Ingestion pipeline used to ingest data from an MSK cluster.

While creating an OpenSearch Ingestion pipeline, add the following snippet in the Pipeline configuration section.

version: "2"
msk-pipeline: 
  source: 
    kafka: 
      acknowledgments: true                  
      topics: 
         - name: "ev-device-topic " 
           group_id: "opensearch-consumer" 
           serde_format: json                 
      aws: 
        # Provide the Role ARN with access to MSK. This role should have a trust relationship with osis-pipelines.amazonaws.com 
        sts_role_arn: "arn:aws:iam:: ::<<account-id>>:role/opensearch-pipeline-Role"
        # Provide the region of the domain. 
        region: "<<region>>" 
        msk: 
          # Provide the MSK ARN.  
          arn: "arn:aws:kafka:<<region>>:<<account-id>>:cluster/<<name>>/<<id>>" 

When configuring Amazon MSK and OpenSearch Ingestion, it’s essential to establish an optimal relationship between the number of partitions in your Kafka topics and the number of OpenSearch Compute Units (OCUs) allocated to your ingestion pipelines. This optimal configuration ensures efficient data processing and maximizes throughput. You can read more about it in Configure recommended compute units (OCUs) for the Amazon MSK pipeline.

Step 2: OpenSearch Ingestion pipeline to OpenSearch Service

OpenSearch Ingestion offers a direct method for streaming EV data into OpenSearch. The OpenSearch sink plugin channels data from multiple sources directly into the OpenSearch domain. Instead of manually provisioning the pipeline, you define the capacity for your pipeline using OCUs. Each OCU provides 6 GB of memory and two virtual CPUs. To use OpenSearch Ingestion auto-scaling optimally, it’s essential to configure the maximum number of OCUs for a pipeline based on the number of partitions in the topics being ingested. If a topic has a large number of partitions (for example, more than 96, which is the maximum OCUs per pipeline), it’s recommended to configure the pipeline with a maximum of 1–96 OCUs. This way, the pipeline can automatically scale up or down within this range as needed. However, if a topic has a low number of partitions (for example, fewer than 96), it’s advisable to set the maximum number of OCUs to be equal to the number of partitions. This approach ensures that each partition is processed by a dedicated OCU enabling parallel processing and optimal performance. In scenarios where a pipeline ingests data from multiple topics, the topic with the highest number of partitions should be used as a reference to configure the maximum OCUs. Additionally, if higher throughput is required, you can create another pipeline with a new set of OCUs for the same topic and consumer group, enabling near-linear scalability.

OpenSearch Ingestion provides several pre-defined configuration blueprints that can help you quickly build your ingestion pipeline on AWS

The following snippet illustrates pipeline configuration for an OpenSearch Ingestion pipeline using OpenSearch as a SINK with a dead letter queue (DLQ) to Amazon S3. When a pipeline encounters write errors, it creates DLQ objects in the configured S3 bucket. DLQ objects exist within a JSON file as an array of failed events.

sink: 
      - opensearch: 
          # Provide an AWS OpenSearch Service domain endpoint 
          hosts: [ "https://<<domain-name>>.<<region>>.es.amazonaws.com" ] 
          aws: 
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com 
            sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>" 
          # Provide the region of the domain. 
            region: "<<region>>" 
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection 
          # serverless: true 
          # index name can be auto-generated from topic name 
          index: "index_ev_pipe-%{yyyy.MM.dd}" 
          # Enable 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x 
          #distribution_version: "es6" 
          # Enable the S3 DLQ to capture any failed requests in Ohan S3 bucket 
          dlq: 
            s3: 
            # Provide an S3 bucket 
              bucket: "<<bucket-name>>"
            # Provide a key path prefix for the failed requests
              key_path_prefix: "oss-pipeline-errors/dlq"
            # Provide the region of the bucket.
              region: "<<region>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
              sts_role_arn: "arn:aws:iam:: <<account-id>>:role/<<role-name>>"

Step 3: OpenSearch Ingestion to Amazon S3

OpenSearch Ingestion offers a built-in sink for loading streaming data directly into S3. The service can compress, partition, and optimize the data for cost-effective storage and analytics in Amazon S3. Data loaded into S3 can be partitioned for easier query isolation and lifecycle management. Partitions can be based on vehicle ID, date, geographic region, or other dimensions as needed for your queries.

The following snippet illustrates how we’ve partitioned and stored EV data in Amazon S3.

- s3:
            aws:
              # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
                sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
              # Provide the region of the domain.
                region: "<<region>>"
            # Replace with the bucket to send the logs to
            bucket: "evbucket"
            object_key:
              # Optional path_prefix for your s3 objects
              path_prefix: "index_ev_pipe/year=%{yyyy}/month=%{MM}/day=%{dd}/hour=%{HH}"
            threshold:
              event_collect_timeout: 60s
            codec:
              parquet:
                auto_schema: true

The pipeline can be created following the steps in Creating Amazon OpenSearch Ingestion pipelines.

The following is the complete pipeline configuration, combining the configuration of all three steps. Update the Amazon Resource Names (ARNs), AWS Region, Open Search Service domain endpoint, and S3 names as needed.

The entire OpenSearch Ingestion pipeline configuration can be directly copied into the ‘Pipeline configuration’ field in the AWS Management Console while creating the OpenSearch Ingestion pipeline

version: "2"
msk-pipeline: 
  source: 
    kafka: 
      acknowledgments: true           # Default is false  
      topics: 
         - name: "<<msk-topic-name>>" 
           group_id: "opensearch-consumer" 
           serde_format: json        
      aws: 
        # Provide the Role ARN with access to MSK. This role should have a trust relationship with osis-pipelines.amazonaws.com 
        sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
        # Provide the region of the domain. 
        region: "<<region>>" 
        msk: 
          # Provide the MSK ARN.  
          arn: "arn:aws:kafka:us-east-1:<<account-id>>:cluster/<<cluster-name>>/<<cluster-id>>" 
  processor:
      - parse_json:
  sink: 
      - opensearch: 
          # Provide an AWS OpenSearch Service domain endpoint 
          hosts: [ "https://<<opensearch-service-domain-endpoint>>.us-east-1.es.amazonaws.com" ] 
          aws: 
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com 
            sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>" 
          # Provide the region of the domain. 
            region: "<<region>>" 
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection 
          # index name can be auto-generated from topic name 
          index: "index_ev_pipe-%{yyyy.MM.dd}" 
          # Enable 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x 
          #distribution_version: "es6" 
          # Enable the S3 DLQ to capture any failed requests in Ohan S3 bucket 
          dlq: 
            s3: 
            # Provide an S3 bucket 
              bucket: "<<bucket-name>>"
            # Provide a key path prefix for the failed requests
              key_path_prefix: "oss-pipeline-errors/dlq"
            # Provide the region of the bucket.
              region: "<<region>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
              sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
      - s3:
            aws:
              # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
                sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
              # Provide the region of the domain.
                region: "<<region>>"
            # Replace with the bucket to send the logs to
            bucket: "<<bucket-name>>"
            object_key:
              # Optional path_prefix for your s3 objects
              path_prefix: "index_ev_pipe/year=%{yyyy}/month=%{MM}/day=%{dd}/hour=%{HH}"
            threshold:
              event_collect_timeout: 60s
            codec:
              parquet:
                auto_schema: true

Real-time analytics

After the data is available in OpenSearch Service, you can build real-time monitoring and notifications. OpenSearch Service has robust support for multiple notification channels, allowing you to receive alerts through services like Slack, Chime, custom webhooks, Microsoft Teams, email, and Amazon Simple Notification Service (Amazon SNS).

The following screenshot illustrates supported notification channels in OpenSearch Service.

The notification feature in OpenSearch Service allows you to create monitors that will watch for certain conditions or changes in your data and launch alerts, such as monitoring vehicle telemetry data and launching alerts for issues like battery degradation or abnormal energy consumption. For example, you can create a monitor that analyzes battery capacity over time and notifies the on-call team using Slack if capacity drops below expected degradation curves in a significant number of vehicles. This could indicate a potential manufacturing defect requiring investigation.

In addition to notifications, OpenSearch Service makes it easy to build real-time dashboards to visually track metrics across your fleet of vehicles. You can ingest vehicle telemetry data like location, speed, fuel consumption, and so on, and visualize it on maps, charts, and gauges. Dashboards can provide real-time visibility into vehicle health and performance.

The following screenshot illustrates creating a sample dashboard on OpenSearch Service

Opensearch Dashboard

A key benefit of OpenSearch Service is its ability to handle high sustained ingestion and query rates with millisecond latencies. It distributes incoming vehicle data across data nodes in a cluster for parallel processing. This allows OpenSearch to scale out to handle very large fleets while still delivering the real-time performance needed for operational visibility and alerting.

Batch analytics

After the data is available in Amazon S3, you can build a secure data lake to power a variety of analytics use cases deriving powerful insights. As an immutable store, new data is continually stored in S3 while existing data remains unaltered. This serves as a single source of truth for downstream analytics.

For business intelligence and reporting, you can analyze trends, identify insights, and create rich visualizations powered by the data lake. You can use Amazon QuickSight to build and share dashboards without needing to set up servers or infrastructure. Here’s an example of a Quicksight dashboard for IoT device data. For example, you can use a dashboard to gain insights from historical data that can help with better vehicle and battery design.

The Amazon Quicksight public gallery shows examples of dashboards across different domains.

You should consider Amazon OpenSearch dashboards for your operational day-to-day use cases to identify issues and alert in near real time whereas Amazon Quicksight should be used to analyze big data stored in a lake house and generate actionable insights from them.

Clean up

Delete the OpenSearch pipeline and Amazon MSK cluster to stop incurring costs on these services.

Conclusion

In this post, you learned how Amazon MSK, OpenSearch Ingestion, OpenSearch Services, and Amazon S3 can be integrated to ingest, process, store, analyze, and act on endless streams of EV data efficiently.

With OpenSearch Ingestion as the integration layer between streams and storage, the entire pipeline scales up and down automatically based on demand. No more complex cluster management or lost data from bursts in streams.

See Amazon OpenSearch Ingestion to learn more.


About the authors

Ayush Agrawal is a Startups Solutions Architect from Gurugram, India with 11 years of experience in Cloud Computing. With a keen interest in AI, ML, and Cloud Security, Ayush is dedicated to helping startups navigate and solve complex architectural challenges. His passion for technology drives him to constantly explore new tools and innovations. When he’s not architecting solutions, you’ll find Ayush diving into the latest tech trends, always eager to push the boundaries of what’s possible.

Fraser SequeiraFraser Sequeira is a Solutions Architect with AWS based in Mumbai, India. In his role at AWS, Fraser works closely with startups to design and build cloud-native solutions on AWS, with a focus on analytics and streaming workloads. With over 10 years of experience in cloud computing, Fraser has deep expertise in big data, real-time analytics, and building event-driven architecture on AWS.

Simplify custom contact center insights with Amazon Connect analytics data lake

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/simplify-custom-contact-center-insights-with-amazon-connect-analytics-data-lake/

Analytics are vital to the success of a contact center. Having insights into each touchpoint of the customer experience allows you to accurately measure performance and adapt to shifting business demands. While you can find common metrics in the Amazon Connect console, sometimes you need to have more details and custom requirements for reporting based on the unique needs of your business. 

Starting today, the Amazon Connect analytics data lake is generally available. As announced last year as preview, this new capability helps you to eliminate the need to build and maintain complex data pipelines. Amazon Connect data lake is zero-ETL capable, so no extract, transform, or load (ETL) is needed.

Here’s a quick look at the Amazon Connect analytics data lake:

Improving your customer experience with Amazon Connect
Amazon Connect analytics data lake helps you to unify disparate data sources, including customer contact records and agent activity, into a single location. By having your data in a centralized location, you now have access to analyze contact center performance and gain insights while reducing the costs associated with implementing complex data pipelines.

With Amazon Connect analytics data lake, you can access and analyze contact center data, such as contact trace records and Amazon Connect Contact Lens data. This provides you the flexibility to prepare and analyze data with Amazon Athena and use the business intelligence (BI) tools of your choice, such as, Amazon QuickSight and Tableau

Get started with the Amazon Connect analytics data lake
To get started with the Amazon Connect analytics data lake, you’ll first need to have an Amazon Connect instance setup. You can follow the steps in the Create an Amazon Connect instance page to create a new Amazon Connect instance. Because I’ve already created my Amazon Connect instance, I will go straight to showing you how you can get started with Amazon Connect analytics data lake.

First, I navigate to the Amazon Connect console and select my instance.

Then, on the next page, I can set up my analytics data lake by navigating to Analytics tools and selecting Add data share.

This brings up a pop-up dialog, and I first need to define the target AWS account ID. With this option, I can set up a centralized account to receive all data from Amazon Connect instances running in multiple accounts. Then, under Data types, I can select the types I need to share with the target AWS account. To learn more about the data types that you can share in the Amazon Connect analytics data lake, please visit Associate tables for Analytics data lake.

Once it’s done, I can see the list of all the target AWS account IDs with which I have shared all the data types.

Besides using the AWS Management Console, I can also use the AWS Command Line Interface (AWS CLI) to associate my tables with the analytics data lake. The following is a sample command:

$> aws connect batch-associate-analytics-data-set --cli-input-json file:///input_batch_association.json

Where input_batch_association.json is a JSON file that contains association details. Here’s a sample:

{
	"InstanceId": YOUR_INSTANCE_ID,
	"DataSetIds": [
		"<DATA_SET_ID>"
		],
	"TargetAccountId": YOUR_ACCOUNT_ID
} 

Next, I need to approve (or reject) the request in the AWS Resource Access Manager (RAM) console in the target account. RAM is a service to help you securely share resources across AWS accounts. I navigate to AWS RAM and select Resource shares in the Shared with me section.

Then, I select the resource and select Accept resource share

At this stage, I can access shared resources from Amazon Connect. Now, I can start creating linked tables from shared tables in AWS Lake Formation. In the Lake Formation console, I navigate to the Tables page and select Create table.

I need to create a Resource link to a shared table. Then, I fill in the details and select the available Database and the Shared table’s region.

Then, when I select Shared table, it will list all the available shared tables that I can access.

Once I select the shared table, it will automatically populate Shared table’s database and Shared table’s owner ID. Once I’m happy with the configuration, I select Create.

To run some queries for the data, I go to the Amazon Athena console.The following is an example of a query that I ran:

With this configuration, I have access to certain Amazon Connect data types. I can even visualize the data by integrating with Amazon QuickSight. The following screenshot show some visuals in the Amazon QuickSight dashboard with data from Amazon Connect.

Customer voice
During the preview period, we heard lots of feedback from our customers about Amazon Connect analytics data lake. Here’s what our customer say:

Joulica is an analytics platform supporting insights for software like Amazon Connect and Salesforce. Tony McCormack, founder and CEO of Joulica, said, “Our core business is providing real-time and historical contact center analytics to Amazon Connect customers of all sizes. In the past, we frequently had to set up complex data pipelines, and so we are excited about using Amazon Connect analytics data lake to simplify the process of delivering actionable intelligence to our shared customers.”

Things you need to know

  • Pricing — Amazon Connect analytics data lake is available for you to use up to 2 years of data without any additional charges in Amazon Connect. You only need to pay for any services you use to interact with the data.
  • Availability — Amazon Connect analytics data lake is generally available in the following AWS Regions: US East (N. Virginia), US West (Oregon), Africa (Cape Town), Asia Pacific (Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), and Europe (Frankfurt, London)
  • Learn more — For more information, please visit Analytics data lake documentation page.

Happy building,
Donnie

Simplify data lake access control for your enterprise users with trusted identity propagation in AWS IAM Identity Center, AWS Lake Formation, and Amazon S3 Access Grants

Post Syndicated from Shoukat Ghouse original https://aws.amazon.com/blogs/big-data/simplify-data-lake-access-control-for-your-enterprise-users-with-trusted-identity-propagation-in-aws-iam-identity-center-aws-lake-formation-and-amazon-s3-access-grants/

Many organizations use external identity providers (IdPs) such as Okta or Microsoft Azure Active Directory to manage their enterprise user identities. These users interact with and run analytical queries across AWS analytics services. To enable them to use the AWS services, their identities from the external IdP are mapped to AWS Identity and Access Management (IAM) roles within AWS, and access policies are applied to these IAM roles by data administrators.

Given the diverse range of services involved, different IAM roles may be required for accessing the data. Consequently, administrators need to manage permissions across multiple roles, a task that can become cumbersome at scale.

To address this challenge, you need a unified solution to simplify data access management using your corporate user identities instead of relying solely on IAM roles. AWS IAM Identity Center offers a solution through its trusted identity propagation feature, which is built upon the OAuth 2.0 authorization framework.

With trusted identity propagation, data access management is anchored to a user’s identity, which can be synchronized to IAM Identity Center from external IdPs using the System for Cross-domain Identity Management (SCIM) protocol. Integrated applications exchange OAuth tokens, and these tokens are propagated across services. This approach empowers administrators to grant access directly based on existing user and group memberships federated from external IdPs, rather than relying on IAM users or roles.

In this post, we showcase the seamless integration of AWS analytics services with trusted identity propagation by presenting an end-to-end architecture for data access flows.

Solution overview

Let’s consider a fictional company, OkTank. OkTank has multiple user personas that use a variety of AWS Analytics services. The user identities are managed externally in an external IdP: Okta. User1 is a Data Analyst and uses the Amazon Athena query editor to query AWS Glue Data Catalog tables with data stored in Amazon Simple Storage Service (Amazon S3). User2 is a Data Engineer and uses Amazon EMR Studio notebooks to query Data Catalog tables and also query raw data stored in Amazon S3 that is not yet cataloged to the Data Catalog. User3 is a Business Analyst who needs to query data stored in Amazon Redshift tables using the Amazon Redshift Query Editor v2. Additionally, this user builds Amazon QuickSight visualizations for the data in Redshift tables.

OkTank wants to simplify governance by centralizing data access control for their variety of data sources, user identities, and tools. They also want to define permissions directly on their corporate user or group identities from Okta instead of creating IAM roles for each user and group and managing access on the IAM role. In addition, for their audit requirements, they need the capability to map data access to the corporate identity of users within Okta for enhanced tracking and accountability.

To achieve these goals, we use trusted identity propagation with the aforementioned services and use AWS Lake Formation and Amazon S3 Access Grants for access controls. We use Lake Formation to centrally manage permissions to the Data Catalog tables and Redshift tables shared with Redshift datashares. In our scenario, we use S3 Access Grants for granting permission for the Athena query result location. Additionally, we show how to access a raw data bucket governed by S3 Access Grants with an EMR notebook.

Data access is audited with AWS CloudTrail and can be queried with AWS CloudTrail Lake. This architecture showcases the versatility and effectiveness of AWS analytics services in enabling efficient and secure data analysis workflows across different use cases and user personas.

We use Okta as the external IdP, but you can also use other IdPs like Microsoft Azure Active Directory. Users and groups from Okta are synced to IAM Identity Center. In this post, we have three groups, as shown in the following diagram.

User1 needs to query a Data Catalog table with data stored in Amazon S3. The S3 location is secured and managed by Lake Formation. The user connects to an IAM Identity Center enabled Athena workgroup using the Athena query editor with EMR Studio. The IAM Identity Center enabled Athena workgroups need to be secured with S3 Access Grants permissions for the Athena query results location. With this feature, you can also enable the creation of identity-based query result locations that are governed by S3 Access Grants. These user identity-based S3 prefixes let users in an Athena workgroup keep their query results isolated from other users in the same workgroup. The following diagram illustrates this architecture.

User2 needs to query the same Data Catalog table as User1. This table is governed using Lake Formation permissions. Additionally, the user needs to access raw data in another S3 bucket that isn’t cataloged to the Data Catalog and is controlled using S3 Access Grants; in the following diagram, this is shown as S3 Data Location-2.

The user uses an EMR Studio notebook to run Spark queries on an EMR cluster. The EMR cluster uses a security configuration that integrates with IAM Identity Center for authentication and uses Lake Formation for authorization. The EMR cluster is also enabled for S3 Access Grants. With this kind of hybrid access management, you can use Lake Formation to centrally manage permissions for your datasets cataloged to the Data Catalog and use S3 Access Grants to centrally manage access to your raw data that is not yet cataloged to the Data Catalog. This gives you flexibility to access data managed by either of the access control mechanisms from the same notebook.

User3 uses the Redshift Query Editor V2 to query a Redshift table. The user also accesses the same table with QuickSight. For our demo, we use a single user persona for simplicity, but in reality, these could be completely different user personas. To enable access control with Lake Formation for Redshift tables, we use data sharing in Lake Formation.

Data access requests by the specific users are logged to CloudTrail. Later in this post, we also briefly touch upon using CloudTrail Lake to query the data access events.

In the following sections, we demonstrate how to build this architecture. We use AWS CloudFormation to provision the resources. AWS CloudFormation lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code. We also use the AWS Command Line Interface (AWS CLI) and AWS Management Console to complete some steps.

The following diagram shows the end-to-end architecture.

Prerequisites

Complete the following prerequisite steps:

  1. Have an AWS account. If you don’t have an account, you can create one.
  2. Have IAM Identity Center set up in a specific AWS Region.
  3. Make sure you use the same Region where you have IAM Identity Center set up throughout the setup and verification steps. In this post, we use the us-east-1 Region.
  4. Have Okta set up with three different groups and users, and enable sync to IAM Identity Center. Refer to Configure SAML and SCIM with Okta and IAM Identity Center for instructions.

After the Okta groups are pushed to IAM Identity Center, you can see the users and groups on the IAM Identity Center console, as shown in the following screenshot. You need the group IDs of the three groups to be passed in the CloudFormation template.

  1. For enabling User2 access using the EMR cluster, you need have an SSL certificate .zip file available in your S3 bucket. You can download the following sample certificate to use in this post. In production use cases, you should create and use your own certificates. You need to reference the bucket name and the certificate bundle .zip file in AWS CloudFormation. The CloudFormation template lets you choose the components you want to provision. If you do not intend to deploy the EMR cluster, you can ignore this step.
  2. Have an administrator user or role to run the CloudFormation stack. The user or role should also be a Lake Formation administrator to grant permissions.

Deploy the CloudFormation stack

The CloudFormation template provided in the post lets you choose the components you want to provision from the solution architecture. In this post, we enable all components, as shown in the following screenshot.

Run the provided CloudFormation stack to create the solution resources. Refer to the following table for a list of important parameters.

Parameter Group Description Parameter Name Expected Value
Choose components to provision. Choose the components you want to be provisioned. DeployAthenaFlow Yes/No. If you choose No, you can ignore the parameters in the “Athena Configuration” group.
DeployEMRFlow Yes/No. If you choose No, you can ignore the parameters in the “EMR Configuration” group.
DeployRedshiftQEV2Flow Yes/No. If you choose No, you can ignore the parameters in the “Redshift Configuration” group.
CreateS3AGInstance Yes/No. If you already have an S3 Access Grants instance, choose No. Otherwise, choose Yes to allow the stack create a new S3 Access Grants instance. The S3 Access Grants instance is needed for User1 and User2.
Identity Center Configuration IAM Identity Center parameters. IDCGroup1Id Group ID corresponding to Group1 from IAM Identity Center.
IDCGroup2Id Group ID corresponding to Group2 from IAM Identity Center.
IDCGroup3Id Group ID corresponding to Group3 from IAM Identity Center.
IAMIDCInstanceArn IAM Identity Center instance ARN. You can get this from the Settings section of IAM Identity Center.
Redshift Configuration

Redshift parameters.

Ignore if you chose DeployRedshiftQEV2Flow as No.

RedshiftServerlessAdminUserName Redshift admin user name.
RedshiftServerlessAdminPassword Redshift admin password.
RedshiftServerlessDatabase Redshift database to create the tables.
EMR Configuration

EMR parameters.

Ignore if you chose parameter DeployEMRFlow as No.

SSlCertsS3BucketName Bucket name where you copied the SSL certificates.
SSlCertsZip Name of SSL certificates file (my-certs.zip) to use the sample certificate provided in the post.
Athena Configuration

Athena parameters.

Ignore if you chose parameter DeployAthenaFlow as No.

IDCUser1Id User ID corresponding to User1 from IAM Identity Center.

The CloudFormation stack provisions the following resources:

  • A VPC with a public and private subnet.
  • If you chose the Redshift components, it also creates three additional subnets.
  • S3 buckets for data and Athena query results location storage. It also copies some sample data to the buckets.
  • EMR Studio with IAM Identity Center integration.
  • Amazon EMR security configuration with IAM Identity Center integration.
  • An EMR cluster that uses the EMR security group.
  • Registers the source S3 bucket with Lake Formation.
  • An AWS Glue database named oktank_tipblog_temp and a table named customer under the database. The table points to the Amazon S3 location governed by Lake Formation.
  • Allows external engines to access data in Amazon S3 locations with full table access. This is required for Amazon EMR integration with Lake Formation for trusted identity propagation. As of this writing, Amazon EMR supports table-level access with IAM Identity Center enabled clusters.
  • An S3 Access Grants instance.
  • S3 Access Grants for Group1 to the User1 prefix under the Athena query results location bucket.
  • S3 Access Grants for Group2 to the S3 bucket input and output prefixes. The user has read access to the input prefix and write access to the output prefix under the bucket.
  • An Amazon Redshift Serverless namespace and workgroup. This workgroup is not integrated with IAM Identity Center; we complete subsequent steps to enable IAM Identity Center for the workgroup.
  • An AWS Cloud9 integrated development environment (IDE), which we use to run AWS CLI commands during the setup.

Note the stack outputs on the AWS CloudFormation console. You use these values in later steps.

Choose the link for Cloud9URL in the stack output to open the AWS Cloud9 IDE. In AWS Cloud9, go to the Window tab and choose New Terminal to start a new bash terminal.

Set up Lake Formation

You need to enable Lake Formation with IAM Identity Center and enable an EMR application with Lake Formation integration. Complete the following steps:

  1. In the AWS Cloud9 bash terminal, enter the following command to get the Amazon EMR security configuration created by the stack:
aws emr describe-security-configuration --name TIP-EMRSecurityConfig | jq -r '.SecurityConfiguration | fromjson | .AuthenticationConfiguration.IdentityCenterConfiguration.IdCApplicationARN'
  1. Note the value for IdcApplicationARN from the output.
  2. Enter the following command in AWS Cloud9 to enable the Lake Formation integration with IAM Identity Center and add the Amazon EMR security configuration application as a trusted application in Lake Formation. If you already have the IAM Identity Center integration with Lake Formation, sign in to Lake Formation and add the preceding value to the list of applications instead of running the following command and proceed to next step.
aws lakeformation create-lake-formation-identity-center-configuration --catalog-id <Replace with CatalogId value from Cloudformation output> --instance-arn <Replace with IDCInstanceARN value from CloudFormation stack output> --external-filtering Status=ENABLED,AuthorizedTargets=<Replace with IdcApplicationARN value copied in previous step>

After this step, you should see the application on the Lake Formation console.

This completes the initial setup. In subsequent steps, we apply some additional configurations for specific user personas.

Validate user personas

To review the S3 Access Grants created by AWS CloudFormation, open the Amazon S3 console and Access Grants in the navigation pane. Choose the access grant you created to view its details.

The CloudFormation stack created the S3 Access Grants for Group1 for the User1 prefix under the Athena query results location bucket. This allows User1 to access the prefix under in the query results bucket. The stack also created the grants for Group2 for User2 to access the raw data bucket input and output prefixes.

Set up User1 access

Complete the steps in this section to set up User1 access.

Create an IAM Identity Center enabled Athena workgroup

Let’s create the Athena workgroup that will be used by User1.

Enter the following command in the AWS Cloud9 terminal. The command creates an IAM Identity Center integrated Athena workgroup and enables S3 Access Grants for the user-level prefix. These user identity-based S3 prefixes let users in an Athena workgroup keep their query results isolated from other users in the same workgroup. The prefix is automatically created by Athena when the CreateUserLevelPrefix option is enabled. Access to the prefix was granted by the CloudFormation stack.

aws athena create-work-group --cli-input-json '{
"Name": "AthenaIDCWG",
"Configuration": {
"ResultConfiguration": {
"OutputLocation": "<Replace with AthenaResultLocation from CloudFormation stack>"
},
"ExecutionRole": "<Replace with TIPStudioRoleArn from CloudFormation stack>",
"IdentityCenterConfiguration": {
"EnableIdentityCenter": true,
"IdentityCenterInstanceArn": "<Replace with IDCInstanceARN from CloudFormation stack>"
},
"QueryResultsS3AccessGrantsConfiguration": {
"EnableS3AccessGrants": true,
"CreateUserLevelPrefix": true,
"AuthenticationType": "DIRECTORY_IDENTITY"
},
"EnforceWorkGroupConfiguration":true
},
"Description": "Athena Workgroup with IDC integration"
}'

Grant access to User1 on the Athena workgroup

Sign in to the Athena console and grant access to Group1 to the workgroup as shown in the following screenshot. You can grant access to the user (User1) or to the group (Group1). In this post, we grant access to Group1.

Grant access to User1 in Lake Formation

Sign in to the Lake Formation console, choose Data lake permissions in the navigation pane, and grant access to the user group on the database oktank_tipblog_temp and table customer.

With Athena, you can grant access to specific columns and for specific rows with row-level filtering. For this post, we grant column-level access and restrict access to only selected columns for the table.

This completes the access permission setup for User1.

Verify access

Let’s see how User1 uses Athena to analyze the data.

  1. Copy the URL for EMRStudioURL from the CloudFormation stack output.
  2. Open a new browser window and connect to the URL.

You will be redirected to the Okta login page.

  1. Log in with User1.
  2. In the EMR Studio query editor, change the workgroup to AthenaIDCWG and choose Acknowledge.
  3. Run the following query in the query editor:
SELECT * FROM "oktank_tipblog_temp"."customer" limit 10;


You can see that the user is only able to access the columns for which permissions were previously granted in Lake Formation. This completes the access flow verification for User1.

Set up User2 access

User2 accesses the table using an EMR Studio notebook. Note the current considerations for EMR with IAM Identity Center integrations.

Complete the steps in this section to set up User2 access.

Grant Lake Formation permissions to User2

Sign in to the Lake Formation console and grant access to Group2 on the table, similar to the steps you followed earlier for User1. Also grant Describe permission on the default database to Group2, as shown in the following screenshot.

Create an EMR Studio Workspace

Next, User2 creates an EMR Studio Workspace.

  1. Copy the URL for EMR Studio from the EMRStudioURL value from the CloudFormation stack output.
  2. Log in to EMR Studio as User2 on the Okta login page.
  3. Create a Workspace, giving it a name and leaving all other options as default.

This will open a JupyterLab notebook in a new window.

Connect to the EMR Studio notebook

In the Compute pane of the notebook, select the EMR cluster (named EMRWithTIP) created by the CloudFormation stack to attach to it. After the notebook is attached to the cluster, choose the PySpark kernel to run Spark queries.

Verify access

Enter the following query in the notebook to read from the customer table:

spark.sql("select * from oktank_tipblog_temp.customer").show()


The user access works as expected based on the Lake Formation grants you provided earlier.

Run the following Spark query in the notebook to read data from the raw bucket. Access to this bucket is controlled by S3 Access Grants.

spark.read.option("header",True).csv("s3://tip-blog-s3-s3ag/input/*").show()

Let’s write this data to the same bucket and input prefix. This should fail because you only granted read access to the input prefix with S3 Access Grants.

spark.read.option("header",True).csv("s3://tip-blog-s3-s3ag/input/*").write.mode("overwrite").parquet("s3://tip-blog-s3-s3ag/input/")

The user has access to the output prefix under the bucket. Change the query to write to the output prefix:

spark.read.option("header",True).csv("s3://tip-blog-s3-s3ag/input/*").write.mode("overwrite").parquet("s3://tip-blog-s3-s3ag/output/test.part")

The write should now be successful.

We have now seen the data access controls and access flows for User1 and User2.

Set up User3 access

Following the target architecture in our post, Group3 users use the Redshift Query Editor v2 to query the Redshift tables.

Complete the steps in this section to set up access for User3.

Enable Redshift Query Editor v2 console access for User3

Complete the following steps:

  1. On the IAM Identity Center console, create a custom permission set and attach the following policies:
    1. AWS managed policy AmazonRedshiftQueryEditorV2ReadSharing.
    2. Customer managed policy redshift-idc-policy-tip. This policy is already created by the CloudFormation stack, so you don’t have to create it.
  2. Provide a name (tip-blog-qe-v2-permission-set) to the permission set.
  3. Set the relay state as https://<region-id>.console.aws.amazon.com/sqlworkbench/home (for example, https://us-east-1.console.aws.amazon.com/sqlworkbench/home).
  4. Choose Create.
  5. Assign Group3 to the account in IAM Identity Center, select the permission set you created, and choose Submit.

Create the Redshift IAM Identity Center application

Enter the following in the AWS Cloud9 terminal:

aws redshift create-redshift-idc-application \
--idc-instance-arn '<Replace with IDCInstanceARN value from CloudFormation Output>' \
--redshift-idc-application-name 'redshift-iad-<Replace with CatalogId value from CloudFormation output>-tip-blog-1' \
--identity-namespace 'tipblogawsidc' \
--idc-display-name 'TIPBlog_AWSIDC' \
--iam-role-arn '<Replace with TIPRedshiftRoleArn value from CloudFormation output>' \
--service-integrations '[
  {
    "LakeFormation": [
    {
     "LakeFormationQuery": {
     "Authorization": "Enabled"
    }
   }
  ]
 }
]'

Enter the following command to get the application details:

aws redshift describe-redshift-idc-applications --output json

Keep a note of the IdcManagedApplicationArn, IdcDisplayName, and IdentityNamespace values in the output for the application with IdcDisplayName TIPBlog_AWSIDC. You need these values in the next step.

Enable the Redshift Query Editor v2 for the Redshift IAM Identity Center application

Complete the following steps:

  1. On the Amazon Redshift console, choose IAM Identity Center connections in the navigation pane.
  2. Choose the application you created.
  3. Choose Edit.
  4. Select Enable Query Editor v2 application and choose Save changes.
  5. On the Groups tab, choose Add or assign groups.
  6. Assign Group3 to the application.

The Redshift IAM Identity Center connection is now set up.

Enable the Redshift Serverless namespace and workgroup with IAM Identity Center

The CloudFormation stack you deployed created a serverless namespace and workgroup. However, they’re not enabled with IAM Identity Center. To enable with IAM Identity Center, complete the following steps. You can get the namespace name from the RedshiftNamespace value of the CloudFormation stack output.

  1. On the Amazon Redshift Serverless dashboard console, navigate to the namespace you created.
  2. Choose Query Data to open Query Editor v2.
  3. Choose the options menu (three dots) and choose Create connections for the workgroup redshift-idc-wg-tipblog.
  4. Choose Other ways to connect and then Database user name and password.
  5. Use the credentials you provided for the Redshift admin user name and password parameters when deploying the CloudFormation stack and create the connection.

Create resources using the Redshift Query Editor v2

You now enter a series of commands in the query editor with the database admin user.

  1. Create an IdP for the Redshift IAM Identity Center application:
CREATE IDENTITY PROVIDER "TIPBlog_AWSIDC" TYPE AWSIDC
NAMESPACE 'tipblogawsidc'
APPLICATION_ARN '<Replace with IdcManagedApplicationArn value you copied earlier in Cloud9>'
IAM_ROLE '<Replace with TIPRedshiftRoleArn value from CloudFormation output>';
  1. Enter the following command to check the IdP you added previously:
SELECT * FROM svv_identity_providers;

Next, you grant permissions to the IAM Identity Center user.

  1. Create a role in Redshift. This role should correspond to the group in IAM Identity Center to which you intend to provide the permissions (Group3 in this post). The role should follow the format <namespace>:<GroupNameinIDC>.
Create role "tipblogawsidc:Group3";
  1. Run the following command to see role you created. The external_id corresponds to the group ID value for Group3 in IAM Identity Center.
Select * from svv_roles where role_name = 'tipblogawsidc:Group3';

  1. Create a sample table to use to verify access for the Group3 user:
CREATE TABLE IF NOT EXISTS revenue
(
account INTEGER ENCODE az64
,customer VARCHAR(20) ENCODE lzo
,salesamt NUMERIC(18,0) ENCODE az64
)
DISTSTYLE AUTO
;

insert into revenue values (10001, 'ABC Company', 12000);
insert into revenue values (10002, 'Tech Logistics', 175400);
  1. Grant access to the user on the schema:
-- Grant usage on schema
grant usage on schema public to role "tipblogawsidc:Group3";
  1. To create a datashare and add the preceding table to the datashare, enter the following statements:
CREATE DATASHARE demo_datashare;
ALTER DATASHARE demo_datashare ADD SCHEMA public;
ALTER DATASHARE demo_datashare ADD TABLE revenue;
  1. Grant usage on the datashare to the account using the Data Catalog:
GRANT USAGE ON DATASHARE demo_datashare TO ACCOUNT '<Replace with CatalogId from Cloud Formation Output>' via DATA CATALOG;

Authorize the datashare

For this post, we use the AWS CLI to authorize the datashare. You can also do it from the Amazon Redshift console.

Enter the following command in the AWS Cloud9 IDE to describe the datashare you created and note the value of DataShareArn and ConsumerIdentifier to use in subsequent steps:

aws redshift describe-data-shares

Enter the following command in the AWS Cloud9 IDE to the authorize the datashare:

aws redshift authorize-data-share --data-share-arn <Replace with DataShareArn value copied from earlier command’s output> --consumer-identifier <Replace with ConsumerIdentifier value copied from earlier command’s output >

Accept the datashare in Lake Formation

Next, accept the datashare in Lake Formation.

  1. On the Lake Formation console, choose Data sharing in the navigation pane.
  2. In the Invitations section, select the datashare invitation that is pending acceptance.
  3. Choose Review invitation and accept the datashare.
  4. Provide a database name (tip-blog-redshift-ds-db), which will be created in the Data Catalog by Lake Formation.
  5. Choose Skip to Review and Create and create the database.

Grant permissions in Lake Formation

Complete the following steps:

  1. On the Lake Formation console, choose Data lake permissions in the navigation pane.
  2. Choose Grant and in the Principals section, choose User3 to grant permissions with the IAM Identity Center-new option. Refer to the Lake Formation access grants steps performed for User1 and User2 if needed.
  3. Choose the database (tip-blog-redshift-ds-db) you created earlier and the table public.revenue, which you created in the Redshift Query Editor v2.
  4. For Table permissions¸ select Select.
  5. For Data permissions¸ select Column-based access and select the account and salesamt columns.
  6. Choose Grant.

Mount the AWS Glue database to Amazon Redshift

As the last step in the setup, mount the AWS Glue database to Amazon Redshift. In the Query Editor v2, enter the following statements:

create external schema if not exists tipblog_datashare_idc_schema from DATA CATALOG DATABASE 'tip-blog-redshift-ds-db' catalog_id '<Replace with CatalogId from CloudFormation output>';

grant usage on schema tipblog_datashare_idc_schema to role "tipblogawsidc:Group3";

grant select on all tables in schema tipblog_datashare_idc_schema to role "tipblogawsidc:Group3";

You are now done with the required setup and permissions for User3 on the Redshift table.

Verify access

To verify access, complete the following steps:

  1. Get the AWS access portal URL from the IAM Identity Center Settings section.
  2. Open a different browser and enter the access portal URL.

This will redirect you to your Okta login page.

  1. Sign in, select the account, and choose the tip-blog-qe-v2-permission-set link to open the Query Editor v2.

If you’re using private or incognito mode for testing this, you may need to enable third-party cookies.

  1. Choose the options menu (three dots) and choose Edit connection for the redshift-idc-wg-tipblog workgroup.
  2. Use IAM Identity Center in the pop-up window and choose Continue.

If you get an error with the message “Redshift serverless cluster is auto paused,” switch to the other browser with admin credentials and run any sample queries to un-pause the cluster. Then switch back to this browser and continue the next steps.

  1. Run the following query to access the table:
SELECT * FROM "dev"."tipblog_datashare_idc_schema"."public.revenue";

You can only see the two columns due to the access grants you provided in Lake Formation earlier.

This completes configuring User3 access to the Redshift table.

Set up QuickSight for User3

Let’s now set up QuickSight and verify access for User3. We already granted access to User3 to the Redshift table in earlier steps.

  1. Create a new IAM Identity Center enabled QuickSight account. Refer to Simplify business intelligence identity management with Amazon QuickSight and AWS IAM Identity Center for guidance.
  2. Choose Group3 for the author and reader for this post.
  3. For IAM Role, choose the IAM role matching the RoleQuickSight value from the CloudFormation stack output.

Next, you add a VPC connection to QuickSight to access the Redshift Serverless namespace you created earlier.

  1. On the QuickSight console, manage your VPC connections.
  2. Choose Add VPC connection.
  3. For VPC connection name, enter a name.
  4. For VPC ID, enter the value for VPCId from the CloudFormation stack output.
  5. For Execution role, choose the value for RoleQuickSight from the CloudFormation stack output.
  6. For Security Group IDs, choose the security group for QSSecurityGroup from the CloudFormation stack output.

  1. Wait for the VPC connection to be AVAILABLE.
  2. Enter the following command in AWS Cloud9 to enable QuickSight with Amazon Redshift for trusted identity propagation:
aws quicksight update-identity-propagation-config --aws-account-id "<Replace with CatalogId from CloudFormation output>" --service "REDSHIFT" --authorized-targets "< Replace with IdcManagedApplicationArn value from output of aws redshift describe-redshift-idc-applications --output json which you copied earlier>"

Verify User3 access with QuickSight

Complete the following steps:

  1. Sign in to the QuickSight console as User3 in a different browser.
  2. On the Okta sign-in page, sign in as User 3.
  3. Create a new dataset with Amazon Redshift as the data source.
  4. Choose the VPC connection you created above for Connection Type.
  5. Provide the Redshift server (the RedshiftSrverlessWorkgroup value from the CloudFormation stack output), port (5439 in this post), and database name (dev in this post).
  6. Under Authentication method, select Single sign-on.
  7. Choose Validate, then choose Create data source.

If you encounter an issue with validating using single sign-on, switch to Database username and password for Authentication method, validate with any dummy user and password, and then switch back to validate using single sign-on and proceed to the next step. Also check that the Redshift serverless cluster is not auto-paused as mentioned earlier in Redshift access verification.

  1. Choose the schema you created earlier (tipblog_datashare_idc_schema) and the table public.revenue
  2. Choose Select to create your dataset.

You should now be able to visualize the data in QuickSight. You are only able to only see the account and salesamt columns from the table because of the access permissions you granted earlier with Lake Formation.

This finishes all the steps for setting up trusted identity propagation.

Audit data access

Let’s see how we can audit the data access with the different users.

Access requests are logged to CloudTrail. The IAM Identity Center user ID is logged under the onBehalfOf tag in the CloudTrail event. The following screenshot shows the GetDataAccess event generated by Lake Formation. You can view the CloudTrail event history and filter by event name GetDataAccess to view similar events in your account.

You can see the userId corresponds to User2.

You can run the following commands in AWS Cloud9 to confirm this.

Get the identity store ID:

aws sso-admin describe-instance --instance-arn <Replace with your instance arn value> | jq -r '.IdentityStoreId'

Describe the user in the identity store:

aws identitystore describe-user --identity-store-id <Replace with output of above command> --user-id <User Id from above screenshot>

One way to query the CloudTrail log events is by using CloudTrail Lake. Set up the event data store (refer to the following instructions) and rerun the queries for User1, User2, and User3. You can query the access events using CloudTrail Lake with the following sample query:

SELECT eventTime,userIdentity.onBehalfOf.userid AS idcUserId,requestParameters as accessInfo, serviceEventDetails
FROM 04d81d04-753f-42e0-a31f-2810659d9c27
WHERE userIdentity.arn IS NOT NULL AND eventName='BatchGetTable' or eventName='GetDataAccess' or eventName='CreateDataSet'
order by eventTime DESC

The following screenshot shows an example of the detailed results with audit explanations.

Clean up

To avoid incurring further charges, delete the CloudFormation stack. Before you delete the CloudFormation stack, delete all the resources you created using the console or AWS CLI:

  1. Manually delete any EMR Studio Workspaces you created with User2.
  2. Delete the Athena workgroup created as part of the User1 setup.
  3. Delete the QuickSight VPC connection you created.
  4. Delete the Redshift IAM Identity Center connection.
  5. Deregister IAM Identity Center from S3 Access Grants.
  6. Delete the CloudFormation stack.
  7. Manually delete the VPC created by AWS CloudFormation.

Conclusion

In this post, we delved into the trusted identity propagation feature of AWS Identity Center alongside various AWS Analytics services, demonstrating its utility in managing permissions using corporate user or group identities rather than IAM roles. We examined diverse user personas utilizing interactive tools like Athena, EMR Studio notebooks, Redshift Query Editor V2, and QuickSight, all centralized under Lake Formation for streamlined permission management. Additionally, we explored S3 Access Grants for S3 bucket access management, and concluded with insights into auditing through CloudTrail events and CloudTrail Lake for a comprehensive overview of user data access.

For further reading, refer to the following resources:


About the Author

Shoukat Ghouse is a Senior Big Data Specialist Solutions Architect at AWS. He helps customers around the world build robust, efficient and scalable data platforms on AWS leveraging AWS analytics services like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.

Architectural Patterns for real-time analytics using Amazon Kinesis Data Streams, Part 2: AI Applications

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/big-data/architectural-patterns-for-real-time-analytics-using-amazon-kinesis-data-streams-part-2-ai-applications/

Welcome back to our exciting exploration of architectural patterns for real-time analytics with Amazon Kinesis Data Streams! In this fast-paced world, Kinesis Data Streams stands out as a versatile and robust solution to tackle a wide range of use cases with real-time data, from dashboarding to powering artificial intelligence (AI) applications. In this series, we streamline the process of identifying and applying the most suitable architecture for your business requirements, and help kickstart your system development efficiently with examples.

Before we dive in, we recommend reviewing Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1 for the basic functionalities of Kinesis Data Streams. Part 1 also contains architectural examples for building real-time applications for time series data and event-sourcing microservices.

Now get ready as we embark on the second part of this series, where we focus on the AI applications with Kinesis Data Streams in three scenarios: real-time generative business intelligence (BI), real-time recommendation systems, and Internet of Things (IoT) data streaming and inferencing.

Real-time generative BI dashboards with Kinesis Data Streams, Amazon QuickSight, and Amazon Q

In today’s data-driven landscape, your organization likely possesses a vast amount of time-sensitive information that can be used to gain a competitive edge. The key to unlock the full potential of this real-time data lies in your ability to effectively make sense of it and transform it into actionable insights in real time. This is where real-time BI tools such as live dashboards come into play, assisting you with data aggregation, analysis, and visualization, therefore accelerating your decision-making process.

To help streamline this process and empower your team with real-time insights, Amazon has introduced Amazon Q in QuickSight. Amazon Q is a generative AI-powered assistant that you can configure to answer questions, provide summaries, generate content, and complete tasks based on your data. Amazon QuickSight is a fast, cloud-powered BI service that delivers insights.

With Amazon Q in QuickSight, you can use natural language prompts to build, discover, and share meaningful insights in seconds, creating context-aware data Q&A experiences and interactive data stories from the real-time data. For example, you can ask “Which products grew the most year-over-year?” and Amazon Q will automatically parse the questions to understand the intent, retrieve the corresponding data, and return the answer in the form of a number, chart, or table in QuickSight.

By using the architecture illustrated in the following figure, your organization can harness the power of streaming data and transform it into visually compelling and informative dashboards that provide real-time insights. With the power of natural language querying and automated insights at your fingertips, you’ll be well-equipped to make informed decisions and stay ahead in today’s competitive business landscape.

Build real-time generative business intelligence dashboards with Amazon Kinesis Data Streams, Amazon QuickSight, and Amazon Qtreaming & inferencing pipeline with AWS IoT & Amazon SageMaker

The steps in the workflow are as follows:

  1. We use Amazon DynamoDB here as an example for the primary data store. Kinesis Data Streams can ingest data in real time from data stores such as DynamoDB to capture item-level changes in your table.
  2. After capturing data to Kinesis Data Streams, you can ingest the data into analytic databases such as Amazon Redshift in near-real time. Amazon Redshift Streaming Ingestion simplifies data pipelines by letting you create materialized views directly on top of data streams. With this capability, you can use SQL (Structured Query Language) to connect to and directly ingest the data stream from Kinesis Data Streams to analyze and run complex analytical queries.
  3. After the data is in Amazon Redshift, you can create a business report using QuickSight. Connectivity between a QuickSight dashboard and Amazon Redshift enables you to deliver visualization and insights. With the power of Amazon Q in QuickSight, you can quickly build and refine the analytics and visuals with natural language inputs.

For more details on how customers have built near real-time BI dashboards using Kinesis Data Streams, refer to the following:

Real-time recommendation systems with Kinesis Data Streams and Amazon Personalize

Imagine creating a user experience so personalized and engaging that your customers feel truly valued and appreciated. By using real-time data about user behavior, you can tailor each user’s experience to their unique preferences and needs, fostering a deep connection between your brand and your audience. You can achieve this by using Kinesis Data Streams and Amazon Personalize, a fully managed machine learning (ML) service that generates product and content recommendations for your users, instead of building your own recommendation engine from scratch.

With Kinesis Data Streams, your organization can effortlessly ingest user behavior data from millions of endpoints into a centralized data stream in real time. This allows recommendation engines such as Amazon Personalize to read from the centralized data stream and generate personalized recommendations for each user on the fly. Additionally, you could use enhanced fan-out to deliver dedicated throughput to your mission-critical consumers at even lower latency, further enhancing the responsiveness of your real-time recommendation system. The following figure illustrates a typical architecture for building real-time recommendations with Amazon Personalize.

Build real-time recommendation systems with Kinesis Data Streams and Amazon Personalize

The steps are as follows:

  1. Create a dataset group, schemas, and datasets that represent your items, interactions, and user data.
  2. Select the best recipe matching your use case after importing your datasets into a dataset group using Amazon Simple Storage Service(Amazon S3), and then create a solution to train a model by creating a solution version. When your solution version is complete, you can create a campaign for your solution version.
  3. After a campaign has been created, you can integrate calls to the campaign in your application. This is where calls to the GetRecommendations or GetPersonalizedRanking APIs are made to request near-real-time recommendations from Amazon Personalize. Your website or mobile application calls a AWS Lambda function over Amazon API Gateway to receive recommendations for your business apps.
  4. An event tracker provides an endpoint that allows you to stream interactions that occur in your application back to Amazon Personalize in near-real time. You do this by using the PutEvents API. You can build an event collection pipeline using API Gateway, Kinesis Data Streams, and Lambda to receive and forward interactions to Amazon Personalize. The event tracker performs two primary functions. First, it persists all streamed interactions so they will be incorporated into future retrainings of your model. This is also how Amazon Personalize cold starts new users. When a new user visits your site, Amazon Personalize will recommend popular items. After you stream in an event or two, Amazon Personalize immediately starts adjusting recommendations.

To learn how other customers have built personalized recommendations using Kinesis Data Streams, refer to the following:

Real-time IoT data streaming and inferencing with AWS IoT Core and Amazon SageMaker

From office lights that automatically turn on as you enter the room to medical devices that monitors a patient’s health in real time, a proliferation of smart devices is making the world more automated and connected. In technical terms, IoT is the network of devices that connect with the internet and can exchange data with other devices and software systems. Many organizations increasingly rely on the real-time data from IoT devices, such as temperature sensors and medical equipment, to drive automation, analytics, and AI systems. It’s important to choose a robust streaming solution that can achieve very low latency and handle high volumes of data throughputs to power the real-time AI inferencing.

With Kinesis Data Streams, IoT data across millions of devices can simultaneously write to a centralized data stream. Alternatively, you can use AWS IoT Core to securely connect and easily manage the fleet of IoT devices, collect the IoT data, and then ingest to Kinesis Data Streams for real-time transformation, analytics, and event-driven microservices. Then, you can use integrated services such as Amazon SageMaker for real-time inference. The following diagram depicts the high-level streaming architecture with IoT sensor data.

Build real-time IoT data streaming & inferencing pipeline with AWS IoT & Amazon SageMaker

The steps are as follows:

  1. Data originates in IoT devices such as medical devices, car sensors, and industrial IoT sensors. This telemetry data is collected using AWS IoT Greengrass, an open source IoT edge runtime and cloud service that helps your devices collect and analyze data closer to where the data is generated.
  2. Event data is ingested into the cloud using edge-to-cloud interface services such as AWS IoT Core, a managed cloud platform that connects, manages, and scales devices effortlessly and securely. You can also use AWS IoT SiteWise, a managed service that helps you collect, model, analyze, and visualize data from industrial equipment at scale. Alternatively, IoT devices could send data directly to Kinesis Data Streams.
  3. AWS IoT Core can stream ingested data into Kinesis Data Streams.
  4. The ingested data gets transformed and analyzed in near real time using Amazon Managed Service for Apache Flink. Stream data can further be enriched using lookup data hosted in a data warehouse such as Amazon Redshift. Managed Service for Apache Flink can persist streamed data into Amazon Redshift after the customer’s integration and stream aggregation (for example, 1 minute or 5 minutes). The results in Amazon Redshift can be used for further downstream BI reporting services, such as QuickSight. Managed Service for Apache Flink can also write to a Lambda function, which can invoke SageMaker models. After the ML model is trained and deployed in SageMaker, inferences are invoked in a microbatch using Lambda. Inferenced data is sent to Amazon OpenSearch Service to create personalized monitoring dashboards using OpenSearch Dashboards. The transformed IoT sensor data can be stored in DynamoDB. You can use AWS AppSync to provide near real-time data queries to API services for downstream applications. These enterprise applications can be mobile apps or business applications to track and monitor the IoT sensor data in near real time.
  5. The streamed IoT data can be written to an Amazon Data Firehose delivery stream, which microbatches data into Amazon S3 for future analytics.

To learn how other customers have built IoT device monitoring solutions using Kinesis Data Streams, refer to:

Conclusion

This post demonstrated additional architectural patterns for building low-latency AI applications with Kinesis Data Streams and its integrations with other AWS services. Customers looking to build generative BI, recommendation systems, and IoT data streaming and inferencing can refer to these patterns as the starting point of designing your cloud architecture. We will continue to add new architectural patterns in the future posts of this series.

For detailed architectural patterns, refer to the following resources:

If you want to build a data vision and strategy, check out the AWS Data-Driven Everything (D2E) program.


About the Authors

Raghavarao Sodabathina is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML, and cloud security. He engages with customers to create innovative solutions that address customer business problems and to accelerate the adoption of AWS services. In his spare time, Raghavarao enjoys spending time with his family, reading books, and watching movies.

Hang Zuo is a Senior Product Manager on the Amazon Kinesis Data Streams team at Amazon Web Services. He is passionate about developing intuitive product experiences that solve complex customer problems and enable customers to achieve their business goals.

Shwetha Radhakrishnan is a Solutions Architect for AWS with a focus in Data Analytics. She has been building solutions that drive cloud adoption and help organizations make data-driven decisions within the public sector. Outside of work, she loves dancing, spending time with friends and family, and traveling.

Brittany Ly is a Solutions Architect at AWS. She is focused on helping enterprise customers with their cloud adoption and modernization journey and has an interest in the security and analytics field. Outside of work, she loves to spend time with her dog and play pickleball.

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics-part-3-visualization-and-trend-analysis-using-amazon-quicksight/

In Part 2 of this series, we discussed how to enable AWS Glue job observability metrics and integrate them with Grafana for real-time monitoring. Grafana provides powerful customizable dashboards to view pipeline health. However, to analyze trends over time, aggregate from different dimensions, and share insights across the organization, a purpose-built business intelligence (BI) tool like Amazon QuickSight may be more effective for your business. QuickSight makes it straightforward for business users to visualize data in interactive dashboards and reports.

In this post, we explore how to connect QuickSight to Amazon CloudWatch metrics and build graphs to uncover trends in AWS Glue job observability metrics. Analyzing historical patterns allows you to optimize performance, identify issues proactively, and improve planning. We walk through ingesting CloudWatch metrics into QuickSight using a CloudWatch metric stream and QuickSight SPICE. With this integration, you can use line charts, bar charts, and other graph types to uncover daily, weekly, and monthly patterns. QuickSight lets you perform aggregate calculations on metrics for deeper analysis. You can slice data by different dimensions like job name, see anomalies, and share reports securely across your organization. With these insights, teams have the visibility to make data integration pipelines more efficient.

Solution overview

The following architecture diagram illustrates the workflow to implement the solution.

The workflow includes the following steps:

  1. AWS Glue jobs emit observability metrics to CloudWatch metrics.
  2. CloudWatch streams metric data through a metric stream into Amazon Data Firehose.
  3. Data Firehose uses an AWS Lambda function to transform data and ingest the transformed records into an Amazon Simple Storage Service (Amazon S3) bucket.
  4. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog.
  5. QuickSight periodically runs Amazon Athena queries to load query results to SPICE and then visualize the latest metric data.

All of the resources are defined in a sample AWS Cloud Development Kit (AWS CDK) template. You can deploy the end-to-end solution to visualize and analyze trends of the observability metrics.

Sample AWS CDK template

This post provides a sample AWS CDK template for a dashboard using AWS Glue observability metrics.

Typically, you have multiple accounts to manage and run resources for your data pipeline.

In this template, we assume the following accounts:

  • Monitoring account – This hosts the central S3 bucket, central Data Catalog, and QuickSight-related resources
  • Source account – This hosts individual data pipeline resources on AWS Glue and the resources to send metrics to the monitoring account

The template works even when the monitoring account and source account are the same.

This sample template consists of four stacks:

  • Amazon S3 stack – This provisions the S3 bucket
  • Data Catalog stack – This provisions the AWS Glue database, table, and crawler
  • QuickSight stack – This provisions the QuickSight data source, dataset, and analysis
  • Metrics sender stack – This provisions the CloudWatch metric stream, Firehose delivery stream, and Lambda function for transformation

Prerequisites

You should have the following prerequisites:

  • Python 3.9 or later
  • AWS accounts for the monitoring account and source account
  • An AWS named profile for the monitoring account and source account
  • The AWS CDK Toolkit 2.87.0 or later

Initialize the CDK project

To initialize the project, complete the following steps:

  1. Clone the cdk template to your workplace:
    $ git clone [email protected]:aws-samples/aws-glue-cdk-baseline.git 
    
    $ cd aws-glue-cdk-baseline.git

  2. Create a Python virtual environment specific to the project on the client machine:
    $ python3 -m venv .venv

We use a virtual environment in order to isolate the Python environment for this project and not install software globally.

  1. Activate the virtual environment according to your OS:
    • On MacOS and Linux, use the following code:
      $ source .venv/bin/activate

    • On a Windows platform, use the following code:
      % .venv\Scripts\activate.bat

After this step, the subsequent steps run within the bounds of the virtual environment on the client machine and interact with the AWS account as needed.

  1. Install the required dependencies described in requirements.txt to the virtual environment:
    $ pip install -r requirements.txt

  2. Edit the configuration file default-config.yaml based on your environments (replace each account ID with your own.
    create_s3_stack: false
    create_metrics_sender_stack: false
    create_catalog_stack: false
    create_quicksight_stack: true
    
    s3_bucket_name: glue-observability-demo-dashboard
    
    firehose_log_group_name: /aws/kinesisfirehose/observability-demo-metric-stream
    firehose_lambda_buffer_size_mb: 2
    firehose_lambda_buffer_interval_seconds: 60
    firehose_s3_buffer_size_mb: 128
    firehose_s3_buffer_interval_seconds: 300
    
    glue_database_name: observability_demo_db
    glue_table_name: metric_data
    glue_crawler_name: observability_demo_crawler
    glue_crawler_cron_schedule: "cron(42 * * * ? *)"
    
    athena_workgroup_name: primary

Bootstrap your AWS environments

Run the following commands to bootstrap your AWS environments:

  1. In the monitoring account, provide your monitoring account number, AWS Region, and monitoring profile:
    $ cdk bootstrap aws://<MONITORING-ACCOUNT-NUMBER>/<REGION> --profile <MONITORING-PROFILE> \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess

  2. In the source account, provide your source account number, Region, and source profile:x
    $ cdk bootstrap aws://<SOURCE-ACCOUNT-NUMBER>/<REGION> --profile <SOURCE-PROFILE> \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess

When you use only one account for all environments, you can just run thecdk bootstrapcommand one time.

Deploy your AWS resources

Run the following commands to deploy your AWS resources:

  1. Run the following command using the monitoring account to deploy resources defined in the AWS CDK template:
    $ cdk deploy '*' --profile <MONITORING-PROFILE>

  2. Run the following command using the source account to deploy resources defined in the AWS CDK template:
    $ cdk deploy MetricSenderStack --profile <SOURCE-PROFILE>

Configure QuickSight permissions

Initially, the new QuickSight resources including the dataset and analysis created by the AWS CDK template are not visible for you because there are no QuickSight permissions configured yet.

To make the dataset and analysis visible for you, complete the following steps:

  1. On the QuickSight console, navigate to the user menu and choose Manage QuickSight.
  2. In the navigation pane, choose Manage assets.
  3. Under Browse assets, choose Analysis.
  4. Search for GlueObservabilityAnalysis, and select it.
  5. Choose SHARE.
  6. For User or Group, select your user, then choose SHARE (1).
  7. Wait for the share to be complete, then choose DONE.
  8. On the Manage assets page, choose Datasets.
  9. Search for observability_demo.metrics_data, and select it.
  10. Choose SHARE.
  11. For User or Group, select your user, then choose SHARE (1).
  12. Wait for the share to be complete, then choose DONE.

Explore the default QuickSight analysis

Now your QuickSight analysis and dataset are visible to you. You can return to the QuickSight console and choose GlueObservabilityAnalysis under Analysis. The following screenshot shows your dashboard.

The sample analysis has two tabs: Monitoring and Insights. By default, the Monitoring tab has the following charts:

  • [Reliability] Job Run Errors Breakdown
  • [Reliability] Job Run Errors (Total)
  • [Performance] Skewness Job
  • [Performance] Skewness Job per Job

  • [Resource Utilization] Worker Utilization
  • [Resource Utilization] Worker Utilization per Job
  • [Throughput] BytesRead, RecordsRead, FilesRead, PartitionRead (Avg)
  • [Throughput] BytesWritten, RecordsWritten, FilesWritten (Avg)

  • [Resource Utilization Disk Available GB (Min)
  • [Resource Utilization Max Disk Used % (Max)

  • [Driver OOM] OOM Error Count
  • [Driver OOM] Max Heap Memory Used % (Max)
  • [Executor OOM] OOM Error Count
  • [Executor OOM] Max Heap Memory Used % (Max)

By default, the Insights tab has following insights:

  • Bottom Ranked Worker Utilization
  • Top Ranked Skewness Job

  • Forecast Worker Utilization
  • Top Mover readBytes

You can add any new graph charts or insights using the observability metrics based on your requirements.

Publish the QuickSight dashboard

When the analysis is ready, complete the following steps to publish the dashboard:

  1. Choose PUBLISH.
  2. Select Publish new dashboard as, and enter GlueObservabilityDashboard.
  3. Choose Publish dashboard.

Then you can view and share the dashboard.

Visualize and analyze with AWS Glue job observability metrics

Let’s use the dashboard to make AWS Glue usage more performant.

Looking at the Skewness Job per Job visualization, there was spike on November 1, 2023. The skewness metrics of the job multistage-demo showed 9.53, which is significantly higher than others.

Let’s drill down into details. You can choose Controls, and change filter conditions based on date time, Region, AWS account ID, AWS Glue job name, job run ID, and the source and sink of the data stores. For now, let’s filter with the job name multistage-demo.

The filtered Worker Utilization per Job visualization shows 0.5, and its minimum value was 0.16. It seems like that there is a room for improvement in resource utilization. This observation guides you to enable auto scaling for this job to increase the worker utilization.

Clean up

Run the following commands to clean up your AWS resources:

  1. Run the following command using the monitoring account to clean up resources:
    $ cdk destroy '*' --profile <MONITORING-PROFILE>

    Run the following command using the source account to clean up resources:

    $ cdk destroy MetricSenderStack --profile <SOURCE-PROFILE>

Considerations

QuickSight integration is designed for analysis and better flexibility. You can aggregate metrics based on any fields. When dealing with many jobs at once, QuickSight insights help you identify problematic jobs.

QuickSight integration is achieved with more resources in your environments. The monitoring account needs an AWS Glue database, table, crawler, and S3 bucket, and the ability to run Athena queries to visualize metrics in QuickSight. Each source account needs to have one metric stream and one Firehose delivery stream. This can incur additional costs.

All the required resources are templatized in AWS CDK.

Conclusion

In this post, we explored how to visualize and analyze AWS Glue job observability metrics on QuickSight using CloudWatch metric streams and SPICE. By connecting the new observability metrics to interactive QuickSight dashboards, you can uncover daily, weekly, and monthly patterns to optimize AWS Glue job usage. The rich visualization capabilities of QuickSight allow you to analyze trends in metrics like worker utilization, error categories, throughput, and more. Aggregating metrics and slicing data by different dimensions such as job name can provide deeper insights.

The sample dashboard showed metrics over time, top errors, and comparative job analytics. These visualizations and reports can be securely shared with teams across the organization. With data-driven insights on the AWS Glue observability metrics, you can have deeper insights on performance bottlenecks, common errors, and more.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Chuhan LiuChuhan Liu is a Software Development Engineer on the AWS Glue team. He is passionate about building scalable distributed systems for big data processing, analytics, and management. In his spare time, he enjoys playing tennis.

XiaoRun Yu is a Software Development Engineer on the AWS Glue team. He is working on building new features for AWS Glue to help customers. Outside of work, Xiaorun enjoys exploring new places in the Bay Area.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has a track record of more than 18 years innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple to use interfaces to efficiently manage and transform petabytes of data seamlessly across data lakes on Amazon S3, databases and data-warehouses on cloud.

Enrich your customer data with geospatial insights using Amazon Redshift, AWS Data Exchange, and Amazon QuickSight

Post Syndicated from Tony Stricker original https://aws.amazon.com/blogs/big-data/enrich-your-customer-data-with-geospatial-insights-using-amazon-redshift-aws-data-exchange-and-amazon-quicksight/

It always pays to know more about your customers, and AWS Data Exchange makes it straightforward to use publicly available census data to enrich your customer dataset.

The United States Census Bureau conducts the US census every 10 years and gathers household survey data. This data is anonymized, aggregated, and made available for public use. The smallest geographic area for which the Census Bureau collects and aggregates data are census blocks, which are formed by streets, roads, railroads, streams and other bodies of water, other visible physical and cultural features, and the legal boundaries shown on Census Bureau maps.

If you know the census block in which a customer lives, you are able to make general inferences about their demographic characteristics. With these new attributes, you are able to build a segmentation model to identify distinct groups of customers that you can target with personalized messaging. This data is available to subscribe to on AWS Data Exchange—and with data sharing, you don’t need to pay to store a copy of it in your account in order to query it.

In this post, we show how to use customer addresses to enrich a dataset with additional demographic details from the US Census Bureau dataset.

Solution overview

The solution includes the following high-level steps:

  1. Set up an Amazon Redshift Serverless endpoint and load customer data.
  2. Set up a place index in Amazon Location Service.
  3. Write an AWS Lambda user-defined function (UDF) to call Location Service from Amazon Redshift.
  4. Subscribe to census data on AWS Data Exchange.
  5. Use geospatial queries to tag addresses to census blocks.
  6. Create a new customer dataset in Amazon Redshift.
  7. Evaluate new customer data in Amazon QuickSight.

The following diagram illustrates the solution architecture.

architecture diagram

Prerequisites

You can use the following AWS CloudFormation template to deploy the required infrastructure. Before deployment, you need to sign up for QuickSight access through the AWS Management Console.

Load generic address data to Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Redshift Serverless makes it straightforward to run analytics workloads of any size without having to manage data warehouse infrastructure.

To load our address data, we first create a Redshift Serverless workgroup. Then we use Amazon Redshift Query Editor v2 to load customer data from Amazon Simple Storage Service (Amazon S3).

Create a Redshift Serverless workgroup

There are two primary components of the Redshift Serverless architecture:

  • Namespace – A collection of database objects and users. Namespaces group together all of the resources you use in Redshift Serverless, such as schemas, tables, users, datashares, and snapshots.
  • Workgroup – A collection of compute resources. Workgroups have network and security settings that you can configure using the Redshift Serverless console, the AWS Command Line Interface (AWS CLI), or the Redshift Serverless APIs.

To create your namespace and workgroup, refer to Creating a data warehouse with Amazon Redshift Serverless. For this exercise, name your workgroup sandbox and your namespace adx-demo.

Use Query Editor v2 to load customer data from Amazon S3

You can use Query Editor v2 to submit queries and load data to your data warehouse through a web interface. To configure Query Editor v2 for your AWS account, refer to Data load made easy and secure in Amazon Redshift using Query Editor V2. After it’s configured, complete the following steps:

  • Use the following SQL to create the customer_data schema within the dev database in your data warehouse:
CREATE SCHEMA customer_data;
  • Use the following SQL DDL to create your target table into which you’ll load your customer address data:
CREATE TABLE customer_data.customer_addresses (
    address character varying(256) ENCODE lzo,
    unitnumber character varying(256) ENCODE lzo,
    municipality character varying(256) ENCODE lzo,
    region character varying(256) ENCODE lzo,
    postalcode character varying(256) ENCODE lzo,
    country character varying(256) ENCODE lzo,
    customer_id integer ENCODE az64
) DISTSTYLE AUTO;

The file has no column headers and is pipe delimited (|). For information on how to load data from either Amazon S3 or your local desktop, refer to Loading data into a database.

Use Location Service to geocode and enrich address data

Location Service lets you add location data and functionality to applications, which includes capabilities such as maps, points of interest, geocoding, routing, geofences, and tracking.

Our data is in Amazon Redshift, so we need to access the Location Service APIs using SQL statements. Each row of data contains an address that we want to enrich and geotag using the Location Service APIs. Amazon Redshift allows developers to create UDFs using a SQL SELECT clause, Python, or Lambda.

Lambda is a compute service that lets you run code without provisioning or managing servers. With Lambda UDFs, you can write custom functions with complex logic and integrate with third-party components. Scalar Lambda UDFs return one result per invocation of the function—in this case, the Lambda function runs one time for each row of data it receives.

For this post, we write a Lambda function that uses the Location Service API to geotag and validate our customer addresses. Then we register this Lambda function as a UDF with our Redshift instance, allowing us to call the function from a SQL command.

For instructions to create a Location Service place index and create your Lambda function and scalar UDF, refer to Access Amazon Location Service from Amazon Redshift. For this post, we use ESRI as a provider and name the place index placeindex.redshift.

Test your new function with the following code, which returns the coordinates of the White House in Washington, DC:

select public.f_geocode_address('1600 Pennsylvania Ave.','Washington','DC','20500','USA');

Subscribe to demographic data from AWS Data Exchange

AWS Data Exchange is a data marketplace with more than 3,500 products from over 300 providers delivered—through files, APIs, or Amazon Redshift queries—directly to the data lakes, applications, analytics, and machine learning models that use it.

First, we need to give our Redshift namespace permission via AWS Identity and Access Management (IAM) to access subscriptions on AWS Data Exchange. Then we can subscribe to our sample demographic data. Complete the following steps:

  1. On the IAM console, add the AWSDataExchangeSubscriberFullAccess managed policy to your Amazon Redshift commands access role you assigned when creating the namespace.
  2. On the AWS Data Exchange console, navigate to the dataset ACS – Sociodemographics (USA, Census Block Groups, 2019), provided by CARTO.
  3. Choose Continue to subscribe, then choose Subscribe.

The subscription may take a few minutes to configure.

  1. When your subscription is in place, navigate back to the Redshift Serverless console.
  2. In the navigation pane, choose Datashares.
  3. On the Subscriptions tab, choose the datashare that you just subscribed to.
  4. On the datashare details page, choose Create database from datashare.
  5. Choose the namespace you created earlier and provide a name for the new database that will hold the shared objects from the dataset you subscribed to.

In Query Editor v2, you should see the new database you just created and two new tables: one that holds the block group polygons and another that holds the demographic information for each block group.

Query Editor v2 data source explorer

Join geocoded customer data to census data with geospatial queries

There are two primary types of spatial data: raster and vector data. Raster data is represented as a grid of pixels and is beyond the scope of this post. Vector data is comprised of vertices, edges, and polygons. With geospatial data, vertices are represented as latitude and longitude points and edges are the connections between pairs of vertices. Think of the road connecting two intersections on a map. A polygon is a set of vertices with a series of connecting edges that form a continuous shape. A simple rectangle is a polygon, just as the state border of Ohio can be represented as a polygon. The geography_usa_blockgroup_2019 dataset that you subscribed to has 220,134 rows, each representing a single census block group and its geographic shape.

Amazon Redshift supports the storage and querying of vector-based spatial data with the GEOMETRY and GEOGRAPHY data types. You can use Redshift SQL functions to perform queries such as a point in polygon operation to determine if a given latitude/longitude point falls within the boundaries of a given polygon (such as state or county boundary). In this dataset, you can observe that the geom column in geography_usa_blockgroup_2019 is of type GEOMETRY.

Our goal is to determine which census block (polygon) each of our geotagged addresses falls within so we can enrich our customer records with details that we know about the census block. Complete the following steps:

  • Build a new table with the geocoding results from our UDF:
CREATE TABLE customer_data.customer_addresses_geocoded AS 
select address
    ,unitnumber
    ,municipality
    ,region
    ,postalcode
    ,country
    ,customer_id
    ,public.f_geocode_address(address||' '||unitnumber,municipality,region,postalcode,country) as geocode_result
FROM customer_data.customer_addresses;
  • Use the following code to extract the different address fields and latitude/longitude coordinates from the JSON column and create a new table with the results:
CREATE TABLE customer_data.customer_addresses_points AS
SELECT customer_id
    ,geo_address
    address
    ,unitnumber
    ,municipality
    ,region
    ,postalcode
    ,country
    ,longitude
    ,latitude
    ,ST_SetSRID(ST_MakePoint(Longitude, Latitude),4326) as address_point
            --create new geom column of type POINT, set new point SRID = 4326
FROM
(
select customer_id
    ,address
    ,unitnumber
    ,municipality
    ,region
    ,postalcode
    ,country
    ,cast(json_extract_path_text(geocode_result, 'Label', true) as VARCHAR) as geo_address
    ,cast(json_extract_path_text(geocode_result, 'Longitude', true) as float) as longitude
    ,cast(json_extract_path_text(geocode_result, 'Latitude', true) as float) as latitude
        --use json function to extract fields from geocode_result
from customer_data.customer_addresses_geocoded) a;

This code uses the ST_POINT function to create a new column from the latitude/longitude coordinates called address_point of type GEOMETRY and subtype POINT.   It uses the ST_SetSRID geospatial function to set the spatial reference identifier (SRID) of the new column to 4326.

The SRID defines the spatial reference system to be used when evaluating the geometry data. It’s important when joining or comparing geospatial data that they have matching SRIDs. You can check the SRID of an existing geometry column by using the ST_SRID function. For more information on SRIDs and GEOMETRY data types, refer to Querying spatial data in Amazon Redshift.

  • Now that your customer addresses are geocoded as latitude/longitude points in a geometry column, you can use a join to identify which census block shape your new point falls within:
CREATE TABLE customer_data.customer_addresses_with_census AS
select c.*
    ,shapes.geoid as census_group_shape
    ,demo.*
from customer_data.customer_addresses_points c
inner join "carto_census_data"."carto".geography_usa_blockgroup_2019 shapes
on ST_Contains(shapes.geom, c.address_point)
    --join tables where the address point falls within the census block geometry
inner join carto_census_data.usa_acs.demographics_sociodemographics_usa_blockgroup_2019_yearly_2019 demo
on demo.geoid = shapes.geoid;

The preceding code creates a new table called customer_addresses_with_census, which joins the customer addresses to the census block in which they belong as well as the demographic data associated with that census block.

To do this, you used the ST_CONTAINS function, which accepts two geometry data types as an input and returns TRUE if the 2D projection of the first input geometry contains the second input geometry. In our case, we have census blocks represented as polygons and addresses represented as points. The join in the SQL statement succeeds when the point falls within the boundaries of the polygon.

Visualize the new demographic data with QuickSight

QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people who you work with, wherever they are. QuickSight connects to your data in the cloud and combines data from many different sources.

First, let’s build some new calculated fields that will help us better understand the demographics of our customer base. We can do this in QuickSight, or we can use SQL to build the columns in a Redshift view. The following is the code for a Redshift view:

CREATE VIEW customer_data.customer_features AS (
SELECT customer_id 
    ,postalcode
    ,region
    ,municipality
    ,geoid as census_geoid
    ,longitude
    ,latitude
    ,total_pop
    ,median_age
    ,white_pop/total_pop as perc_white
    ,black_pop/total_pop as perc_black
    ,asian_pop/total_pop as perc_asian
    ,hispanic_pop/total_pop as perc_hispanic
    ,amerindian_pop/total_pop as perc_amerindian
    ,median_income
    ,income_per_capita
    ,median_rent
    ,percent_income_spent_on_rent
    ,unemployed_pop/coalesce(pop_in_labor_force) as perc_unemployment
    ,(associates_degree + bachelors_degree + masters_degree + doctorate_degree)/total_pop as perc_college_ed
    ,(household_language_total - household_language_english)/coalesce(household_language_total) as perc_other_than_english
FROM "dev"."customer_data"."customer_addresses_with_census" t );

To get QuickSight to talk to our Redshift Serverless endpoint, complete the following steps:

Now you can create a new dataset in QuickSight.

  • On the QuickSight console, choose Datasets in the navigation pane.
  • Choose New dataset.

create a new dataset in quicksight

  • We want to create a dataset from a new data source and use the Redshift: Manual connect option.

Redshift manual connection

  • Provide the connection information for your Redshift Serverless workgroup.

You will need the endpoint for our workgroup and the user name and password that you created when you set up your workgroup. You can find your workgroup’s endpoint on the Redshift Serverless console by navigating to your workgroup configuration. The following screenshot is an example of the connection settings needed. Notice the connection type is the name of the VPC connection that you previously configured in QuickSight. When you copy the endpoint from the Redshift console, be sure to remove the database and port number from the end of the URL before entering it in the field.

Redshift edit data source

  • Save the new data source configuration.

You’ll be prompted to choose the table you want to use for your dataset.

  • Choose the new view that you created that has your new derived fields.

Quicksight choose your table

  • Select Directly query your data.

This will connect your visualizations directly to the data in the database rather than ingesting data into the QuickSight in-memory data store.

Directly query your data

  • To create a histogram of median income level, choose the blank visual on Sheet1 and then choose the histogram visual icon under Visual types.
  • Choose median_income under Fields list and drag it to the Value field well.

This builds a histogram showing the distribution of median_income for our customers based on the census block group in which they live.

QuickSight histogram

Conclusion

In this post, we demonstrated how companies can use open census data available on AWS Data Exchange to effortlessly gain a high-level understanding of their customer base from a demographic standpoint. This basic understanding of customers based on where they live can serve as the foundation for more targeted marketing campaigns and even influence product development and service offerings.

As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.


About the Author

Tony Stricker is a Principal Technologist on the Data Strategy team at AWS, where he helps senior executives adopt a data-driven mindset and align their people/process/technology in ways that foster innovation and drive towards specific, tangible business outcomes. He has a background as a data warehouse architect and data scientist and has delivered solutions in to production across multiple industries including oil and gas, financial services, public sector, and manufacturing. In his spare time, Tony likes to hang out with his dog and cat, work on home improvement projects, and restore vintage Airstream campers.

Enhance container software supply chain visibility through SBOM export with Amazon Inspector and QuickSight

Post Syndicated from Jason Ng original https://aws.amazon.com/blogs/security/enhance-container-software-supply-chain-visibility-through-sbom-export-with-amazon-inspector-and-quicksight/

In this post, I’ll show how you can export software bills of materials (SBOMs) for your containers by using an AWS native service, Amazon Inspector, and visualize the SBOMs through Amazon QuickSight, providing a single-pane-of-glass view of your organization’s software supply chain.

The concept of a bill of materials (BOM) originated in the manufacturing industry in the early 1960s. It was used to keep track of the quantities of each material used to manufacture a completed product. If parts were found to be defective, engineers could then use the BOM to identify products that contained those parts. An SBOM extends this concept to software development, allowing engineers to keep track of vulnerable software packages and quickly remediate the vulnerabilities.

Today, most software includes open source components. A Synopsys study, Walking the Line: GitOps and Shift Left Security, shows that 8 in 10 organizations reported using open source software in their applications. Consider a scenario in which you specify an open source base image in your Dockerfile but don’t know what packages it contains. Although this practice can significantly improve developer productivity and efficiency, the decreased visibility makes it more difficult for your organization to manage risk effectively.

It’s important to track the software components and their versions that you use in your applications, because a single affected component used across multiple organizations could result in a major security impact. According to a Gartner report titled Gartner Report for SBOMs: Key Takeaways You Should know, by 2025, 60 percent of organizations building or procuring critical infrastructure software will mandate and standardize SBOMs in their software engineering practice, up from less than 20 percent in 2022. This will help provide much-needed visibility into software supply chain security.

Integrating SBOM workflows into the software development life cycle is just the first step—visualizing SBOMs and being able to search through them quickly is the next step. This post describes how to process the generated SBOMs and visualize them with Amazon QuickSight. AWS also recently added SBOM export capability in Amazon Inspector, which offers the ability to export SBOMs for Amazon Inspector monitored resources, including container images.

Why is vulnerability scanning not enough?

Scanning and monitoring vulnerable components that pose cybersecurity risks is known as vulnerability scanning, and is fundamental to organizations for ensuring a strong and solid security posture. Scanners usually rely on a database of known vulnerabilities, the most common being the Common Vulnerabilities and Exposures (CVE) database.

Identifying vulnerable components with a scanner can prevent an engineer from deploying affected applications into production. You can embed scanning into your continuous integration and continuous delivery (CI/CD) pipelines so that images with known vulnerabilities don’t get pushed into your image repository. However, what if a new vulnerability is discovered but has not been added to the CVE records yet? A good example of this is the Apache Log4j vulnerability, which was first disclosed on Nov 24, 2021 and only added as a CVE on Dec 1, 2021. This means that for 7 days, scanners that relied on the CVE system weren’t able to identify affected components within their organizations. This issue is known as a zero-day vulnerability. Being able to quickly identify vulnerable software components in your applications in such situations would allow you to assess the risk and come up with a mitigation plan without waiting for a vendor or supplier to provide a patch.

In addition, it’s also good hygiene for your organization to track usage of software packages, which provides visibility into your software supply chain. This can improve collaboration between developers, operations, and security teams, because they’ll have a common view of every software component and can collaborate effectively to address security threats.

In this post, I present a solution that uses the new Amazon Inspector feature to export SBOMs from container images, process them, and visualize the data in QuickSight. This gives you the ability to search through your software inventory on a dashboard and to use natural language queries through QuickSight Q, in order to look for vulnerabilities.

Solution overview

Figure 1 shows the architecture of the solution. It is fully serverless, meaning there is no underlying infrastructure you need to manage. This post uses a newly released feature within Amazon Inspector that provides the ability to export a consolidated SBOM for Amazon Inspector monitored resources across your organization in commonly used formats, including CycloneDx and SPDX.

Figure 1: Solution architecture diagram

Figure 1: Solution architecture diagram

The workflow in Figure 1 is as follows:

  1. The image is pushed into Amazon Elastic Container Registry (Amazon ECR), which sends an Amazon EventBridge event.
  2. This invokes an AWS Lambda function, which starts the SBOM generation job for the specific image.
  3. When the job completes, Amazon Inspector deposits the SBOM file in an Amazon Simple Storage Service (Amazon S3) bucket.
  4. Another Lambda function is invoked whenever a new JSON file is deposited. The function performs the data transformation steps and uploads the new file into a new S3 bucket.
  5. Amazon Athena is then used to perform preliminary data exploration.
  6. A dashboard on Amazon QuickSight displays SBOM data.

Implement the solution

This section describes how to deploy the solution architecture.

In this post, you’ll perform the following tasks:

  • Create S3 buckets and AWS KMS keys to store the SBOMs
  • Create an Amazon Elastic Container Registry (Amazon ECR) repository
  • Deploy two AWS Lambda functions to initiate the SBOM generation and transformation
  • Set up Amazon EventBridge rules to invoke Lambda functions upon image push into Amazon ECR
  • Run AWS Glue crawlers to crawl the transformed SBOM S3 bucket
  • Run Amazon Athena queries to review SBOM data
  • Create QuickSight dashboards to identify libraries and packages
  • Use QuickSight Q to identify libraries and packages by using natural language queries

Deploy the CloudFormation stack

The AWS CloudFormation template we’ve provided provisions the S3 buckets that are required for the storage of raw SBOMs and transformed SBOMs, the Lambda functions necessary to initiate and process the SBOMs, and EventBridge rules to run the Lambda functions based on certain events. An empty repository is provisioned as part of the stack, but you can also use your own repository.

To deploy the CloudFormation stack

  1. Download the CloudFormation template.
  2. Browse to the CloudFormation service in your AWS account and choose Create Stack.
  3. Upload the CloudFormation template you downloaded earlier.
  4. For the next step, Specify stack details, enter a stack name.
  5. You can keep the default value of sbom-inspector for EnvironmentName.
  6. Specify the Amazon Resource Name (ARN) of the user or role to be the admin for the KMS key.
  7. Deploy the stack.

Set up Amazon Inspector

If this is the first time you’re using Amazon Inspector, you need to activate the service. In the Getting started with Amazon Inspector topic in the Amazon Inspector User Guide, follow Step 1 to activate the service. This will take some time to complete.

Figure 2: Activate Amazon Inspector

Figure 2: Activate Amazon Inspector

SBOM invocation and processing Lambda functions

This solution uses two Lambda functions written in Python to perform the invocation task and the transformation task.

  • Invocation task — This function is run whenever a new image is pushed into Amazon ECR. It takes in the repository name and image tag variables and passes those into the create_sbom_export function in the SPDX format. This prevents duplicated SBOMs, which helps to keep the S3 data size small.
  • Transformation task — This function is run whenever a new file with the suffix .json is added to the raw S3 bucket. It creates two files, as follows:
    1. It extracts information such as image ARN, account number, package, package version, operating system, and SHA from the SBOM and exports this data to the transformed S3 bucket under a folder named sbom/.
    2. Because each package can have more than one CVE, this function also extracts the CVE from each package and stores it in the same bucket in a directory named cve/. Both files are exported in Apache Parquet so that the file is in a format that is optimized for queries by Amazon Athena.

Populate the AWS Glue Data Catalog

To populate the AWS Glue Data Catalog, you need to generate the SBOM files by using the Lambda functions that were created earlier.

To populate the AWS Glue Data Catalog

  1. You can use an existing image, or you can continue on to create a sample image.
  2. Open an AWS Cloudshell terminal.
  3. Run the follow commands
    # Pull the nginx image from a public repo
    docker pull public.ecr.aws/nginx/nginx:1.19.10-alpine-perl
    
    docker tag public.ecr.aws/nginx/nginx:1.19.10-alpine-perl <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com/sbom-inspector:nginxperl
    
    # Authenticate to ECR, fill in your account id
    aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com
    
    # Push the image into ECR
    docker push <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com/sbom-inspector:nginxperl

  4. An image is pushed into the Amazon ECR repository in your account. This invokes the Lambda functions that perform the SBOM export by using Amazon Inspector and converts the SBOM file to Parquet.
  5. Verify that the Parquet files are in the transformed S3 bucket:
    1. Browse to the S3 console and choose the bucket named sbom-inspector-<ACCOUNT-ID>-transformed. You can also track the invocation of each Lambda function in the Amazon CloudWatch log console.
    2. After the transformation step is complete, you will see two folders (cve/ and sbom/)in the transformed S3 bucket. Choose the sbom folder. You will see the transformed Parquet file in it. If there are CVEs present, a similar file will appear in the cve folder.

    The next step is to run an AWS Glue crawler to determine the format, schema, and associated properties of the raw data. You will need to crawl both folders in the transformed S3 bucket and store the schema in separate tables in the AWS Glue Data Catalog.

  6. On the AWS Glue Service console, on the left navigation menu, choose Crawlers.
  7. On the Crawlers page, choose Create crawler. This starts a series of pages that prompt you for the crawler details.
  8. In the Crawler name field, enter sbom-crawler, and then choose Next.
  9. Under Data sources, select Add a data source.
  10. Now you need to point the crawler to your data. On the Add data source page, choose the Amazon S3 data store. This solution in this post doesn’t use a connection, so leave the Connection field blank if it’s visible.
  11. For the option Location of S3 data, choose In this account. Then, for S3 path, enter the path where the crawler can find the sbom and cve data, which is s3://sbom-inspector-<ACCOUNT-ID>-transformed/sbom/ and s3://sbom-inspector-<ACCOUNT-ID>-transformed/cve/. Leave the rest as default and select Add an S3 data source.
     
    Figure 3: Data source for AWS Glue crawler

    Figure 3: Data source for AWS Glue crawler

  12. The crawler needs permissions to access the data store and create objects in the Data Catalog. To configure these permissions, choose Create an IAM role. The AWS Identity and Access Management (IAM) role name starts with AWSGlueServiceRole-, and in the field, you enter the last part of the role name. Enter sbomcrawler, and then choose Next.
  13. Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog. To create a database, choose Add database. In the pop-up window, enter sbom-db for the database name, and then choose Create.
  14. Verify the choices you made in the Add crawler wizard. If you see any mistakes, you can choose Back to return to previous pages and make changes. After you’ve reviewed the information, choose Finish to create the crawler.
    Figure 4: Creation of the AWS Glue crawler

    Figure 4: Creation of the AWS Glue crawler

  15. Select the newly created crawler and choose Run.
  16. After the crawler runs successfully, verify that the table is created and the data schema is populated.
     
    Figure 5: Table populated from the AWS Glue crawler

    Figure 5: Table populated from the AWS Glue crawler

Set up Amazon Athena

Amazon Athena performs the initial data exploration and validation. Athena is a serverless interactive analytics service built on open source frameworks that supports open-table and file formats. Athena provides a simplified, flexible way to analyze data in sources like Amazon S3 by using standard SQL queries. If you are SQL proficient, you can query the data source directly; however, not everyone is familiar with SQL. In this section, you run a sample query and initialize the service so that it can used in QuickSight later on.

To start using Amazon Athena

  1. In the AWS Management Console, navigate to the Athena console.
  2. For Database, select sbom-db (or select the database you created earlier in the crawler).
  3. Navigate to the Settings tab located at the top right corner of the console. For Query result location, select the Athena S3 bucket created from the CloudFormation template, sbom-inspector-<ACCOUNT-ID>-athena.
  4. Keep the defaults for the rest of the settings. You can now return to the Query Editor and start writing and running your queries on the sbom-db database.

You can use the following sample query.

select package, packageversion, cve, sha, imagearn from sbom
left join cve
using (sha, package, packageversion)
where cve is not null;

Your Athena console should look similar to the screenshot in Figure 6.

Figure 6: Sample query with Amazon Athena

Figure 6: Sample query with Amazon Athena

This query joins the two tables and selects only the packages with CVEs identified. Alternatively, you can choose to query for specific packages or identify the most common package used in your organization.

Sample output:

# package packageversion cve sha imagearn
<PACKAGE_NAME> <PACKAGE_VERSION> <CVE> <IMAGE_SHA> <ECR_IMAGE_ARN>

Visualize data with Amazon QuickSight

Amazon QuickSight is a serverless business intelligence service that is designed for the cloud. In this post, it serves as a dashboard that allows business users who are unfamiliar with SQL to identify zero-day vulnerabilities. This can also reduce the operational effort and time of having to look through several JSON documents to identify a single package across your image repositories. You can then share the dashboard across teams without having to share the underlying data.

QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine) is an in-memory engine that QuickSight uses to perform advanced calculations. In a large organization where you could have millions of SBOM records stored in S3, importing your data into SPICE helps to reduce the time to process and serve the data. You can also use the feature to perform a scheduled refresh to obtain the latest data from S3.

QuickSight also has a feature called QuickSight Q. With QuickSightQ, you can use natural language to interact with your data. If this is the first time you are initializing QuickSight, subscribe to QuickSight and select Enterprise + Q. It will take roughly 20–30 minutes to initialize for the first time. Otherwise, if you are already using QuickSight, you will need to enable QuickSight Q by subscribing to it in the QuickSight console.

Finally, in QuickSight you can select different data sources, such as Amazon S3 and Athena, to create custom visualizations. In this post, we will use the two Athena tables as the data source to create a dashboard to keep track of the packages used in your organization and the resulting CVEs that come with them.

Prerequisites for setting up the QuickSight dashboard

This process will be used to create the QuickSight dashboard from a template already pre-provisioned through the command line interface (CLI). It also grants the necessary permissions for QuickSight to access the data source. You will need the following:

  • AWS Command Line Interface (AWS CLI) programmatic access with read and write permissions to QuickSight.
  • A QuickSight + Q subscription (only if you want to use the Q feature).
  • QuickSight permissions to Amazon S3 and Athena (enable these through the QuickSight security and permissions interface).
  • Set the default AWS Region where you want to deploy the QuickSight dashboard. This post assumes that you’re using the us-east-1 Region.

Create datasets

In QuickSight, create two datasets, one for the sbom table and another for the cve table.

  1. In the QuickSight console, select the Dataset tab.
  2. Choose Create dataset, and then select the Athena data source.
  3. Name the data source sbom and choose Create data source.
  4. Select the sbom table.
  5. Choose Visualize to complete the dataset creation. (Delete the analyses automatically created for you because you will create your own analyses afterwards.)
  6. Navigate back to the main QuickSight page and repeat steps 1–4 for the cve dataset.

Merge datasets

Next, merge the two datasets to create the combined dataset that you will use for the dashboard.

  1. On the Datasets tab, edit the sbom dataset and add the cve dataset.
  2. Set three join clauses, as follows:
    1. Sha : Sha
    2. Package : Package
    3. Packageversion : Packageversion
  3. Perform a left merge, which will append the cve ID to the package and package version in the sbom dataset.
     
    Figure 7: Combining the sbom and cve datasets

    Figure 7: Combining the sbom and cve datasets

Next, you will create a dashboard based on the combined sbom dataset.

Prepare configuration files

In your terminal, export the following variables. Substitute <QuickSight username> in the QS_USER_ARN variable with your own username, which can be found in the Amazon QuickSight console.

export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
export TEMPLATE_ID=”sbom_dashboard”
export QS_USER_ARN=$(aws quicksight describe-user --aws-account-id $ACCOUNT_ID --namespace default --user-name <QuickSight username> | jq .User.Arn)
export QS_DATA_ARN=$(aws quicksight search-data-sets --aws-account-id $ACCOUNT_ID --filters Name="DATASET_NAME",Operator="StringLike",Value="sbom" | jq .DataSetSummaries[0].Arn)

Validate that the variables are set properly. This is required for you to move on to the next step; otherwise you will run into errors.

echo ACCOUNT_ID is $ACCOUNT_ID || echo ACCOUNT_ID is not set
echo TEMPLATE_ID is $TEMPLATE_ID || echo TEMPLATE_ID is not set
echo QUICKSIGHT USER ARN is $QS_USER_ARN || echo QUICKSIGHT USER ARN is not set
echo QUICKSIGHT DATA ARN is $QS_DATA_ARN || echo QUICKSIGHT DATA ARN is not set

Next, use the following commands to create the dashboard from a predefined template and create the IAM permissions needed for the user to view the QuickSight dashboard.

cat < ./dashboard.json
{
    "SourceTemplate": {
      "DataSetReferences": [
        {
          "DataSetPlaceholder": "sbom",
          "DataSetArn": $QS_DATA_ARN
        }
      ],
      "Arn": "arn:aws:quicksight:us-east-1:293424211206:template/sbom_qs_template"
    }
}
EOF

cat < ./dashboardpermissions.json
[
    {
      "Principal": $QS_USER_ARN,
      "Actions": [
        "quicksight:DescribeDashboard",
        "quicksight:ListDashboardVersions",
        "quicksight:UpdateDashboardPermissions",
        "quicksight:QueryDashboard",
        "quicksight:UpdateDashboard",
        "quicksight:DeleteDashboard",
        "quicksight:DescribeDashboardPermissions",
        "quicksight:UpdateDashboardPublishedVersion"
      ]
    }
]
EOF

Run the following commands to create the dashboard in your QuickSight console.

aws quicksight create-dashboard --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID --name sbom-dashboard --source-entity file://dashboard.json

Note: Run the following describe-dashboard command, and confirm that the response contains a status code of 200. The 200-status code means that the dashboard exists.

aws quicksight describe-dashboard --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID

Use the following update-dashboard-permissions AWS CLI command to grant the appropriate permissions to QuickSight users.

aws quicksight update-dashboard-permissions --aws-account-id $ACCOUNT_ID --dashboard-id $ACCOUNT_ID --grant-permissions file://dashboardpermissions.json

You should now be able to see the dashboard in your QuickSight console, similar to the one in Figure 8. It’s an interactive dashboard that shows you the number of vulnerable packages you have in your repositories and the specific CVEs that come with them. You can navigate to the specific image by selecting the CVE (middle right bar chart) or list images with a specific vulnerable package (bottom right bar chart).

Note: You won’t see the exact same graph as in Figure 8. It will change according to the image you pushed in.

Figure 8: QuickSight dashboard containing SBOM information

Figure 8: QuickSight dashboard containing SBOM information

Alternatively, you can use QuickSight Q to extract the same information from your dataset through natural language. You will need to create a topic and add the dataset you added earlier. For detailed information on how to create a topic, see the Amazon QuickSight User Guide. After QuickSight Q has completed indexing the dataset, you can start to ask questions about your data.

Figure 9: Natural language query with QuickSight Q

Figure 9: Natural language query with QuickSight Q

Conclusion

This post discussed how you can use Amazon Inspector to export SBOMs to improve software supply chain transparency. Container SBOM export should be part of your supply chain mitigation strategy and monitored in an automated manner at scale.

Although it is a good practice to generate SBOMs, it would provide little value if there was no further analysis being done on them. This solution enables you to visualize your SBOM data through a dashboard and natural language, providing better visibility into your security posture. Additionally, this solution is also entirely serverless, meaning there are no agents or sidecars to set up.

To learn more about exporting SBOMs with Amazon Inspector, see the Amazon Inspector User Guide.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Jason Ng

Jason Ng

Jason is a Cloud Sales Center Solutions Architect at AWS. He works with enterprise and independent software vendor (ISV) greenfield customers in ASEAN countries and is part of the Containers Technical Field Community (TFC). He enjoys helping customers modernize their applications, drive growth, and reduce total cost of ownership.

Reference guide to analyze transactional data in near-real time on AWS

Post Syndicated from Jason Dalba original https://aws.amazon.com/blogs/big-data/reference-guide-to-analyze-transactional-data-in-near-real-time-on-aws/

Business leaders and data analysts use near-real-time transaction data to understand buyer behavior to help evolve products. The primary challenge businesses face with near-real-time analytics is getting the data prepared for analytics in a timely manner, which can often take days. Companies commonly maintain entire teams to facilitate the flow of data from ingestion to analysis.

The consequence of delays in your organization’s analytics workflow can be costly. As online transactions have gained popularity with consumers, the volume and velocity of data ingestion has led to challenges in data processing. Consumers expect more fluid changes to service and products. Organizations that can’t quickly adapt their business strategy to align with consumer behavior may experience loss of opportunity and revenue in competitive markets.

To overcome these challenges, businesses need a solution that can provide near-real-time analytics on transactional data with services that don’t lead to latent processing and bloat from managing the pipeline. With a properly deployed architecture using the latest technologies in artificial intelligence (AI), data storage, streaming ingestions, and cloud computing, data will become more accurate, timely, and actionable. With such a solution, businesses can make actionable decisions in near-real time, allowing leaders to change strategic direction as soon as the market changes.

In this post, we discuss how to architect a near-real-time analytics solution with AWS managed analytics, AI and machine learning (ML), and database services.

Solution overview

The most common workloads, agnostic of industry, involve transactional data. Transactional data volumes and velocity have continued to rapidly expand as workloads have been pushed online. Near-real-time data is data stored, processed, and analyzed on a continual basis. It generates information that is available for use almost immediately after being generated. With the power of near-real-time analytics, business units across an organization, including sales, marketing, and operations, can make agile, strategic decisions. Without the proper architecture to support near real-time analytics, organizations will be dependent on delayed data and will not be able to capitalize on emerging opportunities. Missed opportunities could impact operational efficiency, customer satisfaction, or product innovation.

Managed AWS Analytics and Database services allow for each component of the solution, from ingestion to analysis, to be optimized for speed, with little management overhead. It is crucial for critical business solutions to follow the six pillars of the AWS Well-Architected Framework. The framework helps cloud architects build the most secure, high performing, resilient, and efficient infrastructure for critical workloads.

The following diagram illustrates the solution architecture.

Solution architecture

By combining the appropriate AWS services, your organization can run near-real-time analytics off a transactional data store. In the following sections, we discuss the key components of the solution.

Transactional data storage

In this solution, we use Amazon DynamoDB as our transactional data store. DynamoDB is a managed NoSQL database solution that acts as a key-value store for transactional data. As a NoSQL solution, DynamoDB is optimized for compute (as opposed to storage) and therefore the data needs to be modeled and served up to the application based on how the application needs it. This makes DynamoDB good for applications with known access patterns, which is a property of many transactional workloads.

In DynamoDB, you can create, read, update, or delete items in a table through a partition key. For example, if you want to keep track of how many fitness quests a user has completed in your application, you can query the partition key of the user ID to find the item with an attribute that holds data related to completed quests, then update the relevant attribute to reflect a specific quests completion. There are also some added benefits of DynamoDB by design, such as the ability to scale to support massive global internet-scale applications while maintaining consistent single-digit millisecond latency performance, because the date will be horizontally partitioned across the underlying storage nodes by the service itself through the partition keys. Modeling your data here is very important so DynamoDB can horizontally scale based on a partition key, which is again why it’s a good fit for a transactional store. In transactional workloads, when you know what the access patterns are, it will be easier to optimize a data model around those patterns as opposed to creating a data model to accept ad hoc requests. All that being said, DynamoDB doesn’t perform scans across many items as efficiently, so for this solution, we integrate DynamoDB with other services to help meet the data analysis requirements.

Data streaming

Now that we have stored our workload’s transactional data in DynamoDB, we need to move that data to another service that will be better suited for analysis of said data. The time to insights on this data matters, so rather than send data off in batches, we stream the data into an analytics service, which helps us get the near-real time aspect of this solution.

We use Amazon Kinesis Data Streams to stream the data from DynamoDB to Amazon Redshift for this specific solution. Kinesis Data Streams captures item-level modifications in DynamoDB tables and replicates them to a Kinesis data stream. Your applications can access this stream and view item-level changes in near-real time. You can continuously capture and store terabytes of data per hour. Additionally, with the enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Kinesis Data Streams also provides durability and elasticity. The delay between the time a record is put into the stream and the time it can be retrieved (put-to-get delay) is typically less than 1 second. In other words, a Kinesis Data Streams application can start consuming the data from the stream almost immediately after the data is added. The managed service aspect of Kinesis Data Streams relieves you of the operational burden of creating and running a data intake pipeline. The elasticity of Kinesis Data Streams enables you to scale the stream up or down, so you never lose data records before they expire.

Analytical data storage

The next service in this solution is Amazon Redshift, a fully managed, petabyte-scale data warehouse service in the cloud. As opposed to DynamoDB, which is meant to update, delete, or read more specific pieces of data, Amazon Redshift is better suited for analytic queries where you are retrieving, comparing, and evaluating large amounts of data in multi-stage operations to produce a final result. Amazon Redshift achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and very efficient, targeted data compression encoding schemes.

Beyond just the fact that Amazon Redshift is built for analytical queries, it can natively integrate with Amazon streaming engines. Amazon Redshift Streaming Ingestion ingests hundreds of megabytes of data per second, so you can query data in near-real time and drive your business forward with analytics. With this zero-ETL approach, Amazon Redshift Streaming Ingestion enables you to connect to multiple Kinesis data streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) data streams and pull data directly to Amazon Redshift without staging data in Amazon Simple Storage Service (Amazon S3). You can define a schema or choose to ingest semi-structured data with the SUPER data type. With streaming ingestion, a materialized view is the landing area for the data read from the Kinesis data stream, and the data is processed as it arrives. When the view is refreshed, Redshift compute nodes allocate each data shard to a compute slice. We recommend you enable auto refresh for this materialized view so that your data is continuously updated.

Data analysis and visualization

After the data pipeline is set up, the last piece is data analysis with Amazon QuickSight to visualize the changes in consumer behavior. QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people who you work with, wherever they are.

QuickSight connects to your data in the cloud and combines data from many different sources. In a single data dashboard, QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more. As a fully managed cloud-based service, QuickSight provides enterprise-grade security, global availability, and built-in redundancy. It also provides the user-management tools that you need to scale from 10 users to 10,000, all with no infrastructure to deploy or manage.

QuickSight gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. They have secure access to dashboards from any device on your network and from mobile devices. Connecting QuickSight to the rest of our solution will complete the flow of data from being initially ingested into DynamoDB to being streamed into Amazon Redshift. QuickSight can create a visual analysis of the data in near-real time because that data is relatively up to date, so this solution can support use cases for making quick decisions on transactional data.

Using AWS for data services allows for each component of the solution, from ingestion to storage to analysis, to be optimized for speed and with little management overhead. With these AWS services, business leaders and analysts can get near-real-time insights to drive immediate change based on customer behavior, enabling organizational agility and ultimately leading to customer satisfaction.

Next steps

The next step to building a solution to analyze transactional data in near-real time on AWS would be to go through the workshop Enable near real-time analytics on data stored in Amazon DynamoDB using Amazon Redshift. In the workshop, you will get hands-on with AWS managed analytics, AI/ML, and database services to dive deep into an end-to-end solution delivering near-real-time analytics on transactional data. By the end of the workshop, you will have gone through the configuration and deployment of the critical pieces that will enable users to perform analytics on transactional workloads.

Conclusion

Developing an architecture that can serve transactional data to near-real-time analytics on AWS can help business become more agile in critical decisions. By ingesting and processing transactional data delivered directly from the application on AWS, businesses can optimize their inventory levels, reduce holding costs, increase revenue, and enhance customer satisfaction.

The end-to-end solution is designed for individuals in various roles, such as business users, data engineers, data scientists, and data analysts, who are responsible for comprehending, creating, and overseeing processes related to retail inventory forecasting. Overall, being able to analyze near-real time transactional data on AWS can provide businesses timely insight, allowing for quicker decision making in fast paced industries.


About the Authors

Jason D’Alba is an AWS Solutions Architect leader focused on database and enterprise applications, helping customers architect highly available and scalable database solutions.

Veerendra Nayak is a Principal Database Solutions Architect based in the Bay Area, California. He works with customers to share best practices on database migrations, resiliency, and integrating operational data with analytics and AI services.

Evan Day is a Database Solutions Architect at AWS, where he helps customers define technical solutions for business problems using the breadth of managed database services on AWS. He also focuses on building solutions that are reliable, performant, and cost efficient.

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

Post Syndicated from Venkata Kampana original https://aws.amazon.com/blogs/big-data/automate-aws-clean-rooms-querying-and-dashboard-publishing-using-aws-step-functions-and-amazon-quicksight-part-2/

Public health organizations need access to data insights that they can quickly act upon, especially in times of health emergencies, when data needs to be updated multiple times daily. For example, during the COVID-19 pandemic, access to timely data insights was critically important for public health agencies worldwide as they coordinated emergency response efforts. Up-to-date information and analysis empowered organizations to monitor the rapidly changing situation and direct resources accordingly.

This is the second post in this series; we recommend that you read this first post before diving deep into this solution. In our first post, Enable data collaboration among public health agencies with AWS Clean Rooms – Part 1 , we showed how public health agencies can create AWS Clean Room collaborations, invite other stakeholders to join the collaboration, and run queries on their collective data without either party having to share or copy underlying data with each other. As mentioned in the previous blog, AWS Clean Rooms enables multiple organizations to analyze their data and unlock insights they can act upon, without having to share sensitive, restricted, or proprietary records.

However, public health organizations leaders and decision-making officials don’t directly access data collaboration outputs from their Amazon Simple Storage Service (Amazon S3) buckets. Instead, they rely on up-to-date dashboards that help them visualize data insights to make informed decisions quickly.

To ensure these dashboards showcase the most updated insights, the organization builders and data architects need to catalog and update AWS Clean Rooms collaboration outputs on an ongoing basis, which often involves repetitive and manual processes that, if not done well, could delay your organization’s access to the latest data insights.

Manually handling repetitive daily tasks at scale poses risks like delayed insights, miscataloged outputs, or broken dashboards. At a large volume, it would require around-the-clock staffing, straining budgets. This manual approach could expose decision-makers to inaccurate or outdated information.

Automating repetitive workflows, validation checks, and programmatic dashboard refreshes removes human bottlenecks and help decrease inaccuracies. Automation helps ensure continuous, reliable processes that deliver the most current data insights to leaders without delays, all while streamlining resources.

In this post, we explain an automated workflow using AWS Step Functions and Amazon QuickSight to help organizations access the most current results and analyses, without delays from manual data handling steps. This workflow implementation will empower decision-makers with real-time visibility into the evolving collaborative analysis outputs, ensuring they have up-to-date, relevant insights that they can act upon quickly

Solution overview

The following reference architecture illustrates some of the foundational components of clean rooms query automation and publishing dashboards using AWS services. We automate running queries using Step Functions with Amazon EventBridge schedules, build an AWS Glue Data Catalog on query outputs, and publish dashboards using QuickSight so they automatically refresh with new data. This allows public health teams to monitor the most recent insights without manual updates.

The architecture consists of the following components, as numbered in the preceding figure:

  1. A scheduled event rule on EventBridge triggers a Step Functions workflow.
  2. The Step Functions workflow initiates the run of a query using the StartProtectedQuery AWS Clean Rooms API. The submitted query runs securely within the AWS Clean Rooms environment, ensuring data privacy and compliance. The results of the query are then stored in a designated S3 bucket, with a unique protected query ID serving as the prefix for the stored data. This unique identifier is generated by AWS Clean Rooms for each query run, maintaining clear segregation of results.
  3. When the AWS Clean Rooms query is successfully complete, the Step Functions workflow calls the AWS Glue API to update the location of the table in the AWS Glue Data Catalog with the Amazon S3 location where the query results were uploaded in Step 2.
  4. Amazon Athena uses the catalog from the Data Catalog to query the information using standard SQL.
  5. QuickSight is used to query, build visualizations, and publish dashboards using the data from the query results.

Prerequisites

For this walkthrough, you need the following:

Launch the CloudFormation stack

In this post, we provide a CloudFormation template to create the following resources:

  • An EventBridge rule that triggers the Step Functions state machine on a schedule
  • An AWS Glue database and a catalog table
  • An Athena workgroup
  • Three S3 buckets:
    • For AWS Clean Rooms to upload the results of query runs
    • For Athena to upload the results for the queries
    • For storing access logs of other buckets
  • A Step Functions workflow designed to run the AWS Clean Rooms query, upload the results to an S3 bucket, and update the table location with the S3 path in the AWS Glue Data Catalog
  • An AWS Key Management Service (AWS KMS) customer-managed key to encrypt the data in S3 buckets
  • AWS Identity and Access Management (IAM) roles and policies with the necessary permissions

To create the necessary resources, complete the following steps:

  1. Choose Launch Stack:

Launch Button

  1. Enter cleanrooms-query-automation-blog for Stack name.
  2. Enter the membership ID from the AWS Clean Rooms collaboration you created in Part 1 of this series.
  3. Choose Next.

  1. Choose Next again.
  2. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create stack.

After you run the CloudFormation template and create the resources, you can find the following information on the stack Outputs tab on the AWS CloudFormation console:

  • AthenaWorkGroup – The Athena workgroup
  • EventBridgeRule – The EventBridge rule triggering the Step Functions state machine
  • GlueDatabase – The AWS Glue database
  • GlueTable – The AWS Glue table storing metadata for AWS Clean Rooms query results
  • S3Bucket – The S3 bucket where AWS Clean Rooms uploads query results
  • StepFunctionsStateMachine – The Step Functions state machine

Test the solution

The EventBridge rule named cleanrooms_query_execution_Stepfunctions_trigger is scheduled to trigger every 1 hour. When this rule is triggered, it initiates the run of the CleanRoomsBlogStateMachine-XXXXXXX Step Functions state machine. Complete the following steps to test the end-to-end flow of this solution:

  1. On the Step Functions console, navigate to the state machine you created.
  2. On the state machine details page, locate the latest query run.

The details page lists the completed steps:

  • The state machine submits a query to AWS Clean Rooms using the startProtectedQuery API. The output of the API includes the query run ID and its status.
  • The state machine waits for 30 seconds before checking the status of the query run.
  • After 30 seconds, the state machine checks the query status using the getProtectedQuery API. When the status changes to SUCCESS, it proceeds to the next step to retrieve the AWS Glue table metadata information. The output of this step contains the S3 location to which the query run results are uploaded.
  • The state machine retrieves the metadata of the AWS Glue table named patientimmunization, which was created via the CloudFormation stack.
  • The state machine updates the S3 location (the location to which AWS Clean Rooms uploaded the results) in the metadata of the AWS Glue table.
  • After a successful update of the AWS Glue table metadata, the state machine is complete.
  1. On the Athena console, switch the workgroup to CustomWorkgroup.
  2. Run the following query:
“SELECT * FROM "cleanrooms_patientdb "."patientimmunization" limit 10;"

Visualize the data with QuickSight

Now that you can query your data in Athena, you can use QuickSight to visualize the results. Let’s start by granting QuickSight access to the S3 bucket where your AWS Clean Rooms query results are stored.

Grant QuickSight access to Athena and your S3 bucket

First, grant QuickSight access to the S3 bucket:

  1. Sign in to the QuickSight console.
  2. Choose your user name, then choose Manage QuickSight.
  3. Choose Security and permissions.
  4. For QuickSight access to AWS services, choose Manage.
  5. For Amazon S3, choose Select S3 buckets, and choose the S3 bucket named cleanrooms-query-execution-results -XX-XXXX-XXXXXXXXXXXX (XXXXX represents the AWS Region and account number where the solution is deployed).
  6. Choose Save.

Create your datasets and publish visuals

Before you can analyze and visualize the data in QuickSight, you must create datasets for your Athena tables.

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Select Athena.
  4. Enter a name for your dataset.
  5. Choose Create data source.
  6. Choose the AWS Glue database cleanrooms_patientdb and select the table PatientImmunization.
  7. Select Directly query your data.
  8. Choose Visualize.

  1. On the Analysis tab, choose the visual type of your choice and add visuals.

Clean up

Complete the following steps to clean up your resources when you no longer need this solution:

  1. Manually delete the S3 buckets and the data stored in the bucket.
  2. Delete the CloudFormation templates.
  3. Delete the QuickSight analysis.
  4. Delete the data source.

Conclusion

In this post, we demonstrated how to automate running AWS Clean Rooms queries using an API call from Step Functions. We also showed how to update the query results information on the existing AWS Glue table, query the information using Athena, and create visuals using QuickSight.

The automated workflow solution delivers real-time insights from AWS Clean Rooms collaborations to decision makers through automated checks for new outputs, processing, and Amazon QuickSight dashboard refreshes. This eliminates manual handling tasks, enabling faster data-driven decisions based on latest analyses. Additionally, automation frees up staff resources to focus on more strategic initiatives rather than repetitive updates.

Contact the public sector team directly to learn more about how to set up this solution, or reach out to your AWS account team to engage on a proof of concept of this solution for your organization.

About AWS Clean Rooms

AWS Clean Rooms helps companies and their partners more easily and securely analyze and collaborate on their collective datasets—without sharing or copying one another’s underlying data. With AWS Clean Rooms, you can create a secure data clean room in minutes, and collaborate with any other company on the AWS Cloud to generate unique insights about advertising campaigns, investment decisions, and research and development.

The AWS Clean Rooms team is continually building new features to help you collaborate. Watch this video to learn more about privacy-enhanced collaboration with AWS Clean Rooms.

Check out more AWS Partners or contact an AWS Representative to know how we can help accelerate your business.

Additional resources


About the Authors

Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Jim Daniel is the Public Health lead at Amazon Web Services. Previously, he held positions with the United States Department of Health and Human Services for nearly a decade, including Director of Public Health Innovation and Public Health Coordinator. Before his government service, Jim served as the Chief Information Officer for the Massachusetts Department of Public Health.

AWS Weekly Roundup — Amazon API Gateway, AWS Step Functions, Amazon ECS, Amazon EKS, Amazon LightSail, Amazon VPC, and more — January 29, 2024

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-amazon-api-gateway-aws-step-functions-amazon-ecs-amazon-eks-amazon-lightsail-amazon-vpc-and-more-january-29-2024/

This past week our service teams continue to innovate on your behalf, and a lot has happened in the Amazon Web Services (AWS) universe. I’ll also share about all the AWS Community events and initiatives that are happening around the world.

Let’s dive in!

Last week’s launches
Here are some launches that got my attention:

AWS Step Functions adds integration for 33 services including Amazon Q – AWS Step Functions is a visual workflow service capable of orchestrating over 11,000+ API actions from over 220 AWS services to help customers build distributed applications at scale. This week, AWS Step Functions expands its AWS SDK integrations with support for 33 additional AWS services, including Amazon Q, AWS B2B Data Interchange, and Amazon CloudFront KeyValueStore.

Amazon Elastic Container Service (Amazon ECS) Service Connect introduces support for automatic traffic encryption with TLS Certificates – Amazon ECS launches support for automatic traffic encryption with Transport Layer Security (TLS) certificates for its networking capability called ECS Service Connect. With this support, ECS Service Connect allows your applications to establish a secure connection by encrypting your network traffic.

Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon EKS Distro support Kubernetes version 1.29Kubernetes version 1.29 introduced several new features and bug fixes. You can create new EKS clusters using v1.29 and upgrade your existing clusters to v1.29 using the Amazon EKS console, the eksctl command line interface, or through an infrastructure-as-code (IaC) tool.

IPv6 instance bundles on Amazon Lightsail – With these new instance bundles, you can get up and running quickly on IPv6-only without the need for a public IPv4 address with the ease of use and simplicity of Amazon Lightsail. If you have existing Lightsail instances with a public IPv4 address, you can migrate your instances to IPv6-only in a few simple steps.

Amazon Virtual Private Cloud (Amazon VPC) supports idempotency for route table and network ACL creationIdempotent creation of route tables and network ACLs is intended for customers that use network orchestration systems or automation scripts that create route tables and network ACLs as part of a workflow. It allows you to safely retry creation without additional side effects.

Amazon Interactive Video Service (Amazon IVS) announces audio-only pricing for Low-Latency Streaming – Amazon IVS is a managed live streaming solution that is designed to make low-latency or real-time video available to viewers around the world. It now offers audio-only pricing for its Low-Latency Streaming capability at 1/10th of the existing HD video rate.

Sellers can resell third-party professional services in AWS Marketplace – AWS Marketplace sellers, including independent software vendors (ISVs), consulting partners, and channel partners, can now resell third-party professional services in AWS Marketplace. Services can include implementation, assessments, managed services, training, or premium support.

Introducing the AWS Small and Medium Business (SMB) Competency – This is the first go-to-market AWS Specialization designed for partners who deliver to small and medium-sized customers. The SMB Competency provides enhanced benefits for AWS Partners to invest and focus on SMB customer business, such as becoming the go-to standard for participation in new pilots and sales initiatives and receiving unique access to scale demand generation engines.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

X in Y – We launched existing services and instance types in additional Regions:

Other AWS news
Here are some additional projects, programs, and news items that you might find interesting:

Get The NewsExport a Software Bill of Materials using Amazon Inspector – Generating an SBOM gives you critical security information that offers you visibility into specifics about your software supply chain, including the packages you use the most frequently and the related vulnerabilities that might affect your whole company. My colleague Varun Sharma in South Africa shows how to export a consolidated SBOM for the resources monitored by Amazon Inspector across your organization in industry standard formats, including CycloneDx and SPDX. It also shares insights and approaches for analyzing SBOM artifacts using Amazon Athena.

AWS open source news and updates – My colleague Ricardo writes this weekly open source newsletter in which he highlights new open source projects, tools, and demos from the AWS Community.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS InnovateAWS Innovate: AI/ML and Data Edition – Register now for the Asia Pacific & Japan AWS Innovate online conference on February 22, 2024, to explore, discover, and learn how to innovate with artificial intelligence (AI) and machine learning (ML). Choose from over 50 sessions in three languages and get hands-on with technical demos aimed at generative AI builders.

AWS Summit Paris 2024AWS Summit Paris  – The AWS Summit Paris is an annual event that is held in Paris, France. It is a great opportunity for cloud computing professionals from all over the world to learn about the latest AWS technologies, network with other professionals, and collaborate on projects. The Summit is free to attend and features keynote presentations, breakout sessions, and hands-on labs. Registrations are open!

AWS Community re:Invent re:CapsAWS Community re:Invent re:Caps – Join a Community re:Cap event organized by volunteers from AWS User Groups and AWS Cloud Clubs around the world to learn about the latest announcements from AWS re:Invent.

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— seb

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Post Syndicated from Arun Sudhir original https://aws.amazon.com/blogs/big-data/how-eightfold-ai-implemented-metadata-security-in-a-multi-tenant-data-analytics-environment-with-amazon-redshift/

This is a guest post co-written with Arun Sudhir from Eightfold AI.

Eightfold is transforming the world of work by providing solutions that empower organizations to recruit and retain a diverse global workforce. Eightfold is a leader in AI products for enterprises to build on their talent’s existing skills. From Talent Acquisition to Talent Management and talent insights, Eightfold offers a single AI platform that does it all.

The Eightfold Talent Intelligence Platform powered by Amazon Redshift and Amazon QuickSight provides a full-fledged analytics platform for Eightfold’s customers. It delivers analytics and enhanced insights about the customer’s Talent Acquisition, Talent Management pipelines, and much more. Customers can also implement their own custom dashboards in QuickSight. As part of the Talent Intelligence Platform Eightfold also exposes a data hub where each customer can access their Amazon Redshift-based data warehouse and perform ad hoc queries as well as schedule queries for reporting and data export. Additionally, customers who have their own in-house analytics infrastructure can integrate their own analytics solutions with Eightfold Talent Intelligence Platform by directly connecting to the Redshift data warehouse provisioned for them. Doing this gives them access to their raw analytics data, which can then be integrated into their analytics infrastructure irrespective of the technology stack they use.

Eightfold provides this analytics experience to hundreds of customers today. Securing customer data is a top priority for Eightfold. The company requires the highest security standards when implementing a multi-tenant analytics platform on Amazon Redshift.

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift.

In this post, we discuss how the Eightfold Talent Lake system team implemented the Amazon Redshift metadata security feature in their multi-tenant environment to enable access controls for the database catalog. By linking access to business-defined entitlements, they are able to enforce data access policies.

Amazon Redshift security controls addresses restricting data access to users who have been granted permission. This post discusses restricting listing of data catalog metadata as per the granted permissions.

The Eightfold team needed to develop a multi-tenant application with the following features:

  • Enforce visibility of Amazon Redshift objects on a per-tenant basis, so that each tenant can only view and access their own schema
  • Implement tenant isolation and security so that tenants can only see and interact with their own data and objects

Metadata security in Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Many customers have implemented Amazon Redshift to support multi-tenant applications. One of the challenges with multi-tenant environments is that database objects are visible to all tenants even though tenants are only authorized to access certain objects. This visibility creates data privacy challenges because many customers want to hide objects that tenants can’t access.

The newly released metadata security feature in Amazon Redshift enables you to hide database objects from all other tenants and make objects only visible to tenants who are authorized to see and use them. Tenants can use SQL tools, dashboards, or reporting tools, and also query the database catalog, but they will only see appropriate objects for which they have permissions to see.

Solution overview

Exposing a Redshift endpoint to all of Eightfold’s customers as part of the Talent Lake endeavor involved several design choices that had to be carefully considered. Eightfold has a multi-tenant Redshift data warehouse that had individual customer schemas for customers, which they could connect to using their own customer credentials to perform queries on their data. Data in each customer tenant can only be accessed by the customer credentials that had access to the customer schema. Each customer could access data under their analytics schema, which was named after the customer. For example, for a customer named A, the schema name would be A_analytics. The following diagram illustrates this architecture.

Although customer data was secured by restricting access to only the customer user, when customers used business intelligence (BI) tools like QuickSight, Microsoft Power BI, or Tableau to access their data, the initial connection showed all the customer schemas because it was performing a catalog query (which couldn’t be restricted). Therefore, Eightfold’s customers had concerns that other customers could discover that they were Eightfold’s customers by simply trying to connect to Talent Lake. This unrestricted database catalog access posed a privacy concern to several Eightfold customers. Although this could be avoided by provisioning one Redshift database per customer, that was a logistically difficult and expensive solution to implement.

The following screenshot shows what a connection from QuickSight to our data warehouse looked like without metadata security turned on. All other customer schemas were exposed even though the connection to QuickSight was made as customer_k_user.

Approach for implementing metadata access controls

To implement restricted catalog access, and ensure it worked with Talent Lake, we cloned our production data warehouse with all the schemas and enabled the metadata security flag in the Redshift data warehouse by connecting to SQL tools. After it was enabled, we tested the catalog queries by connecting to the data warehouse from BI tools like QuickSight, Microsoft Power BI, and Tableau and ensured that only the customer schemas show up as a result of the catalog query. We also tested by running catalog queries after connecting to the Redshift data warehouse from psql, to ensure that only the customer schema objects were surfaced—It’s important to validate that given tenants have access to the Redshift data warehouse directly.

The metadata security feature was tested by first turning on metadata security in our Redshift data warehouse by connecting using a SQL tool or Amazon Redshift Query Editor v2.0 and issuing the following command:

ALTER SYSTEM SET metadata_security = TRUE;

Note that the preceding command is set at the Redshift cluster level or Redshift Serverless endpoint level, which means it is applied to all databases and schemas in the cluster or endpoint.

In Eightfold’s scenario, data access controls are already in place for each of the tenants for their respective database objects.

After turning on the metadata security feature in Amazon Redshift, Eightfold was able to restrict database catalog access to only show individual customer schemas for each customer that was trying to connect to Amazon Redshift and further validated by issuing a catalog query to access schema objects as well.

We also tested by connecting via psql and trying out various catalog queries. All of them yielded only the relevant customer schema of the logged-in user as the result. The following are some examples:

analytics=> select * from pg_user;
usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig 
------------------------+----------+-------------+----------+-----------+----------+----------+-------------------------------------------
customer_k_user | 377 | f | f | f | ******** | | 
(1 row)

analytics=> select * from information_schema.schemata;
catalog_name | schema_name | schema_owner | default_character_set_catalog | default_character_set_schema | default_character_set_name | sql_path 
--------------+----------------------+------------------------+-------------------------------+------------------------------+----------------------------+----------
analytics | customer_k_analytics | customer_k_user | | | | 
(1 row)

The following screenshot shows the UI after metadata security was enabled: only customer_k_analytics is seen when connecting to the Redshift data warehouse as customer_k_user.

This ensured that individual customer privacy was protected and increased customer confidence in Eightfold’s Talent Lake.

Customer feedback

“Being an AI-first platform for customers to hire and develop people to their highest potential, data and analytics play a vital role in the value provided by the Eightfold platform to its customers. We rely on Amazon Redshift as a multi-tenant Data Warehouse that provides rich analytics with data privacy and security through customer data isolation by using schemas. In addition to the data being secure as always, we layered on Redshift’s new metadata access control to ensure customer schemas are not visible to other customers. This feature truly made Redshift the ideal choice for a multi-tenant, performant, and secure Data Warehouse and is something we are confident differentiates our offering to our customers.”

– Sivasankaran Chandrasekar, Vice President of Engineering, Data Platform at Eightfold AI

Conclusion

In this post, we demonstrated how the Eightfold Talent Intelligence Platform team implemented a multi-tenant environment for hundreds of customers, using the Amazon Redshift metadata security feature. For more information about metadata security, refer to the Amazon Redshift documentation.

Try out the metadata security feature for your future Amazon Redshift implementations, and feel free to leave a comment about your experience!


About the authors

Arun Sudhir is a Staff Software Engineer at Eightfold AI. He has more than 15 years of experience in design and development of backend software systems in companies like Microsoft and AWS, and has a deep knowledge of database engines like Amazon Aurora PostgreSQL and Amazon Redshift.

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using AWS Analytics services.

Anjali Vijayakumar is a Senior Solutions Architect at AWS focusing on EdTech. She is passionate about helping customers build well-architected solutions in the cloud.

New Amazon Q in QuickSight uses generative AI assistance for quicker, easier data insights (preview)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-amazon-q-in-quicksight-uses-generative-ai-assistance-for-quicker-easier-data-insights-preview/

Today, I’m happy to share that Amazon Q in QuickSight is available for preview. Now you can experience the Generative BI capabilities in Amazon QuickSight announced on July 26, as well as two additional capabilities for business users.

Turning insights into impact faster with Amazon Q in QuickSight
With this announcement, business users can now generate compelling sharable stories examining their data, see executive summaries of dashboards surfacing key insights from data in seconds, and confidently answer questions of data not answered by dashboards and reports with a reimagined Q&A experience.

Before we go deeper into each capability, here’s a quick summary:

  • Stories — This is a new and visually compelling way to present and share insights. Stories can automatically generated in minutes using natural language prompts, customized using point-and-click options, and shared securely with others.
  • Executive summaries — With this new capability, Amazon Q helps you to understand key highlights in your dashboard.
  • Data Q&A — This capability provides a new and easy-to-use natural-language Q&A experience to help you get answers for questions beyond what is available in existing dashboards and reports.​​

To get started, you need to enable Preview Q Generative Capabilities in Preview manager.

Once enabled, you’re ready to experience what Amazon Q in QuickSight brings for business users and business analysts building dashboards.

Stories automatically builds formatted narratives
Business users often need to share their findings of data with others to inform team decisions; this has historically involved taking data out of the business intelligence (BI) system. Stories are a new feature enabling business users to create beautifully formatted narratives that describe data, and include visuals, images, and text in document or slide format directly that can easily be shared with others within QuickSight.

Now, business users can use natural language to ask Amazon Q to build a story about their data by starting from the Amazon Q Build menu on an Amazon QuickSight dashboard. Amazon Q extracts data insights and statistics from selected visuals, then uses large language models (LLMs) to build a story in multiple parts, examining what the data may mean to the business and suggesting ideas to achieve specific goals.

For example, a sales manager can ask, “Build me a story about overall sales performance trends. Break down data by product and region. Suggest some strategies for improving sales.” Or, “Write a marketing strategy that uses regional sales trends to uncover opportunities that increase revenue.” Amazon Q will build a story exploring specific data insights, including strategies to grow sales.

Once built, business users get point-and-click tools augmented with artificial intelligence- (AI) driven rewriting capabilities to customize stories using a rich text editor to refine the message, add ideas, and highlight important details.

Stories can also be easily and securely shared with other QuickSight users by email.

Executive summaries deliver a quick snapshot of important information
Executive summaries are now available with a single click using the Amazon Q Build menu in Amazon QuickSight. Amazon QuickSight automatically determines interesting facts and statistics, then use LLMs to write about interesting trends.

This new capability saves time in examining detailed dashboards by providing an at-a-glance view of key insights described using natural language.

The executive summaries feature provides two advantages. First, it helps business users generate all the key insights without the need to browse through tens of visuals on the dashboard and understand changes from each. Secondly, it enables readers to find key insights based on information in the context of dashboards and reports with minimum effort.

New data Q&A experience
Once an interesting insight is discovered, business users frequently need to dig in to understand data more deeply than they can from existing dashboards and reports. Natural language query (NLQ) solutions designed to solve this problem frequently expect that users already know what fields may exist or how they should be combined to answer business questions. However, business users aren’t always experts in underlying data schemas, and their questions frequently come in more general terms, like “How were sales last week in NY?” Or, “What’s our top campaign?”

The new Q&A experience accessed within the dashboards and reports helps business users confidently answer questions about data. It includes AI-suggested questions and a profile of what data can be asked about and automatically generated multi-visual answers with narrative summaries explaining data context.

Furthermore, Amazon Q brings the ability to answer vague questions and offer alternatives for specific data. For example, customers can ask a vague question, such as “Top products,” and Amazon Q will provide an answer that breaks down products by sales and offers alternatives for products by customer count and products by profit. Amazon Q explains answer context in a narrative summarizing total sales, number of products, and picking out the sales for the top product.

Customers can search for specific data values and even a single word such as, for example, the product name “contactmatcher.” Amazon Q returns a complete set of data related to that product and provides a natural language breakdown explaining important insights like total units sold. Specific visuals from the answers can also be added to a pinboard for easy future access.

Watch the demo
To see these new capabilities in action, have a look at the demo.

Things to Know
Here are a few additional things that you need to know:

Join the preview
Amazon Q in QuickSight product page

Happy building!
— Donnie

AWS Weekly Roundup – EC2 DL2q instances, PartyRock, Amplify’s 6th birthday, and more – November 20, 2023

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-ec2-dl2q-instances-partyrock-amplifys-6th-birthday-and-more-november-20-2023/

Last week I saw an astonishing 160+ new service launches. There were so many updates that we decided to publish a weekly roundup again. This continues the same innovative pace of the previous week as we are getting closer to AWS re:Invent 2023.

Our News Blog team is also finalizing new blog posts for re:Invent to introduce awesome launches with service teams for your reading pleasure. Jeff Barr shared The Road to AWS re:Invent 2023 to explain our blogging journey and process. Please stay tuned in the next week!

Last week’s launches
Here are some of the launches that caught my attention last week:

Amazon EC2 DL2q instances – New DL2q instances are powered by Qualcomm AI 100 Standard accelerators and are the first to feature Qualcomm’s AI technology in the public cloud. With eight Qualcomm AI 100 Standard accelerators and 128 GiB of total accelerator memory, you can run popular generative artificial intelligence (AI) applications and extend to edge devices across smartphones, autonomous driving, personal compute, and extended reality headsets to develop and validate these AI workloads before deploying.

PartyRock for Amazon Bedrock – We introduced PartyRock, a fun and intuitive hands-on, generative AI app-building playground powered by Amazon Bedrock. You can experiment, learn all about prompt engineering, build mini-apps, and share them with your friends—all without writing any code or creating an AWS account.

You also can now access the Meta Llama 2 Chat 13B foundation model and Cohere Command Light, Embed English, and multilingual models for Amazon Bedrock.

AWS Amplify celebrates its sixth birthday – We announced six new launches; a new documentation site, support for Next.js 14 with our hosting and JavaScript library, added custom token providers and an automatic React Native social sign-in update to Amplify Auth, new ChangePassword and DeleteUser account settings components, and updated all Amplify UI packages to use new Amplify JavaScript v6. You can also use wildcard subdomains when using a custom domain with your Amplify application deployed to AWS Amplify Hosting.

Amplify docs site UI

Also check out other News Blog posts about major launches published in the past week:

Other AWS service launches
Here are some other bundled feature launches per AWS service:

Amazon Athena  – You can use a new cost-based optimizer (CBO) to enhance query performance based on table and column statistics, collected by AWS Glue Data Catalog and Athena JDBC 3.x driver, a new alternative that supports almost all authentication plugins. You can also use Amazon EMR Studio to develop and run interactive queries on Amazon Athena.

Amazon CloudWatch – You can use a new CloudWatch metric called EBS Stalled I/O Check to monitor the health of your Amazon EBS volumes, the regular expression for Amazon CloudWatch Logs Live Tail filter pattern syntax to search and match relevant log events, observability of SAP Sybase ASE database in CloudWatch Application Insights, and up to two stats commands in a Log Insights query to perform aggregations on the results.

Amazon CodeCatalyst – You can connect to a Amazon Virtual Private Cloud (Amazon VPC) from CodeCatalyst Workflows, provision infrastructure using Terraform within CodeCatalyst Workflows, access CodeCatalyst with your workforce identities configured in IAM Identity Center, and create teams made up of members of the CodeCatalyst space.

Amazon Connect – You can use a pre-built queue performance dashboard and Contact Lens conversational analytics dashboard to view and compare real-time and historical aggregated queue performance. You can use quick responses for chats, previously written formats such as typing in ‘/#greet’ to insert a personalized response, and scanning attachments to detect malware or other unwanted content.

AWS Glue – AWS Glue for Apache Spark added new six database connectors: Teradata, SAP HANA, Azure SQL, Azure Cosmos DB, Vertica, and MongoDB, as well as the native connectivity to Amazon OpenSearch Service.

AWS Lambda – You can see single pane view of metrics, logs, and traces in the AWS Lambda console and advanced logging controls to natively capture logs in JSON structured format. You can view the SAM template on the Lambda console and export the function’s configuration to AWS Application Composer. AWS Lambda also supports Java 21 and NodeJS 20 versions built on the new Amazon Linux 2023 runtime.

AWS Local Zones in Dallas – You can enable the new Local Zone in Dallas, Texas, us-east-1-dfw-2a, with Amazon EC2 C6i, M6i, R6i, C6gn, and M6g instances and Amazon EBS volume types gp2, gp3, io1, sc1, and st1. You can also access Amazon ECS, Amazon EKS, Application Load Balancer, and AWS Direct Connect in this new Local Zone to support a broad set of workloads at the edge.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) – You can standardize access control to Kafka resources using AWS Identity and Access Management (IAM) and build Kafka clients for Amazon MSK Serverless written in all programming languages. These are open source client helper libraries and code samples for popular languages, including Java, Python, Go, and JavaScript. Also, Amazon MSK now supports an enhanced version of Apache Kafka 3.6.0 that offers generally available Tiered Storage and automatically sends you storage capacity alerts when you are at risk of exhausting your storage.

Amazon OpenSearch Service Ingestion – You can migrate your data from Elasticsearch version 7.x clusters to the latest versions of Amazon OpenSearch Service and use persistent buffering to protect the durability of incoming data.

Amazon RDS –Amazon RDS for MySQL now supports creating active-active clusters using the Group Replication plugin, upgrading MySQL 5.7 snapshots to MySQL 8.0, and Innovation Release version of MySQL 8.1.

Amazon RDS Custom for SQL Server extends point-in-time recovery support for up to 1,000 databases, supports Service Master Key Retention to use transparent data encryption (TDE), table- and column-level encryption, DBMail and linked servers, and use SQL Server Developer edition with the bring your own media (BYOM).

Additionally, Amazon RDS Multi-AZ deployments with two readable standbys now supports minor version upgrades and system maintenance updates with typically less than one second of downtime when using Amazon RDS Proxy.

AWS Partner Central – You can use an improved user experience in AWS Partner Central to build and promote your offerings and the new Investments tab in the Partner Analytics Dashboard to gain actionable insights. You can now link accounts and associated users between Partner Central and AWS Marketplace and use an enhanced co-sell experience with APN Customer Engagements (ACE) manager.

Amazon QuickSight – You can programmatically manage user access and custom permissions support for roles to restrict QuickSight functionality to the QuickSight account for IAM Identity Center and Active Directory using APIs. You can also use shared restricted folders, a Contributor role and support for data source asset types in folders and the Custom Week Start feature, an addition designed to enhance the data analysis experience for customers across diverse industries and social contexts.

AWS Trusted Advisor – You can use new APIs to programmatically access Trusted Advisor best practices checks, recommendations, and prioritized recommendations and 37 new Amazon RDS checks that provide best practices guidance by analyzing DB instance configuration, usage, and performance data.

There’s a lot more launch news that I haven’t covered. See AWS What’s New for more details.

See you virtually in AWS re:Invent
AWS re:Invent 2023Next week we’ll hear the latest from AWS, learn from experts, and connect with the global cloud community in Las Vegas. If you come, check out the agenda, session catalog, and attendee guides before your departure.

If you’re not able to attend re:Invent in person this year, we’re offering the option to livestream our Keynotes and Innovation Talks. With the registration for online pass, you will have access to on-demand keynote, Innovation Talks, and selected breakout sessions after the event.

Channy

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

Post Syndicated from Antonio Samaniego Jurado original https://aws.amazon.com/blogs/big-data/visualize-amazon-dynamodb-insights-in-amazon-quicksight-using-the-amazon-athena-dynamodb-connector-and-aws-glue/

Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools. The scalability and flexible data schema of DynamoDB make it well-suited for a variety of use cases. These include internet-scale web and mobile applications, low-latency metadata stores, high-traffic retail websites, Internet of Things (IoT) and time series data, online gaming, and more.

Data stored in DynamoDB is the basis for valuable business intelligence (BI) insights. To make this data accessible to data analysts and other consumers, you can use Amazon Athena. Athena is a serverless, interactive service that allows you to query data from a variety of sources in heterogeneous formats, with no provisioning effort. Athena accesses data stored in DynamoDB via the open source Amazon Athena DynamoDB connector. Table metadata, such as column names and data types, is stored using the AWS Glue Data Catalog.

Finally, to visualize BI insights, you can use Amazon QuickSight, a cloud-powered business analytics service. QuickSight makes it straightforward for organizations to build visualizations, perform ad hoc analysis, and quickly get business insights from their data, anytime, on any device. Its generative BI capabilities enable you to ask questions about your data using natural language, without having to write SQL queries or learn a BI tool.

This post shows how you can use the Athena DynamoDB connector to easily query data in DynamoDB with SQL and visualize insights in QuickSight.

Solution overview

The following diagram illustrates the solution architecture.

Architecture Diagram

  1. The Athena DynamoDB connector runs in a pre-built, serverless AWS Lambda function. You don’t need to write any code.
  2. AWS Glue provides supplemental metadata from the DynamoDB table. In particular, an AWS Glue crawler is run to infer and store the DynamoDB table format, schema, and associated properties in the Glue Data Catalog.
  3. The Athena editor is used to test the connector and perform analysis via SQL queries.
  4. QuickSight uses the Athena connector to visualize BI insights from DynamoDB.

This walkthrough uses data from the ProductCatalog table, part of the DynamoDB developer guide sample data files.

Prerequisites

Before you get started, you should meet the following prerequisites:

Set up the Athena DynamoDB connector

The Athena DynamoDB connector comprises a pre-built, serverless Lambda function provided by AWS that communicates with DynamoDB so you can query your tables with SQL using Athena. The connector is available in the AWS Serverless Application Repository, and is used to create the Athena data source for later use in data analysis and visualization. To set up the connector, complete the following steps:

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. In the search bar, search for and choose Amazon DynamoDB.
  4. Choose Next.
  5. Under Data source details, enter a name. Note that this name should be unique and will be referenced in your SQL statements when you query your Athena data source.
  6. Under Connection details, choose Create Lambda function.

This will take you to the Lambda applications page on the Lambda console. Do not close the Athena data source creation tab; you will return to it in a later step.

  1. Scroll down to Application settings and enter a value for the following parameters (leave the other parameters as default):
    • SpillBucket – Specifies the Amazon Simple Storage Service (Amazon S3) bucket name for storing data that exceeds Lambda function response size limits. To create an S3 bucket, refer to Creating a bucket.
    • AthenaCatalogName – A lowercase name for the Lambda function to be created.Lambda Application Settings
  2. Select the acknowledgement check box and choose Deploy.

Wait for deployment to complete before moving to the next step.

  1. Return to the Athena data source creation tab.
  2. Under Connection details, choose the refresh icon and choose the Lambda function you created.Lambda Connection Details
  3. Choose Next.
  4. Review and choose Create data source.

Provide supplemental metadata via AWS Glue

The Athena connector already comes with a built-in inference capability to discover the schema and table properties of your data source. However, this capability is limited. To accurately discover the metadata of your DynamoDB table and centralize schema management as your data evolves over time, the connector integrates with AWS Glue.

To achieve this, an AWS Glue crawler is run to automatically determine the format, schema, and associated properties of the raw data stored in your DynamoDB table, writing the resulting metadata to a Glue database. Glue databases contain tables, which hold metadata from different data stores, independent from the actual location of the data. The Athena connector then references the Glue table and retrieves the corresponding DynamoDB metadata to enable queries.

Create the AWS Glue database

Complete the following steps to create the Glue database:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Databases.
  2. Choose Add database (you can also edit an existing database if you already have one).
  3. For Name, enter a database name.
  4. For Location, enter the string literal dynamo-db-flag. This keyword indicates that the database contains tables that the connector can use for supplemental metadata.
  5. Choose Create database.

Following security best practices, it is also recommended that you enable encryption at rest for your Data Catalog. For details, refer to Encrypting your Data Catalog.

Create the AWS Glue crawler

Complete the following steps to create and run the Glue crawler:

  1. On the AWS Glue console, under Data Catalog in the navigation pane, choose Crawlers.
  2. Choose Create crawler.
  3. Enter a crawler name and choose Next.
  4. For Data sources, choose Add a data source.
  5. On the Data source drop-down menu, choose DynamoDB. For Table name, enter the name of your DynamoDB table (string literal).
  6. Choose Add a DynamoDB data source.
  7. Choose Next.
  8. For IAM Role, choose Create new IAM role.
  9. Enter a role name and choose Create. This will automatically create an IAM role that trusts AWS Glue and has permissions to access the crawler targets.
  10. Choose Next.
  11. For Target database, choose the database previously created.
  12. Choose Next.
  13. Review and choose Create crawler.
  14. On the newly created crawler page, choose Run crawler.

Crawler runtimes depend on your DynamoDB table size and properties. You can find crawler run details under Crawler runs.

Validate the output metadata

When your crawler run status shows as Completed, follow the below steps to validate the output metadata:

  1. On the AWS Glue console, choose Tables in the navigation pane. Here, you can confirm a new table has been added to the database as a result of the crawler run.
  2. Navigate to the newly created table and take a look at the Schema tab. This tab shows the column names, data types, and other parameters inferred from your DynamoDB table.
  3. If needed, edit the schema by choosing Edit schema.Glue Table Details
  4. Choose Advanced properties.
  5. Under Table properties, verify the crawler automatically created and set the classification key to dynamodb. This indicates to the Athena connector that the table can be used for supplemental metadata.
  6. Optionally, add the following properties to correctly catalog and reference DynamoDB data in AWS Glue and Athena queries. This is due to capital letters not being permitted in AWS Glue table and column names, but being permitted in DynamoDB table and attribute names.
    1. If your DynamoDB table name contains any capital letters, choose Actions and Edit Table and add an extra table property as follows:
      • Key: sourceTable
      • Value: YourDynamoDBTableName
    2. If your DynamoDB table has attributes that contain any capital letters, add an extra table property as follows:
      • Key: columnMapping
      • Value: yourcolumn1=YourColumn1, yourcolumn2=YourColumn2, …

Test the connector with the Athena SQL editor

After the Athena DynamoDB connector is deployed and the AWS Glue table is populated with supplemental metadata, the DynamoDB table is ready for analysis. The example in this post uses the Athena editor to make SQL queries to the ProductCatalog table. For further options to interact with Athena, see Accessing Athena.

Complete the following steps to test the connector:

  1. Open the Athena query editor.
  2. If this is your first time visiting the Athena console in your current AWS Region, complete the following steps. This is a prerequisite before you can run Athena queries. See Getting Started for more details.
    1. Choose Query editor in the navigation pane to open the editor.
    2. Navigate to Settings and choose Manage to set up a query result location in Amazon S3.
  3. Under Data, select the data source and database you created (you may need to choose the refresh icon for them to sync up with Athena).
  4. Tables belonging to the selected database appear under Tables. You can choose a table name for Athena to show the table column list and data types.
  5. Test the connector by pulling data from your table via a SELECT statement. When you run Athena queries, you can reference Athena data sources, databases, and tables as <datasource_name>.<database>.<table_name>. Retrieved records are shown under Results.

For increased security, refer to Encrypting Athena query results stored in Amazon S3 to encrypt query results at rest.

Athena Query Results

For this post, we run a SELECT statement to validate the process. You can refer to the SQL reference for Athena to build more complex queries and analyses.

Visualize in QuickSight

QuickSight allows for building modern interactive dashboards, paginated reports, embedded analytics, and natural language queries through a unified BI solution. In this step, we use QuickSight to generate visual insights from the DynamoDB table by connecting to the Athena data source previously created.

Allow QuickSight to access to resources

Complete the following steps to grant QuickSight access to resources:

  1. On the QuickSight console, choose the profile icon and choose Manage QuickSight.
  2. In the navigation pane, choose Security & Permissions.
  3. Under QuickSight access to AWS services, choose Manage.
  4. QuickSight may ask you to switch to the Region in which users and groups in your account are managed. To change the current Region, navigate to the profile icon on the QuickSight console and choose the Region you want to switch to.
  5. For IAM Role, choose Use QuickSight-managed role (default).

Subsequent instructions assume that the default QuickSight-managed role is being used. If this is not the case, make sure to update the existing role to the same effect.

  1. Under Allow access and autodiscovery for these resources, select IAM and Amazon S3.
  2. For Amazon S3, choose Select S3 buckets.
  3. Choose the spill bucket you specified in earlier when deploying the Lambda function for the connector and the bucket you specified as the Athena query result location in Amazon S3.
  4. For both buckets, select Write permission for Athena Workgroup.
  5. Choose Amazon Athena.
  6. In the pop-up window, choose Next.
  7. Choose Lambda and choose the Amazon Resource Name (ARN) of the Lambda function previously used for the Athena data source connector.
  8. Choose Finish.
  9. Choose Save.

Create the Athena dataset

To create the Athena dataset, complete the following steps:

  1. On the QuickSight console, choose the user profile and switch to the Region you deployed the Athena data source to.
  2. Return to the QuickSight home page.
  3. In the navigation pane, choose Datasets.
  4. Choose New dataset.
  5. For Create a Dataset, select Athena.
  6. For Data source name, enter a name and choose Validate connection.
  7. When the connection shows as Validated, choose Create data source.
  8. Under Catalog, Database, and Tables, select the Athena data source, AWS Glue database, and AWS Glue table previously created.
  9. Choose Select.
  10. On the Finish dataset creation page, select Import to SPICE for quicker analytics.
  11. Choose Visualize.

For additional information on QuickSight query modes, see Importing data into SPICE and Using SQL to customize data.

Build QuickSight visualizations

Once the DynamoDB data is available in QuickSight via the Athena DynamoDB connector, it is ready to be visualized. The QuickSight analysis in the below example shows a vertical stacked bar chart with the average price per product category for the ProductCatalog sample dataset. In addition, it shows a donut chart with the proportion of products by product category, and a tree map containing the count of bicycles per bicycle type.

If you use data imported to SPICE in a QuickSight analysis, the dataset will only be available after the import is complete. For further details, see Using SPICE data in an analysis.

Quicksight Analysis

For comprehensive information on how to create and share visualizations in QuickSight, refer to Visualizing data in Amazon QuickSight and Sharing and subscribing to data in Amazon QuickSight.

Clean up

To avoid incurring continued AWS usage charges, make sure you delete all resources created as part of this walkthrough.

  • Delete the Athena data source:
    1. On the Athena console, switch to the Region you deployed your resources in.
    2. Choose Data sources in the navigation pane.
    3. Select the data source you created and on the Actions menu, choose Delete.
  • Delete the Lambda application:
    1. On the AWS CloudFormation console, switch to the Region you deployed your resources in.
    2. Choose Stacks in the navigation pane.
    3. Select serverlessrepo-AthenaDynamoDBConnector and choose Delete.
  • Delete the AWS Glue resources:
    1. On the AWS Glue console, switch to the Region you deployed your resources in.
    2. Choose Databases in the navigation pane.
    3. Select the database you created and choose Delete.
    4. Choose Crawlers in the navigation pane.
    5. Select the crawler you created and on the Action menu, choose Delete crawler.
  • Delete the QuickSight resources:
    1. On the QuickSight console, switch to the Region you deployed your resources in.
    2. Delete the analysis created for this walkthrough.
    3. Delete the Athena dataset created for this walkthrough.
    4. If you no longer need the Athena data source to create other datasets, delete the data source.

Summary

This post demonstrated how you can use the Athena DynamoDB connector to query data in DynamoDB with SQL and build visualizations in QuickSight.

Learn more about the Athena DynamoDB connector in the Amazon Athena User Guide. Discover more available data source connectors to query and visualize a variety of data sources without setting up or managing any infrastructure while only paying for the queries you run.

For advanced QuickSight capabilities powered by AI, see Gaining insights with machine learning (ML) in Amazon QuickSight and Answering business questions with Amazon QuickSight Q.


About the Authors

Antonio Samaniego Jurado is a Solutions Architect at Amazon Web Services. With a strong passion for modern technology, Antonio helps customers build state-of-the-art applications on AWS. A creator at heart, he loves community-driven learning and sharing of best practices across the AWS service portfolio to make the best of customers cloud journey.

Pascal Vogel is a Solutions Architect at Amazon Web Services. Pascal helps startups and enterprises build cloud-native solutions. As a cloud enthusiast, Pascal loves learning new technologies and connecting with like-minded customers who want to make a difference in their cloud journey.

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

Post Syndicated from Phillip Karg original https://aws.amazon.com/blogs/big-data/bmw-cloud-efficiency-analytics-powered-by-amazon-quicksight-and-amazon-athena/

This post is written in collaboration with Philipp Karg and Alex Gutfreund  from BMW Group.

Bayerische Motoren Werke AG (BMW) is a motor vehicle manufacturer headquartered in Germany with 149,475 employees worldwide and the profit before tax in the financial year 2022 was € 23.5 billion on revenues amounting to € 142.6 billion. BMW Group is one of the world’s leading premium manufacturers of automobiles and motorcycles, also providing premium financial and mobility services.

BMW Group uses 4,500 AWS Cloud accounts across the entire organization but is faced with the challenge of reducing unnecessary costs, optimizing spend, and having a central place to monitor costs. BMW Cloud Efficiency Analytics (CLEA) is a homegrown tool developed within the BMW FinOps CoE (Center of Excellence) aiming to optimize and reduce costs across all these accounts.

In this post, we explore how the BMW Group FinOps CoE implemented their Cloud Efficiency Analytics tool (CLEA), powered by Amazon QuickSight and Amazon Athena. With this tool, they effectively reduced costs and optimized spend across all their AWS Cloud accounts, utilizing a centralized cost monitoring system and using key AWS services. The CLEA dashboards were built on the foundation of the Well-Architected Lab. For more information on this foundation, refer to A Detailed Overview of the Cost Intelligence Dashboard.

CLEA gives full transparency into cloud costs, usage, and efficiency from a high-level overview to granular service, resource, and operational levels. It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer), AWS Trusted Advisor, and AWS Compute Optimizer. Additionally, it incorporates BMW Group’s internal system to integrate essential metadata, offering a comprehensive view of the data across various dimensions, such as group, department, product, and applications.

The ultimate goal is to raise awareness of cloud efficiency and optimize cloud utilization in a cost-effective and sustainable manner. The dashboards, which offer a holistic view together with a variety of cost and BMW Group-related dimensions, were successfully launched in May 2023 and became accessible to users within the BMW Group.

Overview of the BMW Cloud Data Hub

At the BMW Group, Cloud Data Hub (CDH) is the central platform for managing company-wide data and data solutions. It works as a bundle for resources that are bound to a specific staging environment and Region to store data on Amazon Simple Storage Service (Amazon S3), which is renowned for its industry-leading scalability, data availability, security, and performance. Additionally, it manages table definitions in the AWS Glue Data Catalog, containing references to data sources and targets of extract, transform, and load (ETL) jobs in AWS Glue.

Data providers and consumers are the two fundamental users of a CDH dataset. Providers create datasets within assigned domain and as the owner of a dataset, they are responsible for the actual content and for providing appropriate metadata. They can use their own toolsets or rely on provided blueprints to ingest the data from source systems. Once released, consumers use datasets from different providers for analysis, machine learning (ML) workloads, and visualization.

Each CDH dataset has three processing layers: source (raw data), prepared (transformed data in Parquet), and semantic (combined datasets). It is possible to define stages (DEV, INT, PROD) in each layer to allow structured release and test without affecting PROD. Within each stage, it’s possible to create resources for storing actual data. Two resource types are associated with each database in a layer:

  • File store – S3 buckets for data storage
  • Database – AWS Glue databases for metadata sharing

Overview of the CLEA Landscape

The following diagram is a high-level overview of some of the technologies used for the extract, load, and transform (ELT) stages, as well as the final visualization and analysis layer. You might notice that this differs slightly from traditional ETL. The difference lies in when and where data transformation takes place. In ETL, data is transformed before it’s loaded into the data warehouse. In ELT, raw data is loaded into the data warehouse first, then it’s transformed directly within the warehouse. The ELT process has gained popularity with the rise of cloud-based, high-performance data warehouses, where transformation can be done more efficiently after loading.

Regardless of the method used, the goal is to provide high-quality, reliable data that can be used to drive business decisions.

CLEA Architecture

In this section, we take a closer look at the three essential stages mentioned previously: extract, load and transform.

Extract

The extract stage plays a pivotal role in the CLEA, serving as the initial step where data related to cost and usage and optimization is collected from a diverse range of sources within AWS. These sources encompass the AWS Cost and Usage Reports, Cost Explorer (and forecasting with Cost Explorer), Trusted Advisor, and Compute Optimizer. Furthermore, it fetches essential metadata from BMW Group’s internal system, offering a comprehensive view of the data across various dimensions, such as group, department, product, and applications in the later stages of data transformation.

The following diagram illustrates one of the data collection architectures that we use to collect Trusted Advisor data from nearly 4,500 AWS accounts and subsequently load that into Cloud Data Hub.

Let’s go through each numbered step as outlined in the architecture:

  1. A time-based rule in Amazon EventBridge triggers the CLEA Shared Workflow AWS Step Functions state machine.
  2. Based on the inputs, the Shared Workflow state machine invokes the Account Collector AWS Lambda function to retrieve AWS account details from AWS Organizations.
  3. The Account Collector Lambda function assumes an AWS Identity and Access Management (IAM) role to access linked account details via the Organizations API and writes them to Amazon Simple Queue Service (Amazon SQS) queues.
  4. The SQS queues trigger the Data Collector Lambda function using SQS Lambda triggers.
  5. The Data Collector Lambda function assumes an IAM role in each linked account to retrieve the relevant data and load it into the CDH source S3 bucket.
  6. When all linked accounts data is collected, the Shared Workflow state machine triggers an AWS Glue job for further data transformation.
  7. The AWS Glue job reads raw data from the CDH source bucket and transforms it into a compact Parquet format.

Load and transform

For the data transformations, we used an open-source data transformation tool called dbt (Data Build Tool), modifying and preprocessing the data through a number of abstract data layers:

  • Source – This layer contains the raw data the data source provides. The preferred data format is Parquet, but JSON, CSV, or plain text file are also allowed.
  • Prepared – The source layer is transformed and stored as the prepared layer in Parquet format for optimized columnar access. Initial cleaning, filtering, and basic transformations are performed in this layer.
  • Semantic – A semantic layer combines several prepared layer datasets to a single dataset that contains transformations, calculations, and business logic to deliver business-friendly insights.
  • QuickSight – QuickSight is the final presentation layer, which is directly ingested into QuickSight SPICE from Athena via incremental daily ingestion queries. These ingested datasets are used as a source in CLEA dashboards.

Overall, using dbt’s data modeling and the pay-as-you-go pricing of Athena, BMW Group can control costs by running efficient queries on demand. Furthermore, with the serverless architecture of Athena and dbt’s structured transformations, you can scale data processing without worrying about infrastructure management. In CLEA there are currently more than 120 dbt models implemented with complex transformations. The semantic layer is incrementally materialized and partially ingested into QuickSight with up to 4 TB of SPICE capacity. For dbt deployment and scheduling, we use GitHub Actions which allows us to introduce new dbt models and changes easily with automatic deployments and tests.

CLEA Access control

In this section, we explain how we implemented access control using row-level security in QuickSight and QuickSight embedding for authentication and authorization.

RLS for QuickSight

Row-level security (RLS) is a key feature that governs data access and privacy, which we implemented for CLEA. RLS is a mechanism that allows us to control the visibility of data at the row level based on user attributes. In essence, it ensures that users can only access the data that they are authorized to view, adding an additional layer of data protection within the QuickSight environment.

Understanding the importance of RLS requires a broader view of the data landscape. In organizations where multiple users interact with the same datasets but require different access levels due to their roles, RLS becomes a pivotal tool. It ensures data security and compliance with privacy regulations, preventing unauthorized access to sensitive data. Additionally, it offers a tailored user experience by displaying only relevant data to the user, thereby enhancing the effectiveness of data analysis.

For CLEA, we collected BMW Group metadata such as department, application, and organization, which are quite important to allow users to only see the accounts within their department, application, organization, and so on. This is achieved using both a user name and group name for access control. We use the user name for user-specific access control and the group name for adding some users to a specific group to extend their permissions for different use cases.

Lastly, because there are many dashboards created by CLEA, we also control which users a unique user can see and also the data itself in the dashboard. This is done at the group level. By default, all users are assigned to CLEA-READER, which is granted access to core dashboards that we want to share with users, but there are different groups that allow users to see additional dashboards after they’re assigned to that group.

The RLS dataset is refreshed daily to catch recent changes regarding new user additions, group changes, or any other user access changes. This dataset is also ingested to SPICE daily, which automatically updates all datasets restricted via this RLS dataset.

QuickSight embedding

CLEA is a cross-platform application that provides secure access to QuickSight embedded content with custom-built authentication and authorization logic that sits on top of BMW Group identity and role management services (referred to as BMW IAM).

CLEA provides access to sensitive data to multiple users with different permissions, which is why it is designed with fine-grained access control rules. It enforces access control using role-based access control (RBAC) and attribute-based access control (ABAC) models at two different levels:

  • At the dashboard level via QuickSight user groups (RBAC)
  • At the dashboard data level via QuickSight RLS (RBAC and ABAC)

Dashboard-level permissions define the list of dashboards users are able to visualize.

Dashboard data-level permissions define the subsets of dashboard data shown to the user and are applied using RLS with the user attributes mentioned earlier. Although the majority of roles defined in CLEA are used for dashboard-level permissions, some specific roles are strategically defined to grant permissions at the dashboard data level, taking priority over the ABAC model.

BMW has a defined set of guidelines suggesting the usage of their IAM services as the single source of truth for identity and access control, which the team took into careful consideration when designing the authentication and authorization processes for CLEA.

Upon their first login, users are automatically registered in CLEA and assigned a base role that grants them access to a basic set of dashboards.

The process of registering users in CLEA consists of mapping a user’s identity as retrieved from BMW’s identity provider (IdP) to a QuickSight user, then assigning the newly created user to the respective QuickSight user group.

For users that require more extensive permissions (at one of the levels mentioned before), it is possible to order additional role assignments via BMW’s self-service portal for role management. Authorized reviewers will then review it and either accept or reject the role assignments.

Role assignments will take effect the next time the user logs in, at which time the user’s assigned roles in BMW Group IAM are synced to the user’s QuickSight groups—internally referred to as the identity and permissions sync. As shown in the following diagram, the sync groups step calculates which users’ group memberships should be kept, created, and deleted following the logic.

Usage Insights

Amazon CloudWatch plays an indispensable role in enhancing the efficiency and usability of CLEA dashboards. Not only does CloudWatch offer real-time monitoring of AWS resources, but it also allows to track user activity and dashboard utilization. By analyzing usage metrics and logs, we can see who has logged in to the CLEA dashboards, what features are most frequently accessed, and how long users interact with various elements. These insights are invaluable for making data-driven decisions on how to improve the dashboards for a better user experience. Through the intuitive interface of CloudWatch, it’s possible to set up alarms for alerting about abnormal activities or performance issues. Ultimately, employing CloudWatch for monitoring offers a comprehensive view of both system health and user engagement, helping us refine and enhance our dashboards continually.

Conclusion

BMW Group’s CLEA platform offers a comprehensive and effective solution to manage and optimize cloud resources. By providing full transparency into cloud costs, usage, and efficiency, CLEA offers insights from high-level overviews to granular details at the service, resource, and operational level.

CLEA aggregates data from various sources, enabling a detailed roadmap of the cloud operations, tracking footprints across primes, departments, products, applications, resources, and tags. This dynamic vision helps identify trends, anticipate future needs, and make strategic decisions.

Future plans for CLEA include enhancing capabilities with data consistency and accuracy, integrating additional sources like Amazon S3 Storage Lens for deeper insights, and introducing Amazon QuickSight Q for intelligent recommendations powered by machine learning, further streamlining cloud operations.

By following the practices here, you can unlock the potential of efficient cloud resource management by implementing Cloud Intelligence Dashboards, providing you with precise insights into costs, savings, and operational effectiveness.


About the Authors

Philipp Karg is Lead FinOps Engineer at BMW Group and founder of the CLEA platform. He focus on boosting cloud efficiency initiatives and establishing a cost-aware culture within the company to ultimately leverage the cloud in a sustainable way.

Alex Gutfreund is Head of Product and Technology Integration at the BMW Group. He spearheads the digital transformation with a particular focus on platforms ecosystems and efficiencies. With extensive experience at the interface of business and IT, he drives change and makes an impact in various organizations. His industry knowledge spans from automotive, semiconductor, public transportation, and renewable energies.

Cizer Pereira is a Senior DevOps Architect at AWS Professional Services. He works closely with AWS customers to accelerate their journey to the cloud. He has a deep passion for Cloud Native and DevOps, and in his free time, he also enjoys contributing to open-source projects.

Selman Ay is a Data Architect in the AWS Professional Services team. He has worked with customers from various industries such as e-commerce, pharma, automotive and finance to build scalable data architectures and generate insights from the data. Outside of work, he enjoys playing tennis and engaging in outdoor activities.

Nick McCarthy is a Senior Machine Learning Engineer in the AWS Professional Services team. He has worked with AWS clients across various industries including healthcare, finance, sports, telecoms and energy to accelerate their business outcomes through the use of AI/ML. Outside of work Nick loves to travel, exploring new cuisines and cultures in the process.

Miguel Henriques is a Cloud Application Architect in the AWS Professional Services team with 4 years of experience in the automotive industry delivering cloud native solutions. In his free time, he is constantly looking for advancements in the web development space and searching for the next great pastel de nata.

Managing AWS Lambda runtime upgrades

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/managing-aws-lambda-runtime-upgrades/

This post is written by Julian Wood, Principal Developer Advocate, and Dan Fox, Principal Specialist Serverless Solutions Architect.

AWS Lambda supports multiple programming languages through the use of runtimes. A Lambda runtime provides a language-specific execution environment, which provides the OS, language support, and additional settings, such as environment variables and certificates that you can access from your function code.

You can use managed runtimes that Lambda provides or build your own. Each major programming language release has a separate managed runtime, with a unique runtime identifier, such as python3.11 or nodejs20.x.

Lambda automatically applies patches and security updates to all managed runtimes and their corresponding container base images. Automatic runtime patching is one of the features customers love most about Lambda. When these patches are no longer available, Lambda ends support for the runtime. Over the next few months, Lambda is deprecating a number of popular runtimes, triggered by end of life of upstream language versions and of Amazon Linux 1.

Runtime Deprecation
Node.js 14 Nov 27, 2023
Node.js 16 Mar 11, 2024
Python 3.7 Nov 27, 2023
Java 8 (Amazon Linux 1) Dec 31, 2023
Go 1.x Dec 31, 2023
Ruby 2.7 Dec 07, 2023
Custom Runtime (provided) Dec 31, 2023

Runtime deprecation is not unique to Lambda. You must upgrade code using Python 3.7 or Node.js 14 when those language versions reach end of life, regardless of which compute service your code is running on. Lambda can help make this easier by tracking which runtimes you are using and providing deprecation notifications.

This post contains considerations and best practices for managing runtime deprecations and upgrades when using Lambda. Adopting these techniques makes managing runtime upgrades easier, especially when working with a large number of functions.

Specifying Lambda runtimes

When you deploy your function as a .zip file archive, you choose a runtime when you create the function. To change the runtime, you can update your function’s configuration.

Lambda keeps each managed runtime up to date by taking on the operational burden of patching the runtimes with security updates, bug fixes, new features, performance enhancements, and support for minor version releases. These runtime updates are published as runtime versions. Lambda applies runtime updates to functions by migrating the function from an earlier runtime version to a new runtime version.

You can control how your functions receive these updates using runtime management controls. Runtime versions and runtime updates apply to patch updates for a given Lambda runtime. Lambda does not automatically upgrade functions between major language runtime versions, for example, from nodejs14.x to nodejs18.x.

For a function defined as a container image, you choose a runtime and the Linux distribution when you create the container image. Most customers start with one of the Lambda base container images, although you can also build your own images from scratch. To change the runtime, you create a new container image from a different base container image.

Why does Lambda deprecate runtimes?

Lambda deprecates a runtime when upstream runtime language maintainers mark their language end-of-life or security updates are no longer available.

In almost all cases, the end-of-life date of a language version or operating system is published well in advance. The Lambda runtime deprecation policy gives end-of-life schedules for each language that Lambda supports. Lambda notifies you by email and via your Personal Health Dashboard if you are using a runtime that is scheduled for deprecation.

Lambda runtime deprecation happens in several stages. Lambda first blocks creating new functions that use a given runtime. Lambda later also blocks updating existing functions using the unsupported runtime, except to update to a supported runtime. Lambda does not block invocations of functions that use a deprecated runtime. Function invocations continue indefinitely after the runtime reaches end of support.

Lambda is extending the deprecation notification period from 60 days before deprecation to 180 days. Previously, blocking new function creation happened at deprecation and blocking updates to existing functions 30 days later. Blocking creation of new functions now happens 30 days after deprecation, and blocking updates to existing functions 60 days after.

Lambda occasionally delays deprecation of a Lambda runtime for a limited period beyond the end of support date of the language version that the runtime supports. During this period, Lambda only applies security patches to the runtime OS. Lambda doesn’t apply security patches to programming language runtimes after they reach their end of support date.

Can Lambda automatically upgrade my runtime?

Moving from one major version of the language runtime to another has a significant risk of being a breaking change. Some libraries and dependencies within a language have deprecation schedules and do not support versions of a language past a certain point. Moving functions to new runtimes could potentially impact large-scale production workloads that customers depend on.

Since Lambda cannot guarantee backward compatibility between major language versions, upgrading the Lambda runtime used by a function is a customer-driven operation.

Lambda function versions

You can use function versions to manage the deployment of your functions. In Lambda, you make code and configuration changes to the default function version, which is called $LATEST. When you publish a function version, Lambda takes a snapshot of the code, runtime, and function configuration to maintain a consistent experience for users of that function version. When you invoke a function, you can specify the version to use or invoke the $LATEST version. Lambda function versions are required when using Provisioned Concurrency or SnapStart.

Some developers use an auto-versioning process by creating a new function version each time they deploy a change. This results in many versions of a function, with only a single version actually in use.

While Lambda applies runtime updates to published function versions, you cannot update the runtime major version for a published function version, for example from Node.js 16 to Node.js 20. To update the runtime for a function, you must update the $LATEST version, then create a new published function version if necessary. This means that different versions of a function can use different runtimes. The following shows the same function with version 1 using Node.js 14.x and version 2 using Node.js 18.x.

Version 1 using Node.js 14.x

Version 1 using Node.js 14.x

Version 2 using Node.js 18.x

Version 2 using Node.js 18.x

Ensure you create a maintenance process for deleting unused function versions, which also impact your Lambda storage quota.

Managing function runtime upgrades

Managing function runtime upgrades should be part of your software delivery lifecycle, in a similar way to how you treat dependencies and security updates. You need to understand which functions are being actively used in your organization. Organizations can create prioritization based on security profiles and/or function usage. You can use the same communication mechanisms you may already be using for handling security vulnerabilities.

Implement preventative guardrails to ensure that developers can only create functions using supported runtimes. Using infrastructure as code, CI/CD pipelines, and robust testing practices makes updating runtimes easier.

Identifying impacted functions

There are tools available to check Lambda runtime configuration and to identify which functions and what published function versions are actually in use. Deleting a function or function version that is no longer in use is the simplest way to avoid runtime deprecations.

You can identify functions using deprecated or soon to be deprecated runtimes using AWS Trusted Advisor. Use the AWS Lambda Functions Using Deprecated Runtimes check, in the Security category that provides 120 days’ notice.

AWS Trusted Advisor Lambda functions using deprecated runtimes

AWS Trusted Advisor Lambda functions using deprecated runtimes

Trusted Advisor scans all versions of your functions, including $LATEST and published versions.

The AWS Command Line Interface (AWS CLI) can list all functions in a specific Region that are using a specific runtime. To find all functions in your account, repeat the following command for each AWS Region and account. Replace the <REGION> and <RUNTIME> parameters with your values. The --function-version ALL parameter causes all function versions to be returned; omit this parameter to return only the $LATEST version.

aws lambda list-functions --function-version ALL --region <REGION> --output text —query "Functions[?Runtime=='<RUNTIME>'].FunctionArn"

You can use AWS Config to create a view of the configuration of resources in your account and also store configuration snapshot data in Amazon S3. AWS Config queries do not support published function versions, they can only query the $LATEST version.

You can then use Amazon Athena and Amazon QuickSight to make dashboards to visualize AWS Config data. For more information, see the Implementing governance in depth for serverless applications learning guide.

Dashboard showing AWS Config data

Dashboard showing AWS Config data

There are a number of ways that you can track Lambda function usage.

You can use Amazon CloudWatch metrics explorer to view Lambda by runtime and track the Invocations metric within the default CloudWatch metrics retention period of 15 months.

Track invocations in Amazon CloudWatch metrics

Track invocations in Amazon CloudWatch metrics

You can turn on AWS CloudTrail data event logging to log an event every time Lambda functions are invoked. This helps you understand what identities are invoking functions and the frequency of their invocations.

AWS Cost and Usage Reports can show which functions are incurring cost and in use.

Limiting runtime usage

AWS CloudFormation Guard is an open-source evaluation tool to validate infrastructure as code templates. Create policy rules to ensure that developers only chose approved runtimes. For more information, see Preventative Controls with AWS CloudFormation Guard.

AWS Config rules allow you to check that Lambda function settings for the runtime match expected values. For more information on running these rules before deployment, see Preventative Controls with AWS Config. You can also reactively flag functions as non-compliant as your governance policies evolve. For more information, see Detective Controls with AWS Config.

Lambda does not currently have service control policies (SCP) to block function creation based on the runtime

Upgrade best practices

Use infrastructure as code tools to build and manage your Lambda functions, which can make it easier to manage upgrades.

Ensure you run tests against your functions when developing locally. Include automated tests as part of your CI/CD pipelines to provide confidence in your runtime upgrades. When rolling out function upgrades, you can use weighted aliases to shift traffic between two function versions as you monitor for errors and failures.

Using runtimes after deprecation

AWS strongly advises you to upgrade your functions to a supported runtime before deprecation to continue to benefit from security patches, bug-fixes, and the latest runtime features. While deprecation does not affect function invocations, you will be using an unsupported runtime, which may have unpatched security vulnerabilities. Your function may eventually stop working, for example, due to a certificate expiry.

Lambda blocks function creation and updates for functions using deprecated runtimes. To create or update functions after these operations are blocked, contact AWS Support.

Conclusion

Lambda is deprecating a number of popular runtimes over the next few months, reflecting the end-of-life of upstream language versions and Amazon Linux 1. This post covers considerations for managing Lambda function runtime upgrades.

For more serverless learning resources, visit Serverless Land.