Tag Archives: Intermediate (200)

Fine-grained access control in Amazon EMR Serverless with AWS Lake Formation

2024-11-01 Anubhav Awasthi

Post Syndicated from Anubhav Awasthi original https://aws.amazon.com/blogs/big-data/fine-grained-access-control-in-amazon-emr-serverless-with-aws-lake-formation/

In today’s data-driven world , enterprises are increasingly reliant on vast amounts of data to drive decision-making and innovation. With this reliance comes the critical need for robust data security and access control mechanisms. Fine-grained access control restricts access to specific data subsets, protecting sensitive information and maintaining regulatory compliance. It allows organizations to set detailed permissions at various levels, including database, table, column, and row. This precise control mitigates risks of unauthorized access, data leaks, and misuse. In the unfortunate event of a security incident, fine-grained access control helps limit the scope of the breach, minimizing potential damage.
AWS is introducing general availability of fine-grained access control based on AWS Lake Formation for Amazon EMR Serverless on Amazon EMR 7.2. Enterprises can now significantly enhance their data governance and security frameworks. This new integration supports the implementation of modern data lake architectures, such as data mesh, by providing a seamless way to manage and analyze data. You can use EMR Serverless to enforce data access controls using Lake Formation when reading data from Amazon Simple Storage Service (Amazon S3), enabling robust data processing workflows and real-time analytics without the overhead of cluster management.

In this post, we discuss how to implement fine-grained access control in EMR Serverless using Lake Formation. With this integration, organizations can achieve better scalability, flexibility, and cost-efficiency in their data operations, ultimately driving more value from their data assets.

Key use cases for fine-grained access control in analytics

The following are key use cases for fine-grained access control in analytics:

Customer 360 – You can enable different departments to securely access specific customer data relevant to their functions. For example, the sales team can be granted access only to data such as customer purchase history, preferences, and transaction patterns. Meanwhile, the marketing team is limited to viewing campaign interactions, customer demographics, and engagement metrics.
Financial reporting – You can enable financial analysts to access the necessary data for reporting and analysis while restricting sensitive financial details to authorized executives.
Healthcare analytics – You can enable healthcare researchers and data scientists to analyze de-identified patient data for medical advancements and research, while making sure Protected Health Information (PHI) remains confidential and accessible only to authorized healthcare professionals and personnel.
Supply chain optimization – You can grant logistics teams visibility into inventory and shipment data while limiting access to pricing or supplier information to relevant stakeholders.

Solution overview

In this post, we explore how to implement fine-grained access control on Iceberg tables within an EMR Serverless application, using the capabilities of Lake Formation. If you’re interested in learning how to implement fine-grained access control on open table formats in Amazon EMR running on Amazon Elastic Compute Cloud (Amazon EC2) instances using Lake Formation, refer to Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation.
With the data access control features available in Lake Formation, you can enforce granular permissions and govern access to specific columns, rows, or cells within your Iceberg tables. This approach makes sure sensitive data remains secure and accessible only to authorized users or applications, aligning with your organization’s data governance policies and regulatory compliance requirements.

A cross-account modern data platform on AWS involves setting up a centralized data lake in a primary AWS account, while allowing controlled access to this data from secondary AWS accounts. This setup helps organizations maintain a single source of truth for their data, provides consistent data governance, and uses the robust security features of AWS across multiple business units or project teams.

To demonstrate how you can use Lake Formation to implement cross account fine-grained access control within an EMR Serverless environment, we use the TPC-DS dataset to create tables in the AWS Glue Data Catalog in the AWS producer account and provision different user personas to reflect various roles and access levels in the AWS consumer account, forming a secure and governed data lake.

The following diagram illustrates the solution architecture.

The producer account contains the following persona:

Data engineer – Tasks include data preparation, bulk updates, and incremental updates. The data engineer has the following access:
- Table-level access – Full read/write access to all TPC-DS tables.

The consumer account contains the following personas:

Finance analyst – We run a sample query that performs a sales data analysis to guide marketing, inventory, and promotion strategies based on demographic and geographic performance. The finance analyst has the following access:
- Table-level access – Full access to tables store_sales, catalog_sales, web_sales, item, and promotion for comprehensive financial analysis.
- Column-level access – Limited access to cost-related columns in the sales tables to avoid exposure to sensitive pricing strategies. Limited access to sensitive columns like credit_rating in the customer_demographics table.
- Row-level access – Access only to sales data from the current fiscal year or specific promotional periods.
Product analyst – We run a sample query that does a customer behavior analysis to tailor marketing, promotions, and loyalty programs based on purchase patterns and regional insights. The product analyst has the following access:
- Table-level access – Full access to tables item, store_sales, and customer tables to evaluate product and market trends.
- Column-level access – Restricted access to personal identifiers in the customer table, such as customer_address , email_address, and date of birth.

Prerequisites

You should have the following prerequisites:

Access to the producer account and consumer account with adequate permissions to create and deploy AWS CloudFormation stacks, upload files to S3 buckets, accept shared resources in AWS Resource Access Manager (AWS RAM), and other actions taken in this post.
Access to AWS Identity and Access Management (IAM) roles or users who are a Lake Formation data lake administrator in both the producer and consumer account. For instructions, refer to Create a data lake administrator.

Set up infrastructure in the producer account

We provide a CloudFormation template to deploy the data lake stack with the following resources:

Two S3 buckets: one for scripts and query results, and one for the data lake storage
An Amazon Athena workgroup
An EMR Serverless application
An AWS Glue database and tables on external public S3 buckets of TPC-DS data
An AWS Glue database for the data lake
An IAM role and polices

Set up Lake Formation for the data engineer in the producer account

Set up Lake Formation cross-account data sharing version settings:

Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
Under Data Catalog settings, pick Version 4 under Cross-account version settings.

To learn more about the differences between data sharing versions, refer to Updating cross-account data sharing version settings. Make sure Default permissions for newly created databases and tables is unchecked.

Register the Amazon S3 location as the data lake location

When you register an Amazon S3 location with Lake Formation, you specify an IAM role with read/write permissions on that location. After registering, when EMR Serverless requests access to this Amazon S3 location, Lake Formation will supply temporary credentials of the provided role to access the data. We already created the role LakeFormationServiceRole using the CloudFormation template. To register the Amazon S3 location as the data lake location, complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
In the navigation pane, choose Data lake locations under Administration.
Choose Register location.
For Amazon S3 path, enter s3://<DatalakeBucketName>. (Copy the bucket name from the CloudFormation stack’s Outputs tab.)
For IAM role, enter LakeFormationServiceRoleDatalake.
For Permission mode, select Lake Formation.
Choose Register location.

Generate TPC-DS tables in the producer account

In this section, we generate TPC-DS tables in Iceberg format in the producer account.
Grant database permissions to the data engineer
First, let’s grant database permissions to the data engineer IAM role Amazon-EMR-ExecutionRole_DE that we will use with EMR Serverless. Complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
Choose Databases and Create database.
Enter iceberg_db for Name and s3://<DatalakeBucketName> for Location.
Choose Create database.
In the navigation pane, choose Data lake permissions and choose Grant.
In the Principles section, select IAM users and roles and choose Amazon-EMR-ExecutionRole_DE.
In the LF-Tags or catalog resources section, select Named Data Catalog resources and choose tpc-source and iceberg_db for Databases.
Select Super for both Database permissions and Grantable permissions and choose Grant.

Create an EMR Serverless application

Now, let’s log in to EMR Serverless using Amazon EMR Studio and complete the following steps:

On the Amazon EMR console, choose EMR Serverless.
Under Manage applications, choose my-emr-studio. You will be directed to the Create application page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless application
Under Application settings, provide the following information:
1. For Name, enter a name emr-fgac-application.
2. For Type, choose Spark.
3. For Release version, choose emr-7.2.0.
4. For Architecture, choose x86_64.
Under Application setup options, select Use custom settings.
Under Interactive endpoint, select Enable endpoint for EMR studio
Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
Under Network connections, choose emrs-vpc for the VPC, enter any two private subnets, and enter emr-serverless-sg for Security groups.
Choose Create and start application.

Create a Workspace

Complete the following steps to create an EMR Workspace:

On the Amazon EMR console, choose Workspaces in the navigation pane and choose Create Workspace.
Enter the Workspace name emr-fgac-workspace.
Leave all other settings as default and choose Create Workspace.
Choose Launch Workspace. Your browser might request to allow pop-up permissions for the first time launching the Workspace.
After the Workspace is launched, in the navigation pane, choose Compute.
For Compute type¸ select EMR Serverless application and enter emr-fgac-application for the application and Amazon-EMR-ExecutionRole_DE as the runtime role.
Make sure the kernel attached to the Workspace is PySpark.
Navigate to the File browser section and choose Upload files.
Upload the file Iceberg-ingest-final_v2.ipynb.
Update the data lake bucket name, AWS account ID, and AWS Region accordingly.
Choose the double arrow icon to restart the kernel and rerun the notebook.

To verify that the data is generated, you can go to the AWS Glue console. Under Data Catalog, Databases, you should see TPC-DS tables ending with _iceberg for the database iceberg_db.

Share the database and TPC-DS tables to the consumer account

We now grant permissions to the consumer account, including grantable permissions. This allows the Lake Formation data lake administrator in the consumer account to control access to the data within the account.

Grant database permissions to the consumer account

Complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
In the navigation pane, choose Databases.
Select the database iceberg_db, and on the Actions menu, under Permissions, choose Grant.
In the Principles section, select External accounts and enter the consumer account.
In the LF-Tags or catalog resources section, select Named Data Catalog resources and choose iceberg_db for Databases.
In the Database permissions section, select Describe for both Database permissions and Grantable permissions.

This allows the data lake administrator in the consumer account to describe the database and grant describe permissions to other principals in the consumer account.

Grant table permissions to the consumer account

Repeat the preceding steps to grant table permissions to the consumer account.

Choose All tables under Tables and provide select and describe permissions for Table permissions and Grantable permissions.

Set up Lake Formation in the consumer account

For the remaining section of the post, we focus on the consumer account. Deploy the following CloudFormation stack to set up resources:

The template will create the Amazon EMR runtime role for both analyst user personas.
Log in to the AWS consumer account and accept the AWS RAM invitation first:

Open the AWS RAM console with the IAM identity that has AWS RAM access.
In the navigation pane, choose Resource shares under Shared with me.
You should see two pending resource shares from the producer account.
Accept both invitations.

You should be able to see the iceberg_db database on the Lake Formation console.

Create a resource link for the shared database

To access the database and table resources that were shared by the producer AWS account, you need to create a resource link in the consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database or table. After you create a resource link to a database or table, you can use the resource link name wherever you would use the database or table name. In this step, you grant permission on the resource links to the job runtime roles for EMR Serverless. The runtime roles will then access the data in shared databases and underlying tables through the resource link.
To create a resource link, complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
In the navigation pane, choose Databases.
Select the iceberg_db database, verify that the owner account ID is the producer account, and on the Actions menu, choose Create resource links.
For Resource link name, enter the name of the resource link (iceberg_db_shared).
For Shared database’s region, choose the Region of the iceberg_db database.
For Shared database, choose the iceberg_db database.
For Shared database’s owner ID, enter the account ID of the producer account.
Choose Create.

Grant permissions on the resource link to the EMR job runtime roles

Grant permissions on the resource link to Amazon-EMR-ExecutionRole_Finance and Amazon-EMR-ExecutionRole_Product using the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
In the navigation pane, choose Databases.
Select the resource link (iceberg_db_shared) and on the Actions menu, choose Grant.
In the Principles section, select IAM users and roles, and choose Amazon-EMR-ExecutionRole_Finance and Amazon-EMR-ExecutionRole_Product.
In the LF-Tags or catalog resources section, select Named Data Catalog resources and for Databases, choose iceberg_db_shared.
In the Resource link permissions section, select Describe for Resource link permissions.

This allows the EMR Serverless job runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.
Choose Grant.

Grant table permissions for the finance analyst

Complete the following steps:

Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
In the navigation pane, choose Databases.
Select the resource link (iceberg_db_shared) and on the Actions menu, choose Grant on target.
In the Principles section, select IAM users and roles, then choose Amazon-EMR-ExecutionRole_Finance.
In the LF-Tags or catalog resources section, select Named Data Catalog resources and specify the following:
1. For Databases, choose iceberg_db.
2. For Tables¸ choose store_sales_iceberg.
In the Table permissions section, for Table permissions, select Select.
In the Data permissions section, select Column-based access.
Select Exclude columns and choose all cost-related columns (ss_wholesale_cost and ss_ext_wholesale_cost).
Choose Grant.
Similarly, grant access to table customer_demographics_iceberg and exclude the column cd_credit_rating.
Following the same steps, grant All data access for tables store_iceberg and item_iceberg.
For the table date_dim_iceberg, we provide selective row-level access.
Similar to the preceding table permissions, select date_dim_iceberg under Tables and in the Data filters section, choose Create new.
For Data filter name, enter FA_Filter_year.
Select Access to all columns under Column-level access.
Select Filter rows and for Row filter expression, enter d_year=2002 to only provide access to the 2002 year.
Choose Save changes.
Choose Create filter.
Make sure FA_Filter_year is selected under Data filters and grant select permissions on the filter.

Grant table permissions for the product analyst

You can provide permissions for the next set of tables required for the product analyst role using the Lake Formation console. Alternatively, you can use the AWS Command Line Interface (AWS CLI) to grant permissions. We provide grant on target permissions for the resource link iceberg_db_shared to IAM role Amazon-EMR-ExecutionRole_Product.

Similar to steps followed in previous sections, for table store_sales_iceberg, date_dim_iceberg, store_iceberg, and house_hold_demographics_iceberg, provide select permissions for All data access. Make sure the role selected is Amazon-EMR-ExecutionRole_Product.

For table customer_iceberg, we limit access to personally identifiable information (PII) columns.

Under Data permissions, select Column-based access and Exclude columns.
Choose columns c_birth_day, c_birth_month, c_birth_year, c_current_addr_sk, c_customer_id, c_email_address, and c_birth_country.

Verify access using interactive notebooks from EMR Studio

Complete the following steps to test role access:

Log in to the AWS consumer account and open the Amazon EMR console.
Choose EMR Serverless in the navigation pane and choose an existing EMR Studio.
If you don’t have EMR Studio configured, choose Get Started and select Create and launch EMR Studio.
Create a Lake Formation enabled EMR Serverless application as described in previous sections.
Create an EMR Studio Workspace as described in previous sections.
Use emr-studio-service-role for Service role and datalake-resources-<account_id>-<region> for Workspace storage, then launch your Workspace.

Now, let’s verify access for the finance analyst.

Make sure the compute type inside your Workspace is pointing to the EMR Serverless application created in the prior step and Amazon-EMR-ExecutionRole_Finance as the interactive runtime role.
Go to File browser in the navigation pane, choose Upload files, and add Notebook_FA.ipynb to your Workspace.
Run all the cells to verify fine-grained access.

Now let’s test access for the product analyst.

In the same Workspace, detach and attach the same EMR Serverless application with Amazon-EMR-ExecutionRole_Product as the interactive runtime role.
Upload Notebook_PA.ipynb under the File browser section.
Run all the cells to verify fine-grained access for the product analyst.

In a real-world scenario, both analysts will likely have their own Workspace with restricted rights to assume only the authorized interactive runtime role.

Considerations and limitations

EMR Serverless with Lake Formation uses Spark resource profiles to create two profiles and two Spark drivers for access control. Read this white paper to learn about the feature details. The user profile runs the supplied code, and the system profile enforces Lake Formation policies. Therefore, it’s recommended that you have a minimum of two Spark drivers when pre-initialized capacity is used with Lake Formation enabled jobs. No change in executor count is required. Refer to Using EMR Serverless with AWS Lake Formation for fine-grained access control to learn more about the technical implementation of the Lake Formation integration with EMR Serverless.

You can expect a performance overhead after enabling Lake Formation. The level of access (table, column, or row) and the amount of data filtered will have significant impact on query performance.

Clean up

To avoid incurring ongoing costs, complete the following steps to clean up your resources:

In your secondary (consumer) account, log in to the Lake Formation console.
Drop the resource share table.
In your primary (producer) account, log in to the Lake Formation console.
Revoke the access you configured.
Drop the AWS Glue tables and database.
Delete the AWS Glue job.
Delete the S3 buckets and any other resources that you created as part of the prerequisites for this post.

Conclusion

In this post, we showed how to integrate Lake Formation with EMR Serverless to manage access to Iceberg tables. This solution showcases a modern way to enforce fine-grained access control in a multi-account open data lake setup. The approach simplifies data management in the main account while carefully controlling how users access data in other secondary accounts.

Try out the solution for your own use case, and let us know your feedback and questions in the comments section.

About the Authors

Anubhav Awasthi is a Sr. Big Data Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.

Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.

How Volkswagen Autoeuropa built a data mesh to accelerate digital transformation using Amazon DataZone

2024-10-31 Dhrubajyoti Mukherjee

Post Syndicated from Dhrubajyoti Mukherjee original https://aws.amazon.com/blogs/big-data/how-volkswagen-autoeuropa-built-a-data-mesh-to-accelerate-digital-transformation-using-amazon-datazone/

This is a joint blog post co-authored with Martin Mikoleizig from Volkswagen Autoeuropa.

Volkswagen Autoeuropa is a Volkswagen Group plant that produces the T-Roc. The plant is located near Lisbon, Portugal and produces about 934 cars per day. In 2023, Volkswagen Autoeuropa represented 1.3% of the national GDP of Portugal and 4% in national export of goods impact with a sales volume of 3.3511 billion Euros. Volkswagen Autoeuropa aims to become a data-driven factory and has been using cutting-edge technologies to enhance digitalization efforts.

In this post, we discuss how Volkswagen Autoeuropa used Amazon DataZone to build a data marketplace based on data mesh architecture to accelerate their digital transformation. The data mesh, built on Amazon DataZone, simplified data access, improved data quality, and established governance at scale to power analytics, reporting, AI, and machine learning (ML) use cases. As a result, the data solution offers benefits such as faster access to data, expeditious decision making, accelerated time to value for use cases, and enhanced data governance.

Understanding Volkswagen Autoeuropa’s challenges

At the time of writing this post, Volkswagen Autoeuropa has already implemented more than 15 successful digital use cases in the context of real-time visualization, business intelligence, industrial computer vision, and AI.

Before the AWS partnership, Volkswagen Autoeuropa faced the following challenges.

Long lead time to access data – The digital use cases launched by Volkswagen Autoeuropa spent most of their project time getting access to the data that was relevant to their use cases. After the right data for the use case was found, the IT team provided access to the data through manual configuration. The lead time to access data was often from several days to weeks.
Insufficient data governance and auditing – Data was shared directly to use cases by copying it. Therefore, the IT team connected the data manually from their sources to the desired destinations multiple times. This process wasn’t centrally tracked to discover any information on the data sharing process. For example, if the data was copied in the past, how many use cases have access to the data, when access was granted, and who granted the access.
Redundant effort to process the same information – Because the IT team copied the data sources based on the exact use case requirements, they shared specific columns of the tables from the data. As additional use cases requested access to the same data with different column requirements, even more copies of the data were created.
Repeated process to establish security and governance guardrails – Each time the IT and the security team provided a connection to a new data source, they had to set up the security and governance guardrails. This required repeated manual effort.
Data quality issues – Because the data was processed redundantly and shared multiple times, there was no guarantee of or control over the quality of the data. This led to reduced trust in the data.
Absence of data catalog and metadata management – Data didn’t have any metadata associated with it, and so use cases couldn’t consume the data without further explanation from the data source owners and specialists. Furthermore, no process to discover new data existed. Similar to the consumption process, use cases would consult specialists to understand the context of the data and if it could provide value.

Envisioning a data solution for Volkswagen Autoeuropa

To address these challenges, Volkswagen Autoeuropa embarked on a bold vision. They envisioned a seamless data consumption process, similar to an online shopping experience. They envisioned a data marketplace where data users could browse and access high-quality, secure data with clear specifications, business context, and relevant attributes. This vision materialized into a project aimed at transforming data accessibility and governance as the foundation for the digital ecosystem. The vision to be realized: Data as seamless as online shopping.

In collaboration with Amazon Web Services (AWS), Volkswagen Autoeuropa joined the Enhanced Plant Onboarding Program of the Global Volkswagen Group’s Digital Production Platform (DPP EPO) strategy. Through this partnership, AWS and Volkswagen Autoeuropa created a data marketplace that significantly improved data availability.

In the discovery phase of the project, Volkswagen Autoeuropa and AWS evaluated several options to build the data solution. In the end, Volkswagen Autoeuropa chose a solution based on data mesh architecture using Amazon DataZone. Being a managed service, Amazon DataZone provided the necessary speed and agility to build the solution. At the same time, it led to higher operational efficiencies and lower operational overhead. The team adopted a data mesh architecture because the principles of the data mesh aligned with Volkswagen Autoeuropa’s vision of being a data driven factory.

Solution overview

This section describes the key features and architecture of the Volkswagen Autoeuropa data solution. The solution is based on a data mesh architecture.

Data solution features

The following figure shows the key capabilities of the Volkswagen Autoeuropa data solution.

The key capabilities of the solution are:

Data quality – In the solution, we’ve built a data quality framework to streamline the process of data quality checks and publishing quality scores. It uses AWS Glue Data Quality to generate recommendation rulesets, run orchestrated jobs, store results, and send notifications to users. This framework can be seamlessly integrated into AWS Glue jobs, providing a quality score for data pipeline jobs. In addition, the quality score is published in the Amazon DataZone data portal, allowing consumers to subscribe to the data based on its quality score.Assigning a quality score to the data helps build trust in the data, and shifts the responsibility of maintaining data quality to the data owner. As a result, the quality of the results delivered by these use cases improves.
Data registration – The producers sign in to the Amazon DataZone data portal using their AWS Identity and Access Management (IAM) credentials or single sign-on with integration through AWS IAM Identity Center. They register their data assets, which are stored in Amazon Simple Storage Service (Amazon S3), in the Amazon DataZone data catalog. The metadata of the data assets is stored in an AWS Glue catalog and made available in the business data catalog of Amazon DataZone and in the Amazon DataZone data source. The producers add business context such as business unit name, data owner contact information, and data refresh frequency using Amazon DataZone glossaries and metadata forms. In addition, they use generative AI capabilities to generate business metadata. After the business metadata is generated, they review the changes and modify the metadata if needed.Because all data products in Volkswagen Autoeuropa are now registered in the same location, the likelihood of data duplication is significantly reduced. Moreover, the data producers are improving the quality of the data by adding business context to it.
Data discovery – The consumers sign in to the Amazon DataZone data portal using their IAM credentials or single sign-on with integration through IAM Identity Center and search the data using keywords in the search bar. After the results are returned, they can further filter the results using glossary terms and project names. Finally, they review the business metadata of the data assets to evaluate if the data is relevant to their business use cases. They can check the quality score of the data assets and the refresh schedule for their use cases.With a data discovery capability in place, consumers can gain information about the data without the need to consult the source system owners or specialists.
Data access management – When the consumers find a data asset that’s relevant to their use case, they request access to it using the subscription feature of Amazon DataZone. Data is classified as public, internal, and confidential. For public and internal data assets, the access request is automatically approved. For confidential data assets, the data producer team reviews the access request and either accepts or rejects the subscription request.With a central place to manage data access, data owners can view which use cases have access to their data and when the access request was granted. The fine-grained access control feature of Amazon DataZone gives data owners granular control of their data at the row and column levels.
Data consumption – Upon approval of the subscription request, Amazon DataZone provisions the backend infrastructure to make the data accessible to the corresponding consumers. After this process is complete, the consumers can access the data through Amazon Athena using the deep link feature of Amazon DataZone. The data consumption pattern in Volkswagen Autoeuropa supports two use cases:
- Cloud-to-cloud consumption – Both data assets and consumer teams or applications are hosted in the cloud.
- Cloud-to-on-premises consumption – Data assets are hosted in the cloud and consumer use cases or applications are hosted on-premises.

Requirements specific to a use case requires access to the relevant data assets; sharing data to use cases using Amazon DataZone doesn’t require creating multiple copies. As a result, duplication and processing of data. Furthermore, by reducing the number of copies of the data, the overall quality of the data products improves. In addition, the backend automation of Amazon DataZone to make data available to use cases reduces the manual effort and improves the lead time to access data.

Single collaborative environment – The Amazon DataZone data portal provides a single collaborative environment to the users in Volkswagen Autoeuropa. Data consumers such as use case owners, data engineers, data scientists, and ML engineers can browse and request access to data assets. At the same time, data producers, such as use case owners and source system owners, can publish and curate their data in the Amazon DataZone data portal. This collaborative experience promotes teamwork and accelerates the realization of business value. Furthermore, the security and governance guardrails scales across the organization as the number of use cases increases.

Data solution architecture

The following figure displays the reference architecture of the data solution at Volkswagen Autoeuropa. In the next part of the post, we discuss how we arrived at the solution.

The architecture includes:

The data from SAP applications, manufacturing execution systems (MES), and supervisory control and data acquisition (SCADA) systems is ingested into the producer accounts of Volkswagen Autoeuropa.
In the producer account, raw data is transformed using AWS Glue. The technical metadata of the data is stored in AWS Glue catalog. The data quality is measured using the data quality framework. The data stored in Amazon Simple Storage Service (Amazon S3) is registered as an asset in the Amazon DataZone data catalog hosted in the central governance account.
The central governance account hosts the Amazon DataZone domain and the related Amazon DataZone data portal. The AWS accounts of the data producers and consumers are associated with the Amazon DataZone domain. Amazon DataZone projects belonging to the data producers and consumers are created under the related Amazon DataZone domain units.
Consumers of the data products sign in to the Amazon DataZone data portal hosted in the central governance account using their IAM credentials or single sign-on with integration through IAM Identity Center. They search, filter, and view asset information (for example, data quality, business, and technical metadata).
After the consumer finds the asset they need, they request access to the asset using the subscription feature of Amazon DataZone. Based on the validity of the request, the asset owner approves or rejects the request.
After the subscription request is granted and fulfilled, the asset is accessed in the consumer account for a one-time query using Athena and Microsoft Power BI applications hosted on premises. This consumption pattern can be extended for AI and machine learning (AI/ML) model development using Amazon SageMaker and reporting purposes using Amazon QuickSight.

User journey

After discussing the desired system with the use case teams and stakeholders and analyzing the current workflow, Volkswagen Autoeuropa grouped the user personas of the data solution into three main categories: data producer, data consumer, and data solution administrator. This sets the foundation for the desired user experience and what’s needed to achieve the solution goals.

Data producer

Data producers create the data products in the data solution. There are two types of data producers.

Data source owners – Data source owners publish the raw data in the Amazon DataZone data portal. These data products are attributed as source-based data.
Use case owners – Use case owners publish data that’s fit for consumption by other use cases. These data products are called consumer-based data.

The following figure shows the user journey of a data producer:

A data producer’s journey includes:

Identify data of interest –
1. Identify data (Volkswagen Autoeuropa network).
2. Perform data quality checks (Volkswagen Autoeuropa network).
Connect data to the data solution –
1. Ingest data into the data solution (Amazon DataZone portal).
2. Start process to connect data using AWS Glue.
Locate the data source in the data solution –
1. Register data (Amazon DataZone portal).
2. Add data to the inventory in Amazon DataZone.
Add or edit metadata –
1. Add or edit metadata (Amazon DataZone portal).
2. Publish data assets (Amazon DataZone portal).
Approve or reject subscription request –
1. Review subscription requests.
Maintain data assets –
1. Manage data assets (Amazon DataZone portal).

Data consumer

Data consumers use data for business analytics, machine learning, AI, and business reporting. Data consumers are data engineers, data scientists, ML engineers, and business users. The following diagram shows the journey of a data consumer.

A data consumer’s journey includes:

Access Amazon DataZone portal –
1. Amazon DataZone portal – Access is granted based on the user’s assigned domain and projects.
Search for data assets –
1. Data assets in Amazon DataZone portal – Search for data and brows the results by glossary terms or the project name. Use additional filters to refine the results.
View business metadata –
1. Select a data asset to see additional information – Review the description, data quality score and metadata.
Request access to data (subscribe) –
1. Subscribe to request access.
2. After the subscription request is approved, review the data products that you have access to.
3. Query the data to view and consume the data.
Retrieve additional data –
1. Repeat the steps as needed to access and retrieve additional data.

Data solution administrator

Data solution administrators are responsible for performing administrative tasks on the data solution. The following figure shows the common tasks performed by the data solution administrator.

A data administrator’s journey includes:

Manage projects –
1. Manage Amazon DataZone domain.
2. Manage Amazon DataZone projects within the domain.
Manage environment –
1. Set up the environment to manage the infrastructure.
Manage business metadata glossary –
1. Manage and enable Amazon DataZone glossaries and metadata forms.
Manage data assets –
1. Manage assets.
2. Query the data to view and consume the data.
Manage access to data solution –
1. Monitor and revoke access when appropriate.

Conclusion

In this post, you learned how Volkswagen Autoeuropa embarked on a bold vision to become a data driven factory. It shows how this vision was put into action by building a data solution based on data mesh architecture using Amazon DataZone. It highlights the key features and architecture of the data solutions and presents the user journey. As of writing this post, Volkswagen Autoeuropa reduced the data discovery time from days to minutes using the data solution. The time to access data took several weeks before the Volkswagen Autoeuropa and AWS collaboration. Now, with the help of the data solution, the data access time has been reduced to several minutes.

In May 2024, the team achieved a major milestone by successfully offering data on the data solution and transporting it instantly to Power BI, a process that previously took several weeks.

“After one year of work, we did the full roundtrip from offering data on our new data marketplace built using Amazon DataZone to transporting it instantly to third-party tools, a process that previously took several weeks. This was a big achievement for our team.”

– Jorge Paulino, Product owner of the data solution. Volkswagen Autoeuropa.

The next post of the two-part series details discusses how we built the solution, its technical details, and the business value created.

If you want to harness the agility and scalability of a data mesh architecture and Amazon DataZone to accelerate innovation and drive business value for your organization, we have the resources to get you started. Be sure to check out the AWS Prescriptive Guidance: Strategies for building a data mesh-based enterprise solution on AWS. This comprehensive guide covers the key considerations and best practices for establishing a robust, well-governed data mesh on AWS. From aligning your data mesh with overall business strategy to scaling the data mesh across your organization, this Prescriptive Guidance provides a clear roadmap to help you succeed.

If you’re curious to get hands-on, see the GitHub repository: Building an enterprise Data Mesh with Amazon DataZone, Amazon DataZone, AWS CDK, and AWS CloudFormation. This open source project delivers a step-by-step guide to build a data mesh architecture using Amazon DataZone, AWS Cloud Development Kit (AWS CDK), and AWS CloudFormation.

About the Authors

Dhrubajyoti Mukherjee is a Cloud Infrastructure Architect with a strong focus on data strategy, data analytics, and data governance at Amazon Web Services (AWS). He uses his deep expertise to provide guidance to global enterprise customers across industries, helping them build scalable and secure AWS solutions that drive meaningful business outcomes. Dhrubajyoti is passionate about creating innovative, customer-centric solutions that enable digital transformation, business agility, and performance improvement. An active contributor to the AWS community, Dhrubajyoti authors AWS Prescriptive Guidance publications, blog posts, and open-source artifacts, sharing his insights and best practices with the broader community. Outside of work, Dhrubajyoti enjoys spending quality time with his family and exploring nature through his love of hiking mountains.

Ravi Kumar is a Data Architect and Analytics expert at Amazon Web Services; he finds immense fulfillment in working with data. His days are dedicated to designing and analyzing complex data systems, uncovering valuable insights that drive business decisions. Outside of work, he unwinds by listening to music and watching movies, activities that allow him to recharge after a long day of data wrangling.

Martin Mikoleizig studied mechanical engineering and production technology at the RWTH Aachen University before starting to work in Dr. h.c. Ing. F. Porsche AG 2015 as a production planner for the engine assembly. In several years as a Project Manager on Testing Technology for new engine models he also introduced several innovations like human-machine-collaborations and intelligent assistance systems. From 2017, he was responsible for the Shopfloor IT team of the module lines in Zuffenhausen before he became responsible for the Planning of the E-Drive assembly at Porsche. Beside this he was responsible for the Digitalisation Strategy of the Production Ressort at Porsche. Since October 2022, he has been assigned to Volkswagen Autoeuropa in Portugal in the role of a Digital Transformation Manager for the plant driving the Digital Transformation towards a Data Driven Factory.

Weizhou Sun is a Lead Architect at Amazon Web Services, specializing in digital manufacturing solutions and IoT. With extensive experience in Europe, she has enhanced operational efficiencies, reducing latency and increasing throughput. Weizhou’s expertise includes Industrial Computer Vision, predictive maintenance, and predictive quality, consistently delivering top performance and client satisfaction. A recognized thought leader in IoT and remote driving, she has contributed to business growth through innovations and open-source work. Committed to knowledge sharing, Weizhou mentors colleagues and contributes to practice development. Known for her problem-solving skills and customer focus, she delivers solutions that exceed expectations. In her free time, Weizhou explores new technologies and fosters a collaborative culture.

Shameka Almond is an Advisory Consultant at Amazon Web Services. She works closely with enterprise customers to help them better understand the business impact and value of implementing data solutions, including data governance best practices. Shameka has over a decade of wide-ranging IT experience in the manufacturing and aerospace industries, and the nonprofit sector. She has supported several data governance initiatives, helping both public and private organizations identify opportunities for improvement and increased efficiency. Outside of the office she enjoys hosting large family gatherings, and supporting community outreach events dedicated to introducing students in K-12 to STEM.

Adjoa Taylor has over 20 years of experience in industrial manufacturing, providing industry and technology consulting services, digital transformation, and solution delivery. Currently Adjoa leads Product Centric Digital Transformation, enabling customers to solve complex manufacturing problems by leveraging Smart Factory and Industry leading transformation mechanisms. Most recently driving value with AI/ML and generative AI use-cases for the plant floor. Adjoa is an experienced leader spending over 20 years of her career delivering projects in countries throughout North America, Latin America, Europe, and Asia. Through prior roles, Adjoa brings deep experience across multiple business segments with a focus on business outcome driven solutions. Adjoa is passionate about helping customers solve problems while realizing the art of the possible via the right impacting value-based solution.

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

2024-10-30 Ramesh H Singh

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/expanding-data-analysis-and-visualization-options-amazon-datazone-now-integrates-with-tableau-power-bi-and-more/

Amazon DataZone now launched authentication supports through the Amazon Athena JDBC driver, allowing data users to seamlessly query their subscribed data lake assets via popular business intelligence (BI) and analytics tools like Tableau, Power BI, Excel, SQL Workbench, DBeaver, and more. This integration empowers data users to access and analyze governed data within Amazon DataZone using familiar tools, boosting both productivity and flexibility.

Customers use Amazon DataZone to streamline data access and governance by enabling data users to locate and subscribe to data from multiple sources within a single project. Amazon DataZone natively integrates with Amazon-specific options like Amazon Athena, Amazon Redshift, and Amazon SageMaker, allowing users to analyze their project governed data. With this launch of JDBC connectivity, Amazon DataZone expands its support for data users, including analysts and scientists, allowing them to work in their preferred environments—whether it’s SQL Workbench, Domino, or Amazon-native solutions—while ensuring secure, governed access within Amazon DataZone.

Collaborating closely with our partners, we have tested and validated Amazon DataZone authentication via the Athena JDBC connection, providing an intuitive and secure connection experience for users. With this integration, you can now seamlessly query your governed data lake assets in Amazon DataZone using popular business intelligence (BI) and analytics tools, including partner solutions like Tableau.

Ali Tore, Senior Vice President of Advanced Analytics at Salesforce, highlighting the value of this integration, says

“We’re excited to partner with Amazon to bring Tableau’s powerful data exploration and AI-driven analytics capabilities to customers managing data across organizational boundaries with Amazon DataZone. This integration enables our customers to seamlessly explore data with AI in Tableau, build visualizations, and uncover insights hidden in their governed data, all while leveraging Amazon DataZone to catalog, discover, share, and govern data across AWS, on premises, and from third-party sources—enhancing both governance and decision-making.”

With this launch, Amazon DataZone strengthens its commitment to empowering enterprise customers with secure, governed access to data across the tools and platforms they rely on. For example, Guardant Health uses Amazon DataZone to democratize data access across its organization, enabling diverse teams to efficiently access, query, and analyze data tailored to their specific needs.

Rajesh Kucharlapati, Senior Director of Data, CRM, and Analytics at Guardant Health, says

“By harmonizing data across multiple business domains, we foster a culture of data sharing. Using Amazon DataZone lets us avoid building and maintaining an in-house platform, allowing our developers to focus on tailored solutions. Leveraging AWS’s managed service was crucial for us to access business insights faster, apply standardized data definitions, and tap into generative AI potential. We also needed an easy connection process for widely-used analytics tools like Tableau, DBeaver, and Domino, directly within Amazon DataZone projects. This new JDBC connectivity feature enables our governed data to flow seamlessly into these tools, supporting productivity across our teams.”

Getting started

To get started, download and install the latest Athena JDBC driver for your tool of choice. After installation, copy the JDBC connection string from the Amazon DataZone portal into the JDBC connection configuration to establish a connection from your tool. This will direct you to authenticate using single sign-on (SSO) with your corporate credentials. After connecting, you can query, visualize, and share data—governed by Amazon DataZone—within the tools you already know and trust.

In this post, we’ll guide you through connecting various analytics tools to Amazon DataZone using the Athena JDBC driver, enabling seamless access to your subscribed data within your Amazon DataZone projects.

Solution overview

To demonstrate these capabilities, consider a use case where your marketing team wants to drive a campaign that’s focused on product adoption. To achieve this, you need access to sales orders, shipment details, and customer data owned by the retail team. The retail team, acting as the data producer, publishes the necessary data assets to Amazon DataZone, allowing you, as a consumer, to discover and subscribe to these assets.

After the subscription is approved, the data assets become available within your marketing team’s project environment in Amazon DataZone. You can then use your preferred tool (for example, DBeaver, as shown in the following diagram) to perform data exploration.

Prerequisites

To follow along with this post, you need to have the following prerequisites in place:

AWS account – You must have an active AWS account. If you don’t have one, see How do I create and activate a new AWS account?.
Amazon DataZone resources – You need a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone project environment (DefaultDataLake environment with a DataLakeProfile).
Publish data assets – As the data producer from the retail team, you must ingest individual data assets into Amazon DataZone. For this use case, create a data source and import the technical metadata of four data assets—customers, order_items, orders, products, reviews, and shipments—from AWS Glue Data Catalog. Ensure the data assets are enriched with business descriptions and published to the catalog.
Subscribe data assets – As a data analyst from the marketing team, you must discover and subscribe to the data assets. The data producer from the retail team will review and approve your subscription. Upon successful fulfillment, the data assets will be added to your data lake environment. For detailed subscription instructions, see the Amazon DataZone User Guide.

The following figure shows the subscribed assets added to the data lake environment in your marketing project.

In the following sections, we will walk you through the steps to configure DBeaver to consume the subscribed assets from Amazon DataZone.

Configuring DBeaver to access subscribed data assets

In this section, you configure DBeaver to access the subscribed assets from the Marketing project

To configure DBeaver:

Connect with JDBC: In the Amazon DataZone portal, navigate to the Marketing project, select the Environments tab and select Connect with JDBC.
1. Select Marketing from the list in the top navigation are.
2. Choose Environments
3. Select Connect with JDBC.

A new screen will display the JDBC connection parameters. Make sure to capture these details for configuring the database connection in DBeaver, including the JDBC URL, Domain ID, Environment ID, Region, and IDC Issuer URL.
Download and install the latest Athena driver:
- If DBeaver has the Athena driver pre-installed, it might be the older (v2) version. To ensure compatibility with Amazon DataZone, you need the latest driver (v3), which includes the necessary authentication features.
- Download the latest JDBC driver—version 3.x.
- To install the latest driver:
  - Go to Database and then to Driver Manager in DBeaver.
  - Select the Athena driver and choose Edit.
  - Choose Download to fetch the latest driver version.
  - If prompted, select the appropriate version and confirm the download.

In the DBeaver SQL client, create a new database connection and select the Athena driver.
In the Driver Properties section, enter the parameters that you captured from Amazon DataZone:
- CredentialsProvider: The credentials provider to authenticate requests to AWS
- DataZoneDomainId: The ID of your Amazon DataZone domain
- DataZoneDomainRegion: The AWS Region where your domain is hosted.
- DataZoneEnvironmentId: The ID of your DefaultDataLake environment.
- IdentityCenterIssuerUrl: The issuer URL used by AWS IAM Identity Center for token issuance.
- OutputLocation: Amazon S3 path for storing query results.
- Region: The Region where the environment is created.
- Workgroup: Amazon Athena workgroup of the environment.

Choose Test connection.
You will be redirected to the IAM Identity Center sign-in portal. Sign in with your credentials. If you’re already signed in through single sign-on (SSO), this step will be skipped.
After you sign in, you will be prompted to authorize the DataZoneAuthPlugin. Choose Allow access to authorize access to Amazon DataZone from DBeaver.
After the connection is established, a success message will appear as shown in the screenshot
You can now view and query all subscribed assets directly within DBeaver.

These steps might also apply to other analytics tools and clients that support JDBC connections. If you’re using a different tool, you might need to adapt these instructions accordingly to ensure proper configuration and access to Amazon DataZone data assets.

Integration with other applications

You can use similar steps for other BI and analytics tools that support standard database connections.

Connect to Tableau Desktop

Use the Athena JDBC driver to connect Tableau to Amazon DataZone and visualize your subscribed data.

To connect to Tableau Desktop:

Make sure that you’re using the latest Athena JDBC 3.x driver.
Copy the JDBC driver file and place it in the appropriate folders for your operating system
- For Mac OS: ~/Library/Tableau/Drivers
- For Windows: C:\Program Files\Tableau\Drivers
Open Tableau Desktop. From the To a Server connection menu, select Other Databases (JDBC) to connect to Amazon DataZone.
Paste the JDBC connection string you copied from the DataZone portal into the URL Leave other fields such as Dialect, Username, and Password blank and choose Sign in.
This will redirect you to authenticate with IAM Identity Center. Enter the credentials of the Identity Center user that you used to sign in to the DataZone portal. Authorize the DataZoneAuthPlugin to access Amazon DataZone from Tableau. Once the connection is established with the success message, you now view your project’s subscribed data directly within Tableau and build dashboards.

See the Amazon DataZone and Tableau blog post for step-by-step instructions.

Connect to Microsoft Power BI

Now, let’s look at connecting Amazon DataZone with Microsoft Power BI on Windows.

While Amazon Athena provides a native ODBC driver for connecting to ODBC-compatible tools like Microsoft Power BI, it currently doesn’t support Amazon DataZone authentication. Therefore, in this post, we will use an ODBC-JDBC bridge to connect Amazon DataZone with Microsoft Power BI using the Athena JDBC driver, which supports DataZone authentication.

In this post, we’re using the ZappySys driver as the ODBC-JDBC bridge. This is a third-party solution that requires a separate licensing fee, which isn’t included in the AWS solution. You can choose to use any other solution for ODBC-JDBC bridge.

To connect to Power BI:

Make sure that you have administrator privileges to run the ODBC Data Source Administrator.
From the Windows Start menu, run the ODBC Data Source Administrator (the 64-bit version) using run as Administrator.
Create a New Data Source with the ZappySys JDBC Bridge Driver. You will be prompted to enter your connection details.
Paste the JDBC URL you copied from the DataZone portal in the Connection String, along with the driver class and JDBC driver file. Make sure that you’re using the latest Athena JDBC 3.x driver.
Choose Test Connection. A new dialog window will pop up after the connection is successful.
After configuring the data source, launch Power BI. Create a blank report or use an existing report to integrate the new visuals. Choose Get Data and select the name of the data source you created. This will open a new browser window to authenticate your credentials. Allow access to authorize the DataZone plugin. After authorization is complete, you can build your reports in Microsoft Power BI with the subscribed data assets.

Connect to SQL Workbench

Discover how SQL Workbench can connect to Amazon DataZone for users who prefer a SQL interface to query data lake tables and views subscribed through projects in Amazon DataZone.

To connect to SQL Workbench

Make sure that you’re using the latest Athena JDBC 3.x driver.
Open SQL Workbench/J and choose Manage Drivers.
Select the option to add a new driver. Enter a name for it, such as DatazoneAthenaJDBC, and import the driver you downloaded in the previous steps.
Create a new connection and enter a name it, such as datazone-profile. In the Driver option, select the driver you configured.
For the URL, enter the string jdbc:athena://region=us-east-1; (In the example, the Virginia Region is being used). Choose Extended Properties.
Under Extended Properties, add the following parameters that you copied from the DataZone portal and choose OK. You can also include these parameters in the JDBC (URL) connection string.
1. The parameters to add are:
  - Workgroup
  - DataZoneEndpointOverride
  - OutputLocation
  - DataZoneDomainId
  - IdentityCenterIssuerURL
  - CredentialsProvider
  - DatazoneEnvironmentId
  - DataZoneDomainRegain

You will be prompted to sign in and authenticate. Allow access and authorization to Amazon DataZone.
After successful connection, in SQL Workbench/J, under Database Explorer, select the desired database. For example, select the database that has access to the subscribed data asset orders. Select the data asset and execute the query.

Cleanup

To ensure no additional charges are incurred after testing, be sure to delete the Amazon DataZone domain. See Delete Amazon DataZone domains for instructions.

Conclusion

Amazon DataZone continues to expand its offerings, providing you with more flexibility to access, analyze, and visualize your subscribed data. With support for the Athena JDBC driver, you can now use a wide range of popular BI and analytics tools, making data accessed through Amazon DataZone more accessible than ever before. Whether you’re using Tableau, Power BI, or other familiar tools, the integration with Amazon DataZone ensures that your data remains secure and accessible to authorized users.

The feature is supported in all AWS commercial Regions where Amazon DataZone is currently available. Watch the video below to learn how to connect Amazon DataZone to external analytics tools via JDBC. Get started with our technical documentation.

About the Authors

Ramesh H Singh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon DataZone team. He is passionate about building high-performance ML/AI and analytics products that enable enterprise customers to achieve their critical goals using cutting-edge technology. Connect with him on LinkedIn.

Eric Fleishman is a software engineer at AWS in Seattle. He loves diving into cloud technology and solving complex problems to build impactful solutions. Outside of work, he is all about staying active—whether its snowboarding down the slopes or working out. He enjoys pushing his limits and embracing new challenges.

Theo Tolv is a Senior Analytics Architect based in Stockholm, Sweden. He’s worked with small and big data for most of his career, and has built applications running on AWS since 2008. In his spare time he likes to tinker with electronics and read space opera.

Joel Farvault is Principal Specialist SA Analytics for AWS with 25 years’ experience working on enterprise architecture, data governance and analytics, mainly in the financial services industry. Joel has led data transformation projects on fraud analytics, claims automation, and Master Data Management. He leverages his experience to advise customers on their data strategy and technology foundations.

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Fabricio Hamada is a Senior Data Strategy Solutions Architect at AWS.

Lionel Pulickal is Sr. Solutions Architect at AWS

Leverage Amazon Q Developer and AWS Chatbot within Slack

2024-10-29 Jonathan Wong

Post Syndicated from Jonathan Wong original https://aws.amazon.com/blogs/devops/leverage-amazon-q-developer-and-aws-chatbot-within-slack/

The release of Amazon Q Developer and its ability to be integrated into AWS Chatbot allows users who use Microsoft Teams or Slack to stay within their communication platform and interact with a conversational generative artificial intelligence (AI) AWS expert.

Amazon Q Developer is a conversational generative AI chatbot that provides AWS assistance in the form of best practices, documentation, and answers your AWS related questions. AWS Chatbot is a service that lets you interact with AWS services directly from your communications platform such as Microsoft Teams, Amazon Chime, or Slack. Users can ask Q about best practices, building solutions, troubleshooting issues, and more, creating a productive and collaborative environment. Users can also interface with Chatbot to run AWS CLI commands or open support cases all within Slack.

In this post, we show you how you can leverage Q Developer and Chatbot in your Slack workspace by highlighting a number of use cases along with solution screenshots that can enhance a company’s AWS productivity. We will also showcase an architecture diagram, detailing the flow of actions and the use of different services. To learn more about how to implement Q Developer and Chatbot in Slack, refer to this documentation.

Disclaimer: The information and solutions provided by Q Developer are based on patterns from AWS-related data and best practices. While we strive to offer accurate and helpful guidance, please note that the suggestions may not always be fully accurate or applicable to every situation. It is essential to conduct additional research and verify the information with official AWS documentation or consult with AWS support before implementing any recommendations. Always use your judgment and consider the specific requirements of your environment when making decisions based on AI-generated advice.

Leveraging Q Developer and Chatbot

Q Developer and Chatbot serve a wide range of personas across an organization, catering to both AWS-savvy users and those with limited cloud expertise. Software engineers, for instance, can leverage Q Developer to quickly locate documentation, troubleshoot issues, or find best practices, streamlining their workflow. Security engineers can interact with Chatbot to monitor incidents and receive real-time alerts. Even non-technical users, like project managers or operations staff, can benefit from these tools without needing deep cloud knowledge. Together, these tools enhance productivity and collaboration across the company, regardless of technical expertise.

Use Cases

The use cases section is split into two categories, one for Q Developer, and the other for Chatbot. Both services provide unique abilities to interact with AWS to get the response you are looking for and can be accessed by sending a message to @aws on Slack. Q Developer allows users to ask questions in natural language and responds back with a response and a list of sources. Chatbot allows users to open support cases and to run a number of AWS CLI commands for services such as S3, Lambda, and CloudWatch.

Q Developer Use Cases

Q Developer is a versatile tool designed to assist teams for a number of AWS related use cases. In this post, we will focus on training and onboarding, troubleshooting issues, and implementing AWS best practices.

Training and Onboarding

Benefit: Q Developer can act as a virtual learning assistant, providing personalized training and learning paths for users based on their role, skill level, and current projects. It helps team members stay updated with the latest AWS features and best practices, enhances their skills, and ensures that they can leverage AWS services effectively and efficiently. By offering targeted resources, Q Developer supports continuous learning and helps users prepare for AWS certifications or new roles.

Use Case: AWS Beginner Recommendations. When a new employee joins the team, Q Developer can help them get up to speed by suggesting beginner-level tutorials and essential AWS concepts based on the team’s current tech stack and projects.

The conversation covers recommendations for resources to learn more about AWS, including AWS Documentation, AWS Training and Certification, AWS Blogs and Community, and AWS re:Invent and other events.

Figure 1 – AWS Beginner Recommendations

Use Case: Certification Guidance. An employee aims to get another AWS certification. They can ask Q Developer to provide a structured learning path with recommended courses, study guides, whitepapers, and practice exams to prepare effectively.

The conversation discusses a structured learning path to prepare for the AWS Machine Learning Specialty Certification, covering topics like the AWS Certified Cloud Practitioner certification, the AWS Certified Machine Learning - Specialty certification, and recommended study materials and practices.

Figure 2 – Certification Guidance

Troubleshooting Issues

Benefit: Q Developer provides targeted troubleshooting guidance, helping users to diagnose and resolve issues efficiently. By leveraging AWS service documentation, best practices, and community discussions, Q Developer reduces the time spent on searching for solutions and allows users to focus on resolving issues faster. This improves operational efficiency and minimizes downtime or disruptions.

Use Case: Optimization Recommendations. A developer is facing an issue with running their application on EC2 during peak hours and is looking for recommendations to diagnose the issue.

The conversation provides recommendations to address performance issues with an EC2 instance, including EBS volume configuration, network optimization, system optimization, and cost-effective solutions.

Figure 3 – Optimizations Recommendations

Use Case: Service Troubleshooting. An engineer is working on configuring API Gateway with their application but receives a 504 Gateway Timeout error. Q Developer can look up HTTP response codes for specific services and recommend a plan to tackle the issue.

The conversation discusses troubleshooting a 504 Gateway Timeout error with an API Gateway, providing steps to check CloudWatch logs, review the Lambda function, optimize the Lambda function's performance, and implement client-side retry logic.

Figure 4 – Service Troubleshooting

Best Practices

Benefit: Q Developer provides access to AWS best practices, ensuring that users can build, manage, and maintain their cloud infrastructure effectively. By adhering to best practices, users can optimize their applications for performance, security, scalability, and cost-efficiency. Q Developer helps users stay informed about evolving best practices for using AWS services, ensuring their deployments are up-to-date and compliant with industry standards.

Use Case: Designing Resilient Architectures. A solutions architect is designing a new application on AWS and wants to ensure it’s highly available and fault-tolerant. By asking Q Developer for best practices, they can receive guidance on a number topics including region selection, software, architecture, and deployment strategies to maximize uptime and reliability.

The conversation covers best practices for designing a highly available and fault-tolerant application on AWS, including region selection, alignment to demand, software and architecture, data management, hardware and services, process and culture, deployment strategies, and monitoring and logging.

Figure 5 – Designing Resilient Architectures

Use Case: Deploying Applications for Operational Excellence. An engineer is looking for best practices to deploy an application onto AWS Elastic Beanstalk. Q Developer can assist with providing specific tips for the job that conforms with AWS’ operational excellence pillar found in the AWS Well-Architected Framework.

Recommends several best practices such as choosing the right deployment policy, using rolling updates, implementing auto scaling, and optimizing for content delivery.

Figure 6 – Operational Excellence

Chatbot Use Cases

Chatbot can be used to run AWS CLI commands, open support cases, and more within Slack. To learn more about how to get started with these commands, please visit Chatbot’s documentation and refer to this AWS Blog for additional information.

Using Chatbot and Q Developer Together

We can use Chatbot and Q Developer together to provide clarity in situations where an organization receives alerts on their Slack channel. For example, you can configure Chatbot to receive notifications using Amazon Simple Notification Service based off of rules set up within Amazon EventBridge and it will be delivered directly into your Slack channel. Given that an organization can have many types of notifications enabled for their AWS services, there may be times where the message that is being sent to Slack can be confusing and not well understood. You can take the message provided to you from the notification and provide that as context to Q Developer to help you dive deep into the situation and help figure out next steps. To learn more about setting up notifications and having them be sent to your Slack, please refer to this documentation.

Notification from Chatbot on Slack indicating to the user that there is an issue.

Figure 7 – Chatbot Error Notification

Q to address the issue, such as verifying the instance's health, ensuring the Auto Scaling group's configuration is correct, and reviewing the instance's configuration.

Figure 8 – Q Developer Deep Dive into Chatbot Notification

Architecture Diagram

Diagram illustrating the flow of information between a user, Slack Workspace, AWS Chatbot, and an Amazon Q Developer.

Figure 9 – Solution Overview

A user logs into Slack and can either ask a question, run AWS command(s), or open a support case.
Slack sends the request to Chatbot which then validates that it can be processed from the channel role and associated guardrail policies, both of which are setup through AWS Identity and Access Management. If the request follows the Chatbot use case(s), we can disregard step 3 and move to step 4.
The request is forwarded to Q Developer where it is processed and formulates a response which is then sent back to Chatbot. Chatbot will then relay the response back to Slack which is displayed to the user.
Logs are captured from the original message and the response and can be located within Amazon CloudWatch

Next Steps

Refer to these AWS documentation links that cover how to get started with setting up Q Developer and Chatbot in Slack. It is important to follow the order of the listed documents and to adhere to each of the steps listed to be able to get started with using the solution.

Integration Steps

Setting up AWS Chatbot
1. AWS Chatbot Getting Started documentation outlines the steps to set up AWS Chatbot for interacting with AWS infrastructure. It covers steps such as setting up an AWS account, configuring IAM permissions, and setting up Amazon SNS topics for notifications.
Configuring Slack with Chatbot
1. This documentation shows how to integrate AWS Chatbot with Slack, enabling AWS notifications and interactions in Slack channels. It covers Slack client and channel configuration and testing notifications from AWS services to Slack. Once completed with setting up Slack with Chatbot, refer back to the main Chatbot documentation where you can additional links on monitoring AWS services, customizing Chatbot and performing CLI commands on the lefthand side.
Setting up Q Developer with Chatbot
1. After following the previous documentation steps,you can now integrate Amazon Q Developer with AWS Chatbot in Slack, allowing users to ask questions about AWS services directly in chat. It includes IAM role setup with managed policies and necessary configuration steps. Once completed, this will allow you to use Q Developer through Chatbot’s interface on Slack.

Conclusion

This post highlights how using Q Developer and Chatbot within Slack can boost productivity for a number of use cases. Individuals, teams, and organizations can use these two services’ capabilities to navigate the intricacies of AWS, troubleshoot ongoing issues, and provide real-time guidance all without leaving the familiarity of Slack.

Control your AWS Glue Studio development interface with AWS Glue job mode API property

2024-10-29 Shovan Kanjilal

Post Syndicated from Shovan Kanjilal original https://aws.amazon.com/blogs/big-data/control-your-aws-glue-studio-development-interface-with-aws-glue-job-mode-api-property/

In recent years, as the importance of big data has grown, efficient data processing and analysis have become crucial factors in determining a company’s competitiveness. AWS Glue, a serverless data integration service for integrating data across multiple data sources at scale, addresses these data processing needs. Among its features, the AWS Glue Jobs API stands out as a particularly noteworthy tool.

The AWS Glue Jobs API is a robust interface that allows data engineers and developers to programmatically manage and run ETL jobs. By using this API, it becomes possible to automate, schedule, and monitor data pipelines, enabling efficient operation of large-scale data processing tasks.

To improve customer experience with the AWS Glue Jobs API, we added a new property describing the job mode corresponding to script, visual, or notebook. In this post, we explore how the updated AWS Glue Jobs API works in depth and demonstrate the new experience with the updated API.

JobMode property

A new property JobMode describes the mode of AWS Glue jobs (script, visual, or notebook) to improve your UI experience. AWS Glue users can use the mode that best fits your preference. Some extract, transform, and load (ETL) developers prefer to use visual mode and create visual jobs using AWS Glue Studio visual editor. Some data scientists prefer to use notebooks jobs and use AWS Glue Studio notebooks. Some data engineers and developers prefer to implement script through the AWS Glue Studio script editor or preferred integrated development environment (IDE). After the job is created with the preferred mode, you can search for it by filtering on the job mode within your saved AWS Glue jobs page and find it easily. Additionally, if you are migrating existing iPython notebook files to AWS Glue Studio notebook jobs, you can now choose and set the job mode and do so for multiple jobs using this new API property, as demonstrated in this post.

How CreateJob API works with the new JobMode property

You can use CreateJob API to create AWS Glue script or a visual or notebook job. The following is an example of how it works for a visual job using AWS SDK for Python (Boto3): (replace <your-bucket-name> with your S3 bucket)

CODE_GEN_JSON_STR = '''
{
  "node-1": {
    "S3ParquetSource": {
      "Name": "Amazon S3",
      "Paths": [
        "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/"
      ],
      "Exclusions": [],
      "Recurse": true,
      "AdditionalOptions": {
        "EnableSamplePath": false,
        "SamplePath": "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/73612da260b94159b705cf4df12364cb_0.snappy.parquet"
      },
      "OutputSchemas": [
        {
          "Columns": [
            {
              "Name": "marketplace",
              "Type": "string"
            },
            {
              "Name": "customer_id",
              "Type": "string"
            },
            {
              "Name": "review_id",
              "Type": "string"
            },
            {
              "Name": "product_id",
              "Type": "string"
            },
            {
              "Name": "product_title",
              "Type": "string"
            },
            {
              "Name": "star_rating",
              "Type": "bigint"
            },
            {
              "Name": "helpful_votes",
              "Type": "bigint"
            },
            {
              "Name": "total_votes",
              "Type": "bigint"
            },
            {
              "Name": "insight",
              "Type": "string"
            },
            {
              "Name": "review_headline",
              "Type": "string"
            },
            {
              "Name": "review_body",
              "Type": "string"
            },
            {
              "Name": "review_date",
              "Type": "timestamp"
            },
            {
              "Name": "review_year",
              "Type": "bigint"
            }
          ]
        }
      ]
    }
  },
  "node-2": {
    "DropFields": {
      "Name": "Drop Fields",
      "Inputs": [
        "node-1"
      ],
      "Paths": [
        [
          "review_headline"
        ],
        [
          "review_body"
        ],
        [
          "review_date"
        ]
      ]
    }
  },
  "node-3": {
    "S3DirectTarget": {
      "Name": "Amazon S3",
      "Inputs": [
        "node-2"
      ],
      "PartitionKeys": [],
      "Path": "s3://<your-bucket-name>/data/jobmode-blog/output/parquet/",
      "Compression": "snappy",
      "Format": "parquet",
      "SchemaChangePolicy": {
        "EnableUpdateCatalog": false
      }
    }
  }
}
'''

glue_client = boto3.client('glue')
codeGenJson = json.loads(constants.CODE_GEN_JSON_STR, strict=False)

# Call the create_job method
try:
    glue_client.create_job(
        Name="glue-visual-job",
        Description="Glue Visual ETL job",
        Command={'Name': 'glueetl', 'ScriptLocation': "s3://aws-glue-assets-<account-id>-<region>/scripts/glue-visual-job", 'PythonVersion': "3"},
        WorkerType=constants.WORKERTYPE,
        NumberOfWorkers="G.1X",
        Role=<role-arn>,  
        GlueVersion="4.0",        
        CodeGenConfigurationNodes=codeGenJson,
        JobMode="VISUAL"
    )
    print("Successfully created Glue job")
except Exception as e:
    print(f"Error creating Glue job: {str(e)}")

CODE_GEN_JSON_STR represents the visual nodes for the AWS Glue Job. There are three nodes: node-1 uses S3 source, node-2 does transformation, and node-3 uses S3 target. The script instantiates the AWS Glue Boto3 client, loads the JSON, and calls the create_job. JobMode is set to VISUAL.

After you run the Python script, a new job is created. The following screenshot shows how the created job looks in AWS Glue visual editor.

There are three nodes in the visual directed acyclic graph (DAG): node 1 sources product review data for the product_category book from the public S3 bucket, node-2 drops some of the fields that aren’t needed for downstream systems, and node-3 persists the transformed data in a local S3 bucket.

How CloudFormation works with the new JobMode property

You can use AWS CloudFormation to create different types of AWS Glue jobs by specifying the JobMode parameter with the AWS::Glue::Job resource. The supported job modes include:

SCRIPT
VISUAL
NOTEBOOK

In this example, you create a AWS Glue notebook job using AWS CloudFormation, which requires setting the JobMode parameter to NOTEBOOK.

Create a Jupyter Notebook file containing your logic and code, and save the notebook file with a descriptive name, such as my-glue-notebook.ipynb. Alternatively you can download the notebook file, and rename it to my-glue-notebook.ipynb.
Upload the Notebook file to the notebooks/ folder within the aws-glue-assets-<account-id>-<region> S3 bucket.

Create a new CloudFormation template to create a new AWS Glue job, specifying the NotebookJobName parameter as the same name as the Notebook file. Here’s the sample snippet of CloudFormation template:

AWSTemplateFormatVersion: '2010-09-09'
Description: CloudFormation template for creating an AWS Glue ETL job using a Jupyter Notebook

Parameters:
  NotebookJobName:
    Type: String
    Description: Name of the AWS Glue ETL Notebook job

Resources:
  GlueJobRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub ${AWS::StackName}-GlueJobRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - glue.amazonaws.com
            Action:
              - sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
      Policies:
        - PolicyName: GlueJobS3Access
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - iam:PassRole
                Resource:
                  - !Sub arn:aws:iam::${AWS::AccountId}:role/${AWS::StackName}-GlueJobRole

  ETLNotebookJob:
    Type: AWS::Glue::Job
    Properties:
      Name: !Ref NotebookJobName
      Description: ETL job using a Jupyter Notebook
      Role: !GetAtt GlueJobRole.Arn
      Command:
        Name: glueetl
        PythonVersion: '3'
        ScriptLocation: !Sub s3://aws-glue-assets-${AWS::AccountId}-${AWS::Region}/scripts/${NotebookJobName}.py
      DefaultArguments:
        '--job-bookmark-option': job-bookmark-enable
      JobMode: NOTEBOOK

Outputs:
  ETLNotebookJobName:
    Value: !Ref ETLNotebookJob
    Description: Name of the ETL Notebook job

Deploy the CloudFormation template. For NotebookJobName, enter same name as the notebook file.
Verify that the AWS Glue job you created is listed and that it has the name you specified in the CloudFormation template.

AWS Glue notebook shows the Notebook job that contains the existing cells that you had in the ipynb file. You can review the job details to confirm it’s configured correctly.

Console experience

On the AWS Glue console, in the navigation pane, choose ETL Jobs to observe all your ETL jobs listed. Here you have different columns Job name, Type, Created by, Last modified, and AWS Glue version. You can sort and filter by these columns. The following screenshot shows how it looks.

We also enhanced the console experience with the JobMode introduction. The Created by column on the console gives you information about JobMode of the job. You can filter access jobs created by VISUAL, NOTEBOOK, or SCRIPT, as shown in the following screenshot.

This new console experience helps you search and discover your jobs based on JobMode.

Conclusion

This post demonstrated how AWS Glue Job API works with the newly introduced job mode property. With the new property, you can explicitly choose the mode of each job. The steps instructed detailed usage in API, AWS SDK, and CloudFormation. Additionally, the property makes it straightforward to search and discover your jobs quickly on the AWS Glue console.

About the Authors

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure, and high-performance data solutions in the cloud.

Manoj Shunmugam is a DevOps Consultant in Professional Services at Amazon Web Services. He works with customers to establish infrastructures using cloud-centered and/or container-based platforms in the AWS Cloud.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Gal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering, and BI. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data products.

How to mitigate bot traffic by implementing Challenge actions in your AWS WAF custom rules

2024-10-25 Javier Sanchez Navarro

Post Syndicated from Javier Sanchez Navarro original https://aws.amazon.com/blogs/security/how-to-mitigate-bot-traffic-by-implementing-challenge-actions-in-your-aws-waf-custom-rules/

If you are new to AWS WAF and are interested in learning how to mitigate bot traffic by implementing Challenge actions in your AWS WAF custom rules, here is a basic, cost-effective way of using this action to help you reduce the impact of bot traffic in your applications. We also cover the basics of using the Bot Control feature to implement Challenge actions as a more sophisticated and robust option for an additional cost.

AWS WAF is a web application firewall that helps you protect against common web exploits and bots that can affect availability, compromise security, or consume excessive resources. In AWS WAF, you can create web access control lists (ACLs) that you can set with managed or custom rules. There are five possible actions you can define to take when the rules are triggered: Allow, Count, Block, CAPTCHA, or Challenge. The Challenge action, which we focus on in this post, is useful for detecting requests from automated tools without affecting the user experience.

Why is it important to mitigate bot traffic?

In cybersecurity, we typically use the term bot to describe a range of tools that allow the automation or simulation of HTTP requests. These bots can be legitimate (such as search engine crawlers that index your app or site) or malicious (tools that are used to automate unwanted requests), but both types have the potential to impact your app availability and performance. By properly handling bot traffic, you can reduce this impact, which can help you optimize costs and improve the stability of your infrastructure and the availability of your business.

Before starting

If you’ve never used AWS WAF before, we recommend that you read Getting started with AWS WAF to learn the basics of how to set up this service.

How does the Challenge action work, and what are the benefits?

When a request matches your rule that contains a Challenge action, the HTTP client is presented with a challenge, which most web browsers or non-automated clients are able to process. After solving that challenge, the client receives a token that will be included in subsequent requests—that’s how AWS WAF considers the request to be non-automated and permits access. Using a Challenge action adds a protection layer because bots and other automated tools typically can’t process the challenge as a legitimate web browser would.

A more effective mechanism against bots is to present a CAPTCHA action, in which the user must solve a puzzle. However, this action affects the user experience because it requires human interaction, and it typically involves higher costs than the Challenge option. Challenge actions, which consist of a JavaScript function that most web browsers can support and process in the background, are a great first step to monitor web requests because they don’t affect the user experience directly and are more economic than CAPTCHA.

Implementation options

In this blog post, we discuss two options for you to start handling the traffic from bots. Although the focus of this post is implementing the Challenge action through a custom rule (a rule you can create and set yourself), we’ve also included basic instructions for implementing the Challenge action through Bot Control, which allows you to directly use client application integration for more sophisticated detections.

Option 1: Implementing the Challenge action through a custom rule

The first step in setting up a custom rule with the Challenge option is to understand and define clearly what the expected normal behaviors are from users who access your app. Specifically, you need to know the expected number of requests in a given period of time from a single IP and the maximum time length of a typical session.

How do I define the normal rate of requests?

Both the maximum session length and the rate of requests expected will vary depending on each webpage or app, and this information needs to come from the business and application teams. When a user browses a page, they might trigger several requests (for example, the user will trigger a separate request for each image the page contains). You can use this information to estimate and define how many requests per IP a valid user can generate in a given time.

Additionally, you can enable web ACL logging, which will allow you to query logs from Amazon CloudWatch Logs Insights. From these logs, you can get an understanding of the current behaviors and trends in your web traffic and compare that with the expected behavior that you defined.

What parameters should I use to trigger the challenge?

Implementing challenges based on the headers in the request or the user-agent isn’t a good idea. Although you can act based on either of these fields for valid crawlers like those used by search engines, malicious bots might evolve and tamper with these fields as their creators notice they are being stopped.
Filtering by static IP won’t always work. Only valid crawlers tend to use the same IP range over time and malicious bots often change IP ranges.
Filtering by path might be a good option. If there are parts of your app that shouldn’t be indexed or crawled, you can declare that in your robots.txt file. Bots that disregard these directives can be considered suspicious, and you can present them with the challenge. However, this approach isn’t always good enough: A bot might be set to respect the directives.
A rate-limit rule is an effective option for triggering your challenge when you’re attempting to handle malicious bots. You can define the normal rate of requests that you expect valid users to make as described earlier in this section. Users that go over that rate will be presented with the challenge. You should set this rule as the top priority in order for it to be more efficient.

To create and set the rate-limit rule

Open the AWS WAF & Shield console, choose Web ACLs, and then choose the web ACL to which you will add the rule.

Figure 1: AWS WAF web ACLs
On the Web ACL page, choose the Rules tab.
In the Add rules drop-down list, choose Add my own rules and rule groups. If you already have your rate-based custom rule in place, select the checkbox to the left of the rule, and then choose Edit.

Figure 2: Add your own rules and rule groups
For Rule type, choose Rule builder. For Rule, enter a name for your rule. For Type, choose Rate-based rule.

Figure 3: Start the rule builder
Under Rate-limiting criteria, set the rate limit and define the evaluation window, which is customizable.
For Request aggregation, choose Source IP address. For Scope of inspection and rate limiting, choose Consider all requests.

Figure 4: Set the rate-limiting criteria
For Action, choose Challenge.
For Immunity time (which specifies how long a Challenge token is valid before a new one is needed), choose a value according to the maximum time a normal session could last.
When you set a challenge through custom rules and the token expires, subsequent requests will include an invalid cookie and will therefore be rejected until a new session is started. For example, if a normal session’s maximum duration is 5 minutes, you can leave immunity set to the default, but if the maximum duration can be longer (as in an online shop), then you will need to increase the immunity time according to your use case. (Note that SDK application integration, which we cover in the next section, takes care of presenting a new token if the current one expires, without impacting human users.)

Figure 5: Set the action to Challenge
Choose Add rule.
Set the rule priority by selecting the rule and moving it up in the list. Note that we’re considering a scenario where you set a web ACL for a single account. In this case, remember to place the rate-limiting rule at the top of the list, so that you prevent undesired traffic from triggering additional rules. This is even more important if you have paid rules later in the list.

Figure 6: Set the rule priority

Option 2: Implementing the Challenge action by using Bot Control

Implementing Challenge actions through the Bot Control feature in AWS WAF is an easier, more robust and flexible solution than using a custom rule. However, it has extra costs associated that you should be aware of and evaluate.

Bot Control is a managed rules group that provides improved visibility and automated detection and mitigation mechanisms for bots. You are charged differently depending on the tier of Bot Control you use (Common or Targeted). The Common tier detects a variety of self-identifying bots by using static request data analysis. The Targeted tier adds active analysis of client blueprints and behavior as well as machine learning, and is capable of detection and mitigation of more sophisticated agents. You can read more about the Bot Control protection levels in the documentation.

Some of the main features of Bot Control include the following:

The availability of specific metrics for bot categories, giving you more granularity and improved visibility.
The ability to automatically identify bots based on known patterns and behaviors, such as user-agent, path, or behavior. (For more details, see the Bot Control documentation.)
The ability to be set through a managed rule.
The option to set client application integration through the AWS SDK, by providing you with a code snippet you can add to your code for improved efficiency and transparency. (For more details, see Why you should use the application integration SDKs with Bot Control.)

An in-depth explanation of how to use Bot Control is beyond the scope of this post, but we provide instructions on how to enable it here. For further recommendations, see the AWS WAF Bot Control main page and the topics in the AWS WAF Developer Guide.

To enable Bot Control and configure the rule

Open the WAF & Shield console, choose Web ACLs, and choose the web ACL you want Bot Control enabled on.
On the Rules tab, in the Add rules drop-down list, instead of adding your own rules, choose Add managed rule groups.

Figure 7: Add managed rule groups to your web ACL
On the Add managed rule groups page, expand the first option, AWS managed rule groups, and scroll down to find Bot Control. Then select the Add to web ACL toggle button to enable Bot Control.

Figure 8: Enable the Bot Control rule group
You will need to customize the configuration. To do so, choose Edit.
First, choose the level of inspection you want to use. Common detects a variety of self-identifying bots, but, in this example, we chose Targeted because it adds advanced detection for sophisticated bots by using machine learning and allows the challenge application integration that we mentioned earlier.
Choose the scope of inspection. You can keep the scope set to Inspect all web requests or choose to use scope-down statements if you want a more granular filtering.
Choose the action on a per-bot category basis or choose a single action for all the categories. In this example, we used the same settings for the Challenge action for all the categories.

Figure 9: Set Bot Control actions
Similar to the recommendations for Option 1 earlier in this post, we recommend that you define your use cases and how you want to handle each bot category in the Common section and each rule in the Targeted The settings need to be aligned with your business needs, with the understanding that your needs can change over time. The settings you choose might also be specific to each application—for example, in the case of search engine bots, you need to consider the impact of blocking or mitigating them on your search engine optimization (SEO) and find a balance with app performance.

Figure 10: Targeted rules
Choose Save rule and then choose Add rules.
On the Set rule priority page, set the rule priority by selecting the rule and moving it up or down in the list. Make sure you set the Bot Control managed rule group (AWS-AWSManagedRulesBotControlRuleSet) to be lower in priority than the free rules (both custom and managed). Because Bot Control rules pricing is based on the number of requests processed and the number of CAPTCHA or Challenge actions presented, putting Bot Control rules at the bottom of the list helps you to optimize your costs.
You can now integrate the challenge into your application by using the SDK. For more information, see AWS WAF client application integration.

Next steps

As your cloud infrastructure grows, you need to start managing your protection at scale and centrally. AWS Firewall Manager provides you with a single place to centrally configure, manage, and monitor your AWS WAF firewall, AWS Shield Advanced protections, Amazon Virtual Private Cloud (Amazon VPC) security groups, VPC network ACLs, AWS Network Firewall instances, and Amazon Route 53 Resolver DNS Firewall rules across multiple AWS accounts and resources.

For more information, see the Security Blog posts Centrally manage AWS WAF and AWS Managed Rules at scale with Firewall Manager and How to enforce a security baseline for an AWS WAF ACL across your organization using AWS Firewall Manager.

Conclusion

In this post, we reviewed how you can mitigate bot traffic by implementing Challenge actions. By implementing this action type through a custom rule, you can set up basic, cost-effective measures to handle basic bots and control automated traffic to your applications. As your business grows, you can achieve higher efficiency and better protection against more sophisticated bots by enabling Bot Control rules, which use machine learning for advanced detection. To learn more about these topics, see the following links.

Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

2024-10-25 Roger Kim

Post Syndicated from Roger Kim original https://aws.amazon.com/blogs/big-data/achieve-the-best-price-performance-in-amazon-redshift-with-elastic-histograms-for-selectivity-estimation/

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

Amazon Redshift continues to lead in data warehouse price-performance (for examples, see Amazon Redshift continues its price-performance leadership, Amazon Redshift: Lower price, higher performance, and Get up to 3x better price performance with Amazon Redshift than other cloud data warehouses). Amazon Redshift’s advanced Query Optimizer is a crucial part of that leading performance. The Query Optimizer is responsible for finding the fastest way (or plan) to execute a query. It does this by using statistics about the data together with the query to calculate a cost of executing the query for many different plans.

Amazon Redshift has built-in autonomics to collect statistics called automatic analyze (or auto analyze). Auto analyze is a background operation that runs automatically on Redshift tables to keep statistics up-to-date. Statistics collection, however, can be computationally expensive, making it a challenge to keep statistics up-to-date particularly when data is continuously being ingested. As data is ingested into the Redshift data warehouse over time, statistics could become stale, which in turn causes inaccurate selectivity estimations, leading to sub-optimal query plans that impact query performance.

Challenges with stale statistics

Based on Redshift fleet analysis of customer workloads, we found that the staleness of statistics is an especially important factor in the selectivity estimation of predicates with temporal columns such as those with DATE and TIMESTAMP data types. This is due to the following reasons: 1) DATE and TIMESTAMP represent about 11% of predicate columns in the queries in the Amazon Redshift fleet (see Figure 1); 2) More than 40% of query scan volume in the Amazon Redshift fleet have predicates on DATE or TIMESTAMP columns; and 3) Not surprisingly, customer workloads tend to query recent (hot) data more often than historical (cold) data. One such query pattern representative of these customer workloads, derived from the industry standard TPC-H analytics benchmark, is as follows:

SELECT ...
FROM   lineitem
       JOIN orders ON l_orderkey = o_orderkey
       JOIN customer ON ...
WHERE l_shipdate >= current_date - $1
  AND ...

Figure 1: Amazon Redshift fleet metrics on temporal vs non-temporal data types

Solution overview

Amazon Redshift introduced a new selectivity estimation technique in Amazon Redshift patch release P183 (v1.0.75379) to address the situation — having up-to-date statistics on temporal columns improving query plans and thereby performance. The new technique captures real-time statistical metadata gathered during data ingestion without incurring additional computational overhead. For queries with range predicates on temporal columns, the query optimizer uses this additional metadata fetched at runtime to complement the existing statistics, elastically adjusting the histogram boundaries, leading to improved selectivity estimations for temporal predicates. See Figures 2 & 3 for the performance improvements that elastic histograms for selectivity estimation delivers. This query processing optimization is enabled by default requiring no configuration changes or user intervention from users to realize the benefits of automatic optimization and improved query performance.

Benchmark evaluation

We evaluated the new selectivity estimation technique on variations of TPC-H queries. In one variation, the query performs an n-way join between lineitem, orders, and other tables with multiple predicates, including on l_shipdate.

When histogram statistics were stale, the selectivity estimations of predicates on l_shipdate were incorrectly predicted. This led to a sub-optimal query plan with a join order involving large network-heavy data redistributions among the compute resources of the Amazon Redshift provisioned cluster or serverless workgroup. With the new selectivity estimation technique, the prediction became much more accurate, leading to an optimal query plan with a join order that minimized the redistribution of results between join steps, resulting in a performance improvement shown in Figure 2.

Figure 2: Relative performance of TPC-H query variant (lower is better)

Figure 3: Query Plan comparison: Before enhancement (left), After enhancement (right)

Conclusion

In this post, we covered new performance optimizations in Redshift data warehouse query processing and how elastic histogram statistics help enhance selectivity estimation and the overall quality of query plans for Amazon Redshift data warehouse queries in the absence of fresh table statistics.

In summary, Amazon Redshift now offers enhanced query performance with optimizations such as Enhanced Histograms for Selectivity Estimation in the absence of fresh statistics by relying on metadata statistics gathered during ingestion. These optimizations are enabled by default and Amazon Redshift users will benefit with better query response times for their workloads. Amazon Redshift is on a mission to continuously improve performance and therefore overall price-performance. The new selectivity estimation enhancement has already improved the performance of hundreds of thousands of customer queries in the Amazon Redshift fleet since its introduction in the patch release P183. It’s worth noting that this is one of the many behind-the-scenes improvements we continually make to keep Redshift the industry leader in price-performance.

We invite you to try the numerous new features introduced in Amazon Redshift together with the new performance enhancements. For more information, reach out to your AWS account team to request a free consultation or a demo of Amazon Redshift. They will be happy to provide additional guidance and support on choosing the right analytics solution that meets your business needs.

About the authors

Roger Kim is a Software Development Engineer on the Amazon Redshift team focusing on query performance and optimization. He holds a BA in Computer Science and Mathematics from Cornell University.

Mohammed Alkateb is an Engineering Manager at Amazon Redshift. Prior to joining Amazon, Mohammed had 12 years of industry experience in query optimization and database internals as an Individual Contributor and Engineering Manager. Mohammed has 18 US patents, and he has publications in research and industrial tracks of premier database conferences including EDBT, ICDE, SIGMOD and VLDB. Mohammed holds a PhD in Computer Science from The University of Vermont, and MSc and BSc degrees in Information Systems from Cairo University.

Mengchu Cai is a principal engineer on the Amazon Redshift team. Mengchu currently works on query optimization and data lake query performance. He also led the development of SQL language features. Mengchu received his PhD in Computer Science and Engineering from the University of Nebraska Lincoln.

Ravi Animi is a Senior Product Leader on the Amazon Redshift team and manages several functional areas of Amazon Redshift analytics, data, and AI, including spatial analytics, streaming analytics, query performance, Spark integration, and analytics business strategy. He has experience with relational databases, multi-dimensional databases, IoT technologies, storage and compute infrastructure services, and more recently, as a startup founder in the areas of AI and deep learning. Ravi holds dual Bachelors degrees in Physics and Electrical Engineering from Washington University, St. Louis, a Masters in Engineering from Stanford, and an MBA from Chicago Booth.

How to use interface VPC endpoints to meet your security objectives

2024-10-22 Joaquin Manuel Rinaudo

Post Syndicated from Joaquin Manuel Rinaudo original https://aws.amazon.com/blogs/security/how-to-use-interface-vpc-endpoints-to-meet-your-security-objectives/

Amazon Virtual Private Cloud (Amazon VPC) endpoints—powered by AWS PrivateLink—enable customers to establish private connectivity to supported AWS services, enterprise services, and third-party services by using private IP addresses. There are three types of VPC endpoints: interface endpoints, Gateway Load Balancer endpoints, and gateway endpoints. An interface VPC endpoint, in particular, allows customers to design applications that connect to AWS services privately, including the more than 130 AWS services that are available over AWS PrivateLink. For a complete list of services that integrate with PrivateLink, see the documentation for VPC endpoints.

The decision regarding when to use interface VPC endpoints to further secure your AWS infrastructure depends on your need for additional security controls or your preferred architecture patterns. In this blog post, we present four security objectives that VPC endpoints help you achieve. It’s important to note that other non-security benefits, such as reduced latency and cost management, are not covered in this post. For more information on those benefits, see these topics:

Improve the latency of sensitive applications with Amazon SageMaker real-time inference endpoints
Reduce network costs by centralizing access to AWS services through AWS Transit Gateway

Background

By default, network packets that originate in the AWS network with a destination on the AWS network (for example, public endpoints for AWS services) stay in the AWS network, except traffic to and from AWS China Regions. In addition, all data flowing across the AWS global network that interconnects AWS data centers and Regions is automatically encrypted at the physical layer before it leaves AWS secured facilities.

AWS PrivateLink VPC endpoints enable customers to further enhance the security posture of their applications by establishing private connectivity to supported AWS services, enterprise services, and third-party services by using a private IP address. You can find patterns for how to use the different types of endpoints in the Securely Access Services Over AWS PrivateLink whitepaper.

An interface VPC endpoint is a collection of one or more elastic network interfaces with private IP addresses. These endpoints can serve as an entry point for traffic destined to a supported AWS service in the same AWS Region as the VPC, without requiring an internet gateway, NAT device, VPN connection, AWS Direct Connect connection, or a public IP. Customers can then use interface VPC endpoints to help meet multiple security objectives, such as the following:

Implement networks that are isolated from the internet
Implement a data perimeter by using VPC endpoint policies
Enable private connectivity to AWS service API endpoints for on-premises environments
Align with specific compliance requirements

In the rest of this post, we’ll discuss each of these objectives in detail and how interface VPC endpoints can help you implement them.

Security objective 1: Implementing networks that are isolated from the internet

If you operate sensitive workloads, you might require that they run in private subnets that are isolated from the internet. In this scenario, the subnets in the network don’t have routes to an internet gateway and won’t be able to either send packets to the internet or receive packets from it.

In this case, you can use interface VPC endpoints to connect your VPC to AWS services in the same Region as if they were in your VPC, without configuring an internet gateway, NAT instance, or route tables. For information on how to configure a cross-Region VPC interface endpoint by using VPC peering, see this guidance.

Figure 1 shows an example architecture with an Amazon Elastic Compute Cloud (Amazon EC2) instance running in an isolated network and using interface VPC endpoints to send messages to Amazon Simple Queue Service (Amazon SQS).

Figure 1: Isolated subnet for EC2 server sending messages to Amazon SQS

Security objective 2: Implement a data perimeter using VPC endpoint policies

A data perimeter is a set of guardrails to help ensure that only your trusted identities are accessing trusted resources from expected networks. Learn more about data perimeters on AWS.

You can use VPC endpoint policies to implement a data perimeter by allowing access to only trusted entities and resources from your network, helping to prevent unintended access. This enables you to take advantage of the power of AWS Identity and Access Management (IAM) policy and flexibility to control access to your resources at a granular level.

In the VPC diagram in Figure 2, EC2 instance traffic flows out through a firewall endpoint, NAT gateway, and internet gateway to reach the S3 public API endpoint, remaining within the AWS network. However, this setup does not allow the implementation of a logical data perimeter to control the specific resources that the EC2 instance can access.

Figure 2: Before implementing a data perimeter

In contrast, in the diagram in Figure 3, you can see how VPC interface endpoints enable the use of VPC endpoint policies to enforce a data perimeter, such as only allowing certain S3 buckets to be accessed by the EC2 instance.

Figure 3: After implementing a data perimeter

For example, you can attach a policy, similar to the one below, to an Amazon S3 interface or gateway VPC endpoint to restrict access from the VPC to only S3 buckets that are owned by the same AWS Organizations organization. Make sure to replace <MY-ORG-ID> with your own information.

{
    "Version": "2012-10-17",
        "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": "*",
            "Condition": { "StringEquals": { "aws:ResourceOrgID": <MY-ORG-ID>}}
        }
    ]
}

As a further example, the following policy shows how you can limit access to only trusted identities. You can attach this policy to an S3 interface VPC endpoint to permit access only to principals from your organization, to help mitigate the risk of unintended disclosure through non-corporate credentials. Make sure to replace <MY-ORG-ID> with your own information.

{    "Version": "2012-10-17",
        "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": "*",
            "Condition": { "StringEquals": { "aws:PrincipalOrgID": "<MY-ORG-ID>"}}
        }
    ]
}

Finally, you can create resource policies for your resources to restrict access to only VPC interface endpoints. For example, you can use the following policy from our Amazon S3 User Guide for S3 buckets. Make sure to replace <MY-VPCE-ID> and <MY-BUCKET> with your own information.

 {  "Version": "2012-10-17",
        "Statement": [
        {
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": ["arn:aws:s3::<MY-BUCKET>", "arn:aws:s3::<MY-BUCKET>/*"]
            "Condition": { "StringNotEquals": { "aws:SourceVpce": "<MY-VPCE-ID>"}}
        }
    ]
}

For more information on the use of these condition keys and how to implement a data perimeter, see this blog post.

Security objective 3: Enable private connectivity to AWS service API endpoints for on-premises environments

You might be required to run private connectivity to AWS only from your on-premises environments, such as when your on-premises firewalls are configured to limit the connectivity to the internet, including AWS public endpoints.

In this case, you can use interface VPC endpoints with Direct Connect private virtual interfaces (VIFs) or Site-to-Site VPN to extend private connectivity to your on-premises networks. With this setup, you can also enforce data perimeter rules like those shown earlier in this post.

For example, customers can use interface VPC endpoints from Amazon CloudWatch agents running on on-premises servers to CloudWatch through a private connection, as demonstrated in this blog post.

In the diagram in Figure 4, we show how you can extend this approach to include other services, such as Amazon S3, in a single VPC setup. To implement this pattern, you need to set up conditional forwarding on your on-premises DNS resolver to forward queries for amazonaws.com to an Amazon Route 53 Resolver’s inbound endpoint IPs.

The flow in this scenario is as follows:

The DNS query for your S3 endpoint from your on-premises host is routed to the locally configured on-premises DNS server.
The on-premises DNS server performs conditional forwarding to an Amazon Route 53 inbound resolver endpoint IP address.
The inbound resolver returns the IP address of the interface VPC endpoint, which allows the on-premises host to establish private connectivity through AWS VPN or AWS Direct Connect.

Figure 4: On-premises private connectivity to Amazon S3 and Amazon CloudWatch

You can extend this architecture to support a cross-Region and multi-VPC setup by using AWS Transit Gateway and Amazon Route 53 private hosted zones, as described in the Building a Scalable and Secure Multi-VPC AWS Network Infrastructure whitepaper. Keep in mind that a distributed VPC endpoint approach (one that uses one endpoint per VPC) will allow you to implement least-privilege policies in VPC endpoints. A centralized approach, while more cost-effective, can increase the complexity of maintaining least privilege in a single policy and increase the scope of impact of a security event.

Security objective 4: Align with specific compliance requirements

In certain cases, customers operating in industries such as financial services or healthcare need to maintain compliance with regulations or standards such as HIPAA, the EU-US Data Privacy Framework, and PCI DSS. Although all communication between instances and services hosted in AWS use the AWS private network, using an interface VPC endpoint can help prove to auditors that you’re applying a defense-in-depth approach. This approach includes designing your workloads to run in networks that are isolated from the internet or implementing additional conditions such as the example VPC endpoint policies shown earlier in this post.

You can use AWS Audit Manager to get started mapping your compliance requirements to industry and geographic frameworks, such as NIST SP 800-53 Rev. 5, FedRAMP, and PCI DSS, and to automate evidence collection for controls such as the use of VPC endpoints. If you also have custom compliance requirements, you can create your own custom controls by using the Configure Amazon Virtual Private Cloud (VPC) service endpoints core control in the AWS Audit Manager control library console.

If you want to know how the use of VPC endpoints can help you align with compliance requirements for your specific workload and require assistance beyond what is provided in the public documentation on the AWS Compliance Programs webpage, you can consult with AWS Security Assurance Services (AWS SAS). AWS SAS has expert consultants and advisors who can help you design your systems to achieve, maintain, and automate compliance in the cloud.

Conclusion

In this blog post, we presented four security objectives to consider when deciding whether to use AWS interface VPC endpoints. You can use this information when you design your architecture or create a threat model to help implement secure architectures for your AWS hosted workloads. If you want to learn more about AWS PrivateLink and interface endpoints, see the AWS PrivateLink documentation. If you’re interested in learning more about implementing data perimeter concepts by using VPC endpoints, we suggest this workshop.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

2024-10-22 Ramakant Joshi

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/big-data/demystify-data-sharing-and-collaboration-patterns-on-aws-choosing-the-right-tool-for-the-job/

Data is the most significant asset of any organization. However, enterprises often encounter challenges with data silos, insufficient access controls, poor governance, and quality issues. Embracing data as a product is the key to address these challenges and foster a data-driven culture.

In this context, the adoption of data lakes and the data mesh framework emerges as a powerful approach. By decentralizing data ownership and distribution, enterprises can break down silos and enable seamless data sharing. Cataloging data, making the data searchable, implementing robust security and governance, and establishing effective data sharing processes are essential to this transformation. AWS offers services like AWS Data Exchange, AWS Glue, AWS Clean Rooms and Amazon DataZone to help organizations unlock the full potential of their data.

Personas

Let’s identify the various roles involved in the data sharing process.

First of all, there are data producers, which might include internal teams/systems, third-party producers, and partners. The data consumers include internal stakeholders/systems, external partners, and end-customers. At the core of this ecosystem lies the enterprise data platform. When considering enterprises, numerous personas come into play:

Line of business users – These personas need to classify data, add business context, collaborate effectively with other lines of business, gain enhanced visibility into business key performance indicators (KPIs) for improved outcomes, and explore opportunities for monetizing data
Partners – Partners should be able to share data, collaborate with other partners and customers.
Data scientists and business analysts – These personas should be able to access the data, analyze it and generate actionable business insights
Data engineers – Data engineers are tasked with building the proper data pipeline and cataloging the data that meets the diverse needs of stakeholders, including business analysts, data scientists, partners, and line of business users
Data security and governance officers – Data security involves making sure producers and consumers have appropriate access to the data, implementing right access permissions, and maintaining compliance with industry regulations, particularly in highly regulated sectors like healthcare, life sciences, and financial services. This persona is also responsible for enhancing data governance by tracking lineage, and establishing data mesh policies

Choosing the right tool for the job

Now that you have identified the various personas, it’s important to select the appropriate tools for each role:

Starting with the producers, if your data source includes a software as a service (SaaS) platform, AWS Glue offers options to automate data flows between software service providers and AWS services.
For producers seeking collaboration with partners, AWS Clean Rooms facilitates secure collaboration and analysis of collective datasets without the need to share or duplicate underlying data.
When dealing with third-party data sources, AWS Data Exchange simplifies the discovery, subscription, and utilization of third-party data from a diverse range of producers or providers. As a producer, you can also monetize your data through the subscription model using AWS Data Exchange.
Within your organization, you can democratize data with governance, using Amazon DataZone, which offers built-in governance features.
For SaaS consumers, AWS Glue supports bidirectional transfer and serves both as a producer and consumer tool for various SaaS providers.

Let’s briefly describe the capabilities of the AWS services we referred above:

AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. It provides data catalog, automated crawlers, and visual job creation to streamline data integration across various data sources and targets.

AWS Data Exchange enables you to find, subscribe to, and use third-party datasets in the AWS Cloud. It also provides a platform through which a data producer can make their data available for consumption for subscribers. It is a data marketplace featuring over 300 providers offering thousands of datasets accessible through files, Amazon Redshift tables, and APIs. This service supports consolidated billing and subscription management, offering you the flexibility to explore 1,000 free datasets and samples. You don’t need to set up a separate billing mechanism or payment method specifically for AWS Data Exchange subscriptions.

AWS Clean Rooms is designed to assist companies and their partners in securely analyzing and collaborating on collective datasets without revealing or sharing underlying data. You can swiftly create a secure data clean room, fostering collaboration with other entities on the AWS Cloud to derive unique insights for initiatives such as advertising campaigns or research and development. This service protects underlying data through a comprehensive set of privacy-enhancing controls and flexible analysis rules tailored to specific business needs.

Amazon DataZone is a data management service that makes it fast and straightforward to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources. With Amazon DataZone, administrators and data stewards who oversee an organization’s data assets can manage and govern access to data using fine-grained controls. These controls are designed to grant access with the right level of privileges and context. Amazon DataZone makes it straightforward for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization so they can discover, use, and collaborate to derive data-driven insights.

Use cases

Let’s review some example use cases to understand how these diverse services can be effectively applied within a business context to achieve the desired outcomes. In this particular scenario, we focus on a company named AnyHealth, which operates in the healthcare and life sciences sector. This company encompasses multiple lines of businesses, specializing in the sale of various scientific equipment. Three key requirements have been identified:

Sales and customer visibility by line of business – AnyHealth wants to gain insights into the sales performance and customer demands specific to each line of business. This necessitates a comprehensive view of sales activities and customer requirements tailored to individual lines of business.
Cross-organization supply chain and inventory visibility – The company faces challenges related to supply chain and inventory management, especially in global crisis situations like a pandemic. They want to address instances where inventory items are idle in one line of business while there is demand for the same items in another. To overcome this, they want to establish cross-organizational visibility of supply chain and inventory data, breaking down silos and achieving prompt responses to business demands.
Cross-sell and up-sell opportunities – AnyHealth intends to boost sales by implementing cross-selling and up-selling strategies. To achieve this, they plan to use machine learning (ML) models to extract insights from data. These insights will then be provided to sales representatives and resellers, enabling them to identify and capitalize on opportunities effectively.

In the following sections, we discuss how to address each requirement in more detail and the AWS services that best fit each solution.

Sales and customer visibility by line of business

The first requirement involves obtaining visibility into sales and customer demand by line of business. The key consumers of this data include line of business leaders, business analysts, and various other business stakeholders.

The initial step is to ingest sales and order data into the platform. Currently, this data is centralized in the ERP system, specifically SAP. The objective is to regularly retrieve this data and capture any changes that occur. The data engineers are instrumental in building this pipeline. Given that we are dealing with a SaaS integration, AWS Glue is the logical choice for seamless data ingestion.

Next, we focus on building the enterprise data platform where the accumulated data will be hosted. This platform will incorporate robust cataloging, making sure the data is easily searchable, and will enforce the necessary security and governance measures for selective sharing among business stakeholders, data engineers, analysts, security and governance officers. In this context, Amazon DataZone is the optimal choice for managing the enterprise data platform.

As stated earlier, the first step involves data ingestion. Data is ingested from a third-party vendor SaaS solution (SAP), and the data engineer uses AWS Glue. Utilizing the SAP data connector, the data engineer establishes a connection with the SAP environment, running scheduled jobs.

The data lands in Amazon Simple Storage Service (Amazon S3). Additional AWS Glue jobs are created to transform and curate the data. The curated data is placed in a designated bucket and AWS Glue crawlers are run to catalog the data. This cataloged data is then managed through Amazon DataZone.

In Amazon DataZone, the data security officer creates the corporate domain. She/he creates producer projects and enables access to data engineers, and business analysts. Data engineers ensure sales and customer data is available from the source into the Amazon DataZone project. Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products. The data security officer sets permissions in Amazon DataZone to allow users to access the data portal. Users can search for assets in the Amazon DataZone catalog, view the metadata assigned to them, and access the assets.

Amazon Athena is used to query, and explore the data. Amazon QuickSight is used to read from Amazon Athena and generate reports that is consumed by the line of business users and other stakeholders.

The following diagram illustrates the solution architecture using AWS services.

Cross-organization supply chain and inventory visibility

For the second requirement, the objective is to achieve visibility of supply chain and inventory across the organization. The key stakeholders remain line of business users. They would like to get a cross-organization visibility of supply chain and inventory data. The aim is to ingest supply chain and inventory information in a scheduled manner from the ERP system (SAP), and also capture any changes in the supply chain and inventory data. The persona involved in setting up the data ingestion pipeline is a data engineer. Given that we are extracting data from SAP, AWS Glue is the suitable choice for this requirement.

The next step involves obtaining economic indicators and weather information from third-party sources. AnyHealth, with its diverse lines of business, including one that manufactures medical equipment such as inhalers for asthma treatment, recognizes the significance of collecting weather information, particularly data about pollen, because it directly impacts the patient population. Additionally, socioeconomic conditions play a crucial role in government-assisted programs related to out-of-hospital care. To incorporate this third-party data, AWS Data Exchange is the logical choice.

Finally, all the accumulated data needs to be hosted on the enterprise data platform, with cataloging, and robust security and governance measures. In this context, Amazon DataZone is the preferred solution.

The pipeline begins with the ingestion of data from SAP, facilitated by AWS Glue. The data lands in Amazon S3, where AWS Glue jobs are used to curate the data, generate curated tables, and then AWS Glue crawlers are used to catalog the data.

AWS Data Exchange serves as the platform for collecting economic trends and weather information. The business analyst leverages AWS Data Exchange to retrieve data from various sources. In the AWS Data Exchange marketplace, they identify the data set, subscribe to the data, and subsequently consume it. Any changes in the source data invokes events, which updates the data object in the Amazon S3 bucket.

Amazon DataZone is used to manage and govern the datalake. Similar to the first use case, the data security officer creates a producer project. The data owner from LoB creates supply chain and inventory data assets in the producer project and publishes the same. From the consumer perspective, the data security officer also creates a consumer project, which allows the sales and marketing teams from different LoBs to search for the supply chain and inventory data published by the producer. Consumers request access to the published supply chain and inventory data, and the producer grants the necessary access. Amazon Athena is used to query, and explore the data. Amazon QuickSight is used to read from Amazon Athena and generate reports.

The following diagram illustrates this architecture.

Cross-sell and up-sell opportunities

The third requirement involves identifying cross-sell and up-sell opportunities. The key business consumers in this context are the sales representatives and resellers. AnyHealth operates globally, selling products in Europe, America, and Asia. Direct business transactions with consumers occur in America and Europe, and resellers facilitate sales in Asia, where AnyHealth lacks a direct relationship with the consumers.

The enterprise data platform is used to host and analyze the sales data and identify the customer demand. This data platform is managed by Amazon Data Zone. Cross-sell and up-sell opportunities, derived through ML models, are integrated into the customer relationship management (CRM) system, which in this case is Salesforce. Sales representatives access this data from Salesforce to engage with the market and collaborate with customers. AWS Glue is used for this integration.

Typically, resellers don’t provide their partners direct access to their customer data. Although AnyHealth doesn’t have direct access, understanding customer personas and profile information is essential to equip resellers with right offers to cross-sell and up-sell products. AWS Clean Rooms enables collaboration on collective datasets with stringent security controls, enabling insights without sharing the underlying data.

By addressing these requirements, AnyHealth can effectively identify and capitalize on cross-sell and up-sell opportunities, tailoring their approach based on the distinct dynamics of direct and reseller-based business models across various regions.

The initial step in the architecture involves a pipeline where SAP data is ingested into Amazon S3 and curated using AWS Glue job. The curated data is cataloged, governed and managed using Amazon DataZone.

In this scenario, where sales and customer information are acquired, data scientists build ML models to identify cross-sell and upsell opportunities. Using Amazon DataZone, these opportunities are shared with line of business users, providing transparency regarding the opportunities presented to sales reps and resellers. The cross-sell and upsell insights are pushed to Salesforce through AWS Glue, with an event-driven workflow for timely communication to sales reps. However, for resellers, a different pipeline is needed as AnyHealth doesn’t have direct access to the customer sales data. AnyHealth uses AWS Clean Rooms for this purpose.

With AWS Clean Rooms, the collaboration is started by AnyHealth (the collaboration initiator) who invites resellers to join. Resellers participate in the collaboration, and share the customer profile and segment information, while maintaining privacy by excluding customer names and contact details. AnyHealth uses the customer profile information and order trends to identify cross-sell and upsell opportunities. These opportunities are shared with the reseller to pursue further and position products in the market.

The following diagram illustrates this architecture.

Final architecture

Let’s now examine the complete architecture which covers all three use cases. In this architecture, purpose-built services like AWS Data Exchange, AWS Glue, AWS Clean Rooms and Amazon DataZone, have been used. The seamless integration of these services works cohesively to achieve end-to-end business objectives.

The following diagram illustrates this architecture.

To strengthen the security posture of your cloud infrastructure, we recommend using AWS Identity and Access Management (IAM), which allows you to manage access to AWS resources by creating users, groups, and roles with specific permissions. Additionally, you can use AWS Key Management Service (AWS KMS), which enables you to create, manage, and control encryption keys used to protect your data, so only authorized entities can access sensitive information. To provide an audit trail for compliance, you can use AWS CloudTrail, which records API calls made within your AWS account.

Conclusion

In this post, we discussed how to choose right tool for building an enterprise data platform and enabling data sharing, collaboration and access within your organization and with third-party providers. We addressed three business use cases using AWS Glue, AWS Data Exchange, AWS Clean Rooms, and Amazon DataZone through three different use cases.

To learn more about these services, check out the AWS Blogs for Amazon DataZone, AWS Glue, AWS Clean Rooms, and AWS Data Exchange.

About the authors

Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.

Debaprasun Chakraborty is an AWS Solutions Architect, specializing in the analytics domain. He has around 20 years of software development and architecture experience. He is passionate about helping customers in cloud adoption, migration and strategy.

How to build a Security Guardians program to distribute security ownership

2024-10-21 Mitch Beaumont

Post Syndicated from Mitch Beaumont original https://aws.amazon.com/blogs/security/how-to-build-your-own-security-guardians-program/

Welcome to the second post in our series on Security Guardians, a mechanism to distribute security ownership at Amazon Web Services (AWS) that trains, develops, and empowers builder teams to make security decisions about the software that they create. In the previous post, you learned the importance of building a culture of security ownership to scale security within your organization, and how AWS achieves this using the Security Guardians program. Since then, many customers have asked how they can build their own, similar program.

In this post, you will learn the steps to build your own Security Guardians program for your organization, including how to:

Set the vision, mission, and goals of your program
Identify developer teams that can pilot your new program
Define the expected behaviors for those teams
Develop training and create opportunities for career development to keep your teams engaged in the program

The guidance in this post is based on what we learned at AWS. Because every organization is different, the final version of the program you build is likely to look different from the one at AWS. Your program needs to reflect the current state of your organization’s culture of security and be designed to cultivate the security-related behaviors that are most important to your organization.

Security Guardians program mechanism

As discussed in the previous post, mechanisms form a key part of our business at AWS. Figure 1 demonstrates how a mechanism is a complete process, or virtuous cycle, that reinforces and improves itself as it operates. It takes controllable inputs and transforms them into ongoing outputs to address a recurring business challenge. In this case, the business challenge AWS faced was that security findings were being identified late in the development lifecycle, making it more expensive—in terms of time, money and effort—to remediate them. This led to bottlenecks in our security review process. The culture of security at AWS, specifically our culture of ownership, provides support to solve this challenge, but we needed the Security Guardians mechanism to actually do it.

Figure 1: AWS mechanism cycle

With most mechanisms, driving adoption is difficult, especially when the mechanism requires human participation to succeed. This is also true in the case of Security Guardians, and you can use our experience to help you avoid some of the challenges and growing pains of driving adoption.

Getting everyone aligned

“If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask, for once I know the proper question, I could solve the problem in less than five minutes.” – Albert Einstein

Getting alignment for the need to distribute security expertise starts with deeply understanding what problems need to be addressed. For example:

Is product delivery velocity being negatively impacted by delays in the security review process?
What business goal or metric are these delays negatively impacting?
Where in the security review process are those delays occurring?
What factors are contributing to those delays?
Is it a lack of time, people, or skills?

Thoroughly understanding the specific problems and their root causes, as identified by answering those questions, allows you to evaluate whether distributing security ownership is the appropriate solution. This in turn makes it easier to gain alignment and buy-in across the organization for the chosen approach.

A component of a culture of security

Building a strong culture of security requires support from executive leadership to set the direction for the rest of the organization. Executive support makes it easier for product leaders to secure the resources and finances needed for a Security Guardians program to be successful. To align with your organization’s leaders, you can reflect on the goals of your leaders and how the Security Guardians program can be built to meet those goals.

For example, if your business goal is to ship products 25 percent faster, understand how a particular resourcing effort from Security Guardians is going to help your organization meet that goal. AWS benefited from the program with a 26.9 percent reduction in the time to review a new service or feature when a Security Guardian was involved.

Our experience is that it’s challenging to establish a Security Guardians program without executive support. If you’re struggling to identify a business leader to sponsor the program and provide insight on the business problem, your AWS account team—including your account manager or solutions architect—can help. If you’re a business leader or executive reading this post, consider becoming that sponsor yourself.

One step at a time

A step-by-step approach to implementing the Security Guardians program helps overcome organizational challenges and avoid common pitfalls that could lead to failure. These steps, shown in Figure 2, are:

Set the vision
Choose innovators
Define behaviors
Maintain interest
Measure success

These steps support the activities that make a mechanism successful: adoption, inspection, and tools.

Figure 2: Steps for implementing a Security Guardians program

Set the vision

Now that you’ve identified the business problem or business goal, set the vision for the Security Guardians program by working backwards from this problem or goal to define the purpose of your program. For example, the vision of the AWS Security Guardians is “To nourish security ownership that consistently delights our customers with security-by-design throughout the development lifecycle.”

Craft an ambitious vision for your Security Guardians program. Think beyond easy wins and focus on bold, forward-thinking security outcomes for your organization. Make sure that each element of your vision aligns with a business problem or goal. The following table is an example of how the vision of the program is aligned with business goals:

Business goals	Security outcome	Long-term goals
Develop products faster and more efficiently.	To improve developer agility while reducing security risk.	Increase the number of threat models performed by Security Guardians (instead of by application security engineers). Over time, this goal could change to “increase the quality of threat models.”. Decrease the average monthly security issue rate. Train three new Security Guardians each quarter.
Reduce long-term security spend.	To identify and mitigate security risk as early as possible.
Increase customer trust.	To exceed customer security expectations by raising the security bar.

The next step is to define a clear mission that is supported with measurable goals. The mission and goals must be achievable and help to move the needle towards the long-term vision.

The final part is to name your program. We chose Security Guardians, like Marvel’s Guardians of the Galaxy. We’ve also heard customers using Security Champions, Security Advocates, Security Innovators, and Security Drivers. Have fun with it and make sure the name resonates with as many participants as possible.

After you’ve defined the vision, future state, mission, measurable goals, and name of the program, review them with your security and business leaders. It’s beneficial to include your innovators or Security Guardians who will be early adopters of the program in this review. In the next section, you’ll learn how to identify these innovators.

Choosing innovators

Just as you develop for and iterate with early adopters of the products you’re building, you should identify individuals and teams who will pilot the program with you. Before the AWS Security Guardians program, our application security engineering teams built relationships with product teams through security reviews.

This meant that they already knew which individuals within those product teams had an interest in security. This is where AWS started, but the success of your program isn’t dependent on whether you already know who these individuals are. Development teams will self-identify and nominate Security Guardians from their own teams. Figure 4 shows examples to help you get started understanding which development teams will be good early adopters for your program.

Figure 3: Example product teams for early program adopters

The examples in Figure 3 include:

Candidate A: Quick wins team

Early adopters typically share key traits, including existing security measures and a designated security role or team members with security expertise. Essentially, they already prioritize security at the team level.

Candidate B: High impact team

This is the team most impacted by the disparity between product development teams and security teams; the agility and time-related benefits of the Security Guardians program will be the highest for this team. For example, this team might be facing long delays in launching products because of the current security review process at your organization.

Candidate C: High risk team

This team owns a product that has a high security risk because of the nature of the product. This team will benefit the most from additional security scrutiny and from raising the security bar at your organization. For example, this team might be building a product that’s considered a critical asset, hosts sensitive data, or performs critical processes.

After you’ve identified one or more teams that could be good early adopters of the program, you need to identify at least one individual from each team to serve as the Security Guardian. Keep the vision and goals of your program in mind when selecting your Security Guardian. Your early Security Guardians should have at least the following characteristics:

Ability to exercise well-informed and decisive judgement
Maintain and showcase their knowledge
Not afraid to have their work be independently validated
Advocate for their security needs in internal discussions
Hold a high security bar
Thoughtful and assertive to make customer security a top priority on their team

In terms of time commitment, our experience is that each Security Guardian spends an average of 3.5 hours each month on activities such as answering general security questions, identifying security stories needed for sprints, diving deep into security related tasks and supporting security related tasks. Each application security review takes approximately 4 hours of effort.

The first post of this series contains even more details on the characteristics that make a good Security Guardian.

Defining behaviors

It’s important to set expectations on what behaviors you want Security Guardians, developers, and security teams to exhibit within the context of the program. These behaviors typically relate directly to the goals of the program. For example, if one of the goals is to increase the number of threat models created, then create threat modeling will be one of the defined behaviors. The behaviors need to be measurable with some flexibility for change as you improve the program.

At AWS, our Security Guardians have access to a runbook that lists the activities each Guardian should take when engaged as part of a review. With each of these activities understood, the program team will then make sure appropriate training is provided so that the Security Guardians are able to complete each of the activities. For example, AWS Security Guardians are asked to help develop threat models. To support this, the program team has developed and released training material to teach Security Guardians how to create a threat model.

With the defined behaviors, understand how the Security Guardian and product development team will engage with the security team. Although we’re clearly defining behaviors, the behaviors aren’t typically done in a silo for the successful launch of a secure product. At AWS, the Security Guardians and product developers engage with the security teams in key partnership areas. If you’re unsure of where to start in defining the behaviors of your program, Figure 4 shows an example of how teams interact at AWS, beginning with the creation of an initial threat model and going through review, remediation, and testing. Consider creating your own version of the model to help define the behaviors and key partnership areas at your organization.

Figure 4: Example behaviors and partnership areas at AWS

In the example of a threat model review, the Guardian and the central security team will jointly create and review the threat model. Specific activity examples include reviewing threats that have no documented mitigations and discussing additional threats that haven’t yet been considered.

As part of encouraging a culture of ownership, AWS recommends allowing Security Guardians to influence the role within a set of boundaries. An example of this is allowing the Security Guardians to be a part of recurring reviews of the program growth metrics, actively collecting their feedback, and encouraging them to host their own training sessions. Active Security Guardians are key to the success of the program and allowing them to influence the program will give them a sense of ownership and inclusion.

Maintaining interest

It’s important to not lose sight that a program like the AWS Security Guardians program is supported by volunteers. Most of your Security Guardians will be product developers who already have a full-time job developing products for your organization. The time and effort to find and onboard new Security Guardians will have a low return on investment if they stop engaging because the program owners didn’t keep them engaged. Keeping Security Guardians is just as important as finding them.

At AWS, we invest time to understand how to build trust with Security Guardians and provide value by working backwards from their wants and needs. Some Security Guardians joined the program to learn new skills and for career growth opportunities. AWS built training programs that were designed for Security Guardians and provide metrics that are used to document their impact to their managers and leaders.

AWS Security Guardians constantly tell us that they value recognition of their contributions by leadership. We work to build mechanisms to continuously surface the great work of our Security Guardians. We also recognize the contributions Security Guardians make through awards, gifts, and other incentives. For example, each quarter, the AWS Security Guardians team sends out a newsletter to senior leaders of the organization. This communication identifies the Guardians within their organization and highlights their contributions, including the number and impact of reviews they’ve completed.

Another way that AWS recognizes the contributions of our Security Guardians is through the Guardians Belt Program. The Guardians Belt Program is designed to recognize Security Guardians for their contributions and support them as they work to advance their security skills and expand their scope of impact. Security Guardians earn Black, Green, Yellow, and White belts with each belt corresponding to significant accomplishments that require consistent commitment to raising the security bar.

To make sure that Security Guardians value the program, your organization should provide and actively facilitate benefits. The benefits must be accessible without requiring additional time or effort from the Security Guardians, promoting immediate and direct gains. Consider the following examples of benefits to maintain Security Guardian interest and support:

Specialized training: Workshops, game days, challenges and contests.
Impact opportunities: Ability to impact multiple products by working with other teams in the organization, ability to help define patterns, best practices, and automation for the program.
Community: Collaborate, connect, share and learn from experts and individuals with similar interests.
Ownership opportunities: Ability to accelerate certain steps in the process.
Leadership opportunities: Active involvement in recurring program or business reviews.

The best ways to maintain interest are determined by the culture of your organization. What does your organization value the most, and how will the program provide that to your Security Guardians? Sometimes, the best way to answer these questions is to ask your early or potential Security Guardians.

Measuring success

The final step of building a successful Security Guardians program is to measure program success. Measuring success is equivalent to the inspection step from Figure 1. This verifies that your desired outcomes are being achieved and provides a jumping off point for iteration. Measuring success also gives you the opportunity to audit the output or results of the Security Guardians program and perform corrections and improvements.

Earlier in this post, we covered identifying the business problem and creating the vision and measurable goals for your Security Guardians program. Example metrics include:

Average time to release features
Average number of security issues per team
Average time spent by Security Guardians and builders doing security work
Percentage of Security Guardians who have taken required and non-required training

Measuring success includes steps to collect feedback and tune the program over time, shown in Figure 5.

Figure 5: Feedback and tuning steps for Security Guardians program.

The cycle to gather feedback and tune the program includes:

Report on metrics
Communicate wins
Measure outcome and cycle time
Identify trends
Review goals

Gathering feedback from Security Guardians is as important as providing feedback to them. One of the ways AWS collects feedback from Security Guardians is through an annual survey that collects feedback on their experiences of program and tooling. To help both builders and Security Guardians improve over time, our security review tooling captures feedback from security engineers on the inputs from Security Guardians. Combined, the data gathered through these surveys helps our security ownership mechanism reinforce and improve itself over time.

Figure 6 summarizes the steps that you can take to develop your program.

Figure 6: Security Guardians program steps

The broad steps to develop a program include:

Set the vision: Set your vision for the program and metrics for success. Get sponsorship from leadership. Choose a name for your program.
Choose innovators: Identify innovators who have a passion for security and foster a community with continuous knowledge sharing.
Define behaviors: Redefine your RACI (responsible, accountable, consulted, informed) and be clear on expectations from your security advocates.
Maintain interest: Provide clear training and learning paths and opportunities for career advancement.
Measure success: Gather feedback and measure the program’s effectiveness.

Conclusion

This post and the previous post covered numerous concepts, considerations, and ideas, including:

The initial intention of the Security Guardians program is to focus on training developers in product teams. This improves early security-focused design thinking.
An alternative approach is to embed or align security engineers directly with product development teams. This can be more effective in organizations where reporting structures and accountability are key considerations.
Some organizations draw Security Guardians from all job types. The program can also be used to focus on uplifting developers and broad security culture.
You must regularly inspect the outcomes delivered by the Security Guardians program and use the information to make incremental improvements as the program matures.

For additional support building a Security Guardians program, contact your AWS account representative and they will get you in touch with a specialist who can help you develop your program.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Single sign-on SSO for Amazon OpenSearch Service using SAML and Keycloak

2024-10-18 Sajeev Attiyil Bhaskaran

Post Syndicated from Sajeev Attiyil Bhaskaran original https://aws.amazon.com/blogs/big-data/single-sign-on-sso-for-amazon-opensearch-service-using-saml-and-keycloak/

A standard use case for customers is to integrate existing identity providers (IdPs) with Amazon OpenSearch Service. OpenSearch Service offers built-in support for single sign-on (SSO) authentication for OpenSearch Dashboards, and uses SAML protocol. The SAML authentication for OpenSearch Service lets you integrate your existing third-party IdPs, such as Okta, Ping Identity, OneLogin, Auth0, ADFS, Azure Active Directory, and Keycloak, with OpenSearch Service dashboards.

In this post, we walk you through how to configure service provider-initiated authentication for OpenSearch Dashboards by using OpenSearch Service and Keycloak. We also discuss how to set up users, groups, and roles in Keycloak and configure their access to OpenSearch Dashboards.

Solution overview

The following diagram illustrates the SAML authentication flow for this solution.

The sign-in flow consists of the following steps.

The user opens a browser to navigate to the OpenSearch Dashboards endpoint of OpenSearch Service in a virtual private cloud (VPC), for example https://vpc-abc123.us-east-1.es.amazonaws.com/_dashboards.
The service provider (OpenSearch Service) uses the information about the IdP (Keycloak) to generate a SAML authentication request. The service provider redirects SAML authentication requests back to the browser.
The browser relays the SAML authentication request to Keycloak. Keycloak parses the SAML authentication request and asks for the user to insert their login and password to authenticate.
After a successful authentication, Keycloak generates a SAML authentication response that includes authenticated user details from Keycloak and sends the encoded SAML response to the browser.
The browser relays the SAML response to OpenSearch Service Assertion Consumer Service (ACS) URL.
OpenSearch Service validates the SAML response. If the validation checks are passed, the user is redirected to the front page of OpenSearch Dashboards. The authorization is performed according to the roles mapped to the user.

Prerequisites

To complete this walkthrough, you should have the following set up:

An OpenSearch Service domain running OpenSearch or Elasticsearch version 6.7 or later with fine-grained access control enabled within a VPC.
Keycloak installed and configured. In this post, we created the IdP in the same VPC of the OpenSearch domain. There is no need for a direct connection between the IdP and the service provider, so you can have the IdP in a different network as well.
A properly configured security group for OpenSearch Service and Keycloak IdP server to receive inbound traffic from users.
A browser with network connectivity to both Keycloak and OpenSearch Dashboards.

Enable SAML authentication for OpenSearch Service

The first step is to enable SAML authentication for OpenSearch Service. Complete the following steps:

On the OpenSearch Service console, open the details page for your OpenSearch Service domain.
On the Security configuration tab, choose Edit.
Select Enable SAML authentication.

Enabling this option automatically populates different IdP URLs, which is required to configure SAML support in the Keycloak IdP. Note down the values under Service provider entity ID and SP-initiated SSO URL. The OpenSearch Dashboards login flow can be configured either as service provider-initiated or IdP-initiated. The service provider-initiated login flow is initiated by OpenSearch Service, and the IdP-initiated login flow is initiated by the IdP (for example, Keycloak). In this post, we use a service provider-initiated login flow.

Configure Keycloak as IdP

During the SAML authentication process, when the user is authenticated, the browser receives a SAML assertion token from Keycloak and forwards it to OpenSearch Service. The OpenSearch Service domain authorizes the user with backend roles according to the attributes presented in the token.

To configure Keycloak as IdP, complete the following steps:

Log in to the Keycloak IdP admin console with admin user privileges (for example, https://<Keycloak server>:8081/admin/).
Choose Create Realm.
For Realm name, enter a name (for example, Amazon_OpenSearch) and choose Create.

For managing OpenSearch Service specific roles, users, and groups, you first create a separate client realm that provides a logical space to manage objects.

In the navigation pane, choose your realm, then choose Clients.
Choose Create client.
In the General Settings window, for Client type, choose SAML
For Client ID, use the service provider entity ID you copied earlier, then choose Next
Under Login settings, enter the service provider-initiated SSO URL copied from earlier (for example, https://vpc-abc123.us-east-1.es.amazonaws.com/_dashboards/_opendistro/_security/saml/acs) and choose Save.
On the client settings tab, under Signature and Encryption, turn on Sign Assertions and keep all other options as default, then choose Save.
On the Keys tab, under Signing keys config, turn Client signature required off.

Configure Keycloak users, roles, and groups

After you have configured the Keycloak IdP client for OpenSearch Service, you can create roles, groups, and users on the IdP side. For this post, we create two roles, two groups, and two users, as listed in the following table.

Users	Groups	Roles
`super_user_1`	`super_user_group`	`super_user_role`
`readonly_user_1`	`readonly_user_group`	`readonly_user_role`

Complete the following steps:

In the navigation pane for your realm, choose Realm roles.
Choose Create role.
For Role name, enter a name (for this post, super_user_role) and choose Save.
Repeat these steps to create a second role, readonly_user_role.

Now let’s create groups, assign the roles to the groups, and map the users to the groups.

Under your realm, choose Groups in the navigation pane.
Choose Create group.
For Name, enter a group name (for example, super_user_group) and choose Save.
Repeat these steps to create a second group, readonly_user_group.

When the new groups are created, they will be listed on the Groups page.

On the details page for each group, on the Role mapping tab, choose Assign role.
For the group super_user_group, select the role super_user_role and choose Assign.

Repeat these steps to assign the role readonly_user_role to the group readonly_user_group.

The last step is to create users and assign them to groups so they automatically inherit group privileges. For this post, we create two users, super_user_1 and readonly_user_1, with dashboard admin and dashboard read-only privileges, respectively.

Under your realm, choose Users in the navigation pane.
Choose Create new user.
Under General, configure the user details, including user name, first name, last name, and email, then choose Create.

Set a temporary password on the Credentials tab after you create the user.
Choose Add user and repeat these steps to add your second user, readonly_user_1.
To join a user to a specific group, choose Join Group on the Groups tab of the respective user.

Select the group the user is joining and choose Join. For example, the user super_user_1 is joining the group super_user_group.

Repeat these steps for the user readonly_user_1 to join the group readonly_user_group.

Next, you can remove the default role mapping for the users because you already assigned the roles to their respective groups.

On the Role Mapping tab, select the default role.
Unassign the default role for the user by choosing Unassign and then Remove.
Repeat these steps for the other user.

Choose Client scopes in the navigation pane.
In the Name column, choose role_list.

On the Mappers tab, choose role list.

Turn on Single Role Attribute and choose Save.

Download SAML metadata from Keycloak

The configuration of Keycloak is now complete, so you can download the SAML metadata file from Keycloak. The SAML metadata is in XML format and is needed to configure SAML in the OpenSearch Service domain.

Under your realm, choose Realm settings in the navigation pane.
On the General tab, choose SAML 2.0 Identify Provider Metadata under Endpoints.

This will generate an IdP metadata file in another window. This XML file contains information on the provider, such as a TLS certificate, SSO endpoints, and the IdP entity ID.

Download this XML file locally so you can upload this file on the OpenSearch Service console in later steps.

Integrate OpenSearch Service SAML with Keycloak

To integrate OpenSearch Service with the Keycloak IDP, you need to upload the IdP metadata XML file on the OpenSearch Service console.

On the OpenSearch Service console, navigate to your domain.
Choose Security configuration, then choose Edit.
Under Metadata from IdP, choose Import from XML file to import the file and auto-populate the IdP entity ID.

Alternatively, you can copy and paste the contents of the entity ID property from the metadata file.

For SAML master backend role, enter super_user_role.

This means that a user with this role is provided manager user privileges to the cluster, but can only use permissions within OpenSearch Dashboards.

Expand the Additional settings section
For Roles key, enter an attribute from the assertion (in our case, Role) and choose Save Changes.

Test the OpenSearch Dashboards SAML authentication with Keycloak

You’re now ready to test the SAML integration with Keycloak as an IdP.

Choose the OpenSearch Dashboards URL provided on OpenSearch Service console.

It will automatically redirect you to the Keycloak sign-in page for authentication.

Enter the admin user name (super_user_1) and password and choose Sign In.

Upon successful authentication, it will redirect you to the home page of OpenSearch Dashboards. If you encounter issues at this step, refer to SAML troubleshooting for common issues.

Internally, the security plugin maps the backend role super_user_role to the reserved security roles all_access and security_manager. Therefore, Keycloak users with the backend role super_user_role are authorized with the privileges of the manager user in the domain. To grant read-only dashboard access to user readonly_user_1, log in to OpenSearch Dashboards as the user super_user_1. Then map the role readonly_user_role as a backend role for the reserved security role opensearch_dashboards_read_only.

When establishing access control for the cluster, it’s crucial to carefully manage the permissions granted to users, adhering to the principle of least privilege. By having both super_user_role with administrative capabilities and read-only readonly_user_role, you can strike a balance. This approach allows a small number of trusted users to have full administrative access within OpenSearch Dashboards, while also enabling read-only access for other stakeholders who require visibility but don’t need more access.

At the time of writing, if you specify the <SingleLogoutService /> details in the Keycloak metadata XML, when you sign out from OpenSearch Dashboards, it will call Keycloak directly and try to sign the user out. This doesn’t work currently with some of the versions of OpenSearch Service, because Keycloak expects the sign-out request to be signed with a certificate that OpenSearch Service doesn’t currently support. If you remove <SingleLogoutService /> from the metadata XML file, OpenSearch Service will use its own internal sign-out mechanism and sign the user out on the OpenSearch Service side. No calls will be made to Keycloak for signing out.

Clean up

If you don’t want to continue using the solution, delete the resources you created:

OpenSearch Service domain
VPN and Keycloak instance

Conclusion

In this post, you learned how to configure Keycloak as an IdP to access OpenSearch Dashboards using SAML. To learn more about OpenSearch Service and SAML integration, refer to SAML authentication for OpenSearch Dashboards. Stay tuned for a series of posts focusing on SAML integrations with OpenSearch Service and Amazon OpenSearch Serverless.

About the Author

Sajeev is a Senior Cloud Engineer (Big Data & Analytics) and a Subject Matter Expert for Amazon OpenSearch Service. He works closely with AWS customers to provide them architectural and engineering assistance and guidance. He dives deep into big data technologies and streaming solutions and leads onsite and online sessions for customers to design the best solutions for their use cases.

Five ways to optimize code with Amazon Q Developer

2024-10-16 Karthik Chemudupati

Post Syndicated from Karthik Chemudupati original https://aws.amazon.com/blogs/devops/five-ways-to-optimize-code-with-amazon-q-developer/

Practical improvement and optimization of software quality requires expert-level knowledge across various subjects. As such, in this blog we shall look at how Amazon Q Developer can help improve your development team productivity and application stability by enabling automation around code optimization by improving your code’s quality, performance, application infrastructure specifications.

The blog will also look at sample prompts that can be used to discover optimization options, control the scope of modifications, choose improvements and iterate through code changes. Being a generative AI–powered software development assistant that integrates with your integrated development environment (IDE), Amazon Q Developer supports in code explanation, code generation, and code improvements such as debugging and optimization. Amazon Q Developer can be configured for IDEs such as Visual Studio Code or Jet Brains IDEs, using AWS Identity and Access Management (IAM) Identity Center or AWS Builder ID.

To illustrate the optimization techniques, we will use the quant-trading sample application from the github aws-samples repo, to look at optimizations across the following domains – 1) Portability 2) Complexity 3) Code Performance 4) Infrastructure 5) Architecture and non-functionals 6) Running on AWS

Please note that as Amazon Q Developer continues to evolve, and due to the non-deterministic nature of Generative AI, the outputs you see when trying this yourself may differ from the examples shown in this blog post.

Amazon Q Developer can assess your code, provide recommendations, and generate an optimized version based on your prompts. A prompt is a natural language text that requests the generative AI to perform a specific task. Among areas you can optimize are portability and complexity.

Portability optimization

To assess portability of your code base, Let us use Portfolio Generator python code from quant-trading sample.

In the Integrated development environment (IDE), select the entire code in the file, open Amazon Q Chat and type your prompt: “Is the selected code portable?”

Amazon Q Developer will generate an assessment of portability of your code, as shown in Figure 1. Any specific improvements possible will also be specified.

This image shows two side-by-side screenshots of an Amazon Q chat interface discussing code portability. The left panel displays a question "Is the selected code portable?" followed by a detailed response outlining factors affecting code portability, including use of relative imports, hard-coded paths, external libraries, and AWS SDK integration. The right panel continues the discussion with suggestions on how to make the code more portable, including using absolute imports, avoiding hard-coded paths, isolating dependencies, and separating AWS-specific functionality. The interface has a dark theme with white text on a black background. At the bottom, there are suggested follow-up questions and a note about the AWS Responsible AI Policy.

Figure 1: Optimize code quality. Assessment and recommendations

Add code snippets directly to the prompt as context, for further response improvements by:
1. Right click on the IDE
2. choose “Send to Amazon Q”
3. Select “Send to Prompt”.

Now, the context includes the code, its portability assessment and recommendations for further improvements.

Ask – “Rewrite code for maximum portability”

However, such a generic prompt would likely result in numerous code modifications chosen by Amazon Q Developer, as shown in Figure 2. To achieve a more specific and higher quality output, in addition to enriched context, the prompt must be more precise and targeted.

This image shows three side-by-side panels of code and explanations in an Amazon Q chat interface. The left panel displays Python code with various import statements and function definitions. The middle panel contains a summary of key changes made to improve code portability, including the use of environment variables, absolute imports, argument parsing, decimal usage, error handling, and formatting. The right panel shows more detailed Python code with import statements, function definitions, and file path configurations. All panels have a dark theme with light-colored text on a dark background. The interface includes options for asking questions or accessing quick actions at the bottom of each panel.

Figure 2: Specific optimization – externalizing config.

Ask Amazon Q Developer to perform optimization addressing only hardcoded path values in a specific way.
- “Rewrite this code to be more portable. Move hardcoded file paths into a separate JSON configuration file under the node “file-paths”. Leave the rest of the file unchanged.”

Amazon Q Developer will now rewrite a few lines of the code and externalized configuration into a JSON file, as shown in Figure 3.

Figure 3: Specific optimization – externalizing config.

Note: Dialogue with Amazon Q Developer can span several iterations, allowing you to analyze and narrow down to a very specific aspect of your code. This approach will appear in line with pair programming, iteratively collaborating on a better solution.

Continue iterating for optimizations per your code. Examples are – ask “Use YAML format for config.” or “Use path names in config similar to their original values.” or “Add error handling when working with files.”

Such an iterative approach will allow you to gradually apply modifications while preserving control over the scope of changes.

Complexity Optimization

Now let’s analyze and reduce the complexity of the write_portfolio method:

Ask either:
- “Can the selected code be simplified?”
- “How can I reduce complexity of the selected code?”
Drill down into a specific, scoped optimization.
- “Simplify loops, conditions and variables of the selected code.”

Be specific about the kind of optimizations you want Amazon Q Developer to apply (see Figure 4). Example, ask direct prompts such as – “Replace portfolio dictionary with JSON.”

This image shows two panels of an Amazon Q chat interface discussing code simplification. The left panel displays a request to "Simplify loops, conditions and variables of the selected code" followed by a simplified Python function called write_portfolio. The function creates a portfolio dictionary with various keys and values, and includes simplified logic for selecting tickers and creating a positions list using list comprehension. The right panel shows the original Python code that is being simplified. This code includes the write_portfolio function definition with similar structure but more verbose implementation. The file path at the top indicates this is from a file named portfolio_generator.py. Both panels use a dark theme with syntax highlighting in various colors for better code readability. The interface includes an option to ask questions or enter commands at the bottom of the left panel.

Figure 4: Simplify code example

Code Performance optimization

To improve code performance, we shall leverage Amazon Q Developer’s “Optimize” feature. It initiates a dialogue for code performance optimization via the right-click menu or key shortcut (see Figure 5).

This image shows two main panels of an Amazon Q chat interface discussing code optimization. The left panel displays a request to optimize a specific part of the code, followed by suggestions for improvement. These suggestions include using generator expressions instead of list comprehensions, avoiding unnecessary conversions, using conditional assignment, and considering NumPy or Pandas for large numerical datasets. Each suggestion is accompanied by a code snippet demonstrating the optimization. The right panel shows the original Python code in a file editor, with the function calculate_weights highlighted. This appears to be the function targeted for optimization. The editor interface includes various options like "Go to Definition", "Find All References", and "Optimize" visible in a dropdown menu. Both panels use a dark theme with syntax highlighting in various colors for better code readability. The interface includes tabs at the top for different files or chat sessions, and an option to ask questions or enter commands at the bottom of the left panel.

Figure 5: IDE “built-in” feature for code improvement. Amazon Q -> Optimize

The selected code is sent to Amazon Q Developer, which then provides recommendations and generates optimized code.

Let’s now look at how we can use Amazon Q Developer to improve the calculate_weights method.

As shown in Figure 5, Amazon Q Developer explains step-by-step every optimization it suggests. You can further follow-up with a more precise prompt, targeting a specific optimization for a specific code block. For instance, “Optimize only selected method and only avoid unnecessary type conversions. Leave the rest of code unchanged.”

A screenshot of a code editor displaying Python code with a dark background theme. The image shows multiple functions and methods, including 'calculate_weights', 'get_final_payload', and 'add_parameter'. On the left side, there's a blue banner with instructions to optimize a selected method and avoid unnecessary type conversions. Below this, an explanation of the optimized 'calculate_weights' method is provided, highlighting changes made to improve performance. The code is syntax-highlighted, making different elements like functions, variables, and comments easily distinguishable.

Figure 6: Follow-up with a more specific prompt for performance optimization

You can copy-paste newly generated code or insert it directly at the cursor by choosing “Insert code”.

To achieve even higher precision, include in your prompt what not to do or to avoid.

Infrastructure optimization

Amazon Q Developer also supports Infrastructure as Code (IaC) out of the box, providing expert advice and code generation for CloudFormation, CDK, and Terraform. This allows you to leverage code optimization techniques and patterns for your infrastructure.

As a demonstration, let’s improve portability of the CDK code in lambda.ts by introducing environment variables to inject configurations into the runtime.

To begin,

Start a new chat with a broad question – “Could you recommend techniques to inject system variables into a Lambda container function?” Amazon Q Developer will generally provide options to inject environment variables into an AWS Lambda function.
Send function code to the prompt and ask Amazon Q Developer. This generates the code for injecting environment variables through Lambda runtime by using prompt – “Could you add some deployment variables into the tradingStartStopFunction function?”

This image shows three side-by-side screenshots of an Amazon Q chat interface and code editor. The left panel displays a conversation about injecting system variables into a Lambda container function, listing five techniques. The middle panel shows a code snippet for a 'tradingStartStopFunction' with a question about adding deployment variables. The right panel displays more detailed code for Lambda functions related to trading operations. All three panels have a dark theme with syntax-highlighted code in various colors.

Figure 7: Optimizing infrastructure code by introducing environment variables in a Lambda function

Architecture and non-functional optimization

With Amazon Q Developer, you can go beyond code and enhance your system architecture. Let’s consider lambda_function.py which interacts with Amazon DynamoDB and AWS Systems Manager Parameter Store.

Send the entire function to the prompt and ask the following in sequence.
- “What are the architecture implications if I call this lambda function daily?”
- “How do I optimize this function to be called daily.”
- Then, follow up with –“How do I optimize this function to be called every 1 second.”

A split-screen image showing two chat conversations and a code editor. The left panel discusses architectural implications of calling a Lambda function daily, covering topics like concurrency, idempotency, error handling, separation of concerns, monitoring, and security. The middle panel offers optimization strategies for calling a Lambda function every 1 second, including separating concerns, caching, batching, and scaling. The right panel shows Python code for a Lambda function, including imports and a function definition dealing with DynamoDB operations.A split-screen image showing two chat conversations and a code editor. The left panel discusses architectural implications of calling a Lambda function daily, covering topics like concurrency, idempotency, error handling, separation of concerns, monitoring, and security. The middle panel offers optimization strategies for calling a Lambda function every 1 second, including separating concerns, caching, batching, and scaling. The right panel shows Python code for a Lambda function, including imports and a function definition dealing with DynamoDB operations.

Figure 8: NFRs and business rules impact architecture enhancements

Compare Amazon Q’s outputs to see how each use case impacts the architectural recommendations, such as introducing caching, batch processing, queues, or concurrency mechanisms.

Following the techniques discussed earlier, you can dive in more specific implementations of suggested architecture enhancements. For example, ask “Implement a mechanism to execute only one instance of lambda function at any given moment of time. Implement cache for SSM Parameter store value, but not for Portfolio table.”

Optimize code to run on AWS

As a versatile developer assistant, Amazon Q Developer excels at helping you adhere to AWS best practices and recommendations.

Let’s examine if our sample – IntradayMomentum Lambda function handler can be improved.

Send the code to the Amazon Q Developer prompt and ask – “Is this lambda handler following AWS recommended best practices?”

This image shows a split-screen view of an Amazon Q chat interface on the left and a code editor on the right. The left side displays a conversation about AWS Lambda function best practices, listing 9 points of improvement for the provided code, including separation of concerns, environment variables usage, logging, error handling, dependency management, performance optimization, security, idempotency, and testing. The right side shows Python code for a Lambda function. The code includes a lambda_handler function with various operations like getting symbols, calculating updates and weights, and interacting with a DynamoDB table. The code is syntax-highlighted, indicating it's being viewed in a code editor. At the top of the code editor, there are tab names suggesting multiple files are open, including "lambda_function.py" and "portfolio_generator.py". The overall theme of the interface is dark, suggesting a dark mode IDE or development environment.

Figure 9: Optimize code to run on AWS. AWS-recommended best practices for the Lambda handler

The analysis generated by Amazon Q Developer is based on AWS code, best practices and documentation. Not only does it suggest improvements, but also highlights what’s been done correctly, reinforcing best practices.

Following an iterative technique described earlier, continue asking Amazon Q developer for further recommendations with more specific prompts. For example – “Add exception handling to the code.”

This image shows a split-screen interface with an Amazon Q chat on the left and a code editor on the right, both using a dark theme. The left side displays a chat conversation about adding exception handling to the code. It shows Python code for a Lambda function with newly added exception handling, including imports for logging and a try-except block. The right side shows the original Python code for the Lambda function in a code editor. The code includes functions for handling portfolio updates, interacting with DynamoDB, and processing various data elements. At the top of the screen, there are multiple tabs open in the code editor, including "lambda_function.py", "portfolio_generator.py", and "deploy_portfolio.py". The image demonstrates the process of improving the Lambda function code by adding error handling based on the chat conversation's recommendations.

Figure 10: Rewrite code with Best Practices in place. Adding Exception Handling.

Conclusion

In this blog post, we discussed approaches for code optimization with the help of Amazon Q Developer. We explored code optimization from various perspectives, such as code quality, performance, application infrastructure, following best practices, and enhancing architecture. We saw the importance of prompt engineering and context when optimizing code with Amazon Q Developer – a generative AI coding assistant. Starting with open, generic prompts helps build the necessary context and discover optimization options. In contrast, precise and specific follow-up prompts help define the scope of changes and incrementally generate optimized code.

It has never been easier for developers to have a development assistant and start improving code with the help of natural language dialogue, provided by Amazon Q.

About the authors

Roman Martynenko is a Senior Solutions Architect at Amazon Web Services with over 20 years of experience in Software Engineering, Architecture and Cloud technologies. Roman is helping Canadian public sector customers with their cloud journey. He focuses on next-generation developer experience, helping organizations re-imagine the entire Software Development Lifecycle. Outside of work, he enjoys hiking, home automation, and DIY projects.

Karthik Chemudupati is a Principal Technical Account Manager (TAM) with AWS, focused on helping customers achieve cost optimization and operational excellence. He has more than 20 years of IT experience in software engineering, cloud operations and automations. Karthik joined AWS in 2016 as a TAM and worked with more than dozen Enterprise Customers across US-West. Outside of work, he enjoys spending time with his family.

Shardul Vaidya is a Worldwide Partner Solutions Architect with AWS, focused on helping partners and customers build and effectively use Generative AI powered developer experiences. Shardul joined AWS in 2020 as part of their early career talent Solutions Architect team and worked with over a hundred modernization and DevOps partners across the world. Outside of work, he’s a music lover and collects records.

Code security scanning with Amazon Q Developer

2024-10-16 Surabhi Tandon

Post Syndicated from Surabhi Tandon original https://aws.amazon.com/blogs/devops/code-security-scanning-with-amazon-q-developer/

Enriching metadata for accurate text-to-SQL generation for Amazon Athena

2024-10-14 Naidu Rongali

Post Syndicated from Naidu Rongali original https://aws.amazon.com/blogs/big-data/enriching-metadata-for-accurate-text-to-sql-generation-for-amazon-athena/

Extracting valuable insights from massive datasets is essential for businesses striving to gain a competitive edge. Enterprise data is brought into data lakes and data warehouses to carry out analytical, reporting, and data science use cases using AWS analytical services like Amazon Athena, Amazon Redshift, Amazon EMR, and so on. Amazon Athena provides interactive analytics service for analyzing the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift is used to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Amazon EMR provides a big data environment for data processing, interactive analysis, and machine learning using open source frameworks such as Apache Spark, Apache Hive, and Presto. These data processing and analytical services support Structured Query Language (SQL) to interact with the data.

Writing SQL queries requires not just remembering the SQL syntax rules, but also knowledge of the tables metadata, which is data about table schemas, relationships among the tables, and possible column values. Large language model (LLM)-based generative AI is a new technology trend for comprehending a large corpora of information and assisting with complex tasks. Can it also help write SQL queries? The answer is yes.

Generative AI models can translate natural language questions into valid SQL queries, a capability known as text-to-SQL generation. Although LLMs can generate syntactically correct SQL queries, they still need the table metadata for writing accurate SQL query. In this post, we demonstrate the critical role of metadata in text-to-SQL generation through an example implemented for Amazon Athena using Amazon Bedrock. We discuss the challenges in maintaining the metadata as well as ways to overcome those challenges and enrich the metadata.

Solution overview

This post demonstrates text-to-SQL generation for Athena using an example implemented using Amazon Bedrock. We use Anthropic’s Claude 2.1 foundation model (FM) in Amazon Bedrock as the LLM. Amazon Bedrock models are invoked using Amazon SageMaker. Working examples are designed to demonstrate how various details included in the metadata influences the SQL generated by the model. These examples use synthetic datasets created in AWS Glue and Amazon S3. After we review the significance of these metadata details, we’ll delve into the challenges encountered in gathering the required level of metadata. Subsequently, we’ll explore strategies for overcoming these challenges.

The examples implemented in the workflow are illustrated in the following diagram.

Figure 1. The solution architecture and workflow.

The workflow follows the following sequence:

A user asks a text-based question which can be answered by querying relevant AWS Glue tables through Athena.
Table metadata is fetched from AWS Glue.
The tables’ metadata and SQL generating instructions are added to the prompt template. The Claude AI model is invoked by passing the prompt and the model parameters.
The Claude AI model translates the user intent (question) to SQL based on the instructions and tables’ metadata.
The generated Athena SQL query is run.
The generated Athena SQL query and the SQL query results are returned to the user.

Prerequisites

These prerequisites are given If you want to try this example yourself. You can skip this prerequisites section if you want to understand the example without implementing it. The example centers on invoking Amazon Bedrock models using SageMaker, so we need to set up a few resources in an AWS Account. The relevant CloudFormation template, Jupyter Notebooks, and details of launching the necessary AWS services are covered in this section. The CloudFormation template creates the SageMaker instance with the necessary S3 bucket and IAM role permissions to run AWS Glue commands, Athena SQL, and invoke Amazon Bedrock AI models. The two Jupyter Notebooks (0_create_tables_with_metadata.ipynb and 1_text-to-sql-for-athena.ipynb) provide working code snippets to create the necessary tables and generate the SQL using the Claude AI model on Amazon Bedrock.

Granting Anthropic’s Claude permissions on Amazon Bedrock

Have an AWS account and sign in using the AWS Management Console.
Change the AWS Region to US West (Oregon).
Navigate to the AWS Service Catalog console and choose Amazon Bedrock.
On the Amazon Bedrock console, choose Model Access in the navigation pane.
Choose Manage model access.
Select the Claude
Choose Request model access if you’re requesting the model access for the first time. Otherwise choose Save Changes.

Deploying the CloudFormation stack

After launching the CloudFormation stack:

On the Create stack page, choose Next
On the Specify stack details page, choose Next
On the Configure stack options page, choose Next
On the Review and create page, select I acknowledge that AWS CloudFormation might create IAM resources
Choose Submit

Downloading Jupyter Notebooks to SageMaker

In the AWS Management Console, choose the name of the currently displayed Region and change it to US West (Oregon).
Navigate to the AWS Service Catalog console and choose Amazon SageMaker.
On the Amazon SageMaker console, choose Notebook in the navigation pane.
Choose Notebook instances.
Select the SageMakerNotebookInstance created by the texttosqlmetadata CloudFormation stack.
Under Actions, choose Open Jupyter
Navigate to Jupyter console, select New, and then choose Console.

Run the following Shell script commands in the console to copy the Jupyter Notebooks.

cd /home/ec2-user/SageMaker
BASE_S3_PATH="s3://aws-blogs-artifacts-public/artifacts/BDB-4265"
aws s3 cp "${BASE_S3_PATH}/0_create_tables_with_metadata.ipynb" ./
aws s3 cp "${BASE_S3_PATH}/1_text_to_sql_for_athena.ipynb" ./

Open each downloaded Notebook and update the values of the athena_results_bucket, aws_region, and athena_workgroup variables based on the outputs from the texttosqlmetadata CloudFormation

Solution implementation

If you want to try this example yourself, try the CloudFormation template provided in the previous section. In the subsequent sections, we will illustrate how each element of the metadata included in the prompt influences the SQL query generated by the model.

The steps in the 0_create_tables_with_metadata.ipynb Jupyter Notebook create Amazon S3 files with synthetic data for employee and department datasets, creates employee_dtls and department_dtls Glue tables pointing to those S3 buckets, and extracts the following metadata for these two tables.

CREATE EXTERNAL TABLE employee_dtls (
	id int COMMENT 'Employee id',
	name string COMMENT 'Employee name',
	age int COMMENT 'Employee age',
	dept_id int COMMENT 'Employee Departments ID',
	emp_category string COMMENT 'Employee category. Contains TEMP For temporary, PERM for permanent, CONTR for contractors ',
	location_id int COMMENT 'Location identifier of the Employee',
	joining_date date COMMENT 'Joining date of the Employee',
	CONSTRAINT pk_1 PRIMARY KEY  (id) ,
	CONSTRAINT FK_1 FOREIGN KEY (dept_id) REFERENCES department_dtls(id)
) 
PARTITIONED BY (
	region_id string COMMENT 'Region identifier. Contains AMER for Americas, EMEA for Europe, the Middle East, and Africa, APAC for Asia Pacific countries'
);

CREATE EXTERNAL TABLE department_dtls (
	id int COMMENT 'Department id',
	name string COMMENT 'Department name',
	location_id int COMMENT 'Location identifier of the Department'
)

The metadata extracted in the previous step provides column descriptions. For the region_id partition column and emp_category column, the description provides possible values along with their meaning. The metadata also has foreign key constraint details. AWS Glue doesn’t provide a way to specify the primary key and foreign key constraints, so use custom keys in the AWS Glue table-level parameters as an alternative to gather primary key and foreign keys while creating the AWS Glue table.

# Define the table schema
employee_table_input = {
    'Name': employee_table_name,
    'PartitionKeys': [
        {'Name': 'region_id', 'Type': 'string', 'Comment': 'Region identifier. Contains AMER for Americas, EMEA for Europe, the Middle East, and Africa, APAC for Asia Pacific countries'}
    ],
    'StorageDescriptor': {
        'Columns': [
            {'Name': 'id', 'Type': 'int', 'Comment': 'Employee id'},
       …
        ],
        'Location': employee_s3_path,
     …
    'TableType': 'EXTERNAL_TABLE',
    'Parameters': {
        'classification': 'csv',
        'primary_key': 'CONSTRAINT pk_1 PRIMARY KEY  (id)',
        'foreign_key_1': 'CONSTRAINT FK_1 FOREIGN KEY (dept_id) REFERENCES department_dtls(id)'          
    }
}

# Create the table
response = glue_client.create_table(DatabaseName=database_name, TableInput=employee_table_input)

The steps in the 1_text-to-sql-for-athena.ipynb Jupyter notebook create the following wrapper function to interact with Claude FM on Amazon Bedrock to generate SQL based on user-provided text wrapped up in a prompt. This function hard codes the model’s parameters and model ID for demonstrating the basic functionality.

def interactWithClaude(prompt):

    body = json.dumps(
        {
            "prompt": prompt,
            "max_tokens_to_sample": 2048,
            "temperature": 1,
            "top_k": 250,
            "top_p": 0.999,
            "stop_sequences": [],
        }
    )
    modelId = "anthropic.claude-v2"  
    accept = "application/json"
    contentType = "application/json"
    response = bedrock_client.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )
    response_body = json.loads(response.get("body").read())
    response_text_claude = response_body.get("completion")
    return response_text_claude

Define the following set of instructions for generating Athena SQL query. These SQL generating instructions specify which compute engine the SQL query should run on and other instructions to guide the model in generating the SQL query. These instructions are included in the prompt sent to the Bedrock model.

athena_sql_generating_instructions = """
Read database schema inside the <database_schema></database_schema> tags which contains a list of table names and their schemas to do the following:
    1. Create a syntactically correct AWS Athena query to answer the question.
    2. For tables with partitions, include the filters on the relevant partition columns.
    3. Include only relevant columns for the given question.
    4. Use only the column names that are listed in the schema description. 
    5. Qualify column names with the table name.
    6. Avoid joins to a table if there is no column required from the table.
    7. Convert Strings to Date type while filtering on Date type columns
    8. Return the sql query inside the <SQL></SQL> tab.
"""

Define different prompt templates for demonstrating the importance of metadata in text-to-SQL generation. These templates have placeholders for SQL query generating instructions and tables metadata.

athena_prompt1 = """
Human:  You are an AWS Athena query expert whose output is a valid sql query. You are given the following Instructions for building the AWS Athena query.
<Instructions>
{instruction_dtls}
</Instructions>
        
Only use the following tables defined within the database_schema and table_schema XML-style tags:

<database_schema>
<table_schema>
CREATE EXTERNAL TABLE employee_dtls (
  id int,
  name string,
  age int ,
  dept_id int,
  emp_category string ,
  location_id int ,
  joining_date date
) PARTITIONED BY (
  region_id string
  )
</table_schema>

<table_schema>
CREATE EXTERNAL TABLE department_dtls (
  id int,
  name string ,
  location_id int 
)
</table_schema>
</database_schema>

Question: {question}

Assistant: 
"""

Generate the final prompt by passing the question and instruction details as arguments to the prompt template. Then, invoke the model.

question_asked = "List of permanent employees who work in North America and  joined after Jan 1 2024"
prompt_template_for_query_generate = PromptTemplate.from_template(athena_prompt1)
prompt_data_for_query_generate = prompt_template_for_query_generate.format(question=question_asked,instruction_dtls=athena_sql_generating_instructions)
llm_generated_response = interactWithClaude(prompt_data_for_query_generate)
print(llm_generated_response.replace("<sql>", "").replace("</sql>", " ")  )

The model generates the SQL query for the user question by using the instructions and table details provided in the prompt.

SELECT employee_dtls.id, employee_dtls.name, employee_dtls.age, employee_dtls.dept_id, employee_dtls.emp_category
FROM employee_dtls 
WHERE employee_dtls.region_id = 'NA' 
  AND employee_dtls.emp_category = 'permanent'
  AND employee_dtls.joining_date > CAST('2024-01-01' AS DATE)

Significance of prompts and metadata in text-to-SQL generation

Understanding the details of tables and the data they contain is essential for both human SQL experts and generative AI-based text-to-SQL generation. These details, collectively known as metadata, provide crucial context for writing SQL queries. For the text-to-SQL example implemented in the previous section, we used prompts to convey specific instructions and table metadata to the model, enabling it to perform user tasks effectively. A question arises on what level of details we need to include in the table metadata. To clarify this point, we asked the model to generate SQL query for the same question three times with different prompts each time.

Prompt with no metadata

For the first test, we used a basic prompt containing just the SQL generating instructions and no table metadata. The basic prompt helped the model generate a SQL query for the given question, but it’s not helpful because the model made assumptions about table names, column names, and literal values used in the filter expressions.

Question: List of permanent employees who work in North America and joined after January 1, 2024.

Prompt definition:

Human: You are an Amazon Athena query expert whose output is a valid sql query. You are given the following Instructions for building the Amazon Athena query.
<Instructions>
{instruction_dtls}
</Instructions>

Question: {question}
Assistant:

SQL query generated:

SELECT emp.employee_id, emp.first_name, emp.last_name, emp.department_id
FROM employee emp
WHERE emp.contract = 'Permanent'
AND emp.region = 'North America'
AND CAST(emp.start_date AS  DATE) > CAST('2024-01-01' AS DATE)

Prompt with basic metadata

For solving the problem of assumed table names and column names, we added table metadata in DDL format in the second prompt. As a result, the model used the correct column names and data types and restricted the DATE casting to a literal string value. It got the SQL query syntactically correct, but one issue remains: the model assumed the literal values used in the filter expressions.

Question: List of permanent employees who work in North America and joined after January 1, 2024.

Prompt definition:

Human: You are an Amazon Athena query expert whose output is a valid sql query. You are given the following Instructions for building the Amazon Athena query.
<Instructions>
{instruction_dtls}
</Instructions>

Only use the following tables defined within the database_schema and table_schema XML-style tags:

<database_schema>
<table_schema>
CREATE EXTERNAL TABLE employee_dtls (
  id int,
  name string,
  age int ,
  dept_id int,
  emp_category string ,
  location_id int ,
  joining_date date
) PARTITIONED BY (
  region_id string
  )
</table_schema>

<table_schema>
CREATE EXTERNAL TABLE department_dtls (
  id int,
  name string ,
  location_id int 
)
</table_schema>
</database_schema>

Question: {question}
Assistant:

SQL query generated:

SELECT employee_dtls.id, employee_dtls.name, employee_dtls.age, employee_dtls.dept_id, employee_dtls.emp_category
FROM employee_dtls 
WHERE employee_dtls.region_id = 'NA' 
  AND employee_dtls.emp_category = 'permanent'
  AND employee_dtls.joining_date > CAST('2024-01-01' AS DATE)

Prompt with enriched metadata

Now we need to figure out how to provide the possible values of a column to the model. One way could be including metadata in the column for low cardinality columns. So we added column descriptions along with possible values in the third prompt. As a result, the model included the correct literal values in the filter expressions and gave accurate SQL query.

Question: List of permanent employees who work in North America and joined after Jan 1, 2024.

Prompt definition:

Human: You are an Amazon Athena query expert whose output is a valid sql query. You are given the following Instructions for building the Amazon Athena query.
<Instructions>
{instruction_dtls}
</Instructions>

Only use the following tables defined within the database_schema and table_schema XML-style tags:

<database_schema>
<table_schema>
CREATE EXTERNAL TABLE employee_dtls (
id int COMMENT 'Employee id',
name string COMMENT 'Employee name',
age int COMMENT 'Employee age',
dept_id int COMMENT 'Employee Departments ID',
emp_category string COMMENT 'Employee category. Contains TEMP For temporary, PERM for permanent, CONTR for contractors ',
location_id int COMMENT 'Location identifier of the Employee',
joining_date date  COMMENT 'Employee joining date',
CONSTRAINT pk_1 PRIMARY KEY  (id) ,
CONSTRAINT FK_1 FOREIGN KEY (dept_id) REFERENCES department_dtls(id)
) PARTITIONED BY (
region_id string COMMENT 'Region identifier. Contains AMER for Americas, EMEA for Europe, the Middle East, and Africa, APAC for Asia Pacific countries'
)
</table_schema>

<table_schema>
CREATE EXTERNAL TABLE department_dtls (
id int COMMENT 'Department id',
name string COMMENT 'Department name',
location_id int COMMENT 'Location identifier of the Department'
)
</table_schema>
</database_schema>

Question: {question}
Assistant:

SQL query generated:

SELECT employee_dtls.id, 
employee_dtls.name
FROM employee_dtls 
WHERE employee_dtls.emp_category = 'PERM'
  AND employee_dtls.region_id = 'AMER' 
  AND employee_dtls.joining_date > CAST('2024-01-01' AS DATE)

Prompt with foreign key constraints in the Metadata

Note that when we added the finer details to the metadata of the third prompt, we included foreign key constraints as well. This is done to help the model generate SQL for advanced queries that require joins. Adding foreign key constraints to the metadata helps the model identify the correct columns to be used in the join conditions. To demonstrate this point, we asked the model to write SQL for showing department details along with the employee details. For showing the department details, we need the department_dtls table. The model added department_dtls table to the SQL query and identified the right columns for the join condition based on foreign key constraint details included in the metadata.

Question: List of permanent employees who work in North America and joined after Jan 1, 2024.

SQL query generated:

SELECT
  employee_dtls.name AS employee_name,
  employee_dtls.age,
  department_dtls.name AS department_name
FROM employee_dtls 
JOIN department_dtls 
  ON employee_dtls.dept_id = department_dtls.id
WHERE 
  employee_dtls.emp_category = 'PERM'
  AND employee_dtls.region_id = 'AMER' 
  AND employee_dtls.joining_date > CAST('2024-01-01' AS DATE)

Additional observations

Though the model included relevant employee attributes in the SELECT clause, the exact list of attributes it included varied each time. Even for the same prompt definition, the model provided a varying list of attributes. The model randomly used one of the two approaches for casting the string literal value to date type. The first approach uses CAST('2024-01-01' AS DATE) and the second approach uses DATE '2024-01-01'.

Challenges in maintaining the metadata

Now that you understand how maintaining detailed metadata along with foreign key constraints helps the model in generating accurate SQL queries, let’s discuss how you can gather the necessary details of table metadata. The data lake and database catalogs support gathering and querying metadata, including table and column descriptions. However, making sure that these descriptions are accurate and up-to-date poses several practical challenges, such as:

Creating database objects with useful descriptions requires collaboration between technical and business teams to write detailed and meaningful descriptions. As tables undergo schema changes, updating metadata for each change can be time-consuming and requires effort.
Maintaining lists of possible values for the columns requires continuous updates.
Adding data transformation details to metadata can be challenging because of the dispersed nature of this information across data processing pipelines, making it difficult to extract and incorporate into table-level metadata.
Adding data lineage details to metadata faces challenges because of the fragmented nature of this information across data processing pipelines, making extraction and integration into table-level metadata complex.

Specific to the AWS Glue Data Catalog, more challenges arise, such as the following:

Creating AWS Glue tables through crawlers doesn’t automatically generate table or column descriptions, requiring manual updates to table definitions from the AWS Glue console.
Unlike traditional relational databases, AWS Glue tables don’t explicitly define or enforce primary keys or foreign keys. AWS Glue tables operate on a schema-on-read basis, where the schema is inferred from the data when querying. Therefore, there’s no direct support for specifying primary keys, foreign keys, or column descriptions in AWS Glue tables like there is in traditional databases.

Enriching the metadata

Listed here some ways that you can overcome the previously mentioned challenges in maintaining the metadata.

Enhance the table and column descriptions: Documenting table and column descriptions requires a good understanding of the business process, terminology, acronyms, and domain knowledge. The following are the different methods you can use to get these table and column descriptions into the AWS Glue Data Catalog.
- Use generative AI to generate better documentation: Enterprises often document their business processes, terminologies, and acronyms and make them accessible through company-specific portals. By following naming conventions for tables and columns, consistency in object names can be achieved, making them more relatable to business terminology and acronyms. Using generative AI models on Amazon Bedrock, you can enhance table and column descriptions by feeding the models with business terminology and acronym definitions along with the database schema objects. This approach reduces the time and effort required to generate detailed descriptions. The recently released metadata feature in Amazon DataZone, AI recommendations for descriptions in Amazon DataZone, is along these principles. After you generate the descriptions, you can update the column descriptions using any of the following options.
  - From the AWS Glue catalog UI
  - Using the AWS Glue SDK similar to Step 3a : Create employee_dtls Glue table for querying from Athena in the 0_create_tables_with_metadata.ipynb Jupyter Notebook
  - Add the COMMENTS in the DDL script of the table.
```
CREATE EXTERNAL TABLE <table_name> 
( column1 string COMMENT '<column_description>' ) 
PARTITIONED BY ( column2 string COMMENT '<column_description>' )
```

For AWS Glue tables cataloged from other databases:
- You can add table and column descriptions from the source databases using the crawler in AWS Glue.
- You can configure the EnableAdditionalMetadata Crawler option to crawl metadata such as comments and raw data types from the underlying data sources. The AWS Glue crawler will then populate the additional metadata in AWS Glue Data Catalog. This provides a way to document your tables and columns directly from the metadata defined in the underlying database.
Enhance the metadata with data profiling: As demonstrated in the previous section, providing the list of values in the employee category column and their meaning helped in generating the SQL query with more accurate filter conditions. We can provide such a list of values or data characteristics in the column descriptions with the help of data profiling. Data profiling is the process of analyzing and understanding the data and its characteristics as distinct values. By using data profiling insights, we can enhance column descriptions.
Enhance the metadata with details of partitions and a range of partition values: As demonstrated in the previous section, providing the list of partition values and their meaning in the partition column description helped in generating the SQL with more accurate filter conditions. For list partitions, we can add the list of the partition values and their meanings to the partition column description. For range partitions, we can add more context on the grain of the values like daily, monthly, and a specific range of values to the column description.

Enriching the prompt

You can enhance the prompts with query optimization rules like partition pruning. In the athena_sql_generating_instructions, defined as part of the 1_text-to-sql-for-athena.ipynb Jupyter Notebook, we added an instruction “For tables with partitions, include the filters on the relevant partition columns”. This instruction guides the model on how to handle partition pruning. In the example, we observed that the model added the relevant partition filter on the region_id partition column. These partition filters will speed up the SQL query execution and is one of the top query optimization techniques. You can add more such query optimization rules to the instructions. You can enhance these instructions with relevant SQL examples.

Cleanup

To clean up the resources, start by cleaning up the S3 bucket that was created by the CloudFormation stack. Then delete the CloudFormation stack using the following steps.

In the AWS Management Console, choose the name of the currently displayed Region and change it to US West (Oregon).
Navigate to AWS CloudFormation.
Choose Stacks.
Select texttosqlmetadata
Choose Delete.

Conclusion

The example presented in the post highlights the importance of enriched metadata in generating accurate SQL query using the text-to-SQL capabilities of Anthropic’s Claude model on Amazon Bedrock and discusses multiple ways to enrich the metadata. Amazon Bedrock is at the center of this text-to-SQL generation. Amazon Bedrock can help you build various generative AI applications including the metadata generation use case mentioned in the previous section. To get started with Amazon Bedrock, we recommend following the quick start in the GitHub repo and familiarizing yourself with building generative AI applications. After getting familiar with generative AI applications, see the GitHub Text-to-SQL workshop to learn more text-to-SQL techniques. See Build a robust Text-to-SQL solution and Best practices for Text-to-SQL for the recommended architecture and best practices to follow while implementing text-to-SQL generation.

About the author

Naidu Rongali is a Big Data and ML engineer at Amazon. He designs and develops data processing solutions for data intensive analytical systems supporting Amazon retail business. He has been working on integrating generative AI capabilities into the data lake and data warehouse systems using Amazon Bedrock AI models. Naidu has a PG diploma in Applied Statistics from the Indian Statistical Institute, Calcutta and BTech in Electrical and Electronics from NIT, Warangal. Outside of his work, Naidu practices yoga and goes trekking often.

Enhance Amazon EMR scaling capabilities with Application Master Placement

2024-10-14 Lorenzo Ripani

Post Syndicated from Lorenzo Ripani original https://aws.amazon.com/blogs/big-data/enhance-amazon-emr-scaling-capabilities-with-application-master-placement/

In today’s data-driven world, processing large datasets efficiently is crucial for businesses to gain insights and maintain a competitive edge. Amazon EMR is a managed big data service designed to handle these large-scale data processing needs across the cloud. It allows running applications built using open source frameworks on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), or AWS Outposts, or completely serverless. One of the key features of Amazon EMR on EC2 is managed scaling, which dynamically adjusts computing capacity in response to application demands, providing optimal performance and cost-efficiency.

Although managed scaling aims to optimize EMR clusters for best price-performance and elasticity, some use cases require more granular resource allocation. For example, when multiple applications are submitted to the same clusters, resource contention may occur, potentially impacting both performance and cost-efficiency. Additionally, allocating the Application Master (AM) container to non-reliable nodes like Spot can potentially result in loss of the container and immediate shutdown of the entire YARN application, resulting in wasted resources and additional costs for rescheduling the entire YARN application. These uses cases require more granular resource allocation and sophisticated scheduling policies to optimize resource utilization and maintain high performance.

Starting with the Amazon EMR 7.2 release, Amazon EMR on EC2 introduced a new feature called Application Master (AM) label awareness, which allows users to enable YARN node labels to allocate the AM containers within On-Demand nodes only. Because the AM container is responsible for orchestrating the overall job execution, it’s crucial to verify that it gets allocated to a reliable instance and not be subjected to shutdown due to Spot Instance interruption. Additionally, limiting AM containers to On-Demand helps maintain consistent application launch time, because the fulfillment of the On-Demand Instance isn’t prone to unavailable Spot capacity or bid price.

In this post, we explore the key features and use cases where this new functionality can provide significant benefits, enabling cluster administrators to achieve optimal resource utilization, improved application reliability, and cost-efficiency in your EMR on EC2 clusters.

Solution overview

The Application Master label awareness feature in Amazon EMR works in conjunction with YARN node labels, a functionality offered by Hadoop that empowers you to define labels to nodes within a Hadoop cluster. You can use these labels to determine which nodes of the cluster should host specific YARN containers (such as mappers vs. reducers in a MapReduce, or drivers vs. executors in Apache Spark).

This feature is enabled by default when a cluster is launched with Amazon EMR 7.2.0 and later using Amazon EMR managed scaling, and it has been configured to use YARN node labels. The following code is a basic configuration setup that enables this feature:

[
   {
     "Classification": "yarn-site",
     "Properties": {
      "yarn.node-labels.enabled": "true",
       "yarn.node-labels.am.default-node-label-expression": "ON_DEMAND"
     }
   }
]

Within this configuration snippet, we activate the Hadoop node label feature and define a value for the yarn.node-labels.am.default-node-label-expression property. This property defines the YARN node label that will be used to schedule the AM container of each YARN application submitted to the cluster. This specific container plays a key role in maintaining the lifecycle of the workflow, so verifying its placement on reliable nodes in production workloads is crucial, because the unexpected shutdown of this container can result in the shutdown and failure of the entire application.

Currently, the Application Master label awareness feature only supports two predefined node labels that can be specified to allocate the AM container of a YARN job: ON_DEMAND and CORE. When one of these labels is defined using Amazon EMR configurations (see the preceding example code), Amazon EMR automatically creates the corresponding node labels in YARN and labels the instances in the cluster accordingly.

To demonstrate how this feature works, we launch a sample cluster and run some Spark jobs to see how Amazon EMR managed scaling integrates with YARN node labels.

Launch an EMR cluster with Application Manager placement awareness

To perform some tests, you can launch the following AWS CloudFormation stack, which provisions an EMR cluster with managed scaling and the Application Manager placement awareness feature enabled. If this is your first time launching an EMR cluster, make sure to create the Amazon EMR default roles using the following AWS Command Line Interface (AWS CLI) command:

aws emr create-default-roles

To create the cluster, choose Launch Stack:

Provide the following required parameters:

VPC – An existing virtual private cloud (VPC) in your account where the cluster will be provisioned
Subnet – The subnet in your VPC where you want to launch the cluster
SSH Key Name – An EC2 key pair that you use to connect to the EMR primary node

After the EMR cluster has been provisioned, establish a tunnel to the Hadoop Resource Manager web UI to review the cluster configurations. To access the Resource Manager web UI, complete the following steps:

Set up an SSH tunnel to the primary node using dynamic port forwarding.
Point your browser to the URL http://<primary-node-public-dns>:8088/, using the public DNS name of your cluster’s primary node.

This will open the Hadoop Resource Manager web UI, where you can see how the cluster has been configured.

YARN node labels

In the CloudFormation stack, you launched a cluster specifying to allocate the AM containers on nodes labeled as ON_DEMAND. If you explore the Resource Manager web UI, you can see that Amazon EMR created two labels in the cluster: ON_DEMAND and SPOT. To review the YARN node labels present in your cluster, you can inspect the Node Labels page, as shown in the following screenshot.

On this page, you can see how the YARN labels were created in Amazon EMR:

During initial cluster creation, default node labels such as ON_DEMAND and SPOT are automatically generated as non-exclusive partitions
The DEFAULT_PARTITION label stays vacant because every node gets labeled based on its market type—either being an On-Demand or Spot Instance

In our example, because we launched a single core node as On-Demand, you can observe a single node assigned to the ON_DEMAND partition, and the SPOT partition remains empty. Because the labels are created as non-exclusive, nodes with these labels can run both containers launched with a specific YARN label and also containers that don’t specify a YARN label. For additional details on YARN node labels, see YARN Node Labels in the Hadoop documentation.

Now that we have discussed how the cluster was configured, we can perform some tests to validate and review the behavior of this feature when using it in combination with managed scaling.

Concurrent application submission with Spot Instances

To test the managed scaling capabilities, we submit a simple SparkPi job configured to utilize all available memory on the single core node initially launched in our cluster:

spark-example \
  --deploy-mode cluster \
  --driver-memory 10g \
  --executor-memory 10g \
  --conf spark.dynamicAllocation.maxExecutors=1 \
  --conf spark.yarn.executor.nodeLabelExpression=SPOT \
  SparkPi 800000

In the preceding snippet, we tuned specific Spark configurations to utilize all the resources of the cluster nodes launched (you could also achieve this using the maximizeResourceAllocation configuration while launching an EMR cluster). Because the cluster has been launched using m5.xlarge instances, we can launch individual containers up to 12 GB in terms of memory requirements. With these assumptions, the snippet configures the following:

The Spark driver and executors were configured with 10 GB of memory to utilize most of the available memory on the node, in order to have a single container running on each node of our cluster and simplify this example.
The node-labels.am.default-node-label-expression parameter was set to ON_DEMAND, making sure the Spark driver is automatically allocated to the ON_DEMAND partition of our cluster. Because we specified this configuration while launching the cluster, the AM containers are automatically requested to be scheduled on ON_DEMAND labeled instances, so we don’t need to specify it at the job level.
The yarn.executor.nodeLabelExpression=SPOT configuration verifies that the executors operated exclusively on TASK nodes using Spot Instances. Removing this configuration allows the Spark executors to be scheduled both on SPOT and ON_DEMAND labeled nodes.
The dynamicAllocation.maxExecutors setting was set to 1 to delay the processing time of the application and observe the scaling behavior when multiple YARN applications were submitted concurrently in the same cluster.

As the application transitioned to a RUNNING state, we can verify from the YARN Resource Manager UI that its driver placement was automatically assigned to the ON_DEMAND partition of our cluster (see the following screenshot).

Additionally, upon inspecting the YARN scheduler page, we can see that our SPOT partition doesn’t have a resource associated with it because the cluster was launched with just one On-Demand Instance.

Because the cluster didn’t have Spot Instances initially, you can observe from the Amazon EMR console that managed scaling generates a new Spot task group to accommodate the Spark executor requested to run on Spot nodes only (see the following screenshot) . Before this integration, managed scaling didn’t take into account the YARN labels requested by an application, potentially leading to unpredictable scaling behaviors. With this release, managed scaling now considers the YARN labels specified by applications, enabling more predictable and accurate scaling decisions.

While waiting for the launch of the new Spot node, we submitted another SparkPi job with identical specifications. However, because the memory required to allocate the new Spark Driver was 10 GB and such resources were currently unavailable in the ON_DEMAND partition, the application remained in a pending state until resources became available to schedule its container.

Upon detecting the lack of resources to allocate the new Spark driver, Amazon EMR managed scaling commenced scaling the core instance group (On-Demand Instances in our cluster) by launching a new core node. After the new core node was launched, YARN promptly allocated the pending container on the new node, enabling the application to start its processing. Subsequently, the application requested additional Spot nodes to allocate its own executors (see the following screenshot).

This example demonstrates how managed scaling and YARN labels work together to improve the resiliency of YARN applications, while using cost-effective job executions over Spot Instances.

When to use Application Manager placement awareness and managed scaling

You can use this placement awareness feature to improve cost-efficiency by using Spot Instances while protecting the Application Manager from being incorrectly shut down due to Spot interruptions. It’s particularly useful when you want to take advantage of the cost savings offered by Spot Instances while preserving the stability and reliability of your jobs running on the cluster. When working with managed scaling and the placement awareness feature, consider the following best practices:

Maximum cost-efficiency for non-critical jobs – If you have jobs that don’t have strict service level agreement (SLA) requirements, you can force all Spark executors to run on Spot Instances for maximum cost savings. This can be achieved by setting the following Spark configuration:
```
spark.yarn.executor.nodeLabelExpression=SPOT
```
Resilient execution for production jobs – For production jobs where you require a more resilient execution, you might consider not setting the yarn.executor.nodeLabelExpression parameter. When no label is specified, executors are dynamically allocated between both On-Demand and Spot nodes, providing a more reliable execution.
Limit dynamic allocation for concurrent applications – When working with managed scaling and clusters with multiple applications running concurrently (for example, an interactive cluster with concurrent user utilization), you should consider setting a maximum limit for Spark dynamic allocation using the dynamicAllocation.maxExecutors setting. This can help manage resources over-provisioning and facilitate predictable scaling behavior across applications running on the same cluster. For more details, see Dynamic Allocation in the Spark documentation.
Managed scaling configurations – Make sure your managed scaling configurations are set up correctly to facilitate efficient scaling of Spot Instances based on your workload requirements. For example, set an appropriate value for Maximum On-Demand instances in managed scaling based on the number of concurrent applications you want to run on the cluster. Additionally, if you’re planning to use your On-Demand Instances for running solely AM containers, we recommend setting scheduler.capacity.maximum-am-resource-percent to 1 using the Amazon EMR capacity-scheduler classification.
Improve startup time of the nodes – If your cluster is subject to frequent scaling events (for example, you have a long-running cluster that can run multiple concurrent EMR steps), you might want to optimize the startup time of your cluster nodes. When trying to get an efficient node startup, consider only installing the minimum required set of application frameworks in the cluster and, whenever possible, avoid installing non-YARN frameworks such as HBase or Trino, which might delay the startup of processing nodes dynamically attached by Amazon EMR managed scaling. Finally, whenever possible, don’t use complex and time-consuming EMR bootstrap actions to avoid increasing the startup time of nodes launched with managed scaling.

By following these best practices, you can take advantage of the cost savings of Spot Instances while maintaining the stability and reliability of your applications, particularly in scenarios where multiple applications are running concurrently on the same cluster.

Conclusion

In this post, we explored the benefits of the new integration between Amazon EMR managed scaling and YARN node labels, reviewed its implementation and usage, and defined a few best practices that can help you get started. Whether you’re running batch processing jobs, stream processing applications, or other YARN workloads on Amazon EMR, this feature can help you achieve substantial cost savings without compromising on performance or reliability.

As you embark on your journey to use Spot Instances in your EMR clusters, remember to follow the best practices outlined in this post, such as setting appropriate configurations for dynamic allocation, node label expressions, and managed scaling policies. By doing so, you can make sure that your applications run efficiently, reliably, and at the lowest possible cost.

About the authors

Lorenzo Ripani is a Big Data Solution Architect at AWS. He is passionate about distributed systems, open source technologies and security. He spends most of his time working with customers around the world to design, evaluate and optimize scalable and secure data pipelines with Amazon EMR.

Miranda Diaz is a Software Development Engineer for EMR at AWS. Miranda works to design and develop technologies that make it easy for customers across the world to automatically scale their computing resources to their needs, helping them achieve the best performance at the optimal cost.

Sajjan Bhattarai is a Senior Cloud Support Engineer at AWS, and specializes in BigData and Machine Learning workloads. He enjoys helping customers around the world to troubleshoot and optimize their data platforms.

Bezuayehu Wate is an Associate Big Data Specialist Solutions Architect at AWS. She works with customers to provide strategic and architectural guidance on designing, building, and modernizing their cloud-based analytics solutions using AWS.

Unleash deeper insights with Amazon Redshift data sharing for data lake tables

2024-10-10 Mohammed Alkateb

Post Syndicated from Mohammed Alkateb original https://aws.amazon.com/blogs/big-data/unleash-deeper-insights-with-amazon-redshift-data-sharing-for-data-lake-tables/

Amazon Redshift has established itself as a highly scalable, fully managed cloud data warehouse trusted by tens of thousands of customers for its superior price-performance and advanced data analytics capabilities. Driven primarily by customer feedback, the product roadmap for Amazon Redshift is designed to make sure the service continuously evolves to meet the ever-changing needs of its users.

Over the years, this customer-centric approach has led to the introduction of groundbreaking features such as zero-ETL, data sharing, streaming ingestion, data lake integration, Amazon Redshift ML, Amazon Q generative SQL, and transactional data lake capabilities. The latest innovation in Amazon Redshift data sharing capabilities further enhances the service’s flexibility and collaboration potential.

Amazon Redshift now enables the secure sharing of data lake tables—also known as external tables or Amazon Redshift Spectrum tables—that are managed in the AWS Glue Data Catalog, as well as Redshift views referencing those data lake tables. This breakthrough empowers data analytics to span the full breadth of shareable data, allowing you to seamlessly share local tables and data lake tables across warehouses, accounts, and AWS Regions—without the overhead of physical data movement or recreating security policies for data lake tables and Redshift views on each warehouse.

By using granular access controls, data sharing in Amazon Redshift helps data owners maintain tight governance over who can access the shared information. In this post, we explore powerful use cases that demonstrate how you can enhance cross-team and cross-organizational collaboration, reduce overhead, and unlock new insights by using this innovative data sharing functionality.

Overview of Amazon Redshift data sharing

Amazon Redshift data sharing allows you to securely share your data with other Redshift warehouses, without having to copy or move the data.

Data shared between warehouses doesn’t require the data to be physically copied or moved—instead, data remains in the original Redshift warehouse, and access is granted to other authorized users as part of a one-time setup. Data sharing provides granular access control, allowing you to control which specific tables or views are shared, and which users or services can access the shared data.

Since consumers access the shared data in-place, they always access the latest state of the shared data. Data sharing even allows for the automatic sharing of new tables created after that datashare was established.

You can share data across different Redshift warehouses within or across AWS accounts, and you can also do cross-region data sharing. This allows you to share data with partners, subsidiaries, or other parts of your organization, and enables the powerful workload isolation use case, as shown in the following diagram. With the seamless integration of Amazon Redshift with AWS Data Exchange, data can also be monetized and shared publicly, and public datasets such as census data can be added to a Redshift warehouse with just a few steps.

Figure 1: Amazon Redshift data sharing between producer and consumer warehouses

The data sharing capabilities in Amazon Redshift also enable the implementation of a data mesh architecture, as shown in the following diagram. This helps democratize data within the organization by reducing barriers to accessing and using data across different business units and teams. For datasets with multiple authors, Amazon Redshift data sharing supports both read and write use cases (write in preview at the time of writing). This enables the creation of 360-degree datasets, such as a customer dataset that receives contributions from multiple Redshift warehouses across different business units in the organization.

Figure 2: Data mesh architecture using Amazon Redshift data sharing

Overview of Redshift Spectrum and data lake tables

In the modern data organization, the data lake has emerged as a centralized repository—a single source of truth where all data within the organization ultimately resides at some point in its lifecycle. Redshift Spectrum enables seamless integration between the Redshift data warehouse and customers’ data lakes, as shown in the following diagram. With Redshift Spectrum, you can run SQL queries directly against data stored in Amazon Simple Storage Service (Amazon S3), without the need to first load that data into a Redshift warehouse. This allows you to maintain a comprehensive view of your data while optimizing for cost-efficiency.

Figure 3: Amazon Redshift bridges the data warehouse and data lake by enabling querying of data lake tables in-place

Redshift Spectrum supports a variety of open file formats, including Parquet, ORC, JSON, and CSV, as well as open table formats such as Apache Iceberg, all stored in Amazon S3. It runs these queries using a dedicated fleet of high-performance servers with low-latency connections to the S3 data lake. Data lake tables can be added to a Redshift warehouse either automatically through the Data Catalog, in the Amazon Redshift Query Editor, or manually using SQL commands.

From a user experience standpoint, there is little difference between querying a local Redshift table vs. a data lake table. SQL queries can be reused verbatim to perform the same aggregations and transformations on data residing in the data lake, as shown in the following examples. Additionally, by using columnar file formats like Parquet and pushing down query predicates, you can achieve further performance enhancements.

The following SQL is for a sample query against local Redshift tables:

SELECT top 10 mylocal_schema.sales.eventid, sum(mylocal_schema.sales.pricepaid) FROM mylocal_schema.sales, event
WHERE mylocal_schema.sales.eventid = event.eventid
AND mylocal_schema.sales.pricepaid > 30
GROUP BY mylocal_schema.sales.eventid
ORDER BY 2 DESC;

The following SQL is for the same query, but against data lake tables:

SELECT top 10 myspectrum_schema.sales.eventid, sum(myspectrum_schema.sales.pricepaid) FROM myspectrum_schema.sales, event
WHERE myspectrum_schema.sales.eventid = event.eventid
AND myspectrum_schema.sales.pricepaid > 30
GROUP BY myspectrum_schema.sales.eventid
ORDER BY 2 desc;

To maintain robust data governance, Redshift Spectrum integrates with AWS Lake Formation, enabling the consistent application of security policies and access controls across both the Redshift data warehouse and S3 data lake. When Lake Formation is used, Redshift producer warehouses first share their data with Lake Formation rather than directly with other Redshift consumer warehouses, and the data lake administrator grants fine-grained permissions for Redshift consumer warehouses to access the shared data. For more information, see Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation.

In the past, however, sharing data lake tables across Redshift warehouses presented challenges. It wasn’t possible to do so without having to mount the data lake tables on each individual Redshift warehouse and then recreate the related security policies.

This barrier has now been addressed with the introduction of data sharing support for data lake tables. You can now share data lake tables just like any other table, using the built-in data sharing capabilities of Amazon Redshift. By combining the power of Redshift Spectrum data lake integration with the flexibility of Amazon Redshift data sharing, organizations can unlock new levels of cross-team collaboration and insights, while maintaining robust data governance and security controls.

For more information about Redshift Spectrum, see Getting started with Amazon Redshift Spectrum.

Solution overview

In this post, we describe how to add data lake tables or views to a Redshift datashare, covering two key use cases:

Adding a late-binding view or materialized view to a producer datashare that references a data lake table
Adding a data lake table directly to a producer datashare

The first use case provides greater flexibility and convenience. Consumers can query the shared view without having to configure fine-grained permissions. The configuration, such as defining permissions on data stored in Amazon S3 with Lake Formation, is already handled on the producer side. You only need to add the view to the producer datashare one time, making it a convenient option for both the producer and the consumer.

An additional benefit of this approach is that you can add views to a datashare that join data lake tables with local Redshift tables. When these views are shared, you can relegate the trusted business logic to just the producer side.

Alternatively, you can add data lake tables directly to a datashare. In this case, consumers can query the data lake tables directly or join them with their own local tables, allowing them to add their own conditional logic as needed.

Add a view that references a data lake table to a Redshift datashare

When you create data lake tables that you intend to add to a datashare, the recommended and most common way to do this is to add a view to the datashare that references a data lake table or tables. There are three high-level steps involved:

Add the Redshift view’s schema (the local schema) to the Redshift datashare.
Add the Redshift view (the local view) to the Redshift datashare.
Add the Redshift external schemas (for the tables referenced by the Redshift view) to the Redshift datashare.

The following diagram illustrates the full workflow.

Figure 4: Sharing data lake tables via Amazon Redshift views

The workflow consists of the following steps:

Create a data lake table on the datashare producer. For more information on creating Redshift Spectrum objects, see External schemas for Amazon Redshift Spectrum. Data lake tables to be shared can include Lake Formation registered tables and Data Catalog tables, and if using the Redshift Query Editor, these tables are automatically mounted.
Create a view on the producer that references the data lake table that you created.
Create a datashare, if one doesn’t already exist, and add objects to your datashare, including the view you created that references the data lake table. For more information, see Creating datashares and adding objects (preview).
Add the external schema of the base Redshift table to the datashare (this is true of both local base tables and data lake tables). You don’t have to add a data lake table itself to the datashare.
On the consumer, the administrator makes the view available to consumer database users.
Database consumer users can write queries to retrieve data from the shared view and join it with other tables and views on the consumer.

After these steps are complete, database consumer users with access to the datashare views can reference them in their SQL queries. The following SQL queries are examples for achieving the preceding steps.

Create a data lake table on the producer warehouse:

CREATE EXTERNAL TABLE myspectrum_db.myspectrum_schema.test (c1 INT)
stored AS parquet
location 's3://amzn-s3-demo-bucket/myfolder/';

Create a view on the producer warehouse:

CREATE VIEW mylocal_db.mylocal_schema.myspectrumview AS SELECT c1 FROM myspectrum_db.myspectrum_schema.v_test
WITH no schema binding;

Add a view to the datashare on the producer warehouse:

ALTER datashare mydatashare ADD SCHEMA mylocal_db.mylocal_schema;
ALTER datashare mydatashare ADD VIEW myspectrumview;
ALTER datashare mydatashare ADD SCHEMA myspectrum_db.myspectrum_schema;

Create a consumer datashare and grant permissions for the view in the consumer warehouse:

CREATE database myspectrum_db FROM datashare myspectrumproducer OF account '123456789012' namespace 'p1234567-8765-4321-p10987654321';
GRANT usage ON database myspectrum_db TO usernames;

Add a data lake table directly to a Redshift datashare

Adding a data lake table to a datashare is similar to adding a view. This process works well for a case where the consumers want the raw data from the data lake table and they want to write queries and join it to tables in their own data warehouse. There are two high-level steps involved:

Add the Redshift external schemas (of the data lake tables to be shared) to the Redshift datashare.
Add the data lake table (the Redshift external table) to the Redshift datashare.

The following diagram illustrates the full workflow.

Figure 5: Sharing data lake tables directly in an Amazon Redshift datashare

The workflow consists of the following steps:

Create a data lake table on the datashare producer.
Add objects to your datashare, including the data lake table you created. In this case, you don’t have any abstraction over the table.
On the consumer, the administrator makes the table available.
Database consumer users can write queries to retrieve data from the shared table and join it with other tables and views on the consumer.

The following SQL queries are examples for achieving the preceding producer steps.

Create a data lake table on the producer warehouse:

CREATE EXTERNAL TABLE myspectrum_db.myspectrum_schema.test (c1 INT)
stored AS parquet
location 's3://amzn-s3-demo-bucket/myfolder/';

Add a data lake schema and table directly to the datashare on the producer warehouse:

ALTER datashare mydatashare ADD SCHEMA myspectrum_db.myspectrum_schema;
ALTER datashare mydatashare ADD TABLE myspectrum_db.myspectrum_schema.test;

Create a consumer datashare and grant permissions for the view in the consumer warehouse:

CREATE database myspectrum_db FROM datashare myspectrumproducer OF account '123456789012' namespace 'p1234567-8765-4321-p10987654321';
GRANT usage ON database myspectrum_db TO usernames;

Security considerations for sharing data lake tables and views

Data lake tables are stored outside of Amazon Redshift, in the data lake, and may not be owned by the Redshift warehouse, but are still referenced within Amazon Redshift. This setup requires special security considerations. Data lake tables operate under the security and governance of both Amazon Redshift and the data lake. For Lake Formation registered tables specifically, the Amazon S3 resources are secured by Lake Formation and made available to consumers using the provided credentials.

The data owner of the data in the data lake tables may want to impose restrictions on which external objects can be added to a datashare. To give data owners more control over whether warehouse users can share data lake tables, you can use session tags in AWS Identity and Access Management (IAM). These tags provide additional context about the user running the queries. For more details on tagging resources, refer to Tags for AWS Identity and Access Management resources.

Audit considerations for sharing data lake tables and views

When sharing data lake objects through a datashare, there are special logging considerations to keep in mind:

Access controls – You can also use CloudTrail log data in conjunction with IAM policies to control access to shared tables, including both Redshift datashare producers and consumers. The CloudTrail logs record details about who accesses shared tables. The identifiers in the log data are available in the ExternalId field under the AssumeRole CloudTrail logs. The data owner can configure additional limitations on data access in an IAM policy by means of actions. For more information about defining data access through policies, see Access to AWS accounts owned by third parties.
Centralized access – Amazon S3 resources such as data lake tables can be registered and centrally managed with Lake Formation. After they’re registered with Lake Formation, Amazon S3 resources are secured and governed by the associated Lake Formation policies and made available using the credentials provided by Lake Formation.

Billing considerations for sharing data lake tables and views

The billing model for Redshift Spectrum differs for Amazon Redshift provisioned and serverless warehouses. For provisioned warehouses, Redshift Spectrum queries (queries involving data lake tables) are billed based on the amount of data scanned during query execution. For serverless warehouses, data lake queries are billed the same as non-data-lake queries. Storage for data lake tables is always billed to the AWS account associated with the Amazon S3 data.

In the case of datashares involving data lake tables, costs are attributed for storing and scanning data lake objects in a datashare as follows:

When a consumer queries shared objects from a data lake, the cost of scanning is billed to the consumer:
- When the consumer is a provisioned warehouse, Amazon Redshift uses Redshift Spectrum to scan the Amazon S3 data. Therefore, the Redshift Spectrum cost is billed to the consumer account.
- When the consumer is an Amazon Redshift Serverless workgroup, there is no separate charge for data lake queries.
Amazon S3 costs for storage and operations, such as listing buckets, is billed to the account that owns each S3 bucket.

For detailed information on Redshift Spectrum billing, refer to Amazon Redshift pricing and Billing for storage.

Conclusion

In this post, we explored how Amazon Redshift enhanced data sharing capabilities, including support for sharing data lake tables and Redshift views that reference those data lake tables, empower organizations to unlock the full potential of their data by bringing the full breadth of data assets in scope for advanced analytics. Organizations are now able to seamlessly share local tables and data lake tables across warehouses, accounts, and Regions.

We outlined the steps to securely share data lake tables and views that reference those data lake tables across Redshift warehouses, even those in separate AWS accounts or Regions. Additionally, we covered some considerations and best practices to keep in mind when using this innovative feature.

Sharing data lake tables and views through Amazon Redshift data sharing champions the modern, data-driven organization’s goal to democratize data access in a secure, scalable, and efficient manner. By eliminating the need for physical data movement or duplication, this capability reduces overhead and enables seamless cross-team and cross-organizational collaboration. Unleashing the full potential of your data analytics to span the full breadth of your local tables and data lake tables is just a few steps away.

For more information on Amazon Redshift data sharing and how it can benefit your organization, refer to the following resources:

Please also reach out to your AWS technical account manager or AWS account Solutions Architect. They will be happy to provide additional guidance and support.

About the Authors

Mohammed Alkateb is an Engineering Manager at Amazon Redshift. Prior to joining Amazon, Mohammed had 12 years of industry experience in query optimization and database internals as an individual contributor and engineering manager. Mohammed has 18 US patents, and he has publications in research and industrial tracks of premier database conferences including EDBT, ICDE, SIGMOD and VLDB. Mohammed holds a PhD in Computer Science from The University of Vermont, and MSc and BSc degrees in Information Systems from Cairo University.

Ramchandra Anil Kulkarni is a software development engineer who has been with Amazon Redshift for over 4 years. He is driven to develop database innovations that serve AWS customers globally. Kulkarni’s long-standing tenure and dedication to the Amazon Redshift service demonstrate his deep expertise and commitment to delivering cutting-edge database solutions that empower AWS customers worldwide.

Mark Lyons is a Principal Product Manager on the Amazon Redshift team. He works on the intersection of data lakes and data warehouses. Prior to joining AWS, Mark held product leadership roles with Dremio and Vertica. He is passionate about data analytics and empowering customers to change the world with their data.

Asser Moustafa is a Principal Worldwide Specialist Solutions Architect at AWS, based in Dallas, Texas. He partners with customers worldwide, advising them on all aspects of their data architectures, migrations, and strategic data visions to help organizations adopt cloud-based solutions, maximize the value of their data assets, modernize legacy infrastructures, and implement cutting-edge capabilities like machine learning and advanced analytics. Prior to joining AWS, Asser held various data and analytics leadership roles, completing an MBA from New York University and an MS in Computer Science from Columbia University in New York. He is passionate about empowering organizations to become truly data-driven and unlock the transformative potential of their data.

Extract insights in a 30TB time series workload with Amazon OpenSearch Serverless

2024-10-07 Satish Nandi

Post Syndicated from Satish Nandi original https://aws.amazon.com/blogs/big-data/extract-insights-in-a-30tb-time-series-workload-with-amazon-opensearch-serverless/

In today’s data-driven landscape, managing and analyzing vast amounts of data, especially logs, is crucial for organizations to derive insights and make informed decisions. However, handling large data while extracting insights is a significant challenge, prompting organizations to seek scalable solutions without the complexity of infrastructure management.

Amazon OpenSearch Serverless reduces the burden of manual infrastructure provisioning and scaling while still empowering you to ingest, analyze, and visualize your time-series data, simplifying data management and enabling you to derive actionable insights from data.

We recently announced a new capacity level of 30TB for time series data per account per AWS Region. The OpenSearch Serverless compute capacity for data ingestion and search/query is measured in OpenSearch Compute Units (OCUs), which are shared among various collections with the same AWS Key Management Service (AWS KMS) key. To accommodate larger datasets, OpenSearch Serverless now supports up to 500 OCUs per account per Region, each for indexing and search respectively, more than double from the previous limit of 200. You can configure the maximum OCU limits on search and indexing independently, giving you the reassurance of managing costs effectively. You can also monitor real-time OCU usage with Amazon CloudWatch metrics to gain a better perspective on your workload’s resource consumption. With the support for 30TB datasets, you can analyze data at the 30TB level to unlock valuable operational insights and make data-driven decisions to troubleshoot application downtime, improve system performance, or identify fraudulent activities.

This post discusses how you can analyze 30TB time series datasets with OpenSearch Serverless.

Innovations and optimizations to support larger data size and faster responses

Sufficient disk, memory, and CPU resources are crucial for handling extensive data effectively and conducting thorough analysis. These resources are not just beneficial but crucial for our operations. In time series collections, the OCU disk typically contains older shards that are not frequently accessed, referred to as warm shards. We have introduced a new feature called warm shard recovery prefetch. This feature actively monitors recently queried data blocks for a shard. It prioritizes them during shard movements, such as shard balancing, vertical scaling, and deployment activities. More importantly, it accelerates auto-scaling and provides faster readiness for varying search workloads, thereby significantly improving our system’s performance. The results provided later in this post provide details on the improvements.

A few select customers worked with us on early adoption prior to General Availability. In these trials, we observed up to 66% improvement in warm query performance for some customer workloads. This significant improvement shows the effectiveness of our new features. Additionally, we have enhanced the concurrency between coordinator and worker nodes, allowing more requests to be processed as the OCUs increases through auto scaling. This enhancement has resulted in up to a 10% improvement in query performance for hot and warm queries.

We have enhanced our system’s stability to handle time-series collections of up to 30 TB effectively. Our team is committed to improving system performance, as demonstrated by our ongoing enhancements to the auto-scaling system. These improvements comprised of enhanced shard distribution for optimal placement after rollover, auto-scaling policies based on queue length, and a dynamic sharding strategy that adjusts shard count based on ingestion rate.

In the following section we share an example test setup of a 30TB workload that we used internally, detailing the data being used and generated, along with our observations and results. Performance may vary depending on the specific workload.

Ingest the data

You can use the load generation scripts shared in the following workshop, or you can use your own application or data generator to create a load. You can run multiple instances of these scripts to generate a burst in indexing requests. As shown in the following screenshot, we tested with an index, sending approximately 30 TB of data over a period of 15 days. We used our load generator script to send the traffic to a single index, retaining data for 15 days using a data life cycle policy.

Test methodology

We set the deployment type to ‘Enable redundancy’ to enable data replication across Availability Zones. This deployment configuration will lead to 12-24 hours of data in hot storage (OCU disk memory) and the rest in Amazon Simple Storage Service (Amazon S3). With a defined set of search performance and the preceding ingestion expectation, we set the max OCUs to be 500 for both indexing and search.

As part of the testing, we observed auto-scaling behavior and graphed it. The indexing took around 8 hours to get stabilized at 80 OCU.

On the Search side, it took around 2 days to get stabilized at 80 OCU.

Observations:

Ingestion

The ingestion performance achieved was consistently over 2 TB per day

Search

Queries were of two types, with time ranging from 15 minutes to 15 days.

{"aggs":{"1":{"cardinality":{"field":"carrier.keyword"}}},"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}]}}}

For example

{"aggs":{"1":{"cardinality":{"field":"carrier.keyword"}}},"size":0,"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-1d","lte":"now"}}}]}}}

The following chart provides the various percentile performance on the aggregation query

The second query was

{"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}],"should":[{"match":{"originState":"State"}}]}}}

For example

{"query":{"bool":{"filter":[{"range":{"@timestamp":{"gte":"now-15m","lte":"now"}}}],"should":[{"match":{"originState":"California"}}]}}}

The following chart provides the various percentile performance on the search query

The following chart summarizes the time range for different queries

Time-range	Query	P50 (ms)	P90 (ms)	P95 (ms)	P99 (ms)
15 minutes	{“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-15m”,”lte”:”now”}}}]}}}	325	403.867	441.917	514.75
1 day	{“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1d”,”lte”:”now”}}}]}}}	7,693.06	12,294	13,411.19	17,481.4
1 hour	{“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1h”,”lte”:”now”}}}]}}}	1,061.66	1,397.27	1,482.75	1,719.53
1 year	{“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1y”,”lte”:”now”}}}]}}}	2,758.66	10,758	12,028	22,871.4
4 hour	{“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-4h”,”lte”:”now”}}}]}}}	3,870.79	5,233.73	5,609.9	6,506.22
7 day	{“aggs”:{“1”:{“cardinality”:{“field”:”carrier.keyword”}}},”size”:0,”query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-7d”,”lte”:”now”}}}]}}}	5,395.68	17,538.12	19,159.18	22,462.32
15 minutes	{“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-15m”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”California”}}]}}}	139	190	234.55	6,071.96
1 day	{“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1d”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”California”}}]}}}	678.917	1,366.63	2,423	7,893.56
1 hour	{“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1h”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}}	259.167	305.8	343.3	1,125.66
1 year	{“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-1y”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}}	2,166.33	2,469.7	4,804.9	9,440.11
4 hours	{“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-4h”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}}	462.933	653.6	725.3	1,583.37
7 days	{“query”:{“bool”:{“filter”:[{“range”:{“@timestamp”:{“gte”:”now-7d”,”lte”:”now”}}}],”should”:[{“match”:{“originState”:”Washington”}}]}}}	1,353	2,745.1	4,338.8	9,496.36

Conclusion

OpenSearch Serverless not only supports a larger data size than prior releases but also introduces performance improvements like warm shard pre-fetch and concurrency optimization for better query response. These features reduce the latency of warm queries and improve auto-scaling to handle varied workloads. We encourage you to take advantage of the 30TB index support and put it to the test! Migrate your data, explore the improved throughput, and take advantage of the enhanced scaling capabilities.

To get started, refer to Log analytics the easy way with Amazon OpenSearch Serverless. To get hands-on experience with OpenSearch Serverless, follow the Getting started with Amazon OpenSearch Serverless workshop, which has a step-by-step guide for configuring and setting up an OpenSearch Serverless collection.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

About the authors

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and AI/ML. He holds a Bachelor’s degree in Computer Science and an MBA in Entrepreneurship. In his free time, he likes to fly airplanes and hang gliders and ride his motorcycle.

Milav Shah is an Engineering Leader with Amazon OpenSearch Service. He focuses on search experience for OpenSearch customers. He has extensive experience building highly scalable solutions in databases, real-time streaming and distributed computing. He also possesses functional domain expertise in verticals like Internet of Things, fraud protection, gaming and AI/ML. In his free time, he likes to ride cycle, hike, and play chess.

Qiaoxuan Xue is a Senior Software Engineer at AWS leading the search and benchmarking areas of the Amazon OpenSearch Serverless Project. His passion lies in finding solutions for intricate challenges within large-scale distributed systems. Outside of work, he enjoys woodworking, biking, playing basketball, and spending time with his family and dog.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

How Banfico built an Open Banking and Payment Services Directive (PSD2) compliance solution on AWS

2024-10-04 Otis Antoniou

Post Syndicated from Otis Antoniou original https://aws.amazon.com/blogs/architecture/how-banfico-built-an-open-banking-and-payment-services-directive-psd2-compliance-solution-on-aws/

This post was co-written with Paulo Barbosa, the COO of Banfico.

Introduction

Banfico is a London-based FinTech company, providing market-leading Open Banking regulatory compliance solutions. Over 185 leading Financial Institutions and FinTech companies use Banfico to streamline their compliance process and deliver the future of banking.

Under the EU’s revised PSD2, banks can use application programming interfaces (APIs) to securely share financial data with licensed and approved third-party providers (TPPs), when there is customer consent. For example, this can allow you to track your bank balances across multiple accounts in a single budgeting app.

PSD2 requires that all parties in the open banking system are identified in real time using secured certificates. Banks must also provide a service desk to TPPs, and communicate any planned or unplanned downtime that could impact the shared services.

In this blog post, you will learn how the Red Hat OpenShift Service on AWS helped Banfico deliver their highly secure, available, and scalable Open Banking Directory — a product that enables seamless and compliant connectivity between banks and FinTech companies.

Using this modular architecture, Banfico can also serve other use cases such as confirmation of payee, which is designed to help consumers verify that the name of the recipient account, or business, is indeed the name that they intended to send money to.

Design Considerations

Banfico prioritized the following design principles when building their product:

Scalability: Banfico needed their solution to be able to scale up seamlessly as more financial institutions and TPPs begin to utilize the solution, without any interruption to service.
Leverage Managed Solutions and Minimize Administrative Overhead: The Banfico team wanted to focus on their areas of core competency around the product, financial services regulation, and open banking. They wanted to leverage solutions that could minimize the amount of infrastructure maintenance they have to perform.
Reliability: Because the PSD2 regulations require real-time identification and up-to-date communication about planned or unplanned downtime, reliability was a top priority to enable stable communication channels between TPPs and banks. The Open Banking Directory therefore needed to reach availability of 99.95%.
Security and Compliance: The Open Banking Directory needed to be highly secure, ensuring that sensitive data is protected at all times. This was also important due to Banfico’s ISO27001 certification.

To address these requirements, Banfico decided to partner up with AWS and Red Hat and use the Red Hat OpenShift Service on AWS (ROSA). This is a service operated by Red Hat and jointly supported with AWS to provide fully managed Red Hat OpenShift platform that gives them a scalable, secure, and reliable way to build their product. They also leveraged other AWS Managed Services to minimize infrastructure management tasks and focus on delivering business value for their customers.

To understand how they were able to architect a solution that addressed their needs while following the design considerations, see the following reference architecture diagram.

Banfico’s Open Banking Directory Architecture Overview:

Banfico's open banking directory architecture overview diagram

Breakdown of key components:

Red Hat OpenShift Service on AWS (ROSA) cluster: The Banfico Open Banking SaaS key services are built on a ROSA cluster that is deployed across three Availability Zones for high availability and fault tolerance. These key services support the following fundamental business capabilities:

Their core aggregated API platform that integrates with, and provides access to banking information for TPPs.
Facilitating transactions and payment authorizations.
TPP authentication and authorization, more specifically:
- Checking if a certain TPP is authorized by each country’s central bank to check account information and initiate payments.
- Validating TPP certificates that are issued by Qualified Trust Service Provider (QTSPs), which are: “regulated (Qualified) to provide trusted digital certificates under the electronic Identification and Signature (eIDAS) regulation. PSD2 also requires specific types of eIDAS certificates to be issued.” – Planky Open Banking Glossary
Certificate issuing and management. Banfico is able to issue, manage, and store digital certificates that TPPs can use to interact with Open Banking APIs.
The collection of data from central banks across the world to collect regulated entity details.

Elastic Load Balancer (ELB): A load balancer helps Banfico deliver their highly-available and scalable product. It allows them to route traffic across their containers as they grow, and perform health checks accordingly, and it provides Banfico customers access to the application workloads running on ROSA through the ROSA router layers.

Amazon Elastic File System (Amazon EFS): During the collection of data from central banks, either through APIs or by scraping HTML, Banfico’s workloads and apps use the highly-scalable and durable Amazon EFS for shared storage. Amazon EFS automatically scales and provides high availability, simplifying operations and enabling Banfico to focus on application development and delivery.

Amazon Simple Storage Service (Amazon S3): To store digital certificates issued and managed by Banfico’s Open Banking Directory, they rely on Amazon S3, which is a highly-durable, available, and scalable object storage service.

Amazon Relational Database Service (Amazon RDS): The Open Banking Directory uses Amazon RDS PostgreSQL to store application data coming from their different containerized services. Using Amazon RDS, they are able to have a highly-available managed relational database which they also replicate to a secondary Region for disaster recovery purposes.

AWS Key Management Service (AWS KMS): Banfico uses AWS KMS to encrypt all data stored on the volumes used by Amazon RDS to make sure their data is secured.

AWS Identity and Access Management (IAM): Leveraging IAM with the principle of least privilege allows the product to follow security best practices.

AWS Shield: Banfico’s product relies on AWS Shield for DDoS protection, which helps in dynamic detection and automatic inline mitigation.

Amazon Route 53: Amazon Route 53 routes end users to Banfico’s site reliably with globally dispersed Domain Name System (DNS) servers and automatic scaling. They can set up in minutes, and having custom routing policies help Banfico maintain compliance.

Using this architecture and AWS technologies, Banfico is able to deliver their Open Banking Directory to their customers, through a SaaS frontend as shown in the following image.

Banfico's Open Banking Directory SaaS front-end

Conclusion

This AWS solution has proven instrumental in meeting Banfico’s critical business needs, delivering 99.95% availability and scalability. Through the utilization of AWS services, the Open Banking Directory product seamlessly accommodates the entirety of Banfico’s client traffic across Europe. This heightened agility not only facilitates rapid feature deployment (40% faster application development), but also enhances user satisfaction. Looking ahead, Banfico’s Open Banking Directory remains committed to fostering safety and trust within the open banking ecosystem, with AWS standing as a valued partner in Banfico’s journey toward sustained success. Customers who are looking to build their own secure and scalable products in the Financial Services Industry have access industry AWS Specialists; contact us for help in your cloud journey. You can also learn more about AWS services and solutions for financial services by visiting AWS for Financial Services.

Build a dynamic rules engine with Amazon Managed Service for Apache Flink

2024-10-03 Steven Carpenter

Post Syndicated from Steven Carpenter original https://aws.amazon.com/blogs/big-data/build-a-dynamic-rules-engine-with-amazon-managed-service-for-apache-flink/

Imagine you have some streaming data. It could be from an Internet of Things (IoT) sensor, log data ingestion, or even shopper impression data. Regardless of the source, you have been tasked with acting on the data—alerting or triggering when something occurs. Martin Fowler says: “You can build a simple rules engine yourself. All you need is to create a bunch of objects with conditions and actions, store them in a collection, and run through them to evaluate the conditions and execute the actions.”

A business rules engine (or simply rules engine) is a software system that executes many rules based on some input to determine some output. Simplistically, it’s a lot of “if then,” “and,” and “or” statements that are evaluated on some data. There are many different business rule systems, such as Drools, OpenL Tablets, or even RuleBook, and they all share a commonality: they define rules (collection of objects with conditions) that get executed (evaluate the conditions) to derive an output (execute the actions). The following is a simplistic example:

if (office_temperature) < 50 degrees => send an alert

if (office_temperature) < 50 degrees AND (occupancy_sensor) == TRUE => < Trigger action to turn on heat>

When a single condition or a composition of conditions evaluates to true, it is desired to send out an alert to potentially act on that event (trigger the heat to warm the 50 degrees room).

This post demonstrates how to implement a dynamic rules engine using Amazon Managed Service for Apache Flink. Our implementation provides the ability to create dynamic rules that can be created and updated without the need to change or redeploy the underlying code or implementation of the rules engine itself. We discuss the architecture, the key services of the implementation, some implementation details that you can use to build your own rules engine, and an AWS Cloud Development Kit (AWS CDK) project to deploy this in your own account.

Solution overview

The workflow of our solution starts with the ingestion of the data. We assume that we have some source data. It could be from a variety of places, but for this demonstration, we use streaming data (IoT sensor data) as our input data. This is what we will evaluate our rules on. For example purposes, let’s assume we are looking at data from our AnyCompany Home Thermostat. We’ll see attributes like temperature, occupancy, humidity, and more. The thermostat publishes the respective values every 1 minute, so we’ll base our rules around that idea. Because we’re ingesting this data in near real time, we need a service designed specifically for this use case. For this solution, we use Amazon Kinesis Data Streams.

In a traditional rules engine, there may be a finite list of rules. The creation of new rules would likely involve a revision and redeployment of the code base, a replacement of some rules file, or some overwriting process. However, a dynamic rules engine is different. Much like our streaming input data, our rules can also be streamed as well. Here we can use Kinesis Data Streams to stream our rules as they are created.

At this point, we have two streams of data:

The raw data from our thermostat
The business rules perhaps created through a user interface

The following diagram illustrates we can connect these streams together. Architecture Diagram

Connecting streams

A typical use case for Managed Service for Apache Flink is to interactively query and analyze data in real time and continuously produce insights for time-sensitive use cases. With this in mind, if you have a rule that corresponds to the temperature dropping below a certain value (especially in winter), it might be critical to evaluate and produce a result as timely as possible.

Apache Flink connectors are software components that move data into and out of a Managed Service for Apache Flink application. Connectors are flexible integrations that let you read from files and directories. They consist of complete modules for interacting with AWS services and third-party systems. For more details about connectors, see Use Apache Flink connectors with Managed Service for Apache Flink.

We use two types of connectors (operators) for this solution:

Sources – Provide input to your application from a Kinesis data stream, file, or other data source
Sinks – Send output from your application to a Kinesis data stream, Amazon Data Firehose stream, or other data destination

Flink applications are streaming dataflows that may be transformed by user-defined operators. These dataflows form directed graphs that start with one or more sources and end in one or more sinks. The following diagram illustrates an example dataflow (source). As previously discussed, we have two Kinesis data streams that can be used as sources for our Flink program.

Flink Data Flow

The following code snippet shows how we have our Kinesis sources set up within our Flink code:

/**
* Creates a DataStream of Rule objects by consuming rule data from a Kinesis
* stream.
*
* @param env The StreamExecutionEnvironment for the Flink job
* @return A DataStream of Rule objects
* @throws IOException if an error occurs while reading Kinesis properties
*/
private DataStream<Rule> createRuleStream(StreamExecutionEnvironment env, Properties sourceProperties)
                throws IOException {
        String RULES_SOURCE = KinesisUtils.getKinesisRuntimeProperty("kinesis", "rulesTopicName");
        FlinkKinesisConsumer<String> kinesisConsumer = new FlinkKinesisConsumer<>(RULES_SOURCE,
                        new SimpleStringSchema(),
                        sourceProperties);
        DataStream<String> rulesStrings = env.addSource(kinesisConsumer)
                        .name("RulesStream")
                        .uid("rules-stream");
        return rulesStrings.flatMap(new RuleDeserializer()).name("Rule Deserialization");
}

/**
* Creates a DataStream of SensorEvent objects by consuming sensor event data
* from a Kinesis stream.
*
* @param env The StreamExecutionEnvironment for the Flink job
* @return A DataStream of SensorEvent objects
* @throws IOException if an error occurs while reading Kinesis properties
*/
private DataStream<SensorEvent> createSensorEventStream(StreamExecutionEnvironment env,
            Properties sourceProperties) throws IOException {
    String DATA_SOURCE = KinesisUtils.getKinesisRuntimeProperty("kinesis", "dataTopicName");
    FlinkKinesisConsumer<String> kinesisConsumer = new FlinkKinesisConsumer<>(DATA_SOURCE,
                    new SimpleStringSchema(),
                    sourceProperties);
    DataStream<String> transactionsStringsStream = env.addSource(kinesisConsumer)
                    .name("EventStream")
                    .uid("sensor-events-stream");

    return transactionsStringsStream.flatMap(new JsonDeserializer<>(SensorEvent.class))
                    .returns(SensorEvent.class)
                    .flatMap(new TimeStamper<>())
                    .returns(SensorEvent.class)
                    .name("Transactions Deserialization");
}

We use a broadcast state, which can be used to combine and jointly process two streams of events in a specific way. A broadcast state is a good fit for applications that need to join a low-throughput stream and a high-throughput stream or need to dynamically update their processing logic. The following diagram illustrates an example how the broadcast state is connected. For more details, see A Practical Guide to Broadcast State in Apache Flink.

Broadcast State

This fits the idea of our dynamic rules engine, where we have a low-throughput rules stream (added to as needed) and a high-throughput transactions stream (coming in at a regular interval, such as one per minute). This broadcast stream allows us to take our transactions stream (or the thermostat data) and connect it to the rules stream as shown in the following code snippet:

// Processing pipeline setup
DataStream<Alert> alerts = sensorEvents
    .connect(rulesStream)
    .process(new DynamicKeyFunction())
    .uid("partition-sensor-data")
    .name("Partition Sensor Data by Equipment and RuleId")
    .keyBy((equipmentSensorHash) -> equipmentSensorHash.getKey())
    .connect(rulesStream)
    .process(new DynamicAlertFunction())
    .uid("rule-evaluator")
    .name("Rule Evaluator");

To learn more about the broadcast state, see The Broadcast State Pattern. When the broadcast stream is connected to the data stream (as in the preceding example), it becomes a BroadcastConnectedStream. The function applied to this stream, which allows us to process the transactions and rules, implements the processBroadcastElement method. The KeyedBroadcastProcessFunction interface provides three methods to process records and emit results:

processBroadcastElement() – This is called for each record of the broadcasted stream (our rules stream).
processElement() – This is called for each record of the keyed stream. It provides read-only access to the broadcast state to prevent modifications that result in different broadcast states across the parallel instances of the function. The processElement method retrieves the rule from the broadcast state and the previous sensor event of the keyed state. If the expression evaluates to TRUE (discussed in the next section), an alert will be emitted.
onTimer() – This is called when a previously registered timer fires. Timers can be registered in the processElement method and are used to perform computations or clean up states in the future. This is used in our code to make sure any old data (as defined by our rule) is evicted as necessary.

We can handle the rule in the broadcast state instance as follows:

@Override
public void processBroadcastElement(Rule rule, Context ctx, Collector<Alert> out) throws Exception {
   BroadcastState<String, Rule> broadcastState = ctx.getBroadcastState(RulesEvaluator.Descriptors.rulesDescriptor);
   Long currentProcessTime = System.currentTimeMillis();
   // If we get a new rule, we'll give it insufficient data rule op status
    if (!broadcastState.contains(rule.getId())) {
        outputRuleOpData(rule, OperationStatus.INSUFFICIENT_DATA, currentProcessTime, ctx);
    }
   ProcessingUtils.handleRuleBroadcast(rule, broadcastState);
}

static void handleRuleBroadcast(FDDRule rule, BroadcastState<String, FDDRule> broadcastState)
        throws Exception {
    switch (rule.getStatus()) {
        case ACTIVE:
            broadcastState.put(rule.getId(), rule);
            break;
        case INACTIVE:
            broadcastState.remove(rule.getId());
            break;
    }
}

Notice what happens in the code when the rule status is INACTIVE. This would remove the rule from the broadcast state, which would then no longer consider the rule to be used. Similarly, handling the broadcast of a rule that is ACTIVE would add or replace the rule within the broadcast state. This is allowing us to dynamically make changes, adding and removing rules as necessary.

Evaluating rules

Rules can be evaluated in a variety of ways. Although it’s not a requirement, our rules were created in a Java Expression Language (JEXL) compatible format. This allows us to evaluate rules by providing a JEXL expression along with the appropriate context (the necessary transactions to reevaluate the rule or key-value pairs), and simply calling the evaluate method:

JexlExpression expression = jexl.createExpression(rule.getRuleExpression());
Boolean isAlertTriggered = (Boolean) expression.evaluate(context);

A powerful feature of JEXL is that not only can it support simple expressions (such as those including comparison and arithmetic), it also has support for user-defined functions. JEXL allows you to call any method on a Java object using the same syntax. If there is a POJO with the name SENSOR_cebb1baf_2df0_4267_b489_28be562fccea that has the method hasNotChanged, you would call that method using the expression. You can find more of these user-defined functions that we used within our SensorMapState class.

Let’s look at an example of how this would work, using a rule expression exists that reads as follows:

"SENSOR_cebb1baf_2df0_4267_b489_28be562fccea.hasNotChanged(5)"

This rule, evaluated by JEXL, would be equivalent to a sensor that hasn’t changed in 5 minutes

The corresponding user-defined function (part of SensorMapState) that is exposed to JEXL (using the context) is as follows:

public Boolean hasNotChanged(Integer time) {
    Long minutesSinceChange = getMinutesSinceChange();
    log.debug("Time: " + time + " | Minutes since change: " + minutesSinceChange);
    return minutesSinceChange >  time;
}

Relevant data, like that below, would go into the context window, which would then be used to evaluate the rule.

{
    "id": "SENSOR_cebb1baf_2df0_4267_b489_28be562fccea",
    "measureValue": 10,
    "eventTimestamp": 1721666423000
}

In this case, the result (or value of isAlertTriggered) is TRUE.

Creating sinks

Much like how we previously created sources, we also can create sinks. These sinks will be used as the end to our stream processing where our analyzed and evaluated results will get emitted for future use. Like our source, our sink is also a Kinesis data stream, where a downstream Lambda consumer will iterate the records and process them to take the appropriate action. There are many applications of downstream processing; for example, we can persist this evaluation result, create a push notification, or update a rule dashboard.

Based on the previous evaluation, we have the following logic within the process function itself:

if (isAlertTriggered) {
    alert = new Alert(rule.getEquipmentName(), rule.getName(), rule.getId(), AlertStatus.START,
            triggeringEvents, currentEvalTime);
    log.info("Pushing {} alert for {}", AlertStatus.START, rule.getName());
}
out.collect(alert);

When the process function emits the alert, the alert response is sent to the sink, which then can be read and used downstream in the architecture:

alerts.flatMap(new JsonSerializer<>(Alert.class))
    .name("Alerts Deserialization").sinkTo(createAlertSink(sinkProperties))
    .uid("alerts-json-sink")
    .name("Alerts JSON Sink");

At this point, we can then process it. We have a Lambda function logging the records where we can see the following:

{
   "equipmentName":"THERMOSTAT_1",
   "ruleName":"RuleTest2",
   "ruleId":"cda160c0-c790-47da-bd65-4abae838af3b",
   "status":"START",
   "triggeringEvents":[
      {
         "equipment":{
            "id":"THERMOSTAT_1",
         },
         "id":"SENSOR_cebb1baf_2df0_4267_b489_28be562fccea",
         "measureValue":20.0,
         "eventTimestamp":1721672715000,
         "ingestionTimestamp":1721741792958
      }
   ],
   "timestamp":1721741792790
}

Although simplified in this example, these code snippets form the basis for taking the evaluation results and sending them elsewhere.

Conclusion

In this post, we demonstrated how to implement a dynamic rules engine using Managed Service for Apache Flink with both the rules and input data streamed through Kinesis Data Streams. You can learn more about it with the e-learning that we have available.

As companies seek to implement near real-time rules engines, this architecture presents a compelling solution. Managed Service for Apache Flink offers powerful capabilities for transforming and analyzing streaming data in real time, while simplifying the management of Flink workloads and seamlessly integrating with other AWS services.

To help you get started with this architecture, we’re excited to announce that we’ll be publishing our complete rules engine code as a sample on GitHub. This comprehensive example will go beyond the code snippets provided in our post, offering a deeper look into the intricacies of building a dynamic rules engine with Flink.

We encourage you to explore this sample code, adapt it to your specific use case, and take advantage of the full potential of real-time data processing in your applications. Check out the GitHub repository, and don’t hesitate to reach out with any questions or feedback as you embark on your journey with Flink and AWS!

About the Authors

Steven Carpenter is a Senior Solution Developer on the AWS Industries Prototyping and Customer Engineering (PACE) team, helping AWS customers bring innovative ideas to life through rapid prototyping on the AWS platform. He holds a master’s degree in Computer Science from Wayne State University in Detroit, Michigan. Connect with Steven on LinkedIn!

Aravindharaj Rajendran is a Senior Solution Developer within the AWS Industries Prototyping and Customer Engineering (PACE) team, based in Herndon, VA. He helps AWS customers materialize their innovative ideas by rapid prototyping using the AWS platform. Outside of work, he loves playing PC games, Badminton and Traveling.

Customer compliance and security during the post-quantum cryptographic migration

2024-10-03 Panos Kampanakis

Post Syndicated from Panos Kampanakis original https://aws.amazon.com/blogs/security/customer-compliance-and-security-during-the-post-quantum-cryptographic-migration/

Amazon Web Services (AWS) prioritizes the security, privacy, and performance of its services. AWS is responsible for the security of the cloud and the services it offers, and customers own the security of the hosts, applications, and services they deploy in the cloud. AWS has also been introducing quantum-resistant key exchange in common transport protocols used by our customers in order to provide long-term confidentiality. In this blog post, we elaborate how customer compliance and security configuration responsibility will operate in the post-quantum migration of secure connections to the cloud. We explain how customers are responsible for enabling quantum-resistant algorithms or having these algorithms enabled by default in their applications that connect to AWS. We also discuss how AWS will honor and choose these algorithms (if they are supported on the server side) even if that means the introduction of a small delay to the connection.

Secure connectivity

Security and compliance is a shared responsibility between AWS and the customer. This Shared Responsibility Model can help relieve the customer’s operational burden as AWS operates, manages, and controls the components from the host operating system and virtualization layer down to the physical security of the facilities in which the service operates. The customer assumes responsibility and management of the guest operating system and other associated application software, as well as the configuration of the AWS provided security group firewall. AWS has released Customer Compliance Guides (CCGs) to support customers, partners, and auditors in their understanding of how compliance requirements from leading frameworks map to AWS service security recommendations.

In the context of secure connectivity, AWS makes available secure algorithms in encryption protocols (for example, TLS, SSH, and VPN) for customers that connect to its services. That way AWS is responsible for enabling and prioritizing modern cryptography in connections to the AWS Cloud. Customers, on the other hand, use clients that enable such algorithms and negotiate cryptographic ciphers when connecting to AWS. It is the responsibility of the customer to configure or use clients that only negotiate the algorithms the customer prefers and trusts when connecting.

Prioritizing quantum-resistance or performance?

AWS has been in the process of migrating to post-quantum cryptography in network connections to AWS services. New cryptographic algorithms are designed to protect against a future cryptanalytically relevant quantum computer (CRQC) which could threaten the algorithms we use today. Post-quantum cryptography involves introducing post-quantum (PQ) hybrid key exchanges in protocols like TLS 1.3 or SSH/SFTP. Because both classical and PQ-hybrid exchanges need to be supported for backwards compatibility, AWS will prioritize PQ-hybrid exchanges for clients that support it and classical for clients that have not been upgraded yet. We don’t want to switch a client to classical if it advertises support for PQ.

PQ-hybrid key establishment leverages quantum-resistant key encapsulation mechanisms (KEMs) used in conjunction with classical key exchange. The client and server still do an ECDH key exchange, which gets combined with the KEM shared secret when deriving the symmetric key. For example, clients could perform an ECDH key exchange with curve P256 and post-quantum Kyber-768 from NIST’s PQC Project Round 3 (TLS group identifier X25519Kyber768Draft00) when connecting to AWS Certificate Manager (ACM), AWS Key Management Service (AWS KMS), and AWS Secrets Manager. This strategy combines the high assurance of a classical key exchange with the quantum-resistance of the proposed post-quantum key exchanges, to help ensure that the handshakes are protected as long as the ECDH or the post-quantum shared secret cannot be broken. The introduction of the ML-KEM algorithm adds more data (2.3 KB) to be transferred and slightly more processing overhead. The processing overhead is comparable to the existing ECDH algorithm, which has been used in most TLS connections for years. As shown in the following table, the total overhead of hybrid key exchanges has been shown to be immaterial in typical handshakes over the Internet. (Sources: Blog posts How to tune TLS for hybrid post-quantum cryptography with Kyber and The state of the post-quantum Internet)

	Data transferred (bytes)	CPU processing (thousand ops/sec)
	Data transferred (bytes)	Client	Server
ECDH with P256	128	17	17
X25519	64	31	31
ML-KEM-768	2,272	13	25

The new key exchanges introduce some unique conceptual choices that we didn’t have before, which could lead to the peers negotiating classical-only algorithms. In the past, our cryptographic protocol configurations involved algorithms that were widely trusted to be secure. The client and server configured a priority for their algorithms of choice and they picked the more appropriate ones from their negotiated prioritized order. Now, the industry has two families of algorithms, the “trusted classical” and the “trusted post-quantum” algorithms. Given that a CRQC is not available, both classical and post-quantum algorithms are considered secure. Thus, there is a paradigm shift that calls for a decision in the priority vendors should enforce on the client and server configurations regarding the “secure classical” or “secure post-quantum” algorithms.

Figure 1 shows a typical PQ-hybrid key exchange in TLS.

Figure 1: A typical TLS 1.3 handshake

In the example in Figure 1, the client advertises support for PQ-hybrid algorithms with ECDH curve P256 and quantum-resistant ML-KEM-768, ECDH curve P256 and quantum-resistant Kyber-512 Round 3, and classical ECDH with P256. The client also sends a Keyshare value for classical ECDH with P256 and for PQ-hybrid P256+MLKEM768. The Keyshare values include the client’s public keys. The client does not include a Keyshare for P256+Kyber512, because that would increase the size of the ClientHello unnecessarily and because ML-KEM-768 is the ratified version of Kyber Round 3, and so the client chose to only generate and send a P256+MLKEM768 public key. Now let’s say that the server supports ECDH curve P256 and PQ-hybrid P256+Kyber512, but not P256+MLKEM768. Given the groups and the Keyshare values the client included in the ClientHello, the server has the following two options:

Use the client P256 Keyshare to negotiate a classical key exchange, as shown in Figure 1. Although one might assume that the P256+Kyber512 Keyshare could have been used for a quantum-resistant key exchange, the server can pick to negotiate only classical ECDH key exchange with P256, which is not resistant to a CRQC.
Send a Hello Retry Request (HRR) to tell the client to send a PQ-hybrid Keyshare for P256+Kyber512 in a new ClientHello (Figure 2). This introduces a round trip, but it also forces the peers to negotiate a quantum-resistant symmetric key.

Note: A round-trip could take 30-50 ms in typical Internet connections.

Previously, some servers were using the Keyshare value to pick the key exchange algorithm (option 1 above). This generally allowed for faster TLS 1.3 handshakes that did not require an extra round-trip (HRR), but in the post-quantum scenario described earlier, it would mean the server does not negotiate a quantum-resistant algorithm even though both peers support it.

Such scenarios could arise in cases where the client and server don’t deploy the same version of a new algorithm at the same time. In the example in Figure 1, the server could have been an early adopter of the post-quantum algorithm and added support for P256+Kyber512 Round 3. The client could subsequently have upgraded to the ratified post-quantum algorithm with ML-KEM (P256+MLKEM768). AWS doesn’t always control both the client and the server. Some AWS services have adopted the earlier versions of Kyber and others will deploy ML-KEM-768 from the start. Thus, such scenarios could arise while AWS is in the post-quantum migration phase.

Note: In these cases, there won’t be a connection failure; the side-effect is that the connection will use classical-only algorithms although it could have negotiated PQ-hybrid.

These intricacies are not specific to AWS. Other industry peers have been thinking about these issues, and they have been a topic of discussion in the Internet Engineering Task Force (IETF) TLS Working Group. The issue of potentially negotiating a classical key exchange although the client and server support quantum-resistant ones is discussed in the Security Considerations of the TLS Key Share Prediction draft (draft-davidben-tls-key-share-prediction). To address some of these concerns, the Transport Layer Security (TLS) Protocol Version 1.3 draft (draft-ietf-tls-rfc8446bis), which is the draft update of TLS 1.3 (RFC 8446), introduces text about client and server behavior when choosing key exchange groups and the use of Keyshare values in Section 4.2.8. The TLS Key Share Prediction draft also tries to address the issue by providing DNS as a mechanism for the client to use a proper Keyshare that the server supports.

Prioritizing quantum resistance

In a typical TLS 1.3 handshake, the ClientHello includes the client’s key exchange algorithm order of preferences. Upon receiving the ClientHello, the server responds by picking the algorithms based on its preferences.

Figure 2 shows how a server can send a HelloRetryRequest (HRR) to the client in the previous scenario (Figure 1) in order to request the negotiation of quantum-resistant keys by using P256+Kyber512. This approach introduces an extra round trip to the handshake.

Figure 2: An HRR from the server to request the negotiation of mutually supported quantum-resistant keys with the client

AWS services that terminate TLS 1.3 connections will take this approach. They will prioritize quantum resistance for clients that advertise support for it. If the AWS service has added quantum-resistant algorithms, it will honor a client-supported post-quantum key exchange even if that means that the handshake will take an extra round trip and the PQ-hybrid key exchange will include minor processing overhead (ML-KEM is almost performant as ECDH). A typical round trip in regionalized TLS connections today is usually under 50 ms and won’t have material impact to the connection performance. In the post-quantum transition, we consider clients that advertise support for quantum-resistant key exchange to be clients that take the CRQC risk seriously. Thus, the AWS server will honor that preference if the server supports the algorithm.

Pull Request 4526 introduces this behavior in s2n-tls, the AWS open source, efficient TLS library built over other crypto libraries like OpenSSL libcrypto or AWS libcrypto (AWS-LC). When built with s2n-tls, s2n-quic handshakes will also inherit the same behavior. s2n-quic is the AWS open source Rust implementation of the QUIC protocol.

What AWS customers can do to verify post-quantum key exchanges

AWS services that have already adopted the behavior described in this post include AWS KMS, ACM, and Secrets Manager TLS endpoints, which have been supporting post-quantum hybrid key exchange for a few years already. Other endpoints that will deploy quantum-resistant algorithms will inherit the same behavior.

AWS customers that want to take advantage of new quantum-resistant algorithms introduced in AWS services are expected to enable them on the client side or the server side of a customer-managed endpoint. For example, if you are using the AWS Common Runtime (CRT) HTTP client in the AWS SDK for Java v2, you would need to enable post-quantum hybrid TLS key exchanges with the following.

SdkAsyncHttpClient awsCrtHttpClient = AwsCrtAsyncHttpClient.builder()
            .postQuantumTlsEnabled(true)
            .build();

The AWS KMS and Secrets Manager documentation includes more details for using the AWS SDK to make HTTP API calls over quantum-resistant connections to AWS endpoints that support post-quantum TLS.

To confirm that a server endpoint properly prioritizes and enforces the PQ algorithms, you can use an “old” client that sends a PQ-hybrid Keyshare value that the PQ-enabled server does not support. For example, you could use s2n-tls built with AWS-LC (which supports the quantum-resistant KEMs). You could use a client TLS policy (PQ-TLS-1-3-2023-06-01) that is newer than the server’s policy (PQ-TLS-1-0-2021-05-24). That will lead the server to request the client by means of an HRR to send a new ClientHello that includes P256+MLKEM768, as shown following.

./bin/s2nd -c PQ-TLS-1-0-2021-05-24 localhost 4444
sudo tcpdump port 4444 -w hrr-capture.pcap
./bin/s2nc localhost 4444 -c PQ-TLS-1-3-2023-06-01 -i

The hrr-capture.pcap packet capture will show the negotiation and the HRR from the server.

To confirm that a server endpoint properly implements the post-quantum hybrid key exchanges, you can use a modern client that supports the key exchange and connect against the endpoint. For example, using the s2n-tls client built with AWS-LC (which supports the quantum-resistant KEMs), you could try connecting to a Secrets Manager endpoint by using a post-quantum TLS policy (for example, PQ-TLS-1-2-2023-12-15) and observe the PQ hybrid key exchange used in the output, as shown following.

./bin/s2nc -c PQ-TLS-1-2-2023-12-15 secretsmanager.us-east-1.amazonaws.com 443
CONNECTED:
Handshake: NEGOTIATED|FULL_HANDSHAKE|MIDDLEBOX_COMPAT
Client hello version: 33
Client protocol version: 34
Server protocol version: 34
Actual protocol version: 34
Server name: secretsmanager.us-east-1.amazonaws.com
Curve: NONE
KEM: NONE
KEM Group: SecP256r1Kyber768Draft00
Cipher negotiated: TLS_AES_128_GCM_SHA256
Server signature negotiated: RSA-PSS-RSAE+SHA256
Early Data status: NOT REQUESTED
Wire bytes in: 6699
Wire bytes out: 1674
s2n is ready
Connected to secretsmanager.us-east-1.amazonaws.com:443

An alternative would be using the Open Quantum Safe (OQS) for OpenSSL client to do the same.

As another example, if you want to transfer a file over a quantum-resistant SFTP connection with AWS Transfer Family, you would need to configure a PQ cryptography SSH security policy on your AWS File Transfer SFTP endpoint (for example, TransferSecurityPolicy-2024-01) and enable quantum-resistant SSH key exchange in the SFTP client. Note that in SSH/SFTP, although the AWS server side will advertise the quantum-resistant schemes as higher priority, the client picks the key exchange algorithm. So, a client that supports PQ cryptography would need to have the PQ algorithms configured with higher priority (as described in the Shared Responsibility Model). For more details, refer to the AWS Transfer Family documentation.

Conclusion

Cryptographic migrations can introduce intricacies to cryptographic negotiations between clients and servers. During the migration phase, AWS services will mitigate the risks of these intricacies by prioritizing post-quantum algorithms for customers that advertise support for these algorithms—even if that means a small slowdown in the initial negotiation phase. While in the post-quantum migration phase, customers who choose to enable quantum resistance have made a choice which shows that they consider the CRQC risk as important. To mitigate this risk, AWS will honor the customer’s choice, assuming that quantum resistance is supported on the server side.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support. For more details regarding AWS PQC efforts, refer to our PQC page.

Key use cases for fine-grained access control in analytics

Solution overview

Prerequisites

Set up infrastructure in the producer account

Set up Lake Formation for the data engineer in the producer account

Register the Amazon S3 location as the data lake location

Generate TPC-DS tables in the producer account

Create an EMR Serverless application

Create a Workspace

Share the database and TPC-DS tables to the consumer account

Grant database permissions to the consumer account

Grant table permissions to the consumer account

Set up Lake Formation in the consumer account

Create a resource link for the shared database

Grant permissions on the resource link to the EMR job runtime roles

Grant table permissions for the finance analyst

Grant table permissions for the product analyst

Verify access using interactive notebooks from EMR Studio

Considerations and limitations

Clean up

Conclusion

About the Authors

Understanding Volkswagen Autoeuropa’s challenges

Envisioning a data solution for Volkswagen Autoeuropa

Solution overview

Data solution features

Data solution architecture

User journey

Data producer

Data consumer

Data solution administrator

Conclusion

About the Authors

Getting started

Solution overview

Prerequisites

Configuring DBeaver to access subscribed data assets

Integration with other applications

Connect to Tableau Desktop

Connect to Microsoft Power BI

Connect to SQL Workbench

Cleanup

Conclusion

About the Authors

Use Cases

Q Developer Use Cases

Training and Onboarding

Troubleshooting Issues

Best Practices

Chatbot Use Cases

Using Chatbot and Q Developer Together

Architecture Diagram

Conclusion

JobMode property

How CreateJob API works with the new JobMode property

How CloudFormation works with the new JobMode property

Console experience

Conclusion

About the Authors

Why is it important to mitigate bot traffic?

Before starting

How does the Challenge action work, and what are the benefits?

Implementation options

Option 1: Implementing the Challenge action through a custom rule

Option 2: Implementing the Challenge action by using Bot Control

Next steps

Conclusion

Recommended reading

Challenges with stale statistics

Solution overview

Benchmark evaluation

Conclusion

About the authors

Background

Security objective 1: Implementing networks that are isolated from the internet

Security objective 2: Implement a data perimeter using VPC endpoint policies

Security objective 3: Enable private connectivity to AWS service API endpoints for on-premises environments

Security objective 4: Align with specific compliance requirements

Conclusion

Personas