Tag Archives: Analytics

Introducing Express brokers for Amazon MSK to deliver high throughput and faster scaling for your Kafka clusters

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/introducing-express-brokers-for-amazon-msk-to-deliver-high-throughput-and-faster-scaling-for-your-kafka-clusters/

Today, we’re announcing the general availability of Express brokers, a new broker type for Amazon Managed Streaming for Apache Kafka (Amazon MSK). It’s designed to deliver up to three times more throughput per-broker, scale up to 20 times faster, and reduce recovery time by 90 percent as compared to Standard brokers running Apache Kafka. Express brokers come preconfigured with Kafka best practices by default, support Kafka APIs, and provide the same low latency performance that Amazon MSK customers expect, so they can continue using existing client applications without any changes.

Express brokers provide improved compute and storage elasticity for Kafka applications when using Amazon MSK provisioned clusters. Amazon MSK is a fully-managed AWS service that makes it easier for you to build and run highly available and scalable applications based on Apache Kafka.

Let’s dive deeper into some of the key features that Express brokers have and the benefits they provide:

  • Easier operations with hands-free storage management – Express brokers offer unlimited storage without preprovisioning, eliminating disk-related bottlenecks. Cluster sizing is simpler, requiring only ingress and egress throughput divided by recommended per-broker throughput. This removes the need for proactive disk capacity monitoring and scaling, simplifying cluster management and improving resilience by eliminating a potential failure source.
  • Fewer brokers with up to three times throughput per broker – Higher throughput per broker allows for smaller clusters for the same workload. Standard brokers’ throughput must account for client traffic and background operations, with m7g.16xl Standard brokers safely handling 154 MBps ingress. Express brokers use opinionated settings and resource isolation, enabling m7g.16xl size instances to safely manage up to 500 MBps ingress without compromising performance or availability during cluster events.
  • Higher utilization with 20 times faster scaling – Express brokers reduce data movement during scaling, making them up to 20 times faster than Standard brokers. This allows for more quicker and reliable cluster resizing. You can monitor each broker’s ingress throughput capacity and add brokers within minutes, eliminating the need for over-provisioning in anticipation of traffic spikes.
  • Higher resilience with 90 percent faster recovery – Express brokers are designed for mission-critical applications requiring high resilience. They come preconfigured with best-practice defaults, including 3-way replication (RF=3), which reduce failures due to misconfiguration. Express brokers also recover 90 percent faster from transient failures compared to standard Apache Kafka brokers. Express brokers’ rebalancing and recovery use minimal cluster resources, simplifying capacity planning. This eliminates the risk of increased resource utilization and the need for continuous monitoring when right-sizing clusters.

You have choice options in Amazon MSK depending on your workload and preference:

MSK provisioned MSK Serverless
Standard brokers Express brokers
Configuration range Most flexible Flexible Least flexible
Cluster rebalancing Customer managed Customer managed
but up to 20x faster
MSK managed
Capacity management Yes Yes (compute only) No
Storage management Yes No No

Express brokers lower costs, provide higher resiliency, and lower operational overhead, making them the best choice for all Kafka workloads. If you prefer to use Kafka without managing any aspect of its capacity, its configuration, or how it scales, then you can choose Amazon MSK Serverless. This provides a fully abstracted Apache Kafka experience that eliminates the need for any infrastructure management, scales automatically, and charges you on a pay-per-use consumption model that doesn’t require you to optimize resource utilization.

Getting started with Express brokers in Amazon MSK
To get started with Express brokers, you can use the Sizing and Pricing worksheet that Amazon MSK provides. This worksheet helps you estimate the cluster size you’ll need to accommodate your workload and also gives you a rough estimate of the total monthly cost you’ll incur.

The throughput requirements of your workload are the primary factor in the size of your cluster. You should also consider other factors, such as partition and connection count to arrive at the size and number of brokers you’ll need for your cluster. For example, if your streaming application needs 30 MBps of data ingress (write) and 80 MBps data egress (read) capacity, you can use three express.m7g.large brokers to meet your throughput needs (assuming the partition count for your workload is within the maximum number of partitions that Amazon MSK recommends for an m7g.large instance).

The following table shows the recommended maximum ingress, egress, and partition counts per instance size for sustainable and safe operations. You can learn more about these recommendations in the Best practices section of Amazon MSK Developer Guide.

Instance size Ingress (MBps) Egress (MBps)
express.m7g.large 15.6 31.2
express.m7g.4xlarge 124.9 249.8
express.m7g.16xlarge 500.0 1000.0

Once you have decided the number and size of Express brokers you’ll need for your workload, go to the AWS Management Console or use the CreateCluster API to create an Amazon MSK provisioned cluster.

When you create a new cluster on the Amazon MSK console, in the Broker type option, choose Express brokers and then select the mount of compute capacity that you want to provision for the broker. As you can see in the screen shot, you can use Apache Kafka 3.6.0 version and Graviton-based instances for Express brokers. You don’t need to preprovision storage for Express brokers.

You can also customize some of these configurations to further fine-tune the performance of your clusters according to your own preferences. To learn more, visit Express broker configurations in the Amazon MSK developer guide.

To create an MSK cluster in the AWS Command Line Interface (AWS CLI), use the create-cluster command.

aws kafka create-cluster \
    --cluster-name "channy-express-cluster" \
    --kafka-version "3.6.0" \
    --number-of-broker-nodes 3 \
    --broker-node-group-info file://brokernodegroupinfo.json

A JSON file named brokernodegroupinfo.json specifies the three subnets over which you want Amazon MSK to distribute the broker nodes.

{
    "InstanceType": "express.m7g.large",
    "BrokerAZDistribution": "DEFAULT",
    "ClientSubnets": [
        "subnet-0123456789111abcd",
        "subnet-0123456789222abcd",
        "subnet-0123456789333abcd"
    ]
}

Once the cluster is created, you can use the bootstrap connection string to connect your clients to the cluster endpoints.

With Express brokers, you can scale vertically (changing instance size) or horizontally (adding brokers). Vertical scaling doubles throughput without requiring partition reassignment. Horizontal scaling adds brokers in sets of three and and allows you to create more partitions, but it requires partition reassignment for new brokers to serve traffic.

A major benefit of Express brokers is that you can add brokers and rebalance partitions within minutes. On the other hand, rebalancing partitions after adding Standard brokers can take several hours. The graph below shows the time it took to rebalance partitions after adding 3 Express brokers to a cluster and reassigning 2000 partitions to each of the new brokers.

As you can see, it took approximately 10 minutes to reassign these partitions and utilize the additional capacity of the new brokers. When we ran the same experiment on an equivalent cluster comprising of Standard brokers, partition reassignment took over 24hours.

To learn more about the partition reassignment, visit Expanding your cluster in the Apache Kafka documentation.

Things to know
Here are some things you should know about Express brokers:

  • Data migration – You can migrate the data in your existing Kafka or MSK cluster to a cluster composed of Express brokers using Amazon MSK Replicator, which copies both the data and the metadata of your cluster to a new cluster.
  • Monitoring – You can monitor your cluster composed of Express brokers in the cluster and at the broker level with Amazon CloudWatch metrics and enable open monitoring with Prometheus to expose metrics using the JMX Exporter and the Node Exporter.
  • Security – Just like with other broker types, Amazon MSK integrates with AWS Key Management Service (AWS KMS) to offer transparent server-side encryption for the storage in Express brokers. When you create an MSK cluster with Express brokers, you can specify the AWS KMS key that you want Amazon MSK to use to encrypt your data at rest. If you don’t specify a KMS key, Amazon MSK creates an AWS managed key for you and uses it on your behalf.

Now available
The Express broker type is available today in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland), and Europe (Stockholm) Regions.

You pay an hourly rate for Apache Kafka broker instance usage (billed at one-second resolution) for Express brokers, with varying fees depending on the size of the broker instance and active brokers in your MSK clusters. You also pay a per-GB rate for data written to an Express broker (billed at per-byte resolution). To learn more, visit the Amazon MSK pricing page.

Give Express brokers for Amazon MSK a try in the Amazon MSK console. For more information, visit the Amazon MSK Developer Guide and send feedback to AWS re:Post for Amazon MSK or through your usual AWS support contacts.

Channy

Write queries faster with Amazon Q generative SQL for Amazon Redshift

Post Syndicated from Raghu Kuppala original https://aws.amazon.com/blogs/big-data/write-queries-faster-with-amazon-q-generative-sql-for-amazon-redshift/

Amazon Redshift is a fully managed, AI-powered cloud data warehouse that delivers the best price-performance for your analytics workloads at any scale. Amazon Q generative SQL brings the capabilities of generative AI directly into the Amazon Redshift query editor. Amazon Q generative SQL for Amazon Redshift was launched in preview during AWS re:Invent 2023. With over 85,000 queries executed in preview, Amazon Redshift announced the general availability in September 2024.

Amazon Q generative SQL for Amazon Redshift uses generative AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly within Amazon Redshift, accelerating the query authoring process for users and reducing the time required to derive actionable data insights. It provides a conversational interface where users can submit queries in natural language within the scope of their current data permissions. Generative SQL uses query history for better accuracy, and you can further improve accuracy through custom context, such as table descriptions, column descriptions, foreign key and primary key definitions, and sample queries. Custom context enhances the AI model’s understanding of your specific data model, business logic, and query patterns, allowing it to generate more relevant and accurate SQL recommendations. It enables you to get insights faster without extensive knowledge of your organization’s complex database schema and metadata.

Within this feature, user data is secure and private. Your data is not shared across accounts. Your queries, data and database schemas are not used to train a generative AI foundational model (FM). Your input is used as contextual prompts to the FM to answer only your queries.

In this post, we show you how to enable the Amazon Q generative SQL feature in the Redshift query editor and use the feature to get tailored SQL commands based on your natural language queries. With Amazon Q, you can spend less time worrying about the nuances of SQL syntax and optimizations, allowing you to concentrate your efforts on extracting invaluable business insights from your data.

Solution overview

At a high level, the feature works as follows:

  1. For generating the SQL code, you can write your query request in plain English within the conversational interface in the Redshift query editor.
  2. The query editor sends the query context to the underlying Amazon Q generative SQL platform, which uses generative AI to generate SQL code recommendations based on your Redshift metadata.
  3. You receive the generated SQL code suggestions within the same chat interface.

The following diagram illustrates this workflow.

Your content processed by generative SQL is not stored or used by AWS for service improvement.

Amazon Q generative SQL uses a large language model (LLM) and Amazon Bedrock to generate the SQL query. AWS uses different techniques, such as prompt engineering and Retrieval Augmented Generation (RAG), to query the model based on your context:

  • The database you’re connected to
  • The schema you’re working on
  • Your query history
  • Optionally, the query history of other users connected to the same endpoint

Amazon Q generative SQL is conversational, and you can ask it to refine a previously generated query.

In the following sections, we demonstrate how to enable the generative SQL feature in the Redshift query editor and use it to generate SQL queries using natural language.

Prerequisites

To get started, you need an Amazon Redshift Serverless endpoint or an Amazon Redshift provisioned cluster. For this post, we use Redshift Serverless. Refer to Easy analytics and cost-optimization with Amazon Redshift Serverless to get started.

Enable the Amazon Q generative SQL feature in the Redshift query editor

If you’re using the feature for the first time, you need to enable the Amazon Q generative SQL feature in the Redshift query editor.

To enable the feature, complete the following steps:

  1. On the Amazon Redshift console, open the Redshift Serverless dashboard.
  2. Choose Query data.

You can also choose Query Editor V2 in the navigation pane of the Amazon Redshift console.

When you open the Redshift query editor, you will see the new icon for Amazon Q next to the database dropdown menu on the top of the query editor console.

If you choose the Amazon Q icon, you will see the message “Amazon Redshift query editor V2 now supports generative SQL functionality. Contact your administrator to activate this feature in Settings.” If you’re not the administrator, you need to work with the account administrator to enable this feature.

  1. If you’re the administrator, choose the hyperlink in the message, or go to the settings icon and choose Generative SQL settings.
  2. In the Generative SQL settings section, select Q generative SQL, which will turn on Amazon Q generative SQL for all users of the account.

Amazon Q generative SQL is personalized to your database and, based on the updates or conversations you have had with the feature, will apply those learnings to other user conversations who connect to the same database with their own credentials. In the generative SQL settings, you can see the instructions to grant the sys:monitor role to a user or role.

  1. Choose Save.

You will receive a confirmation that the Amazon Q generative SQL settings have been successfully updated.

Load notebooks with sample TPC-DS data

The Redshift query editor comes with sample data and SQL notebooks that you can load into a sample database and corresponding schema. For this post, we use TPC-DS for a decision support benchmark.

We start by loading the TPC-DS data into the Redshift database. When you load this data, the schema tpcds is updated with sample data. We also use the provided notebooks with the tpcds schema to run queries to build a query history.

Complete the following steps:

  1. Connect to your Redshift Serverless workgroup or Redshift provisioned cluster.
  2. Navigate to the sample_data_dev database to view the sample databases available for running the generative SQL feature.
  3. Hover over the tpcds schema and choose Open sample notebooks.
  4. In the Create sample database pop-up message, choose Create.

In a few seconds, you will see the notification that the database sample_data_dev is created successfully and tpcds sample data is loaded successfully. Two sample notebooks for the schema are also generated.

  1. Choose Run all on each notebook tab.

This will take a few minutes to run and will establish a query history for the tpcds data.

This step is not mandatory for using the feature for your organization’s data warehouse.

Use Amazon Q to generate SQL queries from natural language

Now that the Amazon Q generative SQL feature is enabled and ready for use, open a new notebook and choose the Amazon Q icon to open a chat pane in the Redshift query editor.

Amazon Q generative SQL is personalized to your schema. It uses metadata from database schemas to improve the SQL query suggestions. Optionally, administrators can allow the use of the account’s query history to further improve the generated SQL. This can be enabled by running the following GRANT commands to provide access to your query history to other roles or users:

GRANT ROLE SYS:MONITOR to "IAMR:role-name";
GRANT ROLE SYS:MONITOR to "IAM:user-name";
GRANT ROLE SYS:MONITOR to "database-username";

This optional step allows users to make query monitoring history available to other users connected to the same database.

Let’s get started with some query examples.

  1. First, make sure you’re connected to sample_data_dev
  2. Let’s ask the query “What are the top 10 stores in sales in 1998?”

This generates a SQL query. Amazon Q generative SQL is also personalized to your data domain. You will notice that it joins to the Store table to retrieve store_name.

  1. Choose Add to notebook under the query to add the generated SQL.

Our query runs successfully and shows that the store able has the most sales.

  1. Amazon Q is personalized to your conversation. Suppose you want to know what the top selling item was for store able. You can ask this question “What was the unique identifier of the top selling item for the store ‘able’?”

The results show the top selling item. However, the query didn’t filter on the year.

  1. Let’s ask Amazon Q to give us the top selling item for store able in 1998. Instead of repeating the whole question again, you can simply ask “Can you filter by the year 1998?”

Now we have the top selling item for store able for 1998.

  1. To display the item description, you can ask the query “Can you modify the query to include its name and description?”

Amazon Q added the join to the item table and the query ran successfully.

Now that we have done some basic queries, let’s do some deeper analysis.

  1. Let’s ask Amazon Q “Can you give me aggregated store sales, for each county by quarter for all years?”

The answer is correct, but let’s ask a follow-up to include the state.

  1. Ask the follow-up question: “Can you include state?”

This answer looks good; you can also add an ORDER BY clause if you want the data sorted or ask Amazon Q to add that.

So far, we have only been looking at store_sales data. The TPC-DS data contains data for other sales channels, including web_sales and catalog_sales.

  1. Let’s ask Amazon Q “Can you give me the total sales for 1998, from different sales channels, using a union of the sales data from different channels?”

Let’s dive deeper into some other capabilities of Amazon Q generative SQL.

  1. Let’s try logging in with a different user and see how Amazon Q generative SQL interacts with that user. We have created User3 and granted the sys:monitor
  2. Logged in as User3, let’s ask the original question of “What are the top 10 stores in sales in 1998?”

Amazon Q generative SQL is able to use the query history and provide SQL recommendations for User3’s prompts because they have access to the system metadata provided through the role sys:monitor.

Safety features

Amazon Q generative SQL has built-in safety features to warn if a generated SQL statement will modify data and will only run based on user permissions. To test this, let’s ask Amazon Q to “delete data from web_sales table.”

Amazon Q gives a message “I detected that this query changes your database. Only run this SQL command if that is appropriate.”

Now, still logged in as User3, choose Run to try to delete the web_sales data.

As expected, User3 gets a permission denied error, because they don’t have the necessary privileges to delete the web_sales table.

Custom context

Custom context is a feature that allows you to provide domain-specific knowledge and preferences, giving you fine-grained control over the SQL generation process.

The custom context is defined in a JSON file, which can be uploaded by the query editor administrator or can be added directly in the Custom context section in Amazon Q generative SQL settings.

This JSON file contains information that helps Amazon Q generative SQL better understand the specific requirements and constraints of your domain, enabling it to generate more targeted and relevant SQL queries.

By providing a custom context, you can influence factors such as:

  • The terminology and vocabulary used in the generated SQL
  • The level of complexity and optimization of the SQL queries
  • The formatting and structure of the SQL statements
  • The data sources and tables that should be considered

The custom context feature empowers you to take a more active role in shaping the SQL generation process, leading to SQL queries that are better suited to your data and business requirements.

In this post, we use the BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) sample dataset, consisting of three tables. BIRD represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing.

You can load the following BIRD sample dataset into your Redshift data warehouse to experiment with using custom contexts.

For this post, we demonstrate with three custom contexts.

TablesToInclude

TablesToInclude specifies a set of tables that are considered for SQL generation. This field is crucial when you want to limit the scope of SQL queries to a defined subset of available tables. It can help optimize the generation process by reducing unnecessary table references.

Let’s ask Amazon Q “List the distinct translated title and the set code of all cards translated into Spanish.”

This SQL unnecessarily uses the public.cards table. The public.set_translations table contains the data sufficient to answer the question.

We can add the following TablesToInclude custom context JSON:

{
  "resources": [
    {
      "ResourceId":"Serverless:Serverless-workgroup-name",
      "ResourceType": "REDSHIFT_WAREHOUSE",
      "TablesToInclude": [
        "bird.public.set_translations"
      ]
    }
  ]
}

After adding the custom context, the unwanted joins are eliminated and the correct SQL is generated.

ColumnAnnotations

ColumnAnnotations allows you to provide metadata or annotations specific to individual columns in your data tables. These annotations can offer valuable insights into the definitions and characteristics of the columns, which can be beneficial in guiding the SQL generation process.

Let’s ask Amazon Q to “Show me the unconverted mana cost and name for all the cards created by Rob Alexander.”

The generated SQL points to the column convertedmanacost, which doesn’t give a value for unconverted mana cost. The manacost column gives the unconverted mana cost.

Let’s add this using ColumnAnnotations in the custom context JSON:

{
  "resources": [
    {
      "ResourceId": "Serverless: Serverless-workgroup-name",
      "ResourceType": "REDSHIFT_WAREHOUSE",
      "ColumnAnnotations":
         {"bird.public.cards": { "manaCost": "manaCost is the unconverted mana"} }
    }
  ]
}

After the custom context is added, the correct SQL gets generated.

CuratedQueries

CuratedQueries provides a set of predefined question and answer pairs. In this set, the questions are written in natural language and the corresponding answers are the SQL queries that should be generated to address those questions.

These examples serve as a valuable reference point for Amazon Q generative SQL, helping it understand the types of queries it is expected to generate. You can guide Amazon Q generative SQL with the desired format, structure, and content of the SQL queries it should produce.

Let’s ask Amazon Q “List down the name of artists for cards in Chinese Simplified.”

Although the join key multiverseid exists, it is not correct.

Let’s add the following using CuratedQueries in the custom context JSON:

{
  "resources": [
    {
      "ResourceId": "Serverless: Serverless-workgroup-name",
      "ResourceType": "REDSHIFT_WAREHOUSE",
      "CuratedQueries": [
        {
          "Question": "List down the name of artists for cards in Spanish.",
          "Answer": "SELECT artist FROM public.cards c JOIN public.foreign_data f ON c.uuid = f.uuid WHERE f.language = 'Spanish';"
        }
      ]
    }
  ]
}

After the custom context is added, the correct SQL gets generated.

Additional features

In this section, we discuss the supporting features available with Amazon Q generative SQL feature for Redshift query editor:

Provide feedback

Amazon Q generative SQL allows you to provide feedback on the SQL queries it generates, helping improve the quality and relevance of the SQL over time. This feedback mechanism is accessible through the Amazon Q generative SQL interface, where you can indicate whether the generated SQL was helpful or not.

If you find the generated SQL to not be helpful, you can categorize the feedback into the following areas:

  • Incorrect Tables/Columns – This indicates that the SQL references the wrong tables or columns, or is missing essential tables or columns
  • Incorrect Predicates/Literals/Group By – This category covers issues with the SQL’s filter conditions, literal values, or grouping logic
  • Incorrect SQL Structure – This feedback suggests that the overall structure or syntax of the generated SQL is not correct
  • Other – This option allows you to provide feedback that doesn’t fit into the preceding categories

In addition to selecting the appropriate feedback category, you can also provide free text comments to elaborate on the specific issues or inaccuracies you found in the generated SQL. This additional information can be valuable for Amazon Q to better understand the problems and make improvements.

By actively providing this feedback, you play a crucial role in refining the generation capabilities of Amazon Q generative SQL. The feedback you provide helps the service learn from its mistakes, leading to more accurate and relevant SQL queries that better meet your needs over time.

This feedback loop is an important part of Amazon Q generative SQL’s continuous improvement, because it allows the service to adapt and evolve based on your specific requirements and use cases.

Regenerate SQL

The Regenerate SQL option will prompt Amazon Q to generate a new SQL query based on the same natural language prompt, using its learning and improvement capabilities to provide a potentially better-suited response.

Refresh database

By choosing Refresh database, you can instruct Amazon Q generative SQL to re-fetch and update the metadata information about the connected database.

This metadata includes:

  • Schema definitions – The structure and organization of your database schemas
  • Table definitions – The names, columns, and other properties of the tables in your database
  • Column definitions – The data types, names, and other characteristics of the columns within your database tables

Tips and techniques

To get more accurate SQL recommendations from Amazon Q generative SQL, keep in mind the following best practices:

  • Be as specific as possible. Instead of asking for total store sales, ask for total sales across all sales channels if that is what you need.
  • Add your schema to the path. For example:
    set search_path to tpcds;

  • Iterate when you have complex requests and verify the results. For example, ask which county has the most sales in 2000 and follow up with which item had the most sales.
  • Ask follow-up questions to make queries more specific.
  • If an incomplete response is generated, instead of rephrasing the entire request, provide specific instructions to Amazon Q as a continuation to the prior question.

Clean up

To avoid incurring future charges, delete the Redshift cluster you provisioned as part of this post.

Conclusion

Amazon Q generative SQL for Amazon Redshift simplifies query authoring and increases productivity by allowing you to express queries in natural language and receive SQL code recommendations. This post demonstrated how the Amazon Q generative SQL feature can accelerate data analysis by reducing the time required to write SQL queries. By using natural language processing and seamlessly converting it into SQL, you can boost productivity without requiring an in-depth understanding of your organization’s database structures. Importantly, the robust security measures of Amazon Redshift remain fully enforced, and the quality of the generated SQL continues to improve over time by enabling query history sharing across users.

Get started on your Amazon Q generative SQL journey with Amazon Redshift today by implementing the solution in this post or by referring to Interacting with Amazon Q generative SQL. For pricing information, refer to Amazon Q generative SQL pricing. Also, please try other Redshift generative AI features such as Amazon Redshift Integration with Amazon Bedrock and Amazon Redshift Serverless AI-driven scaling and optimization.


About the authors

Raghu Kuppala is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Sushmita Barthakur is a Senior Data Solutions Architect at Amazon Web Services (AWS), supporting Enterprise customers architect their data workloads on AWS. With a strong background in data analytics, she has extensive experience helping customers architect and build enterprise data lakes, ETL workloads, data warehouses and data analytics solutions, both on-premises and the cloud. Sushmita is based out of Tampa, FL and enjoys traveling, reading and playing tennis.

Xiao Qin is a senior applied scientist with the Learned Systems Group (LSG) at Amazon Web Services (AWS). He studies and applies machine learning techniques to solve data management problems. He is one of the developers that build the Amazon Q generative SQL capability.

Erol MurtezaogluErol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Phil Bates was a Senior Analytics Specialist Solutions Architect at AWS, before retiring, with over 25 years of data warehouse experience.

Amazon OpenSearch Service launches the next-generation OpenSearch UI

Post Syndicated from Hang Zuo original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-launches-the-next-generation-opensearch-ui/

Amazon OpenSearch Service launches a modernized operational analytics experience that can provide comprehensive observability spanning multiple data sources, so that you can gain insights from OpenSearch and other integrated data sources in one place. The launch also introduces OpenSearch Workspaces that provides tailored experience for popular use cases and supports access control, so that you can create a private space for your use case and share it only to your collaborators. With the next-generation user interface (UI), the Discover experience has been improved to simplify interactive analysis, so that you can easily utilize features such as natural language query generation to gain insights from your data.

Multiple Data Source: You might have already used OpenSearch Dashboards to provide an operational analytics experience for your OpenSearch clusters. OpenSearch Dashboards is co-located with a cluster, so that each OpenSearch Dashboards can only work with one cluster. And as you scale up your workload across multiple clusters, there is not a unified experience to analyze your data in one place. In comparison, the next-generation OpenSearch UI is designed to work across multiple OpenSearch clusters to aggregate the comprehensive insights in one place. An OpenSearch application is an instance of the next-generation OpenSearch UI. Currently, OpenSearch applications can be associated with multiple OpenSearch clusters (above version 1.3), Amazon OpenSearch Service Serverless collections, and integrated data sources such as Amazon S3. Each OpenSearch cluster can be associated with multiple OpenSearch applications, in addition to its co-located OpenSearch Dashboards that will remain functional.

Workspace: With workspaces, you can easily create your use case specific contents in a private space and manage the permissions in team collaboration. Workspace provide curated experiences for popular use cases such as Observability, Security Analytics and Search, so that you can find it straightforward to build contents for your use case. Workspace supports collaborator management, so that you can share your workspace only to your intended collaborators, and manage the permissions for each collaborator.

Discover: The improved Discover feature now provides a unified log exploration experience that adds the support for SQL and Piped Processing Language (PPL), in addition to the existing support for DQL and Lucene. Discover features a new data selector to support multiple data sources, a new visual design, query autocomplete and natural language query generation for improved usability. With the enhanced Discover interface, you can now analyze data from multiple sources without switching tools, reducing complexity and improving efficiency.

Solution Overview

The following diagram illustrates architecture of the OpenSearch Dashboards.

The following diagram illustrates the next-generation OpenSearch UI architecture.

In the following sections, we discuss the following topics

  1. The process of creating an application
  2. Setting up and using the new Workspaces functionality
  3. The enhanced Discover experience

We’ll demonstrate how these improvements streamline data analysis, foster collaboration, and empower you to extract insights more efficiently across various use cases.

Create an application:

To begin using the next-generation OpenSearch UI, you can first create an application. An application is an instance of the OpenSearch UI (Dashboards), and you have the flexibility to create multiple applications within a single account. To create a new application, complete the following steps:

  1. On the Amazon OpenSearch Service console, choose OpenSearch UI (Dashboards)under Central management in the navigation panel.
  2. Choose Create application.
  3. For application name, enter a descriptive name for your new application.
  4. AWS Identity and Access Management (IAM) is the default authentication mechanism. Optionally, you can select Authentication with IAM Identity Center (IDC), so that you can use credentials and access management from your existing identity providers to manage user access.
  5. For OpenSearch application admins, specify the IAM principals or IDC users that will have permissions to update or delete the application configuration. You will automatically be set as the first admin.

This page lists all the existing applications under your account in the current AWS region. You can create new application from this page.

This page is the create application workflow. You can specify the application name, enable/disable IDC and define application admins to create an application.

After you configured these settings and created an application, your new OpenSearch application will be ready for you to associate data sources and start using the enhanced UI capabilities.

Associating data sources:

After you create your new OpenSearch application, the next step is to associate the relevant data sources. This allows you to connect the application to the necessary OpenSearch domains, collections, and other data sources.

  1. On the application details page, choose Manage data sources.

You will be presented with a list of all the OpenSearch data sources you have access to, including managed domains and serverless collections.

  1. Select the data sources you want to associate with this application.

OpenSearch domains below version v1.3 will not be compatible with the next-generation UI, and will be grayed out in the list. Additionally, if you need to connect to a domain within a virtual private cloud (VPC), you will need to authorize OpenSearch application as a new principal under its security configurations. If you need to connect to a collection within a VPC, you will need to configure its network policy to Private, enable AWS service private access with OpenSearch application.

  1. Choose Save to finalize the data source association.

Your OpenSearch application is now ready to use, with access to the connected data.

Working with the OpenSearch application:

To access your new OpenSearch application, you can either choose the application URL or choose Launch application on the application details page. After you’ve successfully logged in either with IAM or IDC, you’ll be directed to the application’s homepage. From here, you can choose to create a new workspace or navigate to an existing one that you have access to.

Creating a new workspace:

A workspace is a tailored experience for your use case and team collaboration. There are five types of workspaces: Observability, Security Analytics, Search, Essentials, and Analytics. You can click on the info button to learn more about each workspace type. Existing workspaces will be listed on the homepage. To create a new workspace, complete the following steps:

  1. Choose Create workspace.
  2. Enter a name for your workspace.
  3. Optionally, you can select a different color for the workspace icon for easier identification.
  4. Select the type of workspace you want to create: Observability, Security Analytics, Search, Essentials, or Analytics
  5. Add at least one data source for this workspace (from the list of data sources you previously associated with the application).

For this post, we create an Observability workspace named MyWorkspace and associate it with one Amazon OpenSearch Serverless collection and one Amazon OpenSearch Service managed cluster. You can always manage the data sources associated with a workspace, even after it has been created.

Invite Collaborators

After you create your new workspace, you can add users or groups as collaborators. Workspace collaborators are the users you want to invite to work with you in this workspace, and there are three available permission levels for collaborators: admin, read/write, and read-only. Read/write permission allows a collaborator to create, edit and delete the dashboards, visualizations, and saved queries within the workspace, whereas collaborators with read-only access can only view the results. Admin level gives a collaborator the same permissions as you to not only read/write but also update the configurations of the workspace or delete it.

To add collaborators to your workspace, complete the following steps:

  1. Choose Collaborators in the navigation panel.
  2. Choose Add collaborators.
  3. Choose the type of users you want to add as collaborators. You can add collaborators by their IAM Amazon Resource Name (ARN) or IDC username.
  4. Select a permission level for the collaborator from the three options: Read only, Read and write, and Admin

If you do not know the ARN of your intended collaborator, follow the instruction to check for their ARN, for example.

Improved navigation:

The improved navigation in workspaces provides a more contextual and purpose-built interface, ensuring that each workspace includes only the tools and features relevant to its use case. With enhanced clarity and better organization, the new navigation system is tailored to help you find the features you need quickly, improving overall productivity and minimizing time spent searching through menus.

Revamped Discover experience

Discover is now revamped to offer improved usability and efficiency. You can access multiple data sources, natural language query generation, a new data selector, and polished design with optimized data density, allowing you to effortlessly navigate and analyze your data:

  • Unified language selector – Discover now offers a unified language selector, allowing users to choose from SQL, PPL, Dashboards Query Language (DQL), or Lucene, making it convenient to work with your preferred query languages in one place.

  • Natural language query generation – Discover now supports natural language query building for PPL. Enter your questions in plain language, and Discover converts them to PPL syntax, making data exploration simpler and more accessible. This new feature empowers users of different skill levels to get insights without needing to fully understand the PPL syntax.

  • Powerful query autocomplete – The enhanced query bar includes autocomplete functionality and natural language query generation support, simplifying query building by offering relevant suggestions as you type, making it faster and more efficient to write complex queries

  • New data selector– The new data selector makes it straightforward to connect to multiple data sources, bringing data from Amazon OpenSearch Service domains and serverless collections, and Amazon S3 into a unified view.

Conclusion

In this post, we discussed the features of the next-generation OpenSearch UI. These improvements streamline data analytics, foster collaboration, and empower you to extract insights more efficiently across various use cases.

You can create your own OpenSearch UI applications today in the US East (N. Virginia), US West (N. California, Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), South America (São Paulo), Europe (Frankfurt, Ireland, London, Paris) and Canada (Central) Regions.


About the Authors

Hang (Arthur) Zuo is a Senior Product Manager with Amazon OpenSearch Service. Arthur leads the core experience in the next-gen OpenSearch UI and data migration to Amazon OpenSearch Service. Arthur is passionate about cloud technologies and building data products that help users and businesses gain actionable insights and achieve operational excellence.

Rushabh Vora is a Principal Product Manager for the OpenSearch project of Amazon Web Services. Rushabh leads core experiences in data exploration, dashboards, visualizations, reporting, and data management to help organizations unlock insights at scale. Rushabh is passionate about cloud technologies and building products that enable businesses to make data-driven decisions and achieve operational excellence.

Sohaib Katariwala is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service based out of Chicago, IL. His interests are in all things data and analytics. More specifically he loves to help customers use AI in their data strategy to solve modern day challenges.

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He works closely with customers on their OpenSearch journey across various use cases including vector search, observability, and security analytics.

Xenia Tupitsyna is a UX Designer at OpenSearch. She is working on user experiences across security analytics solutions, anomaly detection, alerting, and core dashboards.

Build up-to-date generative AI applications with real-time vector embedding blueprints for Amazon MSK

Post Syndicated from Francisco Morillo original https://aws.amazon.com/blogs/big-data/build-up-to-date-generative-ai-applications-with-real-time-vector-embedding-blueprints-for-amazon-msk/

Businesses today heavily rely on advanced technology to boost customer engagement and streamline operations. Generative AI, particularly through the use of large language models (LLMs), has become a focal point for creating intelligent applications that deliver personalized experiences. However, static pre-trained models often struggle to provide accurate and up-to-date responses without real-time data.

To help address this, we’re introducing a real-time vector embedding blueprint, which simplifies building real-time AI applications by automatically generating vector embeddings using Amazon Bedrock from streaming data in Amazon Managed Streaming for Apache Kafka (Amazon MSK) and indexing them in Amazon OpenSearch Service.

In this post, we discuss the importance of real-time data for generative AI applications, typical architectural patterns for building Retrieval Augmented Generation (RAG) capabilities, and how to use real-time vector embedding blueprints for Amazon MSK to simplify your RAG architecture. We cover the key components required to ingest streaming data, generate vector embeddings, and store them in a vector database. This will enable RAG capabilities for your generative AI models.

The importance of real-time data with generative AI

The potential applications of generative AI extend well beyond chatbots, encompassing various scenarios such as content generation, personalized marketing, and data analysis. For example, businesses can use generative AI for sentiment analysis of customer reviews, transforming vast amounts of feedback into actionable insights. In a world where businesses continuously generate data—from Internet of Things (IoT) devices to application logs—the ability to process this data swiftly and accurately is paramount.

Traditional large language models (LLMs) are trained on vast datasets but are often limited by their reliance on static information. As a result, they can generate outdated or irrelevant responses, leading to user frustration. This limitation highlights the importance of integrating real-time data streams into AI applications. Generative AI applications need contextually rich, up-to-date information to make sure they provide accurate, reliable, and meaningful responses to end users. Without access to the latest data, these models risk delivering suboptimal outputs that fail to meet user needs. Using real-time data streams is crucial for powering next-generation generative AI applications.

Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is the process of optimizing the output of an LLM so it references an authoritative knowledge base outside of its training data sources before generating a response. LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks such as answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

At the core of RAG is the ability to fetch the most relevant information from a continuously updated vector database. Vector embeddings are numerical representations that capture the relationships and meanings of words, sentences, and other data types. They enable more nuanced and effective semantic searches than traditional keyword-based systems. By converting data into vector embeddings, organizations can build robust retrieval mechanisms that enhance the output of LLMs.

At the time of writing, many processes for creating and managing vector embeddings occur in batch mode. This approach can lead to stale data in the vector database, diminishing the effectiveness of RAG applications and the responses that AI applications generate. A streaming engine capable of invoking embedding models and writing directly to a vector database can help maintain an up-to-date RAG vector database. This helps make sure generative AI models can fetch the more relevant information in real time, providing timely and more contextually accurate outputs.

Solution overview

To build an efficient real-time generative AI application, we can divide the flow of the application into two main parts:

  • Data ingestion – This involves ingesting data from streaming sources, converting it to vector embeddings, and storing them in a vector database
  • Insights retrieval – This involves invoking an LLM with user queries to retrieve insights, employing the RAG technique

Data ingestion

The following diagram outlines the data ingestion flow.

The workflow includes the following steps:

  1. The application processes feeds from streaming sources such as social media platforms, Amazon Kinesis Data Streams, or Amazon MSK.
  2. The incoming data is converted to vector embeddings in real time.
  3. The vector embeddings are stored in a vector database for subsequent retrieval.

Data is ingested from a streaming source (for example, social media feeds) and processed using an Amazon Managed Service for Apache Flink application. Apache Flink is an open source stream processing framework that provides powerful streaming capabilities, enabling real-time processing, stateful computations, fault tolerance, high throughput, and low latency. It processes the streaming data, performs deduplication, and invokes an embedding model to create vector embeddings.

After the text data is converted into vectors, these embeddings are persisted in an OpenSearch Service domain, serving as a vector database. Unlike traditional relational databases, where data is organized in rows and columns, vector databases represent data points as vectors with a fixed number of dimensions. These vectors are clustered based on similarity, allowing for efficient retrieval.

OpenSearch Service offers scalable and efficient similarity search capabilities tailored for handling large volumes of dense vector data. With features like approximate k-Nearest Neighbor (k-NN) search algorithms, dense vector support, and robust monitoring through Amazon CloudWatch, OpenSearch Service alleviates the operational overhead of managing infrastructure. This makes it a suitable solution for applications requiring fast and accurate similarity-based retrieval tasks using vector embeddings.

Insights retrieval

The following diagram illustrates the flow from the user side, where the user submits a query through the frontend and receives a response from the LLM model using the retrieved vector database documents as context.

The workflow includes the following steps:

  1. A user submits a text query.
  2. The text query is converted into vector embeddings using the same model used for data ingestion.
  3. The vector embeddings are used to perform a semantic search in the vector database, retrieving related vectors and associated text.
  4. The retrieved information, along with any previous conversation history, and the user prompt are compiled into a single prompt for the LLM.
  5. The LLM is invoked to generate a response based on the enriched prompt.

This process helps make sure the generative AI application can use the most up-to-date context when responding to user queries, providing relevant and timely insights.

Real-time vector embedding blueprints for generative applications

To facilitate the adoption of real-time generative AI applications, we are excited to introduce real-time vector embedding blueprints. This new blueprint includes a Managed Service for Apache Flink application that receives events from an MSK cluster, processes the events, and calls Amazon Bedrock using your embedding model of choice, while storing the vectors in an OpenSearch Service cluster. This new blueprint simplifies the data ingestion piece of the architecture with a low-code approach to integrate MSK streams with OpenSearch Service and Amazon Bedrock.

Implement the solution

To use real-time data from Amazon MSK as an input for generative AI applications, you need to set up several components:

  • An MSK stream to provide the real-time data source
  • An Amazon Bedrock vector embedding model to generate embeddings from the data
  • An OpenSearch Service vector data store to store the generated embeddings
  • An application to orchestrate the data flow between these components

The real-time vector embedding blueprint packages all these components into a preconfigured solution that’s straightforward to deploy. This blueprint will generate embeddings for your real-time data, store the embeddings in an OpenSearch Service vector index, and make the data available for your generative AI applications to query and process. You can access this blueprint using either the Managed Service for Apache Flink or Amazon MSK console. To get started with this blueprint, complete the following steps:

  1. Use an existing MSK cluster or create a new one.
  2. Choose your preferred Amazon Bedrock embedding model and make sure you have access to the model.
  3. Choose an existing OpenSearch Service vector index to store all embeddings or create a new vector index.
  4. Choose Deploy blueprint.

After the Managed Service for Apache Flink blueprint is up and running, all real-time data is automatically vectorized and available for generative AI applications to process.

For the detailed setup steps, see real-time vector embedding blueprint documentation

If you want to include additional data processing steps before the creation of vector embeddings, you can use the GitHub source code for this blueprint.

The real-time vector embedding blueprint reduces the time required and the level of expertise needed to set up this data integration, so you can focus on building and improving your generative AI application.

Conclusion

By integrating streaming data ingestion, vector embeddings, and RAG techniques, organizations can enhance the capabilities of their generative AI applications. Using Amazon MSK, Managed Service for Apache Flink, and Amazon Bedrock provides a solid foundation for building applications that deliver real-time insights. The introduction of the real-time vector embedding blueprint further simplifies the development process, allowing teams to focus on innovation rather than writing custom code for integration. With just a few clicks, you can configure the blueprint to continuously generate vector embeddings using Amazon Bedrock embedding models, then index those embeddings in OpenSearch Service for your MSK data streams. This allows you to combine the context from real-time data with the powerful LLMs on Amazon Bedrock to generate accurate, up-to-date AI responses without writing custom code. You can also improve the efficiency of data retrieval using built-in support for data chunking techniques from LangChain, an open source library, supporting high-quality inputs for model ingestion.

As businesses continue to generate vast amounts of data, the ability to process this information in real time will be a crucial differentiator in today’s competitive landscape. Embracing this technology allows organizations to stay agile, responsive, and innovative, ultimately driving better customer engagement and operational efficiency. Real-time vector embedding blueprint is generally available in the US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Paris), Europe (London), Europe (Ireland) and South America (Sao Paulo) AWS Regions. Visit the Amazon MSK documentation for the list of additional Regions, which will be supported over the next few weeks.


About the authors

Francisco MorilloFrancisco Morillo is a Streaming Solutions Architect at AWS. Francisco works with AWS customers, helping them design real-time analytics architectures using AWS services, supporting Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink.

Anusha Dasarakothapalli is a Principal Software Engineer for Amazon Managed Streaming for Apache Kafka (Amazon MSK) at AWS. She started her software engineering career with Amazon in 2015 and worked on products such as S3-Glacier and S3 Glacier Deep Archive, before transitioning to MSK in 2022. Her primary areas of focus lie in streaming technology, distributed systems, and storage.

Shakhi Hali is a Principal Product Manager for Amazon Managed Streaming for Apache Kafka (Amazon MSK) at AWS. She is passionate about helping customers generate business value from real-time data. Before joining MSK, Shakhi was a PM with Amazon S3. In her free time, Shakhi enjoys traveling, cooking, and spending time with family.

Digish Reshamwala is a Software Development Manager for Amazon Managed Streaming for Apache Kafka (Amazon MSK) at AWS. He started his career with Amazon in 2022 and worked on product such as AWS Fargate, before transitioning to MSK in 2024. Before joining AWS, Digish worked at NortonLifelLock and Symantec in engineering roles. He holds an MS degree from University of Southern California. His primary areas of focus lie in streaming technology and distributed computing.

Reduce your compute costs for stream processing applications with Kinesis Client Library 3.0

Post Syndicated from Minu Hong original https://aws.amazon.com/blogs/big-data/reduce-your-compute-costs-for-stream-processing-applications-with-kinesis-client-library-3-0/

Amazon Kinesis Data Streams is a serverless data streaming service that makes it straightforward to capture and store streaming data at any scale. Kinesis Data Streams not only offers the flexibility to use many out-of-box integrations to process the data published to the streams, but also provides the capability to build custom stream processing applications that can be deployed on your compute fleet.

When building custom stream processing applications, developers typically face challenges with managing distributed computing at scale that is required to process high throughput data in real time. This is where Kinesis Client Library (KCL) comes in. Thousands of AWS customers use KCL to operate custom stream processing applications with Kinesis Data Streams without worrying about the complexities of distributed systems. KCL uses Kinesis Data Streams APIs to read data from the streams and handles the heavy lifting of balancing stream processing across multiple workers, managing failovers, and checkpointing processed records. By abstracting away these concerns, KCL allows developers to focus on what matters most—implementing their core business logic for processing streaming data.

As applications process more and more data over time, customers are looking to reduce the compute costs for their stream processing applications. We are excited to launch Kinesis Client Library 3.0, which enables you to reduce your stream processing cost by up to 33% compared to previous KCL versions. KCL 3.0 achieves this with a new load balancing algorithm that continuously monitors the resource utilization of workers and redistributes the load evenly to all workers. This allows you to process the same data with fewer compute resources.

In this post, we discuss load balancing challenges in stream processing using a sample workload, demonstrating how uneven load distribution across workers increases processing costs. We then show how KCL 3.0 addresses this challenge to reduce compute costs, and walk you through how to effortlessly upgrade from KCL 2.x to 3.0. Additionally, we cover additional benefits that KCL 3.0 provides. This includes using the AWS SDK for Java 2.x and removing the dependency on the AWS SDK for Java v1.x. Lastly, we provide a key checklist as you prepare to upgrade your stream processing application to use KCL 3.0.

Load balancing challenges with operating custom stream processing applications

Customers processing real-time data streams typically use multiple compute hosts such as Amazon Elastic Compute Cloud (Amazon EC2) to handle the high throughput in parallel. In many cases, data streams contain records that must be processed by the same worker. For example, a trucking company might use multiple EC2 instances, each running one worker, to process streaming data with real-time location coordinates published from thousands of vehicles. To accurately keep track of routes of vehicles, each truck’s location needs to be processed by the same worker. For such applications, customers specify the vehicle ID as a partition key for every record published to the data stream. Kinesis Data Streams writes data records belonging to the same partition key to a single shard (the base throughput unit of Kinesis Data Streams) so that they can be processed in order.

However, data in the stream is often unevenly distributed across shards due to varying traffic associated with partition keys. For instance, some vehicles may send more frequent location updates when operational, whereas others send less frequent updates when idle. With previous KCL versions, each worker in the stream processing application processed an equal number of shards in parallel. As a result, workers processing data-heavy shards might reach their data processing limits, whereas those handling lighter shards remain underutilized. This workload imbalance presents a challenge for customers seeking to optimize their resource utilization and stream processing efficiency.

Let’s look at a sample workload with uneven traffic across shards in the stream to elaborate how this leads to uneven utilization of the compute fleet with KCL 2.6, and why it results in higher costs.

In the sample workload, the producer application publishes 2.5MBps of data across four shards. However, two shards receive 1MBps each and the other two receive 0.25MBps based on the traffic pattern associated with partition keys. In our trucking company example, you can think of it as two shards storing data from actively operating vehicles and the other two shards storing data from idle vehicles. We used three EC2 instances, each running one worker, to process this data with KCL 2.6 for this sample workload.

Initially, the load was distributed across three workers with the CPU utilizations of 50%, 50%, and 25%, averaging 42% (as shown in the following figure in the 12:18–12:29 timeframe). Because the EC2 fleet is under-utilized, we removed one EC2 instance (worker) from the fleet to operate with two workers for better cost-efficiency. However, after we removed the worker (red vertical dotted line in the following figure), the CPU utilization of one EC2 instance went up to almost 100%.

This occurs because KCL 2.6 and earlier versions distribute the load to make sure each worker processes the same number of shards, regardless of throughput or CPU utilization of workers. In this scenario, one worker processed two high-throughput shards, reaching 100% CPU utilization, and another worker handled two low-throughput shards, operating at only 25% CPU utilization.

Due to this CPU utilization imbalance, the worker compute fleet can’t be scaled down because it can lead to processing delays due to over-utilization of some workers. Even though the entire fleet is under-utilized in aggregate, uneven distribution of the load prevents us from downsizing the fleet. This increases compute costs of the stream processing application.

Next, we explore how KCL 3.0 addresses these load balancing challenges.

Load balancing improvements with KCL 3.0

KCL 3.0 introduces a new load balancing algorithm that monitors CPU utilization of KCL workers and rebalances the stream processing load. When it detects a worker approaching data processing limits or high variance in CPU utilization across workers, it redistributes the load from over-utilized to underutilized workers. This balances the stream processing load across all workers. As a result, you can avoid over-provisioning of capacity due to imbalanced CPU utilization among workers and save costs by right-sizing your compute capacity.

The following figure shows the result for KCL 3.0 with the same simulation settings we had with KCL 2.6.

With three workers, KCL 3.0 initially distributed the load similarly to KCL 2.6, resulting in 42% average CPU utilization (20:35–20:55 timeframe). However, when we removed one worker (marked with the red vertical dotted line), KCL 3.0 rebalanced the load from one worker to other two workers considering the throughput variability in shards, not just equally distributing shards based on the number of shards. As a result, two workers ended up running at about 65% CPU utilization, allowing us to safely scaling down the compute capacity without any performance risk.

In this scenario, we were able to reduce the compute fleet size from three workers to two workers, resulting in 33% reduction in compute costs compared to KCL 2.6. Although this is a sample workload, imagine the potential savings you can achieve when streaming gigabytes of data per second with hundreds of EC2 instances processing them! You can realize the same cost saving benefit for your KCL 3.0 applications deployed in containerized environments such as Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, or your own self-managed Kubernetes clusters.

Other benefits in KCL 3.0

In addition to the stream processing cost savings, KCL 3.0 offers several other benefits:

  • Amazon DynamoDB read capacity unit (RCU) reduction – KCL 3.0 reduces the Amazon DynamoDB cost associated with KCL by optimizing read operations on the DynamoDB table storing metadata. KCL uses DynamoDB to store metadata such as shard-worker mapping and checkpoints.
  • Graceful handoff of shards from one worker to another – KCL 3.0 minimizes reprocessing of data when the shard processed by one worker is handed over to another worker during the rebalancing or during deployments. It allows the current worker to complete checkpointing the records that it has processed and the new worker taking over the work from the previous worker to pick up from the latest checkpoint.
  • Removal of the AWS SDK for Java 1.x dependency – KCL 3.0 has completely removed the dependency on the AWS SDK for Java 1.x, aligning with the AWS recommendation to use the latest SDK versions. This change improves overall performance, security, and maintainability of KCL applications. For details regarding AWS SDK for Java 2.x benefits, refer to Use features of the AWS SDK for Java 2.x.

Migrating to KCL 3.0

You may now be wondering how to migrate to KCL 3.0 and what code changes you’ll need to make to take advantage of its benefits. If you’re currently on KCL 2.x version, you don’t have to make any changes to your application code! Complete the following steps to migrate to KCL 3.0:

  1. Update your Maven (or build environment) dependency to KCL 3.0.
  2. Set the clientVersionConfig to CLIENT_VERSION_CONFIG_COMPATIBLE_WITH_2X.
  3. Build and deploy your code.

After all KCL workers are updated, KCL 3.0 automatically starts running the new load balancing algorithm to achieve even utilization of the workers. For detailed migration instructions, see Migrating from previous KCL versions.

Key checklists when you choose to use KCL 3.0

We recommend checking the following when you decide to use KCL 3.0 for your stream processing application:

  • Make sure you added proper permissions required for KCL 3.0. KCL 3.0 creates and manages two new metadata tables (worker metrics table, coordinator state table) and a global secondary index on the lease table in DynamoDB. See IAM permissions required for KCL consumer applications for detailed permission settings you need to add.
  • The new load balancing algorithm introduced in KCL 3.0 aims to achieve even CPU utilizations across workers, not an equal number of leases per worker. Setting the maxLeasesForWorker configuration too low may limit the KCL’s ability to balance the workload effectively. If you use the maxLeasesForWorker configuration, consider increasing its value to allow for optimal load distribution.
  • If you use automatic scaling for your KCL application, it’s important to review your scaling policy after upgrading to KCL 3.0. Specifically, if you’re using average CPU utilization as a scaling threshold, you should reassess this value. If you’re conservatively using a higher-than-needed threshold value to make sure your stream processing application won’t have some workers running hot due to the imbalanced load balancing, you might be able to adjust this now. KCL 3.0 introduces improved load balancing, which results in more evenly distributed workloads across workers. After deploying KCL 3.0, monitor your workers’ CPU utilization and see if you can lower your scaling threshold to optimize your resource usage and costs while maintaining performance. This step makes sure you’re taking full advantage of KCL 3.0’s enhanced load balancing capabilities.
  • To gracefully hand off leases, make sure you have implemented a checkpointing logic inside your shutdownRequested() method in the RecordProcessor class. Refer to Step 4 of Migrating from KCL 2.x to KCL 3.x for details.

Conclusion

The release of KCL 3.0 introduces significant enhancements that can help optimize the cost-efficiency and performance of KCL applications. The new load balancing algorithm enables more even CPU utilization across worker instances, potentially allowing for right-sized and more cost-effective stream processing fleets. By following the key checklists, you can take full advantage of KCL 3.0’s features to build efficient, reliable, and cost-optimized stream processing applications with Kinesis Data Streams.


About the Authors

Minu Hong is a Senior Product Manager for Amazon Kinesis Data Streams at AWS. He is passionate about understanding customer challenges around streaming data and developing optimized solutions for them. Outside of work, Minu enjoys traveling, playing tennis, skiing, and cooking.

Pratik Patel is a Senior Technical Account Manager and streaming analytics specialist. He works with AWS customers and provides ongoing support and technical guidance to help plan and build solutions using best practices and proactively helps in keeping customers’ AWS environments operationally healthy.

Priyanka Chaudhary is a Senior Solutions Architect and data analytics specialist. She works with AWS customers as their trusted advisor, providing technical guidance and support in building Well-Architected, innovative industry solutions.

Stream real-time data into Apache Iceberg tables in Amazon S3 using Amazon Data Firehose

Post Syndicated from Diego Garcia Garcia original https://aws.amazon.com/blogs/big-data/stream-real-time-data-into-apache-iceberg-tables-in-amazon-s3-using-amazon-data-firehose/

As businesses generate more data from a variety of sources, they need systems to effectively manage that data and use it for business outcomes—such as providing better customer experiences or reducing costs. We see these trends across many industries—online media and gaming companies providing recommendations and customized advertising, factories monitoring equipment for maintenance and failures, theme parks providing wait times for popular attractions, and many others.

To build such applications, engineering teams are increasingly adopting two trends. First, they’re replacing batch data processing pipelines with real-time streaming, so applications can derive insight and take action within seconds instead of waiting for daily or hourly batch exchange, transform, and load (ETL) jobs. Second, because traditional data warehousing approaches are unable to keep up with the volume, velocity, and variety of data, engineering teams are building data lakes and adopting open data formats such as Parquet and Apache Iceberg to store their data. Iceberg brings the reliability and simplicity of SQL tables to Amazon Simple Storage Service (Amazon S3) data lakes. By using Iceberg for storage, engineers can build applications using different analytics and machine learning frameworks such as Apache Spark, Apache Flink, Presto, Hive, or Impala, or AWS services such as Amazon SageMaker, Amazon Athena, AWS Glue, Amazon EMR, Amazon Managed Service for Apache Flink, or Amazon Redshift.

Iceberg is popular because first, it’s widely supported by different open-source frameworks and vendors. Second, it allows customers to read and write data concurrently using different frameworks. For example, you can write some records using a batch ETL Spark job and other data from a Flink application at the same time and into the same table. Third, it allows scenarios such as time travel and rollback, so you can run SQL queries on a point-in-time snapshot of your data, or rollback data to a previously known good version. Fourth, it supports schema evolution, so when your applications evolve, you can add new columns to your tables without having to rewrite data or change existing applications. To learn more, see Apache Iceberg.

In this post, we discuss how you can send real-time data streams into Iceberg tables on Amazon S3 by using Amazon Data Firehose. Amazon Data Firehose simplifies the process of streaming data by allowing users to configure a delivery stream, select a data source, and set Iceberg tables as the destination. Once set up, the Firehose stream is ready to deliver data. Firehose is integrated with over 20 AWS services, so you can deliver real-time data from Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka, Amazon CloudWatch Logs, AWS Internet of Things (AWS IoT), AWS WAF, Amazon Network Firewall Logs, or from your custom applications (by invoking the Firehose API) into Iceberg tables. It’s cost effective because Firehose is serverless, you only pay for the data sent and written to your Iceberg tables. You don’t have to provision anything or pay anything when your streams are idle during nights, weekends, or other non-use hours.

Firehose also simplifies setting up and running advanced scenarios. For example, if you want to route data to different Iceberg tables to have data isolation or better query performance, you can set up a stream to automatically route records into different tables based on what’s in your incoming data and distribute records from a single stream into dozens of Iceberg tables. Firehose automatically scales—so you don’t have to plan for how much data goes into which table—and has built-in mechanisms to handle delivery failures and guarantee exactly once delivery. Firehose supports updating and deleting records in a table based on the incoming data stream, so you can support scenarios such as GDPR and right-to-forget regulations. Because Firehose is fully compatible with Iceberg, you can write data using it and simultaneously use other applications to read and write to the same tables. Firehose integrates with the AWS Glue Data Catalog, so you can use features in AWS Glue such as managed compaction for Iceberg tables.

In the following sections, you’ll learn how to set up Firehose to deliver real-time streams into Iceberg tables to address four different scenarios:

  1. Deliver data from a stream into a single Iceberg table and insert all incoming records.
  2. Deliver data from a stream into a single Iceberg table and perform record inserts, updates, and deletes.
  3. Route records to different tables based on the content of the incoming data by specifying a JSON Query expression.
  4. Route records to different tables based on the content of the incoming data by using a Lambda function.

You will also learn how to query the data you have delivered to Iceberg tables using a standard SQL query in Amazon Athena. All of the AWS services used in these examples are serverless, so you don’t have to provision and manage any infrastructure.

Solution overview

The following diagram illustrates the architecture.

In our examples, we use Kinesis Data Generator, a sample application to generate and publish data streams to Firehose. You can also set up Firehose to use other data sources for your real-time streams. We set up Firehose to deliver the stream into Iceberg tables in the Data Catalog.

Walkthrough

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template performs the following operations:

  • Creates a Data Catalog database for the destination Iceberg tables
  • Creates four tables in the AWS Glue database that are configured to use the Apache Iceberg format
  • Specifies the S3 bucket locations for the destination tables
  • Creates a Lambda function (optional)
  • Sets up an AWS Identity and Access Management (IAM) role for Firehose
  • Creates resources for Kinesis Data Generator

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account. If you don’t have an account, you can create one.

Deploy the solution

The first step is to deploy the required resources into your AWS environment by using a CloudFormation template.

  1. Sign in to the AWS Management Console for CloudFormation.
  2. Choose Launch Stack.
    Launch Cloudformation Stack
  3. Choose Next.
  4. Leave the stack name as Firehose-Iceberg-Stack, and in the parameters, enter the username and password that you want to use for accessing Kinesis Data Generator.
  5. Go to the bottom of the page and select I acknowledge that AWS CloudFormation might create IAM resources and choose Next.

  6. Review the deployment and choose Submit.

The stack can take 5–10 minutes to complete, after which you can view the deployed stack on the CloudFormation console. The following figure shows the deployed Firehose-Iceberg-stack details.

Before you set up Firehose to deliver streams, you must create the destination tables in the Data Catalog. For the examples discussed here, we use the CloudFormation template to automatically create the tables used in the examples. For your custom applications, you can create your tables using CloudFormation, or by using DDL commands in Athena or Glue. The following is the DDL command for creating a table used in our example:

CREATE TABLE firehose_iceberg_db.firehose_events_1 (
type struct<device: string, event: string, action: string>,
customer_id string,
event_timestamp timestamp,
region string)
LOCATION 's3://firehose-demo-iceberg-<account_id>-<region>/iceberg/events_1'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);

Also note that the four tables that we use in the examples have the same schema, but you can have tables with different schemas in your application.

Use case 1: Deliver data from a stream into a single Iceberg table and insert all incoming records

Now that you have set up the source for your data stream and the destination tables, you’re ready to set up Firehose to deliver streams into the Iceberg tables.

Create a Firehose stream:

  1. Go to the Data Firehose console and choose Create Firehose stream.
  2. Select Direct PUT as the Source and Apache Iceberg Tables as the Destination.

This example uses Direct PUT as the source, but the same steps can be applied for other Firehose sources such as Kinesis Data Streams, and Amazon MSK.

  1. For the Firehose stream name, enter firehose-iceberg-events-1.
  2. In Destination settings, enable Inline parsing for routing information. Because all records from the stream are inserted into a single destination table, you specify a destination database and table. By default, Firehose inserts all incoming records into the specified destination table.
    1. Database expression: “firehose_iceberg_db
    2. Table expression: “firehose_events_1

Include double quotation marks to use the literal value for the database and table name. If you do not use double quotations marks, Firehose assumes that this is a JSON Query expression and will attempt to parse the expression when processing your stream and fail.

  1. Go to Buffer hints and reduce the Buffer size to 1 MiB and the Buffer interval to 60 You can fine tune these settings for your application.
  2. For Backup settings:
    • Select the S3 bucket created by the CloudFormation template. It has the following structure: s3://firehose-demo-iceberg-<account_id>-<region>
    • For error output prefix enter: error/events-1/

  3. In Advanced settings, enable CloudWatch error logging, and in Existing IAM roles, select the role that starts with Firehose-Iceberg-Stack-FirehoseIamRole-*, created by the CloudFormation template.
  4. Choose Create Firehose stream.

Generate streaming data:

Use Kinesis Data Generator to publish data records into your Firehose stream.

  1. Go to the CloudFormation stack, select the Nested stack for the generator, and choose Outputs.
  2. Choose the KinesisDataGenerator URL and enter the credentials that you defined when deploying the CloudFormation stack.
  3. Select the AWS Region where you deployed the CloudFormation stack and select your Firehose stream.
  4. For template, replace the values on the screen with the following:
    {
    "type": {
    "device": "{{random.arrayElement(["mobile", "desktop", "tablet"])}}",
    "event": "{{random.arrayElement(["firehose_events_1", "firehose_events_2"])}}",
    "action": "update"
    },
    "customer_id": "{{random.number({ "min": 1, "max": 1500})}}",
    "event_timestamp": "{{date.now("YYYY-MM-DDTHH:mm:ss.SSS")}}",
    "region": "{{random.arrayElement(["pdx", "nyc"])}}"
    }

  5. Before sending data, choose Test template to see an example payload.
  6. Choose Send data.

Querying with Athena:

You can query the data you’ve written to your Iceberg tables using different processing engines such as Apache Spark, Apache Flink, or Trino. In this example, we will show you how you can use Athena to query data that you’ve written to Iceberg tables.

  1. Go to the Athena console.
  2. Configure a Location of query result. You can use the same S3 bucket for this but add a suffix at the end.
    s3://firehose-demo-iceberg-<account_id>-<region>/athena/

  3. In the query editor, in Tables and views, select the options button next to firehose_events_1 and select Preview Table.

You should be able to see data in the Apache Iceberg tables by using Athena.

With that, you ‘ve delivered data streams into an Apache Iceberg table using Firehose and run a SQL query against your data.

Now let’s explore the other scenarios. We will follow the same procedure as before for creating the Firehose stream and querying Iceberg tables with Amazon Athena.

Use case 2: Deliver data from a stream into a single Iceberg table and perform record inserts, updates, and deletes

One of the advantages of using Apache Iceberg is that it allows you to perform row-level operations such as updates and deletes on tables in a data lake. Firehose can be set up to automatically apply record update and delete operations in your Iceberg tables.

Things to know:

  • When you apply an update or delete operation through Firehose, the data in Amazon S3 isn’t actually deleted. Instead, a marker record is written according to the Apache Iceberg format specification to indicate that the record is updated or deleted, so subsequent read and write operations get the latest record. If you want to purge (remove the underlying data from Amazon S3) the deleted records, you can use tools developed for purging records in Apache Iceberg.
  • If you attempt to update a record using Firehose and the underlying record doesn’t already exist in the destination table, Firehose will insert the record as a new row.

Create a Firehose stream:

  1. Go to the Amazon Data Firehose console.
  2. Choose Create Firehose stream.
  3. For Source, select Direct PUT. For Destination select Apache Iceberg Tables.
  4. For the Firehose stream name, enter firehose-iceberg-events-2.
  5. In the e, enable inline parsing for routing information and provide the required values as static values for Database expression and Table expression. Because you want to be able to update records, you also need to specify the Operation expression.
    1. Database expression: “firehose_iceberg_db
    2. Table expression: “firehose_events_2
    3. Operation expression: “update

Include double quotation marks to use the literal value for the database and table name. If you do not use double quotations marks, Firehose assumes that this is a JSON Query expression and will attempt to parse the expression when processing your stream and fail.

  1. Because you want to perform update and delete operations, you need to provide the columns in the destination table that will be used as unique keys to identify the record in the destination to be updated or deleted.
    • For DestinationDatabaseName: “firehose_iceberg_db
    • For DestinationTableName: “firehose_events_2
    • In UniqueKeys, replace the existing value with: “customer_id

  2. Change the Buffer hints to 1 MiB and 60
  3. In Backup settings, select the same bucket from the stack, but enter the following in the error output prefix:
    error/events-2/

  4. In Advanced settings, enable CloudWatch Error logging and select the existing role of the stack and create the new Firehose stream.

Use Kinesis Data Generator to publish records into your Firehose stream. You might need to refresh the page or change regions so that it refreshes and shows the newly created delivery stream.

Don’t make any changes to the template and start sending data to the firehose-iceberg-events-2 stream.

Run the following query in Athena to see data in the firehose_events_2 table. Note that you can send updated records for the same unique key (same customer_id value) into your Firehose stream, and Firehose automatically applies record updates in the destination table. Thus, when you query data in Athena, you will see only one record for each unique value of customer_id, even if you have sent multiple updates into your stream.

SELECT customer_id, count(*) 
FROM "firehose_iceberg_db"."firehose_events_2" 
GROUP BY customer_id LIMIT 10;

Use case 3: Route records to different tables based on the content of the incoming data by specifying a JSON Query expression

Until now, you provided the routing and operation information as static values to perform operations on a single table. However, you can specify JSON Query expressions to define how Firehose should retrieve the destination database, destination table, and operation from your incoming data stream, and accordingly route the record and perform the corresponding operation. Based on your specification, Firehose automatically routes and delivers each record into the appropriate destination table and applies the corresponding operation.

Create a Firehose stream:

  1. Go back to the Amazon Data Firehose console.
  2. Choose Create Firehose Stream.
  3. For Source, select Direct PUT. For Destination, select Apache Iceberg Tables.
  4. For the Firehose stream name, enter firehose-iceberg-events-3.
  5. In Destination settings, enable Inline parsing for routing information.
    • For Database expression, provide the same value as before as a static string: “firehose_iceberg_db
    • For Table expression, retrieve this value from the nested incoming record using JSON Query.
      .type.event

    • For Operation expression, we will also retrieve this value from our nested record using JSON Query.
      .type.action

If you have the following incoming events with different event values, With the preceding JSON Query expressions, Firehose will parse and get “firehose_event_3” or “firehose_event_4” as the table names, and “update” as the intended operation from the incoming records.

{ "type": {   "device": "tablet",  
"event": "firehose_events_3",   "action": "update" },
"customer_id": "112", "event_timestamp": "2024-10-02T15:46:52.901",
"region": "pdx"}
{ "type": {   "device": "tablet",  
"event": "firehose_events_4",   "action": "update" },
"customer_id": "112", "event_timestamp": "2024-10-02T15:46:52.901",
"region": "pdx"}

  1. Because this is an update operation, you need to configure unique keys for each table. Also, because you want to deliver records to multiple Iceberg tables, you need to provide configurations for each of the two destination tables that records can be written to.
  2. Change the Buffer hints to 1 MiB and 60
  3. In Backup settings, select the same bucket from the stack, but in the error output prefix enter the following:
    error/events-3/

  4. In Advanced settings, select the existing IAM role created by the CloudFormation stack and create the new Firehose stream.
  5. In Kinesis Data Generator, refresh the page and select the newly created Firehose stream: firehose-iceberg-events-3

If you query the firehose_events_3 and firehose_events_4 tables using Athena, you should find the data routed to right tables by Firehose using the routing information retrieved using JSON Query expressions.

Table below showing  events with event “firehose_events_3

The following figure shows Firehose Events Table 4.

Use Case 4: Route records to different tables based on the content of the incoming data by using a Lambda function

There might be scenarios where routing information isn’t readily available in the input record. You might want to parse and process incoming records or perform a lookup to determine where to deliver the record and whether to perform an update or delete operation. For such scenarios, you can use a Lambda function to generate the routing information and operation specification. Firehose automatically invokes your Lambda function for a batch of records (with a configurable batch size). You can process incoming records in your Lambda function and provide the routing information and operation in the output of the function. To learn more about how to process Firehose records using Lambda, see Transform source data in Amazon Data Firehose. After executing your Lambda function, Firehose looks for routing information and operations in the metadata fields (in the following format) provided by your Lambda function.

    "metadata":{
        "otfMetadata":{
            "destinationTableName":"firehose_iceberg_db",
            "destinationDatabaseName":"firehose_events_*",
            "operation":"insert"
        }

So, in this use case, you will explore how you can create custom routing rules based on other values of your records. Specifically, for this use case, you will route all records with a value for Region of ‘pdx’ to table 3 and all records with a region value of ‘nyc’ to table 4.

The CloudFormation template has already created the custom processing Lambda function for you, which has the following code:

import base64
import json
print('Loading function')

def lambda_handler(event, context):
    firehose_records_output = {'records': []}

    for record in event['records']:
        payload = base64.b64decode(record['data']).decode('utf-8')
        # Process the payload based on region
        payload_json = json.loads(payload)
        region = payload_json.get('region', '')
        firehose_record_output = {}
        if region == 'pdx':
            firehose_record_output['metadata'] = {
                'otfMetadata': {
                    'destinationDatabaseName': 'firehose_iceberg_db',
                    'destinationTableName': 'firehose_events_3',
                    'operation': 'insert'
                }
            }
        elif region == 'nyc':
            firehose_record_output['metadata'] = {
                'otfMetadata': {
                    'destinationDatabaseName': 'firehose_iceberg_db',
                    'destinationTableName': 'firehose_events_4',
                    'operation': 'insert'
                }
            }

        # Create output with proper record ID, output data, result, and metadata
        firehose_record_output['recordId'] = record['recordId']
        firehose_record_output['result'] = 'Ok'
        firehose_record_output['data'] = base64.b64encode(payload.encode('utf-8'))
        firehose_records_output['records'].append(firehose_record_output)

    return firehose_records_output

Configure the Firehose stream:

  1. Go back to the Data Firehose console.
  2. Choose Create Firehose stream.
  3. For Source, select Direct PUT. For Destination, select Apache Iceberg Tables.
  4. For the Firehose stream name, enter firehose-iceberg-events-4.
  5. In Transform records, select Turn on data transformation.
  6. Browse and select the function created by the CloudFormation stack:
    • Firehose-Iceberg-Stack-FirehoseProcessingLambda-*.
    • For Version select $LATEST.
  7. You can leave the Destination Settings as default because the Lambda function will provide the required metadata for routing.
  8. Change the Buffer hints to 1 MiB and 60 seconds.
  9. In Backup settings, select the same S3 bucket from the stack, but in the error output prefix, enter the following:
    error/events-4/

  10. In Advanced settings, select the existing role of the stack and create the new Firehose stream.
  11. In Kinesis Data Generator, refresh the page and select the newly created firehose stream: firehose-iceberg-events-4.

If you run the following query, you will see that the last records that were inserted into the table are only in the Region of ‘nyc’.

SELECT * FROM "firehose_iceberg_db"."firehose_events_4" 
order by event_timestamp desc 
limit 10;

Considerations and limitations

Before using Data Firehose with Apache Iceberg, it’s important to be aware of considerations and limitations. For more information, see Considerations and limitations.

Clean up

To avoid future charges, delete the resources you created in AWS Glue, Data Catalog, and the S3 bucket used for storage.

Conclusion

It’s straightforward to set up Firehose streams to deliver streaming records into Apache Iceberg tables in Amazon S3. We hope that this post helps you get started with building some amazing applications without having to worry about writing and managing complex application code or having to manage infrastructure.

To learn more about using Amazon Data Firehose with Apache Iceberg, see the Firehose Developer Guide or try the Immersion day workshop.


About the authors

Diego Garcia Garcia is a Specialist SA Manager for Analytics at AWS. His expertise spans across Amazon’s analytics services, with a particular focus on real-time data processing and advanced analytics architectures. Diego leads a team of specialist solutions architects across EMEA, collaborating closely with customers spanning across multiple industries and geographies to design and implement solutions to their data analytics challenges.

Francisco MorilloFrancisco Morillo is a Streaming Solutions Architect at AWS. Francisco works with AWS customers, helping them design real-time analytics architectures using AWS services, supporting Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink.

Phaneendra Vuliyaragoli is a Product Management Lead for Amazon Data Firehose at AWS. In this role, Phaneendra leads the product and go-to-market strategy for Amazon Data Firehose.

Fine-grained access control in Amazon EMR Serverless with AWS Lake Formation

Post Syndicated from Anubhav Awasthi original https://aws.amazon.com/blogs/big-data/fine-grained-access-control-in-amazon-emr-serverless-with-aws-lake-formation/

In today’s data-driven world , enterprises are increasingly reliant on vast amounts of data to drive decision-making and innovation. With this reliance comes the critical need for robust data security and access control mechanisms. Fine-grained access control restricts access to specific data subsets, protecting sensitive information and maintaining regulatory compliance. It allows organizations to set detailed permissions at various levels, including database, table, column, and row. This precise control mitigates risks of unauthorized access, data leaks, and misuse. In the unfortunate event of a security incident, fine-grained access control helps limit the scope of the breach, minimizing potential damage.
AWS is introducing general availability of fine-grained access control based on AWS Lake Formation for Amazon EMR Serverless on Amazon EMR 7.2. Enterprises can now significantly enhance their data governance and security frameworks. This new integration supports the implementation of modern data lake architectures, such as data mesh, by providing a seamless way to manage and analyze data. You can use EMR Serverless to enforce data access controls using Lake Formation when reading data from Amazon Simple Storage Service (Amazon S3), enabling robust data processing workflows and real-time analytics without the overhead of cluster management.

In this post, we discuss how to implement fine-grained access control in EMR Serverless using Lake Formation. With this integration, organizations can achieve better scalability, flexibility, and cost-efficiency in their data operations, ultimately driving more value from their data assets.

Key use cases for fine-grained access control in analytics

The following are key use cases for fine-grained access control in analytics:

  • Customer 360 – You can enable different departments to securely access specific customer data relevant to their functions. For example, the sales team can be granted access only to data such as customer purchase history, preferences, and transaction patterns. Meanwhile, the marketing team is limited to viewing campaign interactions, customer demographics, and engagement metrics.
  • Financial reporting – You can enable financial analysts to access the necessary data for reporting and analysis while restricting sensitive financial details to authorized executives.
  • Healthcare analytics – You can enable healthcare researchers and data scientists to analyze de-identified patient data for medical advancements and research, while making sure Protected Health Information (PHI) remains confidential and accessible only to authorized healthcare professionals and personnel.
  • Supply chain optimization – You can grant logistics teams visibility into inventory and shipment data while limiting access to pricing or supplier information to relevant stakeholders.

Solution overview

In this post, we explore how to implement fine-grained access control on Iceberg tables within an EMR Serverless application, using the capabilities of Lake Formation. If you’re interested in learning how to implement fine-grained access control on open table formats in Amazon EMR running on Amazon Elastic Compute Cloud (Amazon EC2) instances using Lake Formation, refer to Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation.
With the data access control features available in Lake Formation, you can enforce granular permissions and govern access to specific columns, rows, or cells within your Iceberg tables. This approach makes sure sensitive data remains secure and accessible only to authorized users or applications, aligning with your organization’s data governance policies and regulatory compliance requirements.

A cross-account modern data platform on AWS involves setting up a centralized data lake in a primary AWS account, while allowing controlled access to this data from secondary AWS accounts. This setup helps organizations maintain a single source of truth for their data, provides consistent data governance, and uses the robust security features of AWS across multiple business units or project teams.

To demonstrate how you can use Lake Formation to implement cross account fine-grained access control within an EMR Serverless environment, we use the TPC-DS dataset to create tables in the AWS Glue Data Catalog in the AWS producer account and provision different user personas to reflect various roles and access levels in the AWS consumer account, forming a secure and governed data lake.

The following diagram illustrates the solution architecture.

The producer account contains the following persona:

  • Data engineer – Tasks include data preparation, bulk updates, and incremental updates. The data engineer has the following access:
    • Table-level access – Full read/write access to all TPC-DS tables.

The consumer account contains the following personas:

  • Finance analyst – We run a sample query that performs a sales data analysis to guide marketing, inventory, and promotion strategies based on demographic and geographic performance. The finance analyst has the following access:
    • Table-level access – Full access to tables store_sales, catalog_sales, web_sales, item, and promotion for comprehensive financial analysis.
    • Column-level access – Limited access to cost-related columns in the sales tables to avoid exposure to sensitive pricing strategies. Limited access to sensitive columns like credit_rating in the customer_demographics table.
    • Row-level access – Access only to sales data from the current fiscal year or specific promotional periods.
  • Product analyst – We run a sample query that does a customer behavior analysis to tailor marketing, promotions, and loyalty programs based on purchase patterns and regional insights. The product analyst has the following access:
    • Table-level access – Full access to tables item, store_sales, and customer tables to evaluate product and market trends.
    • Column-level access – Restricted access to personal identifiers in the customer table, such as customer_address , email_address, and date of birth.

Prerequisites

You should have the following prerequisites:

Set up infrastructure in the producer account

We provide a CloudFormation template to deploy the data lake stack with the following resources:

  • Two S3 buckets: one for scripts and query results, and one for the data lake storage
  • An Amazon Athena workgroup
  • An EMR Serverless application
  • An AWS Glue database and tables on external public S3 buckets of TPC-DS data
  • An AWS Glue database for the data lake
  • An IAM role and polices

Set up Lake Formation for the data engineer in the producer account

Set up Lake Formation cross-account data sharing version settings:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
  2. Under Data Catalog settings, pick Version 4 under Cross-account version settings.

To learn more about the differences between data sharing versions, refer to Updating cross-account data sharing version settings. Make sure Default permissions for newly created databases and tables is unchecked.

Register the Amazon S3 location as the data lake location

When you register an Amazon S3 location with Lake Formation, you specify an IAM role with read/write permissions on that location. After registering, when EMR Serverless requests access to this Amazon S3 location, Lake Formation will supply temporary credentials of the provided role to access the data. We already created the role LakeFormationServiceRole using the CloudFormation template. To register the Amazon S3 location as the data lake location, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
  2. In the navigation pane, choose Data lake locations under Administration.
  3. Choose Register location.
  4. For Amazon S3 path, enter s3://<DatalakeBucketName>. (Copy the bucket name from the CloudFormation stack’s Outputs tab.)
  5. For IAM role, enter LakeFormationServiceRoleDatalake.
  6. For Permission mode, select Lake Formation.
  7. Choose Register location.

Generate TPC-DS tables in the producer account

In this section, we generate TPC-DS tables in Iceberg format in the producer account.
Grant database permissions to the data engineer
First, let’s grant database permissions to the data engineer IAM role Amazon-EMR-ExecutionRole_DE that we will use with EMR Serverless. Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
  2. Choose Databases and Create database.
  3. Enter iceberg_db for Name and s3://<DatalakeBucketName> for Location.
  4. Choose Create database.
  5. In the navigation pane, choose Data lake permissions and choose Grant.
  6. In the Principles section, select IAM users and roles and choose Amazon-EMR-ExecutionRole_DE.
  7. In the LF-Tags or catalog resources section, select Named Data Catalog resources and choose tpc-source and iceberg_db for Databases.
  8. Select Super for both Database permissions and Grantable permissions and choose Grant.

Create an EMR Serverless application

Now, let’s log in to EMR Serverless using Amazon EMR Studio and complete the following steps:

  1. On the Amazon EMR console, choose EMR Serverless.
  2. Under Manage applications, choose my-emr-studio. You will be directed to the Create application page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless application
  3.  Under Application settings, provide the following information:
    1. For Name, enter a name emr-fgac-application.
    2. For Type, choose Spark.
    3. For Release version, choose emr-7.2.0.
    4. For Architecture, choose x86_64.
  4. Under Application setup options, select Use custom settings.
  5. Under Interactive endpoint, select Enable endpoint for EMR studio
  6. Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
  7. Under Network connections, choose emrs-vpc for the VPC, enter any two private subnets, and enter emr-serverless-sg for Security groups.
  8. Choose Create and start application.

Create a Workspace

Complete the following steps to create an EMR Workspace:

  1. On the Amazon EMR console, choose Workspaces in the navigation pane and choose Create Workspace.
  2. Enter the Workspace name emr-fgac-workspace.
  3. Leave all other settings as default and choose Create Workspace.
  4. Choose Launch Workspace. Your browser might request to allow pop-up permissions for the first time launching the Workspace.
  5. After the Workspace is launched, in the navigation pane, choose Compute.
  6. For Compute type¸ select EMR Serverless application and enter emr-fgac-application for the application and Amazon-EMR-ExecutionRole_DE as the runtime role.
  7. Make sure the kernel attached to the Workspace is PySpark.
  8. Navigate to the File browser section and choose Upload files.
  9. Upload the file Iceberg-ingest-final_v2.ipynb.
  10. Update the data lake bucket name, AWS account ID, and AWS Region accordingly.
  11. Choose the double arrow icon to restart the kernel and rerun the notebook.


To verify that the data is generated, you can go to the AWS Glue console. Under Data Catalog, Databases, you should see TPC-DS tables ending with _iceberg for the database iceberg_db.

Share the database and TPC-DS tables to the consumer account

We now grant permissions to the consumer account, including grantable permissions. This allows the Lake Formation data lake administrator in the consumer account to control access to the data within the account.

Grant database permissions to the consumer account

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the producer account.
  2. In the navigation pane, choose Databases.
  3. Select the database iceberg_db, and on the Actions menu, under Permissions, choose Grant.
  4. In the Principles section, select External accounts and enter the consumer account.
  5. In the LF-Tags or catalog resources section, select Named Data Catalog resources and choose iceberg_db for Databases.
  6. In the Database permissions section, select Describe for both Database permissions and Grantable permissions.

This allows the data lake administrator in the consumer account to describe the database and grant describe permissions to other principals in the consumer account.

Grant table permissions to the consumer account

Repeat the preceding steps to grant table permissions to the consumer account.

Choose All tables under Tables and provide select and describe permissions for Table permissions and Grantable permissions.

Set up Lake Formation in the consumer account

For the remaining section of the post, we focus on the consumer account. Deploy the following CloudFormation stack to set up resources:

The template will create the Amazon EMR runtime role for both analyst user personas.
Log in to the AWS consumer account and accept the AWS RAM invitation first:

  1. Open the AWS RAM console with the IAM identity that has AWS RAM access.
  2. In the navigation pane, choose Resource shares under Shared with me.
  3. You should see two pending resource shares from the producer account.
  4. Accept both invitations.

You should be able to see the iceberg_db database on the Lake Formation console.

Create a resource link for the shared database

To access the database and table resources that were shared by the producer AWS account, you need to create a resource link in the consumer AWS account. A resource link is a Data Catalog object that is a link to a local or shared database or table. After you create a resource link to a database or table, you can use the resource link name wherever you would use the database or table name. In this step, you grant permission on the resource links to the job runtime roles for EMR Serverless. The runtime roles will then access the data in shared databases and underlying tables through the resource link.
To create a resource link, complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the iceberg_db database, verify that the owner account ID is the producer account, and on the Actions menu, choose Create resource links.
  4. For Resource link name, enter the name of the resource link (iceberg_db_shared).
  5. For Shared database’s region, choose the Region of the iceberg_db database.
  6. For Shared database, choose the iceberg_db database.
  7. For Shared database’s owner ID, enter the account ID of the producer account.
  8. Choose Create.

Grant permissions on the resource link to the EMR job runtime roles

Grant permissions on the resource link to Amazon-EMR-ExecutionRole_Finance and Amazon-EMR-ExecutionRole_Product using the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (iceberg_db_shared) and on the Actions menu, choose Grant.
  4. In the Principles section, select IAM users and roles, and choose Amazon-EMR-ExecutionRole_Finance and Amazon-EMR-ExecutionRole_Product.
  5. In the LF-Tags or catalog resources section, select Named Data Catalog resources and for Databases, choose iceberg_db_shared.
  6. In the Resource link permissions section, select Describe for Resource link permissions.

This allows the EMR Serverless job runtime roles to describe the resource link. We don’t make any selections for grantable permissions because runtime roles shouldn’t be able to grant permissions to other principles.
Choose Grant.

Grant table permissions for the finance analyst

Complete the following steps:

  1. Open the Lake Formation console with the Lake Formation data lake administrator in the consumer account.
  2. In the navigation pane, choose Databases.
  3. Select the resource link (iceberg_db_shared) and on the Actions menu, choose Grant on target.
  4. In the Principles section, select IAM users and roles, then choose Amazon-EMR-ExecutionRole_Finance.
  5. In the LF-Tags or catalog resources section, select Named Data Catalog resources and specify the following:
    1. For Databases, choose iceberg_db.
    2. For Tables¸ choose store_sales_iceberg.
  6. In the Table permissions section, for Table permissions, select Select.
  7. In the Data permissions section, select Column-based access.
  8. Select Exclude columns and choose all cost-related columns (ss_wholesale_cost and ss_ext_wholesale_cost).
  9. Choose Grant.
  10. Similarly, grant access to table customer_demographics_iceberg and exclude the column cd_credit_rating.
  11. Following the same steps, grant All data access for tables store_iceberg and item_iceberg.
  12. For the table date_dim_iceberg, we provide selective row-level access.
  13. Similar to the preceding table permissions, select date_dim_iceberg under Tables and in the Data filters section, choose Create new.
  14. For Data filter name, enter FA_Filter_year.
  15. Select Access to all columns under Column-level access.
  16. Select Filter rows and for Row filter expression, enter d_year=2002 to only provide access to the 2002 year.
  17. Choose Save changes.
  18. Choose Create filter.
  19. Make sure FA_Filter_year is selected under Data filters and grant select permissions on the filter.

Grant table permissions for the product analyst

You can provide permissions for the next set of tables required for the product analyst role using the Lake Formation console. Alternatively, you can use the AWS Command Line Interface (AWS CLI) to grant permissions. We provide grant on target permissions for the resource link iceberg_db_shared to IAM role Amazon-EMR-ExecutionRole_Product.

  1. Similar to steps followed in previous sections, for table store_sales_iceberg, date_dim_iceberg, store_iceberg, and house_hold_demographics_iceberg, provide select permissions for All data access. Make sure the role selected is Amazon-EMR-ExecutionRole_Product.

For table customer_iceberg, we limit access to personally identifiable information (PII) columns.

  1. Under Data permissions, select Column-based access and Exclude columns.
  2. Choose columns c_birth_day, c_birth_month, c_birth_year, c_current_addr_sk, c_customer_id, c_email_address, and c_birth_country.

Verify access using interactive notebooks from EMR Studio

Complete the following steps to test role access:

  1. Log in to the AWS consumer account and open the Amazon EMR console.
  2. Choose EMR Serverless in the navigation pane and choose an existing EMR Studio.
  3. If you don’t have EMR Studio configured, choose Get Started and select Create and launch EMR Studio.
  4. Create a Lake Formation enabled EMR Serverless application as described in previous sections.
  5. Create an EMR Studio Workspace as described in previous sections.
  6. Use emr-studio-service-role for Service role and datalake-resources-<account_id>-<region> for Workspace storage, then launch your Workspace.

Now, let’s verify access for the finance analyst.

  1. Make sure the compute type inside your Workspace is pointing to the EMR Serverless application created in the prior step and Amazon-EMR-ExecutionRole_Finance as the interactive runtime role.
  2. Go to File browser in the navigation pane, choose Upload files, and add Notebook_FA.ipynb to your Workspace.
  3. Run all the cells to verify fine-grained access.

Now let’s test access for the product analyst.

  1. In the same Workspace, detach and attach the same EMR Serverless application with Amazon-EMR-ExecutionRole_Product as the interactive runtime role.
  2. Upload Notebook_PA.ipynb under the File browser section.
  3. Run all the cells to verify fine-grained access for the product analyst.

In a real-world scenario, both analysts will likely have their own Workspace with restricted rights to assume only the authorized interactive runtime role.

Considerations and limitations

EMR Serverless with Lake Formation uses Spark resource profiles to create two profiles and two Spark drivers for access control. Read this white paper to learn about the feature details. The user profile runs the supplied code, and the system profile enforces Lake Formation policies. Therefore, it’s recommended that you have a minimum of two Spark drivers when pre-initialized capacity is used with Lake Formation enabled jobs. No change in executor count is required. Refer to Using EMR Serverless with AWS Lake Formation for fine-grained access control to learn more about the technical implementation of the Lake Formation integration with EMR Serverless.

You can expect a performance overhead after enabling Lake Formation. The level of access (table, column, or row) and the amount of data filtered will have significant impact on query performance.

Clean up

To avoid incurring ongoing costs, complete the following steps to clean up your resources:

  1. In your secondary (consumer)  account, log in to the Lake Formation console.
  2. Drop the resource share table.
  3. In your primary (producer) account, log in to the Lake Formation console.
  4. Revoke the access you configured.
  5. Drop the AWS Glue tables and database.
  6. Delete the AWS Glue job.
  7. Delete the S3 buckets and any other resources that you created as part of the prerequisites for this post.

Conclusion

In this post, we showed how to integrate Lake Formation with EMR Serverless to manage access to Iceberg tables. This solution showcases a modern way to enforce fine-grained access control in a multi-account open data lake setup. The approach simplifies data management in the main account while carefully controlling how users access data in other secondary accounts.

Try out the solution for your own use case, and let us know your feedback and questions in the comments section.


About the Authors

Anubhav Awasthi is a Sr. Big Data Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.

Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.

Integrate Amazon Bedrock with Amazon Redshift ML for generative AI applications

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/integrate-amazon-bedrock-with-amazon-redshift-ml-for-generative-ai-applications/

Amazon Redshift has enhanced its Redshift ML feature to support integration of large language models (LLMs). As part of these enhancements, Redshift now enables native integration with Amazon Bedrock. This integration enables you to use LLMs from simple SQL commands alongside your data in Amazon Redshift, helping you to build generative AI applications quickly. This powerful combination enables customers to harness the transformative capabilities of LLMs and seamlessly incorporate them into their analytical workflows.

With this new integration, you can now perform generative AI tasks such as language translation, text summarization, text generation, customer classification, and sentiment analysis on your Redshift data using popular foundation models (FMs) such as Anthropic’s Claude, Amazon Titan, Meta’s Llama 2, and Mistral AI. You can use the CREATE EXTERNAL MODEL command to point to a text-based model in Amazon Bedrock, requiring no model training or provisioning. You can invoke these models using familiar SQL commands, making it more straightforward than ever to integrate generative AI capabilities into your data analytics workflows.

Solution overview

To illustrate this new Redshift machine learning (ML) feature, we will build a solution to generate personalized diet plans for patients based on their conditions and medications. The following figure shows the steps to build the solution and the steps to run it.

The steps to build and run the solution are the following:

  1. Load sample patients’ data
  2. Prepare the prompt
  3. Enable LLM access
  4. Create a model that references the LLM model on Amazon Bedrock
  5. Send the prompt and generate a personalized patient diet plan

Pre-requisites

  1. An AWS account.
  2. An Amazon Redshift Serverless workgroup or provisioned data warehouse. For setup instructions, see Creating a workgroup with a namespace or Create a sample Amazon Redshift data warehouse, respectively. The Amazon Bedrock integration feature is supported in both Amazon Redshift provisioned and serverless.
  3. Create or update an AWS Identity and Access Management (IAM role) for Amazon Redshift ML integration with Amazon Bedrock.
  4. Associate the IAM role to a Redshift instance.
  5. Users should have the required permissions to create models.

Implementation

The following are the solution implementation steps. The sample data used in the implementation is for illustration only. The same implementation approach can be adapted to your specific data sets and use cases.

You can download a SQL notebook to run the implementation steps in Redshift Query Editor V2. If you’re using another SQL editor, you can copy and paste the SQL queries either from the content of this post or from the notebook.

Load sample patients’ data:

  1. Open Amazon Redshift Query Editor V2 or another SQL editor of your choice and connect to the Redshift data warehouse.
  2. Run the following SQL to create the patientsinfo table and load sample data.
-- Create table

CREATE TABLE patientsinfo (
pid integer ENCODE az64,
pname varchar(100),
condition character varying(100) ENCODE lzo,
medication character varying(100) ENCODE lzo
);
  1. Download the sample file, upload it into your S3 bucket, and load the data into the patientsinfo table using the following COPY command.
-- Load sample data
COPY patientsinfo
FROM 's3://<<your_s3_bucket>>/sample_patientsinfo.csv'
IAM_ROLE DEFAULT
csv
DELIMITER ','
IGNOREHEADER 1;

Prepare the prompt:

  1. Run the following SQL to aggregate patient conditions and medications.
SELECT
pname,
listagg(distinct condition,',') within group (order by pid) over (partition by pid) as conditions,
listagg(distinct medication,',') within group (order by pid) over (partition by pid) as medications
FROM patientsinfo

The following is the sample output showing aggregated conditions and medications. The output includes multiple rows, which will be grouped in the next step.

  1. Build the prompt to combine patient, conditions, and medications data.
SELECT
pname || ' has ' || conditions || ' taking ' || medications as patient_prompt
FROM (
    SELECT pname, 
    listagg(distinct condition,',') within group (order by pid) over (partition by pid) as conditions,
    listagg(distinct medication,',') within group (order by pid) over (partition by pid) as medications
    FROM patientsinfo) 
GROUP BY 1

The following is the sample output showing the results of the fully built prompt concatenating the patients, conditions, and medications into single column value.

  1. Create a materialized view with the preceding SQL query as the definition. This step isn’t mandatory; you’re creating the table for readability. Note that you might see a message indicating that materialized views with column aliases won’t be incrementally refreshed. You can safely ignore this message for the purpose of this illustration.
CREATE MATERIALIZED VIEW mv_prompts AUTO REFRESH YES
AS
(
SELECT pid,
pname || ' has ' || conditions || ' taking ' || medications as patient_prompt
FROM (
SELECT pname, pid,
listagg(distinct condition,',') within group (order by pid) over (partition by pid) as conditions,
listagg(distinct medication,',') within group (order by pid) over (partition by pid) as medications
FROM patientsinfo)
GROUP BY 1,2
)
  1. Run the following SQL to review the sample output.
SELECT * FROM mv_prompts limit 5;

The following is a sample output with a materialized view.

Enable LLM model access:

Perform the following steps to enable model access in Amazon Bedrock.

  1. Navigate to the Amazon Bedrock console.
  2. In the navigation pane, choose Model Access.

  1. Choose Enable specific models.
    You must have the required IAM permissions to enable access to available Amazon Bedrock FMs.

  1. For this illustration, use Anthropic’s Claude model. Enter Claude in the search box and select Claude from the list. Choose Next to proceed.

  1. Review the selection and choose Submit.

Create a model referencing the LLM model on Amazon Bedrock:

  1. Navigate back to Amazon Redshift Query Editor V2 or, if you didn’t use Query Editor V2, to the SQL editor you used to connect with Redshift data warehouse.
  2. Run the following SQL to create an external model referencing the anthropic.claude-v2 model on Amazon Bedrock. See Amazon Bedrock model IDs for how to find the model ID.
CREATE EXTERNAL MODEL patient_recommendations
FUNCTION patient_recommendations_func
IAM_ROLE '<<provide the arn of IAM role created in pre-requisites>>'
MODEL_TYPE BEDROCK
SETTINGS (
    MODEL_ID 'anthropic.claude-v2',
    PROMPT 'Generate personalized diet plan for following patient:');

Send the prompt and generate a personalized patient diet plan:

  1. Run the following SQL to pass the prompt to the function created in the previous step.
SELECT patient_recommendations_func(patient_prompt) 
FROM mv_prompts limit 2;
  1. You will get the output with the generated diet plan. You can copy the cells and paste in a text editor or export the output to view the results in a spreadsheet if you’re using Redshift Query Editor V2.

You will need to expand the row size to see the complete text.

Additional customization options

The previous example demonstrates a straightforward integration of Amazon Redshift with Amazon Bedrock. However, you can further customize this integration to suit your specific needs and requirements.

  • Inference functions as leader-only functions: Amazon Bedrock model inference functions can run as leader node-only when the query doesn’t reference tables. This can be helpful if you want to quickly ask an LLM a question.

You can run following SQL with no FROM clause. This will run as leader-node only function because it doesn’t need data to fetch and pass to the model.

SELECT patient_recommendations_func('Generate diet plan for pre-diabetes');

This will return a generic 7-day diet plan for pre-diabetes. The following figure is an output sample generated by the preceding function call.

  • Inference with UNIFIED request type models: In this mode, you can pass additional optional parameters along with input text to customize the response. Amazon Redshift passes these parameters to the corresponding parameters for the Converse API.

In the following example, we’re setting the temperature parameter to a custom value. The parameter temperature affects the randomness and creativity of the model’s outputs. The default value is 1 (the range is 0–1.0).

SELECT patient_recommendations_func(patient_prompt,object('temperature', 0.2)) 
FROM mv_prompts
WHERE pid=101;

The following is a sample output with a temperature of 0.2. The output includes recommendations to drink fluids and avoid certain foods.

Regenerate the predictions, this time setting the temperature to 0.8 for the same patient.

SELECT patient_recommendations_func(patient_prompt,object('temperature', 0.8)) 
FROM mv_prompts
WHERE pid=101;

The following is a sample output with a temperature of 0.8. The output still includes recommendations on fluid intake and foods to avoid, but is more specific in those recommendations.

Note that the output won’t be the same every time you run a particular query. However, we want to illustrate that the model behavior is influenced by changing parameters.

  • Inference with RAW request type models: CREATE EXTERNAL MODEL supports Amazon Bedrock-hosted models, even those that aren’t supported by the Amazon Bedrock Converse API. In those cases, the request_type needs to be raw and the request needs to be constructed during inference. The request is a combination of a prompt and optional parameters.

Make sure that you enable access to the Titan Text G1 – Express model in Amazon Bedrock before running the following example. You should follow the same steps as described previously in Enable LLM model access to enable access to this model.

-- Create model with REQUEST_TYPE as RAW

CREATE EXTERNAL MODEL titan_raw
FUNCTION func_titan_raw
IAM_ROLE '<<provide the arn of IAM role created in pre-requisites>>'
MODEL_TYPE BEDROCK
SETTINGS (
MODEL_ID 'amazon.titan-text-express-v1',
REQUEST_TYPE RAW,
RESPONSE_TYPE SUPER);

-- Need to construct the request during inference.
SELECT func_titan_raw(object('inputText', 'Generate personalized diet plan for following: ' || patient_prompt, 'textGenerationConfig', object('temperature', 0.5, 'maxTokenCount', 500)))
FROM mv_prompts limit 1;

The following figure shows the sample output.

  • Fetch run metrics with RESPONSE_TYPE as SUPER: If you need more information about an input request such as total tokens, you can request the RESPONSE_TYPE to be super when you create the model.
-- Create Model specifying RESPONSE_TYPE as SUPER.

CREATE EXTERNAL MODEL patient_recommendations_v2
FUNCTION patient_recommendations_func_v2
IAM_ROLE '<<provide the arn of IAM role created in pre-requisites>>'
MODEL_TYPE BEDROCK
SETTINGS (
MODEL_ID 'anthropic.claude-v2',
PROMPT 'Generate personalized diet plan for following patient:',
RESPONSE_TYPE SUPER);

-- Run the inference function
SELECT patient_recommendations_func_v2(patient_prompt)
FROM mv_prompts limit 1;

The following figure shows the output, which includes the input tokens, output tokens, and latency metrics.

Considerations and best practices

There are a few things to keep in mind when using the methods described in this post:

  • Inference queries might generate throttling exceptions because of the limited runtime quotas for Amazon Bedrock. Amazon Redshift retries requests multiple times, but queries can still be throttled because throughput for non-provisioned models might be variable.
  • The throughput of inference queries is limited by the runtime quotas of the different models offered by Amazon Bedrock in different AWS Regions. If you find that the throughput isn’t enough for your application, you can request a quota increase for your account. For more information, see Quotas for Amazon Bedrock.
  • If you need stable and consistent throughput, consider getting provisioned throughput for the model that you need from Amazon Bedrock. For more information, see Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock.
  • Using Amazon Redshift ML with Amazon Bedrock incurs additional costs. The cost is model- and Region-specific and depends on the number of input and output tokens that the model will process. For more information, see Amazon Bedrock Pricing.

Cleanup

To avoid incurring future charges, delete the Redshift Serverless instance or Redshift provisioned data warehouse created as part of the prerequisite steps.

Conclusion

In this post, you learned how to use the Amazon Redshift ML feature to invoke LLMs on Amazon Bedrock from Amazon Redshift. You were provided with step-by-step instructions on how to implement this integration, using illustrative datasets. Additionally, read about various options to further customize the integration to help meet your specific needs. We encourage you to try Redshift ML integration with Amazon Bedrock and share your feedback with us.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data services, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data services for banking and insurance clients across the globe.

Nikos Koulouris is a Software Development Engineer at AWS. He received his PhD from University of California, San Diego and he has been working in the areas of databases and analytics.

Unlock the potential of your supply chain data and gain actionable insights with AWS Supply Chain Analytics

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/unlock-the-potential-of-your-supply-chain-data-and-gain-actionable-insights-with-aws-supply-chain-analytics/

Today, we’re announcing the general availability of AWS Supply Chain Analytics powered by Amazon QuickSight. This new feature helps you to build custom report dashboards using your data in AWS Supply Chain. With this feature, your business analysts or supply chain managers can perform custom analyses, visualize data, and gain actionable insights for your supply chain management operations.

Here’s how it looks:

AWS Supply Chain Analytics leverages the AWS Supply Chain data lake and provides Amazon QuickSight embedded authoring tools directly into the AWS Supply Chain user interface. This integration provides you with a unified and configurable experience for creating custom insights, metrics, and key performance indicators (KPIs) for your operational analytics.

In addition, AWS Supply Chain Analytics provides prebuilt dashboards that you can use as-is or modify based on your needs. At launch, you will have the following prebuilt dashboards:

  1. Plan-Over-Plan Variance: Presents a comparison between two demand plans, showcasing variances in both units and values across key dimensions such as product, site, and time periods.
  2. Seasonality Analytics: Presents a year-over-year view of demand, illustrating trends in average demand quantities and highlighting seasonality patterns through heatmaps at both monthly and weekly levels.

Let’s get started
Let me walk you through the features of AWS Supply Chain Analytics.

The first step is to enable AWS Supply Chain Analytics. To do this, navigate to Settings, then select Organizations and choose Analytics. Here, I can Enable data access for Analytics.

Now I can edit existing roles or create a new role with analytics access. To learn more, visit User permission roles.

Once this feature is enabled, when I log in to AWS Supply Chain I can access the AWS Supply Chain Analytics feature by selecting either the Connecting to Analytics card or Analytics on the left navigation menu.

Here, I have an embedded Amazon QuickSight interface ready for me to use. To get started, I navigate to Prebuilt Dashboards.

Then, I can select the prebuilt dashboards I need in the Supply Chain Function dropdown list:

What I like the most about this prebuilt dashboards is I can easily get started. AWS Supply Chain Analytics will prepare all the datasets, analysis, and even a dashboard for me. I select Add to begin.

Then, I navigate to the dashboard page, and I can see the results. I can also share this dashboard with my team, which improves the collaboration aspect.

If I need to include other datasets for me to build a custom dashboard, I can navigate to Datasets and select New dataset.

Here, I have AWS Supply Chain data lake as an existing dataset for me to use.

Next, I need to select Create dataset.

Then, I can select a table that I need to include into my analysis. On the Data section, I can see all available fields. All data sets that start with asc_ are generated by AWS Supply Chain, such as data from Demand Planning, Insights, Supply Planning, and others.

I can also find all the datasets I have ingested into AWS Supply Chain. To learn more on data entities, visit the AWS Supply Chain documentation page. One thing to note here is if I have not ingested data into AWS Supply Chain Data Lake, I need to ingest data before using AWS Supply Chain Analytics. To learn how to ingest data into the data lake, visit the data lake page.

At this stage, I can start my analysis. 

Now available
AWS Supply Chain Analytics is now generally available in all regions where AWS Supply Chain is offered. Give it a try to experience and transform your operations with the AWS Supply Chain Analytics.

Happy building,
— Donnie

Streamline AI-driven analytics with governance: Integrating Tableau with Amazon DataZone

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/streamline-ai-driven-analytics-with-governance-integrating-tableau-with-amazon-datazone/

Amazon DataZone is a data management service that makes it faster and easier for customers to catalog, discover, share, and govern data stored across AWS, on premises, and from third-party sources. Amazon DataZone recently announced the expansion of data analysis and visualization options for your project-subscribed data within Amazon DataZone using the Amazon Athena JDBC driver.

Collaborating closely with our partners, we have tested and validated Amazon DataZone authentication via the Athena JDBC connection, providing an intuitive and secure connection experience for users. With this integration, you can now seamlessly query your governed data lake assets in Amazon DataZone using popular business intelligence (BI) and analytics tools, including partner solutions like Tableau.

Ali Tore, Senior Vice President of Advanced Analytics at Salesforce, highlighting the value of this integration, says

“We’re excited to partner with Amazon to bring Tableau’s powerful data exploration and AI-driven analytics capabilities to customers managing data across organizational boundaries with Amazon DataZone. This integration enables our customers to seamlessly explore data with AI in Tableau, build visualizations, and uncover insights hidden in their governed data, all while leveraging Amazon DataZone to catalog, discover, share, and govern data across AWS, on premises, and from third-party sources—enhancing both governance and decision-making.”

With this launch, Amazon DataZone strengthens its commitment to empowering enterprise customers with secure, governed access to data across the tools and platforms they rely on. For example, Guardant Health uses Amazon DataZone to democratize data access across its organization, enabling diverse teams to efficiently access, query, and analyze data tailored to their specific needs.

Rajesh Kucharlapati, Senior Director of Data, CRM, and Analytics at Guardant Health, says

“By harmonizing data across multiple business domains, we foster a culture of data sharing. Using Amazon DataZone lets us avoid building and maintaining an in-house platform, allowing our developers to focus on tailored solutions. Leveraging AWS’s managed service was crucial for us to access business insights faster, apply standardized data definitions, and tap into generative AI potential. We also needed an easy connection process for widely-used analytics tools like Tableau, DBeaver, and Domino, directly within Amazon DataZone projects. This new JDBC connectivity feature enables our governed data to flow seamlessly into these tools, supporting productivity across our teams.”

Use case

Amazon DataZone addresses your data sharing challenges and optimizes data availability. Here’s how:

  • Data product creation – As a data producer, you can create and catalog data products while enforcing governance, making your data findable, accessible, interoperable, and reusable (FAIR).
  • Streamlined access – As a data consumer, you can easily locate and subscribe to data from multiple sources within a single project. You can analyze this data using a variety of tools, including built-in AWS options such as Amazon Athena, Amazon Redshift, and Amazon SageMaker.
  • Integration with partner tools – The addition of support for partner analytics tools offers you greater flexibility and efficiency in your workflows. You can now use your tool of choice, including Tableau, to quickly derive business insights from your data while using standardized definitions and decentralized ownership. Refer to the detailed blog post on how you can use this to connect through various other tools.

Prerequisites

To get started, complete these steps:

  1. Download and install the latest Athena JDBC driver for Tableau.
  2. Copy the JDBC connection string from the Amazon DataZone portal into the JDBC connection configuration to establish a connection from Tableau. This will direct you to authenticate using single sign-on with your corporate credentials.

When you’re connected, you can query, visualize, and share data—governed by Amazon DataZone—within Tableau.

The following diagram shows the high-level architecture of the Tableau integration.

Solution walkthrough: Configure Tableau to access project-subscribed data assets

To configure Tableau to access project-subscribed data assets, follow these detailed steps:

  1. Download the latest Athena driver. If Tableau has the Athena driver preinstalled, it could be the older (v2) version. To confirm compatibility with Amazon DataZone, you’ll need the latest (v3) driver that includes the necessary authentication features. To download the latest JDBC driver version x, visit Athena JDBC 3.x driver.
  2. Install the driver. Copy the JDBC driver file to the appropriate folder for your operating system:
    • For macOS: ~/Library/Tableau/Drivers
    • For Windows: C:\Program Files\Tableau\Drivers
  3. On the Amazon DataZone console, select your project, as shown in the following screenshot of DataZone Console.
  4. To capture the JDBC connection parameters, follow these steps:
    1. On the project page, review the connection options under ANALYTICS TOOLS. Choose Connect with JDBC.
    2. In the JDBC parameters dialog box, select Using IDC auth and copy the JDBC URL. Optionally, you can use Using IAM auth to connect with your Amazon DataZone project as an AWS Identity and Access Management (IAM) role (from a server), provided that you are added as a project member within that project. The following screenshot shows the dialog box.
  5. To configure the Tableau desktop for connection, follow these steps:
    1. On the To a Server connection menu, select Other Databases (JDBC).
    2. Paste the copied JDBC URL into the URL field, leaving the other fields (Dialect, Username, Password) unchanged.
  6. To sign in with single sign-on, choose Sign in, as shown in the following screenshot. You’ll be redirected to authenticate with AWS IAM Identity Center. Use the credentials for your AWS single sign-on account.
  7. After you’re signed in, you’ll be prompted to authorize the DataZoneAuthPlugin. Choose Allow access to authorize access to Amazon DataZone from Tableau, as shown in the following screenshot.
  8. After the connection is established, a success message will appear, as shown in the following screenshot.

You can now view your project’s subscribed data directly within Tableau and build dashboards.

Conclusion

Amazon DataZone continues to expand its offerings, providing you with more flexibility in how you access, analyze, and visualize your subscribed data. With support for the Athena JDBC driver, you can now use a wide range of popular BI and analytics tools including Tableau, making governed data within Amazon DataZone more accessible than ever before.

In this post, you learned how the recent enhancements in Amazon DataZone facilitate a seamless connection with Tableau. By integrating Tableau with the comprehensive data governance capabilities of Amazon DataZone, we’re empowering data consumers to quickly and seamlessly explore and analyze their governed data. This integration helps organizations break down silos, foster collaboration, and make informed decisions, all while maintaining the security and control needed in today’s complex, distributed data landscape.

The feature is supported in all AWS commercial Regions where Amazon DataZone is currently available. Check out the video below and the detailed blog post to learn how to connect Amazon DataZone to external analytics tools via JDBC. Get started with our technical documentation.

Related blog posts


About the Authors

Ramesh H Singh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon DataZone team. He is passionate about building high-performance ML/AI and analytics products that enable enterprise customers to achieve their critical goals using cutting-edge technology. Connect with him on LinkedIn.

Adiascar Cisneros is a Tableau Senior Product Manager based in Atlanta, GA. He focuses on the integration of the Tableau Platform with AWS services to amplify the value users get from our products and accelerate their journey to valuable, actionable insights. His background includes analytics, infrastructure, network security, and migrations. Follow him on LinkedIn.

Joel Farvault is Principal Specialist SA Analytics for AWS with 25 years’ experience working on enterprise architecture, data governance and analytics, mainly in the financial services industry. Joel has led data transformation projects on fraud analytics, claims automation, and Master Data Management. He leverages his experience to advise customers on their data strategy and technology foundations.

Yogesh Dhimate is a Sr. Partner Solutions Architect at AWS, leading technology partnership with Tableau. Prior to joining AWS, Yogesh worked with leading companies including Salesforce driving their industry solution initiatives. With over 20 years of experience in product management and solutions architecture Yogesh brings unique perspective in cloud computing and artificial intelligence.

Ariana Rahgozar is a Sr. Senior Solutions Architect at AWS, leading customers design and implement technical solutions as part of their cloud journey.

Expanding data analysis and visualization options: Amazon DataZone now integrates with Tableau, Power BI, and more

Post Syndicated from Ramesh H Singh original https://aws.amazon.com/blogs/big-data/expanding-data-analysis-and-visualization-options-amazon-datazone-now-integrates-with-tableau-power-bi-and-more/

Amazon DataZone  now launched authentication supports through the  Amazon Athena JDBC driver, allowing data users to seamlessly query their subscribed data lake assets via popular business intelligence (BI) and analytics tools like Tableau, Power BI, Excel, SQL Workbench, DBeaver, and more. This integration empowers data users to access and analyze governed data within Amazon DataZone using familiar tools, boosting both productivity and flexibility.

Customers use Amazon DataZone to streamline data access and governance by enabling data users to locate and subscribe to data from multiple sources within a single project. Amazon DataZone natively integrates with Amazon-specific options like Amazon Athena, Amazon Redshift, and Amazon SageMaker, allowing users to analyze their project governed data. With this launch of JDBC connectivity, Amazon DataZone expands its support for data users, including analysts and scientists, allowing them to work in their preferred environments—whether it’s SQL Workbench, Domino, or Amazon-native solutions—while ensuring secure, governed access within Amazon DataZone.

Collaborating closely with our partners, we have tested and validated Amazon DataZone authentication via the Athena JDBC connection, providing an intuitive and secure connection experience for users. With this integration, you can now seamlessly query your governed data lake assets in Amazon DataZone using popular business intelligence (BI) and analytics tools, including partner solutions like Tableau.

Ali Tore, Senior Vice President of Advanced Analytics at Salesforce, highlighting the value of this integration, says

“We’re excited to partner with Amazon to bring Tableau’s powerful data exploration and AI-driven analytics capabilities to customers managing data across organizational boundaries with Amazon DataZone. This integration enables our customers to seamlessly explore data with AI in Tableau, build visualizations, and uncover insights hidden in their governed data, all while leveraging Amazon DataZone to catalog, discover, share, and govern data across AWS, on premises, and from third-party sources—enhancing both governance and decision-making.”

With this launch, Amazon DataZone strengthens its commitment to empowering enterprise customers with secure, governed access to data across the tools and platforms they rely on. For example, Guardant Health uses Amazon DataZone to democratize data access across its organization, enabling diverse teams to efficiently access, query, and analyze data tailored to their specific needs.

Rajesh Kucharlapati, Senior Director of Data, CRM, and Analytics at Guardant Health, says

“By harmonizing data across multiple business domains, we foster a culture of data sharing. Using Amazon DataZone lets us avoid building and maintaining an in-house platform, allowing our developers to focus on tailored solutions. Leveraging AWS’s managed service was crucial for us to access business insights faster, apply standardized data definitions, and tap into generative AI potential. We also needed an easy connection process for widely-used analytics tools like Tableau, DBeaver, and Domino, directly within Amazon DataZone projects. This new JDBC connectivity feature enables our governed data to flow seamlessly into these tools, supporting productivity across our teams.”

Getting started

To get started, download and install the latest Athena JDBC driver for your tool of choice. After installation, copy the JDBC connection string from the Amazon DataZone portal into the JDBC connection configuration to establish a connection from your tool. This will direct you to authenticate using single sign-on (SSO) with your corporate credentials. After connecting, you can query, visualize, and share data—governed by Amazon DataZone—within the tools you already know and trust.

In this post, we’ll guide you through connecting various analytics tools to Amazon DataZone using the Athena JDBC driver, enabling seamless access to your subscribed data within your Amazon DataZone projects.

Solution overview

To demonstrate these capabilities, consider a use case where your marketing team wants to drive a campaign that’s focused on product adoption. To achieve this, you need access to sales orders, shipment details, and customer data owned by the retail team. The retail team, acting as the data producer, publishes the necessary data assets to Amazon DataZone, allowing you, as a consumer, to discover and subscribe to these assets.

After the subscription is approved, the data assets become available within your marketing team’s project environment in Amazon DataZone. You can then use your preferred tool (for example, DBeaver, as shown in the following diagram) to perform data exploration.

Prerequisites

To follow along with this post, you need to have the following prerequisites in place:

  1. AWS account – You must have an active AWS account. If you don’t have one, see How do I create and activate a new AWS account?.
  2. Amazon DataZone resources – You need a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone project environment (DefaultDataLake environment with a DataLakeProfile).
  3. Publish data assets – As the data producer from the retail team, you must ingest individual data assets into Amazon DataZone. For this use case, create a data source and import the technical metadata of four data assets—customers, order_items, orders, products, reviews, and shipments—from AWS Glue Data Catalog. Ensure the data assets are enriched with business descriptions and published to the catalog.
  4. Subscribe data assets – As a data analyst from the marketing team, you must discover and subscribe to the data assets. The data producer from the retail team will review and approve your subscription. Upon successful fulfillment, the data assets will be added to your data lake environment. For detailed subscription instructions, see the Amazon DataZone User Guide.

The following figure shows the subscribed assets added to the data lake environment in your marketing project.

In the following sections, we will walk you through the steps to configure DBeaver to consume the subscribed assets from Amazon DataZone.

Configuring DBeaver to access subscribed data assets

In this section, you configure DBeaver to access the subscribed assets from the Marketing project

To configure DBeaver:

  1. Connect with JDBC: In the Amazon DataZone portal, navigate to the Marketing project, select the Environments tab and select Connect with JDBC.
    1. Select Marketing from the list in the top navigation are.
    2. Choose Environments
    3. Select Connect with JDBC.

  1. A new screen will display the JDBC connection parameters. Make sure to capture these details for configuring the database connection in DBeaver, including the JDBC URL, Domain ID, Environment ID, Region, and IDC Issuer URL.
  2. Download and install the latest Athena driver:
    • If DBeaver has the Athena driver pre-installed, it might be the older (v2) version. To ensure compatibility with Amazon DataZone, you need the latest driver (v3), which includes the necessary authentication features.
    • Download the latest JDBC driver—version 3.x.
    • To install the latest driver:
      • Go to Database and then to Driver Manager in DBeaver.
      • Select the Athena driver and choose Edit.
      • Choose Download to fetch the latest driver version.
      • If prompted, select the appropriate version and confirm the download.
  1. In the DBeaver SQL client, create a new database connection and select the Athena driver.
  2. In the Driver Properties section, enter the parameters that you captured from Amazon DataZone:
    • CredentialsProvider: The credentials provider to authenticate requests to AWS
    • DataZoneDomainId: The ID of your Amazon DataZone domain
    • DataZoneDomainRegion: The AWS Region where your domain is hosted.
    • DataZoneEnvironmentId: The ID of your DefaultDataLake environment.
    • IdentityCenterIssuerUrl: The issuer URL used by AWS IAM Identity Center for token issuance.
    • OutputLocation: Amazon S3 path for storing query results.
    • Region: The Region where the environment is created.
    • Workgroup: Amazon Athena workgroup of the environment.

  1. Choose Test connection.
  2. You will be redirected to the IAM Identity Center sign-in portal. Sign in with your credentials. If you’re already signed in through single sign-on (SSO), this step will be skipped.
  3. After you sign in, you will be prompted to authorize the DataZoneAuthPlugin. Choose Allow access to authorize access to Amazon DataZone from DBeaver.
  4. After the connection is established, a success message will appear as shown in the screenshot
  5. You can now view and query all subscribed assets directly within DBeaver.

These steps might also apply to other analytics tools and clients that support JDBC connections. If you’re using a different tool, you might need to adapt these instructions accordingly to ensure proper configuration and access to Amazon DataZone data assets.

Integration with other applications

You can use similar steps for other BI and analytics tools that support standard database connections.

Connect to Tableau Desktop

Use the Athena JDBC driver to connect Tableau to Amazon DataZone and visualize your subscribed data.

To connect to Tableau Desktop:

  1. Make sure that you’re using the latest Athena JDBC 3.x driver.
  2. Copy the JDBC driver file and place it in the appropriate folders for your operating system
    • For Mac OS: ~/Library/Tableau/Drivers
    • For Windows: C:\Program Files\Tableau\Drivers 
  3. Open Tableau Desktop. From the To a Server connection menu, select Other Databases (JDBC) to connect to Amazon DataZone.
  4. Paste the JDBC connection string you copied from the DataZone portal into the URL Leave other fields such as Dialect, Username, and Password blank and choose Sign in.
  5. This will redirect you to authenticate with IAM Identity Center. Enter the credentials of the Identity Center user that you used to sign in to the DataZone portal. Authorize the DataZoneAuthPlugin to access Amazon DataZone from Tableau. Once the connection is established with the success message, you now view your project’s subscribed data directly within Tableau and build dashboards.

See the Amazon DataZone and Tableau blog post for step-by-step instructions.

Connect to Microsoft Power BI

Now, let’s look at connecting Amazon DataZone with Microsoft Power BI on Windows.

While Amazon Athena provides a native ODBC driver for connecting to ODBC-compatible tools like Microsoft Power BI, it currently doesn’t support Amazon DataZone authentication. Therefore, in this post, we will use an ODBC-JDBC bridge to connect Amazon DataZone with Microsoft Power BI using the Athena JDBC driver, which supports DataZone authentication.

In this post, we’re using the ZappySys driver as the ODBC-JDBC bridge. This is a third-party solution that requires a separate licensing fee, which isn’t included in the AWS solution. You can choose to use any other solution for ODBC-JDBC bridge.

To connect to Power BI:

  1. Make sure that you have administrator privileges to run the ODBC Data Source Administrator.
  2. From the Windows Start menu, run the ODBC Data Source Administrator (the 64-bit version) using run as Administrator.
  3. Create a New Data Source with the ZappySys JDBC Bridge Driver. You will be prompted to enter your connection details.
  4. Paste the JDBC URL you copied from the DataZone portal in the Connection String, along with the driver class and JDBC driver file. Make sure that you’re using the latest Athena JDBC 3.x driver.
  5. Choose Test Connection. A new dialog window will pop up after the connection is successful.
  6. After configuring the data source, launch Power BI. Create a blank report or use an existing report to integrate the new visuals. Choose Get Data and select the name of the data source you created. This will open a new browser window to authenticate your credentials. Allow access to authorize the DataZone plugin. After authorization is complete, you can build your reports in Microsoft Power BI with the subscribed data assets.

Connect to SQL Workbench

Discover how SQL Workbench can connect to Amazon DataZone for users who prefer a SQL interface to query data lake tables and views subscribed through projects in Amazon DataZone.

To connect to SQL Workbench

  1. Make sure that you’re using the latest Athena JDBC 3.x driver.
  2. Open SQL Workbench/J and choose Manage Drivers.
  3. Select the option to add a new driver. Enter a name for it, such as DatazoneAthenaJDBC, and import the driver you downloaded in the previous steps.
  4. Create a new connection and enter a name it, such as datazone-profile. In the Driver option, select the driver you configured.
  5. For the URL, enter the string jdbc:athena://region=us-east-1; (In the example, the Virginia Region is being used). Choose Extended Properties.
  6. Under Extended Properties, add the following parameters that you copied from the DataZone portal and choose OK. You can also include these parameters in the JDBC (URL) connection string.

    1. The parameters to add are:
      • Workgroup
      • DataZoneEndpointOverride
      • OutputLocation
      • DataZoneDomainId
      • IdentityCenterIssuerURL
      • CredentialsProvider
      • DatazoneEnvironmentId
      • DataZoneDomainRegain

  1. You will be prompted to sign in and authenticate. Allow access and authorization to Amazon DataZone.
  2. After successful connection, in SQL Workbench/J, under Database Explorer, select the desired database. For example, select the database that has access to the subscribed data asset orders. Select the data asset and execute the query.

Cleanup

To ensure no additional charges are incurred after testing, be sure to delete the Amazon DataZone domain. See Delete Amazon DataZone domains for instructions.

Conclusion

Amazon DataZone continues to expand its offerings, providing you with more flexibility to access, analyze, and visualize your subscribed data. With support for the Athena JDBC driver, you can now use a wide range of popular BI and analytics tools, making data accessed through Amazon DataZone more accessible than ever before. Whether you’re using Tableau, Power BI, or other familiar tools, the integration with Amazon DataZone ensures that your data remains secure and accessible to authorized users.

The feature is supported in all AWS commercial Regions where Amazon DataZone is currently available. Watch the video below to learn how to connect Amazon DataZone to external analytics tools via JDBC. Get started with our technical documentation.


About the Authors

Ramesh H Singh is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon DataZone team. He is passionate about building high-performance ML/AI and analytics products that enable enterprise customers to achieve their critical goals using cutting-edge technology. Connect with him on LinkedIn.

Eric Fleishman is a software engineer at AWS in Seattle. He loves diving into cloud technology and solving complex problems to build impactful solutions. Outside of work, he is all about staying active—whether its snowboarding down the slopes or working out. He enjoys pushing his limits and embracing new challenges.

Theo Tolv is a Senior Analytics Architect based in Stockholm, Sweden. He’s worked with small and big data for most of his career, and has built applications running on AWS since 2008. In his spare time he likes to tinker with electronics and read space opera.

Joel Farvault is Principal Specialist SA Analytics for AWS with 25 years’ experience working on enterprise architecture, data governance and analytics, mainly in the financial services industry. Joel has led data transformation projects on fraud analytics, claims automation, and Master Data Management. He leverages his experience to advise customers on their data strategy and technology foundations.

Lakshmi Nair is a Senior Analytics Specialist Solutions Architect at AWS. She specializes in designing advanced analytics systems across industries. She focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Fabricio Hamada is a Senior Data Strategy Solutions Architect at AWS.

Lionel Pulickal is Sr. Solutions Architect at AWS

Modernize your legacy databases with AWS data lakes, Part 3: Build a data lake processing layer

Post Syndicated from Anoop Kumar K M original https://aws.amazon.com/blogs/big-data/modernize-your-legacy-databases-with-aws-data-lakes-part-3-build-a-data-lake-processing-layer/

This is the final part of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to process data with Amazon Redshift Spectrum and create the gold (consumption) layer. To review the first two parts of the series where we load data from SQL Server into Amazon Simple Storage Service (Amazon S3) using AWS Database Migration Service (AWS DMS) and load the data into the silver layer of the data lake, see the following:

Solution overview

Choosing the right tools and technology stack to build the data lake in order to build a scalable solution and have shorter time to market is critical. In this post, we go over the process of building a data lake, providing rationale behind the different decisions, and share best practices when building such a data solution.

The following diagram illustrates the different layers of the data lake.

The data lake is designed to serve a multitude of use cases. In the silver layer of the data lake, the data is stored as it is loaded from sources, preserving the table and schema structure. In the gold layer, we create data marts by combining, aggregating, and enriching data as required by our use cases. The gold layer is the consumption layer for the data lake. In this post, we describe how you can use Redshift Spectrum as an API to query data.

To create data marts, we use Amazon Redshift Query Editor. It provides a web-based analyst workbench to create, explore, and share SQL queries. In our use case, we use Redshift Query Editor to create data marts using SQL code. We also use Redshift Spectrum, which allows you to efficiently query and retrieve structured and semi-structured data from files stored on Amazon S3 without having to load the data into the Redshift tables. The Apache Iceberg tables, which we created and cataloged in Part 2, can be queried using Redshift Spectrum. For the latest information on Redshift Spectrum integration with Iceberg, see Using Apache Iceberg tables with Amazon Redshift.

We also show how to use RedshiftDataAPIService to run SQL commands to query the data mart using a Boto3 Python SDK. You can use the Redshift Data API to create the resulting datasets on Amazon S3, and then use the datasets in use cases such as business intelligence dashboards and machine learning (ML).

In this post, we walk through the following steps:

  1. Set up a Redshift cluster.
  2. Set up a data mart.
  3. Query the data mart.

Prerequisites

To follow the solution, you need to set up certain access rights and resources:

  • An AWS Identity and Access Management (IAM) role for the Redshift cluster with access to an external data catalog in AWS Glue and data files in Amazon S3 (these are the data files populated by the silver layer in Part 2). The role also needs Redshift cluster permissions. This policy must include permissions to do the following:
    • Run SQL commands to copy, unload, and query data with Amazon Redshift.
    • Grant permissions to run SELECT statements for related services, such as Amazon S3, Amazon CloudWatch logs, Amazon SageMaker, and AWS Glue.
    • Manage AWS Lake Formation permissions (in case the AWS Glue Data Catalog is managed by Lake Formation).
  • An IAM execution role for AWS Lambda with permissions to access Amazon Redshift and AWS

For more information about setting up IAM roles for Redshift Spectrum, see Getting started with Amazon Redshift Spectrum.

Set up a Redshift cluster

Redshift Spectrum is a feature of Amazon Redshift that queries data stored in Amazon S3 directly, without having to load it into Amazon Redshift. In our use case, we use Redshift Spectrum to query Iceberg data stored as Parquet files on Amazon S3. To use Redshift Spectrum, we first need a Redshift cluster to run the Redshift Spectrum compute jobs. Complete the following steps to provision a Redshift cluster:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Choose Create cluster.
  3. For Cluster identifier, enter a name for your cluster.
  4. For Choose the size of the cluster, select I’ll choose.
  5. For Node type, choose xlplus.
  6. For Number of nodes, enter 1.

can

  1. For Admin password, select Manage admin credentials in AWS Secrets Manager if you want to use Secrets Manager, otherwise you can generate and store the credentials manually.

  1. For the IAM role, choose the IAM role created in the prerequisites.
  2. Choose Create cluster.

We chose the cluster Availability Zone, number of nodes, compute type, and size for this post to minimize costs. If you’re working on larger datasets, we recommend reviewing the different instance types offered by Amazon Redshift to select the one that is appropriate for your workloads.

Set up a data mart

A data mart is a collection of data organized around a specific business area or use case, providing focused and quickly accessible data for analysis or consumption by applications or users. Unlike a data warehouse, which serves the entire organization, a data mart is tailored to the specific needs of a particular department, allowing for more efficient and targeted data analysis. In our use case, we use data marts to create aggregated data from the silver layer and store it in the gold layer for consumption. For our use case, we use the schema HumanResources in the AdventureWorks sample database we loaded in Part 1 (FIX LINK). This database contains a factory’s employee shift information for different departments. We use this database to create a summary of the shift rate changes for different departments, years, and shifts to see which years had the most rate changes.

We recommend using the auto mount feature in Redshift Spectrum. This feature removes the need to create an external schema in Amazon Redshift to query tables cataloged in the Data Catalog.

Complete the following steps to create a data mart:

  1. On the Amazon Redshift console, choose Query editor v2 in the navigation pane.
  2. Choose the cluster you created and choose AWS Secrets Manager or Database username and password depending on how you chose to store the credentials.
  3. After you’re connected, open a new query editor.

You will be able to see the AdventureWorks database under awsdatacatalog. You can now start querying the Iceberg database in the query editor.

query-editor

If you encounter permission issues, choose the options menu (three dots) next to the cluster, choose Edit connection, and connect using Secrets Manager or your database user name and password. Then grant privileges for the IAM user or role with the following command, and reconnect with your IAM identity:

GRANT USAGE ON DATABASE awsdatacatalog to "IAMR:MyRole"

For more information, see Querying the AWS Glue Data Catalog.

Next, you create a local schema to store the definition and data for the view.

  1. On the Create menu, choose Schema.
  2. Provide a name and set the type as local.
  3. For the data mart, create a dataset that combines different tables in the silver layer to generate a report of the total shift rate changes by department, year, and shift. The following SQL code will return the required dataset:
SELECT dep.name AS "Department Name",
extract(year from emp_pay_hist.ratechangedate) AS "Rate Change Year",
shift.name AS "Shift",
COUNT(emp_pay_hist.rate) AS "Rate Changes"
FROM "dev"."{redshift_schema_name}"."department" dep
INNER JOIN "dev"."{redshift_schema_name}"."employeedepartmenthistory" emp_hist
ON dep.departmentid = emp_hist.departmentid
INNER JOIN "dev"."{redshift_schema_name}"."employeepayhistory" emp_pay_hist
ON emp_pay_hist.businessentityid = emp_hist.businessentityid
INNER JOIN "dev"."{redshift_schema_name}"."employee" emp
ON emp_hist.businessentityid = emp.businessentityid
INNER JOIN "dev"."{redshift_schema_name}"."shift" shift
ON emp_hist.shiftid = shift.shiftid
WHERE emp.currentflag = 'true'
GROUP BY dep.name, extract(year from emp_pay_hist.ratechangedate), shift.name;
  1. Create an internal schema where you want Amazon Redshift to store the view definition:

CREATE SCHEMA IF NOT EXISTS {internal_schema_name};

  1. Create a view in Amazon Redshift that you can query to get the dataset:
CREATE OR REPLACE VIEW {internal_schema_name}.rate_changes_by_department_year AS
SELECT dep.name AS "Department Name",
extract(year from emp_pay_hist.ratechangedate) AS "Rate Change Year",
shift.name AS "Shift",
COUNT(emp_pay_hist.rate) AS "Rate Changes"
FROM "dev"."{redshift_schema_name}"."department" dep
INNER JOIN "dev"."{redshift_schema_name}"."employeedepartmenthistory" emp_hist
ON dep.departmentid = emp_hist.departmentid
INNER JOIN "dev"."{redshift_schema_name}"."employeepayhistory" emp_pay_hist
ON emp_pay_hist.businessentityid = emp_hist.businessentityid
INNER JOIN "dev"."{redshift_schema_name}"."employee" emp
ON emp_hist.businessentityid = emp.businessentityid
INNER JOIN "dev"."{redshift_schema_name}"."shift" shift
ON emp_hist.shiftid = shift.shiftid
WHERE emp.currentflag = 'true'
GROUP BY dep.name, extract(year from emp_pay_hist.ratechangedate), shift.name
WITH NO SCHEMA BINDING;

If the SQL takes a long time to run or produces a large result set, consider using Redshift Unlike regular views, which are computed in the moment, the results from materialized views can be pre-computed and stored on Amazon S3. When the data is requested, Amazon Redshift can point to an Amazon S3 location where the results are stored. Materialized views can be refreshed on demand and on a schedule.

Query the data mart

Lastly, we query the data mart using a Lambda function to show how the data can be retrieved using an API. The Lambda function requires an IAM role to access Secrets Manager where the Redshift user credentials are stored. We use the Redshift Data API to retrieve the dataset we created in the previous step. First, we call the execute_statement() command to run the view. Next , we check the status of the run by calling the describe_statement() call. Finally , when the statement has successfully run, we use the get_statement_result() call to get the result set. The Lambda function shown in the following code implements this logic and returns the result set from querying the view rate_changes_by_department_year:

import json
import boto3
import time

def lambda_handler(event, context):
	client = boto3.client('redshift-data')

	# Use the Redshift execute statement api to query the data mart
	response = client.execute_statement(
	ClusterIdentifier='{redshift cluster name}',
	Database='dev',
	SecretArn='{redshift cluster secrets manager secret arn}',
	Sql='select * from {internal_schema_name}.rate_changes_by_department_year',
	StatementName='query data mart'
	)

	statement_id = response["Id"]
	query_status = True
	resultSet = []

	# Check the status of the sql statement, once the statement has finished executing we can retrive the resultset
	while query_status:
	if client.describe_statement(Id=statement_id)["Status"] == "FINISHED":

	print("SQL statement has finished successfully and we can get the resultset")

	response = client.get_statement_result(
	Id=statement_id
	)
	columns = response["ColumnMetadata"]
	results = response["Records"]
	while "NextToken" in response:
	response = client.get_servers(NextToken=response["NextToken"])
	results.extend(response["Records"])

	resultSet.append(str(columns[0].get("label")) + "," + str(columns[1].get("label")) + "," + str(columns[2].get("label")) + "," + str(columns[3].get("label")))

	for result in results:
	resultSet.append(str(result[0].get("stringValue")) + "," + str(result[1].get("longValue")) + "," + str(result[2].get("stringValue")) + "," + str(result[3].get("longValue")))

	query_status = False

	# In case the statement runs into errors we abort the resultset retrival
	if client.describe_statement(Id=statement_id)["Status"] == "ABORTED" or client.describe_statement(Id=statement_id)["Status"] == "FAILED":
	query_status = False
	print("SQL statement has failed or aborted")

	# To avoid spamming the API with requests on the status of the statement, we introduce a 2 second wait between calls
	else:
	print("Query Status ::" + client.describe_statement(Id=statement_id)["Status"])
	time.sleep(2)

	return {
	'statusCode': 200,
	'body': resultSet
	}

The Redshift Data API allows you to access data from many different types of traditional, cloud-based, containerized, web service-based, and event-driven applications. The API is available in many programming languages and environments supported by the AWS SDK, such as Python, Go, Java, Node.js, PHP, Ruby, and C++. For larger datasets that don’t fit into memory, such as ML training datasets, you can use the Redshift UNLOAD command to move the results of the query to an Amazon S3 location.

Clean up

In this post, you created an IAM role, Redshift cluster, and Lambda function. To clean up your resources, complete the following steps:

  1. Delete the IAM role:
    1. On the IAM console, choose Roles in the navigation pane.
    2. Select the role and choose Delete.
  2. Delete the Redshift cluster:
    1. On the Amazon Redshift console, choose Clusters in the navigation pane.
    2. Select the cluster you created and on the Actions menu, choose Delete.
  3. Delete the Lambda function:
    1. On the Lambda console, choose Functions in the navigation pane.
    2. Select the function you created and on the Actions menu, choose Delete.

Conclusion

In this post, we showed how you can use Redshift Spectrum to create data marts on top of the data in your data lake. Redshift Spectrum can query Iceberg data stored in Amazon S3 and cataloged in AWS Glue. You can create views in Amazon Redshift that compute the results from the underlying data on demand, or pre-compute results and store them (using materialized views). Lastly, the Redshift Data API is a great tool for running SQL queries on the data lake from a wide variety of sources.

For more insights into the Redshift Data API and how to use it, refer to Using the Amazon Redshift Data API to interact with Amazon Redshift clusters. To continue to learn more about building a modern data architecture, refer to Analytics on AWS.


About the Authors

Shaheer Mansoor is a Senior Machine Learning Engineer at AWS, where he specializes in developing cutting-edge machine learning platforms. His expertise lies in creating scalable infrastructure to support advanced AI solutions. His focus areas are MLOps, feature stores, data lakes, model hosting, and generative AI.

Anoop Kumar K M is a Data Architect at AWS with focus in the data and analytics area. He helps customers in building scalable data platforms and in their enterprise data strategy. His areas of interest are data platforms, data analytics, security, file systems and operating systems. Anoop loves to travel and enjoys reading books in the crime fiction and financial domains.

Sreenivas Nettem is a Lead Database Consultant at AWS Professional Services. He has experience working with Microsoft technologies with a specialization in SQL Server. He works closely with customers to help migrate and modernize their databases to AWS.

Improve OpenSearch Service cluster resiliency and performance with dedicated coordinator nodes

Post Syndicated from Akshay Zade original https://aws.amazon.com/blogs/big-data/improve-opensearch-service-cluster-resiliency-and-performance-with-dedicated-coordinator-nodes/

Today, we are announcing dedicated coordinator nodes for Amazon OpenSearch Service domains deployed on managed clusters. When you use Amazon OpenSearch Service to create OpenSearch domains, the data nodes serve dual roles of coordinating data-related requests like indexing requests, and search requests, and of doing the work of processing the requests – indexing documents and responding to search queries. Additionally, data nodes also serve the OpenSearch Dashboards. Because of these multiple responsibilities, data nodes can become a hot spot in the OpenSearch Service domain, leading to resource scarcity, and ultimately node failures. Dedicated coordinator nodes help you mitigate this problem by limiting the request coordination and Dashboards to the coordinator nodes, and request processing to the data nodes. This leads to more resilient, scalable domains.

Amazon OpenSearch Service is a managed service that you can use to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. The service allows you to configure clusters with different types of nodes such as data nodes, dedicated cluster manager nodes, and UltraWarm nodes. When you send requests to your OpenSearch Service domain, the request is broadcast to the nodes with shards that will process that request. By assigning roles through deploying dedicated nodes, like dedicated cluster manager nodes, you concentrate the processing of those kinds of requests and remove that processing from nodes in other roles.

OpenSearch Service has recently expanded its node type options to include dedicated coordinator nodes, alongside data nodes, dedicated cluster manager nodes, and UltraWarm nodes. These dedicated coordinator nodes offload coordination tasks and dashboard hosting from data nodes, freeing up CPU and memory resources. By provisioning dedicated coordinator nodes, you can improve a cluster’s overall performance and resiliency. Dedicated coordinator nodes also let you scale the coordination capacity of your cluster independently of the data storage capacity. Dedicated coordinator nodes are available in Amazon OpenSearch Service for all OpenSearch engine versions. See the documentation for engine and version support.

A brief introduction to coordination

OpenSearch operates as a distributed system, where data is stored in multiple shards across various nodes. Consequently, a node handling a request must coordinate with several other nodes to store or retrieve data.

Here are a few examples of coordination operations performed to successfully serve different user requests:

  • A bulk indexing request might contain data that belongs to multiple shards. The coordination process splits such a request into multiple shard-specific subrequests and routes them to the corresponding shards for indexing.
  • A search request might require querying various shards that are present in different nodes. The coordination process splits the request into multiple shard level search requests and sends those requests to the corresponding data nodes holding the data. Each of those data nodes processes the data locally and returns a shard-level response. The coordination process gathers these responses and builds the final response.
  • For queries with aggregations, the coordination process performs the additional computation of re-aggregating the aggregation responses from data nodes.

In OpenSearch Service, each data node is implicitly capable of coordination. In the absence of dedicated coordinator nodes, the data node receiving the request will perform the coordinating duties, though it might not have the relevant shards for the request. By adding dedicated coordinator nodes to a cluster, you can reduce the burden on data nodes. The following sections walk through some of the improvements.

Higher indexing and search throughput

In an OpenSearch cluster, each indexing request goes through three broad phases: coordination, primary, and replica. With coordination responsibilities offloaded to dedicated coordinators, the data nodes have more resources at their disposal for the primary and replica phases. By adding coordinator nodes, we observed as much as 15% higher indexing throughput in workloads such as Stack Overflow and Big5.

A search request in OpenSearch can involve something as trivial as looking up a single document by ID or something complex, such as bucketing a large amount of data and performing aggregations on each of the buckets. The impact of adding dedicated coordinator nodes can vary widely depending on the query. In a query workload containing date histograms with multiple aggregations such as average, p50, p99, and so on, we were able to achieve about 20% higher throughput. The term and multi-term aggregations also benefit from the addition of coordinator nodes. Depending on the key composition throughput improvement of 15% to 20% was observed.

More resilient clusters

Dedicated coordinator nodes provide a separation of responsibilities that prevents data nodes from being overwhelmed by complex queries or sudden spikes in request volume. In the case of complex aggregations, the coordinator nodes absorb the CPU impact ensuring that the data nodes focus on filtering, matching, scoring, sorting, and returning the search response, and maintaining the integrity of the data. In addition to coordination responsibilities, coordinator nodes also serve the OpenSearch Dashboards frontend. This ensures that the dashboards stay responsive even during high loads, ensuring a smooth user experience.

Complex aggregations consume a lot of memory. Memory intensive operations can lead to out of memory (OOM) errors causing node crashes and data loss. By adding dedicated coordinator nodes in a cluster, you can isolate the impact away from the data nodes. Coordinator nodes can greatly improve performance by significantly reducing or even completely eliminating query-induced OOM errors on data nodes. Because coordinator nodes don’t hold any data, the cluster still remains functional even if one of the coordinator nodes fails.

Efficient scaling

Dedicated coordinator nodes separate a cluster’s coordination capacity from data storage capacity. This allows you to choose the amount of memory and CPU required for your workload without impacting the stored data. For example, a cluster with high throughput might require a lot of lightweight nodes while a cluster with complex aggregations should have fewer but larger nodes.

Having a dedicated coordinator node allows you to adjust the number of nodes according to anticipated traffic patterns. For example, you can scale up the number of coordinators in high traffic hours and scale them down during low traffic hours.

Smaller IP reservations for VPC domains

With dedicated coordinator nodes, you can achieve up to 90% reduction in the number of IP addresses reserved by the service in your VPC. This reduction allows deployments of larger clusters that might otherwise face resource constraints.

When you create a virtual private cloud (VPC) domain without dedicated coordinator nodes, OpenSearch Service places an elastic network interface (ENI) in the VPC for each data node. Each ENI is assigned an IP address. At the time of domain creation, the service reserves three IP addresses for each data node. See Architecture for more information. When dedicated coordinator nodes are used, the ENIs are attached to the coordinator nodes instead of the data nodes. Because there are typically fewer coordinator nodes than data nodes fewer IP addresses are reserved. The following diagram shows the domain architecture of a VPC domain with dedicated coordinator nodes.

Picking the right configuration

OpenSearch Service offers two key parameters for managing dedicated coordinator nodes:

  1. Instance type, which determines the memory and compute capacity of each coordinator node.
  2. Instance count, which specifies the number of coordinator nodes.

Identify your use case

To get the most benefits out of coordinator nodes, you must pick the right type as well as the right count. As a general rule, we recommend that you set the count to 10% of the number of data nodes and choose a size that’s similar to the size of the data nodes. See the documentation to find out the supported instance types for dedicated coordinator nodes. The following guidelines should help tailor the configuration further to specific workloads:

  • Indexing: Indexing requires compute power to split the bulk upload request payload into shard-specific chunks. We recommend using CPU optimized instances of a size similar to that of the data nodes. While the count is dependent on the indexing throughput that you want to achieve, 10% of the number of data nodes is a good starting point.
  • High search throughput: Achieving high search throughput requires a lot of network capacity. Increasing the number of coordinator nodes will sustain the traffic load while providing high availability. We recommend setting the coordinator node count at from 10% to 15% of the number of data nodes.
  • Complex aggregations: Aggregations are memory intensive. For example, to calculate a p50 value, a coordinator node must first gather the entire dataset in memory. Moreover, crunching these numbers requires CPU cycles. We recommend that you use general purpose coordinator nodes that are one size larger than the data nodes. While the node count can be tuned by the use case, 8% to 10% of the number of data nodes is a good start.

Coordinator metrics

While the guidelines above are a good start, every use case is unique. To arrive at an optimal configuration, you must experiment with your own workload, observe the performance, and identify the bottlenecks. OpenSearch Service provides some key metrics and APIs to observe how coordinator nodes are doing.

  • CoordinatorCPUUtilization metric: This metric provides information about how much CPU is being consumed on the coordinator nodes. This metric is available at both the node and the cluster levels. If you see CPU consistently breaching the 80% mark, it might be a time to use larger coordinator nodes.
  • CoordinatorJVMMemoryPressure, CoordinatorJVMGCOldCollectionCount and CoordinatorJVMGCOldCollectionTime metrics: The CoordinatorJVMMemoryPressure metric indicates the percentage of JVM memory used by the OpenSearch process. This metric is available at both the cluster and node levels. Consistently high JVM memory pressure suggests that coordination tasks are using memory efficiently. It’s important to assess this metric alongside the JVM garbage collection (GC) metrics, which show how many old generation GC runs have been triggered and how long they lasted. In a properly scaled cluster, GC runs should be infrequent and short. If GC runs occur too often, they might also negatively impact CPU performance.
  • CoordinatingWriteRejected metric: This metric should be evaluated alongside other metrics, such as PrimaryWriteRejected and ReplicaWriteRejected. An increase in primary or replica write rejections suggests that the data nodes are underscaled and unable to process requests quickly enough. However, if the CoordinatingWriteRejected metric rises independently of the other two, it indicates that the coordinating node is struggling to handle the indexing coordination process, preventing it from processing queued requests. Indexing requires many resources, any of which could be a bottleneck. You can alleviate indexing pressure where the CPU is the bottleneck with more or larger instances that have more vCPUs.
  • Circuit breaker statistics API: Circuit breakers prevent OpenSearch from causing a Java OutOfMemoryError. The circuit breaker statistics for coordinator nodes can be retrieved with following API:
    _nodes/coordinating_only:true/stats/breaker
    Every time a circuit breaker trips for a request the client receives a 429 error with the circuit_breaking_exception message. These indicate that the result size of the request was too big to fit on a coordinator node. To avoid these errors, it’s recommended to use an instance with more memory.

Provision a dedicated coordinator node

You can add one or more dedicated coordinator nodes by updating the domain configuration with the appropriate options for coordinator nodes. This will trigger a blue/green deployment, and the domain will have dedicated coordinator nodes once the deployment is complete. Alternatively, you can create a new domain with dedicated coordinator nodes.

In either scenario, you can expand or reduce the number of coordinator nodes without requiring a blue/green deployment, giving you the flexibility to experiment.

Conclusion

In real-world production environments, dedicated coordinator nodes in Amazon OpenSearch Service provide an effective way to separate coordination tasks from data processing. This shift enhances resource efficiency, often delivering up to a 15% increase in indexing throughput and a 20% improvement in query performance, depending on workload demands. By offloading coordination tasks, you reduce the risk of node overloads, improve system stability, and gain better cost control by scaling coordination and data tasks independently.

For workloads with complex queries and high traffic, dedicated coordinator nodes help ensure that your cluster maintains optimal performance and is prepared to handle future growth with greater resilience. Start experimenting with dedicated coordinator nodes today to unlock more efficient resource management and enhanced performance in your OpenSearch clusters.


About the author

Akshay Zade is a Senior SDE working for Amazon OpenSearch Service, passionate about solving real-world problems with the power of large-scale distributed systems. Outside of work, he enjoys drawing, painting, and diving into fantasy books.

Control your AWS Glue Studio development interface with AWS Glue job mode API property

Post Syndicated from Shovan Kanjilal original https://aws.amazon.com/blogs/big-data/control-your-aws-glue-studio-development-interface-with-aws-glue-job-mode-api-property/

In recent years, as the importance of big data has grown, efficient data processing and analysis have become crucial factors in determining a company’s competitiveness. AWS Glue, a serverless data integration service for integrating data across multiple data sources at scale, addresses these data processing needs. Among its features, the AWS Glue Jobs API stands out as a particularly noteworthy tool.

The AWS Glue Jobs API is a robust interface that allows data engineers and developers to programmatically manage and run ETL jobs. By using this API, it becomes possible to automate, schedule, and monitor data pipelines, enabling efficient operation of large-scale data processing tasks.

To improve customer experience with the AWS Glue Jobs API, we added a new property describing the job mode corresponding to script, visual, or notebook. In this post, we explore how the updated AWS Glue Jobs API works in depth and demonstrate the new experience with the updated API.

JobMode property

A new property JobMode describes the mode of AWS Glue jobs (script, visual, or notebook) to improve your UI experience. AWS Glue users can use the mode that best fits your preference. Some extract, transform, and load (ETL) developers prefer to use visual mode and create visual jobs using AWS Glue Studio visual editor. Some data scientists prefer to use notebooks jobs and use AWS Glue Studio notebooks. Some data engineers and developers prefer to implement script through the AWS Glue Studio script editor or preferred integrated development environment (IDE). After the job is created with the preferred mode, you can search for it by filtering on the job mode within your saved AWS Glue jobs page and find it easily. Additionally, if you are migrating existing iPython notebook files to AWS Glue Studio notebook jobs, you can now choose and set the job mode and do so for multiple jobs using this new API property, as demonstrated in this post.

How CreateJob API works with the new JobMode property

You can use CreateJob API to create AWS Glue script or a visual or notebook job. The following is an example of how it works for a visual job using AWS SDK for Python (Boto3): (replace <your-bucket-name> with your S3 bucket)

CODE_GEN_JSON_STR = '''
{
  "node-1": {
    "S3ParquetSource": {
      "Name": "Amazon S3",
      "Paths": [
        "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/"
      ],
      "Exclusions": [],
      "Recurse": true,
      "AdditionalOptions": {
        "EnableSamplePath": false,
        "SamplePath": "s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Books/73612da260b94159b705cf4df12364cb_0.snappy.parquet"
      },
      "OutputSchemas": [
        {
          "Columns": [
            {
              "Name": "marketplace",
              "Type": "string"
            },
            {
              "Name": "customer_id",
              "Type": "string"
            },
            {
              "Name": "review_id",
              "Type": "string"
            },
            {
              "Name": "product_id",
              "Type": "string"
            },
            {
              "Name": "product_title",
              "Type": "string"
            },
            {
              "Name": "star_rating",
              "Type": "bigint"
            },
            {
              "Name": "helpful_votes",
              "Type": "bigint"
            },
            {
              "Name": "total_votes",
              "Type": "bigint"
            },
            {
              "Name": "insight",
              "Type": "string"
            },
            {
              "Name": "review_headline",
              "Type": "string"
            },
            {
              "Name": "review_body",
              "Type": "string"
            },
            {
              "Name": "review_date",
              "Type": "timestamp"
            },
            {
              "Name": "review_year",
              "Type": "bigint"
            }
          ]
        }
      ]
    }
  },
  "node-2": {
    "DropFields": {
      "Name": "Drop Fields",
      "Inputs": [
        "node-1"
      ],
      "Paths": [
        [
          "review_headline"
        ],
        [
          "review_body"
        ],
        [
          "review_date"
        ]
      ]
    }
  },
  "node-3": {
    "S3DirectTarget": {
      "Name": "Amazon S3",
      "Inputs": [
        "node-2"
      ],
      "PartitionKeys": [],
      "Path": "s3://<your-bucket-name>/data/jobmode-blog/output/parquet/",
      "Compression": "snappy",
      "Format": "parquet",
      "SchemaChangePolicy": {
        "EnableUpdateCatalog": false
      }
    }
  }
}
'''

glue_client = boto3.client('glue')
codeGenJson = json.loads(constants.CODE_GEN_JSON_STR, strict=False)

# Call the create_job method
try:
    glue_client.create_job(
        Name="glue-visual-job",
        Description="Glue Visual ETL job",
        Command={'Name': 'glueetl', 'ScriptLocation': "s3://aws-glue-assets-<account-id>-<region>/scripts/glue-visual-job", 'PythonVersion': "3"},
        WorkerType=constants.WORKERTYPE,
        NumberOfWorkers="G.1X",
        Role=<role-arn>,  
        GlueVersion="4.0",        
        CodeGenConfigurationNodes=codeGenJson,
        JobMode="VISUAL"
    )
    print("Successfully created Glue job")
except Exception as e:
    print(f"Error creating Glue job: {str(e)}")

CODE_GEN_JSON_STR represents the visual nodes for the AWS Glue Job. There are three nodes: node-1 uses S3 source, node-2 does transformation, and node-3 uses S3 target. The script instantiates the AWS Glue Boto3 client, loads the JSON, and calls the create_job. JobMode is set to VISUAL.

After you run the Python script, a new job is created. The following screenshot shows how the created job looks in AWS Glue visual editor.

There are three nodes in the visual directed acyclic graph (DAG): node 1 sources product review data for the product_category book from the public S3 bucket, node-2 drops some of the fields that aren’t needed for downstream systems, and node-3 persists the transformed data in a local S3 bucket.

How CloudFormation works with the new JobMode property

You can use AWS CloudFormation to create different types of AWS Glue jobs by specifying the JobMode parameter with the AWS::Glue::Job resource. The supported job modes include:

  • SCRIPT
  • VISUAL
  • NOTEBOOK

In this example, you create a AWS Glue notebook job using AWS CloudFormation, which requires setting the JobMode parameter to NOTEBOOK.

  1. Create a Jupyter Notebook file containing your logic and code, and save the notebook file with a descriptive name, such as my-glue-notebook.ipynb. Alternatively you can download the notebook file, and rename it to my-glue-notebook.ipynb.
  2. Upload the Notebook file to the notebooks/ folder within the aws-glue-assets-<account-id>-<region> S3 bucket.
  3. Create a new CloudFormation template to create a new AWS Glue job, specifying the NotebookJobName parameter as the same name as the Notebook file. Here’s the sample snippet of CloudFormation template:
    AWSTemplateFormatVersion: '2010-09-09'
    Description: CloudFormation template for creating an AWS Glue ETL job using a Jupyter Notebook
    
    Parameters:
      NotebookJobName:
        Type: String
        Description: Name of the AWS Glue ETL Notebook job
    
    Resources:
      GlueJobRole:
        Type: AWS::IAM::Role
        Properties:
          RoleName: !Sub ${AWS::StackName}-GlueJobRole
          AssumeRolePolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Principal:
                  Service:
                    - glue.amazonaws.com
                Action:
                  - sts:AssumeRole
          ManagedPolicyArns:
            - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
          Policies:
            - PolicyName: GlueJobS3Access
              PolicyDocument:
                Version: '2012-10-17'
                Statement:
                  - Effect: Allow
                    Action:
                      - iam:PassRole
                    Resource:
                      - !Sub arn:aws:iam::${AWS::AccountId}:role/${AWS::StackName}-GlueJobRole
    
      ETLNotebookJob:
        Type: AWS::Glue::Job
        Properties:
          Name: !Ref NotebookJobName
          Description: ETL job using a Jupyter Notebook
          Role: !GetAtt GlueJobRole.Arn
          Command:
            Name: glueetl
            PythonVersion: '3'
            ScriptLocation: !Sub s3://aws-glue-assets-${AWS::AccountId}-${AWS::Region}/scripts/${NotebookJobName}.py
          DefaultArguments:
            '--job-bookmark-option': job-bookmark-enable
          JobMode: NOTEBOOK
    
    Outputs:
      ETLNotebookJobName:
        Value: !Ref ETLNotebookJob
        Description: Name of the ETL Notebook job

  4. Deploy the CloudFormation template. For NotebookJobName, enter same name as the notebook file.
  5. Verify that the AWS Glue job you created is listed and that it has the name you specified in the CloudFormation template.

AWS Glue notebook shows the Notebook job that contains the existing cells that you had in the ipynb file. You can review the job details to confirm it’s configured correctly.

Console experience

On the AWS Glue console, in the navigation pane, choose ETL Jobs to observe all your ETL jobs listed. Here you have different columns Job name, Type, Created by, Last modified, and AWS Glue version. You can sort and filter by these columns. The following screenshot shows how it looks.

We also enhanced the console experience with the JobMode introduction. The Created by column on the console gives you information about JobMode of the job. You can filter access jobs created by VISUAL, NOTEBOOK, or SCRIPT, as shown in the following screenshot.

This new console experience helps you search and discover your jobs based on JobMode.

Conclusion

This post demonstrated how AWS Glue Job API works with the newly introduced job mode property. With the new property, you can explicitly choose the mode of each job. The steps instructed detailed usage in API, AWS SDK, and CloudFormation. Additionally, the property makes it straightforward to search and discover your jobs quickly on the AWS Glue console.


About the Authors

Shovan Kanjilal is a Senior Analytics and Machine Learning Architect with Amazon Web Services. He is passionate about helping customers build scalable, secure, and high-performance data solutions in the cloud.

Manoj Shunmugam is a DevOps Consultant in Professional Services at Amazon Web Services. He works with customers to establish infrastructures using cloud-centered and/or container-based platforms in the AWS Cloud.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Gal HeyneGal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering, and BI. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design easy-to-use data products.

Achieve the best price-performance in Amazon Redshift with elastic histograms for selectivity estimation

Post Syndicated from Roger Kim original https://aws.amazon.com/blogs/big-data/achieve-the-best-price-performance-in-amazon-redshift-with-elastic-histograms-for-selectivity-estimation/

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data. Tens of thousands of customers use Amazon Redshift to process large amounts of data, modernize their data analytics workloads, and provide insights for their business users.

Amazon Redshift continues to lead in data warehouse price-performance (for examples, see Amazon Redshift continues its price-performance leadership, Amazon Redshift: Lower price, higher performance, and Get up to 3x better price performance with Amazon Redshift than other cloud data warehouses). Amazon Redshift’s advanced Query Optimizer is a crucial part of that leading performance. The Query Optimizer is responsible for finding the fastest way (or plan) to execute a query. It does this by using statistics about the data together with the query to calculate a cost of executing the query for many different plans.

Amazon Redshift has built-in autonomics to collect statistics called automatic analyze (or auto analyze). Auto analyze is a background operation that runs automatically on Redshift tables to keep statistics up-to-date. Statistics collection, however, can be computationally expensive, making it a challenge to keep statistics up-to-date particularly when data is continuously being ingested. As data is ingested into the Redshift data warehouse over time, statistics could become stale, which in turn causes inaccurate selectivity estimations, leading to sub-optimal query plans that impact query performance.

Challenges with stale statistics

Based on Redshift fleet analysis of customer workloads, we found that the staleness of statistics is an especially important factor in the selectivity estimation of predicates with temporal columns such as those with DATE and TIMESTAMP data types. This is due to the following reasons: 1) DATE and TIMESTAMP represent about 11% of predicate columns in the queries in the Amazon Redshift fleet (see Figure 1); 2) More than 40% of query scan volume in the Amazon Redshift fleet have predicates on DATE or TIMESTAMP columns; and 3) Not surprisingly, customer workloads tend to query recent (hot) data more often than historical (cold) data. One such query pattern representative of these customer workloads, derived from the industry standard TPC-H analytics benchmark, is as follows:

SELECT ...
FROM   lineitem
       JOIN orders ON l_orderkey = o_orderkey
       JOIN customer ON ...
WHERE l_shipdate >= current_date - $1
  AND ...
Figure 1: Amazon Redshift fleet metrics on temporal vs non-temporal data types

Figure 1: Amazon Redshift fleet metrics on temporal vs non-temporal data types

Solution overview

Amazon Redshift introduced a new selectivity estimation technique in Amazon Redshift patch release P183 (v1.0.75379) to address the situation — having up-to-date statistics on temporal columns improving query plans and thereby performance. The new technique captures real-time statistical metadata gathered during data ingestion without incurring additional computational overhead. For queries with range predicates on temporal columns, the query optimizer uses this additional metadata fetched at runtime to complement the existing statistics, elastically adjusting the histogram boundaries, leading to improved selectivity estimations for temporal predicates. See Figures 2 & 3 for the performance improvements that elastic histograms for selectivity estimation delivers. This query processing optimization is enabled by default requiring no configuration changes or user intervention from users to realize the benefits of automatic optimization and improved query performance.

Benchmark evaluation

We evaluated the new selectivity estimation technique on variations of TPC-H queries. In one variation, the query performs an n-way join between lineitem, orders, and other tables with multiple predicates, including on l_shipdate.

When histogram statistics were stale, the selectivity estimations of predicates on l_shipdate were incorrectly predicted. This led to a sub-optimal query plan with a join order involving large network-heavy data redistributions among the compute resources of the Amazon Redshift provisioned cluster or serverless workgroup. With the new selectivity estimation technique, the prediction became much more accurate, leading to an optimal query plan with a join order that minimized the redistribution of results between join steps, resulting in a performance improvement shown in Figure 2.

Figure 2: Relative performance of TPC-H query variant (lower is better)

Figure 2: Relative performance of TPC-H query variant (lower is better)

Figure 3: Query Plan comparison: Before enhancement (left), After enhancement (right)

Figure 3: Query Plan comparison: Before enhancement (left), After enhancement (right)

Conclusion

In this post, we covered new performance optimizations in Redshift data warehouse query processing and how elastic histogram statistics help enhance selectivity estimation and the overall quality of query plans for Amazon Redshift data warehouse queries in the absence of fresh table statistics.

In summary, Amazon Redshift now offers enhanced query performance with optimizations such as Enhanced Histograms for Selectivity Estimation in the absence of fresh statistics by relying on metadata statistics gathered during ingestion.  These optimizations are enabled by default and Amazon Redshift users will benefit with better query response times for their workloads. Amazon Redshift is on a mission to continuously improve performance and therefore overall price-performance. The new selectivity estimation enhancement has already improved the performance of hundreds of thousands of customer queries in the Amazon Redshift fleet since its introduction in the patch release P183. It’s worth noting that this is one of the many behind-the-scenes improvements we continually make to keep Redshift the industry leader in price-performance.

We invite you to try the numerous new features introduced in Amazon Redshift together with the new performance enhancements. For more information, reach out to your AWS account team to request a free consultation or a demo of Amazon Redshift. They will be happy to provide additional guidance and support on choosing the right analytics solution that meets your business needs.


About the authors

Roger Kim is a Software Development Engineer on the Amazon Redshift team focusing on query performance and optimization. He holds a BA in Computer Science and Mathematics from Cornell University.

Mohammed Alkateb is an Engineering Manager at Amazon Redshift. Prior to joining Amazon, Mohammed had 12 years of industry experience in query optimization and database internals as an Individual Contributor and Engineering Manager. Mohammed has 18 US patents, and he has publications in research and industrial tracks of premier database conferences including EDBT, ICDE, SIGMOD and VLDB. Mohammed holds a PhD in Computer Science from The University of Vermont, and MSc and BSc degrees in Information Systems from Cairo University.

Mengchu Cai is a principal engineer on the Amazon Redshift team. Mengchu currently works on query optimization and data lake query performance. He also led the development of SQL language features. Mengchu received his PhD in Computer Science and Engineering from the University of Nebraska Lincoln.

Ravi Animi is a Senior Product Leader on the Amazon Redshift team and manages several functional areas of Amazon Redshift analytics, data, and AI, including spatial analytics, streaming analytics, query performance, Spark integration, and analytics business strategy. He has experience with relational databases, multi-dimensional databases, IoT technologies, storage and compute infrastructure services, and more recently, as a startup founder in the areas of AI and deep learning. Ravi holds dual Bachelors degrees in Physics and Electrical Engineering from Washington University, St. Louis, a Masters in Engineering from Stanford, and an MBA from Chicago Booth.

How to implement access control and auditing on Amazon Redshift using Immuta

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/how-to-implement-access-control-and-auditing-on-amazon-redshift-using-immuta/

This post is co-written with Matt Vogt from Immuta. 

Organizations are looking for products that let them spend less time managing data and more time on core business functions. Data security is one of the key functions in managing a data warehouse. With Immuta integration with Amazon Redshift, user and data security operations are managed using an intuitive user interface. This blog post describes how to set up the integration, access control, governance, and user and data policies.

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that makes it fast and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift. Amazon Redshift natively supports coarse-grained and fine-grained access control with features such as role-based access control, scoped permissionsrow-level security, column-level access control and dynamic data masking.

Immuta enables organizations to break down the silos that exist between data engineering teams, business users, and security by providing a centralized platform for creating and managing policy. Access and security policies are inherently technical, forcing data engineering teams to take responsibility for creating and managing these policies. Immuta empowers business users to effectively manage access to their own datasets and it enables business users to create tag and attribute-based policies. Through Immuta’s natural language policy builder, users can create and deploy data access policies without needing help from data engineers. This distribution of policies to the business enables organizations to rapidly access their data while ensuring that the right people use it for the right reasons.

Solution overview

In this blog, we describe how data in Redshift can be protected by defining the right level of access using Immuta. Let’s consider the following example datasets and user personas. These datasets, groups, and access policies are for illustration only and have been simplified to illustrate the implementation approach.

Datasets:

  • patients: Contains patients’ personal information such as name, address, date of birth (DOB), phone number, gender, and doctor ID
  • conditions: Contains the history of patients’ medical conditions
  • immunization: Contains patients’ immunization records
  • encounters: Contains patients’ medical visits and the associated payment and coverage costs

Groups:

  • Doctor: Groups users who are doctors
  • Nurse: Groups users who are nurses
  • Admin: Groups the administrative users

Following are the four permission policies to enforce.

  • Doctor should have access to all four datasets. However, each doctor should see only the data for their own patients. They should not be able to see all the patients
  • Nurse can access only the patients and immunization And can see all patients data.
  • Admin can access only the patients and encounters And can see all patients data.
  • Patients’ social security numbers and passport information should be masked for all users.

Pre-requisites

Complete the following steps before starting the solution implementation.

  1. Create Redshift data warehouse to load sample data and create users.
  2. Create users in a Redshift Use the following names for the implementation described in this post.
    • david, chris, jon, ema, jane
  3. Create user in Immuta as described in the documentation. You can also integrate your identify manager with Immuta to share user names. For the example in this post, you will use local users.
    • David Mill, Dr Chris, Dr Jon King, Ema Joseph, Jane D

Users

  1. Immuta SaaS deployment is used for this post. However, you can use either software as a service (SaaS) deployment or self-managed deployment.
  2. Download the sample datasets and upload them to your own Amazon Simple Storage Service (Amazon S3) This data is synthetic and doesn’t include real data.
  3. Download the SQL commands and replace the Amazon S3 file path in the COPY command with the file path of the uploaded files in your account.

Implementation

The following diagram describes the high-level steps in the following sections, which you will use to build the solution.

Solution Overview

1. Map users

  1. In the Immuta portal, navigate to People and choose Users. Select a user name to map to an Amazon Redshift user name.
  2. Choose Edit for the Amazon Redshift user name and enter the corresponding Redshift username.

Map Users

  1. Repeat the steps for the other users.

2. Set up native integration

To use Immuta, you must configure Immuta native integration, which requires privileged access to administer policies in your Redshift data warehouse. See the Immuta documentation for detailed requirements.

Use the following steps to create native integration between Amazon Redshift and Immuta.

  1. In Immuta, choose App Settings from the navigation pane.
  2. Click on Integrations.
  3. Click on Add Native Integration.
  4. Enter the Redshift data warehouse endpoint name, port number, and a database name where Immuta will create policies.
  5. Enter privileged user credentials to connect with administrative privileges. These credentials aren’t stored on the Immuta platform and are used for one-time setup.
  6. You should see a successful integration with a status of Enabled.

3. Create a connection

The next step is to create a connection to the Redshift data warehouse and select specific data sources to import.

  1. In Immuta, choose Data Sources and then New Data sources in the navigation pane and choose New Data Source.
  2. Select Redshift as the Data Platform.
    Create Data Source
  3. Enter the Redshift data warehouse endpoint as the Server and the credentials to connect. Ensure the Redshift security group has inbound rules created to open access from Immuta IP addresses.
    Create Data Source2
  4. Immuta will show the schemas available on the connected database.
  5. Choose Edit under Schema/Table section.
    Schemas
  6. Select pschema from the list of schemas displayed.
    pschema
  7. Leave the values for the remaining options as the default and choose Create. This will import the metadata of the datasets and run default data discovery. In 2 to 5 minutes, you should see the table imported with status as Healthy.
    Healthy Source

4. Tag the data fields

Immuta automatically tags the data members using a default framework. It’s a starter framework that contains all the built-in and custom defined identifiers. However, you might want to add custom tags to the data fields to fit your use case. In this section, you will create custom tags and attach them to data fields. Optionally, you can also integrate with an external data catalog such as Alation, or Colibra. For this post, you will use custom tags.

Create tags

  1. In Immuta, choose Governance from the navigation pane, and then choose Tags.
  2. Choose Add Tags to open the Tag Builder dialog box
    Tags
  3. Enter Sensitive as a custom tag and choose Save.

Tags

  1. Repeat steps 1–3 to create the following tags.
    • Doctor ID: Tag to mark the doctor ID field. It will be used for defining an attribute bases access policy (ABAC).
    • Doctor Datasets: Tag to mark data sources accessible to Doctors.
    • Admin Datasets: Tag to mark data sources accessible to Admins.
    • Nurse Datasets: Tag to mark data sources accessible to Nurses.

Add tags

Now add the Sensitive tag to the ssn and passport fields in the Pschema Patient data source.

  1. In Immuta, choose Data and then Data Sources in the navigation pane and select Pschema Patient as the data source.
  2. Choose the Data Dictionary tab
  3. Find ssn in the list and choose Add Tags.

Tags

  1. Search for Sensitive tag and choose Add.

Tags

  1. Repeat the same step for the passport
  2. You should see tags applied to the fields.

Tags

  1. Using the same procedure, add the Doctor ID tag to the drid (doctor ID) field in the Pschema Patients data source.

Attributes

Now tag the data sources as required by the access policy you’re building.

  1. Choose Data and then Data Sources and select Pschema Patients as the data source.
  2. Scroll down to Tags and choose Add Tags
  3. Add Doctor Datasets, Nurse Datasets, and Admin Datasets tags to the patients data source (because this data source should be accessible by the Doctors, Nurses, and Admins groups).
Data Source Tags
Patients Doctor Datasets, Nurse Datasets, Admin Datasets
Conditions Doctor Datasets
Immunizations Doctor Datasets, Nurse Datasets
Encounters Doctor Datasets, Admin Datasets

You can create more tags and tag fields as required by your organization’s data classification rules. The Immuta data source page is where stewards and governors will spend a lot of time.

5. Create groups and add users

You must create user groups before you define policies.

  1. In Immuta, choose People and then Groups from the navigation pane and then choose New Group.
  2. Provide doctor as the group name and select Save.
  3. Repeat step1 and step2 to create the following groups:
    • nurse
    • admin
  4. You should see three groups created.

Groups

Next, you need to add users to these groups.

  1. Choose People and then Groups in the navigation pane.
  2. Select the doctor
  3. Choose Settings and choose Add Members in the Members
  4. Search for Dr Jon King in the search bar and select the user from the results. Choose close to add the user and exit the screen.
  5. You should see Dr Jon King added to the doctor.

Groups

  1. Repeat to add additional users as shown in the following table.
Group Users
Doctor Dr Jon King, Dr Chris
Nurse Jane D
admin David Mill, Ema Joseph

6. Add attributes to users

One of the security requirements is that doctors can only see the data of their patients. They shouldn’t be able to see other doctors’ patient data. To implement this requirement, you must define attributes for users who are doctors.

  1. Choose People and then Users in the navigation pane, and then select Dr Chris.
  2. Choose Settings and scroll down to the Attributes
  3. Choose Add Attributes. Enter drid as the Attribute and d1001 as the Attribute value.
  4. This will assign the attribute value of d1001 to Dr Chris. In Step 8 Define data policies, you will define a policy to show data with the matching drid attribute value.

Group Attributes

  1. Repeat steps 1–4; selecting Dr Jon King and entering d1002 as the Attribute value

7. Create subscription policy

In this section, you will provide data sources access to groups as required by the permission policy.

  • Doctors can access all four datasets: Patients, Conditions, Immunizations, and Encounters.
  • Nurses can access only Patients and Immunizations.
  • Admins can access only Patients and Encounters.

In 4. Tag the data fields, you added tags to the datasets as shown in the following table. You will now use the tags to define subscription policies.

Data source Tags
Patients Doctor Datasets, Nurse Datasets, Admin Datasets
Conditions Doctor Datasets
Immunizations Doctor Datasets, Nurse Datasets
Encounters Doctor Datasets, Admin Datasets
  1. In Immuta, choose Policies and then Subscription Policies from the navigation pane, and then choose Add Subscription Policy.
  2. Enter Doctor Access as the policy name.
  3. For the Subscription level, select Allow users with specific groups/attributes.
  4. Under Allow users to subscribe when user, select doctor. This allows only users who are members of the doctor group to access data sources accessible by doctor group.

Subscription Policy

  1. Scroll down and select Share Responsibility. This will ensure users aren’t blocked from accessing datasets even if they don’t meet all the subscription policies, which isn’t required.

Shared Responsibility

  1. Scroll further down and under Where should this policy be applied, choose On data sources, tagged and Doctor Dataset as options. It selects the datasets tagged as Doctor Dataset. You can notice that this policy applies all 4 data sources as all four data sources are tagged as Doctor Datasets.

Subscription Policy

  1. Next, create the policy by choose Activate This will create the view and policies in Redshift and enforce the permission policy.
  2. Repeat the same steps to define Nurse Access and Admin Access
    • For the Nurse Access policy, select users who are a member of the Nurse group and data sources that are tagged as Nurse Datasets.
    • For the Admin Access policy, select users who are member of the Admin group and data sources that are tagged as Admin Datasets.
  3. In Subscription policies, you should see all three policies in Active Notice the Data Sources count for how many data sources the policy is applied to.

Subscription Policy

8. Define data policies

 So far, you have defined permission policies at the data sources level. Now, you will define row and column level access using data policies. The fine-grained permission policy that you should define to restrict rows and columns is:

  • Doctors can see only the data of their own patients. In other words, when a doctor queries the patients table, then they should see only patients that match their doctor ID (drid).
  • Sensitive fields, such as ssn or passport, should be masked for everyone.
  1. In Immuta, Choose Policies and then Data Policies in the navigation pane and then choose Add Data Policy.
  2. Enter Filter by Doctor ID as the Policy name.
  3. Under How should this policy protect the data?, choose options as Only show rows , where, user possesses an attribute in drid that matches the value in column tagged Doctor ID. These settings will enforce that a doctor can see only the data of patients that have a matching Doctor ID. All other users (members of the nurse and admin groups) can see all of the patients

Data Policy

  1. Scroll down and under Where should this policy be applied?, choose On data sources, with columns tagged, Doctor ID as options. It selects the data sources that have columns tagged as Doctor ID. Notice the number of data sources it selected. It applied the policy to one data source out of the four available. Remember that you added the Doctor ID tag to the drid field for the Patients data source. So, this policy identified the Patients data source as a match and applied the policy.
    Policy
  2. Choose Activate Policy to create the policy.
  3. Similarly, create another policy to mask sensitive data for everyone.
    • Provide Mask Sensitive Data as policy name.
    • Under How should this policy protect the data?, choose Mask, columns tagged, Sensitive, using hashtag, for, everyone.
    • Under Where should this policy be applied?, choose on data sources, with columns tagged, Sensitive.

Data Policy

  1. In the Data Policies screen, you should now see both data policies in Active

Data Policy

9. Query the data to validate policies

The required permission policies are now in place. Sign in to the Redshift Query Editor as different users to see the permission policies in effect.

For example,

  1. Sign in as Dr. Jon King using the Redshift user ID jon. You should see all four tables, and if you query the patients table, you should see only the patients of Dr. Jon King; that is, patients with the Doctor ID d10002.
  2. Sign in as Ema Joseph using the Redshift user ID ema. You should see only two tables, Patients and Encounters, which are Admin datasets.
  3. You will also notice that ssn and passport are masked for both users.

Audit

 Immuta’s comprehensive auditing capabilities provide organizations with detailed visibility and control over data access and usage within their environment. The platform generates rich audit logs that capture a wealth of information about user activities, including:

  • Who’s subscribing to each data source and the reasons behind their access
  • When users are accessing the data
  • The specific SQL queries and blob fetches they are executing
  • The individual files they are accessing

The following is an example screenshot.

Audit

Industry use cases

The following are example industry use cases where Immuta and Amazon Redshift integration adds value to customer business objectives. Consider enabling the following use cases on Amazon Redshift and using Immuta.

Patient records management

In the healthcare and life sciences (HCLS) industry, efficient access to quality data is mission critical. Disjointed tools can hinder the delivery of real-time insights that are critical for healthcare decisions. These delays negatively impact patient care, as well as the production and delivery of pharmaceuticals. Streamlining access in a secure and scalable manner is vital for timely and accurate decision-making.

Data from disparate sources can easily become siloed, lost, or neglected if not stored in an accessible manner. This makes data sharing and collaboration difficult, if not impossible, for teams who rely on this data to make important treatment or research decisions. Fragmentation issues lead to incomplete or inaccurate patient records, unreliable research results, and ultimately slow down operational efficiency.

Maintaining regulatory compliance

HCLS organizations are subject to a range of industry-specific regulations and standards, such as Good Practices (GxP) and HIPAA, that ensure data quality, security, and privacy. Maintaining data integrity and traceability is fundamental, and requires robust policies and continuous monitoring to secure data throughout its lifecycle. With diverse data sets and large amounts of sensitive personal health information (PHI), balancing regulatory compliance with innovation is a significant challenge.

Complex advanced health analytics

Limited machine learning and artificial intelligence capabilities—hindered by legitimate privacy and security concerns—restrict HCLS organizations from using more advanced health analytics. This constraint affects the development of next-generation, data-driven tactics, including patient care models and predictive analytics for drug research and development. Enhancing these capabilities in a secure and compliant manner is key to unlocking the potential of health data.

Conclusion

In this post, you learned how to apply security policies on Redshift datasets using Immuta with an example use case. That includes enforcing data-set level access, attribute-level access and data masking policies. We also covered implementation step by step. Consider adopting simplified Redshift access management using Immuta and let us know your feedback.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Matt Vogt is a seasoned technology professional with over two decades of diverse experience in the tech industry, currently serving as the Vice President of Global Solution Architecture at Immuta. His expertise lies in bridging business objectives with technical requirements, focusing on data privacy, governance, and data access within Data Science, AI, ML, and advanced analytics.

Navneet Srivastava is a Principal Specialist and Analytics Strategy Leader, and develops strategic plans for building an end-to-end analytical strategy for large biopharma, healthcare, and life sciences organizations. His expertise spans across data analytics, data governance, AI, ML, big data, and healthcare-related technologies.

Somdeb Bhattacharjee is a Senior Solutions Architect specializing on data and analytics. He is part of the global Healthcare and Life sciences industry at AWS, helping his customer modernize their data platform solutions to achieve their business outcomes.

Ashok Mahajan is a Senior Solutions Architect at Amazon Web Services. Based in NYC Metropolitan area, Ashok is a part of Global Startup team focusing on Security ISV and helps them design and develop secure, scalable, and innovative solutions and architecture using the breadth and depth of AWS services and their features to deliver measurable business outcomes. Ashok has over 17 years of experience in information security, is CISSP and Access Management and AWS Certified Solutions Architect, and have diverse experience across finance, health care and media domains.

Simplify your query performance diagnostics in Amazon Redshift with Query profiler

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/simplify-your-query-performance-diagnostics-in-amazon-redshift-with-query-profiler/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that lets you analyze your data at scale. Amazon Redshift Serverless lets you access and analyze data without the usual configurations of a provisioned data warehouse. Resources are automatically provisioned and data warehouse capacity is intelligently scaled to deliver fast performance for even the most demanding and unpredictable workloads. If you prefer to manage your Amazon Redshift resources manually, you can create provisioned clusters for your data querying needs. For more information, refer to Amazon Redshift clusters.

Amazon Redshift provides performance metrics and data so you can track the health and performance of your provisioned clusters, serverless workgroups, and databases. The performance data you can use on the Amazon Redshift console falls into two categories:

  • Amazon CloudWatch metrics – Helps you monitor the physical aspects of your cluster or serverless, such as resource utilization, latency, and throughput.
  • Query and load performance data – Helps you monitor database activity, inspect and diagnose query performance problems.

Amazon Redshift has introduced a new feature called the Query profiler. The Query profiler is a graphical tool that helps users analyze the components and performance of a query. This feature is part of the Amazon Redshift console and provides a visual and graphical representation of the query’s run order, execution plan, and various statistics. The Query profiler makes it easier for users to understand and troubleshoot their queries.

In this post, we cover two common use cases for troubleshooting query performance. We show you step-by-step how to analyze and troubleshoot long-running queries using the Query profiler.

Overview

For Amazon Redshift Serverless, the Query profiler can be accessed by going to the Serverless console. Choose Query and database monitoring, select a query, and then navigate to the Query plan tab. If a query plan is available, you will observe a list of child queries. Choose a query to view it in Query profiler.

For Amazon Redshift provisioned, the Query profiler can be accessed by going to the provisioned clusters dashboard. Choose Query and loads, and choose a query. Navigate to the Query plan tab. If a query plan is available, you will observe a list of child queries. Choose a query to view it in Query profiler.

Prerequisites

  • You can use the following sample AWS Identity and Access Management (IAM) policy to configure your IAM user or role with minimum privileges to access Query profiler from the AWS console. If your IAM user or role already has access to Query and loads section of Redshift provisioned cluster dashboard or Query and database monitoring section of Redshift serverless dashboard, then no additional permissions are needed:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters",
                "redshift-serverless:ListNamespaces",
                "redshift-serverless:ListWorkgroups",
                "redshift-data:ExecuteStatement",
                "redshift-data:DescribeStatement",
                "redshift-data:GetStatementResult"
            ],
            "Resource": [
                "arn:aws:redshift-serverless:<your-namespace>",
                "arn:aws:redshift-serverless:<your-workgroupname>",
                "arn:aws:redshift:<your-clustername>"
            ]
        }
    ]
}
  • You can choose to use Query profiler in your account with an existing Amazon Redshift data warehouse and queries. However, if you would like to implement this demo in your existing Amazon Redshift data warehouse, download Redshift query editor v2 notebook, Redshift Query profiler demo, and refer to the Data Loading section later in this post.
  • You must connect to the cluster using database credentials and grant the sys:operator or sys:monitor role to the database user to view queries run by users.

Data loading

Amazon Redshift Query Editor v2 comes with sample data that can be loaded into a sample database and corresponding schema. To test Query profiler against the sample data, load the tpcds sample data and run queries.

  1. To load the tpcds sample data, launch Redshift query editor v2 and expand the database sample_data_dev.
  2. Choose the icon associated with the tpcds.
  3. The query editor v2 then loads the data into a schema tpcds in the database sample_data_dev.

The following screenshot shows these steps.
Load Data

  1. Verify the data by running the following sample query, as shown in the following screenshot.
select count(*) from sample_data_dev.tpcds.customer;

Verify Data

Use cases

In this post, we describe two common uses cases around query performance and how to use Query profiler to troubleshoot the performance issues:

  1. Nested loop joins – This join type is the slowest of the possible join types. Nested loop joins are the cross-joins without a join condition that result in the Cartesian product of two tables.
  2. Suboptimal data distribution – If data distribution is suboptimal, you might notice a large broadcast or redistribution of data across compute nodes when two large tables are joined together.

Use case 1: Nested loop joins

To troubleshoot performance issues with nest loop joins using Query profiler, follow these steps:

  1. Import notebook downloaded previously in prerequisites section of the blog into Redshift query editor v2.
  2. Set the context of database to sample_data_dev in Query Editor v2, as shown in the following screenshot.
    Set the database context
  3. Run cell #3 from demo notebook to diagnose a query performance issue related to nested loop joins.
    Step 3

The query takes around 12 seconds to run, as shown in the Query Editor v2 results panel in the following screenshot.

Step 4 results

  1. Run cell #5 to capture the query id from the SYS_QUERY_HISTORY system view filtering based on the query label you set in the preceding step.Cell 5
  2. On the Amazon Redshift console, in the navigation pane, select Query and loads and choose the cluster name where the query was originally executed, as shown in the following screenshot.
    Query and loads
  3. This will open the new Query profiler. Under the Query history section, choose Connect to database.After successful connection to the database, you will observe the Status showing as Connected and displaying the query history, as shown in the following screenshot.
    Connec to database
  4. You can find your queries either by Query ID or Process ID. Enter the Query ID captured in the preceding step to filter the long-running query for further analysis and choose the corresponding Query ID, as shown in the following screenshot.
    Search query
  5. Under the Query plan section, choose Child query 1, as shown in the following screenshot. If there are multiple child queries, you will have to inspect each one for performance issues.
    Child queryThis will open the query plan in a tree view along with additional metrics on the side panel. This allows you to quickly analyze the query streams, segments and steps. For more information about streams, segments, and steps, refer to Query planning and execution workflow in the Amazon Redshift Database Developer Guide.
  6. Turn on View streams and, in the Streams side panel, investigate and identify which stream has the highest execution time. In this case, Streams ID 5 is where the query spends the majority of time, as shown in the following screenshot
    Enable view stream
  7. In the Streams side panel, under ID, select 5 to focus on Stream 5 for further analysis. Stream 5 shows a step of Nestloop, as shown in the following screenshot.
    Nestloop step
  8. Choose the Nestloop step to further analyze. The side panel will change with step details and additional metrics about the nested loop join.
  9. By looking at Step details – nestloop, we can inspect the Input rows and compare that with the Output rows, as shown in the following screenshot. In this case, due to the cross-joining with the Store_returns table, 287,514 input rows explodes to 950,233,770 rows, thus causing our query to run slower.
    Nestloop step details
  10. Fix the query by introducing a join condition between the store_sales and store_returns. Run cell #7 from Query editor v2 demo notebook.The re-written query runs in just 307 milliseconds.Cell 7

Use case 2: Suboptimal data distribution

  1. To demonstrate suboptimal data distribution, change the distribution style of tables web_sales and web_returns to EVEN by running cell #10 of Query editor v2 demo notebook.Cell 10
  1. Run cell #12. The query takes 409 milliseconds to run, as shown by the elapsed time in the following screenshot of the Query editor v2.Cell 12
  2. Follow steps 3–10 from use case 1 to locate the query_id and to open the Query profiler view for the preceding query.
  3. On the Query profiler page for the preceding query, turn on View streams. In the Streams side panel, investigate and identify which stream has the highest execution time. In this case, Stream ID 6 is where the query spends a majority of the time, as shown in the following screenshot.
    View streams
  4. Under ID, select 6 from the Streams side panel for further analysis.
    Streams side panel

Stream 6 shows a step of hash join, which involves a hash join of two tables that are both redistributed. This can be inferred from Hash Right Join DS_DIST_BOTH under Explain plan node information in the following screenshot. Usually, these redistributions occur because the tables aren’t joined on their distribution keys, or they don’t have the correct distribution style. In the case of large tables, these redistributions can lead to significant performance degradation and, hence, it is important to identify and fix such steps to optimize query performance.

Hashjoin step

  1. Fix this suboptimal data distribution pattern by choosing the appropriate distribution keys on the tables involved: web_sales and web_returns. To change the distribution styles, run cell #14 of demo notebook to alter table commands.
    Cell 14
  2. After the preceding commands finish running, run cell #16 to re-execute the select query. As shown in the Query Editor in the following screenshot, now the same query finished in 244 milliseconds after updating the distribution style to key for tables web_sales and web_returns.
    Cell 16

  3. In the Query profiler view, turn on View streams and notice that Streams 5 now took the most time. It took 8 milliseconds to finish, as compared to 13 milliseconds in the preceding step.
    View streams
  4. In the Streams side panel, under ID, select 5 to drill down further, then choose the Hashjoin As the following screenshot shows, after changing the distribution style to key for both web_sales and web_return tables, none of the tables need to be redistributed at the query runtime, resulting in optimized performance.
    Hashjoin step

Considerations

Consider the following details while using Query profiler:

  1. Query profiler displays information returned by the SYS_QUERY_HISTORY, SYS_QUERY_EXPLAIN, SYS_QUERY_DETAIL, and SYS_CHILD_QUERY_TEXT views.
  2. Query profiler only displays query information for queries that have recently run on the database. If a query completes using a prepopulated resultset cache, Query profiler won’t have information about it because Amazon Redshift doesn’t generate a query plan for such queries.
  3. Queries run by Query profiler to return the query information run on the same data warehouse as the user-defined queries.

Clean Up

To avoid unexpected costs, complete the following action to delete the resources you created:

Drop all the tables in the sample_data_dev under tpcds schema.

Conclusion

In this post, we discussed how to use Amazon Redshift Query profiler to monitor and troubleshoot long-running queries. We demonstrated a step-by-step approach to analyze query performance by examining the query execution plan and statistics and identifying the root cause of query slowness. Try this feature in your environment and share your feedback with us.


About the Authors

Raks KhareRaks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.

Blessing Bamiduro is part of the Amazon Redshift Product Management team. She works with customers to help explore the use of Amazon Redshift ML in their data warehouse. In her spare time, Blessing loves travels and adventures.

Ekta Ahuja is an Amazon Redshift Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys landscape photography, traveling, and board games.

How Getir unleashed data democratization using a data mesh architecture with Amazon Redshift

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/how-getir-unleashed-data-democratization-using-a-data-mesh-architecture-with-amazon-redshift/

This blog post is co-written with Pinar Yasar from Getir.

Amazon Redshift is a fully managed cloud data warehouse that’s used by tens of thousands of customers for price-performance, scale, and advanced data analytics. Amazon Redshift enables data warehousing by seamlessly integrating with other data stores and services in the modern data organization through features such as Zero-ETL, data sharing, streaming ingestion, data lake integration, and Redshift ML.

In this post, we explain how ultrafast delivery pioneer, Getir, unleashed the power of data democratization on a large scale through their data mesh architecture using Amazon Redshift.

We start by introducing Getir and their vision—to seamlessly, securely, and efficiently share business data across different teams within the organization for BI, extract, transform, and load (ETL), and other use cases. We’ll then explore how Amazon Redshift data sharing powered the data mesh architecture that allowed Getir to achieve this transformative vision. We will also explain how Getir’s data mesh architecture enabled data democratization, shorter time-to-market, and cost-efficiencies. Next, we’ll provide a broader overview of modern data trends reinforced by Getir’s vision. In conclusion, we’ll offer some thoughts on how you can apply a similar approach to eliminate costly and barrier-inducing data silos using Amazon Redshift.

Who is Getir?

Getir is an ultrafast delivery pioneer that revolutionized last-mile delivery in 2015 with its 10-minute grocery delivery proposition.Getir’s story started in Istanbul, and they have launched multiple products since inception: GetirFood, GetirMore, GetirWater, GetirLocals, GetirBitaksi (taxi service), GetirDrive (car rental service), and GetirJobs (recruitment).

Getir serves dozens of cities throughout the world with more than 30,000 employees. The following figure shows the Getir app.

Figure 1: Getir app

Figure 1: Getir app

Overview of Getir’s main use case

Getir’s business is characterized by a tremendous volume of data generation and growth, in addition to ample opportunities to gain valuable insights. However, siloing this data and creating friction for teams trying to access the information they needed wasn’t a viable option. Allowing teams to duplicate data wherever required can be an anti-pattern, leading to operational complexity, cost overruns, and fragile data storage bloat.

Similarly, relying on dedicated teams to create data extracts or insights for downstream consumers introduces bottlenecks, stifles innovation, and increases the time-to-market. This approach isn’t optimal for a data-driven organization like Getir, which needs to empower its teams with seamless access to the information they require to drive the business forward. The various business lines within the organization made it abundantly clear that they wanted unfettered access to the company’s entire data ecosystem in a secure, cost-efficient, near real-time, and well-governed manner.

Furthermore, the organization was anticipating the emergence of data-as-a-serviceservice and generative AI use cases in the near future. This would necessitate the ability to securely share and potentially monetize the company’s data with external partners, such as franchises.

Overview of Getir’s use of Amazon Redshift and modern data architecture

To strike a balance that addresses these concerns and enables Getir teams to effectively use the wealth of data to generate meaningful insights and drive strategic decision-making across the organization, we chose a data mesh architecture.

Getir’s data analytics environment encompasses hundreds of terabytes of data, thousands of tables, and billions upon billions of data rows. Additionally, it processes millions of messaging events daily, all of which must be ingested, refined, and made available to analysts querying multiple Amazon Redshift warehouses. The end-to-end service level agreements (SLAs) for this data ecosystem can be extremely aggressive, with requirements that can be as stringent as single-digit minutes to single-digit seconds. This underscores the scale and complexity of Getir’s data analytics capabilities, which must operate with the utmost efficiency and responsiveness to meet the demands of the business. We were able to easily implement the envisioned data mesh architecture using Amazon Redshift’s native data sharing capabilities.

Figure 2: Data mesh architecture using Amazon Redshift data sharing

Figure 2: Data mesh architecture using Amazon Redshift data sharing

As the preceding diagram shows, at the heart of Getir’s architecture, was an ETL Redshift data warehouse that was used for various data sets from all over the organization, creating a refined 360-degree view of critical assets. It also was a producer for downstream Redshift data warehouses.

The demand was quite heavy on this main ETL cluster, so we relied on data sharing to isolate noisy workloads on a different Redshift data warehouse without having to duplicate the data on the main ETL cluster.

Using Redshift data sharing, individual business line teams could now rely solely on their dedicated Redshift cluster to provide them with their own data and analytics capabilities, but also the refined 360-degree views of data generated from all over the organization—without any data duplication or overstepping compute boundaries. BI analysts gained access to all of the data they needed to power their most complex dashboards with consistent performance free of noisy jobs. Additional warehouses were integrated into the data mesh for visualization, reporting, and machine learning.

Another benefit of Amazon Redshift data sharing and the data mesh architecture, was the relative ease with which we were able to maintain a chargeback model for ensuring costs were spread fairly across different teams.

Finally, the data sharing capability also enabled the seamless propagation of newly created tables within a schema to the subscribed consumers.

Modern data trends reinforced by Getir’s case study

Getir’s case study showcases the strategic uses of a data mesh architecture and Amazon Redshift, but more importantly provides tremendous insights into five key trends across all industries as modern data organizations move away from costly data silos that hinder collaboration, business insights, and time-to-market. As highlighted in the following diagram, those trends are 1/interconnected, purpose-built data stores that enable users to access data regardless of its physical location, 2/data democratization empowering users with self-service analytics capabilities, 3/real-time insights to drive greater value from data, 4/resilient data services ensuring business continuity, 5/leveraging generative AI to extract even deeper insights from data more expeditiously.

Figure 3: Key trends in the modern data organization reinforced by Getir's use case and solution

Figure 3: Key trends in the modern data organization reinforced by Getir’s use case and solution

As Getir showed, the modern data organization is adopting data architectures that democratize data securely and enable self-service analytics. To realize data’s true potential, the modern data organization has progressed beyond basic dashboarding and reporting on limited, point-in-time data sets, and evolved to use more sophisticated ETL processes that can ingest data from diverse sources. Near real-time analytics in addition to predictive models have become standard fare, significantly reducing the time to actionable insights.

Furthermore, the data landscape has been democratized to empower analysts in numerous ways through the rise of transactional data lakes powered by open table formats such as Apache Iceberg and the assistance of generative AI. This holistic approach has elevated data organizations’ capabilities well beyond traditional reporting, unlocking greater business value from the wealth of data available.

Using generative AI with data mesh architecture

In addition to the five key trends previously mentioned, the present-day data landscape is characterized by three key facts that are leading data organizations like Getir to increasingly harness the power of generative AI to drive the next evolution of data-informed decision-making.

Data is an organization’s most valuable asset and the ability to effectively use data is central to an organization’s success and growth. Data analytics and insights are absolutely crucial to strengthening and expanding the business. Deriving meaningful insights from data is essential for making informed, strategic decisions. Democratizing data and enabling self-service analytics can greatly expand the range of business insights, while reducing the time to market for those insights. Empowering users across the organization to access and analyze data can unlock tremendous value. Generative AI’s ability to respond to natural language prompts, explore and analyze complex data, and summarize lengthy content makes it a valuable tool for translating large amounts of data into valuable insights. However, the true potential of generative AI for organizations lies in Retrieval Augmented Generation (RAG).

Out of the box, generative AI models start with a relatively generic knowledge base, which can lead to unreliable or inaccurate information. RAG addresses this by introducing the model to additional datasets that are specific to the organization or context. This allows generative AI models to produce far more accurate, attributable, and highly contextualized outputs to support decision-making.

Data mesh architecture can play a crucial role in enabling and facilitating RAG. By facilitating access to multiple data sources within the organization, the data mesh provides the necessary fuel for the generative AI model to draw from, resulting in more reliable and insightful information. This, in turn, empowers data-driven decision-making and helps organizations harness the full potential of their data assets.

Conclusion

In this post, we examined how Getir implemented a data mesh architecture and Amazon Redshift data sharing to meet their evolving data requirements. This entailed dedicated data warehouses tailored to different business lines and needs, while maintaining robust data governance and secure data access. Additionally, we highlighted the key industry trends that Getir’s case study reinforces across the broader data landscape. For more information, contact AWS or connect with your AWS Technical Account Manager or Solutions Architect, who will be happy to provide more detailed guidance and support.


About the Authors

Asser Moustafa is a Principal Worldwide Specialist Solutions Architect at AWS, based in Dallas, Texas, USA. He partners with customers worldwide, advising them on all aspects of their data architectures, migrations, and strategic data visions to help organizations adopt cloud-based solutions, maximize the value of their data assets, modernize legacy infrastructures, and implement cutting-edge capabilities like machine learning and advanced analytics. Prior to joining AWS, Asser held various data and analytics leadership roles, completing an MBA from New York University and an MS in Computer Science from Columbia University in New York. He is passionate about empowering organizations to become truly data-driven and unlock the transformative potential of their data.

Pinar Yasar is the Data Engineering Manager at Getir. Her passion is to accelerate self-service analytics for her internal customers and build highly scalable and cost-effective solutions in the cloud.

Demystify data sharing and collaboration patterns on AWS: Choosing the right tool for the job

Post Syndicated from Ramakant Joshi original https://aws.amazon.com/blogs/big-data/demystify-data-sharing-and-collaboration-patterns-on-aws-choosing-the-right-tool-for-the-job/

Data is the most significant asset of any organization. However, enterprises often encounter challenges with data silos, insufficient access controls, poor governance, and quality issues. Embracing data as a product is the key to address these challenges and foster a data-driven culture.

In this context, the adoption of data lakes and the data mesh framework emerges as a powerful approach. By decentralizing data ownership and distribution, enterprises can break down silos and enable seamless data sharing. Cataloging data, making the data searchable, implementing robust security and governance, and establishing effective data sharing processes are essential to this transformation. AWS offers services like AWS Data Exchange, AWS Glue, AWS Clean Rooms and Amazon DataZone to help organizations unlock the full potential of their data.

Personas

Let’s identify the various roles involved in the data sharing process.

First of all, there are data producers, which might include internal teams/systems, third-party producers, and partners. The data consumers include internal stakeholders/systems, external partners, and end-customers. At the core of this ecosystem lies the enterprise data platform. When considering enterprises, numerous personas come into play:

  • Line of business users – These personas need to classify data, add business context, collaborate effectively with other lines of business, gain enhanced visibility into business key performance indicators (KPIs) for improved outcomes, and explore opportunities for monetizing data
  • Partners – Partners should be able to share data, collaborate with other partners and customers.
  • Data scientists and business analysts – These personas should be able to access the data, analyze it and generate actionable business insights
  • Data engineers – Data engineers are tasked with building the proper data pipeline and cataloging the data that meets the diverse needs of stakeholders, including business analysts, data scientists, partners, and line of business users
  • Data security and governance officers – Data security involves making sure producers and consumers have appropriate access to the data, implementing right access permissions, and maintaining compliance with industry regulations, particularly in highly regulated sectors like healthcare, life sciences, and financial services. This persona is also responsible for enhancing data governance by tracking lineage, and establishing data mesh policies

Choosing the right tool for the job

Now that you have identified the various personas, it’s important to select the appropriate tools for each role:

  • Starting with the producers, if your data source includes a software as a service (SaaS) platform, AWS Glue offers options to automate data flows between software service providers and AWS services.
  • For producers seeking collaboration with partners, AWS Clean Rooms facilitates secure collaboration and analysis of collective datasets without the need to share or duplicate underlying data.
  • When dealing with third-party data sources, AWS Data Exchange simplifies the discovery, subscription, and utilization of third-party data from a diverse range of producers or providers. As a producer, you can also monetize your data through the subscription model using AWS Data Exchange.
  • Within your organization, you can democratize data with governance, using Amazon DataZone, which offers built-in governance features.
  • For SaaS consumers, AWS Glue supports bidirectional transfer and serves both as a producer and consumer tool for various SaaS providers.

Let’s briefly describe the capabilities of the AWS services we referred above:

AWS Glue is a fully managed, serverless, and scalable extract, transform, and load (ETL) service that simplifies the process of discovering, preparing, and loading data for analytics. It provides data catalog, automated crawlers, and visual job creation to streamline data integration across various data sources and targets.

AWS Data Exchange enables you to find, subscribe to, and use third-party datasets in the AWS Cloud. It also provides a platform through which a data producer can make their data available for consumption for subscribers. It is a data marketplace featuring over 300 providers offering thousands of datasets accessible through files, Amazon Redshift tables, and APIs. This service supports consolidated billing and subscription management, offering you the flexibility to explore 1,000 free datasets and samples. You don’t need to set up a separate billing mechanism or payment method specifically for AWS Data Exchange subscriptions.

AWS Clean Rooms is designed to assist companies and their partners in securely analyzing and collaborating on collective datasets without revealing or sharing underlying data. You can swiftly create a secure data clean room, fostering collaboration with other entities on the AWS Cloud to derive unique insights for initiatives such as advertising campaigns or research and development. This service protects underlying data through a comprehensive set of privacy-enhancing controls and flexible analysis rules tailored to specific business needs.

Amazon DataZone is a data management service that makes it fast and straightforward to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources. With Amazon DataZone, administrators and data stewards who oversee an organization’s data assets can manage and govern access to data using fine-grained controls. These controls are designed to grant access with the right level of privileges and context. Amazon DataZone makes it straightforward for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization so they can discover, use, and collaborate to derive data-driven insights.

Use cases

Let’s review some example use cases to understand how these diverse services can be effectively applied within a business context to achieve the desired outcomes. In this particular scenario, we focus on a company named AnyHealth, which operates in the healthcare and life sciences sector. This company encompasses multiple lines of businesses, specializing in the sale of various scientific equipment. Three key requirements have been identified:

  • Sales and customer visibility by line of business – AnyHealth wants to gain insights into the sales performance and customer demands specific to each line of business. This necessitates a comprehensive view of sales activities and customer requirements tailored to individual lines of business.
  • Cross-organization supply chain and inventory visibility – The company faces challenges related to supply chain and inventory management, especially in global crisis situations like a pandemic. They want to address instances where inventory items are idle in one line of business while there is demand for the same items in another. To overcome this, they want to establish cross-organizational visibility of supply chain and inventory data, breaking down silos and achieving prompt responses to business demands.
  • Cross-sell and up-sell opportunities – AnyHealth intends to boost sales by implementing cross-selling and up-selling strategies. To achieve this, they plan to use machine learning (ML) models to extract insights from data. These insights will then be provided to sales representatives and resellers, enabling them to identify and capitalize on opportunities effectively.

In the following sections, we discuss how to address each requirement in more detail and the AWS services that best fit each solution.

Sales and customer visibility by line of business

The first requirement involves obtaining visibility into sales and customer demand by line of business. The key consumers of this data include line of business leaders, business analysts, and various other business stakeholders.

The initial step is to ingest sales and order data into the platform. Currently, this data is centralized in the ERP system, specifically SAP. The objective is to regularly retrieve this data and capture any changes that occur. The data engineers are instrumental in building this pipeline. Given that we are dealing with a SaaS integration, AWS Glue is the logical choice for seamless data ingestion.

Next, we focus on building the enterprise data platform where the accumulated data will be hosted. This platform will incorporate robust cataloging, making sure the data is easily searchable, and will enforce the necessary security and governance measures for selective sharing among business stakeholders, data engineers, analysts, security and governance officers. In this context, Amazon DataZone is the optimal choice for managing the enterprise data platform.

As stated earlier, the first step involves data ingestion. Data is ingested from a third-party vendor SaaS solution (SAP), and the data engineer uses AWS Glue. Utilizing the SAP data connector, the data engineer establishes a connection with the SAP environment, running scheduled jobs.

The data lands in Amazon Simple Storage Service (Amazon S3). Additional AWS Glue jobs are created to transform and curate the data. The curated data is placed in a designated bucket and AWS Glue crawlers are run to catalog the data. This cataloged data is then managed through Amazon DataZone.

In Amazon DataZone, the data security officer creates the corporate domain. She/he creates producer projects and enables access to data engineers, and business analysts. Data engineers ensure sales and customer data is available from the source into the Amazon DataZone project. Business analysts enhance the data with business metadata/glossaries and publish the same as data assets or data products. The data security officer sets permissions in Amazon DataZone to allow users to access the data portal. Users can search for assets in the Amazon DataZone catalog, view the metadata assigned to them, and access the assets.

Amazon Athena is used to query, and explore the data. Amazon QuickSight is used to read from Amazon Athena and generate reports that is consumed by the line of business users and other stakeholders.

The following diagram illustrates the solution architecture using AWS services.

Cross-organization supply chain and inventory visibility

For the second requirement, the objective is to achieve visibility of supply chain and inventory across the organization. The key stakeholders remain line of business users. They would like to get a cross-organization visibility of supply chain and inventory data. The aim is to ingest supply chain and inventory information in a scheduled manner from the ERP system (SAP), and also capture any changes in the supply chain and inventory data. The persona involved in setting up the data ingestion pipeline is a data engineer. Given that we are extracting data from SAP, AWS Glue is the suitable choice for this requirement.

The next step involves obtaining economic indicators and weather information from third-party sources. AnyHealth, with its diverse lines of business, including one that manufactures medical equipment such as inhalers for asthma treatment, recognizes the significance of collecting weather information, particularly data about pollen, because it directly impacts the patient population. Additionally, socioeconomic conditions play a crucial role in government-assisted programs related to out-of-hospital care. To incorporate this third-party data, AWS Data Exchange is the logical choice.

Finally, all the accumulated data needs to be hosted on the enterprise data platform, with cataloging, and robust security and governance measures. In this context, Amazon DataZone is the preferred solution.

The pipeline begins with the ingestion of data from SAP, facilitated by AWS Glue. The data lands in Amazon S3, where AWS Glue jobs are used to curate the data, generate curated tables, and then AWS Glue crawlers are used to catalog the data.

AWS Data Exchange serves as the platform for collecting economic trends and weather information. The business analyst leverages AWS Data Exchange to retrieve data from various sources. In the AWS Data Exchange marketplace, they identify the data set, subscribe to the data, and subsequently consume it. Any changes in the source data invokes events, which updates the data object in the Amazon S3 bucket.

Amazon DataZone is used to manage and govern the datalake. Similar to the first use case, the data security officer creates a producer project. The data owner from LoB creates supply chain and inventory data assets in the producer project and publishes the same. From the consumer perspective, the data security officer also creates a consumer project, which allows the sales and marketing teams from different LoBs to search for the supply chain and inventory data published by the producer. Consumers request access to the published supply chain and inventory data, and the producer grants the necessary access. Amazon Athena is used to query, and explore the data. Amazon QuickSight is used to read from Amazon Athena and generate reports.

The following diagram illustrates this architecture.

Cross-sell and up-sell opportunities

The third requirement involves identifying cross-sell and up-sell opportunities. The key business consumers in this context are the sales representatives and resellers. AnyHealth operates globally, selling products in Europe, America, and Asia. Direct business transactions with consumers occur in America and Europe, and resellers facilitate sales in Asia, where AnyHealth lacks a direct relationship with the consumers.

The enterprise data platform is used to host and analyze the sales data and identify the customer demand. This data platform is managed by Amazon Data Zone. Cross-sell and up-sell opportunities, derived through ML models, are integrated into the customer relationship management (CRM) system, which in this case is Salesforce. Sales representatives access this data from Salesforce to engage with the market and collaborate with customers. AWS Glue is used for this integration.

Typically, resellers don’t provide their partners direct access to their customer data. Although AnyHealth doesn’t have direct access, understanding customer personas and profile information is essential to equip resellers with right offers to cross-sell and up-sell products. AWS Clean Rooms enables collaboration on collective datasets with stringent security controls, enabling insights without sharing the underlying data.

By addressing these requirements, AnyHealth can effectively identify and capitalize on cross-sell and up-sell opportunities, tailoring their approach based on the distinct dynamics of direct and reseller-based business models across various regions.

The initial step in the architecture involves a pipeline where SAP data is ingested into Amazon S3 and curated using AWS Glue job. The curated data is cataloged, governed and managed using Amazon DataZone.

In this scenario, where sales and customer information are acquired, data scientists build ML models to identify cross-sell and upsell opportunities. Using Amazon DataZone, these opportunities are shared with line of business users, providing transparency regarding the opportunities presented to sales reps and resellers. The cross-sell and upsell insights are pushed to Salesforce through AWS Glue, with an event-driven workflow for timely communication to sales reps. However, for resellers, a different pipeline is needed as AnyHealth doesn’t have direct access to the customer sales data. AnyHealth uses AWS Clean Rooms for this purpose.

With AWS Clean Rooms, the collaboration is started by AnyHealth (the collaboration initiator) who invites resellers to join. Resellers participate in the collaboration, and share the customer profile and segment information, while maintaining privacy by excluding customer names and contact details. AnyHealth uses the customer profile information and order trends to identify cross-sell and upsell opportunities. These opportunities are shared with the reseller to pursue further and position products in the market.

The following diagram illustrates this architecture.

Final architecture

Let’s now examine the complete architecture which covers all three use cases. In this architecture, purpose-built services like AWS Data Exchange, AWS Glue, AWS Clean Rooms and Amazon DataZone, have been used. The seamless integration of these services works cohesively to achieve end-to-end business objectives.

The following diagram illustrates this architecture.

To strengthen the security posture of your cloud infrastructure, we recommend using AWS Identity and Access Management (IAM), which allows you to manage access to AWS resources by creating users, groups, and roles with specific permissions. Additionally, you can use AWS Key Management Service (AWS KMS), which enables you to create, manage, and control encryption keys used to protect your data, so only authorized entities can access sensitive information. To provide an audit trail for compliance, you can use AWS CloudTrail, which records API calls made within your AWS account.

Conclusion

In this post, we discussed how to choose right tool for building an enterprise data platform and enabling data sharing, collaboration and access within your organization and with third-party providers. We addressed three business use cases using AWS Glue, AWS Data Exchange, AWS Clean Rooms, and Amazon DataZone through three different use cases.

To learn more about these services, check out the AWS Blogs for Amazon DataZone, AWS Glue, AWS Clean Rooms, and AWS Data Exchange.


About the authors

Ramakant Joshi is an AWS Solutions Architect, specializing in the analytics and serverless domain. He has a background in software development and hybrid architectures, and is passionate about helping customers modernize their cloud architecture.

Debaprasun Chakraborty is an AWS Solutions Architect, specializing in the analytics domain. He has around 20 years of software development and architecture experience. He is passionate about helping customers in cloud adoption, migration and strategy.