Tag Archives: Analytics

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Post Syndicated from Raymond Lai original https://aws.amazon.com/blogs/big-data/enforce-fine-grained-access-control-on-open-table-formats-via-amazon-emr-integrated-with-aws-lake-formation/

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. This allows you to simplify security and governance over transactional data lakes by providing access controls at table-, column-, and row-level permissions with your Apache Spark jobs. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making. You can build a lake house architecture using Amazon EMR integrated with Lake Formation for FGAC. This combination of services allows you to conduct data analysis on your transactional data lake while ensuring secure and controlled access.

The Amazon EMR record server component supports table-, column-, row-, cell-, and nested attribute-level data filtering functionality. It extends support to Hive, Apache Hudi, Apache Iceberg, and Delta lake formats for both reading (including time travel and incremental query) and write operations (on DML statements such as INSERT). Additionally, with version 6.15, Amazon EMR introduces access control protection for its application web interface such as on-cluster Spark History Server, Yarn Timeline Server, and Yarn Resource Manager UI.

In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR integrated with Lake Formation.

Transaction data lake use case

Amazon EMR customers often use Open Table Formats to support their ACID transaction and time travel needs in a data lake. By preserving historical versions, data lake time travel provides benefits such as auditing and compliance, data recovery and rollback, reproducible analysis, and data exploration at different points in time.

Another popular transaction data lake use case is incremental query. Incremental query refers to a query strategy that focuses on processing and analyzing only the new or updated data within a data lake since the last query. The key idea behind incremental queries is to use metadata or change tracking mechanisms to identify the new or modified data since the last query. By identifying these changes, the query engine can optimize the query to process only the relevant data, significantly reducing the processing time and resource requirements.

Solution overview

In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) integrated with Lake Formation. Apache Hudi is an open source transactional data lake framework that greatly simplifies incremental data processing and the development of data pipelines. This new FGAC feature supports all OTF. Besides demonstrating with Hudi here, we will follow up with other OTF tables with other blogs. We use notebooks in Amazon SageMaker Studio to read and write Hudi data via different user access permissions through an EMR cluster. This reflects real-world data access scenarios—for example, if an engineering user needs full data access to troubleshoot on a data platform, whereas data analysts may only need to access a subset of that data that doesn’t contain personally identifiable information (PII). Integrating with Lake Formation via the Amazon EMR runtime role further enables you to improve your data security posture and simplifies data control management for Amazon EMR workloads. This solution ensures a secure and controlled environment for data access, meeting the diverse needs and security requirements of different users and roles in an organization.

The following diagram illustrates the solution architecture.

Solution architecture

We conduct a data ingestion process to upsert (update and insert) a Hudi dataset to an Amazon Simple Storage Service (Amazon S3) bucket, and persist or update the table schema in the AWS Glue Data Catalog. With zero data movement, we can query the Hudi table governed by Lake Formation via various AWS services, such as Amazon Athena, Amazon EMR, and Amazon SageMaker.

When users submit a Spark job through any EMR cluster endpoints (EMR Steps, Livy, EMR Studio, and SageMaker), Lake Formation validates their privileges and instructs the EMR cluster to filter out sensitive data such as PII data.

This solution has three different types of users with different levels of permissions to access the Hudi data:

  • hudi-db-creator-role – This is used by the data lake administrator who has privileges to carry out DDL operations such as creating, modifying, and deleting database objects. They can define data filtering rules on Lake Formation for row-level and column-level data access control. These FGAC rules ensure that data lake is secured and fulfills the data privacy regulations required.
  • hudi-table-pii-role – This is used by engineering users. The engineering users are capable of carrying out time travel and incremental queries on both Copy-on-Write (CoW) and Merge-on-Read (MoR). They also have privilege to access PII data based on any timestamps.
  • hudi-table-non-pii-role – This is used by data analysts. Data analysts’ data access rights are governed by FGAC authorized rules controlled by data lake administrators. They do not have visibility on columns containing PII data like names and addresses. Additionally, they can’t access rows of data that don’t fulfill certain conditions. For example, the users only can access data rows that belong to their country.

Prerequisites

You can download the three notebooks used in this post from the GitHub repo.

Before you deploy the solution, make sure you have the following:

Complete the following steps to set up your permissions:

  1. Log in to your AWS account with your admin IAM user.

Make sure you are in theus-east-1Region.

  1. Create a S3 bucket in the us-east-1 Region (for example,emr-fgac-hudi-us-east-1-<ACCOUNT ID>).

Next, we enable Lake Formation by changing the default permission model.

  1. Sign in to the Lake Formation console as the administrator user.
  2. Choose Data Catalog settings under Administration in the navigation pane.
  3. Under Default permissions for newly created databases and tables, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
  4. Choose Save.

Data Catalog settings

Alternatively, you need to revoke IAMAllowedPrincipals on resources (databases and tables) created if you started Lake Formation with the default option.

Finally, we create a key pair for Amazon EMR.

  1. On the Amazon EC2 console, choose Key pairs in the navigation pane.
  2. Choose Create key pair.
  3. For Name, enter a name (for exampleemr-fgac-hudi-keypair).
  4. Choose Create key pair.

Create key pair

The generated key pair (for this post, emr-fgac-hudi-keypair.pem) will save to your local computer.

Next, we create an AWS Cloud9 interactive development environment (IDE).

  1. On the AWS Cloud9 console, choose Environments in the navigation pane.
  2. Choose Create environment.
  3. For Name¸ enter a name (for example,emr-fgac-hudi-env).
  4. Keep the other settings as default.

Cloud9 environment

  1. Choose Create.
  2. When the IDE is ready, choose Open to open it.

cloud9 environment

  1. In the AWS Cloud9 IDE, on the File menu, choose Upload Local Files.

Upload local file

  1. Upload the key pair file (emr-fgac-hudi-keypair.pem).
  2. Choose the plus sign and choose New Terminal.

new terminal

  1. In the terminal, input the following command lines:
#Create encryption certificates for EMR in transit encryption
openssl req -x509 \
-newkey rsa:1024 \
-keyout privateKey.pem \
-out certificateChain.pem \
-days 365 \
-nodes \
-subj '/C=US/ST=Washington/L=Seattle/O=MyOrg/OU=MyDept/CN=*.compute.internal'
cp certificateChain.pem trustedCertificates.pem

# Zip certificates
zip -r -X my-certs.zip certificateChain.pem privateKey.pem trustedCertificates.pem

# Upload the certificates zip file to S3 bucket
# Replace <ACCOUNT ID> with your AWS account ID
aws s3 cp ./my-certs.zip s3://emr-fgac-hudi-us-east-1-<ACCOUNT ID>/my-certs.zip

Note that the example code is a proof of concept for demonstration purposes only. For production systems, use a trusted certification authority (CA) to issue certificates. Refer to Providing certificates for encrypting data in transit with Amazon EMR encryption for details.

Deploy the solution via AWS CloudFormation

We provide an AWS CloudFormation template that automatically sets up the following services and components:

  • An S3 bucket for the data lake. It contains the sample TPC-DS dataset.
  • An EMR cluster with security configuration and public DNS enabled.
  • EMR runtime IAM roles with Lake Formation fine-grained permissions:
    • <STACK-NAME>-hudi-db-creator-role – This role is used to create Apache Hudi database and tables.
    • <STACK-NAME>-hudi-table-pii-role – This role provides permission to query all columns of Hudi tables, including columns with PII.
    • <STACK-NAME>-hudi-table-non-pii-role – This role provides permission to query Hudi tables that have filtered out PII columns by Lake Formation.
  • SageMaker Studio execution roles that allow the users to assume their corresponding EMR runtime roles.
  • Networking resources such as VPC, subnets, and security groups.

Complete the following steps to deploy the resources:

  1. Choose Quick create stack to launch the CloudFormation stack.
  2. For Stack name, enter a stack name (for example,rsv2-emr-hudi-blog).
  3. For Ec2KeyPair, enter the name of your key pair.
  4. For IdleTimeout, enter an idle timeout for the EMR cluster to avoid paying for the cluster when it’s not being used.
  5. For InitS3Bucket, enter the S3 bucket name you created to save the Amazon EMR encryption certificate .zip file.
  6. For S3CertsZip, enter the S3 URI of the Amazon EMR encryption certificate .zip file.

CloudFormation template

  1. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  2. Choose Create stack.

The CloudFormation stack deployment takes around 10 minutes.

Set up Lake Formation for Amazon EMR integration

Complete the following steps to set up Lake Formation:

  1. On the Lake Formation console, choose Application integration settings under Administration in the navigation pane.
  2. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  3. Choose Amazon EMR for Session tag values.
  4. Enter your AWS account ID for AWS account IDs.
  5. Choose Save.

LF - Application integration settings

  1. Choose Databases under Data Catalog in the navigation pane.
  2. Choose Create database.
  3. For Name, enter default.
  4. Choose Create database.

LF - create database

  1. Choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Select IAM users and roles.
  4. Choose your IAM roles.
  5. For Databases, choose default.
  6. For Database permissions, select Describe.
  7. Choose Grant.

LF - Grant data permissions

Copy Hudi JAR file to Amazon EMR HDFS

To use Hudi with Jupyter notebooks, you need to complete the following steps for the EMR cluster, which includes copying a Hudi JAR file from the Amazon EMR local directory to its HDFS storage, so that you can configure a Spark session to use Hudi:

  1. Authorize inbound SSH traffic (port 22).
  2. Copy the value for Primary node public DNS (for example, ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com) from the EMR cluster Summary section.

EMR cluster summary

  1. Go back to previous AWS Cloud9 terminal you used to create the EC2 key pair.
  2. Run the following command to SSH into the EMR primary node. Replace the placeholder with your EMR DNS hostname:
chmod 400 emr-fgac-hudi-keypair.pem
ssh -i emr-fgac-hudi-keypair.pem [email protected]
  1. Run the following command to copy the Hudi JAR file to HDFS:
hdfs dfs -mkdir -p /apps/hudi/lib
hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar

Create the Hudi database and tables in Lake Formation

Now we’re ready to create the Hudi database and tables with FGAC enabled by the EMR runtime role. The EMR runtime role is an IAM role that you can specify when you submit a job or query to an EMR cluster.

Grant database creator permission

First, let’s grant the Lake Formation database creator permission to<STACK-NAME>-hudi-db-creator-role:

  1. Log in to your AWS account as an administrator.
  2. On the Lake Formation console, choose Administrative roles and tasks under Administration in the navigation pane.
  3. Confirm that your AWS login user has been added as a data lake administrator.
  4. In the Database creator section, choose Grant.
  5. For IAM users and roles, choose<STACK-NAME>-hudi-db-creator-role.
  6. For Catalog permissions, select Create database.
  7. Choose Grant.

Register the data lake location

Next, let’s register the S3 data lake location in Lake Formation:

  1. On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
  2. Choose Register location.
  3. For Amazon S3 path, Choose Browse and choose the data lake S3 bucket. (<STACK_NAME>s3bucket-XXXXXXX) created from the CloudFormation stack.
  4. For IAM role, choose<STACK-NAME>-hudi-db-creator-role.
  5. For Permission mode, select Lake Formation.
  6. Choose Register location.

LF - Register location

Grant data location permission

Next, we need to grant<STACK-NAME>-hudi-db-creator-rolethe data location permission:

  1. On the Lake Formation console, choose Data locations under Permissions in the navigation pane.
  2. Choose Grant.
  3. For IAM users and roles, choose<STACK-NAME>-hudi-db-creator-role.
  4. For Storage locations, enter the S3 bucket (<STACK_NAME>-s3bucket-XXXXXXX).
  5. Choose Grant.

LF - Grant permissions

Connect to the EMR cluster

Now, let’s use a Jupyter notebook in SageMaker Studio to connect to the EMR cluster with the database creator EMR runtime role:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain<STACK-NAME>-Studio-EMR-LF-Hudi.
  3. On the Launch menu next to the user profile<STACK-NAME>-hudi-db-creator, choose Studio.

SM - Domain details

  1. Download the notebook rsv2-hudi-db-creator-notebook.
  2. Choose the upload icon.

SM Studio - Upload

  1. Choose the downloaded Jupyter notebook and choose Open.
  2. Open the uploaded notebook.
  3. For Image, choose SparkMagic.
  4. For Kernel, choose PySpark.
  5. Leave the other configurations as default and choose Select.

SM Studio - Change environment

  1. Choose Cluster to connect to the EMR cluster.

SM Studio - connect EMR cluster

  1. Choose the EMR on EC2 cluster (<STACK-NAME>-EMR-Cluster) created with the CloudFormation stack.
  2. Choose Connect.
  3. For EMR execution role, choose<STACK-NAME>-hudi-db-creator-role.
  4. Choose Connect.

Create database and tables

Now you can follow the steps in the notebook to create the Hudi database and tables. The major steps are as follows:

  1. When you start the notebook, configure“spark.sql.catalog.spark_catalog.lf.managed":"true"to inform Spark that spark_catalog is protected by Lake Formation.
  2. Create Hudi tables using the following Spark SQL.
%%sql 
CREATE TABLE IF NOT EXISTS ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}(
    c_customer_id string,
    c_birth_country string,
    c_customer_sk integer,
    c_email_address string,
    c_first_name string,
    c_last_name string,
    ts bigint
) USING hudi
LOCATION '${cow_table_location_sql}'
OPTIONS (
  type = 'cow',
  primaryKey = '${hudi_primary_key}',
  preCombineField = '${hudi_pre_combined_field}'
 ) 
PARTITIONED BY (${hudi_partitioin_field});

  1. Insert data from the source table to the Hudi tables.
%%sql
INSERT OVERWRITE ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
SELECT 
    c_customer_id ,  
    c_customer_sk,
    c_email_address,
    c_first_name,
    c_last_name,
    unix_timestamp(current_timestamp()) AS ts,
    c_birth_country
FROM ${src_df_view}
WHERE c_birth_country = 'HONG KONG' OR c_birth_country = 'CHINA' 
LIMIT 1000
  1. Insert data again into the Hudi tables.
%%sql
INSERT INTO ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
SELECT 
    c_customer_id ,  
    c_customer_sk,
    c_email_address,
    c_first_name,
    c_last_name,
    unix_timestamp(current_timestamp()) AS ts,
    c_birth_country
FROM ${insert_into_view}

Query the Hudi tables via Lake Formation with FGAC

After you create the Hudi database and tables, you’re ready to query the tables using fine-grained access control with Lake Formation. We have created two types of Hudi tables: Copy-On-Write (COW) and Merge-On-Read (MOR). The COW table stores data in a columnar format (Parquet), and each update creates a new version of files during a write. This means that for every update, Hudi rewrites the entire file, which can be more resource-intensive but provides faster read performance. MOR, on the other hand, is introduced for cases where COW may not be optimal, particularly for write- or change-heavy workloads. In a MOR table, each time there is an update, Hudi writes only the row for the changed record, which reduces cost and enables low-latency writes. However, the read performance might be slower compared to COW tables.

Grant table access permission

We use the IAM role<STACK-NAME>-hudi-table-pii-roleto query Hudi COW and MOR containing PII columns. We first grant the table access permission via Lake Formation:

  1. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Choose<STACK-NAME>-hudi-table-pii-rolefor IAM users and roles.
  4. Choose thersv2_blog_hudi_db_1database for Databases.
  5. For Tables, choose the four Hudi tables you created in the Jupyter notebook.

LF - Grant data permissions

  1. For Table permissions, select Select.
  2. Choose Grant.

LF - table permissions

Query PII columns

Now you’re ready to run the notebook to query the Hudi tables. Let’s follow similar steps to the previous section to run the notebook in SageMaker Studio:

  1. On the SageMaker console, navigate to the<STACK-NAME>-Studio-EMR-LF-Hudidomain.
  2. On the Launch menu next to the<STACK-NAME>-hudi-table-readeruser profile, choose Studio.
  3. Upload the downloaded notebook rsv2-hudi-table-pii-reader-notebook.
  4. Open the uploaded notebook.
  5. Repeat the notebook setup steps and connect to the same EMR cluster, but use the role<STACK-NAME>-hudi-table-pii-role.

In the current stage, FGAC-enabled EMR cluster needs to query Hudi’s commit time column for performing incremental queries and time travel. It does not support Spark’s “timestamp as of” syntax and Spark.read(). We are actively working on incorporating support for both actions in future Amazon EMR releases with FGAC enabled.

You can now follow the steps in the notebook. The following are some highlighted steps:

  1. Run a snapshot query.
%%sql 
SELECT c_birth_country, count(*) FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql} GROUP BY c_birth_country;
  1. Run an incremental query.
incremental_df = spark.sql(f"""
SELECT * FROM {HUDI_CATALOG}.{HUDI_DATABASE}.{COW_TABLE_NAME_SQL} WHERE _hoodie_commit_time >= {commit_ts[-1]}
""")

incremental_df.createOrReplaceTempView("incremental_view")
%%sql
SELECT 
    c_birth_country, 
    count(*) 
FROM incremental_view
GROUP BY c_birth_country;
  1. Run a time travel query.
%%sql
SELECT
    c_birth_country, COUNT(*) as count
FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
WHERE _hoodie_commit_time IN
(
    SELECT DISTINCT _hoodie_commit_time FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql} ORDER BY _hoodie_commit_time LIMIT 1 
)
GROUP BY c_birth_country
  1. Run MOR read-optimized and real-time table queries.
%%sql
SELECT
    a.email_label,
    count(*)
FROM (
    SELECT
        CASE
            WHEN c_email_address = 'UNKNOWN' THEN 'UNKNOWN'
            ELSE 'NOT_UNKNOWN'
        END AS email_label
    FROM ${hudi_catalog}.${hudi_db}.${mor_table_name_sql}_ro
    WHERE c_birth_country = 'HONG KONG'
) a
GROUP BY a.email_label;
%%sql
SELECT *  
FROM ${hudi_catalog}.${hudi_db}.${mor_table_name_sql}_ro
WHERE 
    c_birth_country = 'INDIA' OR c_first_name = 'MASKED'

Query the Hudi tables with column-level and row-level data filters

We use the IAM role<STACK-NAME>-hudi-table-non-pii-roleto query Hudi tables. This role is not allowed to query any columns containing PII. We use the Lake Formation column-level and row-level data filters to implement fine-grained access control:

  1. On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
  2. Choose Create new filter.
  3. For Data filter name, entercustomer-pii-filter.
  4. Choosersv2_blog_hudi_db_1for Target database.
  5. Choosersv2_blog_hudi_mor_sql_dl_customer_1for Target table.
  6. Select Exclude columns and choose thec_customer_id,c_email_address, andc_last_namecolumns.
  7. Enterc_birth_country != 'HONG KONG'for Row filter expression.
  8. Choose Create filter.

LF - create data filter

  1. Choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Choose<STACK-NAME>-hudi-table-non-pii-rolefor IAM users and roles.
  4. Choosersv2_blog_hudi_db_1for Databases.
  5. Choosersv2_blog_hudi_mor_sql_dl_tpc_customer_1for Tables.
  6. Choosecustomer-pii-filterfor Data filters.
  7. For Data filter permissions, select Select.
  8. Choose Grant.

LF - Grant data permissions

Let’s follow similar steps to run the notebook in SageMaker Studio:

  1. On the SageMaker console, navigate to the domainStudio-EMR-LF-Hudi.
  2. On the Launch menu for thehudi-table-readeruser profile, choose Studio.
  3. Upload the downloaded notebook rsv2-hudi-table-non-pii-reader-notebook and choose Open.
  4. Repeat the notebook setup steps and connect to the same EMR cluster, but select the role<STACK-NAME>-hudi-table-non-pii-role.

You can now follow the steps in the notebook. From the query results, you can see that FGAC via the Lake Formation data filter has been applied. The role can’t see the PII columnsc_customer_id,c_last_name, andc_email_address. Also, the rows fromHONG KONGhave been filtered.

filtered query result

Clean up

After you’re done experimenting with the solution, we recommend cleaning up resources with the following steps to avoid unexpected costs:

  1. Shut down the SageMaker Studio apps for the user profiles.

The EMR cluster will be automatically deleted after the idle timeout value.

  1. Delete the Amazon Elastic File System (Amazon EFS) volume created for the domain.
  2. Empty the S3 buckets created by the CloudFormation stack.
  3. On the AWS CloudFormation console, delete the stack.

Conclusion

In this post, we used Apachi Hudi, one type of OTF tables, to demonstrate this new feature to enforce fine-grained access control on Amazon EMR. You can define granular permissions in Lake Formation for OTF tables and apply them via Spark SQL queries on EMR clusters. You also can use transactional data lake features such as running snapshot queries, incremental queries, time travel, and DML query. Please note that this new feature covers all OTF tables.

This feature is launched starting from Amazon EMR release 6.15 in all Regions where Amazon EMR is available. With the Amazon EMR integration with Lake Formation, you can confidently manage and process big data, unlocking insights and facilitating informed decision-making while upholding data security and governance.

To learn more, refer to Enable Lake Formation with Amazon EMR and feel free to contact your AWS Solutions Architects, who can be of assistance alongside your data journey.


About the Author

Raymond LaiRaymond Lai is a Senior Solutions Architect who specializes in catering to the needs of large enterprise customers. His expertise lies in assisting customers with migrating intricate enterprise systems and databases to AWS, constructing enterprise data warehousing and data lake platforms. Raymond excels in identifying and designing solutions for AI/ML use cases, and he has a particular focus on AWS Serverless solutions and Event Driven Architecture design.

Bin Wang, PhD, is a Senior Analytic Specialist Solutions Architect at AWS, boasting over 12 years of experience in the ML industry, with a particular focus on advertising. He possesses expertise in natural language processing (NLP), recommender systems, diverse ML algorithms, and ML operations. He is deeply passionate about applying ML/DL and big data techniques to solve real-world problems.

Aditya Shah is a Software Development Engineer at AWS. He is interested in Databases and Data warehouse engines and has worked on performance optimisations, security compliance and ACID compliance for engines like Apache Hive and Apache Spark.

Melody Yang is a Senior Big Data Solution Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Power neural search with AI/ML connectors in Amazon OpenSearch Service

Post Syndicated from Aruna Govindaraju original https://aws.amazon.com/blogs/big-data/power-neural-search-with-ai-ml-connectors-in-amazon-opensearch-service/

With the launch of the neural search feature for Amazon OpenSearch Service in OpenSearch 2.9, it’s now effortless to integrate with AI/ML models to power semantic search and other use cases. OpenSearch Service has supported both lexical and vector search since the introduction of its k-nearest neighbor (k-NN) feature in 2020; however, configuring semantic search required building a framework to integrate machine learning (ML) models to ingest and search. The neural search feature facilitates text-to-vector transformation during ingestion and search. When you use a neural query during search, the query is translated into a vector embedding and k-NN is used to return the nearest vector embeddings from the corpus.

To use neural search, you must set up an ML model. We recommend configuring AI/ML connectors to AWS AI and ML services (such as Amazon SageMaker or Amazon Bedrock) or third-party alternatives. Starting with version 2.9 on OpenSearch Service, AI/ML connectors integrate with neural search to simplify and operationalize the translation of your data corpus and queries to vector embeddings, thereby removing much of the complexity of vector hydration and search.

In this post, we demonstrate how to configure AI/ML connectors to external models through the OpenSearch Service console.

Solution Overview

Specifically, this post walks you through connecting to a model in SageMaker. Then we guide you through using the connector to configure semantic search on OpenSearch Service as an example of a use case that is supported through connection to an ML model. Amazon Bedrock and SageMaker integrations are currently supported on the OpenSearch Service console UI, and the list of UI-supported first- and third-party integrations will continue to grow.

For any models not supported through the UI, you can instead set them up using the available APIs and the ML blueprints. For more information, refer to Introduction to OpenSearch Models. You can find blueprints for each connector in the ML Commons GitHub repository.

Prerequisites

Before connecting the model via the OpenSearch Service console, create an OpenSearch Service domain. Map an AWS Identity and Access Management (IAM) role by the name LambdaInvokeOpenSearchMLCommonsRole as the backend role on the ml_full_access role using the Security plugin on OpenSearch Dashboards, as shown in the following video. The OpenSearch Service integrations workflow is pre-filled to use the LambdaInvokeOpenSearchMLCommonsRole IAM role by default to create the connector between the OpenSearch Service domain and the model deployed on SageMaker. If you use a custom IAM role on the OpenSearch Service console integrations, make sure the custom role is mapped as the backend role with ml_full_access permissions prior to deploying the template.

Deploy the model using AWS CloudFormation

The following video demonstrates the steps to use the OpenSearch Service console to deploy a model within minutes on Amazon SageMaker and generate the model ID via the AI connectors. The first step is to choose Integrations in the navigation pane on the OpenSearch Service AWS console, which routes to a list of available integrations. The integration is set up through a UI, which will prompt you for the necessary inputs.

To set up the integration, you only need to provide the OpenSearch Service domain endpoint and provide a model name to uniquely identify the model connection. By default, the template deploys the Hugging Face sentence-transformers model, djl://ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2.

When you choose Create Stack, you are routed to the AWS CloudFormation console. The CloudFormation template deploys the architecture detailed in the following diagram.

The CloudFormation stack creates an AWS Lambda application that deploys a model from Amazon Simple Storage Service (Amazon S3), creates the connector, and generates the model ID in the output. You can then use this model ID to create a semantic index.

If the default all-MiniLM-L6-v2 model doesn’t serve your purpose, you can deploy any text embedding model of your choice on the chosen model host (SageMaker or Amazon Bedrock) by providing your model artifacts as an accessible S3 object. Alternatively, you can select one of the following pre-trained language models and deploy it to SageMaker. For instructions to set up your endpoint and models, refer to Available Amazon SageMaker Images.

SageMaker is a fully managed service that brings together a broad set of tools to enable high-performance, low-cost ML for any use case, delivering key benefits such as model monitoring, serverless hosting, and workflow automation for continuous training and deployment. SageMaker allows you to host and manage the lifecycle of text embedding models, and use them to power semantic search queries in OpenSearch Service. When connected, SageMaker hosts your models and OpenSearch Service is used to query based on inference results from SageMaker.

View the deployed model through OpenSearch Dashboards

To verify the CloudFormation template successfully deployed the model on the OpenSearch Service domain and get the model ID, you can use the ML Commons REST GET API through OpenSearch Dashboards Dev Tools.

The GET _plugins REST API now provides additional APIs to also view the model status. The following command allows you to see the status of a remote model:

GET _plugins/_ml/models/<modelid>

As shown in the following screenshot, a DEPLOYED status in the response indicates the model is successfully deployed on the OpenSearch Service cluster.

Alternatively, you can view the model deployed on your OpenSearch Service domain using the Machine Learning page of OpenSearch Dashboards.

This page lists the model information and the statuses of all the models deployed.

Create the neural pipeline using the model ID

When the status of the model shows as either DEPLOYED in Dev Tools or green and Responding in OpenSearch Dashboards, you can use the model ID to build your neural ingest pipeline. The following ingest pipeline is run in your domain’s OpenSearch Dashboards Dev Tools. Make sure you replace the model ID with the unique ID generated for the model deployed on your domain.

PUT _ingest/pipeline/neural-pipeline
{
  "description": "Semantic Search for retail product catalog ",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "sfG4zosBIsICJFsINo3X",
        "field_map": {
           "description": "desc_v",
           "name": "name_v"
        }
      }
    }
  ]
}

Create the semantic search index using the neural pipeline as the default pipeline

You can now define your index mapping with the default pipeline configured to use the new neural pipeline you created in the previous step. Ensure the vector fields are declared as knn_vector and the dimensions are appropriate to the model that is deployed on SageMaker. If you have retained the default configuration to deploy the all-MiniLM-L6-v2 model on SageMaker, keep the following settings as is and run the command in Dev Tools.

PUT semantic_demostore
{
  "settings": {
    "index.knn": true,  
    "default_pipeline": "neural-pipeline",
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "desc_v": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "space_type": "cosinesimil"
        }
      },
      "name_v": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "space_type": "cosinesimil"
        }
      },
      "description": {
        "type": "text" 
      },
      "name": {
        "type": "text" 
      } 
    }
  }
}

Ingest sample documents to generate vectors

For this demo, you can ingest the sample retail demostore product catalog to the new semantic_demostore index. Replace the user name, password, and domain endpoint with your domain information and ingest raw data into OpenSearch Service:

curl -XPOST -u 'username:password' 'https://domain-end-point/_bulk' --data-binary @semantic_demostore.json -H 'Content-Type: application/json'

Validate the new semantic_demostore index

Now that you have ingested your dataset to the OpenSearch Service domain, validate if the required vectors are generated using a simple search to fetch all fields. Validate if the fields defined as knn_vectors have the required vectors.

Compare lexical search and semantic search powered by neural search using the Compare Search Results tool

The Compare Search Results tool on OpenSearch Dashboards is available for production workloads. You can navigate to the Compare search results page and compare query results between lexical search and neural search configured to use the model ID generated earlier.

Clean up

You can delete the resources you created following the instructions in this post by deleting the CloudFormation stack. This will delete the Lambda resources and the S3 bucket that contain the model that was deployed to SageMaker. Complete the following steps:

  1. On the AWS CloudFormation console, navigate to your stack details page.
  2. Choose Delete.

  1. Choose Delete to confirm.

You can monitor the stack deletion progress on the AWS CloudFormation console.

Note that, deleting the CloudFormation stack doesn’t delete the model deployed on the SageMaker domain and the AI/ML connector created. This is because these models and the connector can be associated with multiple indexes within the domain. To specifically delete a model and its associated connector, use the model APIs as shown in the following screenshots.

First, undeploy the model from the OpenSearch Service domain memory:

POST /_plugins/_ml/models/<model_id>/_undeploy

Then you can delete the model from the model index:

DELETE /_plugins/_ml/models/<model_id>

Lastly, delete the connector from the connector index:

DELETE /_plugins/_ml/connectors/<connector_id>

Conclusion

In this post, you learned how to deploy a model in SageMaker, create the AI/ML connector using the OpenSearch Service console, and build the neural search index. The ability to configure AI/ML connectors in OpenSearch Service simplifies the vector hydration process by making the integrations to external models native. You can create a neural search index in minutes using the neural ingestion pipeline and the neural search that use the model ID to generate the vector embedding on the fly during ingest and search.

To learn more about these AI/ML connectors, refer to Amazon OpenSearch Service AI connectors for AWS services, AWS CloudFormation template integrations for semantic search, and Creating connectors for third-party ML platforms.


About the Authors

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

Dagney Braun is a Principal Product Manager at AWS focused on OpenSearch.

Amazon OpenSearch Serverless now supports automated time-based data deletion 

Post Syndicated from Satish Nandi original https://aws.amazon.com/blogs/big-data/amazon-opensearch-serverless-now-supports-automated-time-based-data-deletion/

We recently announced a new enhancement to OpenSearch Serverless for managing data retention of Time Series collections and Indexes. OpenSearch Serverless for Amazon OpenSearch Service makes it straightforward to run search and analytics workloads without having to think about infrastructure management. With the new automated time-based data deletion feature, you can specify how long they want to retain data and OpenSearch Serverless automatically manages the lifecycle of the data based on this configuration.

To analyze time series data such as application logs and events in OpenSearch, you must create and ingest data into indexes. Typically, these logs are generated continuously and ingested frequently, such as every few minutes, into OpenSearch. Large volumes of logs can consume a lot of the available resources such as storage in the clusters and therefore need to be managed efficiently to maximize optimum performance. You can manage the lifecycle of the indexed data by using automated tooling to create daily indexes. You can then use scripts to rotate the indexed data from the primary storage in clusters to a secondary remote storage to maintain performance and control costs, and then delete the aged data after a certain retention period.

The new automated time-based data deletion feature in OpenSearch Serverless minimizes the need to manually create and manage daily indexes or write data lifecycle scripts. You can now create a single index and OpenSearch Serverless will handle creating a timestamped collection of indexes under one logical grouping automatically. You only need to configure the desired data retention policies for your time series data collections. OpenSearch Serverless will then efficiently roll over indexes from primary storage to Amazon Simple Storage Service(Amazon S3) as they age, and automatically delete aged data per the configured retention policies, reducing the operational overhead and saving costs.

In this post we discuss the new data lifecycle polices and how to get started with these polices in OpenSearch Serverless

Solution Overview

Consider a use case where the fictitious  company Octank Broker collects logs from its web services and ingests them into OpenSearch Serverless for service availability analysis. The company is interested in tracking web access and root cause when failures are seen with error types 4xx and 5xx. Generally, the server issues are of interest within an immediate timeframe, say in a few days. After 30 days, these logs are no longer of interest.

Octank wants to retain their log data for 7 days. If the collections or indexes are configured for 7 days’ data retention, then after 7 days, OpenSearch Serverless deletes the data. The indexes are no longer available for search. Note: Document counts in search results might reflect data that is marked for deletion for a short time.

You can configure data retention by creating a data lifecycle policy. The retention time can be unlimited, or a you can provide a specific time length in Days and Hours with a minimum retention of 24 hours and a maximum of 10 years. If the retention time is unlimited, as the name suggests, no data is deleted.

To start using data lifecycle policies in OpenSearch Serverless, you can follow the steps outlined in this post.

Prerequisites

This post assumes that you have already set up an OpenSearch Serverless collection. If not, refer to Log analytics the easy way with Amazon OpenSearch Serverless for instructions.

Create a data lifecycle policy

You can create a data lifecycle policy from the AWS Management Console, the AWS Command Line Interface (AWS CLI), AWS CloudFormation, AWS Cloud Development Kit (AWS CDK), and Terraform. To create a data lifecycle policy via the console, complete the following steps:

  • On the OpenSearch Service console, choose Data lifecycle policies under Serverless in the navigation pane.
  • Choose Create data lifecycle policy.
  • For Data lifecycle policy name, enter a name (for example, web-logs-policy).
  • Choose Add under Data lifecycle.
  • Under Source Collection, choose the collection to which you want to apply the policy (for example, web-logs-collection).
  • Under Indexes, enter the index or index patterns to apply the retention duration (for example, web-logs).
  • Under Data retention, disable Unlimited (to set up the specific retention for the index pattern you defined).
  • Enter the hours or days after which you want to delete data from Amazon S3.
  • Choose Create.

The following graphic gives a quick demonstration of creating the OpenSearch Serverless Data lifecycle policies via the preceding steps.

View the data lifecycle policy

After you have created the data lifecycle policy, you can view the policy by completing the following steps:

  • On the OpenSearch Service console, choose Data lifecycle policies under Serverless in the navigation pane.
  • Select the policy you want to view (for example, web-logs-policy).
  • Choose the hyperlink under Policy name.

This page will show you the details such as the index pattern and its retention period for a specific index and collection. The following graphic gives a quick demonstration of viewing the OpenSearch Serverless data lifecycle policies via the preceding steps.

Update the data lifecycle policy

After you have created the data lifecycle policy, you can modify and update it to add more rules. For example, you can add another index pattern or add a new collection with a new index pattern to set up the retention. The following example shows the steps to add another rule in the policy for syslog index under syslogs-collection.

  • On the OpenSearch Service console, choose Data lifecycle policies under Serverless in the navigation pane.
  • Select the policy you want to edit (for example, web-logs-policy), then choose Edit.
  • Choose Add under Data lifecycle.
  • Under Source Collection, choose the collection you are going to use for setting up the data lifecycle policy (for example, syslogs-collection).
  • Under Indexes, enter index or index patterns you are going to set retention for (for example, syslogs).
  • Under Data retention, disable Unlimited (to set up specific retention for the index pattern you defined).
  • Enter the hours or days after which you want to delete data from Amazon S3.
  • Choose Save.

The following graphic gives a quick demonstration of updating existing data lifecycle policies via the preceding steps.

Delete the data lifecycle policy

Delete the existing data lifecycle policy with the following steps:

  • On the OpenSearch Service console, choose Data lifecycle policies under Serverless in the navigation pane.
  • Select the policy you want to edit (for example, web-logs-policy).
  • Choose Delete.

Data lifecycle policy rules

In a data lifecycle policy, you specify a series of rules. The data lifecycle policy lets you manage the retention period of data associated to indexes or collections that match these rules. These rules define the retention period for data in an index or group of indexes. Each rule consists of a resource type (index), a retention period, and a list of resources (indexes) that the retention period applies to.

You define the retention period with one of the following formats:

  • “MinIndexRetention”: “24h” – OpenSearch Serverless retains the index data for a specified period in hours or days. You can set this period to be from 24 hours (24h) to 3,650 days (3650d).
  • “NoMinIndexRetention”: true – OpenSearch Serverless retains the index data indefinitely.

When data lifecycle policy rules overlap, within or across policies, the rule with a more specific resource name or pattern for an index overrides a rule with a more general resource name or pattern for any indexes that are common to both rules. For example, in the following policy, two rules apply to the index index/sales/logstash. In this situation, the second rule takes precedence because index/sales/log* is the longest match to index/sales/logstash. Therefore, OpenSearch Serverless sets no retention period for the index.

Summary

Data lifecycle policies provide a consistent and straightforward way to manage indexes in OpenSearch Serverless. With data lifecycle policies, you can automate data management and avoid human errors. Deleting non-relevant data without manual intervention reduces your operational load, saves storage costs, and helps keep the system performant for search.


About the authors

Prashant Agrawal is a Senior Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and ML/AI. He holds a Bachelor degree in Computer Science and an MBA in Entrepreneurship. In his free time, he likes to fly airplanes, hang gliders and ride his motorcycle.

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Post Syndicated from Alex Naumov original https://aws.amazon.com/blogs/big-data/how-smava-makes-loans-transparent-and-affordable-using-amazon-redshift-serverless/

This is a guest post co-written by Alex Naumov, Principal Data Architect at smava.

smava GmbH is one of the leading financial services companies in Germany, making personal loans transparent, fair, and affordable for consumers. Based on digital processes, smava compares loan offers from more than 20 banks. In this way, borrowers can choose the deals that are most favorable to them in a fast, digitalized, and efficient way.

smava believes in and takes advantage of data-driven decisions in order to become the market leader. The Data Platform team is responsible for supporting data-driven decisions at smava by providing data products across all departments and branches of the company. The departments include teams from engineering to sales and marketing. Branches range by products, namely B2C loans, B2B loans, and formerly also B2C mortgages. The data products used inside the company include insights from user journeys, operational reports, and marketing campaign results, among others. The data platform serves on average 60 thousand queries per day. The data volume is in double-digit TBs with steady growth as business and data sources evolve.

smava’s Data Platform team faced the challenge to deliver data to stakeholders with different SLAs, while maintaining the flexibility to scale up and down while staying cost-efficient. It took up to 3 hours to generate daily reporting, which impacted business decision-making when re-calculations needed to happen during the day. To speed up the self-service analytics and foster innovation based on data, a solution was needed to provide ways to allow any team to create data products on their own in a decentralized manner. To create and manage the data products, smava uses Amazon Redshift, a cloud data warehouse.

In this post, we show how smava optimized their data platform by using Amazon Redshift Serverless and Amazon Redshift data sharing to overcome right-sizing challenges for unpredictable workloads and further improve price-performance. Through the optimizations, smava achieved up to 50% cost savings and up to three times faster report generation compared to the previous analytics infrastructure.

Overview of solution

As a data-driven company, smava relies on the AWS Cloud to power their analytics use cases. To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

smava ingests data from various external and internal data sources into a landing stage on the data lake based on Amazon Simple Storage Service (Amazon S3). To ingest the data, smava uses a set of popular third-party customer data platforms complemented by custom scripts.

After the data lands in Amazon S3, smava uses the AWS Glue Data Catalog and crawlers to automatically catalog the available data, capture the metadata, and provide an interface that allows querying all data assets.

Data analysts who require access to the raw assets on the data lake use Amazon Athena, a serverless, interactive analytics service for exploration with ad hoc queries. For the downstream consumption by all departments across the organization, smava’s Data Platform team prepares curated data products following the extract, load, and transform (ELT) pattern. smava uses Amazon Redshift as their cloud data warehouse to transform, store, and analyze data, and uses Amazon Redshift Spectrum to efficiently query and retrieve structured and semi-structured data from the data lake using SQL.

smava follows the data vault modeling methodology with the Raw Vault, Business Vault, and Data Mart stages to prepare the data products for end consumers. The Raw Vault describes objects loaded directly from the data sources and represents a copy of the landing stage in the data lake. The Business Vault is populated with data sourced from the Raw Vault and transformed according to the business rules. Finally, the data is aggregated into specific data products oriented to a specific business line. This is the Data Mart stage. The data products from the Business Vault and Data Mart stages are now available for consumers. smava decided to use Tableau for business intelligence, data visualization, and further analytics. The data transformations are managed with dbt to simplify the workflow governance and team collaboration.

The following diagram shows the high-level data platform architecture before the optimizations.

High-level Data Platform architecture before the optimizations

Evolution of the data platform requirements

smava started with a single Redshift cluster to host all three data stages. They chose provisioned cluster nodes of the RA3 type with Reserved Instances (RIs) for cost optimization. As data volumes grew 53% year over year, so did the complexity and requirements from various analytic workloads.

smava quickly addressed the growing data volumes by right-sizing the cluster and using Amazon Redshift Concurrency Scaling for peak workloads. Furthermore, smava wanted to give all teams the option to create their own data products in a self-service manner to increase the pace of innovation. To avoid any interference with the centrally managed data products, the decentralized product development environments needed to be strictly isolated. The same requirement was also applied for the isolation of different product stages curated by the Data Platform team.

Optimizing the architecture with data sharing and Redshift Serverless

To meet the evolved requirements, smava decided to separate the workload by splitting the single provisioned Redshift cluster into multiple data warehouses, with each warehouse serving a different stage. In addition, smava added new staging environments in the Business Vault to develop new data products without the risk of interfering with existing product pipelines. To avoid any interference with the centrally managed data products of the Data Platform team, smava introduced an additional Redshift cluster, isolating the decentralized workloads.

smava was looking for an out-of-the-box solution to achieve workload isolation without managing a complex data replication pipeline.

Right after the launch of Redshift data sharing capabilities in 2021, the Data Platform team recognized that this was the solution they had been looking for. smava adopted the data sharing feature to have the data from producer clusters available for read access on different consumer clusters, with each of those consumer clusters serving a different stage.

Redshift data sharing enables instant, granular, and fast data access across Redshift clusters without the need to copy data. It provides live access to data so that users always see the most up-to-date and consistent information as it’s updated in the data warehouse. With data sharing, you can securely share live data with Redshift clusters in the same or different AWS accounts and across Regions.

With Redshift data sharing, smava was able to optimize the data architecture by separating the data workloads to individual consumer clusters without having to replicate the data. The following diagram illustrates the high-level data platform architecture after splitting the single Redshift cluster into multiple clusters.

High-level Data Platform architecture after splitting the single Redshift cluster in multiple clusters

By providing a self-service data mart, smava increased data democratization by providing users with access to all aspects of the data. They also provided teams with a set of custom tools for data discovery, ad hoc analysis, prototyping, and operating the full lifecycle of mature data products.

After collecting operational data from the individual clusters, the Data Platform team identified further potential optimizations: the Raw Vault cluster was under steady load 24/7, but the Business Vault clusters were only updated nightly. To optimize for costs, smava used the pause and resume capabilities of Redshift provisioned clusters. These capabilities are useful for clusters that need to be available at specific times. While the cluster is paused, on-demand billing is suspended. Only the cluster’s storage incurs charges.

The pause and resume feature helped smava optimize for cost, but it required additional operational overhead to trigger the cluster operations. Additionally, the development clusters remained subject to idle times during working hours. These challenges were finally solved by adopting Redshift Serverless in 2022. The Data Platform team decided to move the Business Data Vault stage clusters to Redshift Serverless, which allows them to pay for the data warehouse only when in use, reliably and efficiently.

Redshift Serverless is ideal for cases when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. Additionally, as usage demand evolves with new workloads and more concurrent users, Redshift Serverless automatically provisions the right compute resources, and the data warehouse scales seamlessly and automatically, without the need for manual intervention. Data sharing is supported in both directions between Redshift Serverless and provisioned Redshift clusters with RA3 nodes, so no changes to the smava architecture were needed. The following diagram shows the high-level architecture setup after the move to Redshift Serverless.

High-level Data Platform architecture after introducing Redshift Serverless for Business Vault clusters

smava combined the benefits of Redshift Serverless and dbt through a seamless CI/CD pipeline, adopting a trunk-based development methodology. Changes on the Git repository are automatically deployed to a test stage and validated using automated integration tests. This approach increased the efficiency of developers and decreased the average time to production from days to minutes.

smava adopted an architecture that utilizes both provisioned and serverless Redshift data warehouses, together with the data sharing capability to isolate the workloads. By choosing the right architectural patterns for their needs, smava was able to accomplish the following:

  • Simplify the data pipelines and reduce operational overhead
  • Reduce the feature release time from days to minutes
  • Increase price-performance by reducing idle times and right-sizing the workload
  • Achieve up to three times faster report generation (faster calculations and higher parallelization) at 50% of the original setup costs
  • Increase agility of all departments and support data-driven decision-making by democratizing access to data
  • Increase the speed of innovation by exposing self-service data capabilities for teams across all departments and strengthening the A/B test capabilities to cover the complete customer journey

Now, all departments at smava are using the available data products to make data-driven, accurate, and agile decisions.

Future vision

For the future, smava plans to continue to optimize the Data Platform based on operational metrics. They’re considering switching more provisioned clusters like the Self-Service Data Mart cluster to serverless. Additionally, smava is optimizing the ELT orchestration toolchain to increase the number of parallel data pipelines to be run. This will increase the utilization of provisioned Redshift resources and allow for cost reductions.

With the introduction of the decentralized, self-service for data product creation, smava made a step forward towards a data mesh architecture. In the future, the Data Platform team plans to further evaluate the needs of their service users and establish further data mesh principles like federated data governance.

Conclusion

In this post, we showed how smava optimized their data platform by isolating environments and workloads using Redshift Serverless and data sharing features. Those Redshift environments are well integrated with their infrastructure, flexible in scaling on demand, and highly available, and they require minimum administration efforts. Overall, smava has increased performance by three times while reducing the total platform costs by 50%. Additionally, they reduced operational overhead to a minimum while maintaining the existing SLAs for report generation times. Moreover, smava has strengthened the culture of innovation by providing self-service data product capabilities to speed up their time to market.

If you’re interested in learning more about Amazon Redshift capabilities, we recommend watching the most recent What’s new with Amazon Redshift session in the AWS Events channel to get an overview of the features recently added to the service. You can also explore the self-service, hands-on Amazon Redshift labs to experiment with key Amazon Redshift functionalities in a guided manner.

You can also dive deeper into Redshift Serverless use cases and data sharing use cases. Additionally, check out the data sharing best practices and discover how other customers optimized for cost and performance with Redshift data sharing to get inspired for your own workloads.

If you prefer books, check out Amazon Redshift: The Definitive Guide by O’Reilly, where the authors detail the capabilities of Amazon Redshift and provide you with insights on corresponding patterns and techniques.


About the Authors

Blog author: Alex NaumovAlex Naumov is a Principal Data Architect at smava GmbH, and leads the transformation projects at the Data department. Alex previously worked 10 years as a consultant and data/solution architect in a wide variety of domains, such as telecommunications, banking, energy, and finance, using various tech stacks, and in many different countries. He has a great passion for data and transforming organizations to become data-driven and the best in what they do.

Blog author: Lingli ZhengLingli Zheng works as a Business Development Manager in the AWS worldwide specialist organization, supporting customers in the DACH region to get the best value out of Amazon analytics services. With over 12 years of experience in energy, automation, and the software industry with a focus on data analytics, AI, and ML, she is dedicated to helping customers achieve tangible business results through digital transformation.

Blog author: Alexander SpivakAlexander Spivak is a Senior Startup Solutions Architect at AWS, focusing on B2B ISV customers across EMEA North. Prior to AWS, Alexander worked as a consultant in financial services engagements, including various roles in software development and architecture. He is passionate about data analytics, serverless architectures, and creating efficient organizations.


This post was reviewed for technical accuracy by David Greenshtein, Senior Analytics Solutions Architect.

AWS re:Invent 2023 Amazon Redshift Sessions Recap

Post Syndicated from Mia Heard original https://aws.amazon.com/blogs/big-data/aws-reinvent-2023-amazon-redshift-sessions-recap/

Amazon Redshift powers data-driven decisions for tens of thousands of customers every day with a fully managed, AI-powered cloud data warehouse, delivering the best price-performance for your analytics workloads. Customers use Amazon Redshift as a key component of their data architecture to drive use cases from typical dashboarding to self-service analytics, real-time analytics, machine learning (ML), data sharing and monetization, and more.

This year’s AWS re:Invent conference, held in Las Vegas from November 27 through December 1, showcased the advancements of Amazon Redshift to help you further accelerate your journey towards modernizing your cloud analytics environments. To learn more about the latest and greatest advancements and how customers are powering data-driven decision-making using Amazon Redshift, watch the re:Invent sessions available on demand listed in this post.

Keynotes

Adam Selipsky, Chief Executive Officer of Amazon Web Services

Watch Adam Selipsky, CEO of Amazon Web Services, as he shares his perspective on cloud transformation. He highlights innovations in data, infrastructure, and artificial intelligence and machine learning that are helping AWS customers achieve their goals faster, mine untapped potential, and create a better future. Learn more about the AWS zero-ETL future with newly launched AWS databases integrations with Amazon Redshift.

Swami Sivasubramanian, Vice President of AWS Data and Machine Learning

Watch Swami Sivasubramanian, Vice President of Data and AI at AWS, to discover how you can use your company data to build differentiated generative AI applications and accelerate productivity for employees across your organization. Learn more about these new generative AI features to increase productivity including Amazon Q generative SQL in Amazon Redshift.

Peter DeSantis, Senior Vice President of AWS Utility Computing

Watch Peter DeSantis, Senior Vice President of AWS Utility Computing, as he deep dives into the engineering that powers AWS services. Get a closer look at how scaling for data warehousing works in AWS with the latest introduction of AI driven scaling and optimizations in Amazon Redshift Serverless to enable better price-performance for your workloads.

Innovation Talks

Data drives transformation: Data foundations with AWS analytics with G2 Krishnamoorthy, Vice President of AWS Analytics

G2’s session discusses strategies for embedding analytics into your applications and ideas for building a data foundation that supports your business initiatives. With new capabilities for self-service and straightforward builder experiences, you can democratize data access for line of business users, analysts, scientists, and engineers. Hear also from Adidas, GlobalFoundries, and University of California, Irvine.

Sessions

ANT203 | What’s new in Amazon Redshift

Watch this session to learn about the newest innovations within Amazon Redshift—the petabyte-scale AWS Cloud data warehousing solution. Amazon Redshift empowers users to extract powerful insights by securely and cost-effectively analyzing data across data warehouses, operational databases, data lakes, third-party data stores, and streaming sources using zero-ETL approaches. Easily build and train machine learning models using SQL within Amazon Redshift to generate predictive analytics and propel data-driven decision-making. Learn about Amazon Redshift’s newest functionality to increase reliability and speed to insights through near-real-time data access, ML, and more—all with impressive price-performance.

ANT322 | Modernize analytics by moving your data warehouse to Amazon Redshift

Watch this session to hear from AWS customers as they share their journeys moving to a modern cloud data warehouse and analytics with Amazon Redshift. Learn best practices for building powerful analytics and ML applications and operating at scale while keeping costs low.

ANT211 | Powering self-service & near real-time analytics with Amazon Redshift

To stay competitive, allowing data citizens across your organization to see near-real-time analytics without worrying about data infrastructure management is crucial for your business. In this session, learn how your data users can get to near-real-time insights on streaming data with Amazon Redshift and AWS streaming data services. Explore a solution using flexible querying tools and a serverless architecture, which brings intelligent automation and scaling capabilities, and maintains consistently high performance for even the most demanding and volatile workloads.

ANT325 | Amazon Redshift: A decade of innovation in cloud data warehousing

Exponential data growth has created unique challenges for data practitioners to manage data warehouses that can support high performance workloads at scale within cost constraints. Amazon Redshift has been constantly innovating over the last decade to give you a modern, massively parallel processing cloud data warehouse that delivers the best price-performance, ease of use, scalability, and reliability. In this session, learn about Amazon Redshift’s technical innovations including serverless, AI/ML-powered autonomics, and zero-ETL data integrations. Discover how you can use Amazon Redshift to build a data mesh architecture to analyze your data.

ANT326 | Set up a zero-ETL-based analytics architecture for your organizations

ETL (extract, transform, and load data) can be challenging, time-consuming, and costly. AWS is building a zero-ETL future with capabilities like streaming ingestion into the data warehouse, federated querying, and connectors that access data in place across databases, data lakes, and third-party data sources without data movement. In this session, learn how zero-ETL investments such as Amazon Aurora zero-ETL integration with Amazon Redshift drive direct integration between AWS data services to allow data engineers to focus on creating value from data instead of spending time and resources building pipelines.

ANT351 | [NEW LAUNCH] Multi-data warehouse writes through Amazon Redshift data sharing

Organizations want simple and secure ways for their teams to meet their ETL SLAs, optimize costs, and collaborate on live data. With multi-data warehouse writes available through Amazon Redshift data sharing, you can write to the same databases with multiple warehouses at the same time. Join this session to learn how you can keep your ETL jobs completing predictably and on time, collaborate on live data with multiple teams, and better optimize your costs with this newly launched capability.

ANT 352 | [NEW LAUNCH] Amazon Q generative SQL in Amazon Redshift Query Editor

SQL, the industry standard language for data analytics, often requires users to spend a lot of time understanding an organization’s complex metadata in order to write and carry out complex SQL queries for data insights. Join this session to learn how you can help SQL users of all skill levels within your organization derive insights faster with the new Amazon Q generative SQL capability in Amazon Redshift Query Editor. This session demonstrates how this functionality works and how to use text prompts in plain English to build effective queries, including complex multi-table join or nested queries.

ANT 354 | [NEW LAUNCH] AI-powered scaling and optimization for Amazon Redshift Serverless

Amazon Redshift Serverless makes it easier to run analytics workloads of any size without having to manage data warehouse infrastructure. Redshift Serverless helps developers, data scientists, and analysts work across various data sources to build reports, applications, machine learning models, and more. In this session, learn about Redshift Serverless new AI-driven scaling and optimization functionality. This new functionality proactively adapts to workload changes and applies tailored performance optimizations by intelligently predicting query patterns and using machine learning, increasing consistent price-performance.

SEC245 | Simplify and improve access control for your AWS analytics services

As organizations adopt new AWS services, end users need more access to data across a full range of AWS analytics services to extract value and insights. Data end users are accustomed to seamless authentication to their AWS applications, and cloud administrators want more granular, user-based access control over their data. Join this session to learn how to simplify and improve access control using a new AWS IAM Identity Center feature, known as trusted identity propagation, along with supported AWS analytics services. Also learn how to audit user and group-based access activity across interconnected AWS managed applications so that you can align better with regulatory and sovereignty requirements.

What’s new with Amazon Redshift

Want to learn more about the most recent features launched in Amazon Redshift? Refer to Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data to learn about all of the Amazon Redshift launch announcements made at 2023 AWS re:Invent

_______________________________________________________________________

About the Author

Mia Heard is a product marketing manager for Amazon Redshift, a fully managed, AI-powered cloud data warehouse with the best price-performance for analytic workloads.

Federate IAM-based single sign-on to Amazon Redshift role-based access control with Okta

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/federate-iam-based-single-sign-on-to-amazon-redshift-role-based-access-control-with-okta/

Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. Tens of thousands of customers rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries.

You can use your preferred SQL clients to analyze your data in an Amazon Redshift data warehouse. Connect seamlessly by leveraging an identity provider (IdP) or single sign-on (SSO) credentials to connect to the Redshift data warehouse to reuse existing user credentials and avoid additional user setup and configuration. Using role-based access control (RBAC), you can simplify managing user privileges, database permissions and management of security permissions in Amazon Redshift. You can also use redshift database roles to define a set of elevated permissions, such as for a system monitor or database administrator.

Using AWS Identity and Access Management (IAM) with RBAC, organizations can simplify user management because you no longer need to create users and map them to database roles manually. You can define the mapped database roles as a principal tag for the IdP groups or IAM role, so Redshift database roles and users who are members of those IdP groups are granted to the database roles automatically.

Earlier in 2023, we launched support for Okta integration with Amazon Redshift Serverless using database roles. In this post, we focus on Okta as the IdP and provide step-by-step guidance to integrate a Redshift provisioned cluster with Okta using the Redshift Query Editor v2 and with SQL clients like SQL Workbench/J. You can use this mechanism with other IdP providers such as Azure Active Directory or Ping with any applications or tools using Amazon’s JDBC, ODBC, or Python driver.

Recently we also announced Amazon Redshift integration with AWS IAM Identity Center, supporting trusted identity propagation, allowing you to use third-party Identity Providers (IdP) such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin. This integration simplifies the authentication and authorization process for Amazon Redshift users using Query Editor V2 or Amazon Quicksight, making it easier for them to securely access your data warehouse. AWS IAM Identity Center offers automatic user and group provisioning from Okta to itself by utilizing the System for Cross-domain Identity Management (SCIM) 2.0 protocol. This integration allows for seamless synchronization of information between two services, ensuring accurate and up-to-date information in AWS IAM Identity Center. Refer to Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On blog post to learn more about setting up single sign-on (SSO) to Amazon Redshift using integration with IdC and Okta as the Identity Provider.

If you are interested in using IAM-based single sign-on with Amazon Redshift database roles then you can continue reading this blog.

Solution overview

The following diagram illustrates the authentication flow of Okta with a Redshift provisioned cluster using federated IAM roles and automatic database role mapping.

Architecture Diagram

The workflow contains the following steps:

  1. Either the user chooses an IdP app in their browser, or the SQL client initiates a user authentication request to the IdP (Okta).
  2. Upon a successful authentication, Okta submits a request to the AWS federation endpoint with a SAML assertion containing the principal tags.
  3. The AWS federation endpoint validates the SAML assertion and invokes the AWS Security Token Service (AWS STS) API AssumeRoleWithSAML. The SAML assertion contains the IdP user and group information that is stored in the RedshiftDbUser and RedshiftDbRoles principal tags, respectively. Temporary IAM credentials are returned to the SQL client or, if using the Query Editor v2, the user’s browser is redirected to the Query Editor v2 console using the temporary IAM credentials.
  4. The temporary IAM credentials are used by the SQL client or Query Editor v2 to call the Redshift API GetClusterCredentialsWithIAM. This API uses the principal tags to determine the user and database roles that the user belongs to. An associated database user is created if the user is signing in for the first time and is granted the matching database roles automatically. A temporary password is returned to the SQL client.
  5. Using the database user and temporary password, the SQL client or Query Editor v2 connects to Amazon Redshift. Upon login, the user is authorized based on the Redshift database roles that were assigned in Step 4.

Prerequisites

You need the following prerequisites to set up this solution:

Connect with a Redshift provisioned cluster as a federated user using Query Editor v2

To connect using Query Editor v2, complete the following steps:

  1. Follow all the steps described in the sections Set up your Okta application and Set up AWS configuration in the following post.
  2. For the Amazon Redshift access IAM policy, replace the policy with the following JSON to use the GetClusterCredentialsWithIAM API:
    {
    	"Version": "2012-10-17",
    	"Statement": [
    					{
    						"Sid": "VisualEditor0",
    						"Effect": "Allow",
    						"Action": "redshift:GetClusterCredentialsWithIAM",
    						"Resource": "arn:aws:redshift:us-west-2:123456789012:dbname:redshift-cluster-1/dev"
    					}
    				]
    }

Now you’re ready to connect to your Redshift provisioned cluster using Query Editor v2 and federated login.

  1. Use the SSO URL from Okta and log in to your Okta account with your user credentials. For this demo, we log in with user Ethan.
  2. In Query Editor v2, choose your Redshift provisioned cluster (right-click) and choose Create connection.
  3. For Authentication, select Temporary credentials using your IAM identity.
  4. For Database, enter the database name you want to connect to.
  5. Choose Create connection.
  6. Run the following command to validate that you are logged in as a federated user and also to get the list of roles associated with that user for the current session:
SELECT current_user,* FROM pg_get_session_roles() eff_ro(name name, roleid integer);

Because Ethan is part of the sales group and has been granted permissions to access tables in the sales_schema, he should be able to access those tables without any issues. However, if he tries to access tables in the finance_schema, he would receive a permission denied error because Ethan is not part of the finance group in Okta.

Okta-QEV2-Federation

Connect with a Redshift provisioned cluster as a federated user via a third-party client

To connect as a federated user via a third-party client, complete the following steps:

  1. Follow steps 1 and 2 which are described in above section (Connect with a Redshift provisioned cluster as a federated user using Query Editor v2).
  2. Use the Redshift JDBC driver v2.1.0.18 and above because it supports authentication with IAM group federation. For the URL, enter jdbc:redshift:iam://<cluster endpoint>:<port>:<databasename>?groupfederation=true. For example, jdbc:redshift:iam://redshift-cluster-1.abdef0abc0ab.us-west-2.redshift.amazonaws.com:5439/dev?groupfederation=true

In the preceding URL, groupfederation is a mandatory parameter that allows you to authenticate with the IAM credentials for the Redshift provisioned cluster. Without the groupfederation parameter, it will not use Redshift database roles.

  1. For Username and Password, enter your Okta credentials.

SQL Workbench/J - Connection

  1. To set up extended properties, follow Steps 4–9 in the section Configure the SQL client (SQL Workbench/J) in the following post.

User Ethan will be able to access the sales_schema tables. If Ethan tries to access the tables in the finance_schema, he will get a permission denied error.

SQL Workbench/J Demo

Troubleshooting

If your connection didn’t work, consider the following:

  • Enable logging in the driver. For instructions, see Configure logging.
  • Make sure to use the latest Amazon Redshift JDBC driver version.
  • If you’re getting errors while setting up the application on Okta, make sure you have admin access.
  • If you can authenticate via the SQL client but get a permission issue or can’t see objects, grant the relevant permission to the role.

Clean up

When you’re done testing the solution, clean up the resources to avoid incurring future charges:

  1. Delete the Redshift provisioned cluster.
  2. Delete the IAM roles, IAM IdPs, and IAM policies.

Conclusion

In this post, we provided step-by-step instructions to integrate a Redshift provisioned cluster with Okta using the Redshift Query Editor v2 and SQL Workbench/J with the help of federated IAM roles and automatic database role mapping. You can use a similar setup with other SQL clients (such as DBeaver or DataGrip). We also showed how Okta group membership is mapped automatically with Redshift provisioned cluster roles to use role-based authentication seamlessly.

If you have any feedback or questions, please leave them in the comments.


About the Authors

Debu-PandaDebu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

How FanDuel adopted a modern Amazon Redshift architecture to serve critical business workloads

Post Syndicated from Sreenivasa Munagala original https://aws.amazon.com/blogs/big-data/how-fanduel-adopted-a-modern-amazon-redshift-architecture-to-serve-critical-business-workloads/

This post is co-written with Sreenivasa Mungala and Matt Grimm from FanDuel.

In this post, we share how FanDuel moved from a DC2 nodes architecture to a modern Amazon Redshift architecture, which includes Redshift provisioned clusters using RA3 instances, Amazon Redshift data sharing, and Amazon Redshift Serverless.

About FanDuel

Part of Flutter Entertainment, FanDuel Group is a gaming company that offers sportsbooks, daily fantasy sports, horse racing, and online casinos. The company operates sportsbooks in a number of US and Canadian states. Fanduel first carved out a niche in the US through daily fantasy sports, such as their most popular fantasy sport: NFL football.

As FanDuel’s business footprint grew, so too did the complexity of their analytical needs. More and more of FanDuel’s community of analysts and business users looked for comprehensive data solutions that centralized the data across the various arms of their business. Their individual, product-specific, and often on-premises data warehouses soon became obsolete. FanDuel’s data team solved the problem of creating a new massive data store for centralizing the data in one place, with one version of the truth. At the heart of this new Global Data Platform was Amazon Redshift, which fast became the trusted data store from which all analysis was derived. Users could now assess risk, profitability, and cross-sell opportunities not only for piecemeal divisions or products, but also globally for the business as a whole.

FanDuel’s journey on Amazon Redshift

FanDuel’s first Redshift cluster was launched using Dense Compute (DC2) nodes. This was chosen over Dense Storage (DS2) nodes in order to take advantage of the greater compute power for the complex queries in their organization. As FanDuel grew, so did their data workloads. This meant that there was a constant challenge to scale and overcome contention while providing the performance their user community needed for day-to-day decision-making. FanDuel met this challenge initially by continuously adding nodes and experimenting with workload management (WLM), but it became abundantly obvious that they needed to take a more significant step to meet the needs of their users.

In 2021, FanDuel’s workloads almost tripled since they first started using Amazon Redshift in 2018, and they started evaluating Redshift RA3 nodes vs. DC2 nodes to take advantage of the storage and compute separation and deliver better performance at lower costs. FanDuel wanted to make the move primarily to separate storage and compute, and evaluate data sharing in the hopes of bringing different compute to the data to alleviate user contention on their primary cluster. FanDuel decided to launch a new RA3 cluster when they were satisfied that the performance matched that of their existing DC2 architecture, providing them the ability to scale storage and compute independently.

In 2022, FanDuel shifted their focus to using data sharing. Data sharing allows you to share live data securely across Redshift data warehouses for read and write (in preview) purposes. This means that workloads can be isolated to individual clusters, allowing for a more streamlined schema design, WLM configuration, and right-sizing for cost optimization. The following diagram illustrates this architecture.

To achieve a data sharing architecture, the plan was to first spin up consumer clusters for development and testing environments for their data engineers that were moving key legacy code to dbt. FanDuel wanted their engineers to have access to production datasets to test their new models and match the results from their legacy SQL-based code sets. They also wanted to ensure that they had adequate compute to run many jobs concurrently. After they saw the benefits of data sharing, they spun up their first production consumer cluster in the spring of 2022 to handle other analytics use cases. This was sharing most of the schemas and their tables from the main producer cluster.

Benefits of moving to a data sharing architecture

FanDuel saw a lot of benefits from the data sharing architecture, where data engineers had access to real production data to test their jobs without impacting the producer’s performance. Since splitting the workloads through a data sharing architecture, FanDuel has doubled their query concurrency and reduced the query queuing, resulting in a better end-to-end query time. FanDuel received positive feedback on the new environment and soon reaped the rewards of increased engineering velocity and reduced performance issues in production after deployments. Their initial venture into the world of data sharing was definitely considered a success.

Given the successful rollout of their first consumer in a data sharing architecture, they looked for opportunities to meet other users’ needs with new targeted consumers. With the assistance of AWS, FanDuel initiated the development of a comprehensive strategy aimed at safeguarding their extract, load, and transform (ELT) jobs. This approach involved implementing workload isolation and allocating dedicated clusters for these workloads, designated as the producer cluster within the data sharing architecture. Simultaneously, they planned to migrate all other activities onto one or more consumer clusters, apart from the existing cluster utilized by their data engineering team.

They spun up a second consumer in the summer of 2022 with the hopes of moving some of their more resource-intensive analytical processes off the main cluster. In order to empower their analysts over time, they had allowed a pattern in which users other than data engineers could create and share their own objects.

As the calendar flipped from 2022 to 2023, several developments changed the landscape of architecture at FanDuel. For one, FanDuel launched their initial event-based streaming work for their sportsbook data, which allowed them to micro-batch data into Amazon Redshift at a much lower latency than their previous legacy batch approach. This allowed them to generate C-Suite revenue reports at a much earlier SLA, which was a big win for the data team, because this was never achieved before the Super Bowl.

FanDuel introduced a new internal KPI called Query Efficiency, a measure to capture the amount of time users spent waiting for their queries to run. As the workload started increasing exponentially, FanDuel also noticed an increase in this KPI, specifically for risk and trading workloads.

Working with AWS Enterprise Support and the Amazon Redshift service team, FanDuel soon realized that the risk and trading use case was a perfect opportunity to move it to Amazon Redshift Serverless. Redshift Serverless offers scalability across dimensions such a data volume changes, concurrent users and query complexity, enabling you to automatically scale compute up or down to manage demanding and unpredictable workloads. Because billing is only accrued while queries are run, it also means that you no longer need to cover costs for compute you’re not utilizing. Redshift Serverless also manages workload management (WLM) entirely, allowing you to focus only on the query monitoring rules (QMRs) you want and usage limits, further limiting the need for you to manage your data warehouses. This adoption also complimented data sharing, where Redshift Serverless endpoints can read and write (in preview) from provisioned clusters during peak hours, offering flexible compute scalability and workload isolation and avoiding the impact on other mission-critical workloads. Seeing the benefits of what Redshift Serverless offers for their risk and trading workloads, they also moved some of their other workloads like business intelligence (BI) dashboards and risk and trading (RT) to a Redshift Serverless environment.

Benefits of introducing Redshift Serverless in a data sharing architecture

Through a combination of data sharing and a serverless architecture, FanDuel could elastically scale their most critical workloads on demand. Redshift Serverless Automatic WLM allowed users to get started without the need to configure WLM. With the intelligent and automated scaling capabilities of Redshift Serverless, FanDuel could focus on their business objectives without worrying about the data warehouse capacity. This architecture alleviated the constraints of a single predefined Redshift provisioned cluster and reduced the need for FanDuel to manage data warehouse capacity and any WLM configuration.

In terms of cost, Redshift Serverless enabled FanDuel to elegantly handle the most demanding workloads with a pay-as-you-go model, paying only when the data warehouse is in use, along with complete separation of compute and storage.

Having now introduced workload isolation and Redshift Serverless, FanDuel is able to achieve a more granular understanding of each team’s compute requirements without the noise of ELT and contending workloads all in the same environment. This allowed comprehensive analytics workloads to be conducted on consumers with vastly minimized contention while also being serviced with the most cost-efficient configuration possible.

The following diagram illustrates the updated architecture.

Results

FanDuel’s re-architecting efforts for workload isolation with risk and trading (RT) workloads using Redshift data sharing and Redshift Serverless resulted in the most critical business SLAs finishing three times faster, along with an increase in average query efficiency of 55% for overall workloads. These SLA improvements have resulted into an overall saving of tenfold in business cost, and they have been able to deliver business insights to other verticals such as product, commercial, and marketing much faster.

Conclusion

By harnessing the power of Redshift provisioned clusters and serverless endpoints with data sharing, FanDuel has been able to better scale and run analytical workloads without having to manage any data warehouse infrastructure. FanDuel is looking forward to future Amazon partnerships and is excited to embark on a journey of new innovation with Redshift Serverless and continued enhancements such as machine learning optimization and auto scaling.

If you’re new to Amazon Redshift, you can explore demos, other customer stories, and the latest features at Amazon Redshift. If you’re already using Amazon Redshift, reach out to your AWS account team for support, and learn more about what’s new with Amazon Redshift.


About the authors

Sreenivasa Munagala is a Principal Data Architect at FanDuel Group. He defines their Amazon Redshift optimization strategy and works with the data analytics team to provide solutions to their key business problems.

Matt Grimm is a Principal Data Architect at FanDuel Group, moving the company to an event-based, data-driven architecture using the integration of both streaming and batch data, while also supporting their Machine Learning Platform and development teams.

Luke Shearer is a Cloud Support Engineer at Amazon Web Services for the Data Insight Analytics profile, where he is engaged with AWS customers every day and is always working to identify the best solution for each customer.

Dhaval Shah is Senior Customer Success Engineer at AWS and specializes in bringing the most complex and demanding data analytics workloads to Amazon Redshift. He has more then 20 years of experiences in different databases and data warehousing technologies. He is passionate about efficient and scalable data analytics cloud solutions that drive business value for customers.

Ranjan Burman is an Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 17 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for cost, reliability, performance, and operational excellence at scale in their cloud journey. He has a keen interest in data analytics as well.

Integrate Okta with Amazon Redshift Query Editor V2 using AWS IAM Identity Center for seamless Single Sign-On

Post Syndicated from Maneesh Sharma original https://aws.amazon.com/blogs/big-data/integrate-okta-with-amazon-redshift-query-editor-v2-using-aws-iam-identity-center-for-seamless-single-sign-on/

AWS IAM Identity Center (IdC) allows you to manage single sign-on (SSO) access to all your AWS accounts and applications from a single location. We are pleased to announce that Amazon Redshift now integrates with AWS IAM Identity Center, and supports trusted identity propagation, allowing you to use third-party Identity Providers (IdP) such as Microsoft Entra ID (Azure AD), Okta, Ping, and OneLogin. This integration simplifies the authentication and authorization process for Amazon Redshift users using Query Editor V2 or Amazon Quicksight, making it easier for them to securely access your data warehouse. Additionally, this integration positions Amazon Redshift as an IdC-managed application, enabling you to use database role-based access control on your data warehouse for enhanced security.

AWS IAM Identity Center offers automatic user and group provisioning from Okta to itself by utilizing the System for Cross-domain Identity Management (SCIM) 2.0 protocol. This integration allows for seamless synchronization of information between two services, ensuring accurate and up-to-date information in AWS IAM Identity Center.

In this post, we’ll outline a comprehensive guide for setting up SSO to Amazon Redshift using integration with IdC and Okta as the Identity Provider. This guide shows how you can SSO onto Amazon Redshift for Amazon Redshift Query Editor V2 (QEV2).

Solution overview

Using IAM IdC with Amazon Redshift can benefit your organization in the following ways:

  • Users can connect to Amazon Redshift without requiring an administrator to set up AWS IAM roles with complex permissions.
  • IAM IdC integration allows mapping of IdC groups with Amazon Redshift database roles. Administrators can then assign different privileges to different roles and assigning these roles to different users, giving organizations granular control for user access.
  • IdC provides a central location for your users in AWS. You can create users and groups directly in IdC or connect your existing users and groups that you manage in a standards-based identity provider like Okta, Ping Identity, or Microsoft Entra ID (i.e., Azure Active Directory [AD]).
  • IdC directs authentication to your chosen source of truth for users and groups, and it maintains a directory of users and groups for access by Amazon Redshift.
  • You can share one IdC instance with multiple Amazon Redshift data warehouses with a simple auto-discovery and connect capability. This makes it fast to add clusters without the extra effort of configuring the IdC connection for each, and it ensures that all clusters and workgroups have a consistent view of users, their attributes, and groups. Note: Your organization’s IdC instance must be in the same region as the Amazon Redshift data warehouse you’re connecting to.
  • Because user identities are known and logged along with data access, it’s easier for you to meet compliance regulations through auditing user access in AWS CloudTrail authorizes access to data.
    Architecture Diagram

Amazon Redshift Query Editor V2 workflow:

  1. End user initiates the flow using AWS access portal URL (this URL would be available on IdC dashboard console). A browser pop-up triggers and takes you to the Okta Login page where you enter Okta credentials. After successful authentication, you’ll be logged into the AWS Console as a federated user. Click on your AWS Account and choose the Amazon Redshift Query Editor V2 application. Once you federate to Query Editor V2, select the IdC authentication method.
  2. QEv2 invokes browser flow where you re-authenticate, this time with their AWS IdC credentials. Since Okta is the IdP, you enter Okta credentials, which are already cached in browser. At this step, federation flow with IdC initiates and at the end of this flow, the Session token and Access token is available to the QEv2 console in browser as cookies.
  3. Amazon Redshift retrieves your authorization details based on session token retrieved and fetches user’s group membership.
  4. Upon a successful authentication, you’ll be redirected back to QEV2, but logged in as an IdC authenticated user.

This solution covers following steps:

  1. Integrate Okta with AWS IdC to sync user and groups.
  2. Setting up IdC integration with Amazon Redshift
  3. Assign Users or Groups from IdC to Amazon Redshift Application.
  4. Enable IdC integration for a new Amazon Redshift provisioned or Amazon Redshift Serverless endpoint.
  5. Associate an IdC application with an existing provisioned or serverless data warehouse.
  6. Configure Amazon Redshift role-based access.
  7. Create a permission set.
  8. Assign permission set to AWS accounts.
  9. Federate to Redshift Query Editor V2 using IdC.
  10. Troubleshooting

Prerequisites

Walkthrough

Integrate Okta with AWS IdC to sync user and groups

Enable group and user provisioning from Okta with AWS IdC by following this documentation here.

If you see issues while syncing users and groups, then refer to this section these considerations for using automatic provisioning.

Setting up IAM IdC integration with Amazon Redshift

Amazon Redshift cluster administrator or Amazon Redshift Serverless administrator must perform steps to configure Redshift as an IdC-enabled application. This enables Amazon Redshift to discover and connect to the IdC automatically to receive sign-in and user directory services.

After this, when the Amazon Redshift administrator creates a cluster or workgroup, they can enable the new data warehouse to use IdC and its identity-management capabilities. The point of enabling Amazon Redshift as an IdC-managed application is so you can control user and group permissions from within the IdC, or from a source third-party identity provider that’s integrated with it.

When your database users sign in to an Amazon Redshift database, for example an analyst or a data scientist, it checks their groups in IdC and these are mapped to roles in Amazon Redshift. In this manner, a group can map to an Amazon Redshift database role that allows read access to a set of tables.

The following steps show how to make Amazon Redshift an AWS-managed application with IdC:

  1. Select IAM Identity Center connection from Amazon Redshift console menu.
    Redshift Application Creation
  2. Choose Create application
    Redshift IdC Create Application
  3. The IAM Identity Center connection opens. Choose Next.
  4. In IAM Identity Center integration setup section, for:
    1. IAM Identity Center display name – Enter a unique name for Amazon Redshift’s IdC-managed application.
    2. Managed application name – You can enter the managed Amazon Redshift application name or use the assigned value as it is.
  5. In Connection with third-party identity providers section, for:
    1. Identity Provider Namespace – Specify the unique namespace for your organization. This is typically an abbreviated version of your organization’s name. It’s added as a prefix for your IdC-managed users and roles in the Amazon Redshift database.
  6. In IAM role for IAM Identity Center access – Select an IAM role to use. You can create a new IAM role if you don’t have an existing one. The specific policy permissions required are the following:
    • sso:DescribeApplication – Required to create an identity provider (IdP) entry in the catalog.
    • sso:DescribeInstance – Used to manually create IdP federated roles or users.
    • redshift:DescribeQev2IdcApplications – Used to detect capability for IDC authentication from Redshift Query Editor V2.

The following screenshot is from the IAM role:
IAM IdC Role

IAM IdC Role Trust Relationship

  1. We won’t enable Trusted identity propagation because we are not integrating with AWS Lake Formation in this post.
  2. Choose Next.
    Redshift IdC Connection
  3. In Configure client connections that use third-party IdPs section, choose Yes if you want to connect Amazon Redshift with a third-party application. Otherwise, choose No. For this post we chose No because we’ll be integrating only with Amazon Redshift Query Editor V2.
  4. Choose Next.
    Redshift IdC No Third Party App
  5. In the Review and create application section, review all the details you have entered before and choose Create application.
    Redshift IdC Application - Review and Create

After the Amazon Redshift administrator finishes the steps and saves the configuration, the IdC properties appear in the Redshift Console. Completing these tasks makes Redshift an IdC-enabled application.
Redshift IdC Application Created

After you select the managed application name, the properties in the console includes the integration status. It says Success when it’s completed. This status indicates if IdC configuration is completed.
Amazon Redshift IdC Application Status

Assigning users or groups from IdC to Amazon Redshift application

In this step, Users or groups synced to your IdC directory are available to assign to your application where the Amazon Redshift administrator can decide which users or groups from IDC need to be included as part of Amazon Redshift application.

For example, if you have total 20 groups from your IdC and you don’t want all the groups to include as part of Amazon Redshift application, then you have options to choose which IdC groups to include as part of Amazon Redshift-enabled IdC application. Later, you can create two Redshift database roles as part of IDC integration in Amazon Redshift.

The following steps assign groups to Amazon Redshift-enabled IdC application:

  1. On IAM Identity Center properties in the Amazon Redshift Console, select Assign under Groups tab.
  2. If this is the first time you’re assigning groups, then you’ll see a notification. Select Get started.
  3. Enter which groups you want to synchronize in the application. In this example, we chose the groups wssso-sales and awssso-finance.
  4. Choose Done.
    Redshift Application - Assigning User and Groups

Enabling IdC integration for a new Amazon Redshift provisioned cluster or Amazon Redshift Serverless

After completing steps under section (Setting up IAM Identity Center integration with Amazon Redshift) — Amazon Redshift database administrator needs to configure new Redshift resources to work in alignment with IdC to make sign-in and data access easier. This is performed as part of the steps to create a provisioned cluster or a Serverless workgroup. Anyone with permissions to create Amazon Redshift resources can perform these IdC integration tasks. When you create a provisioned cluster, you start by choosing Create Cluster in the Amazon Redshift Console.

  1. Choose Enable for the cluster (recommended) in the section for IAM Identity Center connection in the create-cluster steps.
  2. From the drop down, choose the redshift application which you created in above steps.

Note that when a new data warehouse is created, the IAM role specified for IdC integration is automatically attached to the provisioned cluster or Serverless Namespace. After you finish entering the required cluster metadata and create the resource, you can check the status for IdC integration in the properties.
Amazon Redshift - Create Cluster

Associating an IdC application with an existing provisioned cluster or Serverless endpoint

If you have an existing provisioned cluster or serverless workgroup that you would like to enable for IdC integration, then you can do that by running a SQL command. You run the following command to enable integration. It’s required that a database administrator run the query.

CREATE IDENTITY PROVIDER "<idc_application_name>" TYPE AWSIDC
NAMESPACE '<idc_namespace_name>'
APPLICATION_ARN '<idc_application_arn>'
IAM_ROLE '<IAM role for IAM Identity Center access>';

Example:

CREATE IDENTITY PROVIDER "redshift-idc-app" TYPE AWSIDC
NAMESPACE 'awsidc'
APPLICATION_ARN 'arn:aws:sso::123456789012:application/ssoins-12345f67fe123d4/apl-a0b0a12dc123b1a4'
IAM_ROLE 'arn:aws:iam::123456789012:role/EspressoRedshiftRole';

To alter the IdP, use the following command (this new set of parameter values completely replaces the current values):

ALTER IDENTITY PROVIDER "<idp_name>"
NAMESPACE '<idc_namespace_name>'|
IAM_ROLE default | 'arn:aws:iam::<AWS account-id-1>:role/<role-name>';

Few of the examples are:

ALTER IDENTITY PROVIDER "redshift-idc-app"
NAMESPACE 'awsidctest';

ALTER IDENTITY PROVIDER "redshift-idc-app"
IAM_ROLE 'arn:aws:iam::123456789012:role/administratoraccess';

Note: If you update the idc-namespace value, then all the new cluster created afterwards will be using the updated namespace.

For existing clusters or serverless workgroups, you need to update the namespace manually on each Amazon Redshift cluster using the previous command. Also, all the database roles associated with identity provider will be updated with new namespace value.

You can disable or enable the identity provider using the following command:

ALTER IDENTITY PROVIDER <provider_name> disable|enable;

Example:

ALTER IDENTITY PROVIDER <provider_name> disable;

You can drop an existing identity provider. The following example shows how CASCADE deletes users and roles attached to the identity provider.

DROP IDENTITY PROVIDER <provider_name> [ CASCADE ]

Configure Amazon Redshift role-based access

In this step, we pre-create the database roles in Amazon Redshift based on the groups that you synced in IdC. Make sure the role name matches with the IdC Group name.

Amazon Redshift roles simplify managing privileges required for your end-users. In this post, we create two database roles, sales and finance, and grant them access to query tables with sales and finance data, respectively. You can download this sample SQL Notebook and import into Redshift Query Editor v2 to run all cells in the notebook used in this example. Alternatively, you can copy and enter the SQL into your SQL client.

Below is the syntax to create role in Amazon Redshift:

create role "<IDC_Namespace>:<IdP groupname>;

For example:

create role "awsidc:awssso-sales";
create role "awsidc:awssso-finance";

Create the sales and finance database schema:

create schema sales_schema;
create schema finance_schema;

Creating the tables:

CREATE TABLE IF NOT EXISTS finance_schema.revenue
(
account INTEGER   ENCODE az64
,customer VARCHAR(20)   ENCODE lzo
,salesamt NUMERIC(18,0)   ENCODE az64
)
DISTSTYLE AUTO
;

insert into finance_schema.revenue values (10001, 'ABC Company', 12000);
insert into finance_schema.revenue values (10002, 'Tech Logistics', 175400);
insert into finance_schema.revenue values (10003, 'XYZ Industry', 24355);
insert into finance_schema.revenue values (10004, 'The tax experts', 186577);

CREATE TABLE IF NOT EXISTS sales_schema.store_sales
(
ID INTEGER   ENCODE az64,
Product varchar(20),
Sales_Amount INTEGER   ENCODE az64
)
DISTSTYLE AUTO
;

Insert into sales_schema.store_sales values (1,'product1',1000);
Insert into sales_schema.store_sales values (2,'product2',2000);
Insert into sales_schema.store_sales values (3,'product3',3000);
Insert into sales_schema.store_sales values (4,'product4',4000);

Below is the syntax to grant permission to the Amazon Redshift Serverless role:

GRANT { { SELECT | INSERT | UPDATE | DELETE | DROP | REFERENCES } [,...]| ALL [ PRIVILEGES ] } ON { [ TABLE ] table_name [, ...] | ALL TABLES IN SCHEMA schema_name [, ...] } TO role <IdP groupname>;

Grant relevant permission to the role as per your requirement. In following example, we grant full permission to role sales on sales_schema and only select permission on finance_schema to role finance.

For example:

grant usage on schema sales_schema to role "awsidc:awssso-sales";
grant select on all tables in schema sales_schema to role "awsidc:awssso-sales";

grant usage on schema finance_schema to role "awsidc:awssso-finance";
grant select on all tables in schema finance_schema to role "awsidc:awssso-finance";

Create a permission set

A permission set is a template that you create and maintain that defines a collection of one or more IAM policies. Permission sets simplify the assignment of AWS account access for users and groups in your organization. We’ll create a permission set to allow federated user to access Query Editor V

The following steps to create permission set:

  1. Open the IAM Identity Center Console.
  2. In the navigation pane, under Multi-Account permissions, choose Permission sets.
  3. Choose Create permission set.
    IdC-Create Permission Set
  4. Choose Custom permission set and then choose Next.
  5. Under AWS managed policies, choose AmazonRedshiftQueryEditorV2ReadSharing.
  6. Under Customer managed policies, provide the policy name which you created in step 4 under section – Setting up IAM Identity Center integration with Amazon Redshift.
  7. Choose Next.
  8. Enter permission set name. For example, Amazon Redshift-Query-Editor-V2.
  9. Under Relay state – optional – set default relay state to the Query Editor V2 URL, using the format : https://<region>.console.aws.amazon.com/sqlworkbench/home.
    For this post, we use: https://us-east-1.console.aws.amazon.com/sqlworkbench/home.
  10. Choose Next.
  11. On the Review and create screen, choose Create. The console displays the following message: The permission set Redshift-Query-Editor-V2 was successfully created.
    IdC- Review Create Permission

Assign permission set to AWS accounts

  1. Open the IAM Identity Center Console.
  2. In the navigation pane, under Multi-account permissions, choose AWS accounts.
  3. On the AWS accounts page, select one or more AWS accounts that you want to assign single sign-on access to.
  4. Choose Assign users or groups.
  5. On the Assign users and groups to AWS-account-name, choose the groups that you want to create the permission set for. Then, choose Next.
  6. On the Assign permission sets to AWS-account-name, choose the permission set you created in the section – Create a permission set. Then, choose Next.
  7. On the Review and submit assignments to AWS-account-name page, for Review and submit, choose Submit. The console displays the following message: We reprovisioned your AWS account successfully and applied the updated permission set to the account.
    Idc-Review and Submit AssignmentsIdC-Assignment Permission Created

Federate to Amazon Redshift using Query Editor V2 using IdC

Now you’re ready to connect to Amazon Redshift Query Editor V2 and federated login using IdC authentication:

  1. Open the IAM Identity Center Console.
  2. Go to dashboard and select the AWS access portal URL.
  3. A browser pop-up triggers and takes you to the Okta Login page where you enter your Okta credentials.
  4. After successful authentication, you’ll be logged into the AWS console as a federated user.
  5. Select your AWS Account and choose the Amazon Redshift Query Editor V2 application.
  6. Once you federate to Query Editor V2, choose your Redshift instance (i.e., right-click) and choose Create connection.
  7. To authenticate using IdC, choose the authentication method IAM Identity Center.
  8. It will show a pop-up and since your Okta credentials is already cached, it utilizes the same credentials and connects to Amazon Redshift Query Editor V2 using IdC authentication.

The following demonstration shows a federated user (Ethan) used the AWS access portal URL to access Amazon Redshift using IdC authentication. User Ethan accesses the sales_schema tables. If User Ethan tries to access the tables in finance_schema, then the user gets a permission denied error.
QEV2-IdC-Demo

Troubleshooting

  • If you get the following error:
    ERROR: registered identity provider does not exist for "<idc-namespace:redshift_database_role_name"

This means that you are trying to create a role with a wrong namespace. Please check current namespace using the command select * from identity_providers;

  • If you get below error:
    ERROR: (arn:aws:iam::123456789012:role/<iam_role_name>) Insufficient privileges to access AWS IdC. [ErrorId: 1-1234cf25-1f23ea1234c447fb12aa123e]

This means that an IAM role doesn’t have sufficient privileges to access to the IdC. Your IAM role should contain a policy with following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "redshift:DescribeQev2IdcApplications",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "sso:DescribeApplication",
                "sso:DescribeInstance"
            ],
            "Resource": "arn:aws:sso::726034267621:application/ssoins-722324fe82a3f0b8/*"
        }
    ]
}
  • If you get below error:
    FATAL: Invalid scope. User's credentials are not authorized to connect to Redshift. [SQL State=28000]

Please make sure that the user and group are added to the Amazon Redshift IdC application.

Clean up

Complete the following steps to clean up your resources:

  1. Delete the Okta Applications which you have created to integrate with IdC.
  2. Delete IAM Identity Center configuration.
  3. Delete the Redshift application and the Redshift provisioned cluster which you have created for testing.
  4. Delete the IAM role which you have created for IdC and Redshift integration.

Conclusion

In this post, we showed you a detailed walkthrough of how you can integrate Okta with the IdC and Amazon Redshift Query Editor version 2 to simplify your SSO setup. This integration allows you to use role-based access control with Amazon Redshift. We encourage you to try out this integration.

To learn more about IdC with Amazon Redshift, visit the documentation.


About the Authors

Debu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Harshida Patel is a Principal Solutions Architect, Analytics with AWS.

Praveen Kumar Ramakrishnan is a Senior Software Engineer at AWS. He has nearly 20 years of experience spanning various domains including filesystems, storage virtualization and network security. At AWS, he focuses on enhancing the Redshift data security.

Karthik Ramanathan is a Sr. Software Engineer with AWS Redshift and is based in San Francisco. He brings close to two decades of development experience across the networking, data storage and IoT verticals prior to Redshift. When not at work, he is also a writer and loves to be in the water.

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Post Syndicated from Arun Sudhir original https://aws.amazon.com/blogs/big-data/how-eightfold-ai-implemented-metadata-security-in-a-multi-tenant-data-analytics-environment-with-amazon-redshift/

This is a guest post co-written with Arun Sudhir from Eightfold AI.

Eightfold is transforming the world of work by providing solutions that empower organizations to recruit and retain a diverse global workforce. Eightfold is a leader in AI products for enterprises to build on their talent’s existing skills. From Talent Acquisition to Talent Management and talent insights, Eightfold offers a single AI platform that does it all.

The Eightfold Talent Intelligence Platform powered by Amazon Redshift and Amazon QuickSight provides a full-fledged analytics platform for Eightfold’s customers. It delivers analytics and enhanced insights about the customer’s Talent Acquisition, Talent Management pipelines, and much more. Customers can also implement their own custom dashboards in QuickSight. As part of the Talent Intelligence Platform Eightfold also exposes a data hub where each customer can access their Amazon Redshift-based data warehouse and perform ad hoc queries as well as schedule queries for reporting and data export. Additionally, customers who have their own in-house analytics infrastructure can integrate their own analytics solutions with Eightfold Talent Intelligence Platform by directly connecting to the Redshift data warehouse provisioned for them. Doing this gives them access to their raw analytics data, which can then be integrated into their analytics infrastructure irrespective of the technology stack they use.

Eightfold provides this analytics experience to hundreds of customers today. Securing customer data is a top priority for Eightfold. The company requires the highest security standards when implementing a multi-tenant analytics platform on Amazon Redshift.

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift.

In this post, we discuss how the Eightfold Talent Lake system team implemented the Amazon Redshift metadata security feature in their multi-tenant environment to enable access controls for the database catalog. By linking access to business-defined entitlements, they are able to enforce data access policies.

Amazon Redshift security controls addresses restricting data access to users who have been granted permission. This post discusses restricting listing of data catalog metadata as per the granted permissions.

The Eightfold team needed to develop a multi-tenant application with the following features:

  • Enforce visibility of Amazon Redshift objects on a per-tenant basis, so that each tenant can only view and access their own schema
  • Implement tenant isolation and security so that tenants can only see and interact with their own data and objects

Metadata security in Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Many customers have implemented Amazon Redshift to support multi-tenant applications. One of the challenges with multi-tenant environments is that database objects are visible to all tenants even though tenants are only authorized to access certain objects. This visibility creates data privacy challenges because many customers want to hide objects that tenants can’t access.

The newly released metadata security feature in Amazon Redshift enables you to hide database objects from all other tenants and make objects only visible to tenants who are authorized to see and use them. Tenants can use SQL tools, dashboards, or reporting tools, and also query the database catalog, but they will only see appropriate objects for which they have permissions to see.

Solution overview

Exposing a Redshift endpoint to all of Eightfold’s customers as part of the Talent Lake endeavor involved several design choices that had to be carefully considered. Eightfold has a multi-tenant Redshift data warehouse that had individual customer schemas for customers, which they could connect to using their own customer credentials to perform queries on their data. Data in each customer tenant can only be accessed by the customer credentials that had access to the customer schema. Each customer could access data under their analytics schema, which was named after the customer. For example, for a customer named A, the schema name would be A_analytics. The following diagram illustrates this architecture.

Although customer data was secured by restricting access to only the customer user, when customers used business intelligence (BI) tools like QuickSight, Microsoft Power BI, or Tableau to access their data, the initial connection showed all the customer schemas because it was performing a catalog query (which couldn’t be restricted). Therefore, Eightfold’s customers had concerns that other customers could discover that they were Eightfold’s customers by simply trying to connect to Talent Lake. This unrestricted database catalog access posed a privacy concern to several Eightfold customers. Although this could be avoided by provisioning one Redshift database per customer, that was a logistically difficult and expensive solution to implement.

The following screenshot shows what a connection from QuickSight to our data warehouse looked like without metadata security turned on. All other customer schemas were exposed even though the connection to QuickSight was made as customer_k_user.

Approach for implementing metadata access controls

To implement restricted catalog access, and ensure it worked with Talent Lake, we cloned our production data warehouse with all the schemas and enabled the metadata security flag in the Redshift data warehouse by connecting to SQL tools. After it was enabled, we tested the catalog queries by connecting to the data warehouse from BI tools like QuickSight, Microsoft Power BI, and Tableau and ensured that only the customer schemas show up as a result of the catalog query. We also tested by running catalog queries after connecting to the Redshift data warehouse from psql, to ensure that only the customer schema objects were surfaced—It’s important to validate that given tenants have access to the Redshift data warehouse directly.

The metadata security feature was tested by first turning on metadata security in our Redshift data warehouse by connecting using a SQL tool or Amazon Redshift Query Editor v2.0 and issuing the following command:

ALTER SYSTEM SET metadata_security = TRUE;

Note that the preceding command is set at the Redshift cluster level or Redshift Serverless endpoint level, which means it is applied to all databases and schemas in the cluster or endpoint.

In Eightfold’s scenario, data access controls are already in place for each of the tenants for their respective database objects.

After turning on the metadata security feature in Amazon Redshift, Eightfold was able to restrict database catalog access to only show individual customer schemas for each customer that was trying to connect to Amazon Redshift and further validated by issuing a catalog query to access schema objects as well.

We also tested by connecting via psql and trying out various catalog queries. All of them yielded only the relevant customer schema of the logged-in user as the result. The following are some examples:

analytics=> select * from pg_user;
usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig 
------------------------+----------+-------------+----------+-----------+----------+----------+-------------------------------------------
customer_k_user | 377 | f | f | f | ******** | | 
(1 row)

analytics=> select * from information_schema.schemata;
catalog_name | schema_name | schema_owner | default_character_set_catalog | default_character_set_schema | default_character_set_name | sql_path 
--------------+----------------------+------------------------+-------------------------------+------------------------------+----------------------------+----------
analytics | customer_k_analytics | customer_k_user | | | | 
(1 row)

The following screenshot shows the UI after metadata security was enabled: only customer_k_analytics is seen when connecting to the Redshift data warehouse as customer_k_user.

This ensured that individual customer privacy was protected and increased customer confidence in Eightfold’s Talent Lake.

Customer feedback

“Being an AI-first platform for customers to hire and develop people to their highest potential, data and analytics play a vital role in the value provided by the Eightfold platform to its customers. We rely on Amazon Redshift as a multi-tenant Data Warehouse that provides rich analytics with data privacy and security through customer data isolation by using schemas. In addition to the data being secure as always, we layered on Redshift’s new metadata access control to ensure customer schemas are not visible to other customers. This feature truly made Redshift the ideal choice for a multi-tenant, performant, and secure Data Warehouse and is something we are confident differentiates our offering to our customers.”

– Sivasankaran Chandrasekar, Vice President of Engineering, Data Platform at Eightfold AI

Conclusion

In this post, we demonstrated how the Eightfold Talent Intelligence Platform team implemented a multi-tenant environment for hundreds of customers, using the Amazon Redshift metadata security feature. For more information about metadata security, refer to the Amazon Redshift documentation.

Try out the metadata security feature for your future Amazon Redshift implementations, and feel free to leave a comment about your experience!


About the authors

Arun Sudhir is a Staff Software Engineer at Eightfold AI. He has more than 15 years of experience in design and development of backend software systems in companies like Microsoft and AWS, and has a deep knowledge of database engines like Amazon Aurora PostgreSQL and Amazon Redshift.

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using AWS Analytics services.

Anjali Vijayakumar is a Senior Solutions Architect at AWS focusing on EdTech. She is passionate about helping customers build well-architected solutions in the cloud.

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/build-and-manage-your-modern-data-stack-using-dbt-and-aws-glue-through-dbt-glue-the-new-trusted-dbt-adapter/

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt focuses on the transform layer of extract, load, transform (ELT) or extract, transform, load (ETL) processes across data warehouses and databases through specific engine adapters to achieve extract and load functionality. It enables data engineers, data scientists, and analytics engineers to define the business logic with SQL select statements and eliminates the need to write boilerplate data manipulation language (DML) and data definition language (DDL) expressions. dbt lets data engineers quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, continuous integration and continuous delivery (CI/CD), and documentation.

dbt is predominantly used by data warehouses (such as Amazon Redshift) customers who are looking to keep their data transform logic separate from storage and engine. We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities.

In 2022, AWS published a dbt adapter called dbt-glue—the open source, battle-tested dbt AWS Glue adapter that allows data engineers to use dbt for cloud-based data lakes along with data warehouses and databases, paying for just the compute they need. The dbt-glue adapter democratized access for dbt users to data lakes, and enabled many users to effortlessly run their transformation workloads on the cloud with the serverless data integration capability of AWS Glue. From the launch of the adapter, AWS has continued investing into dbt-glue to cover more requirements.

Today, we are pleased to announce that the dbt-glue adapter is now a trusted adapter based on our strategic collaboration with dbt Labs. Trusted adapters are adapters not maintained by dbt Labs, but adaptors that that dbt Lab is comfortable recommending to users for use in production.

The key capabilities of the dbt-glue adapter are as follows:

  • Runs SQL as Spark SQL on AWS Glue interactive sessions
  • Manages table definitions on the AWS Glue Data Catalog
  • Supports open table formats such as Apache Hudi, Delta Lake, and Apache Iceberg
  • Supports AWS Lake Formation permissions for fine-grained access control

In addition to those capabilities, the dbt-glue adapter is designed to optimize resource utilization with several techniques on top of AWS Glue interactive sessions.

This post demonstrates how the dbt-glue adapter helps your workload, and how you can build a modern data stack using dbt and AWS Glue using the dbt-glue adapter.

Common use cases

One common use case for using dbt-glue is if a central analytics team at a large corporation is responsible for monitoring operational efficiency. They ingest application logs into raw Parquet tables in an Amazon Simple Storage Service (Amazon S3) data lake. Additionally, they extract organized data from operational systems capturing the company’s organizational structure and costs of diverse operational components that they stored in the raw zone using Iceberg tables to maintain the original schema, facilitating easy access to the data. The team uses dbt-glue to build a transformed gold model optimized for business intelligence (BI). The gold model joins the technical logs with billing data and organizes the metrics per business unit. The gold model uses Iceberg’s ability to support data warehouse-style modeling needed for performant BI analytics in a data lake. The combination of Iceberg and dbt-glue allows the team to efficiently build a data model that’s ready to be consumed.

Another common use case is when an analytics team in a company that has an S3 data lake creates a new data product in order to enrich its existing data from its data lake with medical data. Let’s say that this company is located in Europe and the data product must comply with the GDPR. For this, the company uses Iceberg to meet needs such as the right to be forgotten and the deletion of data. The company uses dbt to model its data product on its existing data lake due to its compatibility with AWS Glue and Iceberg and the simplicity that the dbt-glue adapter brings to the use of this storage format.

How dbt and dbt-glue work

The following are key dbt features:

  • Project – A dbt project enforces a top-level structure on the staging, models, permissions, and adapters. A project can be checked into a GitHub repo for version control.
  • SQL – dbt relies on SQL select statements for defining data transformation logic. Instead of raw SQL, dbt offers templatized SQL (using Jinja) that allows code modularity. Instead of having to copy/paste SQL in multiple places, data engineers can define modular transforms and call those from other places within the project. Having a modular pipeline helps data engineers collaborate on the same project.
  • Models – dbt models are primarily written as a SELECT statement and saved as a .sql file. Data engineers define dbt models for their data representations. To learn more, refer to About dbt models.
  • Materializations – Materializations are strategies for persisting dbt models in a warehouse. There are five types of materializations built into dbt: table, view, incremental, ephemeral, and materialized view. To learn more, refer to Materializations and Incremental models.
  • Data lineage – dbt tracks data lineage, allowing you to understand the origin of data and how it flows through different transformations. dbt also supports impact analysis, which helps identify the downstream effects of changes.

The high-level data flow is as follows:

  1. Data engineers ingest data from data sources to raw tables and define table definitions for the raw tables.
  2. Data engineers write dbt models with templatized SQL.
  3. The dbt adapter converts dbt models to SQL statements compatible in a data warehouse.
  4. The data warehouse runs the SQL statements to create intermediate tables or final tables, views, or materialized views.

The following diagram illustrates the architecture.

dbt-glue works with the following steps:

  1. The dbt-glue adapter converts dbt models to SQL statements compatible in Spark SQL.
  2. AWS Glue interactive sessions run the SQL statements to create intermediate tables or final tables, views, or materialized views.
  3. dbt-glue supports csv, parquet, hudi, delta, and iceberg as fileformat.
  4. On the dbt-glue adapter, table or incremental are commonly used for materializations at the destination. There are three strategies for incremental materialization. The merge strategy requires hudi, delta, or iceberg. With the other two strategies, append and insert_overwrite, you can use csv, parquet, hudi, delta, or iceberg.

The following diagram illustrates this architecture.

Example use case

In this post, we use the data from the New York City Taxi Records dataset. This dataset is available in the Registry of Open Data on AWS (RODA), which is a repository containing public datasets from AWS resources. The raw Parquet table records in this dataset stores trip records.

The objective is to create the following three tables, which contain metrics based on the raw table:

  • silver_avg_metrics – Basic metrics based on NYC Taxi Open Data for the year 2016
  • gold_passengers_metrics – Metrics per passenger based on the silver metrics table
  • gold_cost_metrics – Metrics per cost based on the silver metrics table

The final goal is to create two well-designed gold tables that store already aggregated results in Iceberg format for ad hoc queries through Amazon Athena.

Prerequisites

The instruction requires following prerequisites:

  • An AWS Identity and Access Management (IAM) role with all the mandatory permissions to run an AWS Glue interactive session and the dbt-glue adapter
  • An AWS Glue database and table to store the metadata related to the NYC taxi records dataset
  • An S3 bucket to use as output and store the processed data
  • An Athena configuration (a workgroup and S3 bucket to store the output) to explore the dataset
  • An AWS Lambda function (created as an AWS CloudFormation custom resource) that updates all the partitions in the AWS Glue table

With these prerequisites, we simulate the situation that data engineers have already ingested data from data sources to raw tables, and defined table definitions for the raw tables.

For ease of use, we prepared a CloudFormation template. This template deploys all the required infrastructure. To create these resources, choose Launch Stack in the us-east-1 Region, and follow the instructions:

Install dbt, the dbt CLI, and the dbt adaptor

The dbt CLI is a command line interface for running dbt projects. It’s free to use and available as an open source project. Install dbt and the dbt CLI with the following code:

$ pip3 install --no-cache-dir dbt-core

For more information, refer to How to install dbt, What is dbt?, and Viewpoint.

Install the dbt adapter with the following code:

$ pip3 install --no-cache-dir dbt-glue

Create a dbt project

Complete the following steps to create a dbt project:

  1. Run the dbt init command to create and initialize a new empty dbt project:
    $ dbt init

  2. For the project name, enter dbt_glue_demo.
  3. For the database, choose glue.

Now the empty project has been created. The directory structure is shown as follows:

$ cd dbt_glue_demo 
$ tree .
.
├── README.md
├── analyses
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── seeds
├── snapshots
└── tests

Create a source

The next step is to create a source table definition. We add models/source_tables.yml with the following contents:

version: 2

sources:
  - name: data_source
    schema: nyctaxi

    tables:
      - name: records

This source definition corresponds to the AWS Glue table nyctaxi.records, which we created in the CloudFormation stack.

Create models

In this step, we create a dbt model that represents the average values for trip duration, passenger count, trip distance, and total amount of charges. Complete the following steps:

  1. Create the models/silver/ directory.
  2. Create the file models/silver/silver_avg_metrics.sql with the following contents:
    WITH source_avg as ( 
        SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
        , avg(passenger_count) as avg_passenger_count 
        , avg(trip_distance) as avg_trip_distance 
        , avg(total_amount) as avg_total_amount
        , year
        , month 
        , type
        FROM {{ source('data_source', 'records') }} 
        WHERE year = "2016"
        AND dropoff_datetime is not null 
        GROUP BY year, month, type
    ) 
    SELECT *
    FROM source_avg

  3. Create the file models/silver/schema.yml with the following contents:
    version: 2
    
    models:
      - name: silver_avg_metrics
        description: This table has basic metrics based on NYC Taxi Open Data for the year 2016
    
        columns:
          - name: avg_duration
            description: The average duration of a NYC Taxi trip
    
          - name: avg_passenger_count
            description: The average number of passenger per NYC Taxi trip
    
          - name: avg_trip_distance
            description: The average NYC Taxi trip distance
    
          - name: avg_total_amount
            description: The avarage amount of a NYC Taxi trip
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip 
    
          - name: type
            description: The type of the NYC Taxi 

  4. Create the models/gold/ directory.
  5. Create the file models/gold/gold_cost_metrics.sql with the following contents:
    {{ config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=["year", "month", "type"],
        file_format='iceberg',
        iceberg_expire_snapshots='False',
        table_properties={'format-version': '2'}
    ) }}
    SELECT (avg_total_amount/avg_trip_distance) as avg_cost_per_distance
    , (avg_total_amount/avg_duration) as avg_cost_per_minute
    , year
    , month 
    , type 
    FROM {{ ref('silver_avg_metrics') }}

  6. Create the file models/gold/gold_passengers_metrics.sql with the following contents:
    {{ config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=["year", "month", "type"],
        file_format='iceberg',
        iceberg_expire_snapshots='False',
        table_properties={'format-version': '2'}
    ) }}
    SELECT (avg_total_amount/avg_passenger_count) as avg_cost_per_passenger
    , (avg_duration/avg_passenger_count) as avg_duration_per_passenger
    , (avg_trip_distance/avg_passenger_count) as avg_trip_distance_per_passenger
    , year
    , month 
    , type 
    FROM {{ ref('silver_avg_metrics') }}

  7. Create the file models/gold/schema.yml with the following contents:
    version: 2
    
    models:
      - name: gold_cost_metrics
        description: This table has metrics per cost based on NYC Taxi Open Data
    
        columns:
          - name: avg_cost_per_distance
            description: The average cost per distance of a NYC Taxi trip
    
          - name: avg_cost_per_minute
            description: The average cost per minute of a NYC Taxi trip
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip
    
          - name: type
            description: The type of the NYC Taxi
    
      - name: gold_passengers_metrics
        description: This table has metrics per passenger based on NYC Taxi Open Data
    
        columns:
          - name: avg_cost_per_passenger
            description: The average cost per passenger for a NYC Taxi trip
    
          - name: avg_duration_per_passenger
            description: The average number of passenger per NYC Taxi trip
    
          - name: avg_trip_distance_per_passenger
            description: The average NYC Taxi trip distance
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip 
    
          - name: type
            description: The type of the NYC Taxi

  8. Remove the models/example/ folder, because it’s just an example created in the dbt init command.

Configure the dbt project

dbt_project.yml is a key configuration file for dbt projects. It contains the following code:

models:
  dbt_glue_demo:
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: view

We configure dbt_project.yml to replace the preceding code with the following:

models:
  dbt_glue_demo:
    silver:
      +materialized: table

This is because that we want to materialize the models under silver as Parquet tables.

Configure a dbt profile

A dbt profile is a configuration that specifies how to connect to a particular database. The profiles are defined in the profiles.yml file within a dbt project.

Complete the following steps to configure a dbt profile:

  1. Create the profiles directory.
  2. Create the file profiles/profiles.yml with the following contents:
    dbt_glue_demo:
      target: dev
      outputs:
        dev:
          type: glue
          query-comment: demo-nyctaxi
          role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
          region: us-east-1
          workers: 5
          worker_type: G.1X
          schema: "dbt_glue_demo_nyc_metrics"
          database: "dbt_glue_demo_nyc_metrics"
          session_provisioning_timeout_in_seconds: 120
          location: "{{ env_var('DBT_S3_LOCATION') }}"

  3. Create the profiles/iceberg/ directory.
  4. Create the file profiles/iceberg/profiles.yml with the following contents:
    dbt_glue_demo:
      target: dev
      outputs:
        dev:
          type: glue
          query-comment: demo-nyctaxi
          role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
          region: us-east-1
          workers: 5
          worker_type: G.1X
          schema: "dbt_glue_demo_nyc_metrics"
          database: "dbt_glue_demo_nyc_metrics"
          session_provisioning_timeout_in_seconds: 120
          location: "{{ env_var('DBT_S3_LOCATION') }}"
          datalake_formats: "iceberg"
          conf: --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse="{{ env_var('DBT_S3_LOCATION') }}"warehouse/ --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"

The last two lines are added for setting Iceberg configurations on AWS Glue interactive sessions.

Run the dbt project

Now it’s time to run the dbt project. Complete the following steps:

  1. To run the project dbt, you should be in the project folder:
    $ cd dbt_glue_demo

  2. The project requires you to set environment variables in order to run on the AWS account:
    $ export DBT_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query "Account" --output text):role/GlueInteractiveSessionRole"
    $ export DBT_S3_LOCATION="s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1"

  3. Make sure the profile is set up correctly from the command line:
    $ dbt debug --profiles-dir profiles
    ...
    05:34:22 Connection test: [OK connection ok]
    05:34:22 All checks passed!

If you see any failures, check if you provided the correct IAM role ARN and S3 location in Step 2.

  1. Run the models with the following code:
    $ dbt run -m silver --profiles-dir profiles
    $ dbt run -m gold --profiles-dir profiles/iceberg/

Now the tables are successfully created in the AWS Glue Data Catalog, and the data is materialized in the Amazon S3 location.

You can verify those tables by opening the AWS Glue console, choosing Databases in the navigation pane, and opening dbt_glue_demo_nyc_metrics.

Query materialized tables through Athena

Let’s query the target table using Athena to verify the materialized tables. Complete the following steps:

  1. On the Athena console, switch the workgroup to athena-dbt-glue-aws-blog.
  2. If the workgroup athena-dbt-glue-aws-blog settings dialog box appears, choose Acknowledge.
  3. Use the following query to explore the metrics created by the dbt project:
    SELECT cm.avg_cost_per_minute
        , cm.avg_cost_per_distance
        , pm.avg_cost_per_passenger
        , cm.year
        , cm.month
        , cm.type
    FROM "dbt_glue_demo_nyc_metrics"."gold_passengers_metrics" pm
    LEFT JOIN "dbt_glue_demo_nyc_metrics"."gold_cost_metrics" cm
        ON cm.type = pm.type
        AND cm.year = pm.year
        AND cm.month = pm.month
    WHERE cm.type = 'yellow'
        AND cm.year = '2016'
        AND cm.month = '6'

The following screenshot shows the results of this query.

Review dbt documentation

Complete the following steps to review your documentation:

  1. Generate the following documentation for the project:
    $ dbt docs generate --profiles-dir profiles/iceberg
    11:41:51  Running with dbt=1.7.1
    11:41:51  Registered adapter: glue=1.7.1
    11:41:51  Unable to do partial parsing because profile has changed
    11:41:52  Found 3 models, 1 source, 0 exposures, 0 metrics, 478 macros, 0 groups, 0 semantic models
    11:41:52  
    11:41:53  Concurrency: 1 threads (target='dev')
    11:41:53  
    11:41:53  Building catalog
    11:43:32  Catalog written to /Users/username/Documents/workspace/dbt_glue_demo/target/catalog.json

  2. Run the following command to open the documentation on your browser:
    $ dbt docs serve --profiles-dir profiles/iceberg

  3. In the navigation pane, choose gold_cost_metrics under dbt_glue_demo/models/gold.

You can see the detailed view of the model gold_cost_metrics, as shown in the following screenshot.

  1. To see the lineage graph, choose the circle icon at the bottom right.

Clean up

To clean up your environment, complete the following steps:

  1. Delete the database created by dbt:
    $ aws glue delete-database —name dbt_glue_demo_nyc_metrics

  2. Delete all generated data:
    $ aws s3 rm s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity —query "Account" —output text)-us-east-1/ —recursive
    $ aws s3 rm s3://aws-athena-dbt-glue-query-results-$(aws sts get-caller-identity —query "Account" —output text)-us-east-1/ —recursive

  3. Delete the CloudFormation stack:
    $ aws cloudformation delete-stack —stack-name dbt-demo

Conclusion

This post demonstrated how the dbt-glue adapter helps your workload, and how you can build a modern data stack using dbt and AWS Glue using the dbt-glue adapter. You learned the end-to-end operations and data flow for data engineers to build and manage a data stack using dbt and the dbt-glue adapter. To report issues or request a feature enhancement, feel free to open an issue on GitHub.


About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team at Amazon Web Services. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Benjamin Menuet is a Senior Data Architect on the AWS Professional Services team at Amazon Web Services. He helps customers develop data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some iconic races like the UTMB.

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.

Kinshuk Pahare is a Principal Product Manager on the AWS Glue team at Amazon Web Services.

Jason Ganz is the manager of the Developer Experience (DX) team at dbt Labs

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/amazon-redshift-announcements-at-aws-reinvent-2023-to-enable-analytics-on-all-your-data/

In 2013, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully-managed, petabyte-scale, enterprise-grade cloud data warehouse. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Since then, customer demands for better scale, higher throughput, and agility in handling a wide variety of changing, but increasingly business critical analytics and machine learning use cases has exploded, and we have been keeping pace. Today, tens of thousands of customers use Amazon Redshift in AWS global infrastructure to collectively process exabytes of data daily and employs Amazon Redshift as a key component of their data architecture to drive use cases from typical dashboarding to self-service analytics, real-time analytics, machine learning, data sharing and monetization, and more

The advancements to Amazon Redshift announced at AWS re:Invent 2023 further accelerates modernization of your cloud analytics environments, keeping our core tenet to help you achieve the best price-performance at any scale. These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations. From price-performance improvements to zero-ETL, to generative AI capabilities, we have something for everyone. Let’s dive into the highlights.

Modernizing analytics for scale, performance, and reliability

“Our migration from legacy on-premises platform to Amazon Redshift allows us to ingest data 88% faster, query data 3x faster, and load daily data to the cloud 6x faster. Amazon Redshift enabled us to optimize performance, availability, and reliability—significantly easing operational complexity, while increasing the velocity of our end-users’ decision-making experience on the Fab floor.”

– Sunil Narayan, Sr Dir, Analytics at GlobalFoundries

Diligently driving the best price-performance at scale with new improvements

Since day 1, Amazon Redshift has been building innovative capabilities to help you get to optimum performance, while keeping costs lower. Amazon Redshift continues to lead on the price-performance front with up to 6x better price-performance than alternative cloud data warehouse and for dash boarding applications with high concurrency and low latency. We closely analyze query patterns in the fleet and look for opportunities to drive customer-focused innovation. For example, earlier in the year, we announced speed ups for string-based data processing up to 63x compared to alternative compression encodings such as LZO (Lempel-Ziv-Oberhumer) or ZStandard. At AWS re:Invent 2023, we introduced more performance enhancements in query planning and execution such as enhanced bloom filters , query rewrites, and support for write operations in auto scaling . For more information about performance improvement capabilities, refer to the list of announcements below.

Amazon Redshift Serverless is more intelligent than ever with new AI-driven scaling and optimizations

Speaking of price-performance, new next generation AI-driven scaling and optimizations capabilities in Amazon Redshift Serverless can deliver up to 10x better price-performance for variable workloads (based on internal testing), without manual intervention. Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse. Since GA, Redshift Serverless executed over a billion queries to power data insights for thousands of customers. With these new AI optimizations, Amazon Redshift Serverless scales proactively and automatically with workload changes across all key dimensions —such as data volume, concurrent users, and query complexity. You just specify your desired price-performance targets to either optimize for cost or optimize for performance or balanced and serverless does the rest. Learn more about additional improvements in Redshift Serverless, under the list of announcements below.

Multi-data warehouse writes through data sharing

Data sharing is a widely adopted feature in Amazon Redshift with customers running tens of millions of queries on shared data daily. Customers share live transactionally consistent data within and across organizations and regions for read purposes without data copies or data movement. Customers are using data sharing to modernize their analytics architectures from monolithic architectures to multi-cluster, data mesh deployments that enable seamless and secure access across organizations to drive data collaboration and powerful insights. At AWS re:Invent 2023, we extended data sharing capabilities to launch multi-data warehouse writes in preview. You can now start writing to Redshift databases from other Redshift data warehouses in just a few clicks, further enabling data collaboration, flexible scaling of compute for ETL/data processing workloads by adding warehouses of different types and sizes based on price-performance needs. Experience greater transparency of compute usage as each warehouse is billed for its own compute and consequently keep your costs under control.

Multidimensional data layouts

Amazon Redshift offers industry leading predictive optimizations that continuously monitor your workloads and seamlessly accelerate performance and maximize concurrency by adjusting data layout and compute management as you use the data warehouse more. In addition to the powerful optimizations Redshift already offers, such as Automatic Table Sort, Automatic sort and distribution keys, we are introducing Multidimensional Data Layouts, a new powerful table sorting mechanism that improves performance of repetitive queries by automatically sorting data based on the incoming query filters (for example: Sales in a specific region). This method significantly accelerates the performance of table scans compared to traditional methods.

Unifying all your data with zero-ETL approaches

“Using the Aurora MySQL zero-ETL integration, we experience near real-time data synchronization between Aurora MySQL databases and Amazon Redshift, making it possible to build an analysis environment in just three hours instead of the month of developer time it used to take before”

– MoneyForward

JOYME uses Amazon Redshift’s streaming ingestion and other Amazon services for risk control over users’ financial activity such as recharge, refund, and rewards.

“With Redshift, we are able to view risk counterparts and data in near real time—
instead of on an hourly basis. Redshift significantly improved our business ROI efficiency.”

– PengBo Yang, CTO, JOYME

Data pipelines can be challenging and costly to build and manage and can create hours-long delays to obtain transactional data for analytics. These delays can lead to missed business opportunities, especially when the insights derived from analytics on transactional data are relevant for only a limited amount of time. Amazon Redshift employs AWS’s zero-ETL approach that enables interoperability and integration between the data warehouse and operational databases and even your streaming data services, so that the data is easily and automatically ingested into the warehouse for you, or you can access the data in place, where it lives.

Zero-ETL integrations with operational databases

We delivered zero-ETL integration between Amazon Aurora MySQL Amazon Redshift (general availability) this year, to enable near real-time analytics and machine learning (ML) using Amazon Redshift on petabytes of transactional data from Amazon Aurora. Within seconds of transactional data being written into Aurora, the data is available in Amazon Redshift, so you don’t have to build and maintain complex data pipelines to perform extract, transform, and load (ETL) operations. At AWS re:Invent, we extended zero-ETL integration to additional sources specifically Aurora PostgreSQL, Dynamo DB, and Amazon RDS MySQL. Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications.

Data lake querying with support for Apache Iceberg tables

Amazon Redshift allows customers to run a wide range of workloads on data warehouse and data lakes using its support for various open file and table formats. At AWS re:Invent, we announced the general availability of support for Apache Iceberg tables, so you can easily access your Apache Iceberg tables on your data lake from Amazon Redshift and join it with the data in your data warehouse when needed. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience. We have improved data lake query performance by integrating with AWS Glue statistics and introduce preview of incremental refresh for materialized views on data lake data to accelerate repeated queries.

Learn more about the zero-ETL integrations, data lake performance enhancements, and other announcements below.

Maximize value with comprehensive analytics and ML capabilities

“Amazon Redshift is one of the most important tools we had in growing Jobcase as a company.”

– Ajay Joshi, Distinguished Engineer, Jobcase

With all your data integrated and available, you can easily build and run near real-time analytics to AI/ML/Generative AI applications. Here’s a couple of highlights from this week and for the full list, see below.

Amazon Q Generative SQL capability

Query Editor, an out-of-the-box web-based SQL experience in Amazon Redshift is a popular tool for data exploration, visual analysis, and data collaboration. At AWS re:Invent, we introduced Amazon Q Generative SQL capabilities in Amazon Redshift Query Editor (preview), to simplify query authoring and increase your productivity by allowing you to express queries in natural language and receive SQL code recommendations. Generative SQL uses AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly allowing you to get insights faster in a conversational format without extensive knowledge of your organization’s complex database metadata.

Amazon Redshift ML large language model (LLM) integration

Amazon Redshift ML enables customers to create, train, and deploy machine learning models using familiar SQL commands. Customers use Redshift ML to run an average of over 10 billion predictions a day within their data warehouses. At AWS re:Invent, we announced support for LLMs as preview. Now, you can use pre-trained open source LLMs in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. For example, you can make inferences on your product feedback data in Amazon Redshift, use LLMs to summarize feedback, perform entity extraction, sentiment analysis and product feedback classification.

Innovate faster with secure data collaboration within and across the organizations

“Millions of companies use Stripe’s software and APIs to accept payments, send payouts, and manage their businesses online.  Access to their Stripe data via leading data warehouses like Amazon Redshift has been a top request from our customers. Our customers needed secure, fast, and integrated analytics at scale without building complex data pipelines or moving and copying data around. With Stripe Data Pipeline for Amazon Redshift, we’re helping our customers set up a direct and reliable data pipeline in a few clicks. Stripe Data Pipeline enables our customers to automatically share their complete, up-to-date Stripe data with their Amazon Redshift data warehouse, and take their business analytics and reporting to the next level.”

– Tony Petrossian, Head of Engineering, Revenue & Financial Management at Stripe

With Amazon Redshift, you can easily and securely share data and collaborate no matter where your teams or data is located. And have the confidence that your data is secure no matter where you operate or how highly regulated your industries are. We have enabled fine grained permissions, an easy authentication experience with single sign-on for your organizational identity—all provided at no additional cost to you.

Unified identity with IAM identity center integration

We announced Amazon Redshift integration with AWS IAM Identity Center to enable organizations to support trusted identity propagation between Amazon QuickSight,, Amazon Redshift Query Editor, and Amazon Redshift,  . Customers can use their organization identities to access Amazon Redshift in a single sign-on experience using third party identity providers (IdP), such as Microsoft Entra ID, Okta, Ping, OneLogin, etc. from Amazon QuickSight and Amazon Redshift Query Editor. Administrators can use third-party identity provider users and groups to manage fine grained access to data across services and audit user level access in AWS CloudTrail. With trusted identity propagation, a user’s identity is passed seamlessly between Amazon QuickSight, Amazon Redshift reducing time to insights and enabling a friction free analytics experience.

For the full set of announcements, see the following:

Learn more: https://aws.amazon.com/redshift


About the authors

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

Sunaina AbdulSalah leads product marketing for Amazon Redshift. She focuses on educating customers about the impact of data warehousing and analytics and sharing AWS customer stories. She has a deep background in marketing and GTM functions in the B2B technology and cloud computing domains. Outside of work, she spends time with her family and friends and enjoys traveling.

Unlocking the value of data as your differentiator

Post Syndicated from G2 Krishnamoorthy original https://aws.amazon.com/blogs/big-data/unlocking-the-value-of-data-as-your-differentiator/

Today on the AWS re:Invent keynote stage, Swami Sivasubramanian, VP of Data and AI, AWS, spoke about the beneficial relationship among data, generative AI, and humans—all working together to unleash new possibilities in efficiency and creativity. There has never been a more exciting time in modern technology. Innovation is accelerating everywhere, and the future is rife with possibility. While Swami explored many facets of this beneficial relationship in the keynote today, one area that is especially critical for our customers to get right if they want to see success in generative AI is data. When you want to build generative AI applications that are unique to your business needs, data is the differentiator. This week, we launched many new tools to help you turn your data into your differentiator. This includes tools to help you customize your foundation models, and new services and features to build a strong data foundation to fuel your generative AI applications.

Customizing foundation models

The need for data is quite obvious if you are building your own foundation models (FMs). These models need vast amounts of data. But data is necessary even when you are building on top of FMs. If you think about it, everyone has access to the same models for building generative AI applications. It’s data that is the key to moving from generic applications to generative AI applications that create real value for your customers and your business. For instance, Intuit’s new generative AI-powered assistant, Intuit Assist, uses relevant contextual datasets spanning small business, consumer finance, and tax information to deliver personalized financial insights to their customers. With Amazon Bedrock, you can privately customize FMs for your specific use case using a small set of your own labeled data through a visual interface without writing any code. Today, we announced the ability to fine-tune Cohere Command and Meta Llama 2 in addition to Amazon Titan. In addition to fine-tuning, we’re also making it easier for you to provide models with up-to-date and contextually relevant information from your data sources using Retrieval Augmented Generation (RAG). Amazon Bedrock’s Knowledge Bases feature, which went to general availability today, supports the entire RAG workflow, from ingestion, to retrieval, and prompt augmentation. Knowledge Bases works with popular vector databases and engines including Amazon OpenSearch Serverless, Redis Enterprise Cloud, and Pinecone, with support for Amazon Aurora and MongoDB coming soon.

Building a strong data foundation

To produce the high-quality data that you need to build or customize FMs for generative AI, you need a strong data foundation. Of course, the value of a strong data foundation is not new and the need for one spans well beyond generative AI. Across all types of use cases, from generative AI to business intelligence (BI), we’ve found that a strong data foundation includes a comprehensive set of services to meet all your use case needs, integrations across those services to break down data silos, and tools to govern data across the end-to-end data workflow so you can innovate more quickly. These tools also need to be intelligent to remove the heavy lifting from data management.

Comprehensive

First, you need a comprehensive set of data services so you can get the price/performance, speed, flexibility, and capabilities for any use case. AWS offers a broad set of tools that enable you to store, organize, access, and act upon various types of data. We have the broadest selection of database services, including relational databases like Aurora and Amazon Relational Database Service (Amazon RDS)—and on Monday, we introduced the newest addition to the RDS family: Amazon RDS for Db2. Now Db2 customers can easily set up, operate, and scale highly available Db2 databases in the cloud. We also offer non-relational databases like Amazon DynamoDB, used by over 1 million customers for its serverless, single-digit millisecond performance at any scale. You also need services to store data for analysis and machine learning (ML) like Amazon Simple Storage Service (Amazon S3). Customers have created hundreds of thousands of data lakes on Amazon S3. It also includes our data warehouse, Amazon Redshift, which delivers more than 6 times better price/performance than other cloud data warehouses. We also have tools that enable you to act on your data, including Amazon QuickSight for BI, Amazon SageMaker for ML, and of course, Amazon Bedrock for generative AI.

Serverless enhancements

The dynamic nature of data makes it perfectly suited to serverless technologies, which is why AWS offers a broad range of serverless database and analytics offerings that help support our customers’ most demanding workloads. This week, we made even more improvements to our serverless options in this area, including a new Aurora capability that automatically scales to millions of write transactions per second and manages petabytes of data while maintaining the simplicity of operating a single database. We also released a new serverless option for Amazon ElastiCache, which makes it faster and easier to create highly available caches and instantly scales to meet application demand. Finally, we announced new AI-driven scaling and optimizations for Amazon Redshift Serverless that enable the service to learn from your patterns and proactively scale on multiple dimensions, including concurrent users, data variability, and query complexity. It does all of this while factoring in your price/performance targets so you can optimize between cost and performance.

Vector capabilities across more databases

Your data foundation also needs to include services to store, index, retrieve, and search vector data. As our customers need vector embeddings as part as part of their generative AI application workflows, they told us they want to use vector capabilities in their existing databases to eliminate the steep learning curve for new programming tools, APIs, and SDKs. They also feel more confident knowing their existing databases are proven in production and meet requirements for scalability, availability, and storage and compute. And when your vectors and business data are stored in the same place, your applications will run faster—and there’s no data sync or data movement to worry about.

For all of these reasons, we’ve invested in adding vector capabilities to some of our most popular data services, including Amazon OpenSearch Service and OpenSearch Serverless, Aurora, and Amazon RDS. Today, we added four more to that list, with the addition of vector support in Amazon MemoryDB for Redis, Amazon DocumentDB (with MongoDB compatibility), DynamoDB, and Amazon Neptune. Now you can use vectors and generative AI with your database of choice.

Integrated

Another key to your data foundation is integrating data across your data sources for a more complete view of your business. Typically, connecting data across different data sources requires complex extract, transform, and load (ETL) pipelines, which can take hours—if not days—to build. These pipelines also have to be continuously maintained and can be brittle. AWS is investing in a zero-ETL future so you can quickly and easily connect and act on all your data, no matter where it lives. We’re delivering on this vision in a number of ways, including zero-ETL integrations between our most popular data stores. Earlier this year, we brought you our fully managed zero-ETL integration between Amazon Aurora MySQL-Compatible Edition and Amazon Redshift. Within seconds of data being written into Aurora, you can use Amazon Redshift to do near-real-time analytics and ML on petabytes of data. Woolworths, a pioneer in retail who helped build the retail model of today, was able to reduce development time for analysis of promotions and other events from 2 months to 1 day using the Aurora zero-ETL integration with Amazon Redshift.

More zero-ETL options

At re:Invent, we announced three more zero-ETL integrations with Amazon Redshift, including Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and DynamoDB, to make it easier for you to take advantage of near-real-time analytics to improve your business outcomes. In addition to Amazon Redshift, we’ve also expanded our zero ETL support to OpenSearch Service, which tens of thousands of customers use for real-time search, monitoring, and analysis of business and operational data. This includes zero-ETL integrations with DynamoDB and Amazon S3. With all of these zero-ETL integrations, we’re making it even easier to leverage relevant data for your applications, including generative AI.

Governed

Finally, your data foundation needs to be secure and governed to ensure the data that’s used throughout the development cycle of your generative AI applications is high quality and compliant. To help with this, we launched Amazon DataZone last year. Amazon DataZone is being used by companies like Guardant Health and Bristol Meyers Squibb to catalog, discover, share, and govern data across their organization. Amazon DataZone uses ML to automatically add metadata to your data catalog, making all of your data more discoverable. This week, we added a new feature to Amazon DataZone that uses generative AI to automatically create business descriptions and context for your datasets with just a few clicks, making data even easier to understand and apply. While Amazon DataZone helps you share data in a governed way within your organization, many customers also want to securely share data with their partners.

Infusing intelligence across the data foundation

Not only have we added generative AI to Amazon DataZone, but we’re leveraging intelligent technology across our data services to make data easier to use, more intuitive to work with, and more accessible. Amazon Q, our new generative AI assistant, helps you in QuickSight to author dashboards and create compelling visual stories from your dashboard data using natural language. We also announced that Amazon Q can help you create data integration pipelines using natural language. For example, you can ask Q to “read JSON files from S3, join on ‘accountid’, and load into DynamoDB,” and Q will return an end-to-end data integration job to perform this action. Amazon Q is also making it easier to query data in your data warehouse with generative AI SQL in Amazon Redshift Query Editor (in preview). Now data analysts, scientists, and engineers can be more productive using generative AI text-to-code functionality. You can also improve accuracy by enabling query history access to specific users—without compromising data privacy.

These new innovations are going to make it easy for you to leverage data to differentiate your generative AI applications and create new value for your customers and your business. We look forward to seeing what you create!


About the authors

G2 Krishnamoorthy is VP of Analytics, leading AWS data lake services, data integration, Amazon OpenSearch Service, and Amazon QuickSight. Prior to his current role, G2 built and ran the Analytics and ML Platform at Facebook/Meta, and built various parts of the SQL Server database, Azure Analytics, and Azure ML at Microsoft.

Rahul Pathak is VP of Relational Database Engines, leading Amazon Aurora, Amazon Redshift, and Amazon QLDB. Prior to his current role, he was VP of Analytics at AWS, where he worked across the entire AWS database portfolio. He has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation.

AWS Clean Rooms Differential Privacy enhances privacy protection of your users data (preview)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-clean-rooms-differential-privacy-enhances-privacy-protection-of-your-users-data-preview/

Starting today, you can use AWS Clean Rooms Differential Privacy (preview) to help protect the privacy of your users with mathematically backed and intuitive controls in a few steps. As a fully managed capability of AWS Clean Rooms, no prior differential privacy experience is needed to help you prevent the reidentification of your users.

AWS Clean Rooms Differential Privacy obfuscates the contribution of any individual’s data in generating aggregate insights in collaborations so that you can run a broad range of SQL queries to generate insights about advertising campaigns, investment decisions, clinical research, and more.

Quick overview on differential privacy
Differential privacy is not new. It is a strong, mathematical definition of privacy compatible with statistical and machine learning based analysis, and has been used by the United States Census Bureau as well as companies with vast amounts of data.

Differential privacy helps with a wide variety of use cases involving large datasets, where adding or removing a few individuals has a small impact on the overall result, such as population analyses using count queries, histograms, benchmarking, A/B testing, and machine learning.

The following illustration shows how differential privacy works when it is applied to SQL queries.

When an analyst runs a query, differential privacy adds a carefully calibrated amount of error (also referred to as noise) to query results at run-time, masking the contribution of individuals while still keeping the query results accurate enough to provide meaningful insights. The noise is carefully fine-tuned to mask the presence or absence of any possible individual in the dataset.

Differential privacy also has another component called privacy budget. The privacy budget is a finite resource consumed each time a query is run and thus controls the number of queries that can be run on your datasets, helping ensure that the noise cannot be averaged out to reveal any private information about an individual. When the privacy budget is fully exhausted, no more queries can be run on your tables until it is increased or refreshed.

However, differential privacy is not easy to implement because this technique requires an in-depth understanding of mathematically rigorous formulas and theories to apply it effectively. Configuring differential privacy is also a complex task because customers need to calculate the right level of noise in order to preserve the privacy of their users without negatively impacting the utility of query results.

Customers also want to enable their partners to conduct a wide variety of analyses including highly complex and customized queries on their data. This requirement is hard to support with differential privacy because of the intricate nature of the calculations involved in calibrating the noise while processing various query components such as aggregations, joins, and transformations.

We created AWS Clean Rooms Differential Privacy to help you protect the privacy of your users with mathematically backed controls in a few clicks.

How differential privacy works in AWS Clean Rooms
While differential privacy is quite a sophisticated technique, AWS Clean Rooms Differential Privacy makes it easy for you to apply it and protect the privacy of your users with mathematically backed, flexible, and intuitive controls. You can begin using it with just a few steps after starting or joining an AWS Clean Rooms collaboration as a member with abilities to contribute data.

You create a configured table, which is a reference to your table in the AWS Glue Data Catalog, and choose to turn on differential privacy while adding a custom analysis rule to the configured table.

Next, you associate the configured table to your AWS Clean Rooms collaboration and configure a differential privacy policy in the collaboration to make your table available for querying. You can use a default policy to quickly complete the setup or customize it to meet your specific requirements. As part of this step, you will configure the following:

Privacy budget
Quantified as a value that we call epsilon, the privacy budget controls the level of privacy protection. It is a common, finite resource that is applied for all of your tables protected with differential privacy in the collaboration because the goal is to preserve the privacy of your users whose information can be present in multiple tables. The privacy budget is consumed every time a query is run on your tables. You have the flexibility to increase the privacy budget value any time during the collaboration and automatically refresh it each calendar month.

Noise added per query
Measured in terms of the number of users whose contributions you want to obscure, this input parameter governs the rate at which the privacy budget is depleted.

In general, you need to balance your privacy needs against the number of queries you want to permit and the accuracy of those queries. AWS Clean Rooms makes it easy for you to complete this step by helping you understand the resulting utility you are providing to your collaboration partner. You can also use the interactive examples to understand how your chosen settings would impact the results for different types of SQL queries.

Now that you have successfully enabled differential privacy protection for your data, let’s see AWS Clean Rooms Differential Privacy in action. For this demo, let’s assume I am your partner in the AWS Clean Rooms collaboration.

Here, I’m running a query to count the number of overlapping customers and the result shows there are 3,227,643 values for tv.customer_id.

Now, if I run the same query again after removing records about an individual from coffee_customers table, it shows a different result, 3,227,604 tv.customer_id. This variability in the query results prevents me from identifying the individuals from observing the difference in query results.

I can also see the impact of differential privacy, including the remaining queries I can run.

Available for preview
Join this preview and start protecting the privacy of your users with AWS Clean Rooms Differential Privacy. During this preview period, you can use AWS Clean Rooms Differential Privacy wherever AWS Clean Rooms is available. To learn more on how to get started, visit the AWS Clean Rooms Differential Privacy page.

Happy collaborating!
Donnie

AWS Clean Rooms ML helps customers and partners apply ML models without sharing raw data (preview)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-clean-rooms-ml-helps-customers-and-partners-apply-ml-models-without-sharing-raw-data-preview/

Today, we’re introducing AWS Clean Rooms ML (preview), a new capability of AWS Clean Rooms that helps you and your partners apply machine learning (ML) models on your collective data without copying or sharing raw data with each other. With this new capability, you can generate predictive insights using ML models while continuing to protect your sensitive data.

During this preview, AWS Clean Rooms ML introduces its first model specialized to help companies create lookalike segments for marketing use cases. With AWS Clean Rooms ML lookalike, you can train your own custom model, and you can invite partners to bring a small sample of their records to collaborate and generate an expanded set of similar records while protecting everyone’s underlying data.

In the coming months, AWS Clean Rooms ML will release a healthcare model. This will be the first of many models that AWS Clean Rooms ML will support next year.

AWS Clean Rooms ML helps you to unlock various opportunities for you to generate insights. For example:

  • Airlines can take signals about loyal customers, collaborate with online booking services, and offer promotions to users with similar characteristics.
  • Auto lenders and car insurers can identify prospective auto insurance customers who share characteristics with a set of existing lease owners.
  • Brands and publishers can model lookalike segments of in-market customers and deliver highly relevant advertising experiences.
  • Research institutions and hospital networks can find candidates similar to existing clinical trial participants to accelerate clinical studies (coming soon).

AWS Clean Rooms ML lookalike modeling helps you apply an AWS managed, ready-to-use model that is trained in each collaboration to generate lookalike datasets in a few clicks, saving months of development work to build, train, tune, and deploy your own model.

How to use AWS Clean Rooms ML to generate predictive insights
Today I will show you how to use lookalike modeling in AWS Clean Rooms ML and assume you have already set up a data collaboration with your partner. If you want to learn how to do that, check out the AWS Clean Rooms Now Generally Available — Collaborate with Your Partners without Sharing Raw Data post.

With your collective data in the AWS Clean Rooms collaboration, you can work with your partners to apply ML lookalike modeling to generate a lookalike segment. It works by taking a small sample of representative records from your data, creating a machine learning (ML) model, then applying the particular model to identify an expanded set of similar records from your business partner’s data.

The following screenshot shows the overall workflow for using AWS Clean Rooms ML.

By using AWS Clean Rooms ML, you don’t need to build complex and time-consuming ML models on your own. AWS Clean Rooms ML trains a custom, private ML model, which saves months of your time while still protecting your data.

Eliminating the need to share data
As ML models are natively built within the service, AWS Clean Rooms ML helps you protect your dataset and customer’s information because you don’t need to share your data to build your ML model.

You can specify the training dataset using the AWS Glue Data Catalog table, which contains user-item interactions.

Under Additional columns to train, you can define numerical and categorical data. This is useful if you need to add more features to your dataset, such as the number of seconds spent watching a video, the topic of an article, or the product category of an e-commerce item.

Applying custom-trained AWS-built models
Once you have defined your training dataset, you can now create a lookalike model. A lookalike model is a machine learning model used to find similar profiles in your partner’s dataset without either party having to share their underlying data with each other.

When creating a lookalike model, you need to specify the training dataset. From a single training dataset, you can create many lookalike models. You also have the flexibility to define the date window in your training dataset using Relative range or Absolute range. This is useful when you have data that is constantly updated within AWS Glue, such as articles read by users.

Easy-to-tune ML models
After you create a lookalike model, you need to configure it to use in AWS Clean Rooms collaboration. AWS Clean Rooms ML provides flexible controls that enable you and your partners to tune the results of the applied ML model to garner predictive insights.

On the Configure lookalike model page, you can choose which Lookalike model you want to use and define the Minimum matching seed size you need. This seed size defines the minimum number of profiles in your seed data that overlap with profiles in the training data.

You also have the flexibility to choose whether the partner in your collaboration receives metrics in Metrics to share with other members.

With your lookalike models properly configured, you can now make the ML models available for your partners by associating the configured lookalike model with a collaboration.

Creating lookalike segments
Once the lookalike models have been associated, your partners can now start generating insights by selecting Create lookalike segment and choosing the associated lookalike model for your collaboration.

Here on the Create lookalike segment page, your partners need to provide the Seed profiles. Examples of seed profiles include your top customers or all customers who purchased a specific product. The resulting lookalike segment will contain profiles from the training data that are most similar to the profiles from the seed.

Lastly, your partner will get the Relevance metrics as the result of the lookalike segment using the ML models. At this stage, you can use the Score to make a decision.

Export data and use programmatic API
You also have the option to export the lookalike segment data. Once it’s exported, the data is available in JSON format and you can process this output by integrating with AWS Clean Rooms API and your applications.

Join the preview
AWS Clean Rooms ML is now in preview and available via AWS Clean Rooms in US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Seoul, Singapore, Sydney, Tokyo), and Europe (Frankfurt, Ireland, London). Support for additional models is in the works.

Learn how to apply machine learning with your partners without sharing underlying data on the AWS Clean Rooms ML page.

Happy collaborating!
— Donnie

Announcing Amazon OpenSearch Service zero-ETL integration with Amazon S3 (preview)

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-opensearch-service-zero-etl-integration-with-amazon-s3-preview/

Today we are announcing a preview of Amazon OpenSearch Service zero-ETL integration with Amazon S3, a new way to query operational logs in Amazon S3 and S3-based data lakes without needing to switch between services. You can now analyze infrequently queried data in cloud object stores and simultaneously use the operational analytics and visualization capabilities of OpenSearch Service.

Amazon OpenSearch Service direct queries with Amazon S3 provides a zero-ETL integration to reduce the operational complexity of duplicating data or managing multiple analytics tools by enabling customers to directly query their operational data, reducing costs and time to action. This zero-ETL integration will be configurable within OpenSearch Service, where you can take advantage of various log type templates, including predefined dashboards, and configure data accelerations tailored to that log type. Templates include VPC Flow Logs, Elastic Load Balancing logs, and NGINX logs, and accelerations include skipping indexes, materialized views, and covered indexes.

With direct queries with Amazon S3, you can perform complex queries critical to security forensic and threat analysis that correlate data across multiple data sources, which aids teams in investigating service downtime and security events. After creating an integration, you can start querying their data directly from the OpenSearch Dashboards or OpenSearch API. You can easily audit connections to ensure that they are set up in a scalable, cost-efficient, and secure way.

Getting started with direct queries with Amazon S3
You can easily get started by creating a new Amazon S3 direct query data source for OpenSearch Service through the AWS Management Console or the API. Each new data source uses AWS Glue Data Catalog to manage tables that represent S3 buckets. Once you create a data source, you can configure Amazon S3 tables and data indexing and query data in OpenSearch Dashboards.

1. Create a data source in OpenSearch Service
Before you create a data source, you should have an OpenSearch Service domain with version 2.11 or later and a target Amazon S3 table in AWS Glue Data Catalog with the appropriate IAM permissions. IAM will need access to the desired S3 bucket(s) and read and write access to AWS Glue Data Catalog. To learn more about IAM prerequisites, see Creating a data source in the AWS documentation.

Go to the OpenSearch Service console and choose the domain you want to set up a new data source for. In the domain details page, choose the Connections tab below the general information and see the Direct Query section.

To create a new data source, choose Create, input the name of your new data source, select the data source type as Amazon S3 with AWS Glue Data Catalog, and choose the IAM role for your data source.

Once you create a data source, you can go to the OpenSearch Dashboards of the domain, which you use to configure access control, define tables, set up log type–based dashboards for popular log types, and query your data.

2. Configuring your data source in OpenSearch Dashboards
To configure data source in OpenSearch Dashboards, choose Configure in the console and go to OpenSearch Dashboards. In the left-hand navigation of OpenSearch Dashboards, under Management, choose Data sources. Under Manage data sources, choose the name of the data source you created in the console.

Direct queries from OpenSearch Service to Amazon S3 use Spark tables within AWS Glue Data Catalog. To create a new table you want to direct query, go to the Query Workbench in the Open Search Plugins menu.

Now run as in the following SQL statement to create http_logs table and run MSCK REPAIR TABLE mys3.default.http_logs command to update the metadata in the catalog

CREATE EXTERNAL TABLE IF NOT EXISTS mys3.default.http_logs (
   `@timestamp` TIMESTAMP,
    clientip STRING,
    request STRING, 
    status INT, 
    size INT, 
    year INT, 
    month INT, 
    day INT) 
USING json PARTITIONED BY(year, month, day) OPTIONS (path 's3://mys3/data/http_log/http_logs_partitioned_json_bz2/', compression 'bzip2')

To ensure a fast experience with your data in Amazon S3, you can set up any of three different types of accelerations to index data into OpenSearch Service, such as skipping indexes, materialized views, and covering indexes. To create OpenSearch indexes from external data connections for better performance, choose the Accelerate Table.

  • Skipping indexes allow you to index only the metadata of the data stored in Amazon S3. Skipping indexes help quickly identify data stored by narrowing down a specific location of where the data is stored.
  • Materialized views enable you to use complex queries such as aggregations, which can be used for querying or powering dashboard visualizations. Materialized views ingest data into OpenSearch Service for anomaly detection or geospatial capabilities.
  • Covering indexes will ingest all the data from the specified table column. Covering indexes are the most performant of the three indexing types.

3. Query your data source in OpenSearch Dashboards
After you set up your tables, you can query your data using Discover. You can run a sample SQL query for the http_logs table you created in AWS Glue Data Catalog tables.

To learn more, see Working with Amazon OpenSearch Service direct queries with Amazon S3 in the AWS documentation.

Join the preview
Amazon OpenSearch Service zero-ETL integration with Amazon S3 is now previewed in the AWS US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland) Regions.

OpenSearch Service separately charges for only the compute needed as OpenSearch Compute Units to query your external data as well as maintain indexes in OpenSearch Service. For more information, see Amazon OpenSearch Service Pricing.

Give it a try and send feedback to the AWS re:Post for Amazon OpenSearch Service or through your usual AWS Support contacts.

Channy

Analyze large amounts of graph data to get insights and find trends with Amazon Neptune Analytics

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/introducing-amazon-neptune-analytics-a-high-performance-graph-analytics/

I am happy to announce the general availability of Amazon Neptune Analytics, a new analytics database engine that makes it faster for data scientists and application developers to quickly analyze large amounts of graph data. With Neptune Analytics, you can now quickly load your dataset from Amazon Neptune or your data lake on Amazon Simple Storage Service (Amazon S3), run your analysis tasks in near real time, and optionally terminate your graph afterward.

Graph data enables the representation and analysis of intricate relationships and connections within diverse data domains. Common applications include social networks, where it aids in identifying communities, recommending connections, and analyzing information diffusion. In supply chain management, graphs facilitate efficient route optimization and bottleneck identification. In cybersecurity, they reveal network vulnerabilities and identify patterns of malicious activity. Graph data finds application in knowledge management, financial services, digital advertising, and network security, performing tasks such as identifying money laundering networks in banking transactions and predicting network vulnerabilities.

Since the launch of Neptune in May 2018, thousands of customers have embraced the service for storing their graph data and performing updates and deletion on specific subsets of the graph. However, analyzing data for insights often involves loading the entire graph into memory. For instance, a financial services company aiming to detect fraud may need to load and correlate all historical account transactions.

Performing analyses on extensive graph datasets, such as running common graph algorithms, requires specialized tools. Utilizing separate analytics solutions demands the creation of intricate pipelines to transfer data for processing, which is challenging to operate, time-consuming, and prone to errors. Furthermore, loading large datasets from existing databases or data lakes to a graph analytic solution can take hours or even days.

Neptune Analytics offers a fully managed graph analytics experience. It takes care of the infrastructure heavy lifting, enabling you to concentrate on problem-solving through queries and workflows. Neptune Analytics automatically allocates compute resources according to the graph’s size and quickly loads all the data in memory to run your queries in seconds. Our initial benchmarking shows that Neptune Analytics loads data from Amazon S3 up to 80x faster than existing AWS solutions.

Neptune Analytics supports 5 families of algorithms covering 15 different algorithms, each with multiple variants. For example, we provide algorithms for path-finding, detecting communities (clustering), identifying important data (centrality), and quantifying similarity. Path-finding algorithms are used for use cases such as route planning for supply chain optimization. Centrality algorithms like page rank identify the most influential sellers in a graph. Algorithms like connected components, clustering, and similarity algorithms can be used for fraud-detection use cases to determine whether the connected network is a group of friends or a fraud ring formed by a set of coordinated fraudsters.

Neptune Analytics facilitates the creation of graph applications using openCypher, presently one of the widely adopted graph query languages. Developers, business analysts, and data scientists appreciate openCypher’s SQL-inspired syntax, finding it familiar and structured for composing graph queries.

Let’s see it at work
As we usually do on the AWS News blog, let’s show how it works. For this demo, I first navigate to Neptune in the AWS Management Console. There is a new Analytics section on the left navigation pane. I select Graphs and then Create graph.

Neptune Analytics - create graph 1

On the Create graph page, I enter the details of my graph analytics database engine. I won’t detail each parameter here; their names are self-explanatory.

Neptune Analytics - Create graph 1

Pay attention to Allow from public because, the vast majority of the time, you want to keep your graph only available from the boundaries of your VPC. I also create a Private endpoint to allow private access from machines and services inside my account VPC network.

Neptune Analytics - Create graph 2

In addition to network access control, users will need proper IAM permissions to access the graph.

Finally, I enable Vector search to perform similarity search using embeddings in the dataset. The dimension of the vector depends on the large language model (LLM) that you use to generate the embedding.

Neptune Analytics - Create graph 3

When I am ready, I select Create graph (not shown here).

After a few minutes, my graph is available. Under Connectivity & security, I take note of the Endpoint. This is the DNS name I will use later to access my graph from my applications.

I can also create Replicas. A replica is a warm standby copy of the graph in another Availability Zone. You might decide to create one or more replicas for high availability. By default, we create one replica, and depending on your availability requirements, you can choose not to create replicas.

Neptune Analytics - create graph 3

Business queries on graph data
Now that the Neptune Analytics graph is available, let’s load and analyze data. For the rest of this demo, imagine I’m working in the finance industry.

I have a dataset obtained from the US Securities and Exchange Commission (SEC). This dataset contains the list of positions held by investors that have more than $100 million in assets. Here is a diagram to illustrate the structure of the dataset I use in this demo.

Nuptune graph analytics - dataset structure

I want to get a better understanding of the positions held by one investment firm (let’s name it “Seb’s Investments LLC”). I wonder what its top five holdings are and who else holds more than $1 billion in the same companies. I am also curious to know what are other investment companies that have a similar portfolio as Seb’s Investments LLC.

To start my analysis, I create a Jupyter notebook in the Neptune section of the AWS Management Console. In the notebook, I first define my analytics endpoint and load the data set from an S3 bucket. It takes only 18 seconds to load 17 million records.

Neptune Analytics - load data

Then, I start to explore the dataset using openCypher queries. I start by defining my parameters:

params = {'name': "Seb's Investments LLC", 'quarter': '2023Q4'}

First, I want to know what the top five holdings are for Seb’s Investments LLC in this quarter and who else holds more than $1 billion in the same companies. In openCypher, it translates to the query hereafter. The $name parameter’s value is “Seb’s Investment LLC” and the $quarter parameter’s value is 2023Q4.

MATCH p=(h:Holder)-->(hq1)-[o:owns]->(holding)
WHERE h.name = $name AND hq1.name = $quarter
WITH DISTINCT holding as holding, o ORDER BY o.value DESC LIMIT 5
MATCH (holding)<-[o2:owns]-(hq2)<--(coholder:Holder)
WHERE hq2.name = '2023Q4'
WITH sum(o2.value) AS totalValue, coholder, holding
WHERE totalValue > 1000000000
RETURN coholder.name, collect(holding.name)

Neptune Analytics - query 1

Then, I want to know what the other top five companies are that have similar holdings as “Seb’s Investments LLC.” I use the topKByNode() function to perform a vector search.

MATCH (n:Holder)
WHERE n.name = $name
CALL neptune.algo.vectors.topKByNode(n)
YIELD node, score
WHERE score >0
RETURN node.name LIMIT 5

This query identifies a specific Holder node with the name “Seb’s Investments LLC.” Then, it utilizes the Neptune Analytics custom vector similarity search algorithm on the embedding property of the Holder node to find other nodes in the graph that are similar. The results are filtered to include only those with a positive similarity score, and the query finally returns the names of up to five related nodes.

Neptune Analytics - query 2

Pricing and availability
Neptune Analytics is available today in seven AWS Regions: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Singapore, Tokyo), and Europe (Frankfurt, Ireland).

AWS charges for the usage on a pay-as-you-go basis, with no recurring subscriptions or one-time setup fees.

Pricing is based on configurations of memory-optimized Neptune capacity units (m-NCU). Each m-NCU corresponds to one hour of compute and networking capacity and 1 GiB of memory. You can choose configurations starting with 128 m-NCUs and up to 4096 m-NCUs. In addition to m-NCU, storage charges apply for graph snapshots.

I invite you to read the Neptune pricing page for more details

Neptune Analytics is a new analytics database engine to analyze large graph datasets. It helps you discover insights faster for use cases such as fraud detection and prevention, digital advertising, cybersecurity, transportation logistics, and bioinformatics.

Get started
Log in to the AWS Management Console to give Neptune Analytics a try.

— seb

Vector engine for Amazon OpenSearch Serverless is now available

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/vector-engine-for-amazon-opensearch-serverless-is-now-generally-available/

Today we are announcing the general availability of the vector engine for Amazon OpenSearch Serverless with new features. In July 2023, we introduced the preview release of the vector engine for Amazon OpenSearch Serverless, a simple, scalable, and high-performing similarity search capability. The vector engine makes it easy for you to build modern machine learning (ML) augmented search experiences and generative artificial intelligence (generative AI) applications without needing to manage the underlying vector database infrastructure.

You can now store, update, and search billions of vector embeddings with thousands of dimensions in milliseconds. The highly performant similarity search capability of vector engine enables generative AI-powered applications to deliver accurate and reliable results with consistent milliseconds-scale response times.

The vector engine also enables you to optimize and tune results with hybrid search by combining vector search and full-text search in the same query, removing the need to manage and maintain separate data stores or a complex application stack. The vector engine provides a secure, reliable, scalable, and enterprise-ready platform to cost effectively build a prototyping application and then seamlessly scale to production.

You can now get started in minutes with the vector engine by creating a specialized vector engine–based collection, which is a logical grouping of embeddings that works together to support a workload.

The vector engine uses OpenSearch Compute Units (OCUs), compute capacity unit, to ingest and run similarity search queries. One OCU can handle up to 2 million vectors for 128 dimensions or 500,000 for 768 dimensions at 99 percent recall rate.

The vector engine built on OpenSearch Serverless is a highly available service by default. It requires a minimum of four OCUs (2 OCUs for the ingest, including primary and standby, and 2 OCUs for the search with two active replicas across Availability Zones) for the first collection in an account. All subsequent collections using the same AWS Key Management Service (AWS KMS) key can share those OCUs.

What’s new at GA?
Since the preview, the vector engine for Amazon OpenSearch Serverless became one of the vector database options in the knowledge base of Amazon Bedrock to build generative AI applications using a Retrieval Augmented Generation (RAG) concept.

Here are some new or improved features for this GA release:

Disable redundant replica (development and test focused) option
As we announced in our preview blog post, this feature eliminates the need to have redundant OCUs in another Availability Zone solely for availability purposes. A collection can be deployed with two OCUs – one for indexing and one for search. This cuts the costs in half compared to default deployment with redundant replicas. The reduced cost makes this configuration suitable and economical for development and testing workloads.

With this option, we will still provide durability guarantees since the vector engine persists all the data in Amazon S3, but single-AZ failures would impact your availability.

If you want to disable a redundant replica, uncheck Enable redundancy when creating a new vector search
collection.

Fractional OCU for the development and test focused option
Support for fractional OCU billing for development and test focused workloads (that is, no redundant replica option) reduces the floor price for vector search collection. The vector engine will initially deploy smaller 0.5 OCUs while providing the same capabilities at lower scale and will scale up to a full OCU and beyond to meet your workload demand. This option will further reduce the monthly costs when experimenting with using the vector engine.

Automatic scaling for a billion scale
With vector engine’s seamless auto-scaling, you no longer have to reindex for scaling purposes. At preview, we were supporting about 20 million vector embeddings. With the general availability of vector engine, we have raised the limits to support a billion vector scale.

Now available
The vector engine for Amazon OpenSearch Serverless is now available in all AWS Regions where Amazon OpenSearch Serverless is available.

To get started, you can refer to the following resources:

Give it a try and send feedback to AWS re:Post for Amazon OpenSearch Service or through your usual AWS support contacts.

Channy

Announcing zero-ETL integrations with AWS Databases and Amazon Redshift

Post Syndicated from Jyoti Aggarwal original https://aws.amazon.com/blogs/big-data/announcing-zero-etl-integrations-with-aws-databases-and-amazon-redshift/

As customers become more data driven and use data as a source of competitive advantage, they want to easily run analytics on their data to better understand their core business drivers to grow sales, reduce costs, and optimize their businesses. To run analytics on their operational data, customers often build solutions that are a combination of a database, a data warehouse, and an extract, transform, and load (ETL) pipeline. ETL is the process data engineers use to combine data from different sources.

Through customer feedback, we learned that lot of undifferentiated time and resources go towards building and managing ETL pipelines between transactional databases and data warehouses. At Amazon Web Services (AWS), our goal is to make it easier for our customers to connect to and use all of their data and to do it with the speed and agility they need. We think that by automating the undifferentiated parts, we can help our customers increase the pace of their data-driven innovation by breaking down data silos and simplifying data integration.

Bringing operational data closer to analytics workflows

Customers want flexible data architectures that let them integrate data across their organization to give them a better picture of their customers, streamline operations, and help teams make better, faster decisions. But integrating data isn’t easy. Today, building these pipelines and assembling the architecture to interconnect all the data sources and optimize analytics results is complex, requires highly skilled resources, and renders data that can be erroneous or is often inconsistent.

Amazon Redshift powers data driven decisions for tens of thousands of customers every day with a fully managed, artificial intelligence (AI)-powered cloud data warehouse that delivers the best price-performance for your analytics workloads.

Zero-ETL is a set of integrations that eliminates the need to build ETL data pipelines. Zero-ETL integrations with Amazon Redshift enable customers to access their data in place using federated queries or ingest it into Amazon Redshift with a fully managed solution from across their databases. With newer features, such as support for autocopy that simplifies and automates file ingestion from Amazon Simple Storage Service (Amazon S3), Redshift Streaming Ingestion capabilities to continuously ingest any amount of streaming data directly into the warehouse, and multi-cluster data sharing architectures that minimize data movement and even provide access to third-party data, Amazon Redshift enables data integration and quick access to data without building manual pipelines.

With all the data integrated and available, Amazon Redshift empowers every data user to run analytics and build AI, machine learning (ML), and generative AI applications. Developers can run Apache Spark applications directly on the data in their warehouse from AWS analytics services, such as Amazon EMR and AWS Glue. They can enrich their datasets by joining operational data replicated through zero-ETL integrations with other sources such as sales and marketing data from SaaS applications and can even create Amazon QuickSight dashboards on top of this data to track key metrics across sales, website analytics, operations, and more—all in one place.

Customers can also use Amazon Redshift data sharing to securely share this data with multiple consumer clusters used by different teams—both within and across AWS accounts—driving a unified view of business and facilitating self-service access to application data within team clusters while maintaining governance over sensitive operational data.

Furthermore, customers can build machine learning models directly on their operational data in Amazon Redshift ML (native integration into Amazon SageMaker) without needing to build any data pipelines and use them to run billions of predictions with SQL commands. Or they can build complex transformations and aggregations on the integrated data using Amazon Redshift materialized views.

We’re excited to share four AWS database zero-ETL integrations with Amazon Redshift:

By bringing different database services closer to analytics, AWS is streamlining access to data and enabling companies to accelerate innovation, create competitive advantage, and maximize the business value extracted from their data assets.

Amazon Aurora zero-ETL integration with Amazon Redshift

The Amazon Aurora zero-ETL integration with Amazon Redshift unifies transactional data from Amazon Aurora with near real-time analytics in Amazon Redshift. This eliminates the burden of building and maintaining custom ETL pipelines between the two systems. Unlike traditional siloed databases that force a tradeoff between performance and analytics, the zero-ETL integration replicates data from multiple Aurora clusters into the same Amazon Redshift warehouse. This enables holistic insights across applications without impacting production workloads. The entire system can be serverless and can auto-scale to handle fluctuations in data volume without infrastructure management.

Amazon Aurora MySQL zero-ETL integration with Amazon Redshift processes over 1 million transactions per minute (an equivalent of 17.5 million insert/update/delete row operations per minute) from multiple Aurora databases and makes them available in Amazon Redshift in less than 15 seconds (p50 latency lag). Figure 1 shows how the Aurora MySQL zero-ETL integration with Amazon Redshift works at a high level.

Figure 1: High level working of Aurora MySQL zero-ETL integration with Amazon Redshift

In their own words, see how one of our customers is using Aurora MySQL zero-ETL integration with Amazon Redshift.

In the retail industry, for example, Infosys wanted to gain faster insights about their business, such as best-selling products and high-revenue stores, based on transactions in a store management system. They used Amazon Aurora MySQL zero-ETL integration with Amazon Redshift to achieve this. With this integration, Infosys replicated Aurora data to Amazon Redshift and created Amazon QuickSight dashboards for product managers and channel leaders in just a few seconds, instead of several hours. Now, as part of Infosys Cobalt and Infosys Topaz blueprints, enterprises can have near real-time analytics on transactional data, which can help them make informed decisions related to store management.

– Sunil Senan, SVP and Global Head of Data, Analytics, and AI, Infosys

To learn more, see Aurora Docs, Amazon Redshift Docs, and the AWS News Blog.

Amazon RDS for MySQL zero-ETL integration with Amazon Redshift

The new Amazon RDS for MySQL integration with Amazon Redshift empowers customers to easily perform analytics on their RDS for MySQL data. With a few clicks, it seamlessly replicates RDS for MySQL data into Amazon Redshift, automatically handling initial data loads, ongoing change synchronization, and schema replication. This eliminates the complexity of traditional ETL jobs. The zero-ETL integration enables workload isolation for optimal performance; RDS for MySQL focuses on high-speed transactions while Amazon Redshift handles analytical workloads. Customers can also consolidate data from multiple sources into Amazon Redshift, such as Aurora MySQL-Compatible Edition and Aurora PostgreSQL-Compatible Edition. This unified view provides holistic insights across applications in one place, delivering significant cost and operational efficiencies.

Figure 2 shows how a customer can use the AWS Management Console for Amazon RDS to get started with creating a zero-ETL integration from RDS for MySQL, Aurora MySQL-Compatible Edition, and Aurora PostgreSQL-Compatible Edition to Amazon Redshift.

Figure 2: How to create a zero-ETL integration using Amazon RDS.

This integration is currently in public preview, visit the getting started guide to learn more.

Amazon DynamoDB zero-ETL integration with Amazon Redshift

The Amazon DynamoDB zero-ETL integration with Amazon Redshift (limited preview) provides a fully managed solution for making data from DynamoDB available for analytics in Amazon Redshift. With minimal configuration, customers can replicate DynamoDB data into Amazon Redshift for analytics without consuming the DynamoDB Read Capacity Units (RCU). This zero-ETL integration unlocks powerful Amazon Redshift capabilities on DynamoDB data such as high-speed SQL queries, machine learning integrations, materialized views for fast aggregations, and secure data sharing.

This integration is currently in limited preview, use this link to request access.

Integrated services bring us closer to zero-ETL

Our mission is to help customers get the most value from their data, and integrated services are key to this. That’s why we’re building towards a zero-ETL future today. By automating complex ETL processes, data engineers can redirect their focus on creating value from the data. With this modern approach to data management, organizations can accelerate their use of data to streamline operations and fuel business growth.


About the author

Jyoti Aggarwal is a Product Management lead for Amazon Redshift zero-ETL. She brings along an expertise in cloud compute and storage, data warehouse, and B2B/B2C customer experience.

New generative AI capabilities for Amazon DataZone to further simplify data cataloging and discovery (preview)

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-generative-ai-capabilities-for-amazon-datazone-to-further-simplify-data-cataloging-and-discovery-preview/

Today, we are announcing a preview of an automation feature backed by generative artificial intelligence (AI) for Amazon DataZone that will dramatically decrease the amount of time needed to provide context for organizational data. The new feature can automate the traditionally labor-intensive process of data cataloging. Powered by the large language models (LLMs) of Amazon Bedrock, it generates detailed descriptions of data assets and their schemas, and suggests analytical use cases. You can generate a comprehensive business context with a single click.

We heard from customers that data consumers such as data analysts, scientists, and engineers in organizations struggle to understand the data’s relevance with little metadata. As a result, they either spend more time interpreting the data, or they return to data producers with continued questions. So, data producers such as data owners, engineers, and analysts who own the data and make it available for consumers need to manually enter detailed context for higher-priority data to make data shareable and discoverable. This is time-consuming and the number one problem customers have when trying to collate their data in a system for self-service by consumers.

When we launched the general availability of Amazon DataZone in October 2023, we introduced the first feature that brings generative AI capabilities to automate the generation of the table name and column names of a business catalog asset. In the data portal of Amazon DataZone, the green brain icon indicates automatically generated metadata suggestions. You could accept, edit, or reject each suggestion recommended by Amazon DataZone.

What’s new with today’s preview announcement?
Now, in addition to column and table names, you can automatically generate more detailed descriptions of the table and schema, as well as suggested uses.

In the Business Metadata tab in the data portal, when you choose Generate summary, new content will be generated to explain the table and its metadata.

You can also accept, edit, and reject this recommendation.

When you choose the Schema tab, you can also see new Description recommendations as well as the Name. You can review generated metadata and choose to accept, edit, or reject the recommendation.

This new feature will enhance data discoverability and reduce on back-and-forth communications between data consumers and producers. You will have a richer search experience based on extensive data insights in the future.

Join the preview
The new metadata generation ability is now previewed in the AWS US East (N. Virginia) and US West (Oregon) Regions. With this new generative AI capability, you can reduce time-to-insight by accelerating data cataloging and boosting data discovery. To learn more, visit the Amazon DataZone: Automate Data Discovery.

Give it a try and send feedback to AWS re:Post for Amazon DataZone or through your usual AWS Support contacts.

Channy

Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service is now available

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-dynamodb-zero-etl-integration-with-amazon-opensearch-service-is-now-generally-available/

Today, we are announcing the general availability of Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service, which lets you perform a search on your DynamoDB data by automatically replicating and transforming it without custom code or infrastructure. This zero-ETL integration reduces the operational burden and cost involved in writing code for a data pipeline architecture, keeping the data in sync, and updating code with frequent application changes, enabling you to focus on your application.

With this zero-ETL integration, Amazon DynamoDB customers can now use the powerful search features of Amazon OpenSearch Service, such as full-text search, fuzzy search, auto-complete, and vector search for machine learning (ML) capabilities to offer new experiences that boost user engagement and improve satisfaction with their applications.

This zero-ETL integration uses Amazon OpenSearch Ingestion to synchronize the data between Amazon DynamoDB and Amazon OpenSearch Service. You choose the DynamoDB table whose data needs to be synchronized and Amazon OpenSearch Ingestion synchronizes the data to an Amazon OpenSearch managed cluster or serverless collection within seconds of it being available.

You can also specify index mapping templates to ensure that your Amazon DynamoDB fields are mapped to the correct fields in your Amazon OpenSearch Service indexes. Also, you can synchronize data from multiple DynamoDB tables into one Amazon OpenSearch Service managed cluster or serverless collection to offer holistic insights across several applications.

Getting started with this zero-ETL integration
With a few clicks, you can synchronize data from DynamoDB to OpenSearch Service. To create an integration between DynamoDB and OpenSearch Service, choose the Integrations menu in the left pane of the DynamoDB console and the DynamoDB table whose data you want to synchronize.

You must turn on point-in-time recovery (PITR) and the DynamoDB Streams feature. This feature allows you to capture item-level changes in your table and push the changes to a stream. Choose Turn on for PITR and enable DynamoDB Streams in the Exports and streams tab.

After turning on PITR and DynamoDB Stream, choose Create to set up an OpenSearch Ingestion pipeline in your account that replicates the data to an OpenSearch Service managed domain.

In the first step, enter a unique pipeline name and set up pipeline capacity and compute resources to automatically scale your pipeline based on the current ingestion workload.

Now you can configure the pre-defined pipeline configuration in YAML file format. You can browse resources to look up and paste information to build the pipeline configuration. This pipeline is a combination of a source part from DyanmoDB settings and a sink part for OpenSearch Service.

You must set multiple IAM roles (sts_role_arn) with the necessary permissions to read data from the DynamoDB table and write to an OpenSearch domain. This role is then assumed by OpenSearch Ingestion pipelines to ensure that the right security posture is always maintained when moving the data from source to destination. To learn more, see Setting up roles and users in Amazon OpenSearch Ingestion in the AWS documentation.

After entering all required values, you can validate the pipeline configuration to ensure that your configuration is valid. To learn more, see Creating Amazon OpenSearch Ingestion pipelines in the AWS documentation.

Take a few minutes to set up the OpenSearch Ingestion pipeline, and you can see your integration is completed in the DynamoDB table.

Now you can search synchronized items in the OpenSearch Dashboards.

Things to know
Here are a couple of things that you should know about this feature:

  • Custom schema – You can specify your custom data schema along with the index mappings used by OpenSearch Ingestion when writing data from Amazon DynamoDB to OpenSearch Service. This experience is added to the console within Amazon DynamoDB so that you have full control over the format of indices that are created on OpenSearch Service.
  • Pricing – There will be no additional cost to use this feature apart from the cost of the existing underlying components. Note that Amazon OpenSearch Ingestion charges OpenSearch Compute Units (OCUs) which will be used to replicate data between Amazon DynamoDB and Amazon OpenSearch Service. Furthermore, this feature uses Amazon DynamoDB streams for the change data capture (CDC) and you will incur the standard costs for Amazon DynamoDB Streams.
  • Monitoring – You can monitor the state of the pipelines by checking the status of the integration on the DynamoDB console or using the OpenSearch Ingestion dashboard. Additionally, you can use Amazon CloudWatch to provide real-time metrics and logs, which lets you to set up alerts in case of a breach of user-defined thresholds.

Now available
Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service is now generally available in all AWS Regions where OpenSearch Ingestion is available today.

Channy