Tag Archives: Analytics

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Post Syndicated from Arun Sudhir original https://aws.amazon.com/blogs/big-data/how-eightfold-ai-implemented-metadata-security-in-a-multi-tenant-data-analytics-environment-with-amazon-redshift/

This is a guest post co-written with Arun Sudhir from Eightfold AI.

Eightfold is transforming the world of work by providing solutions that empower organizations to recruit and retain a diverse global workforce. Eightfold is a leader in AI products for enterprises to build on their talent’s existing skills. From Talent Acquisition to Talent Management and talent insights, Eightfold offers a single AI platform that does it all.

The Eightfold Talent Intelligence Platform powered by Amazon Redshift and Amazon QuickSight provides a full-fledged analytics platform for Eightfold’s customers. It delivers analytics and enhanced insights about the customer’s Talent Acquisition, Talent Management pipelines, and much more. Customers can also implement their own custom dashboards in QuickSight. As part of the Talent Intelligence Platform Eightfold also exposes a data hub where each customer can access their Amazon Redshift-based data warehouse and perform ad hoc queries as well as schedule queries for reporting and data export. Additionally, customers who have their own in-house analytics infrastructure can integrate their own analytics solutions with Eightfold Talent Intelligence Platform by directly connecting to the Redshift data warehouse provisioned for them. Doing this gives them access to their raw analytics data, which can then be integrated into their analytics infrastructure irrespective of the technology stack they use.

Eightfold provides this analytics experience to hundreds of customers today. Securing customer data is a top priority for Eightfold. The company requires the highest security standards when implementing a multi-tenant analytics platform on Amazon Redshift.

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift.

In this post, we discuss how the Eightfold Talent Lake system team implemented the Amazon Redshift metadata security feature in their multi-tenant environment to enable access controls for the database catalog. By linking access to business-defined entitlements, they are able to enforce data access policies.

Amazon Redshift security controls addresses restricting data access to users who have been granted permission. This post discusses restricting listing of data catalog metadata as per the granted permissions.

The Eightfold team needed to develop a multi-tenant application with the following features:

  • Enforce visibility of Amazon Redshift objects on a per-tenant basis, so that each tenant can only view and access their own schema
  • Implement tenant isolation and security so that tenants can only see and interact with their own data and objects

Metadata security in Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Many customers have implemented Amazon Redshift to support multi-tenant applications. One of the challenges with multi-tenant environments is that database objects are visible to all tenants even though tenants are only authorized to access certain objects. This visibility creates data privacy challenges because many customers want to hide objects that tenants can’t access.

The newly released metadata security feature in Amazon Redshift enables you to hide database objects from all other tenants and make objects only visible to tenants who are authorized to see and use them. Tenants can use SQL tools, dashboards, or reporting tools, and also query the database catalog, but they will only see appropriate objects for which they have permissions to see.

Solution overview

Exposing a Redshift endpoint to all of Eightfold’s customers as part of the Talent Lake endeavor involved several design choices that had to be carefully considered. Eightfold has a multi-tenant Redshift data warehouse that had individual customer schemas for customers, which they could connect to using their own customer credentials to perform queries on their data. Data in each customer tenant can only be accessed by the customer credentials that had access to the customer schema. Each customer could access data under their analytics schema, which was named after the customer. For example, for a customer named A, the schema name would be A_analytics. The following diagram illustrates this architecture.

Although customer data was secured by restricting access to only the customer user, when customers used business intelligence (BI) tools like QuickSight, Microsoft Power BI, or Tableau to access their data, the initial connection showed all the customer schemas because it was performing a catalog query (which couldn’t be restricted). Therefore, Eightfold’s customers had concerns that other customers could discover that they were Eightfold’s customers by simply trying to connect to Talent Lake. This unrestricted database catalog access posed a privacy concern to several Eightfold customers. Although this could be avoided by provisioning one Redshift database per customer, that was a logistically difficult and expensive solution to implement.

The following screenshot shows what a connection from QuickSight to our data warehouse looked like without metadata security turned on. All other customer schemas were exposed even though the connection to QuickSight was made as customer_k_user.

Approach for implementing metadata access controls

To implement restricted catalog access, and ensure it worked with Talent Lake, we cloned our production data warehouse with all the schemas and enabled the metadata security flag in the Redshift data warehouse by connecting to SQL tools. After it was enabled, we tested the catalog queries by connecting to the data warehouse from BI tools like QuickSight, Microsoft Power BI, and Tableau and ensured that only the customer schemas show up as a result of the catalog query. We also tested by running catalog queries after connecting to the Redshift data warehouse from psql, to ensure that only the customer schema objects were surfaced—It’s important to validate that given tenants have access to the Redshift data warehouse directly.

The metadata security feature was tested by first turning on metadata security in our Redshift data warehouse by connecting using a SQL tool or Amazon Redshift Query Editor v2.0 and issuing the following command:

ALTER SYSTEM SET metadata_security = TRUE;

Note that the preceding command is set at the Redshift cluster level or Redshift Serverless endpoint level, which means it is applied to all databases and schemas in the cluster or endpoint.

In Eightfold’s scenario, data access controls are already in place for each of the tenants for their respective database objects.

After turning on the metadata security feature in Amazon Redshift, Eightfold was able to restrict database catalog access to only show individual customer schemas for each customer that was trying to connect to Amazon Redshift and further validated by issuing a catalog query to access schema objects as well.

We also tested by connecting via psql and trying out various catalog queries. All of them yielded only the relevant customer schema of the logged-in user as the result. The following are some examples:

analytics=> select * from pg_user;
usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig 
------------------------+----------+-------------+----------+-----------+----------+----------+-------------------------------------------
customer_k_user | 377 | f | f | f | ******** | | 
(1 row)

analytics=> select * from information_schema.schemata;
catalog_name | schema_name | schema_owner | default_character_set_catalog | default_character_set_schema | default_character_set_name | sql_path 
--------------+----------------------+------------------------+-------------------------------+------------------------------+----------------------------+----------
analytics | customer_k_analytics | customer_k_user | | | | 
(1 row)

The following screenshot shows the UI after metadata security was enabled: only customer_k_analytics is seen when connecting to the Redshift data warehouse as customer_k_user.

This ensured that individual customer privacy was protected and increased customer confidence in Eightfold’s Talent Lake.

Customer feedback

“Being an AI-first platform for customers to hire and develop people to their highest potential, data and analytics play a vital role in the value provided by the Eightfold platform to its customers. We rely on Amazon Redshift as a multi-tenant Data Warehouse that provides rich analytics with data privacy and security through customer data isolation by using schemas. In addition to the data being secure as always, we layered on Redshift’s new metadata access control to ensure customer schemas are not visible to other customers. This feature truly made Redshift the ideal choice for a multi-tenant, performant, and secure Data Warehouse and is something we are confident differentiates our offering to our customers.”

– Sivasankaran Chandrasekar, Vice President of Engineering, Data Platform at Eightfold AI

Conclusion

In this post, we demonstrated how the Eightfold Talent Intelligence Platform team implemented a multi-tenant environment for hundreds of customers, using the Amazon Redshift metadata security feature. For more information about metadata security, refer to the Amazon Redshift documentation.

Try out the metadata security feature for your future Amazon Redshift implementations, and feel free to leave a comment about your experience!


About the authors

Arun Sudhir is a Staff Software Engineer at Eightfold AI. He has more than 15 years of experience in design and development of backend software systems in companies like Microsoft and AWS, and has a deep knowledge of database engines like Amazon Aurora PostgreSQL and Amazon Redshift.

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation analytics solutions using AWS Analytics services.

Anjali Vijayakumar is a Senior Solutions Architect at AWS focusing on EdTech. She is passionate about helping customers build well-architected solutions in the cloud.

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/build-and-manage-your-modern-data-stack-using-dbt-and-aws-glue-through-dbt-glue-the-new-trusted-dbt-adapter/

dbt is an open source, SQL-first templating engine that allows you to write repeatable and extensible data transforms in Python and SQL. dbt focuses on the transform layer of extract, load, transform (ELT) or extract, transform, load (ETL) processes across data warehouses and databases through specific engine adapters to achieve extract and load functionality. It enables data engineers, data scientists, and analytics engineers to define the business logic with SQL select statements and eliminates the need to write boilerplate data manipulation language (DML) and data definition language (DDL) expressions. dbt lets data engineers quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, continuous integration and continuous delivery (CI/CD), and documentation.

dbt is predominantly used by data warehouses (such as Amazon Redshift) customers who are looking to keep their data transform logic separate from storage and engine. We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities.

In 2022, AWS published a dbt adapter called dbt-glue—the open source, battle-tested dbt AWS Glue adapter that allows data engineers to use dbt for cloud-based data lakes along with data warehouses and databases, paying for just the compute they need. The dbt-glue adapter democratized access for dbt users to data lakes, and enabled many users to effortlessly run their transformation workloads on the cloud with the serverless data integration capability of AWS Glue. From the launch of the adapter, AWS has continued investing into dbt-glue to cover more requirements.

Today, we are pleased to announce that the dbt-glue adapter is now a trusted adapter based on our strategic collaboration with dbt Labs. Trusted adapters are adapters not maintained by dbt Labs, but adaptors that that dbt Lab is comfortable recommending to users for use in production.

The key capabilities of the dbt-glue adapter are as follows:

  • Runs SQL as Spark SQL on AWS Glue interactive sessions
  • Manages table definitions on the AWS Glue Data Catalog
  • Supports open table formats such as Apache Hudi, Delta Lake, and Apache Iceberg
  • Supports AWS Lake Formation permissions for fine-grained access control

In addition to those capabilities, the dbt-glue adapter is designed to optimize resource utilization with several techniques on top of AWS Glue interactive sessions.

This post demonstrates how the dbt-glue adapter helps your workload, and how you can build a modern data stack using dbt and AWS Glue using the dbt-glue adapter.

Common use cases

One common use case for using dbt-glue is if a central analytics team at a large corporation is responsible for monitoring operational efficiency. They ingest application logs into raw Parquet tables in an Amazon Simple Storage Service (Amazon S3) data lake. Additionally, they extract organized data from operational systems capturing the company’s organizational structure and costs of diverse operational components that they stored in the raw zone using Iceberg tables to maintain the original schema, facilitating easy access to the data. The team uses dbt-glue to build a transformed gold model optimized for business intelligence (BI). The gold model joins the technical logs with billing data and organizes the metrics per business unit. The gold model uses Iceberg’s ability to support data warehouse-style modeling needed for performant BI analytics in a data lake. The combination of Iceberg and dbt-glue allows the team to efficiently build a data model that’s ready to be consumed.

Another common use case is when an analytics team in a company that has an S3 data lake creates a new data product in order to enrich its existing data from its data lake with medical data. Let’s say that this company is located in Europe and the data product must comply with the GDPR. For this, the company uses Iceberg to meet needs such as the right to be forgotten and the deletion of data. The company uses dbt to model its data product on its existing data lake due to its compatibility with AWS Glue and Iceberg and the simplicity that the dbt-glue adapter brings to the use of this storage format.

How dbt and dbt-glue work

The following are key dbt features:

  • Project – A dbt project enforces a top-level structure on the staging, models, permissions, and adapters. A project can be checked into a GitHub repo for version control.
  • SQL – dbt relies on SQL select statements for defining data transformation logic. Instead of raw SQL, dbt offers templatized SQL (using Jinja) that allows code modularity. Instead of having to copy/paste SQL in multiple places, data engineers can define modular transforms and call those from other places within the project. Having a modular pipeline helps data engineers collaborate on the same project.
  • Models – dbt models are primarily written as a SELECT statement and saved as a .sql file. Data engineers define dbt models for their data representations. To learn more, refer to About dbt models.
  • Materializations – Materializations are strategies for persisting dbt models in a warehouse. There are five types of materializations built into dbt: table, view, incremental, ephemeral, and materialized view. To learn more, refer to Materializations and Incremental models.
  • Data lineage – dbt tracks data lineage, allowing you to understand the origin of data and how it flows through different transformations. dbt also supports impact analysis, which helps identify the downstream effects of changes.

The high-level data flow is as follows:

  1. Data engineers ingest data from data sources to raw tables and define table definitions for the raw tables.
  2. Data engineers write dbt models with templatized SQL.
  3. The dbt adapter converts dbt models to SQL statements compatible in a data warehouse.
  4. The data warehouse runs the SQL statements to create intermediate tables or final tables, views, or materialized views.

The following diagram illustrates the architecture.

dbt-glue works with the following steps:

  1. The dbt-glue adapter converts dbt models to SQL statements compatible in Spark SQL.
  2. AWS Glue interactive sessions run the SQL statements to create intermediate tables or final tables, views, or materialized views.
  3. dbt-glue supports csv, parquet, hudi, delta, and iceberg as fileformat.
  4. On the dbt-glue adapter, table or incremental are commonly used for materializations at the destination. There are three strategies for incremental materialization. The merge strategy requires hudi, delta, or iceberg. With the other two strategies, append and insert_overwrite, you can use csv, parquet, hudi, delta, or iceberg.

The following diagram illustrates this architecture.

Example use case

In this post, we use the data from the New York City Taxi Records dataset. This dataset is available in the Registry of Open Data on AWS (RODA), which is a repository containing public datasets from AWS resources. The raw Parquet table records in this dataset stores trip records.

The objective is to create the following three tables, which contain metrics based on the raw table:

  • silver_avg_metrics – Basic metrics based on NYC Taxi Open Data for the year 2016
  • gold_passengers_metrics – Metrics per passenger based on the silver metrics table
  • gold_cost_metrics – Metrics per cost based on the silver metrics table

The final goal is to create two well-designed gold tables that store already aggregated results in Iceberg format for ad hoc queries through Amazon Athena.

Prerequisites

The instruction requires following prerequisites:

  • An AWS Identity and Access Management (IAM) role with all the mandatory permissions to run an AWS Glue interactive session and the dbt-glue adapter
  • An AWS Glue database and table to store the metadata related to the NYC taxi records dataset
  • An S3 bucket to use as output and store the processed data
  • An Athena configuration (a workgroup and S3 bucket to store the output) to explore the dataset
  • An AWS Lambda function (created as an AWS CloudFormation custom resource) that updates all the partitions in the AWS Glue table

With these prerequisites, we simulate the situation that data engineers have already ingested data from data sources to raw tables, and defined table definitions for the raw tables.

For ease of use, we prepared a CloudFormation template. This template deploys all the required infrastructure. To create these resources, choose Launch Stack in the us-east-1 Region, and follow the instructions:

Install dbt, the dbt CLI, and the dbt adaptor

The dbt CLI is a command line interface for running dbt projects. It’s free to use and available as an open source project. Install dbt and the dbt CLI with the following code:

$ pip3 install --no-cache-dir dbt-core

For more information, refer to How to install dbt, What is dbt?, and Viewpoint.

Install the dbt adapter with the following code:

$ pip3 install --no-cache-dir dbt-glue

Create a dbt project

Complete the following steps to create a dbt project:

  1. Run the dbt init command to create and initialize a new empty dbt project:
    $ dbt init

  2. For the project name, enter dbt_glue_demo.
  3. For the database, choose glue.

Now the empty project has been created. The directory structure is shown as follows:

$ cd dbt_glue_demo 
$ tree .
.
├── README.md
├── analyses
├── dbt_project.yml
├── macros
├── models
│   └── example
│       ├── my_first_dbt_model.sql
│       ├── my_second_dbt_model.sql
│       └── schema.yml
├── seeds
├── snapshots
└── tests

Create a source

The next step is to create a source table definition. We add models/source_tables.yml with the following contents:

version: 2

sources:
  - name: data_source
    schema: nyctaxi

    tables:
      - name: records

This source definition corresponds to the AWS Glue table nyctaxi.records, which we created in the CloudFormation stack.

Create models

In this step, we create a dbt model that represents the average values for trip duration, passenger count, trip distance, and total amount of charges. Complete the following steps:

  1. Create the models/silver/ directory.
  2. Create the file models/silver/silver_avg_metrics.sql with the following contents:
    WITH source_avg as ( 
        SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
        , avg(passenger_count) as avg_passenger_count 
        , avg(trip_distance) as avg_trip_distance 
        , avg(total_amount) as avg_total_amount
        , year
        , month 
        , type
        FROM {{ source('data_source', 'records') }} 
        WHERE year = "2016"
        AND dropoff_datetime is not null 
        GROUP BY year, month, type
    ) 
    SELECT *
    FROM source_avg

  3. Create the file models/silver/schema.yml with the following contents:
    version: 2
    
    models:
      - name: silver_avg_metrics
        description: This table has basic metrics based on NYC Taxi Open Data for the year 2016
    
        columns:
          - name: avg_duration
            description: The average duration of a NYC Taxi trip
    
          - name: avg_passenger_count
            description: The average number of passenger per NYC Taxi trip
    
          - name: avg_trip_distance
            description: The average NYC Taxi trip distance
    
          - name: avg_total_amount
            description: The avarage amount of a NYC Taxi trip
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip 
    
          - name: type
            description: The type of the NYC Taxi 

  4. Create the models/gold/ directory.
  5. Create the file models/gold/gold_cost_metrics.sql with the following contents:
    {{ config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=["year", "month", "type"],
        file_format='iceberg',
        iceberg_expire_snapshots='False',
        table_properties={'format-version': '2'}
    ) }}
    SELECT (avg_total_amount/avg_trip_distance) as avg_cost_per_distance
    , (avg_total_amount/avg_duration) as avg_cost_per_minute
    , year
    , month 
    , type 
    FROM {{ ref('silver_avg_metrics') }}

  6. Create the file models/gold/gold_passengers_metrics.sql with the following contents:
    {{ config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key=["year", "month", "type"],
        file_format='iceberg',
        iceberg_expire_snapshots='False',
        table_properties={'format-version': '2'}
    ) }}
    SELECT (avg_total_amount/avg_passenger_count) as avg_cost_per_passenger
    , (avg_duration/avg_passenger_count) as avg_duration_per_passenger
    , (avg_trip_distance/avg_passenger_count) as avg_trip_distance_per_passenger
    , year
    , month 
    , type 
    FROM {{ ref('silver_avg_metrics') }}

  7. Create the file models/gold/schema.yml with the following contents:
    version: 2
    
    models:
      - name: gold_cost_metrics
        description: This table has metrics per cost based on NYC Taxi Open Data
    
        columns:
          - name: avg_cost_per_distance
            description: The average cost per distance of a NYC Taxi trip
    
          - name: avg_cost_per_minute
            description: The average cost per minute of a NYC Taxi trip
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip
    
          - name: type
            description: The type of the NYC Taxi
    
      - name: gold_passengers_metrics
        description: This table has metrics per passenger based on NYC Taxi Open Data
    
        columns:
          - name: avg_cost_per_passenger
            description: The average cost per passenger for a NYC Taxi trip
    
          - name: avg_duration_per_passenger
            description: The average number of passenger per NYC Taxi trip
    
          - name: avg_trip_distance_per_passenger
            description: The average NYC Taxi trip distance
    
          - name: year
            description: The year of the NYC Taxi trip
    
          - name: month
            description: The month of the NYC Taxi trip 
    
          - name: type
            description: The type of the NYC Taxi

  8. Remove the models/example/ folder, because it’s just an example created in the dbt init command.

Configure the dbt project

dbt_project.yml is a key configuration file for dbt projects. It contains the following code:

models:
  dbt_glue_demo:
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: view

We configure dbt_project.yml to replace the preceding code with the following:

models:
  dbt_glue_demo:
    silver:
      +materialized: table

This is because that we want to materialize the models under silver as Parquet tables.

Configure a dbt profile

A dbt profile is a configuration that specifies how to connect to a particular database. The profiles are defined in the profiles.yml file within a dbt project.

Complete the following steps to configure a dbt profile:

  1. Create the profiles directory.
  2. Create the file profiles/profiles.yml with the following contents:
    dbt_glue_demo:
      target: dev
      outputs:
        dev:
          type: glue
          query-comment: demo-nyctaxi
          role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
          region: us-east-1
          workers: 5
          worker_type: G.1X
          schema: "dbt_glue_demo_nyc_metrics"
          database: "dbt_glue_demo_nyc_metrics"
          session_provisioning_timeout_in_seconds: 120
          location: "{{ env_var('DBT_S3_LOCATION') }}"

  3. Create the profiles/iceberg/ directory.
  4. Create the file profiles/iceberg/profiles.yml with the following contents:
    dbt_glue_demo:
      target: dev
      outputs:
        dev:
          type: glue
          query-comment: demo-nyctaxi
          role_arn: "{{ env_var('DBT_ROLE_ARN') }}"
          region: us-east-1
          workers: 5
          worker_type: G.1X
          schema: "dbt_glue_demo_nyc_metrics"
          database: "dbt_glue_demo_nyc_metrics"
          session_provisioning_timeout_in_seconds: 120
          location: "{{ env_var('DBT_S3_LOCATION') }}"
          datalake_formats: "iceberg"
          conf: --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse="{{ env_var('DBT_S3_LOCATION') }}"warehouse/ --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"

The last two lines are added for setting Iceberg configurations on AWS Glue interactive sessions.

Run the dbt project

Now it’s time to run the dbt project. Complete the following steps:

  1. To run the project dbt, you should be in the project folder:
    $ cd dbt_glue_demo

  2. The project requires you to set environment variables in order to run on the AWS account:
    $ export DBT_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query "Account" --output text):role/GlueInteractiveSessionRole"
    $ export DBT_S3_LOCATION="s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1"

  3. Make sure the profile is set up correctly from the command line:
    $ dbt debug --profiles-dir profiles
    ...
    05:34:22 Connection test: [OK connection ok]
    05:34:22 All checks passed!

If you see any failures, check if you provided the correct IAM role ARN and S3 location in Step 2.

  1. Run the models with the following code:
    $ dbt run -m silver --profiles-dir profiles
    $ dbt run -m gold --profiles-dir profiles/iceberg/

Now the tables are successfully created in the AWS Glue Data Catalog, and the data is materialized in the Amazon S3 location.

You can verify those tables by opening the AWS Glue console, choosing Databases in the navigation pane, and opening dbt_glue_demo_nyc_metrics.

Query materialized tables through Athena

Let’s query the target table using Athena to verify the materialized tables. Complete the following steps:

  1. On the Athena console, switch the workgroup to athena-dbt-glue-aws-blog.
  2. If the workgroup athena-dbt-glue-aws-blog settings dialog box appears, choose Acknowledge.
  3. Use the following query to explore the metrics created by the dbt project:
    SELECT cm.avg_cost_per_minute
        , cm.avg_cost_per_distance
        , pm.avg_cost_per_passenger
        , cm.year
        , cm.month
        , cm.type
    FROM "dbt_glue_demo_nyc_metrics"."gold_passengers_metrics" pm
    LEFT JOIN "dbt_glue_demo_nyc_metrics"."gold_cost_metrics" cm
        ON cm.type = pm.type
        AND cm.year = pm.year
        AND cm.month = pm.month
    WHERE cm.type = 'yellow'
        AND cm.year = '2016'
        AND cm.month = '6'

The following screenshot shows the results of this query.

Review dbt documentation

Complete the following steps to review your documentation:

  1. Generate the following documentation for the project:
    $ dbt docs generate --profiles-dir profiles/iceberg
    11:41:51  Running with dbt=1.7.1
    11:41:51  Registered adapter: glue=1.7.1
    11:41:51  Unable to do partial parsing because profile has changed
    11:41:52  Found 3 models, 1 source, 0 exposures, 0 metrics, 478 macros, 0 groups, 0 semantic models
    11:41:52  
    11:41:53  Concurrency: 1 threads (target='dev')
    11:41:53  
    11:41:53  Building catalog
    11:43:32  Catalog written to /Users/username/Documents/workspace/dbt_glue_demo/target/catalog.json

  2. Run the following command to open the documentation on your browser:
    $ dbt docs serve --profiles-dir profiles/iceberg

  3. In the navigation pane, choose gold_cost_metrics under dbt_glue_demo/models/gold.

You can see the detailed view of the model gold_cost_metrics, as shown in the following screenshot.

  1. To see the lineage graph, choose the circle icon at the bottom right.

Clean up

To clean up your environment, complete the following steps:

  1. Delete the database created by dbt:
    $ aws glue delete-database —name dbt_glue_demo_nyc_metrics

  2. Delete all generated data:
    $ aws s3 rm s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity —query "Account" —output text)-us-east-1/ —recursive
    $ aws s3 rm s3://aws-athena-dbt-glue-query-results-$(aws sts get-caller-identity —query "Account" —output text)-us-east-1/ —recursive

  3. Delete the CloudFormation stack:
    $ aws cloudformation delete-stack —stack-name dbt-demo

Conclusion

This post demonstrated how the dbt-glue adapter helps your workload, and how you can build a modern data stack using dbt and AWS Glue using the dbt-glue adapter. You learned the end-to-end operations and data flow for data engineers to build and manage a data stack using dbt and the dbt-glue adapter. To report issues or request a feature enhancement, feel free to open an issue on GitHub.


About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team at Amazon Web Services. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Benjamin Menuet is a Senior Data Architect on the AWS Professional Services team at Amazon Web Services. He helps customers develop data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some iconic races like the UTMB.

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.

Kinshuk Pahare is a Principal Product Manager on the AWS Glue team at Amazon Web Services.

Jason Ganz is the manager of the Developer Experience (DX) team at dbt Labs

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Post Syndicated from Neeraja Rentachintala original https://aws.amazon.com/blogs/big-data/amazon-redshift-announcements-at-aws-reinvent-2023-to-enable-analytics-on-all-your-data/

In 2013, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully-managed, petabyte-scale, enterprise-grade cloud data warehouse. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Since then, customer demands for better scale, higher throughput, and agility in handling a wide variety of changing, but increasingly business critical analytics and machine learning use cases has exploded, and we have been keeping pace. Today, tens of thousands of customers use Amazon Redshift in AWS global infrastructure to collectively process exabytes of data daily and employs Amazon Redshift as a key component of their data architecture to drive use cases from typical dashboarding to self-service analytics, real-time analytics, machine learning, data sharing and monetization, and more

The advancements to Amazon Redshift announced at AWS re:Invent 2023 further accelerates modernization of your cloud analytics environments, keeping our core tenet to help you achieve the best price-performance at any scale. These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations. From price-performance improvements to zero-ETL, to generative AI capabilities, we have something for everyone. Let’s dive into the highlights.

Modernizing analytics for scale, performance, and reliability

“Our migration from legacy on-premises platform to Amazon Redshift allows us to ingest data 88% faster, query data 3x faster, and load daily data to the cloud 6x faster. Amazon Redshift enabled us to optimize performance, availability, and reliability—significantly easing operational complexity, while increasing the velocity of our end-users’ decision-making experience on the Fab floor.”

– Sunil Narayan, Sr Dir, Analytics at GlobalFoundries

Diligently driving the best price-performance at scale with new improvements

Since day 1, Amazon Redshift has been building innovative capabilities to help you get to optimum performance, while keeping costs lower. Amazon Redshift continues to lead on the price-performance front with up to 6x better price-performance than alternative cloud data warehouse and for dash boarding applications with high concurrency and low latency. We closely analyze query patterns in the fleet and look for opportunities to drive customer-focused innovation. For example, earlier in the year, we announced speed ups for string-based data processing up to 63x compared to alternative compression encodings such as LZO (Lempel-Ziv-Oberhumer) or ZStandard. At AWS re:Invent 2023, we introduced more performance enhancements in query planning and execution such as enhanced bloom filters , query rewrites, and support for write operations in auto scaling . For more information about performance improvement capabilities, refer to the list of announcements below.

Amazon Redshift Serverless is more intelligent than ever with new AI-driven scaling and optimizations

Speaking of price-performance, new next generation AI-driven scaling and optimizations capabilities in Amazon Redshift Serverless can deliver up to 10x better price-performance for variable workloads (based on internal testing), without manual intervention. Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse. Since GA, Redshift Serverless executed over a billion queries to power data insights for thousands of customers. With these new AI optimizations, Amazon Redshift Serverless scales proactively and automatically with workload changes across all key dimensions —such as data volume, concurrent users, and query complexity. You just specify your desired price-performance targets to either optimize for cost or optimize for performance or balanced and serverless does the rest. Learn more about additional improvements in Redshift Serverless, under the list of announcements below.

Multi-data warehouse writes through data sharing

Data sharing is a widely adopted feature in Amazon Redshift with customers running tens of millions of queries on shared data daily. Customers share live transactionally consistent data within and across organizations and regions for read purposes without data copies or data movement. Customers are using data sharing to modernize their analytics architectures from monolithic architectures to multi-cluster, data mesh deployments that enable seamless and secure access across organizations to drive data collaboration and powerful insights. At AWS re:Invent 2023, we extended data sharing capabilities to launch multi-data warehouse writes in preview. You can now start writing to Redshift databases from other Redshift data warehouses in just a few clicks, further enabling data collaboration, flexible scaling of compute for ETL/data processing workloads by adding warehouses of different types and sizes based on price-performance needs. Experience greater transparency of compute usage as each warehouse is billed for its own compute and consequently keep your costs under control.

Multidimensional data layouts

Amazon Redshift offers industry leading predictive optimizations that continuously monitor your workloads and seamlessly accelerate performance and maximize concurrency by adjusting data layout and compute management as you use the data warehouse more. In addition to the powerful optimizations Redshift already offers, such as Automatic Table Sort, Automatic sort and distribution keys, we are introducing Multidimensional Data Layouts, a new powerful table sorting mechanism that improves performance of repetitive queries by automatically sorting data based on the incoming query filters (for example: Sales in a specific region). This method significantly accelerates the performance of table scans compared to traditional methods.

Unifying all your data with zero-ETL approaches

“Using the Aurora MySQL zero-ETL integration, we experience near real-time data synchronization between Aurora MySQL databases and Amazon Redshift, making it possible to build an analysis environment in just three hours instead of the month of developer time it used to take before”

– MoneyForward

JOYME uses Amazon Redshift’s streaming ingestion and other Amazon services for risk control over users’ financial activity such as recharge, refund, and rewards.

“With Redshift, we are able to view risk counterparts and data in near real time—
instead of on an hourly basis. Redshift significantly improved our business ROI efficiency.”

– PengBo Yang, CTO, JOYME

Data pipelines can be challenging and costly to build and manage and can create hours-long delays to obtain transactional data for analytics. These delays can lead to missed business opportunities, especially when the insights derived from analytics on transactional data are relevant for only a limited amount of time. Amazon Redshift employs AWS’s zero-ETL approach that enables interoperability and integration between the data warehouse and operational databases and even your streaming data services, so that the data is easily and automatically ingested into the warehouse for you, or you can access the data in place, where it lives.

Zero-ETL integrations with operational databases

We delivered zero-ETL integration between Amazon Aurora MySQL Amazon Redshift (general availability) this year, to enable near real-time analytics and machine learning (ML) using Amazon Redshift on petabytes of transactional data from Amazon Aurora. Within seconds of transactional data being written into Aurora, the data is available in Amazon Redshift, so you don’t have to build and maintain complex data pipelines to perform extract, transform, and load (ETL) operations. At AWS re:Invent, we extended zero-ETL integration to additional sources specifically Aurora PostgreSQL, Dynamo DB, and Amazon RDS MySQL. Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications.

Data lake querying with support for Apache Iceberg tables

Amazon Redshift allows customers to run a wide range of workloads on data warehouse and data lakes using its support for various open file and table formats. At AWS re:Invent, we announced the general availability of support for Apache Iceberg tables, so you can easily access your Apache Iceberg tables on your data lake from Amazon Redshift and join it with the data in your data warehouse when needed. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience. We have improved data lake query performance by integrating with AWS Glue statistics and introduce preview of incremental refresh for materialized views on data lake data to accelerate repeated queries.

Learn more about the zero-ETL integrations, data lake performance enhancements, and other announcements below.

Maximize value with comprehensive analytics and ML capabilities

“Amazon Redshift is one of the most important tools we had in growing Jobcase as a company.”

– Ajay Joshi, Distinguished Engineer, Jobcase

With all your data integrated and available, you can easily build and run near real-time analytics to AI/ML/Generative AI applications. Here’s a couple of highlights from this week and for the full list, see below.

Amazon Q Generative SQL capability

Query Editor, an out-of-the-box web-based SQL experience in Amazon Redshift is a popular tool for data exploration, visual analysis, and data collaboration. At AWS re:Invent, we introduced Amazon Q Generative SQL capabilities in Amazon Redshift Query Editor (preview), to simplify query authoring and increase your productivity by allowing you to express queries in natural language and receive SQL code recommendations. Generative SQL uses AI to analyze user intent, query patterns, and schema metadata to identify common SQL query patterns directly allowing you to get insights faster in a conversational format without extensive knowledge of your organization’s complex database metadata.

Amazon Redshift ML large language model (LLM) integration

Amazon Redshift ML enables customers to create, train, and deploy machine learning models using familiar SQL commands. Customers use Redshift ML to run an average of over 10 billion predictions a day within their data warehouses. At AWS re:Invent, we announced support for LLMs as preview. Now, you can use pre-trained open source LLMs in Amazon SageMaker JumpStart as part of Redshift ML, allowing you to bring the power of LLMs to analytics. For example, you can make inferences on your product feedback data in Amazon Redshift, use LLMs to summarize feedback, perform entity extraction, sentiment analysis and product feedback classification.

Innovate faster with secure data collaboration within and across the organizations

“Millions of companies use Stripe’s software and APIs to accept payments, send payouts, and manage their businesses online.  Access to their Stripe data via leading data warehouses like Amazon Redshift has been a top request from our customers. Our customers needed secure, fast, and integrated analytics at scale without building complex data pipelines or moving and copying data around. With Stripe Data Pipeline for Amazon Redshift, we’re helping our customers set up a direct and reliable data pipeline in a few clicks. Stripe Data Pipeline enables our customers to automatically share their complete, up-to-date Stripe data with their Amazon Redshift data warehouse, and take their business analytics and reporting to the next level.”

– Tony Petrossian, Head of Engineering, Revenue & Financial Management at Stripe

With Amazon Redshift, you can easily and securely share data and collaborate no matter where your teams or data is located. And have the confidence that your data is secure no matter where you operate or how highly regulated your industries are. We have enabled fine grained permissions, an easy authentication experience with single sign-on for your organizational identity—all provided at no additional cost to you.

Unified identity with IAM identity center integration

We announced Amazon Redshift integration with AWS IAM Identity Center to enable organizations to support trusted identity propagation between Amazon QuickSight,, Amazon Redshift Query Editor, and Amazon Redshift,  . Customers can use their organization identities to access Amazon Redshift in a single sign-on experience using third party identity providers (IdP), such as Microsoft Entra ID, Okta, Ping, OneLogin, etc. from Amazon QuickSight and Amazon Redshift Query Editor. Administrators can use third-party identity provider users and groups to manage fine grained access to data across services and audit user level access in AWS CloudTrail. With trusted identity propagation, a user’s identity is passed seamlessly between Amazon QuickSight, Amazon Redshift reducing time to insights and enabling a friction free analytics experience.

For the full set of announcements, see the following:

Learn more: https://aws.amazon.com/redshift


About the authors

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

Sunaina AbdulSalah leads product marketing for Amazon Redshift. She focuses on educating customers about the impact of data warehousing and analytics and sharing AWS customer stories. She has a deep background in marketing and GTM functions in the B2B technology and cloud computing domains. Outside of work, she spends time with her family and friends and enjoys traveling.

Unlocking the value of data as your differentiator

Post Syndicated from G2 Krishnamoorthy original https://aws.amazon.com/blogs/big-data/unlocking-the-value-of-data-as-your-differentiator/

Today on the AWS re:Invent keynote stage, Swami Sivasubramanian, VP of Data and AI, AWS, spoke about the beneficial relationship among data, generative AI, and humans—all working together to unleash new possibilities in efficiency and creativity. There has never been a more exciting time in modern technology. Innovation is accelerating everywhere, and the future is rife with possibility. While Swami explored many facets of this beneficial relationship in the keynote today, one area that is especially critical for our customers to get right if they want to see success in generative AI is data. When you want to build generative AI applications that are unique to your business needs, data is the differentiator. This week, we launched many new tools to help you turn your data into your differentiator. This includes tools to help you customize your foundation models, and new services and features to build a strong data foundation to fuel your generative AI applications.

Customizing foundation models

The need for data is quite obvious if you are building your own foundation models (FMs). These models need vast amounts of data. But data is necessary even when you are building on top of FMs. If you think about it, everyone has access to the same models for building generative AI applications. It’s data that is the key to moving from generic applications to generative AI applications that create real value for your customers and your business. For instance, Intuit’s new generative AI-powered assistant, Intuit Assist, uses relevant contextual datasets spanning small business, consumer finance, and tax information to deliver personalized financial insights to their customers. With Amazon Bedrock, you can privately customize FMs for your specific use case using a small set of your own labeled data through a visual interface without writing any code. Today, we announced the ability to fine-tune Cohere Command and Meta Llama 2 in addition to Amazon Titan. In addition to fine-tuning, we’re also making it easier for you to provide models with up-to-date and contextually relevant information from your data sources using Retrieval Augmented Generation (RAG). Amazon Bedrock’s Knowledge Bases feature, which went to general availability today, supports the entire RAG workflow, from ingestion, to retrieval, and prompt augmentation. Knowledge Bases works with popular vector databases and engines including Amazon OpenSearch Serverless, Redis Enterprise Cloud, and Pinecone, with support for Amazon Aurora and MongoDB coming soon.

Building a strong data foundation

To produce the high-quality data that you need to build or customize FMs for generative AI, you need a strong data foundation. Of course, the value of a strong data foundation is not new and the need for one spans well beyond generative AI. Across all types of use cases, from generative AI to business intelligence (BI), we’ve found that a strong data foundation includes a comprehensive set of services to meet all your use case needs, integrations across those services to break down data silos, and tools to govern data across the end-to-end data workflow so you can innovate more quickly. These tools also need to be intelligent to remove the heavy lifting from data management.

Comprehensive

First, you need a comprehensive set of data services so you can get the price/performance, speed, flexibility, and capabilities for any use case. AWS offers a broad set of tools that enable you to store, organize, access, and act upon various types of data. We have the broadest selection of database services, including relational databases like Aurora and Amazon Relational Database Service (Amazon RDS)—and on Monday, we introduced the newest addition to the RDS family: Amazon RDS for Db2. Now Db2 customers can easily set up, operate, and scale highly available Db2 databases in the cloud. We also offer non-relational databases like Amazon DynamoDB, used by over 1 million customers for its serverless, single-digit millisecond performance at any scale. You also need services to store data for analysis and machine learning (ML) like Amazon Simple Storage Service (Amazon S3). Customers have created hundreds of thousands of data lakes on Amazon S3. It also includes our data warehouse, Amazon Redshift, which delivers more than 6 times better price/performance than other cloud data warehouses. We also have tools that enable you to act on your data, including Amazon QuickSight for BI, Amazon SageMaker for ML, and of course, Amazon Bedrock for generative AI.

Serverless enhancements

The dynamic nature of data makes it perfectly suited to serverless technologies, which is why AWS offers a broad range of serverless database and analytics offerings that help support our customers’ most demanding workloads. This week, we made even more improvements to our serverless options in this area, including a new Aurora capability that automatically scales to millions of write transactions per second and manages petabytes of data while maintaining the simplicity of operating a single database. We also released a new serverless option for Amazon ElastiCache, which makes it faster and easier to create highly available caches and instantly scales to meet application demand. Finally, we announced new AI-driven scaling and optimizations for Amazon Redshift Serverless that enable the service to learn from your patterns and proactively scale on multiple dimensions, including concurrent users, data variability, and query complexity. It does all of this while factoring in your price/performance targets so you can optimize between cost and performance.

Vector capabilities across more databases

Your data foundation also needs to include services to store, index, retrieve, and search vector data. As our customers need vector embeddings as part as part of their generative AI application workflows, they told us they want to use vector capabilities in their existing databases to eliminate the steep learning curve for new programming tools, APIs, and SDKs. They also feel more confident knowing their existing databases are proven in production and meet requirements for scalability, availability, and storage and compute. And when your vectors and business data are stored in the same place, your applications will run faster—and there’s no data sync or data movement to worry about.

For all of these reasons, we’ve invested in adding vector capabilities to some of our most popular data services, including Amazon OpenSearch Service and OpenSearch Serverless, Aurora, and Amazon RDS. Today, we added four more to that list, with the addition of vector support in Amazon MemoryDB for Redis, Amazon DocumentDB (with MongoDB compatibility), DynamoDB, and Amazon Neptune. Now you can use vectors and generative AI with your database of choice.

Integrated

Another key to your data foundation is integrating data across your data sources for a more complete view of your business. Typically, connecting data across different data sources requires complex extract, transform, and load (ETL) pipelines, which can take hours—if not days—to build. These pipelines also have to be continuously maintained and can be brittle. AWS is investing in a zero-ETL future so you can quickly and easily connect and act on all your data, no matter where it lives. We’re delivering on this vision in a number of ways, including zero-ETL integrations between our most popular data stores. Earlier this year, we brought you our fully managed zero-ETL integration between Amazon Aurora MySQL-Compatible Edition and Amazon Redshift. Within seconds of data being written into Aurora, you can use Amazon Redshift to do near-real-time analytics and ML on petabytes of data. Woolworths, a pioneer in retail who helped build the retail model of today, was able to reduce development time for analysis of promotions and other events from 2 months to 1 day using the Aurora zero-ETL integration with Amazon Redshift.

More zero-ETL options

At re:Invent, we announced three more zero-ETL integrations with Amazon Redshift, including Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and DynamoDB, to make it easier for you to take advantage of near-real-time analytics to improve your business outcomes. In addition to Amazon Redshift, we’ve also expanded our zero ETL support to OpenSearch Service, which tens of thousands of customers use for real-time search, monitoring, and analysis of business and operational data. This includes zero-ETL integrations with DynamoDB and Amazon S3. With all of these zero-ETL integrations, we’re making it even easier to leverage relevant data for your applications, including generative AI.

Governed

Finally, your data foundation needs to be secure and governed to ensure the data that’s used throughout the development cycle of your generative AI applications is high quality and compliant. To help with this, we launched Amazon DataZone last year. Amazon DataZone is being used by companies like Guardant Health and Bristol Meyers Squibb to catalog, discover, share, and govern data across their organization. Amazon DataZone uses ML to automatically add metadata to your data catalog, making all of your data more discoverable. This week, we added a new feature to Amazon DataZone that uses generative AI to automatically create business descriptions and context for your datasets with just a few clicks, making data even easier to understand and apply. While Amazon DataZone helps you share data in a governed way within your organization, many customers also want to securely share data with their partners.

Infusing intelligence across the data foundation

Not only have we added generative AI to Amazon DataZone, but we’re leveraging intelligent technology across our data services to make data easier to use, more intuitive to work with, and more accessible. Amazon Q, our new generative AI assistant, helps you in QuickSight to author dashboards and create compelling visual stories from your dashboard data using natural language. We also announced that Amazon Q can help you create data integration pipelines using natural language. For example, you can ask Q to “read JSON files from S3, join on ‘accountid’, and load into DynamoDB,” and Q will return an end-to-end data integration job to perform this action. Amazon Q is also making it easier to query data in your data warehouse with generative AI SQL in Amazon Redshift Query Editor (in preview). Now data analysts, scientists, and engineers can be more productive using generative AI text-to-code functionality. You can also improve accuracy by enabling query history access to specific users—without compromising data privacy.

These new innovations are going to make it easy for you to leverage data to differentiate your generative AI applications and create new value for your customers and your business. We look forward to seeing what you create!


About the authors

G2 Krishnamoorthy is VP of Analytics, leading AWS data lake services, data integration, Amazon OpenSearch Service, and Amazon QuickSight. Prior to his current role, G2 built and ran the Analytics and ML Platform at Facebook/Meta, and built various parts of the SQL Server database, Azure Analytics, and Azure ML at Microsoft.

Rahul Pathak is VP of Relational Database Engines, leading Amazon Aurora, Amazon Redshift, and Amazon QLDB. Prior to his current role, he was VP of Analytics at AWS, where he worked across the entire AWS database portfolio. He has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation.

AWS Clean Rooms Differential Privacy enhances privacy protection of your users data (preview)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-clean-rooms-differential-privacy-enhances-privacy-protection-of-your-users-data-preview/

Starting today, you can use AWS Clean Rooms Differential Privacy (preview) to help protect the privacy of your users with mathematically backed and intuitive controls in a few steps. As a fully managed capability of AWS Clean Rooms, no prior differential privacy experience is needed to help you prevent the reidentification of your users.

AWS Clean Rooms Differential Privacy obfuscates the contribution of any individual’s data in generating aggregate insights in collaborations so that you can run a broad range of SQL queries to generate insights about advertising campaigns, investment decisions, clinical research, and more.

Quick overview on differential privacy
Differential privacy is not new. It is a strong, mathematical definition of privacy compatible with statistical and machine learning based analysis, and has been used by the United States Census Bureau as well as companies with vast amounts of data.

Differential privacy helps with a wide variety of use cases involving large datasets, where adding or removing a few individuals has a small impact on the overall result, such as population analyses using count queries, histograms, benchmarking, A/B testing, and machine learning.

The following illustration shows how differential privacy works when it is applied to SQL queries.

When an analyst runs a query, differential privacy adds a carefully calibrated amount of error (also referred to as noise) to query results at run-time, masking the contribution of individuals while still keeping the query results accurate enough to provide meaningful insights. The noise is carefully fine-tuned to mask the presence or absence of any possible individual in the dataset.

Differential privacy also has another component called privacy budget. The privacy budget is a finite resource consumed each time a query is run and thus controls the number of queries that can be run on your datasets, helping ensure that the noise cannot be averaged out to reveal any private information about an individual. When the privacy budget is fully exhausted, no more queries can be run on your tables until it is increased or refreshed.

However, differential privacy is not easy to implement because this technique requires an in-depth understanding of mathematically rigorous formulas and theories to apply it effectively. Configuring differential privacy is also a complex task because customers need to calculate the right level of noise in order to preserve the privacy of their users without negatively impacting the utility of query results.

Customers also want to enable their partners to conduct a wide variety of analyses including highly complex and customized queries on their data. This requirement is hard to support with differential privacy because of the intricate nature of the calculations involved in calibrating the noise while processing various query components such as aggregations, joins, and transformations.

We created AWS Clean Rooms Differential Privacy to help you protect the privacy of your users with mathematically backed controls in a few clicks.

How differential privacy works in AWS Clean Rooms
While differential privacy is quite a sophisticated technique, AWS Clean Rooms Differential Privacy makes it easy for you to apply it and protect the privacy of your users with mathematically backed, flexible, and intuitive controls. You can begin using it with just a few steps after starting or joining an AWS Clean Rooms collaboration as a member with abilities to contribute data.

You create a configured table, which is a reference to your table in the AWS Glue Data Catalog, and choose to turn on differential privacy while adding a custom analysis rule to the configured table.

Next, you associate the configured table to your AWS Clean Rooms collaboration and configure a differential privacy policy in the collaboration to make your table available for querying. You can use a default policy to quickly complete the setup or customize it to meet your specific requirements. As part of this step, you will configure the following:

Privacy budget
Quantified as a value that we call epsilon, the privacy budget controls the level of privacy protection. It is a common, finite resource that is applied for all of your tables protected with differential privacy in the collaboration because the goal is to preserve the privacy of your users whose information can be present in multiple tables. The privacy budget is consumed every time a query is run on your tables. You have the flexibility to increase the privacy budget value any time during the collaboration and automatically refresh it each calendar month.

Noise added per query
Measured in terms of the number of users whose contributions you want to obscure, this input parameter governs the rate at which the privacy budget is depleted.

In general, you need to balance your privacy needs against the number of queries you want to permit and the accuracy of those queries. AWS Clean Rooms makes it easy for you to complete this step by helping you understand the resulting utility you are providing to your collaboration partner. You can also use the interactive examples to understand how your chosen settings would impact the results for different types of SQL queries.

Now that you have successfully enabled differential privacy protection for your data, let’s see AWS Clean Rooms Differential Privacy in action. For this demo, let’s assume I am your partner in the AWS Clean Rooms collaboration.

Here, I’m running a query to count the number of overlapping customers and the result shows there are 3,227,643 values for tv.customer_id.

Now, if I run the same query again after removing records about an individual from coffee_customers table, it shows a different result, 3,227,604 tv.customer_id. This variability in the query results prevents me from identifying the individuals from observing the difference in query results.

I can also see the impact of differential privacy, including the remaining queries I can run.

Available for preview
Join this preview and start protecting the privacy of your users with AWS Clean Rooms Differential Privacy. During this preview period, you can use AWS Clean Rooms Differential Privacy wherever AWS Clean Rooms is available. To learn more on how to get started, visit the AWS Clean Rooms Differential Privacy page.

Happy collaborating!
Donnie

AWS Clean Rooms ML helps customers and partners apply ML models without sharing raw data (preview)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-clean-rooms-ml-helps-customers-and-partners-apply-ml-models-without-sharing-raw-data-preview/

Today, we’re introducing AWS Clean Rooms ML (preview), a new capability of AWS Clean Rooms that helps you and your partners apply machine learning (ML) models on your collective data without copying or sharing raw data with each other. With this new capability, you can generate predictive insights using ML models while continuing to protect your sensitive data.

During this preview, AWS Clean Rooms ML introduces its first model specialized to help companies create lookalike segments for marketing use cases. With AWS Clean Rooms ML lookalike, you can train your own custom model, and you can invite partners to bring a small sample of their records to collaborate and generate an expanded set of similar records while protecting everyone’s underlying data.

In the coming months, AWS Clean Rooms ML will release a healthcare model. This will be the first of many models that AWS Clean Rooms ML will support next year.

AWS Clean Rooms ML helps you to unlock various opportunities for you to generate insights. For example:

  • Airlines can take signals about loyal customers, collaborate with online booking services, and offer promotions to users with similar characteristics.
  • Auto lenders and car insurers can identify prospective auto insurance customers who share characteristics with a set of existing lease owners.
  • Brands and publishers can model lookalike segments of in-market customers and deliver highly relevant advertising experiences.
  • Research institutions and hospital networks can find candidates similar to existing clinical trial participants to accelerate clinical studies (coming soon).

AWS Clean Rooms ML lookalike modeling helps you apply an AWS managed, ready-to-use model that is trained in each collaboration to generate lookalike datasets in a few clicks, saving months of development work to build, train, tune, and deploy your own model.

How to use AWS Clean Rooms ML to generate predictive insights
Today I will show you how to use lookalike modeling in AWS Clean Rooms ML and assume you have already set up a data collaboration with your partner. If you want to learn how to do that, check out the AWS Clean Rooms Now Generally Available — Collaborate with Your Partners without Sharing Raw Data post.

With your collective data in the AWS Clean Rooms collaboration, you can work with your partners to apply ML lookalike modeling to generate a lookalike segment. It works by taking a small sample of representative records from your data, creating a machine learning (ML) model, then applying the particular model to identify an expanded set of similar records from your business partner’s data.

The following screenshot shows the overall workflow for using AWS Clean Rooms ML.

By using AWS Clean Rooms ML, you don’t need to build complex and time-consuming ML models on your own. AWS Clean Rooms ML trains a custom, private ML model, which saves months of your time while still protecting your data.

Eliminating the need to share data
As ML models are natively built within the service, AWS Clean Rooms ML helps you protect your dataset and customer’s information because you don’t need to share your data to build your ML model.

You can specify the training dataset using the AWS Glue Data Catalog table, which contains user-item interactions.

Under Additional columns to train, you can define numerical and categorical data. This is useful if you need to add more features to your dataset, such as the number of seconds spent watching a video, the topic of an article, or the product category of an e-commerce item.

Applying custom-trained AWS-built models
Once you have defined your training dataset, you can now create a lookalike model. A lookalike model is a machine learning model used to find similar profiles in your partner’s dataset without either party having to share their underlying data with each other.

When creating a lookalike model, you need to specify the training dataset. From a single training dataset, you can create many lookalike models. You also have the flexibility to define the date window in your training dataset using Relative range or Absolute range. This is useful when you have data that is constantly updated within AWS Glue, such as articles read by users.

Easy-to-tune ML models
After you create a lookalike model, you need to configure it to use in AWS Clean Rooms collaboration. AWS Clean Rooms ML provides flexible controls that enable you and your partners to tune the results of the applied ML model to garner predictive insights.

On the Configure lookalike model page, you can choose which Lookalike model you want to use and define the Minimum matching seed size you need. This seed size defines the minimum number of profiles in your seed data that overlap with profiles in the training data.

You also have the flexibility to choose whether the partner in your collaboration receives metrics in Metrics to share with other members.

With your lookalike models properly configured, you can now make the ML models available for your partners by associating the configured lookalike model with a collaboration.

Creating lookalike segments
Once the lookalike models have been associated, your partners can now start generating insights by selecting Create lookalike segment and choosing the associated lookalike model for your collaboration.

Here on the Create lookalike segment page, your partners need to provide the Seed profiles. Examples of seed profiles include your top customers or all customers who purchased a specific product. The resulting lookalike segment will contain profiles from the training data that are most similar to the profiles from the seed.

Lastly, your partner will get the Relevance metrics as the result of the lookalike segment using the ML models. At this stage, you can use the Score to make a decision.

Export data and use programmatic API
You also have the option to export the lookalike segment data. Once it’s exported, the data is available in JSON format and you can process this output by integrating with AWS Clean Rooms API and your applications.

Join the preview
AWS Clean Rooms ML is now in preview and available via AWS Clean Rooms in US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Seoul, Singapore, Sydney, Tokyo), and Europe (Frankfurt, Ireland, London). Support for additional models is in the works.

Learn how to apply machine learning with your partners without sharing underlying data on the AWS Clean Rooms ML page.

Happy collaborating!
— Donnie

Announcing Amazon OpenSearch Service zero-ETL integration with Amazon S3 (preview)

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-opensearch-service-zero-etl-integration-with-amazon-s3-preview/

Today we are announcing a preview of Amazon OpenSearch Service zero-ETL integration with Amazon S3, a new way to query operational logs in Amazon S3 and S3-based data lakes without needing to switch between services. You can now analyze infrequently queried data in cloud object stores and simultaneously use the operational analytics and visualization capabilities of OpenSearch Service.

Amazon OpenSearch Service direct queries with Amazon S3 provides a zero-ETL integration to reduce the operational complexity of duplicating data or managing multiple analytics tools by enabling customers to directly query their operational data, reducing costs and time to action. This zero-ETL integration will be configurable within OpenSearch Service, where you can take advantage of various log type templates, including predefined dashboards, and configure data accelerations tailored to that log type. Templates include VPC Flow Logs, Elastic Load Balancing logs, and NGINX logs, and accelerations include skipping indexes, materialized views, and covered indexes.

With direct queries with Amazon S3, you can perform complex queries critical to security forensic and threat analysis that correlate data across multiple data sources, which aids teams in investigating service downtime and security events. After creating an integration, you can start querying their data directly from the OpenSearch Dashboards or OpenSearch API. You can easily audit connections to ensure that they are set up in a scalable, cost-efficient, and secure way.

Getting started with direct queries with Amazon S3
You can easily get started by creating a new Amazon S3 direct query data source for OpenSearch Service through the AWS Management Console or the API. Each new data source uses AWS Glue Data Catalog to manage tables that represent S3 buckets. Once you create a data source, you can configure Amazon S3 tables and data indexing and query data in OpenSearch Dashboards.

1. Create a data source in OpenSearch Service
Before you create a data source, you should have an OpenSearch Service domain with version 2.11 or later and a target Amazon S3 table in AWS Glue Data Catalog with the appropriate IAM permissions. IAM will need access to the desired S3 bucket(s) and read and write access to AWS Glue Data Catalog. To learn more about IAM prerequisites, see Creating a data source in the AWS documentation.

Go to the OpenSearch Service console and choose the domain you want to set up a new data source for. In the domain details page, choose the Connections tab below the general information and see the Direct Query section.

To create a new data source, choose Create, input the name of your new data source, select the data source type as Amazon S3 with AWS Glue Data Catalog, and choose the IAM role for your data source.

Once you create a data source, you can go to the OpenSearch Dashboards of the domain, which you use to configure access control, define tables, set up log type–based dashboards for popular log types, and query your data.

2. Configuring your data source in OpenSearch Dashboards
To configure data source in OpenSearch Dashboards, choose Configure in the console and go to OpenSearch Dashboards. In the left-hand navigation of OpenSearch Dashboards, under Management, choose Data sources. Under Manage data sources, choose the name of the data source you created in the console.

Direct queries from OpenSearch Service to Amazon S3 use Spark tables within AWS Glue Data Catalog. To create a new table you want to direct query, go to the Query Workbench in the Open Search Plugins menu.

Now run as in the following SQL statement to create http_logs table and run MSCK REPAIR TABLE mys3.default.http_logs command to update the metadata in the catalog

CREATE EXTERNAL TABLE IF NOT EXISTS mys3.default.http_logs (
   `@timestamp` TIMESTAMP,
    clientip STRING,
    request STRING, 
    status INT, 
    size INT, 
    year INT, 
    month INT, 
    day INT) 
USING json PARTITIONED BY(year, month, day) OPTIONS (path 's3://mys3/data/http_log/http_logs_partitioned_json_bz2/', compression 'bzip2')

To ensure a fast experience with your data in Amazon S3, you can set up any of three different types of accelerations to index data into OpenSearch Service, such as skipping indexes, materialized views, and covering indexes. To create OpenSearch indexes from external data connections for better performance, choose the Accelerate Table.

  • Skipping indexes allow you to index only the metadata of the data stored in Amazon S3. Skipping indexes help quickly identify data stored by narrowing down a specific location of where the data is stored.
  • Materialized views enable you to use complex queries such as aggregations, which can be used for querying or powering dashboard visualizations. Materialized views ingest data into OpenSearch Service for anomaly detection or geospatial capabilities.
  • Covering indexes will ingest all the data from the specified table column. Covering indexes are the most performant of the three indexing types.

3. Query your data source in OpenSearch Dashboards
After you set up your tables, you can query your data using Discover. You can run a sample SQL query for the http_logs table you created in AWS Glue Data Catalog tables.

To learn more, see Working with Amazon OpenSearch Service direct queries with Amazon S3 in the AWS documentation.

Join the preview
Amazon OpenSearch Service zero-ETL integration with Amazon S3 is now previewed in the AWS US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland) Regions.

OpenSearch Service separately charges for only the compute needed as OpenSearch Compute Units to query your external data as well as maintain indexes in OpenSearch Service. For more information, see Amazon OpenSearch Service Pricing.

Give it a try and send feedback to the AWS re:Post for Amazon OpenSearch Service or through your usual AWS Support contacts.

Channy

Analyze large amounts of graph data to get insights and find trends with Amazon Neptune Analytics

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/introducing-amazon-neptune-analytics-a-high-performance-graph-analytics/

I am happy to announce the general availability of Amazon Neptune Analytics, a new analytics database engine that makes it faster for data scientists and application developers to quickly analyze large amounts of graph data. With Neptune Analytics, you can now quickly load your dataset from Amazon Neptune or your data lake on Amazon Simple Storage Service (Amazon S3), run your analysis tasks in near real time, and optionally terminate your graph afterward.

Graph data enables the representation and analysis of intricate relationships and connections within diverse data domains. Common applications include social networks, where it aids in identifying communities, recommending connections, and analyzing information diffusion. In supply chain management, graphs facilitate efficient route optimization and bottleneck identification. In cybersecurity, they reveal network vulnerabilities and identify patterns of malicious activity. Graph data finds application in knowledge management, financial services, digital advertising, and network security, performing tasks such as identifying money laundering networks in banking transactions and predicting network vulnerabilities.

Since the launch of Neptune in May 2018, thousands of customers have embraced the service for storing their graph data and performing updates and deletion on specific subsets of the graph. However, analyzing data for insights often involves loading the entire graph into memory. For instance, a financial services company aiming to detect fraud may need to load and correlate all historical account transactions.

Performing analyses on extensive graph datasets, such as running common graph algorithms, requires specialized tools. Utilizing separate analytics solutions demands the creation of intricate pipelines to transfer data for processing, which is challenging to operate, time-consuming, and prone to errors. Furthermore, loading large datasets from existing databases or data lakes to a graph analytic solution can take hours or even days.

Neptune Analytics offers a fully managed graph analytics experience. It takes care of the infrastructure heavy lifting, enabling you to concentrate on problem-solving through queries and workflows. Neptune Analytics automatically allocates compute resources according to the graph’s size and quickly loads all the data in memory to run your queries in seconds. Our initial benchmarking shows that Neptune Analytics loads data from Amazon S3 up to 80x faster than existing AWS solutions.

Neptune Analytics supports 5 families of algorithms covering 15 different algorithms, each with multiple variants. For example, we provide algorithms for path-finding, detecting communities (clustering), identifying important data (centrality), and quantifying similarity. Path-finding algorithms are used for use cases such as route planning for supply chain optimization. Centrality algorithms like page rank identify the most influential sellers in a graph. Algorithms like connected components, clustering, and similarity algorithms can be used for fraud-detection use cases to determine whether the connected network is a group of friends or a fraud ring formed by a set of coordinated fraudsters.

Neptune Analytics facilitates the creation of graph applications using openCypher, presently one of the widely adopted graph query languages. Developers, business analysts, and data scientists appreciate openCypher’s SQL-inspired syntax, finding it familiar and structured for composing graph queries.

Let’s see it at work
As we usually do on the AWS News blog, let’s show how it works. For this demo, I first navigate to Neptune in the AWS Management Console. There is a new Analytics section on the left navigation pane. I select Graphs and then Create graph.

Neptune Analytics - create graph 1

On the Create graph page, I enter the details of my graph analytics database engine. I won’t detail each parameter here; their names are self-explanatory.

Neptune Analytics - Create graph 1

Pay attention to Allow from public because, the vast majority of the time, you want to keep your graph only available from the boundaries of your VPC. I also create a Private endpoint to allow private access from machines and services inside my account VPC network.

Neptune Analytics - Create graph 2

In addition to network access control, users will need proper IAM permissions to access the graph.

Finally, I enable Vector search to perform similarity search using embeddings in the dataset. The dimension of the vector depends on the large language model (LLM) that you use to generate the embedding.

Neptune Analytics - Create graph 3

When I am ready, I select Create graph (not shown here).

After a few minutes, my graph is available. Under Connectivity & security, I take note of the Endpoint. This is the DNS name I will use later to access my graph from my applications.

I can also create Replicas. A replica is a warm standby copy of the graph in another Availability Zone. You might decide to create one or more replicas for high availability. By default, we create one replica, and depending on your availability requirements, you can choose not to create replicas.

Neptune Analytics - create graph 3

Business queries on graph data
Now that the Neptune Analytics graph is available, let’s load and analyze data. For the rest of this demo, imagine I’m working in the finance industry.

I have a dataset obtained from the US Securities and Exchange Commission (SEC). This dataset contains the list of positions held by investors that have more than $100 million in assets. Here is a diagram to illustrate the structure of the dataset I use in this demo.

Nuptune graph analytics - dataset structure

I want to get a better understanding of the positions held by one investment firm (let’s name it “Seb’s Investments LLC”). I wonder what its top five holdings are and who else holds more than $1 billion in the same companies. I am also curious to know what are other investment companies that have a similar portfolio as Seb’s Investments LLC.

To start my analysis, I create a Jupyter notebook in the Neptune section of the AWS Management Console. In the notebook, I first define my analytics endpoint and load the data set from an S3 bucket. It takes only 18 seconds to load 17 million records.

Neptune Analytics - load data

Then, I start to explore the dataset using openCypher queries. I start by defining my parameters:

params = {'name': "Seb's Investments LLC", 'quarter': '2023Q4'}

First, I want to know what the top five holdings are for Seb’s Investments LLC in this quarter and who else holds more than $1 billion in the same companies. In openCypher, it translates to the query hereafter. The $name parameter’s value is “Seb’s Investment LLC” and the $quarter parameter’s value is 2023Q4.

MATCH p=(h:Holder)-->(hq1)-[o:owns]->(holding)
WHERE h.name = $name AND hq1.name = $quarter
WITH DISTINCT holding as holding, o ORDER BY o.value DESC LIMIT 5
MATCH (holding)<-[o2:owns]-(hq2)<--(coholder:Holder)
WHERE hq2.name = '2023Q4'
WITH sum(o2.value) AS totalValue, coholder, holding
WHERE totalValue > 1000000000
RETURN coholder.name, collect(holding.name)

Neptune Analytics - query 1

Then, I want to know what the other top five companies are that have similar holdings as “Seb’s Investments LLC.” I use the topKByNode() function to perform a vector search.

MATCH (n:Holder)
WHERE n.name = $name
CALL neptune.algo.vectors.topKByNode(n)
YIELD node, score
WHERE score >0
RETURN node.name LIMIT 5

This query identifies a specific Holder node with the name “Seb’s Investments LLC.” Then, it utilizes the Neptune Analytics custom vector similarity search algorithm on the embedding property of the Holder node to find other nodes in the graph that are similar. The results are filtered to include only those with a positive similarity score, and the query finally returns the names of up to five related nodes.

Neptune Analytics - query 2

Pricing and availability
Neptune Analytics is available today in seven AWS Regions: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Singapore, Tokyo), and Europe (Frankfurt, Ireland).

AWS charges for the usage on a pay-as-you-go basis, with no recurring subscriptions or one-time setup fees.

Pricing is based on configurations of memory-optimized Neptune capacity units (m-NCU). Each m-NCU corresponds to one hour of compute and networking capacity and 1 GiB of memory. You can choose configurations starting with 128 m-NCUs and up to 4096 m-NCUs. In addition to m-NCU, storage charges apply for graph snapshots.

I invite you to read the Neptune pricing page for more details

Neptune Analytics is a new analytics database engine to analyze large graph datasets. It helps you discover insights faster for use cases such as fraud detection and prevention, digital advertising, cybersecurity, transportation logistics, and bioinformatics.

Get started
Log in to the AWS Management Console to give Neptune Analytics a try.

— seb

Vector engine for Amazon OpenSearch Serverless is now available

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/vector-engine-for-amazon-opensearch-serverless-is-now-generally-available/

Today we are announcing the general availability of the vector engine for Amazon OpenSearch Serverless with new features. In July 2023, we introduced the preview release of the vector engine for Amazon OpenSearch Serverless, a simple, scalable, and high-performing similarity search capability. The vector engine makes it easy for you to build modern machine learning (ML) augmented search experiences and generative artificial intelligence (generative AI) applications without needing to manage the underlying vector database infrastructure.

You can now store, update, and search billions of vector embeddings with thousands of dimensions in milliseconds. The highly performant similarity search capability of vector engine enables generative AI-powered applications to deliver accurate and reliable results with consistent milliseconds-scale response times.

The vector engine also enables you to optimize and tune results with hybrid search by combining vector search and full-text search in the same query, removing the need to manage and maintain separate data stores or a complex application stack. The vector engine provides a secure, reliable, scalable, and enterprise-ready platform to cost effectively build a prototyping application and then seamlessly scale to production.

You can now get started in minutes with the vector engine by creating a specialized vector engine–based collection, which is a logical grouping of embeddings that works together to support a workload.

The vector engine uses OpenSearch Compute Units (OCUs), compute capacity unit, to ingest and run similarity search queries. One OCU can handle up to 2 million vectors for 128 dimensions or 500,000 for 768 dimensions at 99 percent recall rate.

The vector engine built on OpenSearch Serverless is a highly available service by default. It requires a minimum of four OCUs (2 OCUs for the ingest, including primary and standby, and 2 OCUs for the search with two active replicas across Availability Zones) for the first collection in an account. All subsequent collections using the same AWS Key Management Service (AWS KMS) key can share those OCUs.

What’s new at GA?
Since the preview, the vector engine for Amazon OpenSearch Serverless became one of the vector database options in the knowledge base of Amazon Bedrock to build generative AI applications using a Retrieval Augmented Generation (RAG) concept.

Here are some new or improved features for this GA release:

Disable redundant replica (development and test focused) option
As we announced in our preview blog post, this feature eliminates the need to have redundant OCUs in another Availability Zone solely for availability purposes. A collection can be deployed with two OCUs – one for indexing and one for search. This cuts the costs in half compared to default deployment with redundant replicas. The reduced cost makes this configuration suitable and economical for development and testing workloads.

With this option, we will still provide durability guarantees since the vector engine persists all the data in Amazon S3, but single-AZ failures would impact your availability.

If you want to disable a redundant replica, uncheck Enable redundancy when creating a new vector search
collection.

Fractional OCU for the development and test focused option
Support for fractional OCU billing for development and test focused workloads (that is, no redundant replica option) reduces the floor price for vector search collection. The vector engine will initially deploy smaller 0.5 OCUs while providing the same capabilities at lower scale and will scale up to a full OCU and beyond to meet your workload demand. This option will further reduce the monthly costs when experimenting with using the vector engine.

Automatic scaling for a billion scale
With vector engine’s seamless auto-scaling, you no longer have to reindex for scaling purposes. At preview, we were supporting about 20 million vector embeddings. With the general availability of vector engine, we have raised the limits to support a billion vector scale.

Now available
The vector engine for Amazon OpenSearch Serverless is now available in all AWS Regions where Amazon OpenSearch Serverless is available.

To get started, you can refer to the following resources:

Give it a try and send feedback to AWS re:Post for Amazon OpenSearch Service or through your usual AWS support contacts.

Channy

Announcing zero-ETL integrations with AWS Databases and Amazon Redshift

Post Syndicated from Jyoti Aggarwal original https://aws.amazon.com/blogs/big-data/announcing-zero-etl-integrations-with-aws-databases-and-amazon-redshift/

As customers become more data driven and use data as a source of competitive advantage, they want to easily run analytics on their data to better understand their core business drivers to grow sales, reduce costs, and optimize their businesses. To run analytics on their operational data, customers often build solutions that are a combination of a database, a data warehouse, and an extract, transform, and load (ETL) pipeline. ETL is the process data engineers use to combine data from different sources.

Through customer feedback, we learned that lot of undifferentiated time and resources go towards building and managing ETL pipelines between transactional databases and data warehouses. At Amazon Web Services (AWS), our goal is to make it easier for our customers to connect to and use all of their data and to do it with the speed and agility they need. We think that by automating the undifferentiated parts, we can help our customers increase the pace of their data-driven innovation by breaking down data silos and simplifying data integration.

Bringing operational data closer to analytics workflows

Customers want flexible data architectures that let them integrate data across their organization to give them a better picture of their customers, streamline operations, and help teams make better, faster decisions. But integrating data isn’t easy. Today, building these pipelines and assembling the architecture to interconnect all the data sources and optimize analytics results is complex, requires highly skilled resources, and renders data that can be erroneous or is often inconsistent.

Amazon Redshift powers data driven decisions for tens of thousands of customers every day with a fully managed, artificial intelligence (AI)-powered cloud data warehouse that delivers the best price-performance for your analytics workloads.

Zero-ETL is a set of integrations that eliminates the need to build ETL data pipelines. Zero-ETL integrations with Amazon Redshift enable customers to access their data in place using federated queries or ingest it into Amazon Redshift with a fully managed solution from across their databases. With newer features, such as support for autocopy that simplifies and automates file ingestion from Amazon Simple Storage Service (Amazon S3), Redshift Streaming Ingestion capabilities to continuously ingest any amount of streaming data directly into the warehouse, and multi-cluster data sharing architectures that minimize data movement and even provide access to third-party data, Amazon Redshift enables data integration and quick access to data without building manual pipelines.

With all the data integrated and available, Amazon Redshift empowers every data user to run analytics and build AI, machine learning (ML), and generative AI applications. Developers can run Apache Spark applications directly on the data in their warehouse from AWS analytics services, such as Amazon EMR and AWS Glue. They can enrich their datasets by joining operational data replicated through zero-ETL integrations with other sources such as sales and marketing data from SaaS applications and can even create Amazon QuickSight dashboards on top of this data to track key metrics across sales, website analytics, operations, and more—all in one place.

Customers can also use Amazon Redshift data sharing to securely share this data with multiple consumer clusters used by different teams—both within and across AWS accounts—driving a unified view of business and facilitating self-service access to application data within team clusters while maintaining governance over sensitive operational data.

Furthermore, customers can build machine learning models directly on their operational data in Amazon Redshift ML (native integration into Amazon SageMaker) without needing to build any data pipelines and use them to run billions of predictions with SQL commands. Or they can build complex transformations and aggregations on the integrated data using Amazon Redshift materialized views.

We’re excited to share four AWS database zero-ETL integrations with Amazon Redshift:

By bringing different database services closer to analytics, AWS is streamlining access to data and enabling companies to accelerate innovation, create competitive advantage, and maximize the business value extracted from their data assets.

Amazon Aurora zero-ETL integration with Amazon Redshift

The Amazon Aurora zero-ETL integration with Amazon Redshift unifies transactional data from Amazon Aurora with near real-time analytics in Amazon Redshift. This eliminates the burden of building and maintaining custom ETL pipelines between the two systems. Unlike traditional siloed databases that force a tradeoff between performance and analytics, the zero-ETL integration replicates data from multiple Aurora clusters into the same Amazon Redshift warehouse. This enables holistic insights across applications without impacting production workloads. The entire system can be serverless and can auto-scale to handle fluctuations in data volume without infrastructure management.

Amazon Aurora MySQL zero-ETL integration with Amazon Redshift processes over 1 million transactions per minute (an equivalent of 17.5 million insert/update/delete row operations per minute) from multiple Aurora databases and makes them available in Amazon Redshift in less than 15 seconds (p50 latency lag). Figure 1 shows how the Aurora MySQL zero-ETL integration with Amazon Redshift works at a high level.

Figure 1: High level working of Aurora MySQL zero-ETL integration with Amazon Redshift

In their own words, see how one of our customers is using Aurora MySQL zero-ETL integration with Amazon Redshift.

In the retail industry, for example, Infosys wanted to gain faster insights about their business, such as best-selling products and high-revenue stores, based on transactions in a store management system. They used Amazon Aurora MySQL zero-ETL integration with Amazon Redshift to achieve this. With this integration, Infosys replicated Aurora data to Amazon Redshift and created Amazon QuickSight dashboards for product managers and channel leaders in just a few seconds, instead of several hours. Now, as part of Infosys Cobalt and Infosys Topaz blueprints, enterprises can have near real-time analytics on transactional data, which can help them make informed decisions related to store management.

– Sunil Senan, SVP and Global Head of Data, Analytics, and AI, Infosys

To learn more, see Aurora Docs, Amazon Redshift Docs, and the AWS News Blog.

Amazon RDS for MySQL zero-ETL integration with Amazon Redshift

The new Amazon RDS for MySQL integration with Amazon Redshift empowers customers to easily perform analytics on their RDS for MySQL data. With a few clicks, it seamlessly replicates RDS for MySQL data into Amazon Redshift, automatically handling initial data loads, ongoing change synchronization, and schema replication. This eliminates the complexity of traditional ETL jobs. The zero-ETL integration enables workload isolation for optimal performance; RDS for MySQL focuses on high-speed transactions while Amazon Redshift handles analytical workloads. Customers can also consolidate data from multiple sources into Amazon Redshift, such as Aurora MySQL-Compatible Edition and Aurora PostgreSQL-Compatible Edition. This unified view provides holistic insights across applications in one place, delivering significant cost and operational efficiencies.

Figure 2 shows how a customer can use the AWS Management Console for Amazon RDS to get started with creating a zero-ETL integration from RDS for MySQL, Aurora MySQL-Compatible Edition, and Aurora PostgreSQL-Compatible Edition to Amazon Redshift.

Figure 2: How to create a zero-ETL integration using Amazon RDS.

This integration is currently in public preview, visit the getting started guide to learn more.

Amazon DynamoDB zero-ETL integration with Amazon Redshift

The Amazon DynamoDB zero-ETL integration with Amazon Redshift (limited preview) provides a fully managed solution for making data from DynamoDB available for analytics in Amazon Redshift. With minimal configuration, customers can replicate DynamoDB data into Amazon Redshift for analytics without consuming the DynamoDB Read Capacity Units (RCU). This zero-ETL integration unlocks powerful Amazon Redshift capabilities on DynamoDB data such as high-speed SQL queries, machine learning integrations, materialized views for fast aggregations, and secure data sharing.

This integration is currently in limited preview, use this link to request access.

Integrated services bring us closer to zero-ETL

Our mission is to help customers get the most value from their data, and integrated services are key to this. That’s why we’re building towards a zero-ETL future today. By automating complex ETL processes, data engineers can redirect their focus on creating value from the data. With this modern approach to data management, organizations can accelerate their use of data to streamline operations and fuel business growth.


About the author

Jyoti Aggarwal is a Product Management lead for Amazon Redshift zero-ETL. She brings along an expertise in cloud compute and storage, data warehouse, and B2B/B2C customer experience.

New generative AI capabilities for Amazon DataZone to further simplify data cataloging and discovery (preview)

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-generative-ai-capabilities-for-amazon-datazone-to-further-simplify-data-cataloging-and-discovery-preview/

Today, we are announcing a preview of an automation feature backed by generative artificial intelligence (AI) for Amazon DataZone that will dramatically decrease the amount of time needed to provide context for organizational data. The new feature can automate the traditionally labor-intensive process of data cataloging. Powered by the large language models (LLMs) of Amazon Bedrock, it generates detailed descriptions of data assets and their schemas, and suggests analytical use cases. You can generate a comprehensive business context with a single click.

We heard from customers that data consumers such as data analysts, scientists, and engineers in organizations struggle to understand the data’s relevance with little metadata. As a result, they either spend more time interpreting the data, or they return to data producers with continued questions. So, data producers such as data owners, engineers, and analysts who own the data and make it available for consumers need to manually enter detailed context for higher-priority data to make data shareable and discoverable. This is time-consuming and the number one problem customers have when trying to collate their data in a system for self-service by consumers.

When we launched the general availability of Amazon DataZone in October 2023, we introduced the first feature that brings generative AI capabilities to automate the generation of the table name and column names of a business catalog asset. In the data portal of Amazon DataZone, the green brain icon indicates automatically generated metadata suggestions. You could accept, edit, or reject each suggestion recommended by Amazon DataZone.

What’s new with today’s preview announcement?
Now, in addition to column and table names, you can automatically generate more detailed descriptions of the table and schema, as well as suggested uses.

In the Business Metadata tab in the data portal, when you choose Generate summary, new content will be generated to explain the table and its metadata.

You can also accept, edit, and reject this recommendation.

When you choose the Schema tab, you can also see new Description recommendations as well as the Name. You can review generated metadata and choose to accept, edit, or reject the recommendation.

This new feature will enhance data discoverability and reduce on back-and-forth communications between data consumers and producers. You will have a richer search experience based on extensive data insights in the future.

Join the preview
The new metadata generation ability is now previewed in the AWS US East (N. Virginia) and US West (Oregon) Regions. With this new generative AI capability, you can reduce time-to-insight by accelerating data cataloging and boosting data discovery. To learn more, visit the Amazon DataZone: Automate Data Discovery.

Give it a try and send feedback to AWS re:Post for Amazon DataZone or through your usual AWS Support contacts.

Channy

Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service is now available

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-dynamodb-zero-etl-integration-with-amazon-opensearch-service-is-now-generally-available/

Today, we are announcing the general availability of Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service, which lets you perform a search on your DynamoDB data by automatically replicating and transforming it without custom code or infrastructure. This zero-ETL integration reduces the operational burden and cost involved in writing code for a data pipeline architecture, keeping the data in sync, and updating code with frequent application changes, enabling you to focus on your application.

With this zero-ETL integration, Amazon DynamoDB customers can now use the powerful search features of Amazon OpenSearch Service, such as full-text search, fuzzy search, auto-complete, and vector search for machine learning (ML) capabilities to offer new experiences that boost user engagement and improve satisfaction with their applications.

This zero-ETL integration uses Amazon OpenSearch Ingestion to synchronize the data between Amazon DynamoDB and Amazon OpenSearch Service. You choose the DynamoDB table whose data needs to be synchronized and Amazon OpenSearch Ingestion synchronizes the data to an Amazon OpenSearch managed cluster or serverless collection within seconds of it being available.

You can also specify index mapping templates to ensure that your Amazon DynamoDB fields are mapped to the correct fields in your Amazon OpenSearch Service indexes. Also, you can synchronize data from multiple DynamoDB tables into one Amazon OpenSearch Service managed cluster or serverless collection to offer holistic insights across several applications.

Getting started with this zero-ETL integration
With a few clicks, you can synchronize data from DynamoDB to OpenSearch Service. To create an integration between DynamoDB and OpenSearch Service, choose the Integrations menu in the left pane of the DynamoDB console and the DynamoDB table whose data you want to synchronize.

You must turn on point-in-time recovery (PITR) and the DynamoDB Streams feature. This feature allows you to capture item-level changes in your table and push the changes to a stream. Choose Turn on for PITR and enable DynamoDB Streams in the Exports and streams tab.

After turning on PITR and DynamoDB Stream, choose Create to set up an OpenSearch Ingestion pipeline in your account that replicates the data to an OpenSearch Service managed domain.

In the first step, enter a unique pipeline name and set up pipeline capacity and compute resources to automatically scale your pipeline based on the current ingestion workload.

Now you can configure the pre-defined pipeline configuration in YAML file format. You can browse resources to look up and paste information to build the pipeline configuration. This pipeline is a combination of a source part from DyanmoDB settings and a sink part for OpenSearch Service.

You must set multiple IAM roles (sts_role_arn) with the necessary permissions to read data from the DynamoDB table and write to an OpenSearch domain. This role is then assumed by OpenSearch Ingestion pipelines to ensure that the right security posture is always maintained when moving the data from source to destination. To learn more, see Setting up roles and users in Amazon OpenSearch Ingestion in the AWS documentation.

After entering all required values, you can validate the pipeline configuration to ensure that your configuration is valid. To learn more, see Creating Amazon OpenSearch Ingestion pipelines in the AWS documentation.

Take a few minutes to set up the OpenSearch Ingestion pipeline, and you can see your integration is completed in the DynamoDB table.

Now you can search synchronized items in the OpenSearch Dashboards.

Things to know
Here are a couple of things that you should know about this feature:

  • Custom schema – You can specify your custom data schema along with the index mappings used by OpenSearch Ingestion when writing data from Amazon DynamoDB to OpenSearch Service. This experience is added to the console within Amazon DynamoDB so that you have full control over the format of indices that are created on OpenSearch Service.
  • Pricing – There will be no additional cost to use this feature apart from the cost of the existing underlying components. Note that Amazon OpenSearch Ingestion charges OpenSearch Compute Units (OCUs) which will be used to replicate data between Amazon DynamoDB and Amazon OpenSearch Service. Furthermore, this feature uses Amazon DynamoDB streams for the change data capture (CDC) and you will incur the standard costs for Amazon DynamoDB Streams.
  • Monitoring – You can monitor the state of the pipelines by checking the status of the integration on the DynamoDB console or using the OpenSearch Ingestion dashboard. Additionally, you can use Amazon CloudWatch to provide real-time metrics and logs, which lets you to set up alerts in case of a breach of user-defined thresholds.

Now available
Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service is now generally available in all AWS Regions where OpenSearch Ingestion is available today.

Channy

New Amazon Q in QuickSight uses generative AI assistance for quicker, easier data insights (preview)

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-amazon-q-in-quicksight-uses-generative-ai-assistance-for-quicker-easier-data-insights-preview/

Today, I’m happy to share that Amazon Q in QuickSight is available for preview. Now you can experience the Generative BI capabilities in Amazon QuickSight announced on July 26, as well as two additional capabilities for business users.

Turning insights into impact faster with Amazon Q in QuickSight
With this announcement, business users can now generate compelling sharable stories examining their data, see executive summaries of dashboards surfacing key insights from data in seconds, and confidently answer questions of data not answered by dashboards and reports with a reimagined Q&A experience.

Before we go deeper into each capability, here’s a quick summary:

  • Stories — This is a new and visually compelling way to present and share insights. Stories can automatically generated in minutes using natural language prompts, customized using point-and-click options, and shared securely with others.
  • Executive summaries — With this new capability, Amazon Q helps you to understand key highlights in your dashboard.
  • Data Q&A — This capability provides a new and easy-to-use natural-language Q&A experience to help you get answers for questions beyond what is available in existing dashboards and reports.​​

To get started, you need to enable Preview Q Generative Capabilities in Preview manager.

Once enabled, you’re ready to experience what Amazon Q in QuickSight brings for business users and business analysts building dashboards.

Stories automatically builds formatted narratives
Business users often need to share their findings of data with others to inform team decisions; this has historically involved taking data out of the business intelligence (BI) system. Stories are a new feature enabling business users to create beautifully formatted narratives that describe data, and include visuals, images, and text in document or slide format directly that can easily be shared with others within QuickSight.

Now, business users can use natural language to ask Amazon Q to build a story about their data by starting from the Amazon Q Build menu on an Amazon QuickSight dashboard. Amazon Q extracts data insights and statistics from selected visuals, then uses large language models (LLMs) to build a story in multiple parts, examining what the data may mean to the business and suggesting ideas to achieve specific goals.

For example, a sales manager can ask, “Build me a story about overall sales performance trends. Break down data by product and region. Suggest some strategies for improving sales.” Or, “Write a marketing strategy that uses regional sales trends to uncover opportunities that increase revenue.” Amazon Q will build a story exploring specific data insights, including strategies to grow sales.

Once built, business users get point-and-click tools augmented with artificial intelligence- (AI) driven rewriting capabilities to customize stories using a rich text editor to refine the message, add ideas, and highlight important details.

Stories can also be easily and securely shared with other QuickSight users by email.

Executive summaries deliver a quick snapshot of important information
Executive summaries are now available with a single click using the Amazon Q Build menu in Amazon QuickSight. Amazon QuickSight automatically determines interesting facts and statistics, then use LLMs to write about interesting trends.

This new capability saves time in examining detailed dashboards by providing an at-a-glance view of key insights described using natural language.

The executive summaries feature provides two advantages. First, it helps business users generate all the key insights without the need to browse through tens of visuals on the dashboard and understand changes from each. Secondly, it enables readers to find key insights based on information in the context of dashboards and reports with minimum effort.

New data Q&A experience
Once an interesting insight is discovered, business users frequently need to dig in to understand data more deeply than they can from existing dashboards and reports. Natural language query (NLQ) solutions designed to solve this problem frequently expect that users already know what fields may exist or how they should be combined to answer business questions. However, business users aren’t always experts in underlying data schemas, and their questions frequently come in more general terms, like “How were sales last week in NY?” Or, “What’s our top campaign?”

The new Q&A experience accessed within the dashboards and reports helps business users confidently answer questions about data. It includes AI-suggested questions and a profile of what data can be asked about and automatically generated multi-visual answers with narrative summaries explaining data context.

Furthermore, Amazon Q brings the ability to answer vague questions and offer alternatives for specific data. For example, customers can ask a vague question, such as “Top products,” and Amazon Q will provide an answer that breaks down products by sales and offers alternatives for products by customer count and products by profit. Amazon Q explains answer context in a narrative summarizing total sales, number of products, and picking out the sales for the top product.

Customers can search for specific data values and even a single word such as, for example, the product name “contactmatcher.” Amazon Q returns a complete set of data related to that product and provides a natural language breakdown explaining important insights like total units sold. Specific visuals from the answers can also be added to a pinboard for easy future access.

Watch the demo
To see these new capabilities in action, have a look at the demo.

Things to Know
Here are a few additional things that you need to know:

Join the preview
Amazon Q in QuickSight product page

Happy building!
— Donnie

Improve performance of workloads containing repetitive scan filters with multidimensional data layout sort keys in Amazon Redshift

Post Syndicated from Yanzhu Ji original https://aws.amazon.com/blogs/big-data/improve-performance-of-workloads-containing-repetitive-scan-filters-with-multidimensional-data-layout-sort-keys-in-amazon-redshift/

Amazon Redshift, the most widely used cloud data warehouse, has evolved significantly to meet the performance requirements of the most demanding workloads. This post covers one such new feature—the multidimensional data layout sort key.

Amazon Redshift now improves your query performance by supporting multidimensional data layout sort keys, which is a new type of sort key that sorts a table’s data by filter predicates instead of physical columns of the table. Multidimensional data layout sort keys will significantly improve the performance of table scans, especially when your query workload contains repetitive scan filters.

Amazon Redshift already provides the capability of automatic table optimization (ATO), which automatically optimizes the design of tables by applying sort and distribution keys without the need for administrator intervention. In this post, we introduce multidimensional data layout sort keys as an additional capability offered by ATO and fortified by Amazon Redshift’s sort key advisor algorithm.

Multidimensional data layout sort keys

When you define a table with the AUTO sort key, Amazon Redshift ATO will analyze your query history and automatically select either a single-column sort key or multidimensional data layout sort key for your table, based on which option is better for your workload. When multidimensional data layout is selected, Amazon Redshift will construct a multidimensional sort function that co-locates rows that are typically accessed by the same queries, and the sort function is subsequently used during query runs to skip data blocks and even skip scanning the individual predicate columns.

Consider the following user query, which is a dominant query pattern in the user’s workload:

SELECT season, sum(metric2) AS "__measure__0"
FROM titles
WHERE lower(subregion) like '%United States%'
GROUP BY 1
ORDER BY 1;

Amazon Redshift stores data for each column in 1 MB disk blocks and stores the minimum and maximum values in each block as part of the table’s metadata. If a query uses a range-restricted predicate, Amazon Redshift can use the minimum and maximum values to rapidly skip over large numbers of blocks during table scans. However, this query’s filter on the subregion column can’t be used to determine which blocks to skip based on minimum and maximum values, and as a result, Amazon Redshift scans all rows from the titles table:

SELECT table_name, input_rows, step_attribute
FROM sys_query_detail
WHERE query_id = 123456789;

When the user’s query was run with titles using a single-column sort key on subregion, the result of the preceding query is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 2164081640 | 
(1 rows)

This shows that the table scan read 2,164,081,640 rows.

To improve scans on the titles table, Amazon Redshift might automatically decide to use a multidimensional data layout sort key. All rows that satisfy the lower(subregion) like '%United States%' predicate would be co-located to a dedicated region of the table, and therefore Amazon Redshift will only scan data blocks that satisfy the predicate.

When the user’s query is run with titles using a multidimensional data layout sort key that includes lower(subregion) like '%United States%' as a predicate, the result of the sys_query_detail query is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 152324046  | multi-dimensional
(1 rows)

This shows that the table scan read 152,324,046 rows, which is only 7% of the original, and it used the multidimensional data layout sort key.

Note that this example uses a single query to showcase the multidimensional data layout feature, but Amazon Redshift will consider all the queries running against the table and can create multiple regions to satisfy the most commonly run predicates.

Let’s take another example, with more complex predicates and multiple queries this time.

Imagine having a table items (cost int, available int, demand int) with four rows as shown in the following example.

#id cost available demand
1 4 3 3
2 2 23 6
3 5 4 5
4 1 1 2

Your dominant workload consists of two queries:

  • 70% queries pattern:
    select * from items where cost > 3 and available < demand

  • 20% queries pattern:
    select avg(cost) from items where available < demand

With traditional sorting techniques, you might choose to sort the table over the cost column, such that the evaluation of cost > 3 will benefit from the sort. So, the items table after sorting using a single cost column will look like the following.

#id cost available demand
Region #1, with cost <= 3
Region #2, with cost > 3
#id cost available demand
4 1 1 2
2 2 23 6
1 4 3 3
3 5 4 5

By using this traditional sort, we can immediately exclude the top two (blue) rows with ID 4 and ID 2, because they don’t satisfy cost > 3.

On the other hand, with a multidimensional data layout sort key, the table will be sorted based on a combination of the two commonly occurring predicates in the user’s workload, which are cost > 3 and available < demand. As a result, the table’s rows are sorted into four regions.

#id cost available demand
Region #1, with cost <= 3 and available < demand
Region #2, with cost <= 3 and available >= demand
Region #3, with cost > 3 and available < demand
Region #4, with cost > 3 and available >= demand
#id cost available demand
4 1 1 2
2 2 23 6
3 5 4 5
1 4 3 3

This concept is even more powerful when applied to entire blocks instead of single rows, when applied to complex predicates that use operators not suitable for traditional sorting techniques (such as like), and when applied to more than two predicates.

System tables

The following Amazon Redshift system tables will show users if multidimensional data layouts are used on their tables and queries:

  • To determine if a particular table is using a multidimensional data layout sort key, you can check whether sortkey1 in svv_table_info is equal to AUTO(SORTKEY(padb_internal_mddl_key_col)).
  • To determine if a particular query uses multidimensional data layout to accelerate table scans, you can check step_attribute in the sys_query_detail view. The value will be equal to multi-dimensional if the table’s multidimensional data layout sort key was used during the scan.

Performance benchmarks

We performed internal benchmark testing for multiple workloads with repetitive scan filters and see that introducing multidimensional data layout sort keys produced the following results:

  • A 74% total runtime reduction compared to having no sort key.
  • A 40% total runtime reduction compared to having the best single-column sort key on each table.
  • A 80% reduction in total rows read from tables compared to having no sort key.
  • A 47% reduction in total rows read from tables compared to having the best single-column sort key on each table.

Feature comparison

With the introduction of multidimensional data layout sort keys, your tables can now be sorted by expressions based off of the commonly occurring filter predicates in your workload. The following table provides a feature comparison for Amazon Redshift against two competitors.

Feature Amazon Redshift Competitor A Competitor B
Support for sorting on columns Yes Yes Yes
Support for sorting by expression Yes Yes No
Automatic column selection for sorting Yes No Yes
Automatic expressions selection for sorting Yes No No
Automatic selection between columns sorting or expressions sorting Yes No No
Automatic use of sorting properties for expressions during scans Yes No No

Considerations

Keep in mind the following when using a multidimensional data layout:

  • Multidimensional data layout is enabled when you set your table as SORTKEY AUTO.
  • Amazon Redshift Advisor will automatically choose either a single-column sort key or multidimensional data layout for the table by analyzing your historical workload.
  • Amazon Redshift ATO adjusts the multidimensional data layout sorting results based on the manner in which ongoing queries interact with the workload.
  • Amazon Redshift ATO maintains multidimensional data layout sort keys the same way as it currently does for existing sort keys. Refer to Working with automatic table optimization for more details on ATO.
  • Multidimensional data layout sort keys will work with both provisioned clusters and serverless workgroups.
  • Multidimensional data layout sort keys will work with your existing data as long as the AUTO SORTKEY is enabled on your table and a workload with repetitive scan filters is detected. The table will be reorganized based on the results of multi-dimensional sort function.
  • To disable multidimensional data layout sort keys for a table, use alter table: ALTER TABLE table_name ALTER SORTKEY NONE. This disables the AUTO sort key feature on the table.
  • Multidimensional data layout sort keys are preserved when restoring or migrating your provisioned cluster to a serverless cluster or vice versa.

Conclusion

In this post, we showed that multidimensional data layout sort keys can significantly improve query runtime performance for workloads where dominant queries have repetitive scan filters.

To create a preview cluster from the Amazon Redshift console, navigate to the Clusters page and choose Create preview cluster. You can create a cluster in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm) Regions and test your workloads.

We would love to hear your feedback on this new feature and look forward to your comments on this post.


About the authors

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Jialin Ding is an Applied Scientist in the Learned Systems Group, specializing in applying machine learning and optimization techniques to improve the performance of data systems such as Amazon Redshift.

Amazon MSK now provides up to 29% more throughput and up to 24% lower costs with AWS Graviton3 support

Post Syndicated from Sai Maddali original https://aws.amazon.com/blogs/big-data/amazon-msk-now-provides-up-to-29-more-throughput-and-up-to-24-lower-costs-with-aws-graviton3-support/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data.

Today, we’re excited to bring the benefits of Graviton3 to Kafka workloads, with Amazon MSK now offering M7g instances for new MSK provisioned clusters. AWS Graviton processors are custom Arm-based processors built by AWS to deliver the best price-performance for your cloud workloads. For example, when running an MSK provisioned cluster using M7g.4xlarge instances, you can achieve up to 27% reduction in CPU usage and up to 29% higher write and read throughput compared to M5.4xlarge instances. These performance improvements, along with M7g’s lower prices provide up to 24% in compute cost savings over M5 instances.

In February 2023, AWS launched new Graviton3-based M7g instances. M7g instances are equipped with DDR5 memory, which provides up to 50% higher memory bandwidth than the DDR4 memory used in previous generations. M7g instances also deliver up to 25% higher storage throughput and up to 88% increase in network throughput compared to similar sized M5 instances to deliver price-performance benefits for Kafka workloads. You can read more about M7g features in New Graviton3-Based General Purpose (m7g) and Memory-Optimized (r7g) Amazon EC2 Instances.

Here are the specs for the M7g instances on MSK:

Name vCPUs Memory Network Bandwidth Storage Bandwidth
M7g.large 2 8 GiB up to 12.5 Gbps up to 10 Gbps
M7g.xlarge 4 16 GiB up to 12.5 Gbps up to 10 Gbps
M7g.2xlarge 8 32 GiB up to 15 Gbps up to 10 Gbps
M7g.4xlarge 16 64 GiB up to 15 Gbps up to 10 Gbps
M7g.8xlarge 32 128 GiB 15 Gbps 10 Gbps
M7g.12xlarge 48 192 GiB 22.5 Gbps 15 Gbps
M7g.16xlarge 64 256 GiB 30 Gbps 20 Gbps

M7g instances on Amazon MSK

Organizations are adopting Amazon MSK to capture and analyze data in real time, run machine learning (ML) workflows, and build event-driven architectures. Amazon MSK enables you to reduce operational overhead and run your applications with higher availability and durability. It also offers a consistent reduction in price-performance with capabilities such as Tiered Storage. With compute making up a large portion of Kafka costs, customers wanted a way to optimize them further and see Graviton instances providing them the quickest path. Amazon MSK has fully tested and validated M7g on all Kafka versions starting with version 2.8.2, making it to run critical workloads and benefit from Gravition3 cost savings.

You can get started by provisioning new clusters with the Graviton3-based M7g instances as the broker type using the AWS Management Console, APIs via the AWS SDK, and the AWS Command Line Interface (AWS CLI). M7g instances support all Amazon MSK and Kafka features, making it straightforward for you to run all your existing Kafka workloads with minimal changes. Amazon MSK supports Graviton3-based M7g instances from large through 16xlarge sizes to run all Kafka workloads.

Let’s take the M7g instances on MSK provisioned clusters for a test drive and see how it compares with Amazon MSK M5 instances.

M7g instances in action

Customers run a wide variety of workloads on Amazon MSK; some are latency sensitive, and some are throughput bound. In this post, we focus on M7g performance impact on throughput-bound workloads. M7g comes with an increase in network and storage throughput, providing a higher throughput per broker compared to an M5-based cluster.

To understand the implications, let’s look at how Kafka uses available throughput for writing or reading data. Every broker in the MSK cluster comes with a bounded storage and network throughput entitlement. Predominantly, writes in Kafka consume both storage and network throughput, whereas reads consume mostly network throughput. This is because a Kafka consumer is typically reading real-time data from a page cache and occasionally goes to disk to process old data. Therefore, the overall throughput gains also change based on the workload’s write to read throughput ratios.

Let’s look at the throughput gains based on an example. Our setup includes an MSK cluster with M7g.4xlarge instances and another with M5.4xlarge instances, with three nodes in three different Availability Zones. We also enabled TLS encryption, AWS Identity and Access Management (IAM) authentication, and a replication factor of 3 across both M7g and M5 MSK clusters. We also applied Amazon MSK best practices for broker configurations, including num.network.threads = 8 and num.io.threads = 16. On the client side for writes, we optimized the batch size with appropriate linger.ms and batch.size configurations. For the workload, we assumed 6 topics each with 64 partitions (384 per broker). For ingestion, we generated load with an average message size of 512 bytes and with one consumer group per topic. The amount of load sent to the clusters was identical.

As we ingest more data into the MSK cluster, the M7g.4xlarge instance supports higher throughput per broker, as shown in the following graph. After an hour of consistent writes, M7g.4xlarge brokers support up to 54 MB/s of write throughput vs. 40 MB/s with M5-based brokers, which represents a 29% increase.

We also see another important observation: M7g-based brokers consume much fewer CPU resources than M5s, even though they support 29% higher throughput. As seen in the following chart, CPU utilization of an M7g-based broker is on average 40%, whereas on an M5-based broker, it’s 47%.

As covered previously, customers may see different performance improvements based on the number of consumer group, batch sizes, and instance size. We recommend referring to MSK Sizing and Pricing to calculate M7g performance gains for your use case or creating a cluster based on M7g instances and benchmark the gains on your own.

Lower costs, with lesser operational burden, and higher resiliency

Since its launch, Amazon MSK has made it cost-effective to run your Kafka workloads, while still improving overall resiliency. Since day 1, you have been able to run brokers in multiple Availability Zones without worrying about additional networking costs. In October 2022, we launched Tiered Storage, which provides virtually unlimited storage at up to 50% lower costs. When you use Tiered Storage, you not only save on overall storage cost but also improve the overall availability and elasticity of your cluster.

Continuing down this path, we are now reducing compute costs for customers while still providing performance improvements. With M7g instances, Amazon MSK provides 24% savings on compute costs compared to similar sized M5 instances. When you move to Amazon MSK, you can not only lower your operational overhead using features such as Amazon MSK Connect, Amazon MSK Replicator, and automatic Kafka version upgrades, but also improve over resiliency and reduce their infrastructure costs.

Pricing and Regions

M7g instances on Amazon MSK are available today in the US (Ohio, N. Virginia, N. California, Oregon), Asia Pacific (Hyderabad, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), and EU (Ireland, London, Spain, Stockholm) Regions.

Refer to Amazon MSK pricing to learn about Graivton3-based instances with Amazon MSK pricing.

Summary

In this post, we discussed the performance gains achieved while using Graviton-based M7g instances. These instances can provide significant improvement in read and write throughput compared to similar sized M5 instances for Amazon MSK workloads. To get started, create a new cluster with M7g brokers using the AWS Management Console, and read our documentation for more information.


About the Authors

Sai Maddali is a Senior Manager Product Management at AWS who leads the product team for Amazon MSK. He is passionate about understanding customer needs, and using technology to deliver services that empowers customers to build innovative applications. Besides work, he enjoys traveling, cooking, and running.

Umesh is a Streaming Solutions Architect at AWS. He works with AWS customers to design and build real time data processing systems. He has 13 years of working experience in software engineering including architecting, designing, and developing data analytics systems.

Lanre Afod is a Solutions Architect focused with Global Financial Services at AWS, passionate about helping customers with deploying secure, scalable, high available and resilient architectures within the AWS Cloud.

Use Amazon EMR with S3 Access Grants to scale Spark access to Amazon S3

Post Syndicated from Damon Cortesi original https://aws.amazon.com/blogs/big-data/use-amazon-emr-with-s3-access-grants-to-scale-spark-access-to-amazon-s3/

Amazon EMR is pleased to announce integration with Amazon Simple Storage Service (Amazon S3) Access Grants that simplifies Amazon S3 permission management and allows you to enforce granular access at scale. With this integration, you can scale job-based Amazon S3 access for Apache Spark jobs across all Amazon EMR deployment options and enforce granular Amazon S3 access for better security posture.

In this post, we’ll walk through a few different scenarios of how to use Amazon S3 Access Grants. Before we get started on walking through the Amazon EMR and Amazon S3 Access Grants integration, we’ll set up and configure S3 Access Grants. Then, we’ll use the AWS CloudFormation template below to create an Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) Cluster, an EMR Serverless application and two different job roles.

After the setup, we’ll run a few scenarios of how you can use Amazon EMR with S3 Access Grants. First, we’ll run a batch job on EMR on Amazon EC2 to import CSV data and convert to Parquet. Second, we’ll use Amazon EMR Studio with an interactive EMR Serverless application to analyze the data. Finally, we’ll show how to set up cross-account access for Amazon S3 Access Grants. Many customers use different accounts across their organization and even outside their organization to share data. Amazon S3 Access Grants make it easy to grant cross-account access to your data even when filtering by different prefixes.

Besides this post, you can learn more about Amazon S3 Access Grants from Scaling data access with Amazon S3 Access Grants.

Prerequisites

Before you launch the AWS CloudFormation stack, ensure you have the following:

  • An AWS account that provides access to AWS services
  • The latest version of the AWS Command Line Interface (AWS CLI)
  • An AWS Identity and Access Management (AWS IAM) user with an access key and secret key to configure the AWS CLI, and permissions to create an IAM role, IAM policies, and stacks in AWS CloudFormation
  • A second AWS account if you wish to test the cross-account functionality

Walkthrough

Create resources with AWS CloudFormation

In order to use Amazon S3 Access Grants, you’ll need a cluster with Amazon EMR 6.15.0 or later. For more information, see the documentation for using Amazon S3 Access Grants with an Amazon EMR cluster, an Amazon EMR on EKS cluster, and an Amazon EMR Serverless application. For the purpose of this post, we’ll assume that you have two different types of data access users in your organization—analytics engineers with read and write access to the data in the bucket and business analysts with read-only access. We’ll utilize two different AWS IAM roles, but you can also connect your own identity provider directly to IAM Identity Center if you like.

Here’s the architecture for this first portion. The AWS CloudFormation stack creates the following AWS resources:

  • A Virtual Private Cloud (VPC) stack with private and public subnets to use with EMR Studio, route tables, and Network Address Translation (NAT) gateway.
  • An Amazon S3 bucket for EMR artifacts like log files, Spark code, and Jupyter notebooks.
  • An Amazon S3 bucket with sample data to use with S3 Access Grants.
  • An Amazon EMR cluster configured to use runtime roles and S3 Access Grants.
  • An Amazon EMR Serverless application configured to use S3 Access Grants.
  • An Amazon EMR Studio where users can login and create workspace notebooks with the EMR Serverless application.
  • Two AWS IAM roles we’ll use for our EMR job runs: one for Amazon EC2 with write access and another for Serverless with read access.
  • One AWS IAM role that will be used by S3 Access Grants to access bucket data (i.e., the Role to use when registering a location with S3 Access Grants. S3 Access Grants use this role to create temporary credentials).

To get started, complete the following steps:

  1. Choose Launch Stack:
  2. Accept the defaults and select I acknowledge that this template may create IAM resources.

The AWS CloudFormation stack takes approximately 10–15 minutes to complete. Once the stack is finished, go to the outputs tab where you will find information necessary for the following steps.

Create Amazon S3 Access Grants resources

First, we’re going to create an Amazon S3 Access Grants resources in our account. We create an S3 Access Grants instance, an S3 Access Grants location that refers to our data bucket created by the AWS CloudFormation stack that is only accessible by our data bucket AWS IAM role, and grant different levels of access to our reader and writer roles.

To create the necessary S3 Access Grants resources, use the following AWS CLI commands as an administrative user and replace any of the fields between the arrows with the output from your CloudFormation stack.

aws s3control create-access-grants-instance \
  --account-id <YOUR_ACCOUNT_ID>

Next, we create a new S3 Access Grants location. What is a Location? Amazon S3 Access Grants works by vending AWS IAM credentials with access scoped to a particular S3 prefix. An S3 Access Grants location will be associated with an AWS IAM Role from which these temporary sessions will be created.

In our case, we’re going to scope the AWS IAM Role to the bucket created with our AWS CloudFormation stack and give access to the data bucket role created by the stack. Go to the outputs tab to find the values to replace with the following code snippet:

aws s3control create-access-grants-location \
  --account-id <YOUR_ACCOUNT_ID> \
  --location-scope "s3://<DATA_BUCKET>/" \
  --iam-role-arn <DATA_BUCKET_ROLE>

Note the AccessGrantsLocationId value in the response. We’ll need that for the next steps where we’ll walk through creating the necessary S3 Access Grants to limit read and write access to your bucket.

  • For the read/write user, use s3-control create-access-grant to allow READWRITE access to the “output/*” prefix:
    aws s3control create-access-grant \
      --account-id <YOUR_ACCOUNT_ID> \
      --access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> \
      --access-grants-location-configuration S3SubPrefix="output/*" \
      --permission READWRITE \
      --grantee GranteeType=IAM,GranteeIdentifier=<DATA_WRITER_ROLE>

  • For the read user, use s3control create-access-grant again to allow only READ access to the same prefix:
    aws s3control create-access-grant \
      --account-id <YOUR_ACCOUNT_ID> \
      --access-grants-location-id <LOCATION_ID_FROM_PREVIOUS_COMMAND> \
      --access-grants-location-configuration S3SubPrefix="output/*" \
      --permission READ \
      --grantee GranteeType=IAM,GranteeIdentifier=<DATA_READER_ROLE>

Demo Scenario 1: Amazon EMR on EC2 Spark Job to generate Parquet data

Now that we’ve got our Amazon EMR environments set up and granted access to our roles via S3 Access Grants, it’s important to note that the two AWS IAM roles for our EMR cluster and EMR Serverless application have an IAM policy that only allow access to our EMR artifacts bucket. They have no IAM access to our S3 data bucket and instead use S3 Access Grants to fetch short-lived credentials scoped to the bucket and prefix. Specifically, the roles are granted s3:GetDataAccess and s3:GetDataAccessGrantsInstanceForPrefix permissions to request access via the specific S3 Access Grants instance created in our region. This allows you to easily manage your S3 access in one place in a highly scoped and granular fashion that enhances your security posture. By combining S3 Access Grants with job roles on EMR on Amazon Elastic Kubernetes Service (Amazon EKS) and EMR Serverless as well as runtime roles for Amazon EMR steps beginning with EMR 6.7.0, you can easily manage access control for individual jobs or queries. S3 Access Grants are available on EMR 6.15.0 and later. Let’s first run a Spark job on EMR on EC2 as our analytics engineer to convert some sample data into Parquet.

For this, use the sample code provided in converter.py. Download the file and copy it to the EMR_ARTIFACTS_BUCKET created by the AWS CloudFormation stack. We’ll submit our job with the ReadWrite AWS IAM role. Note that for the EMR cluster, we configured S3 Access Grants to fall back to the IAM role if access is not provided by S3 Access Grants. The DATA_WRITER_ROLE has read access to the EMR artifacts bucket through an IAM policy so it can read our script. As before, replace all the values with the <> symbols from the Outputs tab of your CloudFormation stack.

aws s3 cp converter.py s3://<EMR_ARTIFACTS_BUCKET>/code/
aws emr add-steps --cluster-id <EMR_CLUSTER_ID> \
    --execution-role-arn <DATA_WRITER_ROLE> \
    --steps '[
        {
            "Type": "CUSTOM_JAR",
            "Name": "converter",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Args": [
                    "spark-submit",
                    "--deploy-mode",
                    "client",
                    "s3://<EMR_ARTIFACTS_BUCKET>/code/converter.py",
                    "s3://<DATA_BUCKET>/output/weather-data/"
            ]
        }
    ]'

Once the job finishes, we should see some Parquet data in s3://<DATA_BUCKET>/output/weather-data/. You can see the status of the job in the Steps tab of the EMR console.

Demo Scenario 2: EMR Studio with an interactive EMR Serverless application to analyze data

Now let’s go ahead and login to EMR Studio and connect to your EMR Serverless application with the ReadOnly runtime role to analyze the data from scenario 1. First we need to enable the interactive endpoint on your Serverless application.

  • Select the EMRStudioURL in the Outputs tab of your AWS CloudFormation stack.
  • Select Applications under the Serverless section on the left-hand side.
  • Select the EMRBlog application, then the Action dropdown, and Configure.
  • Expand the Interactive endpoint section and make sure that Enable interactive endpoint is checked.
  • Scroll down and click Configure application to save your changes.
  • Back on the Applications page, select EMRBlog application, then the Start application button.

Next, create a new workspace in our Studio.

  • Choose Workspaces on the left-hand side, then the Create workspace button.
  • Enter a Workspace name, leave the remaining defaults, and choose Create Workspace.
  • After creating the workspace, it should launch in a new tab in a few seconds.

Now connect your Workspace to your EMR Serverless application.

  • Select the EMR Compute button on the left-hand side as shown in the following code.
  • Choose EMR Serverless as the compute type.
  • Choose the EMRBlog application and the runtime role that starts with EMRBlog.
  • Choose Attach. The window will refresh and you can open a new PySpark notebook and follow along below. To execute the code yourself, download the AccessGrantsReadOnly.ipynb notebook and upload it into your workspace using the Upload Files button in the file browser.

Let’s do a quick read of the data.

df = spark.read.parquet(f"s3://{DATA_BUCKET}/output/weather-data/")
df.createOrReplaceTempView("weather")
df.show()

We’ll do a simple count(*):

spark.sql("SELECT year, COUNT(*) FROM weather GROUP BY 1").show()


You can also see that if we try to write data into the output location, we get an Amazon S3 error.

df.write.format("csv").mode("overwrite").save("s3://<DATA_BUCKET>/output/weather-data-2/")

While you can also grant similar access via AWS IAM policies, Amazon S3 Access Grants can be useful for situations where your organization has outgrown managing access via IAM, wants to map S3 Access Grants to IAM Identity Center principals or roles, or has previously used EMR File System (EMRFS) Role Mappings. S3 Access Grants credentials are also temporary providing more secure access to your data. In addition, as shown below, cross-account access also benefits from the simplicity of S3 Access Grants.

Demo Scenario 3 – Cross-account access

One of the other more common access patterns is accessing data across accounts. This pattern has become increasingly common with the emergence of data mesh, where data producers and consumers are decentralized across different AWS accounts.

Previously, cross-account access required setting up complex cross-account assume role actions and custom credentials providers when configuring your Spark job. With S3 Access Grants, we only need to do the following:

  • Create an Amazon EMR job role and cluster in a second data consumer account
  • The data producer account grants access to the data consumer account with a new instance resource policy
  • The data producer account creates an access grant for the data consumer job role

And that’s it! If you have a second account handy, go ahead and deploy this AWS CloudFormation stack in the data consumer account, to create a new EMR Serverless application and job role. If not, just follow along below. The AWS CloudFormation stack should finish creating in under a minute. Next, let’s go ahead and grant our data consumer access to the S3 Access Grants instance in our data producer account.

  • Replace <DATA_PRODUCER_ACCOUNT_ID> and <DATA_CONSUMER_ACCOUNT_ID> with the relevant 12-digit AWS account IDs.
  • You may also need to change the region in the command and policy.
    aws s3control put-access-grants-instance-resource-policy \
        --account-id <DATA_PRODUCER_ACCOUNT_ID> \
        --region us-east-2 \
        --policy '{
        "Version": "2012-10-17",
        "Id": "S3AccessGrantsPolicy",
        "Statement": [
            {
                "Sid": "AllowAccessToS3AccessGrants",
                "Principal": {
                    "AWS": "<DATA_CONSUMER_ACCOUNT_ID>"
                },
                "Effect": "Allow",
                "Action": [
                    "s3:ListAccessGrants",
                    "s3:ListAccessGrantsLocations",
                    "s3:GetDataAccess"
                ],
                "Resource": "arn:aws:s3:us-east-2:<DATA_PRODUCER_ACCOUNT_ID>:access-grants/default"
            }
        ]
    }'

  • And then grant READ access to the output folder to our EMR Serverless job role in the data consumer account.
    aws s3control create-access-grant \
        --account-id <DATA_PRODUCER_ACCOUNT_ID> \
        --region us-east-2 \
        --access-grants-location-id default \
        --access-grants-location-configuration S3SubPrefix="output/*" \
        --permission READ \
        --grantee GranteeType=IAM,GranteeIdentifier=arn:aws:iam::<DATA_CONSUMER_ACCOUNT_ID>:role/<EMR_SERVERLESS_JOB_ROLE> \
        --region us-east-2

Now that we’ve done that, we can read data in the data consumer account from the bucket in the data producer account. We’ll just run a simple COUNT(*) again. Replace the <APPLICATION_ID>, <DATA_CONSUMER_JOB_ROLE>, and <DATA_CONSUMER_LOG_BUCKET> with the values from the Outputs tab on the AWS CloudFormation stack created in your second account.

And replace <DATA_PRODUCER_BUCKET> with the bucket from your first account.

aws emr-serverless start-job-run \
  --application-id <APPLICATION_ID> \
  --execution-role-arn <DATA_CONSUMER_JOB_ROLE> \
  --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://<DATA_CONSUMER_LOG_BUCKET>/logs/"
            }
        }
    }' \
  --job-driver '{
    "sparkSubmit": {
        "entryPoint": "SELECT COUNT(*) FROM parquet.`s3://<DATA_PRODUCER_BUCKET>/output/weather-data/`",
        "sparkSubmitParameters": "--class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver -e"
    }
  }'

Wait for the job to reach a completed state, and then fetch the stdout log from your bucket, replacing the <APPLICATION_ID>, <JOB_RUN_ID> from the job above, and <DATA_CONSUMER_LOG_BUCKET>.

aws emr-serverless get-job-run --application-id <APPLICATION_ID> --job-run-id <JOB_RUN_ID>
{
    "jobRun": {
        "applicationId": "00feq2s6g89r2n0d",
        "jobRunId": "00feqnp2ih45d80e",
        "state": "SUCCESS",
        ...
}

If you are on a unix-based machine and have gunzip installed, then you can use the following command as your administrative user.

Note that this command only uses AWS IAM Role Policies, not Amazon S3 Access Grants.

aws s3 ls s3:// <DATA_CONSUMER_LOG_BUCKET>/logs/applications/<APPLICATION_ID>/jobs/<JOB_RUN_ID>/SPARK_DRIVER/stdout.gz - | gunzip

Otherwise, you can use the get-dashboard-for-job-run command and open the resulting URL in your browser to view the Driver stdout logs in the Executors tab of the Spark UI.

aws emr-serverless get-dashboard-for-job-run --application-id <APPLICATION_ID> --job-run-id <JOB_RUN_ID>

Cleaning up

In order to avoid incurring future costs for examples resources in your AWS accounts, be sure to take the following steps:

  • You must manually delete the Amazon EMR Studio workspace created in the first part of the post
  • Empty the Amazon S3 buckets created by the AWS CloudFormation stacks
  • Make sure you delete the Amazon S3 Access Grants, resource policies, and S3 Access Grants location created in the steps above using the delete-access-grant, delete-access-grants-instance-resource-policy, delete-access-grants-location, and delete-access-grants-instance commands.
  • Delete the AWS CloudFormation Stacks created in each account

Comparison to AWS IAM Role Mapping

In 2018, EMR introduced EMRFS role mapping as a way to provide storage-level authorization by configuring EMRFS with multiple IAM roles. While effective, role mapping required managing users or groups locally on your EMR cluster in addition to maintaining the mappings between those identities and their corresponding IAM roles. In combination with runtime roles on EMR on EC2 and job roles for EMR on EKS and EMR Serverless, it is now easier to grant access to your data on S3 directly to the relevant principal on a per-job basis.

Conclusion

In this post, we showed you how to set up and use Amazon S3 Access Grants with Amazon EMR in order to easily manage data access for your Amazon EMR workloads. With S3 Access Grants and EMR, you can easily configure access to data on S3 for IAM identities or using your corporate directory in IAM Identity Center as your identity source. S3 Access Grants is supported across EMR on EC2, EMR on EKS, and EMR Serverless starting in EMR release 6.15.0.

To learn more, see the S3 Access Grants and EMR documentation and feel free to ask any questions in the comments!


About the author

Damon Cortesi is a Principal Developer Advocate with Amazon Web Services. He builds tools and content to help make the lives of data engineers easier. When not hard at work, he still builds data pipelines and splits logs in his spare time.

Amazon Transcribe Call Analytics adds new generative AI-powered call summaries (preview)

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/amazon-transcribe-call-analytics-adds-new-generative-ai-powered-call-summaries-preview/

We are announcing generative artificial intelligence (AI)-powered call summarization in Amazon Transcribe Call Analytics in preview. Powered by Amazon Bedrock, this feature helps businesses improve customer experience, and agent and supervisor productivity by automatically summarizing customer service calls. Amazon Transcribe Call Analytics provides machine learning (ML)-powered analytics that allows contact centers to understand the sentiment, trends, and policy compliance of customer conversations to improve their experience and identify crucial feedback. A single API call is all it takes to extract transcripts, rich insights, and summaries from your customer conversations.

We understand that as a business, you want to maintain an accurate historical record of key conversation points, including action items associated with each conversation. To do this, agents summarize notes after the conversation has ended and enter these in their CRM system, a process that is time-consuming and subject to human error. Now imagine the customer trust erosion that follows when the agent fails to correctly capture and act upon important action items discussed during conversations.

How it works
Starting today, to assist agents and supervisors with the summarization of customer conversations, Amazon Transcribe Call Analytics will generate a concise summary of a contact center interaction that captures key components such as why the customer called, how the issue was addressed, and what follow-up actions were identified. After completing a customer interaction, agents can directly proceed to help the next customer since they don’t have to summarize a conversation, resulting in reduced customer wait times and improved agent productivity. Further, supervisors can review the summary when investigating a customer issue to get a gist of the conversation, without having to listen to the entire call recording or read the transcript.

Exploring Amazon Transcribe Call Analytics in the console
To see how this works visually, I first create an Amazon Simple Storage Service (Amazon S3) bucket in the relevant AWS Region. I then upload the audio file to the S3 bucket.

Audio file in S3 bucket

To create an analytics job that transcribes the audio and provides additional analytics about the conversation that the customer and the agent were having, I go to the Amazon Transcribe Call Analytics console. I select Post-call Analytics in the left hand navigation bar and then choose Create job.

Create Post-call analytics job

Next I enter a job name making sure to keep the language settings based on the language in the audio file.

Job settings

In the Amazon S3 URI path, I provide the link to the audio file uploaded in the first screenshot shown in this post.

Audio file details

In Role name, I select Create an IAM role which will have access to the Amazon S3 bucket, then choose Next.

Create IAM Role

I enable Generative call summarization, and then choose Create job.

Configure job

After a few minutes, the job’s status will change from In progress to Complete, indicating that it was completed successfully.

Job status

Select the job, and the next screen will show the transcript and a new tab, Generative call summarization – preview.

You can also download the transcript to view the analytics and summary.

Now available
Generative call summarization in Amazon Transcribe Call Analytics is available today in English in US East (N. Virginia) and US West (Oregon).

With generative call summarization in Amazon Transcribe Call Analytics, you pay as you go and are billed monthly based on tiered pricing. For more information, see Amazon Transcribe pricing.

Learn more:

Veliswa

Large Language Models for sentiment analysis with Amazon Redshift ML (Preview)

Post Syndicated from Blessing Bamiduro original https://aws.amazon.com/blogs/big-data/large-language-models-for-sentiment-analysis-with-amazon-redshift-ml-preview/

Amazon Redshift ML empowers data analysts and database developers to integrate the capabilities of machine learning and artificial intelligence into their data warehouse. Amazon Redshift ML helps to simplify the creation, training, and application of machine learning models through familiar SQL commands.

You can further enhance Amazon Redshift’s inferencing capabilities by Bringing Your Own Models (BYOM). There are two types of BYOM: 1) remote BYOM for remote inferences, and 2) local BYOM for local inferences. With local BYOM, you utilize a model trained in Amazon SageMaker for in-database inference within Amazon Redshift by importing Amazon SageMaker Autopilot and Amazon SageMaker trained models into Amazon Redshift. Alternatively, with remote BYOM you can invoke remote custom ML models deployed in SageMaker. This enables you to use custom models in SageMaker for churn, XGBoost, linear regression, multi-class classification and now LLMs.

Amazon SageMaker JumpStart is a SageMaker feature that helps deploy pretrained, publicly available large language models (LLM) for a wide range of problem types, to help you get started with machine learning. You can access pretrained models and use them as-is or incrementally train and fine-tune these models with your own data.

In prior posts, Amazon Redshift ML exclusively supported BYOMs that accepted text or CSV as the data input and output format. Now, it has added support for the SUPER data type for both input and output. With this support, you can use LLMs in Amazon SageMaker JumpStart which offers numerous proprietary and publicly available foundation models from various model providers.

LLMs have diverse use cases. Amazon Redshift ML supports available LLM models in SageMaker including models for sentiment analysis. In sentiment analysis, the model can analyze product feedback and strings of text and hence the sentiment. This capability is particularly valuable for understanding product reviews, feedback, and overall sentiment.

Overview of solution

In this post, we use Amazon Redshift ML for sentiment analysis on reviews stored in an Amazon Redshift table. The model takes the reviews as an input and returns a sentiment classification as the output. We use an out of the box LLM in SageMaker Jumpstart. Below picture shows the solution overview.

Walkthrough

Follow the steps below to perform sentiment analysis using Amazon Redshift’s integration with SageMaker JumpStart to invoke LLM models:

  1. Deploy LLM model using foundation models in SageMaker JumpStart and create an endpoint
  2. Using Amazon Redshift ML, create a model referencing the SageMaker JumpStart LLM endpoint
  3. Create a user defined function(UDF) that engineers the prompt for sentiment analysis
  4. Load sample reviews data set into your Amazon Redshift data warehouse
  5. Make a remote inference to the LLM model to generate sentiment analysis for input dataset
  6. Analyze the output

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • An Amazon Redshift Serverless preview workgroup or an Amazon Redshift provisioned preview cluster. Refer to creating a preview workgroup or creating a preview cluster documentation for steps.
  • For the preview, your Amazon Redshift data warehouse should be on preview_2023 track in of these regions – US East (N. Virginia), US West (Oregon), EU-West (Ireland), US-East (Ohio), AP-Northeast (Tokyo) or EU-North-1 (Stockholm).

Solution Steps

Follow the below solution steps

1. Deploy LLM Model using Foundation models in SageMaker JumpStart and create an endpoint

  1. Navigate to Foundation Models in Amazon SageMaker Jumpstart
  2. Search for the foundation model by typing Falcon 7B Instruct BF16 in the search box
  3. Choose View Model

  4. In the Model Details  page, choose Open notebook in Studio

  5. When Select domain and user profile dialog box opens up, choose the profile you like from the drop down and choose Open Studio

  6. When the notebook opens, a prompt Set up notebook environment pops open. Choose ml.g5.2xlarge or any other instance type recommended in the notebook and choose Select

  7. Scroll to Deploying Falcon model for inference section of the notebook and run the three cells in that section
  8. Once the third cell execution is complete, expand Deployments section in the left pane, choose Endpoints to see the endpoint created. You can see endpoint Name. Make a note of that. It will be used in the next steps
  9. Select Finish.

2. Using Amazon Redshift ML, create a model referencing the SageMaker JumpStart LLM endpoint

Create a model using Amazon Redshift ML bring your own model (BYOM) capability. After the model is created, you can use the output function to make remote inference to the LLM model. To create a model in Amazon Redshift for the LLM endpoint created previously, follow the below steps.

  1. Login to Amazon Redshift endpoint. You can use Query editor V2 to login
  2. Import this notebook into Query Editor V2. It has all the SQLs used in this blog.
  3. Ensure you have the below IAM policy added to your IAM role. Replace <endpointname> with the SageMaker JumpStart endpoint name captured earlier
    {
      "Statement": [
          {
              "Action": "sagemaker:InvokeEndpoint",
              "Effect": "Allow",
              "Resource": "arn:aws:sagemaker:<region>:<AccountNumber>:endpoint/<endpointname>",
              "Principal": "*"
          }
      ]
    }

  4. Create model in Amazon Redshift using the create model statement given below. Replace <endpointname> with the endpoint name captured earlier. The input and output data type for the model needs to be SUPER.
    CREATE MODEL falcon_7b_instruct_llm_model
    FUNCTION falcon_7b_instruct_llm_model(super)
    RETURNS super
    SAGEMAKER '<endpointname>'
    IAM_ROLE default;

3. Load sample reviews data set into your Amazon Redshift data warehouse

In this blog post, we will use a sample fictitious reviews dataset for the walkthrough

  1. Login to Amazon Redshift using Query Editor V2
  2. Create sample_reviews table using the below SQL statement. This table will store the sample reviews dataset
    CREATE TABLE sample_reviews
    (
    review varchar(4000)
    );

  3. Download the sample file, upload into your S3 bucket and load data into sample_reviews table using the below COPY command
    COPY sample_reviews
    FROM 's3://<<your_s3_bucket>>/sample_reviews.csv'
    IAM_ROLE DEFAULT
    csv
    DELIMITER ','
    IGNOREHEADER 1;

4. Create a UDF that engineers the prompt for sentiment analysis

The input to the LLM consists of two main parts – the prompt and the parameters.

The prompt is the guidance or set of instructions you want to give to the LLM. Prompt should be clear to provide proper context and direction for the LLM. Generative AI systems rely heavily on the prompts provided to determine how to generate a response.  If the prompt does not provide enough context and guidance, it can lead to unhelpful responses. Prompt engineering helps avoid these pitfalls.

Finding the right words and structure for a prompt is challenging and often requires trial and error. Prompt engineering allows experimenting to find prompts that reliably produce the desired output.  Prompt engineering helps shape the input to best leverage the capabilities of the Generative-AI model being used. Well-constructed prompts allow generative AI to provide more nuanced, high-quality, and helpful responses tailored to the specific needs of the user.

The parameters allow configuring and fine-tuning the model’s output. This includes settings like maximum length, randomness levels, stopping criteria, and more. Parameters give control over the properties and style of the generated text and are model specific.

The UDF below takes varchar data in your data warehouse, parses it into SUPER (JSON format) for the LLM. This flexibility allows you to store your data as varchar in your data warehouse without performing data type conversion to SUPER to use LLMs in Amazon Redshift ML and makes prompt engineering easy. If you want to try a different prompt, you can just replace the UDF

The UDF given below has both the prompt and a parameter.

  • Prompt: “Classify the sentiment of this sentence as Positive, Negative, Neutral. Return only the sentiment nothing else” – This instructs the model to classify the review into 3 sentiment categories.
  • Parameter: “max_new_tokens”:1000 – This allows the model to return up to 1000 tokens.
CREATE FUNCTION udf_prompt_eng_sentiment_analysis (varchar)
  returns super
stable
as $$
  select json_parse(
  '{"inputs":"Classify the sentiment of this sentence as Positive, Negative, Neutral. Return only the sentiment nothing else.' || $1 || '","parameters":{"max_new_tokens":1000}}')
$$ language sql;

5. Make a remote inference to the LLM model to generate sentiment analysis for input dataset

The output of this step is stored in a newly created table called sentiment_analysis_for_reviews. Run the below SQL statement to create a table with output from LLM model

CREATE table sentiment_analysis_for_reviews
as
(
    SELECT 
        review, 
        falcon_7b_instruct_llm_model
            (
                udf_prompt_eng_sentiment_analysis(review)
        ) as sentiment
    FROM sample_reviews
);

6. Analyze the output

The output of the LLM is of datatype SUPER. For the Falcon model, the output is available in the attribute named generated_text. Each LLM has its own output payload format. Please refer to the documentation for the LLM you would like to use for its output format.

Run the below query to extract the sentiment from the output of LLM model. For each review, you can see it’s sentiment analysis

SELECT review, sentiment[0]."generated_text" :: varchar as sentiment 
FROM sentiment_analysis_for_reviews;

Cleaning up

To avoid incurring future charges, delete the resources.

  1. Delete the LLM endpoint in SageMaker Jumpstart
  2. Drop the sample_reviews table and the model in Amazon Redshift using the below query

    DROP MODEL falcon_7b_instruct_llm_model;
    DROP TABLE sample_reviews;
    DROP FUNCTION fn_gen_prompt_4_sentiment_analysis;

  3. If you have created an Amazon Redshift endpoint, delete the endpoint as well

Conclusion

In this post, we showed you how to perform sentiment analysis for data stored in Amazon Redshift using Falcon, a large language model(LLM) in SageMaker jumpstart and Amazon Redshift ML. Falcon is used as an example, you can use other LLM models as well with Amazon Redshift ML. Sentiment analysis is just one of the many use cases that are possible with LLM support in Amazon Redshift ML. You can achieve other use cases such as data enrichment, content summarization, knowledge graph development and more. LLM support broadens the analytical capabilities of Amazon Redshift ML as it continues to empower data analysts and developers to incorporate machine learning into their data warehouse workflow with streamlined processes driven by familiar SQL commands. The addition of SUPER data type enhances Amazon Redshift ML capabilities, allowing smooth integration of large language models (LLM) from SageMaker JumpStart for remote BYOM inferences.


About the Authors

Blessing Bamiduro is part of the Amazon Redshift Product Management team. She works with customers to help explore the use of Amazon Redshift ML in their data warehouse. In her spare time, Blessing loves travels and adventures.

Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.

Enhance query performance using AWS Glue Data Catalog column-level statistics

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential cost savings.

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets. When talking with our customers, we learned that one the challenging aspect of data lake performance is how to optimize these analytics queries to execute faster.

The data lake performance optimization is especially important for queries with multiple joins and that is where cost-based optimizers helps the most. In order for CBO to work, column statistics need to be collected and updated based on changes in the data. We’re launching capability of generating column-level statistics such as number of distinct, number of nulls, max, and min on files such as Parquet, ORC, JSON, Amazon ION, CSV, XML on AWS Glue tables. With this launch, customers now have integrated end-to-end experience where statistics on Glue tables are collected and stored in the AWS Glue Catalog, and made available to analytics services for improved query planning and execution.

Using these statistics, cost-based optimizers improves query run plans and boosts the performance of queries run in Amazon Athena and Amazon Redshift Spectrum. For example, CBO can use column statistics such as number of distinct values and number of nulls to improve row prediction. Row prediction is the number of rows from a table that will be returned by a certain step during the query planning stage. The more accurate the row predictions are, the more efficient query execution steps are. This leads to faster query execution and potentially reduced cost. Some of the specific optimizations that CBO can employ include join reordering and push-down of aggregations based on the statistics available for each table and column.

For customers using data mesh with AWS Lake Formation permissions, tables from different data producers are cataloged in the centralized governance accounts. As they generate statistics on tables on centralized catalog and share those tables with consumers, queries on those tables in consumer accounts will see query performance improvements automatically. In this post, we’ll demonstrate the capability of AWS Glue Data Catalog to generate column statistics for our sample tables.

Solution overview

To demonstrate the effectiveness of this capability, we employ the industry-standard TPC-DS 3 TB dataset stored in an Amazon Simple Storage Service (Amazon S3) public bucket. We’ll compare the query performance before and after generating column statistics for the tables, by running queries in Amazon Athena and Amazon Redshift Spectrum. We are providing queries that we used in this post and we encourage to try out your own queries following workflow as illustrated in the following details.

The workflow consists of the following high level steps:

  1. Cataloging the Amazon S3 Bucket: Utilize AWS Glue Crawler to crawl the designated Amazon S3 bucket, extracting metadata, and seamlessly storing it in the AWS Glue data catalog. We’ll query these tables using Amazon Athena and Amazon Redshift Spectrum.
  2. Generating column statistics: Employ the enhanced capabilities of AWS Glue Data Catalog to generate comprehensive column statistics for the crawled data, thereby providing valuable insights into the dataset.
  3. Querying with Amazon Athena and Amazon Redshift Spectrum: Evaluate the impact of column statistics on query performance by utilizing Amazon Athena and Amazon Redshift Spectrum to execute queries on the dataset.

The following diagram illustrates the solution architecture.

Walkthrough

To implement the solution, we complete the following steps:

  1. Set up resources with AWS CloudFormation.
  2. Run AWS Glue Crawler on Public Amazon S3 bucket to list the 3TB TPC-DS dataset.
  3. Run queries on Amazon Athena and Amazon Redshift and note down query duration
  4. Generate statistics for AWS Glue Data Catalog tables
  5. Run queries on Amazon Athena and Amazon Redshift and compare query duration with previous run
  6. Optional: Schedule AWS Glue column statistics jobs using AWS Lambda and the Amazon EventBridge Scheduler

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. The template generates the following resources:

  • An Amazon Virtual Private Cloud (Amazon VPC), public subnet, private subnets and route tables.
  • An Amazon Redshift Serverless workgroup and namespace.
  • An AWS Glue crawler to crawl the public Amazon S3 bucket and create a table for the Glue Data Catalog for TPC-DS dataset
  • AWS Glue catalog databases and tables
  • An Amazon S3 bucket to store athena result.
  • AWS Identity and Access Management (AWS IAM) users and policies.
  • AWS Lambda and Amazon Event Bridge scheduler to schedule the AWS Glue Column statistics

To launch the AWS CloudFormation stack, complete the following steps:

Note: The AWS Glue data catalog tables are generated using the public bucket s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/, hosted in the us-east-1 region. If you intend to deploy this AWS CloudFormation template in a different region, it is necessary to either copy the data to the corresponding region or share the data within your deployed region for it to be accessible from Amazon Redshift.

  1. Log in to the AWS Management Console as AWS Identity and Access Management (AWS IAM) administrator.
  2. Choose Launch Stack to deploy a AWS CloudFormation template.
  3. Choose Next.
  4. On the next page, keep all the option as default or make appropriate changes based on your requirement choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

This stack can take around 10 minutes to complete, after which you can view the deployed stack on the AWS CloudFormation console.

Run the AWS Glue Crawlers created by the AWS CloudFormation stack

To run your crawlers, complete the following steps:

  1. On the AWS Glue console to AWS Glue Console, choose Crawlers under Data Catalog in the navigation pane.
  2. Locate and run two crawlers tpcdsdb-without-stats and tpcdsdb-with-stats. It may take few mins to complete.

Once the crawler completes successfully, it would create two identical databases tpcdsdbnostats and tpcdsdbwithstats. The tables in tpcdsdbnostats will have No Stats and we’ll use them as reference. We generate statistics on tables in tpcdsdbwithstats. Please verify that you have those two databases and underlying tables from the AWS Glue Console. The tpcdsdbnostats database will look like below. At this time there are no statistics generated on these tables.

Run provided query using Amazon Athena on no-stats tables

To run your query in Amazon Athena on tables without statistics, complete the following steps:

  1. Download the athena queries from here.
  2. On the Amazon Athena Console, choose the provided query one at a time for tables in database tpcdsdbnostats.
  3. Run the query and note down the Run time for each query.

Run provided query using Amazon Redshift Spectrum on no-stats tables

To run your query in Amazon Redshift, complete the following steps:

  1. Download the Amazon Redshift queries from here.
  2. On the Redshift query editor v2, execute the Redshift Query for tables without stats section from downloaded query.
  3. Run the query and note down the query execution of each query.

Generate statistics on AWS Glue Catalog tables

To generate statistics on AWS Glue Catalog tables, complete the following steps:

  1. Navigate to the AWS Glue Console and choose the databases under Data Catalog.
  2. Click on tpcdsdbwithstats database and it will list all the available tables.
  3. Select any of these tables (e.g., call_center).
  4. Go to Column statistics – new tab and choose Generate statistics.
  5. Keep the default option. Under Choose columns keep Table (All columns) and Under Row sampling options Keep All rows, Under IAM role choose AWSGluestats-blog and select Generate statistics.

You’ll be able to see status of the statistics generation run as shown in the following illustration:

After generate statistics on AWS Glue Catalog tables, you should be able to see detailed column statistics for that table:

Reiterate steps 2–5 to generate statistics for all necessary tables, such as catalog_sales, catalog_returns, warehouse, item, date_dim, store_sales, customer, customer_address, web_sales, time_dim, ship_mode, web_site, web_returns. Alternatively, you can follow the “Schedule AWS Glue Statistics Runs” section near the end of this blog to generate statistics for all tables. Once done, assess query performance for each query.

Run provided query using Athena Console on stats tables

  1. On the Amazon Athena console, execute the Athena Query for tables with stats section from downloaded query.
  2. Run and note down the query execution of each query.

In our sample run of the queries on the tables, we observed the query execution time as per the below table. We saw clear improvement in the query performance, ranging from 13 to 55%.

Athena query time improvement

TPC-DS 3T Queries without glue stats (sec) with glue stats (sec) performance improvement (%)
Query 2 33.62 15.17 55%
Query 4 132.11 72.94 45%
Query 14 134.77 91.48 32%
Query 28 55.99 39.36 30%
Query 38 29.32 25.58 13%

Run the provided query using Amazon Redshift Spectrum on statistics tables

  1. On the Amazon Redshift query editor v2, execute the Redshift Query for tables with stats section from downloaded query.
  2. Run the query and note down the query execution of each query.

In our sample run of the queries on the tables, we observed the query execution time as per the below table. We saw clear improvement in the query performance, ranging from 13 to 89%.

Amazon Redshift Spectrum query time improvement

TPC-DS 3T Queries without glue stats (sec) with glue stats (sec) performance improvement (%)
Query 40 124.156 13.12 89%
Query 60 29.52 16.97 42%
Query 66 18.914 16.39 13%
Query 95 308.806 200 35%
Query 99 20.064 16 20%

Schedule AWS Glue statistics Runs

In this segment of the post, we’ll guide you through the steps of scheduling AWS Glue column statistics runs using AWS Lambda and the Amazon EventBridge Scheduler. To streamline this process, a AWS Lambda function and an Amazon EventBridge scheduler were created as part of the CloudFormation stack deployment.

  1. AWS Lambda function setup:

To begin, we utilize an AWS Lambda function to trigger the execution of the AWS Glue column statistics job. The AWS Lambda function invokes the start_column_statistics_task_run API through the boto3 (AWS SDK for Python) library. This sets the groundwork for automating the column statistics update.

Let’s explore the AWS Lambda function:

    • Go to the AWS Glue Lambda Console.
    • Select Functions and locate the GlueTableStatisticsFunctionv1.
    • For a clearer understanding of the AWS Lambda function, we recommend reviewing the code in the Code section and examining the environment variables under Configuration.
  1. Amazon EventBridge scheduler configuration

The next step involves scheduling the AWS Lambda function invocation using the Amazon EventBridge Scheduler. The scheduler is configured to trigger the AWS Lambda function daily at a specific time – in this case, 08:00 PM. This ensures that the AWS Glue column statistics job runs on a regular and predictable basis.

Now, let’s explore how you can update the schedule:

Cleaning up

To avoid unwanted charges to your AWS account, delete the AWS resources:

  1. Sign into the AWS CloudFormation console as the AWS IAM administrator used for creating the AWS CloudFormation stack.
  2. Delete the AWS CloudFormation stack you created.

Conclusion

In this post, we showed you how you can use AWS Glue Data Catalog to generate column-level statistics for AWS Glue tables. These statistics are now integrated with cost-based optimizer from Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential costs savings. Refer to Docs for support for Glue Catalog Statistics across various AWS analytical services.

If you have questions or suggestions, submit them in the comments section.


About the Authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Navnit Shukla serves as an AWS Specialist Solution Architect with a focus on Analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled Data Wrangling on AWS. He can be reached via LinkedIn.

Introducing Apache Hudi support with AWS Glue crawlers

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-apache-hudi-support-with-aws-glue-crawlers/

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. Data engineers use Apache Hudi for streaming workloads as well as to create efficient incremental data pipelines. Hudi provides tables, transactions, efficient upserts and deletes, advanced indexes, streaming ingestion services, data clustering and compaction optimizations, and concurrency control, all while keeping your data in open source file formats. Hudi’s advanced performance optimizations make analytical workloads faster with any of the popular query engines including Apache Spark, Presto, Trino, Hive, and so on.

Many AWS customers adopted Apache Hudi on their data lakes built on top of Amazon S3 using AWS Glue, a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue Crawler is a component of AWS Glue, which allows you to create table metadata from data content automatically without requiring manual definition of the metadata.

AWS Glue crawlers now support Apache Hudi tables, simplifying the adoption of AWS Glue Data Catalog as the catalog for Hudi tables. One typical use case is to register Hudi tables, which does not have catalog table definition. Another typical use case is migration from other Hudi catalogs, such as Hive metastore. When migrating from other Hudi Catalogs, you can create and schedule an AWS Glue crawler and provide one or more Amazon S3 paths where the Hudi table files are located. You have the option to provide the maximum depth of Amazon S3 paths that the AWS Glue crawler can traverse. With each run, AWS Glue crawlers will extract schema and partition information and update AWS Glue Data Catalog with the schema and partition changes. AWS Glue crawlers updates the latest metadata file location in the AWS Glue Data Catalog that AWS analytical engines can directly use.

With this launch, you can create and schedule an AWS Glue crawler to register Hudi tables in AWS Glue Data Catalog. You can then provide one or multiple Amazon S3 paths where the Hudi tables are located. You have the option to provide the maximum depth of Amazon S3 paths that crawlers can traverse. With each crawler run, the crawler inspects each of the S3 paths and catalogs the schema information, such as new tables, deletes, and updates to schemas in the AWS Glue Data Catalog. Crawlers inspect partition information and add newly added partitions to AWS Glue Data Catalog. Crawlers also update the latest metadata file location in the AWS Glue Data Catalog that AWS analytical engines can directly use.

This post demonstrates how this new capability to crawl Hudi tables works.

How AWS Glue crawler works with Hudi tables

Hudi tables have two categories, with specific implications for each:

  • Copy on write (CoW) – Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write.
  • Merge on read (MoR) – Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files.

With CoW datasets, each time there is an update to a record, the file that contains the record is rewritten with the updated values. With a MoR dataset, each time there is an update, Hudi writes only the row for the changed record. MoR is better suited for write- or change-heavy workloads with fewer reads. CoW is better suited for read-heavy workloads on data that change less frequently.

Hudi provides three query types for accessing the data:

  • Snapshot queries – Queries that see the latest snapshot of the table as of a given commit or compaction action. For MoR tables, snapshot queries expose the most recent state of the table by merging the base and delta files of the latest file slice at the time of the query.
  • Incremental queries – Queries only see new data written to the table, since a given commit or compaction. This effectively provides change streams to enable incremental data pipelines.
  • Read optimized queries – For MoR tables, queries see the latest data compacted. For CoW tables, queries see the latest data committed.

For copy-on-write tables, crawlers create a single table in the AWS Glue Data Catalog with the ReadOptimized Serde  org.apache.hudi.hadoop.HoodieParquetInputFormat.

For merge-on-read tables, crawlers create two tables in AWS Glue Data Catalog for the same table location:

  • A table with suffix _ro, which uses the ReadOptimized Serde org.apache.hudi.hadoop.HoodieParquetInputFormat
  • A table with suffix _rt, which uses the RealTime Serde allowing for Snapshot queries: org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat

During each crawl, for each Hudi path provided, crawlers make an Amazon S3 list API call, filter based on the .hoodie folders, and find the most recent metadata file under that Hudi table metadata folder.

Crawl a Hudi CoW table using AWS Glue crawler

In this section, let’s go through how to crawl a Hudi CoW using AWS Glue crawlers.

Prerequisites

Here are the prerequisites for this tutorial:

  1. Install and configure AWS Command Line Interface (AWS CLI).
  2. Create your S3 bucket if you do not have it.
  3. Create your IAM role for AWS Glue if you do not have it. You need s3:GetObject for s3://your_s3_bucket/data/sample_hudi_cow_table/.
  4. Run the following command to copy the sample Hudi table into your S3 bucket. (Replace your_s3_bucket with your S3 bucket name.)
$ aws s3 sync s3://aws-bigdata-blog/artifacts/hudi-crawler/product_cow/ s3://your_s3_bucket/data/sample_hudi_cow_table/

This instruction guides you to copy sample data, but you can create any Hudi tables easily using AWS Glue. Learn more in Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor.

Create a Hudi crawler

In this instruction, create the crawler through the console. Complete the following steps to create a Hudi crawler:

  1. On the AWS Glue console, choose Crawlers.
  2. Choose Create crawler.
  3. For Name, enter hudi_cow_crawler. Choose Next.
  4. Under Data source configuration,  choose Add data source.
    1. For Data source, choose Hudi.
    2. For Include hudi table paths, enter s3://your_s3_bucket/data/sample_hudi_cow_table/. (Replace your_s3_bucket with your S3 bucket name.)
    3. Choose Add Hudi data source.
  5. Choose Next.
  6. For Existing IAM role, choose your IAM role, then choose Next.
  7. For Target database, choose Add database, then the Add database dialog appears. For Database name, enter hudi_crawler_blog, then choose Create. Choose Next.
  8. Choose Create crawler.

Now a new Hudi crawler has been successfully created. The crawler can be triggered to run through the console or through the SDK or AWS CLI using the StartCrawl API. It could also be scheduled through the console to trigger the crawlers at specific times. In this instruction, run the crawler through the console.

  1. Choose Run crawler.
  2. Wait for the crawler to complete.

After the crawler has run, you can see the Hudi table definition in the AWS Glue console:

You have successfully crawled the Hudi CoR table with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. After you create the table definition on AWS Glue Data Catalog, AWS analytics services such as Amazon Athena are able to query the Hudi table.

Complete the following steps to start queries on Athena:

  1. Open the Amazon Athena console.
  2. Run the following query.
SELECT * FROM "hudi_crawler_blog"."sample_hudi_cow_table" limit 10;

The following screenshot shows our output:

Crawl a Hudi MoR table using AWS Glue crawler with AWS Lake Formation data permissions

In this section, let’s go through how to crawl a Hudi MoR table using AWS Glue. This time, you use AWS Lake Formation data permission for crawling Amazon S3 data sources instead of IAM and Amazon S3 permission. This is optional, but it simplifies permission configurations when your data lake is managed by AWS Lake Formation permissions.

Prerequisites

Here are the prerequisites for this tutorial:

  1. Install and configure AWS Command Line Interface (AWS CLI).
  2. Create your S3 bucket if you do not have it.
  3. Create your IAM role for AWS Glue if you do not have it. You need lakeformation:GetDataAccess. But you do not need s3:GetObject for s3://your_s3_bucket/data/sample_hudi_mor_table/ because we use Lake Formation data permission to access the files.
  4. Run the following command to copy the sample Hudi table into your S3 bucket. (Replace your_s3_bucket with your S3 bucket name.)
$ aws s3 sync s3://aws-bigdata-blog/artifacts/hudi-crawler/product_mor/ s3://your_s3_bucket/data/sample_hudi_mor_table/

In addition to the processing steps, complete the following steps to update the AWS Glue Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

  1. Sign in to the Lake Formation console as a data lake administrator.
    1. If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
  2. Under Administration, choose Data catalog settings.
  3. For Default permissions for newly created databases and tables, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
  4. For Cross account version setting, choose Version 3.
  5. Choose Save.

The next step is to register your S3 bucket in Lake Formation data lake locations:

  1. On the Lake Formation console, choose Data lake locations, and choose Register location.
  2. For Amazon S3 path, enter s3://your_s3_bucket/. (Replace your_s3_bucket with your S3 bucket name.)
  3. Choose Register location.

Then, grant Glue crawler role access to data location so that the crawler can use Lake Formation permission to access the data and create tables in the location:

  1. On the Lake Formation console, choose Data locations and choose Grant.
  2. For IAM users and roles, select the IAM role you used for the crawler.
  3. For Storage location, enter s3://your_s3_bucket/data/. (Replace your_s3_bucket with your S3 bucket name.)
  4. Choose Grant.

Then, grant crawler role to create tables under the database hudi_crawler_blog:

  1. On the Lake Formation console, choose Data lake permissions.
  2. Choose Grant.
  3. For Principals, choose IAM users and roles, and choose the crawler role.
  4. For LF tags or catalog resources, choose Named data catalog resources.
  5. For Database, choose the database hudi_crawler_blog.
  6. Under Database permissions, select Create table.
  7. Choose Grant.

Create a Hudi crawler with Lake Formation data permissions

Complete the following steps to create a Hudi crawler:

  1. On the AWS Glue console, choose Crawlers.
  2. Choose Create crawler.
  3. For Name, enter hudi_mor_crawler. Choose Next.
  4. Under Data source configuration,  choose Add data source.
    1. For Data source, choose Hudi.
    2. For Include hudi table paths, enter s3://your_s3_bucket/data/sample_hudi_mor_table/. (Replace your_s3_bucket with your S3 bucket name.)
    3. Choose Add Hudi data source.
  5. Choose Next.
  6. For Existing IAM role, choose your IAM role.
  7. Under Lake Formation configuration – optional, select Use Lake Formation credentials for crawling S3 data source.
  8. Choose Next.
  9. For Target database, choose hudi_crawler_blog. Choose Next.
  10. Choose Create crawler.

Now a new Hudi crawler has been successfully created. The crawler uses Lake Formation credentials for crawling Amazon S3 files. Let’s run the new crawler:

  1. Choose Run crawler.
  2. Wait for the crawler to complete.

After the crawler has run, you can see two tables of the Hudi table definition in the AWS Glue console:

  • sample_hudi_mor_table_ro (read optimized table)
  • sample_hudi_mor_table_rt (real time table)

You registered the data lake bucket with Lake Formation and enabled crawling access to the data lake using Lake Formation permissions. You have successfully crawled the Hudi MoR table with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. After you create the table definitions on AWS Glue Data Catalog, AWS analytics services such as Amazon Athena are able to query the Hudi table.

Complete the following steps to start queries on Athena:

  1. Open the Amazon Athena console.
  2. Run the following query.
    SELECT * FROM "hudi_crawler_blog"."sample_hudi_mor_table_rt" limit 10;

The following screenshot shows our output:

  1. Run the following query.
    SELECT * FROM "hudi_crawler_blog"."sample_hudi_mor_table_ro" limit 10;

The following screenshot shows our output:

Fine-grained access control using AWS Lake Formation permissions

To apply fine-grained access control on the Hudi table, you can benefit from AWS Lake Formation permissions. Lake Formation permissions allow you to restrict access to specific tables, columns, or rows and then query the Hudi tables through Amazon Athena with fine-grained access control. Let’s configure Lake Formation permission for the Hudi MoR table.

Prerequisites

Here are the prerequisites for this tutorial:

  1. Complete the previous section Crawl a Hudi MoR table using AWS Glue crawler with AWS Lake Formation data permissions.
  2. Create an IAM user DataAnalyst, who has AWS managed policy AmazonAthenaFullAccess.

Create a Lake Formation data cell filter

Let’s first set up a filter for the MoR read optimized table.

  1. Sign in to the Lake Formation console as a data lake administrator.
  2. Choose Data filters.
  3. Choose Create new filter.
  4. For Data filter name, enter exclude_product_price.
  5. For Target database, choose the database hudi_crawler_blog.
  6. For Target table, choose the table sample_hudi_mor_table_ro.
  7. For Column-level access, select Exclude columns, and choose the column price.
  8. For Row filter expression, enter true.
  9. Choose Create filter.

Grant Lake Formation permissions to the DataAnalyst user

Complete the following steps to grant Lake Formation permission to the DataAnalyst user

  1. On the Lake Formation console, choose Data lake permissions.
  2. Choose Grant.
  3. For Principals, choose IAM users and roles, and choose the user DataAnalyst.
  4. For LF tags or catalog resources, choose Named data catalog resources.
  5. For Database, choose the database hudi_crawler_blog.
  6. For Table – optional, choose the table sample_hudi_mor_table_ro.
  7. For Data filters – optional, select exclude_product_price.
  8. For Data filter permissions, select Select.
  9. Choose Grant.

You granted Lake Formation permission on the database hudi_crawler_blog and the table sample_hudi_mor_table_ro, excluding the column price to the DataAnalyst user. Now let’s validate user access to the data using Athena.

  1. Sign in to the Athena console as a DataAnalyst user.
  2. On the query editor, run the following query:
    SELECT * FROM "hudi_crawler_blog"."sample_hudi_mor_table_ro" limit 10;

The following screenshot shows our output:

Now you validated that the column price is not shown, but the other columns product_id, product_name, update_at, and category are shown.

Clean up

To avoid unwanted charges to your AWS account, delete the following AWS resources:

  1. Delete AWS Glue database hudi_crawler_blog.
  2. Delete AWS Glue crawlers hudi_cow_crawler and hudi_mor_crawler.
  3. Delete Amazon S3 files under s3://your_s3_bucket/data/sample_hudi_cow_table/ and s3://your_s3_bucket/data/sample_hudi_mor_table/.

Conclusion

This post demonstrated how AWS Glue crawlers work for Hudi tables. With the support for Hudi crawler, you can quickly move to using AWS Glue Data Catalog as your primary Hudi table catalog. You can start building your serverless transactional data lake using Hudi on AWS using AWS Glue, AWS Glue Data Catalog, and Lake Formation fine-grained access controls for tables and formats supported by AWS analytical engines.


About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Kyle Duong is a Software Development Engineer on the AWS Glue and Lake Formation team. He is passionate about building big data technologies and distributed systems.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.