Tag Archives: AWS Analytics

Export JMX metrics from Kafka connectors in Amazon Managed Streaming for Apache Kafka Connect with a custom plugin

Post Syndicated from Jaydev Nath original https://aws.amazon.com/blogs/big-data/export-jmx-metrics-from-kafka-connectors-in-amazon-managed-streaming-for-apache-kafka-connect-with-a-custom-plugin/

Organizations use streaming applications to process and analyze data in real time and adopt the Amazon MSK Connect feature of Amazon Managed Streaming for Apache Kafka (Amazon MSK) to run fully managed Kafka Connect workloads on AWS. Message brokers like Apache Kafka allow applications to handle large volumes and diverse types of data efficiently and enable timely decision-making and instant insights. It’s crucial to monitor the performance and health of each component to help ensure the seamless operation of data streaming pipelines.

Amazon MSK is a fully managed service that simplifies the deployment and operation of Apache Kafka clusters on AWS. It simplifies building and running applications that use Apache Kafka to process streaming data. Amazon MSK Connect simplifies the deployment, monitoring, and automatic scaling of connectors that transfer data between Apache Kafka clusters and external systems such as databases, file systems, and search indices. Amazon MSK Connect is fully compatible with Kafka Connect and supports Amazon MSK, Apache Kafka, and Apache Kafka compatible clusters. Amazon MSK Connect uses a custom plugin as the container for connector implementation logic.

Custom MSK connect plugins use Java Management Extensions (JMX) to expose runtime metrics. While Amazon MSK Connect sends a set of connect metrics to Amazon CloudWatch, it currently does not support exporting the JMX metrics emitted by the connector plugins natively. These metrics can be exported by modifying the custom connect plugin code directly, but it requires maintenance overhead because the plugin code needs to be modified every time it’s updated. In this post, we demonstrate an optimal approach by extending a custom connect plugin with additional modules to export JMX metrics and publish them to CloudWatch as custom metrics. These additional JMX metrics emitted by the custom connectors provide rich insights into their performance and health of the connectors. In this post, we demonstrate how you can export the JMX metrics for Debezium connector when used with MSK Connect.

Understanding JMX

Before we dive deep into exporting JMX metrics, let’s understand how JMX works. JMX is a technology that you can use to monitor and manage Java applications. Key components involved in JMX monitoring are:

  • Managed beans (MBeans) are Java objects that represent the metrics of the Java application being monitored. They contain the actual data points of the resources being monitored.
  • JMX server creates and registers the MBeans with the PlatformMBeanServer. The Java application that is being monitored acts as the JMX server and exposes the MBeans.
  • MBeanServer or JMX registry is the central registry that keeps track of all the registered MBeans in the JMX server. It is the access point for all the MBeans within the Java virtual machine (JVM).
  • JMXConnectorServer acts as a bridge between the JMX client and the JMX server and enables remote access to the exposed MBeans. JMXConnectorServerFactory creates and manages the JMXConnectorServer. It allows for the customization of the server’s properties and uses the JMXServiceURL to define the endpoint where the JMX client can connect to the JMX server.
  • JMXServiceURL provides the necessary information such as the protocol, host, and port for the client to connect to the JMX server and access the desired MBeans.
  • JMX client is an external application or tool that connect to the JMX server to access and monitor the exposed metrics.

JMX monitoring involves the steps shown in the following figure:

JMX architecture diagram showing connection flow from client to server with MBeans

JMX monitoring steps include:

  1. The Java application acting as the JMX server creates and configures MBeans for the desired metrics.
  2. JMX server registers the MBeans with the JMX registry.
  3. JMXConnectorServerFactory creates the JMXConnectorServer that defines the JMXServiceURL that provides the entry point details for the JMX client.
  4. JMXClient connects to the JMX registry in the JMX server using the JMXServiceURL and the JMXConnectorServer.
  5. The JMX server handles client requests, interacting with the JMX registry to retrieve the MBean data.

Solution overview

This method of wrapping supported Kafka connectors with custom code that exposes connector-specific operational metrics enables teams to get better insights by correlating various connector metrics with cloud-centered metrics in monitoring systems such as Amazon CloudWatch. This approach enables consistent monitoring across different components of the change data capture (CDC) pipeline, ultimately feeding metrics into unified dashboards while respecting each connector’s architectural philosophy. The consolidated metrics can be delivered to CloudWatch or the monitoring tool of your choice including partner specific application performance management (APM) tools such as Datadog, New Relic, and so on.

We have the working implementation of this same approach with two popular connectors: Debezium source connector and MongoDB Sink Connector. You can find the Github sample and ready to use plugins built for each in the repository. Review the README file for this custom implementation for more details.

For example, our custom implementation for the MongoDB Sink Connector adds a metrics export layer that calculates critical performance indicators such as latest-kafka-time-difference-ms – which measures the latency between Kafka message timestamps and connector processing time by subtracting the connector’s current clock time from the last received record’s timestamp. This custom wrapper around the MongoDB Sink Connector enables exporting relevant JMX metrics and publishing them as custom metrics to CloudWatch. We’ve open sourced this solution on GitHub, along with a ready-to-use plugin and detailed configuration guidance in the README.

CDC is the process of identifying and capturing changes made in a database and delivering those changes in real time to a downstream system. Debezium is an open source distributed platform built on top of Apache Kafka that provides CDC functionality. It provides a set of connectors to track and stream changes from databases to Kafka.

In the next section, we dive deep into the implementation details of how to export JMX metrics from Debezium MySQL Connector deployed as a custom plugin in Amazon MSK Connect. The connector plugin takes care of creating and configuring the MBeans and registering them with the JMX registry.

The following diagram shows the workflow of using Debezium MySQL Connector as a custom plugin in Amazon MSK Connect for CDC from an Amazon Aurora MySQL-Compatible Edition data source.

Data flow diagram illustrating custom Amazon MSK Connect plugin integrating Aurora, Kafka, and CloudWatch metrics

  1. MySQL binary log (binlog) is enabled in Amazon Aurora for MySQL to record all the operations in the order in which they are committed to the database.
  2. The Debezium connector plugin component of the MSK Connect custom plugin continuously monitors the MySQL database, captures the row-level changes by reading the MySQL bin logs, and streams them as change events to Kafka topics in Amazon MSK.
  3. We’ll build a custom module to enable JMX monitoring on the Debezium connector. This module will act as a JMX client to retrieve the JMX metrics from the connector and publish them as custom metrics to CloudWatch.

The Debezium connector provides three types of metrics in addition to the built-in support for default Kafka and Kafka Connect JMX metrics.

  • Snapshot metrics provide information about connector operation while performing a snapshot.
  • Streaming metrics provide information about connector operation when the connector is reading the binlog.
  • Schema history metrics provide information about the status of the connector’s schema history.

In this solution, we export the MilliSecondsBehindSource streaming metrics emitted by the Debezium MySQL connector. This metric provides the number of milliseconds that the connector is lagging behind the change events in the database.

Prerequisites

Following are the prerequisites you need:

  • Access to the AWS account where you want to set up this solution.
  • You have set up the source database and MSK cluster by following this setup instructions in the MSK Connect workshop.

Create a custom plugin

Creating a custom plugin for Amazon MSK Connect for the solution involves the following steps:

  1. Create a custom module: Create a new Maven module or project that will contain your custom code to:
    1. Enable JMX monitoring in the connector application by starting the JMX server.
    2. Create a Remote Method Invocation (RMI) registry to enable the access to the JMX metrics to the clients.
    3. Create a JMX metrics exporter to query the JMX metrics by connecting to the JMX server and push the metrics to CloudWatch as custom metrics.
    4. Schedule to run the JMX metrics exporter at a configured interval.
  2. Package and deploy the custom module as an MSK Connect custom plugin.
  3. Create a connector using the custom plugin to capture CDC from the source, stream it and validate the metrics in Amazon CloudWatch.

This custom module extends the connector functionality to export the JMX metrics without requiring any changes in the underlying connector implementation. This helps ensure that upgrading the custom plugin requires only upgrading the plugin version in the pom.xml of the custom module.

Let’s deep dive and understand the implementation of each step mentioned above.

1. Create a custom module

Create a new Maven project with dependencies on Debezium MySQL Connector to enable JMX monitoring, Kafka Connect API for configuration, and CloudWatch AWS SDK to push the metrics to CloudWatch.
Set up a JMX connector server to enable JMX monitoring: To enable JMX monitoring, the JMX server needs to be started at the time of initializing the connector. This is usually done by setting the environment variables with JMX options as described in Monitoring Debezium. In the case of an Amazon MSK Connect custom plugin, JMX monitoring is enabled programmatically at the time of connector plugin initialization. To achieve this:

  • Extend the MySqlConnector class and override the start which is the connector’s entry point to execute custom code.
public class DebeziumMySqlMetricsConnector extends MySqlConnector{
@Override
	public void start(Map<String, String> props) {
  • In the start method of the custom connector class (DebeziumMySqlMetricsConnector) that we are creating, set the following parameters to allow customization of the JMX Server properties by retrieving connector configuration from a config file.

connect.jmx.port – The port number on which the RMI registry needs to be created. JMXConnectorServer would listen to the incoming connections on this port.

database.server.name – Name of the database that is the source for the CDC.

It also retrieves the CloudWatch configuration related properties that will be used while pushing the JMX metrics to CloudWatch.

cloudwatch.namespace.name – CloudWatch NameSpace to which the metrics need to be pushed as custom metrics

cloudwatch.region – CloudWatch Region where the custom namespace is created in your AWS account

connectJMXPort = Integer.parseInt(props.getOrDefault(CONNECT_JMX_PORT_KEY, String.valueOf(DEFAULT_JMX_PORT)));
databaseServerName = props.getOrDefault(DATABASE_SERVER_NAME_KEY, "");
cwNameSpace = props.getOrDefault(CW_NAMESPACE_KEY, DEFAULT_CW_NAMESPACE);
cwRegion = props.getOrDefault(CW_REGION_KEY, null);
  • Create an RMI registry on the specified port (connectJMXPort). This registry is used by the JMXConnectorServer to store the RMI objects corresponding to the MBeans in the JMX registry. This allows the JMX clients to look up and access the MBeans on the PlatformMBeanServer.

LocateRegistry.createRegistry(connectJMXPort);

  • Retrieve the PlatformMBeanServer and construct the JMXServiceURL which is in the format service:jmx:rmi://localhost/jndi/rmi://localhost:<<jmx.port>>/jmxrmi. Create a new JMXConnectorServer instance using the JMXConnectorServerFactory and the JMXServiceURL and start the JMXConnectorServer instance.
MBeanServer mbs = ManagementFactory.getPlatformMBeanServer();
String jmxServiceURL = String.format(JMX_URL_TEMPLATE, connectJMXPort);
JMXServiceURL url = new JMXServiceURL(jmxServiceURL);
JMXConnectorServer svr = JMXConnectorServerFactory.newJMXConnectorServer(url, null, mbs);
svr.start();

Implement JMX metrics exporter: Create a JMX client to connect to the JMX server, query the MilliSecondBehindSource metric from the JMX server, convert it into the required format, and export it to CloudWatch.

  • Connect to the JMX Server using the JMXConnectorFactory and JMXServiceURL
JMXServiceURL jmxUrl = new JMXServiceURL(String.format(JMX_URL_TEMPLATE,DebeziumMySqlMetricsConnector.getConnectJMXPort()));
JMXConnector jmxConnector = JMXConnectorFactory.connect(jmxUrl, null);
jmxConnector.connect();
  • Query the MBean object that holds the corresponding metric, for example, MilliSecondsBehindSource, and retrieve the metric value using sample code provided in msk-connect-custom-plugin-jmx. (you can choose one or more metrics).
  • Schedule the execution of your JMX metrics exporter at regular intervals.

getScheduler().schedule(new JMXMetricsExporter(), SCHEDULER_INITIAL_DELAY, SCHEDULER_PERIOD);

Export metrics to CloudWatch: Implement the logic to push relevant JMX metrics to CloudWatch. You can use the AWS SDK for Java to interact with the CloudWatch PutMetricData API or use the CloudWatch Logs subscription filter to ingest the metrics from a dedicated Kafka topic.

Dimension dimension = Dimension.builder()
.name("DBServerName")
.value(DebeziumMySqlMetricsConnector.getDatabaseServerName())
.build();
MetricDatum datum = MetricDatum.builder()
	     .metricName("MilliSecondsBehindSource")
	     .unit(StandardUnit.NONE)
	     .value(Double.valueOf(msBehindSource))
	     .timestamp(instant)
	     .dimensions(dimension).build();
PutMetricDataRequest request = PutMetricDataRequest.builder()
	  .namespace(DebeziumMySqlMetricsConnector.getCWNameSpace())
	  .metricData(datum).build();
cw.putMetricData(request);

For more information, see the sample implementation for the custom module in aws-samples in GitHub. This sample also provides custom plugins packaged with two different versions of Debezium MySQL connector (debezium-connector-mysql-2.5.2.Final-plugin and debezium-connector-mysql-2.7.3.Final-plugin) and the following steps would explain the steps to build a custom plugin using your custom code.

2. Package the custom module and Debezium MySQL connector as a custom plugin

Build and package the Maven project with the custom code as a JAR file and include the JAR file in the debezium-connector-mysql-2.5.2.Final-plugin folder downloaded from maven repo. Package the updated debezium-connector-mysql-2.5.2.Final-plugin as a ZIP file (Amazon MSK Connect accepts custom plugins in ZIP or JAR format). Alternatively, you can use the prebuiltcustom-debezium-mysql-connector-plugin.zip available in GitHub.

Choose the Debezium connector version (2.5 or 2.7) that fits your requirement.

When you have to upgrade to a new version of the Debezium MySQL connector, you can update the version of the dependency and build the custom module and deploy it. By doing this, you can maintain the custom plugin without modifying the original connector code. The GitHub samples provide ready-to-use plugins for two Debezium connector versions. However, you can follow the same approach to upgrade to the latest connector version as well.

Create a custom plugin in Amazon MSK

  1. If you have set up your AWS resources by following the Getting Started lab, open Amazon S3 console and locate the bucket msk-lab-${ACCOUNT_ID}-plugins-bucket/debezium .
  2. Upload the custom plugin created in the previous section custom-debezium-mysql-connector-plugin.zip to msk-lab-${ACCOUNT_ID}-plugins-bucket/debezium, as shown in the following figure.

msk-lab-s3-plugin-bucket

  1. Switch to the Amazon MSK console and choose Custom plugins in the navigation pane. Choose Create custom plugin and, browse the S3 bucket that you created above and select the custom plugin ZIP file you just uploaded.

custom-connector-plugin-s3-object

  1. Enter custom-debezium-mysql-connector-plugin for the plugin name. Optionally, enter a description and choose Create Custom Plugin.

msk-connect-create-custom-plugin-console

  1. After a few seconds you should see the plugin is created and the status is Active.
  2. Customize the worker configuration for the connector by following the instructions in the Customize worker configuration lab.

3. Create an Amazon MSK connector

The next step is to create an MSK connector.

  1. From the MSK section choose Connectors, then choose Create connector. Choose custom-debezium-mysql-connector-plugin from the list of Custom plugins, then choose Next.

msk-plugin-create

  1. Enter custom-debezium-mysql-connector in the Name textbox, and a description for the connector.

connector-properties-console-in-MSK-connect

  1. Select the MSKCluster-msk-connect-lab from the listed MSK clusters. From the Authentication dropdown, select IAM.
  2. Copy the following configuration and paste it in the connector configuration textbox.
  • Replace the <Your Aurora MySQL database endpoint>, <Your Database Password>, <Your MSK Bootstrap Server Address>, and <Your CloudWatch Region>placeholders with the corresponding details for the resources in your account.
  • Review the topic.prefix, database.user, topic.prefix, database.server.id, database.server.name, database.port, database.include.listparameters in the configuration. These parameters are configured with the values used in the workshop. Update them with the details corresponding to your configuration if you have customized it in your account.
  • Note that the connector.classparameter is updated with the qualified name of the subclass of MySqlConnector class that you created in the custom module.
  • The connect.jmx.portparameter specifies the default port to start the JMX server. You can configure this to any available port.
connector.class=com.amazonaws.msk.debezium.mysql.connect.DebeziumMySqlMetricsConnector tasks.max=1
include.schema.changes=true
topic.prefix=salesdb
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
database.user=master
database.server.id=123456
database.server.name=salesdb
database.port=3306
key.converter.schemas.enable=false
database.hostname=<Your Aurora MySQL database endpoint>
database.password=<Your Database Password>
value.converter.schemas.enable=false
database.include.list=salesdb
schema.history.internal.kafka.topic=internal.dbhistory.salesdb
schema.history.internal.kafka.bootstrap.servers=<Your MSK Bootstrap Server Address>
schema.history.internal.producer.sasl.mechanism=AWS_MSK_IAM
schema.history.internal.consumer.sasl.mechanism=AWS_MSK_IAM
schema.history.internal.producer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
schema.history.internal.consumer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
schema.history.internal.producer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
schema.history.internal.consumer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
schema.history.internal.consumer.security.protocol=SASL_SSL
schema.history.internal.producer.security.protocol=SASL_SSL
connect.jmx.port=7098
cloudwatch.namespace.name=MSK_Connect
cloudwatch.region=<Your CloudWatch Region>

connector-properties-configuration-settings

5. Follow the remaining instructions from the Create MSK Connector lab and create the connector. Verify that the connector status changes to Running.

Debezium MySQL custom connector version (2.7.3) provides additional flexibility to configure optional properties that can be added to your MSK connector configuration and selectively include and exclude metrics to emit to CloudWatch. The following are the example configuration properties that can be used with version 2.7.3 :

  • cloudwatch.debezium.streaming.metrics.include – A comma-separated list of streaming metrics type that must be exported to CloudWatch as custom metrics.
  • cloudwatch.debezium.streaming.metrics.exclude – Specify a comma-separated list of streaming metrics types to exclude from being sent to CloudWatch as custom metrics.
  • Similarly include and exclude properties for snapshot metrics type are cloudwatch.debezium.snapshot.metrics.include and cloudwatch.debezium.snapshot.metrics.exclude
  • Include and exclude properties for schema history metrics type are cloudwatch.debezium.schema.history.metrics.include and cloudwatch.debezium.schema.history.metrics.exclude

The following is a sample configuration excerpt.

  "cloudwatch.debezium.streaming.metrics.include": "LastTransactionId, TotalNumberOfEventsSeen, MilliSecondsBehindSource,CapturedTables",
  "cloudwatch.debezium.streaming.metrics.exclude": "LastTransactionId",
  "cloudwatch.debezium.schema.history.metrics.exclude": "MilliSecondsSinceLastAppliedChange",

Review the GitHub README file for more details on the use of these properties with MSK connector configurations.

Verify the replication in the Kafka cluster and CloudWatch metrics

Follow the instructions in the Verify the replication in the Kafka cluster lab to set up a client and make changes to the source DB and verify that the changes are captured and sent to Kafka topics by the connector.

To verify that the connector has published the JMX metrics to CloudWatch, go to the CloudWatch console and choose Metrics in the navigation pane, then choose All Metrics. Under Custom namespace, you can see MSK_Connect with database name as the dimension. Select the database name to view the metrics.

Amazon CloudWatch interface with time series graph and MSK Connect metric details

Select the MilliSecondBehindSource metric with statistic as Average in the Graphed Metric to plot the graph. You can verify that the MilliSecondBehindSource metric value is greater than zero whenever any operation is being performed on the source database and returns to zero during the idle time.

 Amazon CloudWatch console showing custom metric visualization with detailed controls and timeline analysis

Clean up

Delete the resources that you created such as the Aurora DB, Amazon MSK Cluster and connectors by following the instructions at Cleanup in the Amazon MSK Connect lab if you have been following along to set up the solution on your account.

Conclusion

In this post, we showed you how to extend the Debezium MySQL connector plugin with an additional module to export the JMX metrics to CloudWatch as custom metrics. As a next step, you can create a CloudWatch alarm to monitor the metrics and take remediation actions when the alarm is triggered. In addition to exporting the JMX metrics to CloudWatch, you can export these metrics to third-party applications such as Prometheus or DataDog using CloudWatch Metric Streams. You can follow a similar approach to export the JMX metrics of other connectors from MSK Connect. You can learn more about creating your own connectors by visiting the Connector Developer Guide and how to deploy them as custom plugins in the MSK Connect documentation.


About the authors

Jaydev NathJaydev Nath is a Solutions Architect at AWS, where he works with ISV customers to build secure, scalable, reliable, and cost-efficient cloud solutions. He brings strong expertise in building SaaS architecture on AWS with a focus on Generative AI and data analytics technologies to help deliver practical, valuable business outcomes for customers.

David John Chakram is a Principal Solutions Architect at AWS. He specializes in building data platforms and architecting seamless data ecosystems. With a profound passion for databases, data analytics, and machine learning, he excels at transforming complex data challenges into innovative solutions and driving businesses forward with data-driven insights.

Sharmila Shanmugam is a Solutions Architect at Amazon Web Services. She is passionate about solving the customers’ business challenges with technology and automation and reduce the operational overhead. In her current role, she helps customers across industries in their digital transformation journey and build secure, scalable, performant and optimized workloads on AWS.

Use Databricks Unity Catalog Open APIs for Spark workloads on Amazon EMR

Post Syndicated from Venkat Viswanathan original https://aws.amazon.com/blogs/big-data/use-databricks-unity-catalog-open-apis-for-spark-workloads-on-amazon-emr/

This post was written with John Spencer, Sreeram Thoom, and Dipankar Kushari from Databricks.

Organizations need seamless access to data across multiple platforms and business units. A common scenario involves one team using Amazon EMR for data processing while needing to access data that another team manages in Databricks Unity Catalog. Traditionally, this would require data duplication or complex manual setups.

Although both Amazon EMR and Databricks Unity Catalog are powerful tools on their own, integrating them effectively is crucial for maintaining strong data governance, security, and operational efficiency. In this post, we demonstrate how to achieve this integration using Amazon EMR Serverless, though the approach works well with other Amazon EMR deployment options and Unity Catalog OSS.

EMR Serverless makes running big data analytics frameworks straightforward by offering a serverless option that automatically provisions and manages the infrastructure required to run big data applications. Teams can run Apache Spark and other workloads without the complexity of cluster management, while providing cost-effective scaling based on actual workload demands and seamless integration with AWS services and security controls.

Databricks Unity Catalog serves as a unified governance solution for data and AI assets, providing centralized access control and auditing capabilities. It enables fine-grained permissions across workspaces and cloud platforms, while supporting comprehensive metadata management and data discovery across the organization, and can complement governance tools like AWS Lake Formation.

To enable Amazon EMR to process data maintained in Unity Catalog, the data team traditionally copies data products across the platforms to a location accessible by Amazon EMR. The practice of data duplication not only leads to increased storage costs, but also severely impacts data quality and makes it challenging to effectively enforce same governance policies across different systems, track data lineage, enforce data retention policies, and maintain consistent access controls across the organization.

Now using Unity Catalog’s Open REST APIs, Amazon EMR customers can read from and write to Databricks Unity Catalog and Unity Catalog OSS tables using Spark, enabling cross-platform interoperability while maintaining governance and access controls across Amazon EMR and Unity Catalog.

Solution overview

In this post, we will provide an overview of EMR Spark workload integration with Databricks Unity Catalog and walk through the end-to-end process of reading from and writing to Databricks Unity Catalog tables using Amazon EMR and Spark. We show you how to configure EMR Serverless to interact with Databricks Unity Catalog, run an interactive Spark workload to access the data, and run an analysis to derive insights.

The following diagram illustrates the solution architecture.

Prerequisites

You must have the following prerequisites:

In the following sections, we walk through the process of reading and writing to Unity Catalog with EMR Serverless.

Enable Unity Catalog for external access

Log in to your workspace as a Databricks admin and complete the following steps to configure external access to read Databricks objects:

  1. Enable external data access for your metastore. For instructions, see Enable external data access on the metastore.
  2. Set up a principal that will be configured with Amazon EMR for data access.
  3. Grant the principal the privilege to configure the integration of the EXTERNAL USE SCHEMA privilege on the schema containing the objects. For instructions, see Grant a principal EXTERNAL USE SCHEMA.
  4. For this post, we generate a Databricks personal access token (PAT) for the principal and note it down. For instructions, refer to Authorizing access to Databricks resources and Databricks personal access token authentication.

For a production deployment, store the PAT in AWS Secrets Manager. You can use it in a later step to read and write to Unity Catalog with Amazon EMR.

Configure EMR Spark to access Unity Catalog

In this walkthrough, we run PySpark interactive queries through notebooks using EMR Studio. Complete the following steps:

  1. Open the AWS Management Console with administrator permission.
  2. Create an EMR Studio to run interactive workloads. To create a workspace, you need to specify the S3 bucket created in the prerequisites and the minimum service role for EMR Serverless. For instructions, see Set up an EMR Studio.
  3. For this post, we create two EMR Serverless applications. For instructions, see Creating an EMR Serverless application from the EMR Studio console.
    1. For Iceberg tables, create an EMR Serverless application called dbx-demo-application-iceberg with version 7.8.0 or higher. Make sure to deselect Use AWS Glue Data Catalog as Metastore under Additional Configurations, Metastore configuration. Add the following Spark configuration (see Configure applications). Provide the name of the catalog in Unity Catalog that contains your tables and the URL of the Databricks workspace.
      {
          "runtimeConfiguration": [
              {
                  "classification": "spark-defaults",
                  "properties": {
                      "spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar",
                      "spark.jars.packages": "io.unitycatalog:unitycatalog-spark_2.12:0.2.0",
                      "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
                      "spark.sql.defaultCatalog": "<uc-catalog-name>",
                      "spark.sql.catalog.<uc-catalog-name>": "org.apache.iceberg.spark.SparkCatalog",
                      "spark.sql.catalog.<uc-catalog-name>.uri": "https://<workspace-url>/api/2.1/unity-catalog/iceberg-rest",
                      "spark.sql.catalog.<uc-catalog-name>.type": "rest",
                      "spark.sql.catalog.<uc-catalog-name>.warehouse": "<uc-catalog-name>"
                  }
              }
          ]
      }

    2. For Delta tables, create an EMR Serverless application called dbx-demo-application and version 7.8.0 or higher. Make sure to deselect Use AWS Glue Data Catalog as Metastore under Additional Configurations, Metastore configuration. Add the following Spark configuration (see Configure applications). Provide the name of the catalog in Unity Catalog that contains your tables and the URL of the Databricks workspace.
      {
          "runtimeConfiguration": [
              {
                  "classification": "spark-defaults",
                  "properties": {
                      "spark.jars": "/usr/share/aws/delta/lib/delta-spark.jar,/usr/share/aws/delta/lib/delta- storage.jar",
                      "spark.jars.packages": "io.unitycatalog:unitycatalog-spark_2.12:0.2.0",
                      "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
                      "spark.sql.defaultCatalog": "<uc-catalog-name>",
                      "spark.sql.catalog.spark_catalog": "io.unitycatalog.spark.UCSingleCatalog",
                      "spark.sql.catalog.<uc-catalog-name>": "io.unitycatalog.spark.UCSingleCatalog",
                      "spark.sql.catalog.<uc-catalog-name>.uri": "https://<workspace-url>/api/2.1/unity-catalog"
                  }
              }
          ]
      }

  4. To set up your interactive workload with a runtime role, see Run interactive workloads with EMR Serverless through EMR Studio.

Read and write to Unity Catalog with Amazon EMR

Launch the workspace created in the previous step. Download the notebooks create-delta-table and create-iceberg-table and upload them to the EMR Studio workspace.

The create-delta-table.ipynb notebook configures the metastore properties to work with Delta tables. The create-iceberg-table.ipynb notebook configures the metastore properties to work with Iceberg tables.

Add the generated token to the session.

For a production deployment, store the PAT in Secrets Manager.

For Iceberg tables, connect to the EMR Serverless application dbx-demo-application-iceberg with the runtime role created in earlier steps under compute and run the notebook (create-iceberg-table). Select PySpark as the kernel and execute each cell in the notebook by choosing the run icon. Refer to Submit a job run or interactive workload for further details about how to run an interactive notebook.

We use the following code to create an external Iceberg table in the catalog:

CREATE SCHEMA IF NOT EXISTS customerschema;
USE SCHEMA customerschema;
CREATE TABLE IF NOT EXISTS iceberg_customer (id string, name string, country string) USING iceberg; 
insert into iceberg customer values('1','Alice','US');

For Delta tables, connect to the EMR Serverless application dbx-demo-application with the runtime role created in earlier steps and run the notebook (create-delta-table). Select PySpark as the kernel and execute each cell in the notebook by choosing the run icon. Refer to Submit a job run or interactive workload for further details about how to run an interactive notebook.

We use the following code to create an external Delta table in the catalog:

CREATE SCHEMA IF NOT EXISTS customerschema;
USE SCHEMA customerschema;
CREATE TABLE IF NOT EXISTS delta_customer (id int, name string, country string) USING delta LOCATION ‘s3://<bucket_name>/emr-dbx/external/customerschema/delta_customer’;
insert into delta_customer values(1,'Bob','US'); 

Verify in Databricks for both Iceberg and Delta tables

Now you can run queries in Databricks Unity Catalog to show the records inserted into the Iceberg and Delta tables from EMR Serverless:

  1. Log in to your Databricks workspace.
  2. Choose SQL Editor in the navigation pane.
  3. Run queries for both Iceberg and Delta tables.
  4. Verify the results show the same as what you saw in the Jupyter notebook in EMR Studio.

The following screenshot shows an example of querying the Iceberg table.

The following screenshot shows an example of querying the Delta table.

Clean up

Clean up the resources used in this post to avoid additional charges:

  1. Delete the IAM roles for this post.
  2. Delete the EMR applications and EMR Studio setup created for this post.
  3. Delete the resources created in Unity Catalog.
  4. Empty and then delete the S3 bucket.

Summary

In this post, we demonstrated the powerful interoperability between Amazon EMR and Databricks Unity Catalog by walking through how to enable external access to Unity Catalog, configure EMR Spark to connect seamlessly with Unity Catalog, and perform DML and DDL operations on Unity Catalog tables using EMR Serverless.

To learn more about using EMR Serverless, see Getting started with Amazon EMR Serverless. To learn more about using tools like EMR Spark with Unity Catalog, see Unity Catalog integrations.


About the authors

Venkatavaradhan (Venkat) Viswanathan is a Global Partner Solutions Architect at Amazon Web Services. Venkat is a Technology Strategy Leader in Data, AI, ML, generative AI, and Advanced Analytics. Venkat is a Global SME for Databricks and helps AWS customers design, build, secure, and optimize Databricks workloads on AWS.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.

Ramkumar Nottath is a Principal Solutions Architect at AWS focusing on Analytics services. He enjoys working with various customers to help them build scalable, reliable big data and analytics solutions. His interests extend to various technologies such as analytics, data warehousing, streaming, data governance, and machine learning. He loves spending time with his family and friends.

John Spencer is a Product Manager at Databricks, dedicated to making Unity Catalog work seamlessly with customers’ ecosystems of tools and platforms so they can easily access, govern, and use their data.

Sreeram Thoom is a Specialist Solutions Architect at Databricks helping customers design secure, scalable applications on the Data Lakehouse.

Dipankar Kushari is a specialist solutions architect at Databricks helping customer architect and build secured applications on Data Lakehouse.

Read and write Apache Iceberg tables using AWS Lake Formation hybrid access mode

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/read-and-write-apache-iceberg-tables-using-aws-lake-formation-hybrid-access-mode/

Enterprises are adopting Apache Iceberg table format for its multitude of benefits. The change data capture (CDC), ACID compliance, and schema evolution features cater to representing big datasets that receive new records at a fast pace. In an earlier blog post, we discussed how to implement fine-grained access control in Amazon EMR Serverless using AWS Lake Formation for reads. Lake Formation helps you centrally manage and scale fine-grained data access permissions and share data with confidence within and outside your organization.

In this post, we demonstrate how to use Lake Formation for read access while continuing to use AWS Identity and Access Management (IAM) policy-based permissions for write workloads that update the schema and upsert (insert and update combined) data records into the Iceberg tables. The bimodal permissions are needed to support existing data pipelines that use only IAM and Amazon Simple Storage Service (Amazon) S3 bucket policy-based permissions and to support table operations that are not yet available in the analytics engines. The two-way permission is achieved by registering the Amazon S3 data location of the Iceberg table with Lake Formation in hybrid access mode. Lake Formation hybrid access mode allows you to onboard new users with Lake Formation permissions to access AWS Glue Data Catalog tables with minimal interruptions to existing IAM policy-based users. With this solution, organizations can use the Lake Formation permissions to scale the access of their existing Iceberg tables in Amazon S3 to new readers. You can extend the methodology to other open table formats, such as Linux Foundation Delta Lake tables and Apache Hudi tables.

Key use cases for Lake Formation hybrid access mode

Lake Formation hybrid access mode is useful in the following use cases:

  • Avoiding data replication – Hybrid access mode helps onboard new users with Lake Formation permissions on existing Data Catalog tables. For example, you can enable a subset of data access (coarse vs. fine-grained access) for various user personas, such as data scientists and data analysts, without making multiple copies of the data. This also helps maintain a single source of truth for production and business insights.
  • Minimal interruption to existing IAM policy-based user access – With hybrid access mode, you can add new Lake Formation managed users with minimal disruptions to your existing IAM and Data Catalog policy-based user access. Both access methods can coexist for the same catalog table, but each user can have only one mode of permissions.
  • Transactional table writes – Certain write operations like insert, update, and delete are not supported by Amazon EMR for Lake Formation managed Iceberg tables. Refer to Considerations and limitations for additional details. Although you could use Lake Formation permissions for Iceberg table read operations, you could manage the write operations as the table owners with IAM policy-based access.

Solution overview

An example Enterprise Corp has a large number of Iceberg tables based on Amazon S3. They are currently managing the Iceberg tables manually with IAM policy, Data Catalog resource policy, and S3 bucket policy-based access in their organization. They want to share their transactional data of Iceberg tables across different teams, such as data analysts and data scientists, asking for read access across a few lines of business. While maintaining the ownership of the table’s updates to their single team, they want to provide restricted read access to certain columns of their tables. This is achieved by using the hybrid access mode feature of Lake Formation.

In this post, we illustrate the scenario with a data engineer team and a new data analyst team. The data engineering team owns the extract, transform, and load (ETL) application that will process the raw data to create and maintain the Iceberg tables. The data analyst team will query the tables to gather business insights from those tables. The ETL application will use IAM role-based access to the Iceberg table, and the data analyst gets Lake Formation permissions to query the same tables.

The solution can be visually represented in the following diagram.

Solution Overview

For ease of illustration, we use only one AWS account in this post. Enterprise use cases typically have multiple accounts or cross-account access requirements. The setup of the Iceberg tables, Lake Formation permissions, and IAM based permissions are similar for multiple and cross-account scenarios.

The high-level steps involved in the permissions setup are as follows:

  1. Make sure that IAMAllowedPrincipals has Super access to the database and tables in Lake Formation. IAMAllowedPrincipals is a virtual group that represents any IAM principal permissions. Super access to this virtual group is required to make sure that IAM policy-based permissions to any IAM principal continues to work.
  2. Register the data location with Lake Formation in hybrid access mode.
  3. Grant DATA LOCATION permission to the IAM role that manages the table with IAM policy-based permissions. Without the DATA LOCATION permission, write workloads will fail. Test the access to the table by writing new records to the table as the IAM role.
  4. Add SELECT table permissions to the Data-Analyst role in Lake Formation.
  5. Opt-in the Data-Analyst to the Iceberg table, making the Lake Formation permissions effective for the analyst.
  6. Test access to the table as the Data-Analyst by running SELECT queries in Athena.
  7. Test the table write operations by adding new records to the table as ETL-application-role using EMR Serverless.
  8. Read the latest update, again, as Data-Analyst.

Prerequisites

You should have the following prerequisites:

  • An AWS account with a Lake Formation administrator configured. Refer to Data lake administrator permissions and Set up AWS Lake Formation. You can also refer to Simplify data access for your enterprise using Amazon SageMaker Lakehouse for the Lake Formation admin setup in your AWS account. For ease of demonstration, we have used an IAM admin role added as a Lake Formation administrator.
  • An S3 bucket to host the sample Iceberg table data and metadata.
  • An IAM role to register your Iceberg table Amazon S3 location with Lake Formation. Follow the policy and trust policy details for a user-defined role creation from Requirements for roles used to register locations.
  • An IAM role named ETL-application-role, which will be the runtime role to execute jobs in EMR Serverless. The minimum policy required is shown in the following code snippet. Replace the Amazon S3 data location of the Iceberg table, database name, and AWS Key Management Service (AWS KMS) key ID with your own. For additional details on the role setup, refer to Job runtime roles for Amazon EMR Serverless. This role can insert, update, and delete data in the table.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "IcebergDataAccessInS3",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetBucketLocation",
                    "s3:ListAllMyBuckets",
                    "s3:Get*",
                    "s3:Put*",
                    "s3:Delete*"
                ],
                "Resource": [
                    "arn:aws:s3:::your-iceberg-data-bucket-name",
                    "arn:aws:s3:::your-iceberg-data-bucket-name/*"
                ]
            },
            {
                "Sid": "GlueCatalogApiPermissions",
                "Effect": "Allow",
                "Action": [
                    "glue:*"
                ],
                "Resource": [
                    "arn:aws:glue:your-Region:account-id:catalog",
                    "arn:aws:glue:your-Region:account-id:database/iceberg-database-name",
                    "arn:aws:glue:your-Region:account-id:database/default",
                    "arn:aws:glue:your-Region:account-id:table/*/*"
                ]
            },
            {
                "Sid": "KmsKeyPermissions",
                "Effect": "Allow",
                "Action": [
                    "kms:Encrypt",
                    "kms:Decrypt",
                    "kms:ReEncrypt*",
                    "kms:GenerateDataKey",
                    "kms:DescribeKey",
                    "kms:ListKeys",
                    "kms:ListAliases"
                ],
                "Resource": [
                    "arn:aws:kms:your-Region:account-id:key/your-key-id"
                ]
            }
        ]
    }

    Add the following trust policy to the role:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "emr-serverless.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  • An IAM role called Data-Analyst, to represent the data analyst access. Use the following policy to create the role. Also attach the AWS managed policy arn:aws:iam::aws:policy/AmazonAthenaFullAccess to the role, to allow querying the Iceberg table using Amazon Athena. Refer to Data engineer permissions for additional details about this role.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "LFBasicUser",
                "Effect": "Allow",
                "Action": [
                    "glue:GetCatalog",
                    "glue:GetCatalogs",
                    "glue:GetTable",
                    "glue:GetTables",
                    "glue:GetTableVersion",
                    "glue:GetTableVersions",
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetPartition",
                    "glue:GetPartitions",
                    "lakeformation:GetDataAccess"
                ],
                "Resource": "*"
            },
            {
                "Sid": "AthenaResultsBucket",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetBucketLocation",
                    "s3:Put*",
                    "s3:Get*",
                    "s3:Delete*"
                ],
                "Resource": [
                    "arn:aws:s3:::your-bucket-name-prefix",
                    "arn:aws:s3:::your-bucket-name-prefix/*"
                ]
            }
        ]
    }

    Add the following trust policy to the role:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::<your_account_id>:root"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

Create the Iceberg table

Complete the following steps to create the Iceberg table:

  1. Sign in to the Lake Formation console as the admin role.
  2. In the navigation pane under Data Catalog, choose Databases.
  3. From the Create dropdown menu, create a database named iceberg_db. You can leave the Amazon S3 location property empty for the database.
  4. On the Athena console, run the following provided queries. The queries perform the following operations:
    1. Create a table called customer_csv, pointing to the customer dataset in the public S3 bucket.
    2. Create an Iceberg table called customer_iceberg, pointing to your S3 bucket location that will host the Iceberg table data and metadata.
    3. Insert data from the CSV table to the Iceberg table.
      CREATE EXTERNAL TABLE `iceberg_db`.`customer_csv`(
        `c_customer_sk` int,
        `c_customer_id` string,
        `c_current_cdemo_sk` int,
        `c_current_hdemo_sk` int,
        `c_current_addr_sk` int,
        `c_first_shipto_date_sk` int,
        `c_first_sales_date_sk` int,
        `c_salutation` string,
        `c_first_name` string,
        `c_last_name` string,
        `c_preferred_cust_flag` string,
        `c_birth_day` int,
        `c_birth_month` int,
        `c_birth_year` int,
        `c_birth_country` string,
        `c_login` string,
        `c_email_address` string,
        `c_last_review_date` string)
      ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '|'
      STORED AS INPUTFORMAT
        'org.apache.hadoop.mapred.TextInputFormat'
      OUTPUTFORMAT
        'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
      LOCATION
        ' s3://redshift-downloads/TPC-DS/2.13/10GB/customer/'
      TBLPROPERTIES (
        'classification'='csv');   
      
       SELECT * FROM customer_csv LIMIT 5; //verifies table data  
      
      CREATE TABLE IF NOT EXISTS iceberg_db.customer_iceberg (
              c_customer_sk             int,
              c_customer_id             string,
              c_current_cdemo_sk        int,
              c_current_hdemo_sk        int,
              c_current_addr_sk         int,
              c_first_shipto_date_sk    int,
              c_first_sales_date_sk     int,
              c_salutation              string,
              c_first_name              string,
              c_last_name               string,
              c_preferred_cust_flag     string,
              c_birth_day               int,
              c_birth_month             int,
              c_birth_year              int,
              c_birth_country           string,
              c_login                   string,
              c_email_address           string,
              c_last_review_date        string
          )
      LOCATION 's3://your-iceberg-data-bucket-name/path/'
      TBLPROPERTIES ( 'table_type' = 'ICEBERG' );
      
      INSERT INTO customer_iceberg
      SELECT *
      FROM customer_csv;  
      
      SELECT * FROM customer_iceberg LIMIT 5; //verifies table data

Set up the Iceberg table as a hybrid access mode resource

Complete the following steps to set up the Iceberg table’s Amazon S3 data location as hybrid access mode in Lake Formation:

  1. Register your table location with Lake Formation:
    1. Sign in to the Lake Formation console as data lake administrator.
    2. In the navigation pane, choose Data lake Locations.
    3. For Amazon S3 path, provide the S3 prefix of your Iceberg table location that holds both the data and metadata of the table.
    4. For IAM role, provide the user-defined role that has permissions to your Iceberg table’s Amazon S3 location and that you created according to the prerequisites. For more details, refer to Registering an Amazon S3 location.
    5. For Permission mode, select Hybrid access mode.
    6. Choose Register location to register your Iceberg table Amazon S3 location with Lake Formation.

  1. Add data location permission to ETL-application-role:
    1. In the navigation pane, choose Data locations.
    2. For IAM users and roles, choose ETL-application-role.
    3. For Storage location, provide the S3 prefix of your Iceberg table.
    4. Choose Grant.

Data location permission is required for write operations to the Iceberg table location only if the Iceberg table’s S3 prefix is a child location of the database’s Amazon S3 location property.

  1. Grant Super access on the Iceberg database and table to IAMAllowedPrincipals:
    1. In the navigation pane, choose Data permissions.
    2. Choose IAM users and roles and choose IAMAllowedPrincipals.
    3. For LF-Tags or catalog resources, choose Named Data Catalog resources.
    4. Under Databases, select the name of your Iceberg table’s database.
    5. Under Database permissions, select Super.
    6. Choose Grant.

    7. Repeat the preceding steps and for Tables – optional, choose the Iceberg table.
    8. Under Table permissions, select Super.
    9. Choose Grant.

  1. Add database and table permissions to the Data-Analyst role:
    1. Repeat the steps in Step 3 to grant permissions for the Data-Analyst role, once for database-level permission and once for table-level permission.
    2. Select Describe permissions for the Iceberg database.
    3. Select Select permissions for the Iceberg table.
    4. Under Hybrid access mode, select Make Lake Formation permissions effective immediately.
    5. Choose Grant.

The following screenshots show the database permissions for Data-Analyst.

The following screenshots show the table permissions for Data-Analyst.

  1. Verify Lake Formation permissions on the Iceberg table and database to both Data-Analyst and IAMAllowedPrincipals:
    1. In the navigation pane, choose Data permissions.
    2. Filter by Table= customer_iceberg.
      You should see IAMAllowedPrincipals with All permission and Data-Analyst with Select permission.
    3. Similarly, verify permissions for the database by filtering database=iceberg_db.

You should see IAMAllowedPrincipals with All permission and Data-Analyst with Describe permission.

  1. Verify Lake Formation opt-in for Data-Analyst:
    1. In the navigation pane, choose Hybrid access mode.

You should see Data-Analyst opted-in for both database and table level permissions.

Query the table as the Data-Analyst role in Athena

While you are logged in to the AWS Management Console as admin, set up the Athena query results bucket:

  1. On the console navigation bar, choose your user name.
  2. Choose Switch role to switch to the Data-Analyst role.
  3. Enter your account ID, IAM role name (Data-Analyst), and choose Switch Role.
  4. Now that you’re logged in as the Data-Analyst role, open the Athena console and set up the Athena query results bucket.
  5. Run the following query to read the Iceberg table. This verifies the Select permission granted to the Data-Analyst role in Lake Formation.
SELECT * FROM "iceberg_db"."customer_iceberg"
WHERE c_customer_sk = 247

Upsert data as ETL-application-role using Amazon EMR

To upsert data to Lake Formation enabled Iceberg tables, we will use Amazon EMR Studio, which is an integrated development environment (IDE) that makes it straightforward for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. EMR Studio will be our web-based IDE to run our notebooks, and we will use EMR Serverless as the compute engine. EMR Serverless is a deployment option for Amazon EMR that provides a serverless runtime environment. For the steps to run an interactive notebook, see Submit a job run or interactive workload.

  1. Sign out of the AWS console as Data-Analyst and log back or switch the user to admin.
  2. On the Amazon EMR console, choose EMR Serverless in the navigation pane.
  3. Choose Get started.
  4. For first-time users, Amazon EMR allows creation of an EMR Studio without a virtual private cloud (VPC). Create an EMR Serverless application as follows:
    1. Provide a name for the EMR Serverless application, such as DemoHybridAccess.
    2. Under Application setup, choose Use default settings for interactive workloads.
    3. Choose Create and start application.

The next step is to create an EMR Studio.

  1. On the Amazon EMR console, choose Studio under EMR Studio in the navigation pane.
  2. Choose Create Studio.
  3. Select Interactive workloads.
  4. You should see a default pre-populated section. Keep these default settings and choose Create Studio and launch Workspace.

  1. After the workspace is launched, attach the EMR Serverless application created earlier and select ETL-application-role as the runtime role under Compute.

  1. Download the notebook Iceberg-hybridaccess_final.ipynb and upload it to EMR Studio workspace.

This notebook configures the metastore properties to work with Iceberg tables. (For more details, see Using Apache Iceberg with EMR Serverless.) Then it performs insert, update, and delete operations in the Iceberg table. It also verifies if the operations are successful by reading the newly added data.

  1. Select PySpark as the kernel and execute each cell in the notebook by choosing the run icon.

Refer to Submit a job run or interactive workload for further details about how to run an interactive notebook.

The following screenshot shows that the Iceberg table insert operation completed successfully.

The following screenshot illustrates running the update statement on the Iceberg table in the notebook.

The following screenshot shows that the Iceberg table delete operation completed successfully.

Query the table again as Data-Analyst using Athena

Complete the following steps:

  1. Switch your role to Data-Analyst on the AWS console.
  2. Run the following query on the Iceberg table and read the row that was updated by the EMR cluster:
    SELECT * FROM "iceberg_db"."customer_iceberg"
    WHERE c_customer_sk = 247

The following screenshot shows the results. As we can see, ‘c_first_name’ column is updated with new value.

Clean up

To avoid incurring costs, clean up the resources you used for this post:

  1. Revoke the Lake Formation permissions and hybrid access mode opt-in granted to the Data-Analyst role and IAMAllowedPrincipals.
  2. Revoke the registration of the S3 bucket to Lake Formation.
  3. Delete the Athena query results from your S3 bucket.
  4. Delete the EMR Serverless resources.
  5. Delete Data-Analyst role and ETL-application-role from IAM.

Conclusion

In this post, we demonstrated how to scale the adoption and use of Iceberg tables using Lake Formation permissions for read workloads, while maintaining full control over table schema and data updates through IAM policy-based permissions for the table owners. The methodology also applies to other open table formats and standard Data Catalog tables, but the Apache Spark configuration for each open table format will vary.

Hybrid access mode in Lake Formation is an option you could use to adopt Lake Formation permissions gradually and scale those use cases that support Lake Formation permissions while using IAM based permissions for the use cases that don’t. We encourage you to try out this setup in your environment. Please share your feedback and any additional topics you would like to see in the comments section.


About the Authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She collaborates with the service team to enhance product features, works with AWS customers and partners to architect lake house solutions, and establishes best practices.

Parul Saxena is a Senior Big Data Specialist Solutions Architect in AWS. She helps customers and partners build highly optimized, scalable, and secure solutions. She specializes in Amazon EMR, Amazon Athena, and AWS Lake Formation, providing architectural guidance for complex big data workloads and assisting organizations in modernizing their architectures and migrating analytics workloads to AWS.

Unlock the power of optimization in Amazon Redshift Serverless

Post Syndicated from Ricardo Serafim original https://aws.amazon.com/blogs/big-data/unlock-the-power-of-optimization-in-amazon-redshift-serverless/

Amazon Redshift Serverless automatically scales compute capacity to match workload demands, measuring this capacity in Redshift Processing Units (RPUs). Although traditional scaling primarily responds to query queue times, the new AI-driven scaling and optimization feature offers a more sophisticated approach by considering multiple factors including query complexity and data volume. Intelligent scaling addresses key data warehouse challenges by preventing both over-provisioning of resources for performance and under-provisioning to save costs, particularly for workloads that fluctuate based on daily patterns or monthly cycles.

Amazon Redshift serverless now offers enhanced flexibility in configuring workgroups through two primary methods. Users can either set a base capacity, specifying the baseline RPUs for query execution, with options ranging from 8 to 1024 RPUs and each RPU providing 16 GB of memory, or they can opt for the price-performance target. Amazon Redshift Serverless AI-driven scaling and optimization can adapt more precisely to diverse workload requirements and employs intelligent resource management, automatically adjusting resources during query execution for optimal performance. Consider using AI-driven scaling and optimization if your current workload requires 32 to 512 base RPUs. We don’t recommend using this feature for less than 32 base RPU or more than 512 base RPU workloads.

In this post, we demonstrate how Amazon Redshift Serverless AI-driven scaling and optimization impacts performance and cost across different optimization profiles.

Options in AI-driven scaling and optimization

Amazon Redshift Serverless AI-driven scaling and optimization offers an intuitive slider interface, letting you balance price and performance goals. You can select from five optimization profiles, ranging from Optimized for Cost to Optimized for Performance, as shown in the following diagram. Your slider position determines how Amazon Redshift allocates resources and implements AI-driven scaling and optimizations, to achieve your desired price-performance target.

Sliding bar

The slider offers the following options:

  1. Optimized for Cost (1)
    • Prioritizes cost savings over performance
    • Allocates minimum resources in favor of saving on costs
    • Best for workloads where performance isn’t time-critical
  2. Cost-Balanced (25)
    • Balances towards cost savings while maintaining reasonable performance
    • Allocates moderate resources
    • Suitable for mixed workloads with some flexibility in query time
  3. Balanced (50)
    • Provides equal emphasis on cost efficiency and performance
    • Allocates optimal resources for most use cases
    • Ideal for general-purpose workloads
  4. Performance-Balanced (75)
    • Favors performance while maintaining some cost control
    • Allocates additional resources when needed
    • Suitable for workloads requiring consistently fast query elapsed time
  5. Optimized for Performance (100)
    • Maximizes performance regardless of cost
    • Provides maximum available resources
    • Best for time-critical workloads requiring fastest possible query delivery

Which workloads to consider for AI-driven scaling and optimizations

The Amazon Redshift Serverless AI-driven scaling and optimization capabilities can be applied to almost every analytical workload. Amazon Redshift will assess and apply optimizations according to your price-performance target—cost, balance, or performance.

Most analytical workloads operate on millions or even billions of rows and generate aggregations and complex calculations. These workloads have high variability for query patterns and number of queries. The Amazon Redshift Serverless AI-driven scaling and optimization will improve the price, performance, or both because it learns the patterns (the repeatability of your workload) and will allocate more resources towards performance improvements if you’re performance-focused or fewer resources if you’re cost-focused.

Cost-effectiveness of AI-driven scaling and optimization

To effectively determine the effectiveness of Amazon Redshift Serverless AI-driven scaling and optimization we need to be able to measure your current state of price-performance. We encourage you to measure your current price-performance by using sys_query_history to calculate the total elapsed time of your workload and note the start time and end time. Then use sys_serverless_usage to calculate the cost. You can use the query from the Amazon Redshift documentation and add the same start and end times. This will establish your current price performance, and now you have a baseline to compare against.

If such measurement isn’t practical because your workloads are continuously running and it’s impractical for you to determine a fixed start and end time, then another way is to compare holistically, check your month over month cost, check your user sentiment towards performance, towards system stability, improvements in data delivery, or reduction in overall monthly processing times.

Benchmark conducted and results

We evaluated the optimization options using the TPCDS 3TB dataset from the AWS Labs GitHub repository (amazon-redshift-utils). We deployed this dataset across three Amazon Redshift Serverless workgroups configured as Optimized for Cost, Balanced, and Optimized for Performance. To create a realistic reporting environment, we configured three Amazon Elastic Compute Cloud (Amazon EC2) instances with JMeter (one per endpoint) and ran 15 selected TPCDS queries concurrently for approximately 1 hour, as shown in the following screenshot.

We disabled the result cache to make sure Amazon Redshift Serverless ran all queries directly, providing accurate measurements. This setup helped us capture authentic performance characteristics across each optimization profile. Also, we designed our test environment without setting the Amazon Redshift Serverless workgroup max capacity parameter—a key configuration that controls the maximum RPUs available to your data warehouse. By removing this limit, we could clearly showcase how different configurations affect scaling behavior in our test endpoints.

Jmeter

Our comprehensive test plan included running each of the 15 queries 355 times, generating 5,325 queries per test cycle. The AI-driven scaling and optimization needs multiple iterations to identify patterns and optimize RPUs, so we ran this workload 10 times. Through these repetitions, the AI learned and adapted its behavior, processing a total of 53,250 queries throughout our testing period.

The testing revealed how the AI-driven scaling and optimization system adapts and optimizes performance across three distinct configuration profiles: Optimized for Cost, Balanced, and Optimized for Performance.

Queries and elapsed time

Although we ran the same core workload repeatedly, we used variable parameters in JMeter to generate different values for the WHERE clause conditions. This approach created similar but not identical workloads, introducing natural variations that showed how the system handles real-world scenarios with varying query patterns.

Our elapsed time analysis demonstrates how each configuration achieved its performance objectives, as shown by the average consumption metrics for each endpoint, as shown in the following screenshot.

Average Elapsed Time per Endpoint

The results matched our expectations: the Optimized for Performance configuration delivered significant speed improvements, running queries approximately two times as the Balanced configuration and four times as the Optimized for Cost setup.

The following screenshots show the elapsed time breakdown for each test.

Optimized for Cost - Elapsed Time Balanced - Elapsed Time Optimized for Performance - Elapsed Time

The following screenshot shows tenth and final test iteration demonstrates distinct performance differences across configurations.

Per Configuration - Elapsed Time

To clarify more, we categorized our query elapsed times into three groups:

  • Short queries – Less than 10 seconds
  • Medium queries – From 10 seconds to 10 minutes
  • Long queries: More than 10 minutes

Considering our last test, the analysis shows:

Duration per configuration Optimized for Cost Balanced Optimized for Performance
Short queries (<10 sec) 1488 1743 3290
Medium queries (10 sec – 10 min) 3633 3579 2035
Long queries (>10 min) 204 3 0
TOTAL 5325 5325 5325

The configuration’s capacity directly impacts query elapsed time. The Optimized for Cost configuration limits resources to save money, resulting in longer query times, making it best suited for workloads that aren’t time critical, where cost savings are prioritized. The Balanced configuration provides moderate resource allocation, striking a middle ground by effectively handling medium-duration queries and maintaining reasonable performance for short queries while nearly eliminating long-running queries. In contrast, the Optimized for Performance configuration allocates more resources, which increases costs but delivers faster query results, making it best for latency-sensitive workloads where query speed is critical.

Capacity used during the tests

Our comparison of the three configurations reveals how Amazon Redshift Serverless AI-driven scaling and optimization technology adapts resource allocation to meet user expectations. The monitoring showed both Base RPU variations and distinct scaling patterns across configurations—scaling up aggressively for faster performance or maintaining lower RPUs to optimize costs.

The Optimized for Cost configuration starts at 128 RPUs and increases to 256 RPUs after three tests. To maintain cost-efficiency, this setup limits the maximum RPU allocation during scaling, even when facing query queuing.

In the following table, we can observe the costs for this Optimized for Cost configuration.

Test# Starting RPUs Scaled up to Cost incurred
1 128 1408  $254.17
2 128 1408  $258.39
3 128 1408  $261.92
4 256 1408  $245.57
5 256 1408  $247.11
6 256 1408  $257.25
7 256 1408  $254.27
8 256 1408  $254.27
9 256 1408  $254.11
10 256 1408  $256.15

The strategic RPU allocation by Amazon Redshift Serverless helps optimize costs, as demonstrated in tests 3 and 4, where we observed significant cost savings. This is shown in the following graph.

Optimized for Cost - Cost Average

Although the optimization for cost changed the base RPU, the balanced configuration didn’t change the base RPUs but scaled up to 2176, further than the 1408 RPUs that were the maximum used by the cost optimization setup. The following table shows the figures for the Balanced configuration.

Test# Starting RPUs Scaled up to Cost incurred
1 192 2176  $261.48
2 192 2112  $270.90
3 192 2112  $265.26
4 192 2112  $260.20
5 192 2112  $262.12
6 192 2112  $253.18
7 192 2112  $272.80
8 192 2112  $272.80
9 192 2112  $263.72
10 192 2112  $243.28

The Balanced configuration, averaging $262.57 per test, delivered significantly better performance while costing only 3% more than the Optimized for Cost configuration, which averaged $254.32 per test. As demonstrated in the previous section, this performance advantage is evident in the elapsed time comparisons. The following graph shows the costs for the Balanced configuration.

Balanced - Cost Average

As expected from the Optimized for Performance configuration, the usage of resources was higher to attend the high performance. In this configuration, we can also observe that after two tests, the engine adapted itself to start with a higher number of RPUs to attend the queries faster.

Test# Starting RPUs Scaled Up to Cost incurred
1 512 2753  $295.07
2 512 2327  $280.29
3 768 2560  $333.52
4 768 2991  $295.36
5 768 2479  $308.72
6 768 2816  $324.08
7 768 2413  $300.45
8 768 2413  $300.45
9 768 2107  $321.07
10 768 2304  $284.93

Despite a 19% cost increase in the third test, most subsequent tests remained below the $304.39 average cost.

Optimized for Performance - Cost Average

The Optimized for Performance configuration maximizes resource usage to achieve faster query times, prioritizing speed over cost efficiency.

The final cost-performance analysis reveals compelling results:

  • The Balanced configuration delivered twofold better performance while costing only 3.25% more than the Optimized for Cost setup
  • The Optimized for Performance configuration achieved fourfold faster elapsed time with a 19.39% cost increase compared to the Optimized for Cost option.

The following chart illustrates our cost-performance findings:

Average Billing and Elapsed Time per Endpoint

It’s important to note that these results reflect our specific test scenario. Each workload has unique characteristics, and the performance and cost differences between configurations might vary significantly in other use cases. Our findings serve as a reference point rather than a universal benchmark. Additionally, we didn’t test two intermediate configurations available in Amazon Redshift Serverless: one between Optimized for Cost and Balanced, and another between Balanced and Optimized for Performance.

Conclusion

The test results demonstrate the effectiveness of Amazon Redshift Serverless AI-driven scaling and optimization across different workload requirements. These findings highlight how Amazon Redshift Serverless AI-driven scaling and optimization can help organizations find their ideal balance between cost and performance. Although our test results serve as a reference point, each organization should evaluate their specific workload requirements and price-performance targets. The flexibility of five different optimization profiles, combined with intelligent resource allocation, enables teams to fine-tune their data warehouse operations for optimal efficiency.

To get started with Amazon Redshift Serverless AI-driven scaling and optimization, we recommend:

  1. Establishing your current price-performance baseline
  2. Identifying your workload patterns and requirements
  3. Testing different optimization profiles with your specific workloads
  4. Monitoring and adjusting based on your results

By using these capabilities, organizations can achieve better resource utilization while meeting their specific performance and cost objectives.

Ready to optimize your Amazon Redshift Serverless workloads? Visit the AWS Management Console today to create your own Amazon Redshift Serverless AI-driven scaling and optimization to start exploring the different optimization profiles. For more information, check out our documentation on Amazon Redshift Serverless AI-driven scaling and optimization, or contact your AWS account team to discuss your specific use case.


About the Authors

Ricardo Serafim Ricardo Serafim is a Senior Analytics Specialist Solutions Architect at AWS. He has been helping companies with Data Warehouse solutions since 2007.

Milind Oke Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Andre HassAndre Hass is a Senior Technical Account Manager at AWS, specialized in AWS Data Analytics workloads. With more than 20 years of experience in databases and data analytics, he helps customers optimize their data solutions and navigate complex technical challenges. When not immersed in the world of data, Andre can be found pursuing his passion for outdoor adventures. He enjoys camping, hiking, and exploring new destinations with his family on weekends or whenever an opportunity arises.

How Getir unleashed data democratization using a data mesh architecture with Amazon Redshift

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/how-getir-unleashed-data-democratization-using-a-data-mesh-architecture-with-amazon-redshift/

This blog post is co-written with Pinar Yasar from Getir.

Amazon Redshift is a fully managed cloud data warehouse that’s used by tens of thousands of customers for price-performance, scale, and advanced data analytics. Amazon Redshift enables data warehousing by seamlessly integrating with other data stores and services in the modern data organization through features such as Zero-ETL, data sharing, streaming ingestion, data lake integration, and Redshift ML.

In this post, we explain how ultrafast delivery pioneer, Getir, unleashed the power of data democratization on a large scale through their data mesh architecture using Amazon Redshift.

We start by introducing Getir and their vision—to seamlessly, securely, and efficiently share business data across different teams within the organization for BI, extract, transform, and load (ETL), and other use cases. We’ll then explore how Amazon Redshift data sharing powered the data mesh architecture that allowed Getir to achieve this transformative vision. We will also explain how Getir’s data mesh architecture enabled data democratization, shorter time-to-market, and cost-efficiencies. Next, we’ll provide a broader overview of modern data trends reinforced by Getir’s vision. In conclusion, we’ll offer some thoughts on how you can apply a similar approach to eliminate costly and barrier-inducing data silos using Amazon Redshift.

Who is Getir?

Getir is an ultrafast delivery pioneer that revolutionized last-mile delivery in 2015 with its 10-minute grocery delivery proposition.Getir’s story started in Istanbul, and they have launched multiple products since inception: GetirFood, GetirMore, GetirWater, GetirLocals, GetirBitaksi (taxi service), GetirDrive (car rental service), and GetirJobs (recruitment).

Getir serves dozens of cities throughout the world with more than 30,000 employees. The following figure shows the Getir app.

Figure 1: Getir app

Figure 1: Getir app

Overview of Getir’s main use case

Getir’s business is characterized by a tremendous volume of data generation and growth, in addition to ample opportunities to gain valuable insights. However, siloing this data and creating friction for teams trying to access the information they needed wasn’t a viable option. Allowing teams to duplicate data wherever required can be an anti-pattern, leading to operational complexity, cost overruns, and fragile data storage bloat.

Similarly, relying on dedicated teams to create data extracts or insights for downstream consumers introduces bottlenecks, stifles innovation, and increases the time-to-market. This approach isn’t optimal for a data-driven organization like Getir, which needs to empower its teams with seamless access to the information they require to drive the business forward. The various business lines within the organization made it abundantly clear that they wanted unfettered access to the company’s entire data ecosystem in a secure, cost-efficient, near real-time, and well-governed manner.

Furthermore, the organization was anticipating the emergence of data-as-a-serviceservice and generative AI use cases in the near future. This would necessitate the ability to securely share and potentially monetize the company’s data with external partners, such as franchises.

Overview of Getir’s use of Amazon Redshift and modern data architecture

To strike a balance that addresses these concerns and enables Getir teams to effectively use the wealth of data to generate meaningful insights and drive strategic decision-making across the organization, we chose a data mesh architecture.

Getir’s data analytics environment encompasses hundreds of terabytes of data, thousands of tables, and billions upon billions of data rows. Additionally, it processes millions of messaging events daily, all of which must be ingested, refined, and made available to analysts querying multiple Amazon Redshift warehouses. The end-to-end service level agreements (SLAs) for this data ecosystem can be extremely aggressive, with requirements that can be as stringent as single-digit minutes to single-digit seconds. This underscores the scale and complexity of Getir’s data analytics capabilities, which must operate with the utmost efficiency and responsiveness to meet the demands of the business. We were able to easily implement the envisioned data mesh architecture using Amazon Redshift’s native data sharing capabilities.

Figure 2: Data mesh architecture using Amazon Redshift data sharing

Figure 2: Data mesh architecture using Amazon Redshift data sharing

As the preceding diagram shows, at the heart of Getir’s architecture, was an ETL Redshift data warehouse that was used for various data sets from all over the organization, creating a refined 360-degree view of critical assets. It also was a producer for downstream Redshift data warehouses.

The demand was quite heavy on this main ETL cluster, so we relied on data sharing to isolate noisy workloads on a different Redshift data warehouse without having to duplicate the data on the main ETL cluster.

Using Redshift data sharing, individual business line teams could now rely solely on their dedicated Redshift cluster to provide them with their own data and analytics capabilities, but also the refined 360-degree views of data generated from all over the organization—without any data duplication or overstepping compute boundaries. BI analysts gained access to all of the data they needed to power their most complex dashboards with consistent performance free of noisy jobs. Additional warehouses were integrated into the data mesh for visualization, reporting, and machine learning.

Another benefit of Amazon Redshift data sharing and the data mesh architecture, was the relative ease with which we were able to maintain a chargeback model for ensuring costs were spread fairly across different teams.

Finally, the data sharing capability also enabled the seamless propagation of newly created tables within a schema to the subscribed consumers.

Modern data trends reinforced by Getir’s case study

Getir’s case study showcases the strategic uses of a data mesh architecture and Amazon Redshift, but more importantly provides tremendous insights into five key trends across all industries as modern data organizations move away from costly data silos that hinder collaboration, business insights, and time-to-market. As highlighted in the following diagram, those trends are 1/interconnected, purpose-built data stores that enable users to access data regardless of its physical location, 2/data democratization empowering users with self-service analytics capabilities, 3/real-time insights to drive greater value from data, 4/resilient data services ensuring business continuity, 5/leveraging generative AI to extract even deeper insights from data more expeditiously.

Figure 3: Key trends in the modern data organization reinforced by Getir's use case and solution

Figure 3: Key trends in the modern data organization reinforced by Getir’s use case and solution

As Getir showed, the modern data organization is adopting data architectures that democratize data securely and enable self-service analytics. To realize data’s true potential, the modern data organization has progressed beyond basic dashboarding and reporting on limited, point-in-time data sets, and evolved to use more sophisticated ETL processes that can ingest data from diverse sources. Near real-time analytics in addition to predictive models have become standard fare, significantly reducing the time to actionable insights.

Furthermore, the data landscape has been democratized to empower analysts in numerous ways through the rise of transactional data lakes powered by open table formats such as Apache Iceberg and the assistance of generative AI. This holistic approach has elevated data organizations’ capabilities well beyond traditional reporting, unlocking greater business value from the wealth of data available.

Using generative AI with data mesh architecture

In addition to the five key trends previously mentioned, the present-day data landscape is characterized by three key facts that are leading data organizations like Getir to increasingly harness the power of generative AI to drive the next evolution of data-informed decision-making.

Data is an organization’s most valuable asset and the ability to effectively use data is central to an organization’s success and growth. Data analytics and insights are absolutely crucial to strengthening and expanding the business. Deriving meaningful insights from data is essential for making informed, strategic decisions. Democratizing data and enabling self-service analytics can greatly expand the range of business insights, while reducing the time to market for those insights. Empowering users across the organization to access and analyze data can unlock tremendous value. Generative AI’s ability to respond to natural language prompts, explore and analyze complex data, and summarize lengthy content makes it a valuable tool for translating large amounts of data into valuable insights. However, the true potential of generative AI for organizations lies in Retrieval Augmented Generation (RAG).

Out of the box, generative AI models start with a relatively generic knowledge base, which can lead to unreliable or inaccurate information. RAG addresses this by introducing the model to additional datasets that are specific to the organization or context. This allows generative AI models to produce far more accurate, attributable, and highly contextualized outputs to support decision-making.

Data mesh architecture can play a crucial role in enabling and facilitating RAG. By facilitating access to multiple data sources within the organization, the data mesh provides the necessary fuel for the generative AI model to draw from, resulting in more reliable and insightful information. This, in turn, empowers data-driven decision-making and helps organizations harness the full potential of their data assets.

Conclusion

In this post, we examined how Getir implemented a data mesh architecture and Amazon Redshift data sharing to meet their evolving data requirements. This entailed dedicated data warehouses tailored to different business lines and needs, while maintaining robust data governance and secure data access. Additionally, we highlighted the key industry trends that Getir’s case study reinforces across the broader data landscape. For more information, contact AWS or connect with your AWS Technical Account Manager or Solutions Architect, who will be happy to provide more detailed guidance and support.


About the Authors

Asser Moustafa is a Principal Worldwide Specialist Solutions Architect at AWS, based in Dallas, Texas, USA. He partners with customers worldwide, advising them on all aspects of their data architectures, migrations, and strategic data visions to help organizations adopt cloud-based solutions, maximize the value of their data assets, modernize legacy infrastructures, and implement cutting-edge capabilities like machine learning and advanced analytics. Prior to joining AWS, Asser held various data and analytics leadership roles, completing an MBA from New York University and an MS in Computer Science from Columbia University in New York. He is passionate about empowering organizations to become truly data-driven and unlock the transformative potential of their data.

Pinar Yasar is the Data Engineering Manager at Getir. Her passion is to accelerate self-service analytics for her internal customers and build highly scalable and cost-effective solutions in the cloud.

Apply fine-grained access and transformation on the SUPER data type in Amazon Redshift

Post Syndicated from Ritesh Sinha original https://aws.amazon.com/blogs/big-data/apply-fine-grained-access-and-transformation-on-the-super-data-type-in-amazon-redshift/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

Amazon Redshift, a cloud data warehouse service, supports attaching dynamic data masking (DDM) policies to paths of SUPER data type columns, and uses the OBJECT_TRANSFORM function with the SUPER data type. SUPER data type columns in Amazon Redshift contain semi-structured data like JSON documents. Previously, data masking in Amazon Redshift only worked with regular table columns, but now you can apply masking policies specifically to elements within SUPER columns. For example, you could apply a masking policy to mask sensitive fields like credit card numbers within JSON documents stored in a SUPER column. This allows for more granular control over data masking in Amazon Redshift. Amazon Redshift gives you more flexibility in how you apply data masking to protect sensitive information stored in SUPER columns containing semi-structured data.

With DDM support in Amazon Redshift, you can do the following:

  • Define masking policies that apply custom obfuscation policies, such as masking policies to handle credit card, personally identifiable information (PII) entries, HIPAA or GDPR needs, and more
  • Transform the data at query time to apply masking policies
  • Attach masking policies to roles or users
  • Attach multiple masking policies with varying levels of obfuscation to the same column in a table and assign them to different roles with priorities to avoid conflicts
  • Implement cell-level masking by using conditional columns when creating your masking policy
  • Use masking policies to partially or completely redact data, or hash it by using user-defined functions (UDFs)

In this post, we demonstrate how a retail company can control the access of PII data stored in the SUPER data type to users based on their access privilege without duplicating the data.

Solution overview

For our use case, we have the following data access requirements:

  • Users from the Customer Service team should be able to view the order data but not PII information
  • Users from the Sales team should be able to view customer IDs and all order information
  • Users from the Executive team should be able to view all the data
  • Staff should not be able to view any data

The following diagram illustrates how DDM support in Amazon Redshift policies works with roles and users for our retail use case.

The solution encompasses creating masking policies with varying masking rules and attaching one or more to the same role and table with an assigned priority to remove potential conflicts. These policies may pseudonymize results or selectively nullify results to comply with retailers’ security requirements. We refer to multiple masking policies being attached to a table as a multi-modal masking policy. A multi-modal masking policy consists of three parts:

  • A data masking policy that defines the data obfuscation rules
  • Roles with different access levels depending on the business case
  • The ability to attach multiple masking policies on a user or role and table combination with priority for conflict resolution

Prerequisites

To implement this solution, you need the following prerequisites:

Prepare the data

To set up our use case, complete the following steps:

  1. On the Amazon Redshift console, choose Query editor v2 under Explorer in the navigation pane.

If you’re familiar with SQL Notebooks, you can download the SQL notebook for the demonstration and import it to quickly get started.

  1. Create the table and populate contents:
    -- 1- Create the orders table
    drop table if exists public.order_transaction;
    create table public.order_transaction (
     data_json super
    );
    
    -- 2- Populate the table with sample values
    INSERT INTO public.order_transaction
    VALUES
        (
            json_parse('
            {
            "c_custkey": 328558,
            "c_name": "Customer#000328558",
            "c_phone": "586-436-7415",
            "c_creditcard": "4596209611290987",
            "orders":{
              "o_orderkey": 8014018,
              "o_orderstatus": "F",
              "o_totalprice": 120857.71,
              "o_orderdate": "2024-01-01"
              }
            }'
            )
        ),
        (
            json_parse('
            {
            "c_custkey": 328559,
            "c_name": "Customer#000328559",
            "c_phone": "789-232-7421",
            "c_creditcard": "8709000219329924",
            "orders":{
              "o_orderkey": 8014019,
              "o_orderstatus": "S",
              "o_totalprice": 9015.98,
              "o_orderdate": "2024-01-01"
              }
            }'
            )
        ),
        (
            json_parse('
            {
            "c_custkey": 328560,
            "c_name": "Customer#000328560",
            "c_phone": "276-564-9023",
            "c_creditcard": "8765994378650090",
            "orders":{
              "o_orderkey": 8014020,
              "o_orderstatus": "C",
              "o_totalprice": 18765.56,
              "o_orderdate": "2024-01-01"
              }
            }
            ')
        );

Implement the solution

To satisfy the security requirements, we need to make sure that each user sees the same data in different ways based on their granted privileges. To do that, we use user roles combined with masking policies as follows:

  1. Create users and roles, and add users to their respective roles:
    --create four users
    set session authorization admin;
    CREATE USER Kate_cust WITH PASSWORD disable;
    CREATE USER Ken_sales WITH PASSWORD disable;
    CREATE USER Bob_exec WITH PASSWORD disable;
    CREATE USER Jane_staff WITH PASSWORD disable;
    
    -- 1. Create User Roles
    CREATE ROLE cust_srvc_role;
    CREATE ROLE sales_srvc_role;
    CREATE ROLE executives_role;
    CREATE ROLE staff_role;
    
    -- note that public role exists by default.
    -- Grant Roles to Users
    GRANT ROLE cust_srvc_role to Kate_cust;
    GRANT ROLE sales_srvc_role to Ken_sales;
    GRANT ROLE executives_role to Bob_exec;
    GRANT ROLE staff_role to Jane_staff;
    
    -- note that regualr_user is attached to public role by default.
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE cust_srvc_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE sales_srvc_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE executives_role;
    GRANT ALL ON ALL TABLES IN SCHEMA "public" TO ROLE staff_role;

  2. Create masking policies:
    -- Mask Full Data
    CREATE MASKING POLICY mask_full
    WITH(pii_data VARCHAR(256))
    USING ('000000XXXX0000'::TEXT);
    
    -- This policy rounds down the given price to the nearest 10.
    CREATE MASKING POLICY mask_price
    WITH(price INT)
    USING ( (FLOOR(price::FLOAT / 10) * 10)::INT );
    
    -- This policy converts the first 12 digits of the given credit card to 'XXXXXXXXXXXX'.
    CREATE MASKING POLICY mask_credit_card
    WITH(credit_card TEXT)
    USING ( 'XXXXXXXXXXXX'::TEXT || SUBSTRING(credit_card::TEXT FROM 13 FOR 4) );
    
    -- This policy mask the given date
    CREATE MASKING POLICY mask_date
    WITH(order_date TEXT)
    USING ( 'XXXX-XX-XX'::TEXT);
    
    -- This policy mask the given phone number
    CREATE MASKING POLICY mask_phone
    WITH(phone_number TEXT)
    USING ( 'XXX-XXX-'::TEXT || SUBSTRING(phone_number::TEXT FROM 9 FOR 4) );

  3. Attach the masking policies:
    • Attach the masking policy for the customer service use case:
      --customer_support (cannot see customer PHI/PII data but can see the order id , order details and status etc.)
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_full
      ON public.order_transaction(data_json.c_custkey)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_phone
      ON public.order_transaction(data_json.c_phone)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_credit_card
      ON public.order_transaction(data_json.c_creditcard)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_price
      ON public.order_transaction(data_json.orders.o_totalprice)
      TO ROLE cust_srvc_role;
      
      ATTACH MASKING POLICY mask_date
      ON public.order_transaction(data_json.orders.o_orderdate)
      TO ROLE cust_srvc_role;

    • Attach the masking policy for the sales use case:
      --sales —> can see the customer ID (non phi data) and all order info
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_phone
      ON public.order_transaction(data_json.customer.c_phone)
      TO ROLE sales_srvc_role;

    • Attach the masking policy for the staff use case:
      --Staff — > cannot see any data about the order. all columns masked for them ( we can hand pick some columns) to show the functionality
      
      set session authorization admin;
      
      ATTACH MASKING POLICY mask_full
      ON public.order_transaction(data_json.orders.o_orderkey)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_pii_full
      ON public.order_transaction(data_json.orders.o_orderstatus)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_pii_price
      ON public.order_transaction(data_json.orders.o_totalprice)
      TO ROLE staff_role;
      
      ATTACH MASKING POLICY mask_date
      ON public.order_transaction(data_json.orders.o_orderdate)
      TO ROLE staff_role;

Test the solution

Let’s confirm that the masking policies are created and attached.

  1. Check that the masking policies are created with the following code:
    -- 1.1- Confirm the masking policies are created
    SELECT * FROM svv_masking_policy;

  2. Check that the masking policies are attached:
    -- 1.2- Verify attached masking policy on table/column to user/role.
    SELECT * FROM svv_attached_masking_policy;

Now you can test that different users can see the same data masked differently based on their roles.

  1. Test that the customer support can’t see customer PHI/PII data but can see the order ID, order details, and status:
    set session authorization Kate_cust;
    select * from order_transaction;

  2. Test that the sales team can see the customer ID (non PII data) and all order information:
    set session authorization Ken_sales;
    select * from order_transaction;

  3. Test that the executives can see all data:
    set session authorization Bob_exec;
    select * from order_transaction;

  4. Test that the staff can’t see any data about the order. All columns should masked for them.
    set session authorization Jane_staff;
    select * from order_transaction;

Object_Transform function

In this section, we dive into the capabilities and benefits of the OBJECT_TRANSFORM function and explore how it empowers you to efficiently reshape your data for analysis. The OBJECT_TRANSFORM function in Amazon Redshift is designed to facilitate data transformations by allowing you to manipulate JSON data directly within the database. With this function, you can apply transformations to semi-structured or SUPER data types, making it less complicated to work with complex data structures in a relational database environment.

Let’s look at some usage examples.

First, create a table and populate contents:

--1- Create the customer table 

DROP TABLE if exists customer_json;

CREATE TABLE customer_json (
    col_super super,
    col_text character varying(100) ENCODE lzo
) DISTSTYLE AUTO;

--2- Populate the table with sample data 

INSERT INTO customer_json
VALUES
    (
        
        json_parse('
            {
                "person": {
                    "name": "GREGORY HOUSE",
                    "salary": 120000,
                    "age": 17,
                    "state": "MA",
                    "ssn": ""
                }
            }
        ')
        ,'GREGORY HOUSE'
    ),
    (
        json_parse('
              {
                "person": {
                    "name": "LISA CUDDY",
                    "salary": 180000,
                    "age": 30,
                    "state": "CA",
                    "ssn": ""
                }
            }
        ')
        ,'LISA CUDDY'
    ),
     (
        json_parse('
              {
                "person": {
                    "name": "JAMES WILSON",
                    "salary": 150000,
                    "age": 35,
                    "state": "WA",
                    "ssn": ""
                }
            }
        ')
        ,'JAMES WILSON'
    )
;
-- 3 select the data 

SELECT * FROM customer_json;

Apply the transformations with the OBJECT_TRANSFORM function:

SELECT
    OBJECT_TRANSFORM(
        col_super
        KEEP
            '"person"."name"',
            '"person"."age"',
            '"person"."state"'
           
        SET
            '"person"."name"', LOWER(col_super.person.name::TEXT),
            '"person"."salary"',col_super.person.salary + col_super.person.salary*0.1
    ) AS col_super_transformed
FROM customer_json;

As you can see in the example, by applying the transformation with OBJECT_TRANSFORM, the person name is formatted in lowercase and the salary is increased by 10%. This demonstrates how the transformation makes is less complicated to work with semi-structured or nested data types.

Clean up

When you’re done with the solution, clean up your resources:

  1. Detach the masking policies from the table:
    -- Cleanup
    --reset session authorization to the default
    RESET SESSION AUTHORIZATION;

  2. Drop the masking policies:
    DROP MASKING POLICY mask_pii_data CASCADE;

  3. Revoke or drop the roles and users:
    REVOKE ROLE cust_srvc_role from Kate_cust;
    REVOKE ROLE sales_srvc_role from Ken_sales;
    REVOKE ROLE executives_role from Bob_exec;
    REVOKE ROLE staff_role from Jane_staff;
    DROP ROLE cust_srvc_role;
    DROP ROLE sales_srvc_role;
    DROP ROLE executives_role;
    DROP ROLE staff_role;
    DROP USER Kate_cust;
    DROP USER Ken_sales;
    DROP USER Bob_exec;
    DROP USER Jane_staff;

  4. Drop the table:
    DROP TABLE order_transaction CASCADE;
    DROP TABLE if exists customer_json;

Considerations and best practices

Consider the following when implementing this solution:

  • When attaching a masking policy to a path on a column, that column must be defined as the SUPER data type. You can only apply masking policies to scalar values on the SUPER path. You can’t apply masking policies to complex structures or arrays.
  • You can apply different masking policies to multiple scalar values on a single SUPER column as long as the SUPER paths don’t conflict. For example, the SUPER paths a.b and a.b.c conflict because they’re on the same path, with a.b being the parent of a.b.c. The SUPER paths a.b.c and a.b.d don’t conflict.

Refer to Using dynamic data masking with SUPER data type paths for more details on considerations.

Conclusion

In this post, we discussed how to use DDM support for the SUPER data type in Amazon Redshift to define configuration-driven, consistent, format-preserving, and irreversible masked data values. With DDM support in Amazon Redshift, you can control your data masking approach using familiar SQL language. You can take advantage of the Amazon Redshift role-based access control capability to implement different levels of data masking. You can create a masking policy to identify which column needs to be masked, and you have the flexibility of choosing how to show the masked data. For example, you can completely hide all the information of the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or Lambda UDFs. Additionally, you can apply conditional masking based on other columns, which selectively protects the column data in a table based on the values in one or more columns.

We encourage you to create your own user-defined functions for various use cases and achieve your desired security posture using dynamic data masking support in Amazon Redshift.


About the Authors

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.

Detect and handle data skew on AWS Glue

Post Syndicated from Salim Tutuncu original https://aws.amazon.com/blogs/big-data/detect-and-handle-data-skew-on-aws-glue/

AWS Glue is a fully managed, serverless data integration service provided by Amazon Web Services (AWS) that uses Apache Spark as one of its backend processing engines (as of this writing, you can use Python Shell, Spark, or Ray).

Data skew occurs when the data being processed is not evenly distributed across the Spark cluster, causing some tasks to take significantly longer to complete than others. This can lead to inefficient resource utilization, longer processing times, and ultimately, slower performance. Data skew can arise from various factors, including uneven data distribution, skewed join keys, or uneven data processing patterns. Even though the biggest issue is often having nodes running out of disk during shuffling, which leads to nodes falling like dominoes and job failures, it’s also important to mention that data skew is hidden. The stealthy nature of data skew means it can often go undetected because monitoring tools might not flag an uneven distribution as a critical issue, and logs don’t always make it evident. As a result, a developer may observe that their AWS Glue jobs are completing without apparent errors, yet the system could be operating far from its optimal efficiency. This hidden inefficiency not only increases operational costs due to longer runtimes but can also lead to unpredictable performance issues that are difficult to diagnose without a deep dive into the data distribution and task run patterns.

For example, in a dataset of customer transactions, if one customer has significantly more transactions than the others, it can cause a skew in the data distribution.

Identifying and handling data skew issues is key to having good performance on Apache Spark and therefore on AWS Glue jobs that use Spark as a backend. In this post, we show how you can identify data skew and discuss the different techniques to mitigate data skew.

How to detect data skew

When an AWS Glue job has issues with local disks (split disk issues), doesn’t scale with the number of workers, or has low CPU usage (you can enable Amazon CloudWatch metrics for your job to be able to see this), you may have a data skew issue. You can detect data skew with data analysis or by using the Spark UI. In this section, we discuss how to use the Spark UI.

The Spark UI provides a comprehensive view of Spark applications, including the number of tasks, stages, and their duration. To use it you need to enable Spark UI event logs for your job runs. It is enabled by default on Glue console and once enabled, Spark event log files will be created during the job run and stored in your S3 bucket. Then, those logs are parsed, and you can use the AWS Glue serverless Spark UI to visualize them. You can refer to this blogpost for more details. In those jobs where the AWS Glue serverless Spark UI does not work as it has a limit of 512 MB of logs, you can set up the Spark UI using an EC2 instance.

You can use the Spark UI to identify which tasks are taking longer to complete than others, and if the data distribution among partitions is balanced or not (remember that in Spark, one partition is mapped to one task). If there is data skew, you will see that some partitions have significantly more data than others. The following figure shows an example of this. We can see that one task is taking a lot more time than the others, which can indicate data skew.

Another thing that you can use is the summary metrics for each stage. The following screenshot shows another example of data skew.

These metrics represent the task-related metrics below which a certain percentage of tasks completed. For example, the 75th percentile task duration indicates that 75% of tasks completed in less time than this value. When the tasks are evenly distributed, you will see similar numbers in all the percentiles. When there is data skew, you will see very biased values in each percentile. In the preceding example, it didn’t write many shuffle files (less than 50 MiB) in Min, 25th percentile, Median, and 75th percentile. However, in Max, it wrote 460 MiB, 10 times the 75th percentile. It means there was at least one task (or up to 25% of tasks) that wrote much bigger shuffle files than the rest of the tasks. You can also see that the duration of the tax in Max is 46 seconds and the Median is 2 seconds. These are all indicators that your dataset may have data skew.

AWS Glue interactive sessions

You can use interactive sessions to load your data from the AWS Glue Data Catalog or just use Spark methods to load the files such as Parquet or CSV that you want to analyze. You can use a similar script to the following to detect data skew from the partition size perspective; the more important issue is related to data skew while shuffling, and this script does not detect that kind of skew:

from pyspark.sql.functions import spark_partition_id, asc, desc
#input_dataframe being the dataframe where you want to check for data skew
partition_sizes_df=input_dataframe\
    .withColumn("partitionId", spark_partition_id())\
    .groupBy("partitionId")\
    .count()\
    .orderBy(asc("count"))\
    .withColumnRenamed("count","partition_size")
#calculate average and standar deviation for the partition sizes
avg_size = partition_sizes_df.agg({"partition_size": "avg"}).collect()[0][0]
std_dev_size = partition_sizes_df.agg({"partition_size": "stddev"}).collect()[0][0]

""" 
 the code calculates the absolute difference between each value in the "partition_size" column and the calculated average (avg_size).
 then, calculates twice the standard deviation (std_dev_size) and use 
 that as a boolean mask where the condition checks if the absolute difference is greater than twice the standard deviation
 in order to mark a partition 'skewed'
"""
skewed_partitions_df = partition_sizes_df.filter(abs(partition_sizes_df["partition_size"] - avg_size) > 2 * std_dev_size)
if skewed_partitions_df.count() > 0:
    skewed_partitions = [row["partition_id"] for row in skewed_partitions_df.collect()]
    print(f"The following partitions have significantly different sizes: {skewed_partitions}")
else:
    print("No data skew detected.")

You can calculate the average and standard deviation of partition sizes using the agg() function and identify partitions with significantly different sizes using the filter() function, and you can print their indexes if any skewed partitions are detected. Otherwise, the output prints that no data skew is detected.

This code assumes that your data is structured, and you may need to modify it if your data is of a different type.

How to handle data skew

You can use different techniques in AWS Glue to handle data skew; there is no single universal solution. The first thing to do is confirm that you’re using latest AWS Glue version, for example AWS Glue 4.0 based on Spark 3.3 has enabled by default some configs like Adaptative Query Execution (AQE) that can help improve performance when data skew is present.

The following are some of the techniques that you can employ to handle data skew:

  • Filter and perform – If you know which keys are causing the skew, you can filter them out, perform your operations on the non-skewed data, and then handle the skewed keys separately.
  • Implementing incremental aggregation – If you are performing a large aggregation operation, you can break it up into smaller stages because in large datasets, a single aggregation operation (like sum, average, or count) can be resource-intensive. In those cases, you can perform intermediate actions. This could involve filtering, grouping, or additional aggregations. This can help distribute the workload across the nodes and reduce the size of intermediate data.
  • Using a custom partitioner – If your data has a specific structure or distribution, you can create a custom partitioner that partitions your data based on its characteristics. This can help make sure that data with similar characteristics is in the same partition and reduce the size of the largest partition.
  • Using broadcast join – If your dataset is small but exceeds the spark.sql.autoBroadcastJoinThreshold value (default is 10 MB), you have the option to either provide a hint to use broadcast join or adjust the threshold value to accommodate your dataset. This can be an effective strategy to optimize join operations and mitigate data skew issues resulting from shuffling large amounts of data across nodes.
  • Salting – This involves adding a random prefix to the key of skewed data. By doing this, you distribute the data more evenly across the partitions. After processing, you can remove the prefix to get the original key values.

These are just a few techniques to handle data skew in PySpark; the best approach will depend on the characteristics of your data and the operations you are performing.

The following is an example of joining skewed data with the salting technique:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, ceil, rand, concat, col

# Define the number of salt values
num_salts = 3

# Function to identify skewed keys
def identify_skewed_keys(df, key_column, threshold):
    key_counts = df.groupBy(key_column).count()
    return key_counts.filter(key_counts['count'] > threshold).select(key_column)

# Identify skewed keys
skewed_keys = identify_skewed_keys(skewed_data, "key", skew_threshold)

# Splitting the dataset
skewed_data_subset = skewed_data.join(skewed_keys, ["key"], "inner")
non_skewed_data_subset = skewed_data.join(skewed_keys, ["key"], "left_anti")

# Apply salting to skewed data
skewed_data_subset = skewed_data_subset.withColumn("salt", ceil((rand() * 10) % num_salts))
skewed_data_subset = skewed_data_subset.withColumn("salted_key", concat(col("key"), lit("_"), col("salt")))

# Replicate skewed rows in non-skewed dataset
def replicate_skewed_rows(df, keys, multiplier):
    replicated_df = df.join(keys, ["key"]).crossJoin(spark.range(multiplier).withColumnRenamed("id", "salt"))
    replicated_df = replicated_df.withColumn("salted_key", concat(col("key"), lit("_"), col("salt")))
    return replicated_df.drop("salt")

replicated_non_skewed_data = replicate_skewed_rows(non_skewed_data, skewed_keys, num_salts)

# Perform the JOIN operation on the salted keys for skewed data
result_skewed = skewed_data_subset.join(replicated_non_skewed_data, "salted_key")

# Perform regular join on non-skewed data
result_non_skewed = non_skewed_data_subset.join(non_skewed_data, "key")

# Combine results
final_result = result_skewed.union(result_non_skewed)

In this code, we first define a salt value, which can be a random integer or any other value. We then add a salt column to our DataFrame using the withColumn() function, where we set the value of the salt column to a random number using the rand() function with a fixed seed. The function replicate_salt_rows is defined to replicate each row in the non-skewed dataset (non_skewed_data) num_salts times. This ensures that each key in the non-skewed data has matching salted keys. Finally, a join operation is performed on the salted_key column between the skewed and non-skewed datasets. This join is more balanced compared to a direct join on the original key, because salting and replication have mitigated the data skew.

The rand() function used in this example generates a random number between 0–1 for each row, so it’s important to use a fixed seed to achieve consistent results across different runs of the code. You can choose any fixed integer value for the seed.

The following figures illustrate the data distribution before (left) and after (right) salting. Heavily skewed key2 identified and salted into key2_0, key2_1, and key2_2, balancing the data distribution and preventing any single node from being overloaded. After processing, the results can be aggregated back, so that that the final output is consistent with the unsalted key values.

Other techniques to use on skewed data during the join operation

When you’re performing skewed joins, you can use salting or broadcasting techniques, or divide your data into skewed and regular parts before joining the regular data and broadcasting the skewed data.

If you are using Spark 3, there are automatic optimizations for trying to optimize Data Skew issues on joins. Those can be tuned because they have dedicated configs on Apache Spark.

Conclusion

This post provided details on how to detect data skew in your data integration jobs using AWS Glue and different techniques for handling it. Having a good data distribution is key to achieving the best performance on distributed processing systems like Apache Spark.

Although this post focused on AWS Glue, the same concepts apply to jobs you may be running on Amazon EMR using Apache Spark or Amazon Athena for Apache Spark.

As always, AWS welcomes your feedback. Please leave your comments and questions in the comments section.


About the Authors

Salim Tutuncu is a Sr. PSA Specialist on Data & AI, based from Amsterdam with a focus on the EMEA North and EMEA Central regions. With a rich background in the technology sector that spans roles as a Data Engineer, Data Scientist, and Machine Learning Engineer, Salim has built a formidable expertise in navigating the complex landscape of data and artificial intelligence. His current role involves working closely with partners to develop long-term, profitable businesses leveraging the AWS Platform, particularly in Data and AI use cases.

Angel Conde Manjon is a Sr. PSA Specialist on Data & AI, based in Madrid, and focuses on EMEA South and Israel. He has previously worked on research related to Data Analytics and Artificial Intelligence in diverse European research projects. In his current role, Angel helps partners develop businesses centered on Data and AI.

GoDaddy benchmarking results in up to 24% better price-performance for their Spark workloads with AWS Graviton2 on Amazon EMR Serverless

Post Syndicated from Mukul Sharma original https://aws.amazon.com/blogs/big-data/godaddy-benchmarking-results-in-up-to-24-better-price-performance-for-their-spark-workloads-with-aws-graviton2-on-amazon-emr-serverless/

This is a guest post co-written with Mukul Sharma, Software Development Engineer, and Ozcan IIikhan, Director of Engineering from GoDaddy.

GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 22 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, and manage their work.

GoDaddy is a data-driven company, and getting meaningful insights from data helps us drive business decisions to delight our customers. At GoDaddy, we embarked on a journey to uncover the efficiency promises of AWS Graviton2 on Amazon EMR Serverless as part of our long-term vision for cost-effective intelligent computing.

In this post, we share the methodology and results of our benchmarking exercise comparing the cost-effectiveness of EMR Serverless on the arm64 (Graviton2) architecture against the traditional x86_64 architecture. EMR Serverless on Graviton2 demonstrated an advantage in cost-effectiveness, resulting in significant savings in total run costs. We achieved 23.85% improvement in price-performance for sample production Spark workloads—an outcome that holds tremendous potential for businesses striving to maximize their computing efficiency.

Solution overview

GoDaddy’s intelligent compute platform envisions simplification of compute operations for all personas, without limiting power users, to ensure out-of-box cost and performance optimization for data and ML workloads. As a part of this vision, GoDaddy’s Data & ML Platform team plans to use EMR Serverless as one of the compute solutions under the hood.

The following diagram shows a high-level illustration of the intelligent compute platform vision.

Benchmarking EMR Serverless for GoDaddy

EMR Serverless is a serverless option in Amazon EMR that eliminates the complexities of configuring, managing, and scaling clusters when running big data frameworks like Apache Spark and Apache Hive. With EMR Serverless, businesses can enjoy numerous benefits, including cost-effectiveness, faster provisioning, simplified developer experience, and improved resilience to Availability Zone failures.

At GoDaddy, we embarked on a comprehensive study to benchmark EMR Serverless using real production workflows at GoDaddy. The purpose of the study was to evaluate the performance and efficiency of EMR Serverless and develop a well-informed adoption plan. The results of the study have been extremely promising, showcasing the potential of EMR Serverless for our workloads.

Having achieved compelling results in favor of EMR Serverless for our workloads, our attention turned to evaluating the utilization of the Graviton2 (arm64) architecture on EMR Serverless. In this post, we focus on comparing the performance of Graviton2 (arm64) with the x86_64 architecture on EMR Serverless. By conducting this apples-to-apples comparative analysis, we aim to gain valuable insights into the benefits and considerations of using Graviton2 for our big data workloads.

By using EMR Serverless and exploring the performance of Graviton2, GoDaddy aims to optimize their big data workflows and make informed decisions regarding the most suitable architecture for their specific needs. The combination of EMR Serverless and Graviton2 presents an exciting opportunity to enhance the data processing capabilities and drive efficiency in our operations.

AWS Graviton2

The Graviton2 processors are specifically designed by AWS, utilizing powerful 64-bit Arm Neoverse cores. This custom-built architecture provides a remarkable boost in price-performance for various cloud workloads.

In terms of cost, Graviton2 offers an appealing advantage. As indicated in the following table, the pricing for Graviton2 is 20% lower compared to the x86 architecture option.

   x86_64  arm64 (Graviton2) 
per vCPU per hour $0.052624 $0.042094
per GB per hour $0.0057785 $0.004628
per storage GB per hour* $0.000111

*Ephemeral storage: 20 GB of ephemeral storage is available for all workers by default—you pay only for any additional storage that you configure per worker.

For specific pricing details and current information, refer to Amazon EMR pricing.

AWS benchmark

The AWS team performed benchmark tests on Spark workloads with Graviton2 on EMR Serverless using the TPC-DS 3 TB scale performance benchmarks. The summary of their analysis are as follows:

  • Graviton2 on EMR Serverless demonstrated an average improvement of 10% for Spark workloads in terms of runtime. This indicates that the runtime for Spark-based tasks was reduced by approximately 10% when utilizing Graviton2.
  • Although the majority of queries showcased improved performance, a small subset of queries experienced a regression of up to 7% on Graviton2. These specific queries showed a slight decrease in performance compared to the x86 architecture option.
  • In addition to the performance analysis, the AWS team considered the cost factor. Graviton2 is offered at a 20% lower cost than the x86 architecture option. Taking this cost advantage into account, the AWS benchmark set yielded an overall 27% better price-performance for workloads. This means that by using Graviton2, users can achieve a 27% improvement in performance per unit of cost compared to the x86 architecture option.

These findings highlight the significant benefits of using Graviton2 on EMR Serverless for Spark workloads, with improved performance and cost-efficiency. It showcases the potential of Graviton2 in delivering enhanced price-performance ratios, making it an attractive choice for organizations seeking to optimize their big data workloads.

GoDaddy benchmark

During our initial experimentation, we observed that arm64 on EMR Serverless consistently outperformed or performed on par with x86_64. One of the jobs showed a 7.51% increase in resource usage on arm64 compared to x86_64, but due to the lower price of arm64, it still resulted in a 13.48% cost reduction. In another instance, we achieved an impressive 43.7% reduction in run cost, attributed to both the lower price and reduced resource utilization. Overall, our initial tests indicated that arm64 on EMR Serverless delivered superior price-performance compared to x86_64. These promising findings motivated us to conduct a more comprehensive and rigorous study.

Benchmark results

To gain a deeper understanding of the value of Graviton2 on EMR Serverless, we conducted our study using real-life production workloads from GoDaddy, which are scheduled to run at a daily cadence. Without any exceptions, EMR Serverless on arm64 (Graviton2) is significantly more cost-effective compared to the same jobs run on EMR Serverless on the x86_64 architecture. In fact, we recorded an impressive 23.85% improvement in price-performance across the sample GoDaddy jobs using Graviton2.

Like the AWS benchmarks, we observed slight regressions of less than 5% in the total runtime of some jobs. However, given that these jobs will be migrated from Amazon EMR on EC2 to EMR Serverless, the overall total runtime will still be shorter due to the minimal provisioning time in EMR Serverless. Additionally, across all jobs, we observed an average speed up of 2.1% in addition to the cost savings achieved.

These benchmarking results provide compelling evidence of the value and effectiveness of Graviton2 on EMR Serverless. The combination of improved price-performance, shorter runtimes, and overall cost savings makes Graviton2 a highly attractive option for optimizing big data workloads.

Benchmarking methodology

As an extension of a larger benchmarking EMR Serverless for GoDaddy study, where we divided Spark jobs into brackets based on total runtime (quick-run, medium-run, long-run), we measured effect of architecture (arm64 vs. x86_64) on total cost and total runtime. All other parameters were kept the same to achieve an apples-to-apples comparison.

The team followed these steps:

  1. Prepare the data and environment.
  2. Choose two random production jobs from each job bracket.
  3. Make necessary changes to avoid inference with actual production outputs.
  4. Run tests to execute scripts over multiple iterations to collect accurate and consistent data points.
  5. Validate input and output datasets, partitions, and row counts to ensure identical data processing.
  6. Gather relevant metrics from the tests.
  7. Analyze results to draw insights and conclusions.

The following table shows the summary of an example Spark job.

Metric  EMR Serverless (Average) – X86_64  EMR Serverless (Average) – Graviton  X86_64 vs Graviton (% Difference) 
Total Run Cost $2.76 $1.85 32.97%

Total Runtime

(hh:mm:ss)

00:41:31 00:34:32 16.82%
EMR Release Label emr-6.9.0
Job Type Spark
Spark Version Spark 3.3.0
Hadoop Distribution Amazon 3.3.3
Hive/HCatalog Version Hive 3.1.3, HCatalog 3.1.3

Summary of results

The following table presents a comparison of job performance between EMR Serverless on arm64 (Graviton2) and EMR Serverless on x86_64. For each architecture, every job was run at least three times to obtain the accurate average cost and runtime.

 Job  Average x86_64 Cost Average arm64 Cost Average x86_64 Runtime (hh:mm:ss) Average arm64 Runtime (hh:mm:ss)  Average Cost Savings %  Average Performance Gain % 
1 $1.64 $1.25 00:08:43 00:09:01 23.89% -3.24%
2 $10.00 $8.69 00:27:55 00:28:25 13.07% -1.79%
3 $29.66 $24.15 00:50:49 00:53:17 18.56% -4.85%
4 $34.42 $25.80 01:20:02 01:24:54 25.04% -6.08%
5 $2.76 $1.85 00:41:31 00:34:32 32.97% 16.82%
6 $34.07 $24.00 00:57:58 00:51:09 29.57% 11.76%
Average  23.85% 2.10%

Note that the improvement calculations are based on higher-precision results for more accuracy.

Conclusion

Based on this study, GoDaddy observed a significant 23.85% improvement in price-performance for sample production Spark jobs utilizing the arm64 architecture compared to the x86_64 architecture. These compelling results have led us to strongly recommend internal teams to use arm64 (Graviton2) on EMR Serverless, except in cases where there are compatibility issues with third-party packages and libraries. By adopting an arm64 architecture, organizations can achieve enhanced cost-effectiveness and performance for their workloads, contributing to more efficient data processing and analytics.


About the Authors

Mukul Sharma is a Software Development Engineer on Data & Analytics (DnA) organization at GoDaddy. He is a polyglot programmer with experience in a wide array of technologies to rapidly deliver scalable solutions. He enjoys singing karaoke, playing various board games, and working on personal programming projects in his spare time.

Ozcan Ilikhan is a Director of Engineering on Data & Analytics (DnA) organization at GoDaddy. He is passionate about solving customer problems and increasing efficiency using data and ML/AI. In his spare time, he loves reading, hiking, gardening, and working on DIY projects.

Harsh Vardhan Singh Gaur is an AWS Solutions Architect, specializing in analytics. He has over 6 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Ramesh Kumar Venkatraman is a Senior Solutions Architect at AWS who is passionate about containers and databases. He works with AWS customers to design, deploy, and manage their AWS workloads and architectures. In his spare time, he loves to play with his two kids and follows cricket.

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Post Syndicated from Tom Romano original https://aws.amazon.com/blogs/big-data/empower-your-jira-data-in-a-data-lake-with-amazon-appflow-and-aws-glue/

In the world of software engineering and development, organizations use project management tools like Atlassian Jira Cloud. Managing projects with Jira leads to rich datasets, which can provide historical and predictive insights about project and development efforts.

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. Companies often take a data lake approach to their analytics, bringing data from many different systems into one place to simplify how the analytics are done.

This post shows you how to use Amazon AppFlow and AWS Glue to create a fully automated data ingestion pipeline that will synchronize your Jira data into your data lake. Amazon AppFlow provides software as a service (SaaS) integration with Jira Cloud to load the data into your AWS account. AWS Glue is a serverless data discovery, load, and transformation service that will prepare data for consumption in BI and AI/ML activities. Additionally, this post strives to achieve a low-code and serverless solution for operational efficiency and cost optimization, and the solution supports incremental loading for cost optimization.

Solution overview

This solution uses Amazon AppFlow to retrieve data from the Jira Cloud. The data is synchronized to an Amazon Simple Storage Service (Amazon S3) bucket using an initial full download and subsequent incremental downloads of changes. When new data arrives in the S3 bucket, an AWS Step Functions workflow is triggered that orchestrates extract, transform, and load (ETL) activities using AWS Glue crawlers and AWS Glue DataBrew. The data is then available in the AWS Glue Data Catalog and can be queried by services such as Amazon Athena, Amazon QuickSight, and Amazon Redshift Spectrum. The solution is completely automated and serverless, resulting in low operational overhead. When this setup is complete, your Jira data will be automatically ingested and kept up to date in your data lake!

The following diagram illustrates the solution architecture.

The Jira Appflow Architecture is shown. The Jira Cloud data is retrieved by Amazon AppFlow and is stored in Amazon S3. This triggers an Amazon EventBridge event that runs an AWS Step Functions workflow. The workflow uses AWS Glue to catalog and transform the data, The data is then queried with QuickSight.

The Step Functions workflow orchestrates the following ETL activities, resulting in two tables:

  • An AWS Glue crawler collects all downloads into a single AWS Glue table named jira_raw. This table is comprised of a mix of full and incremental downloads from Jira, with many versions of the same records representing changes over time.
  • A DataBrew job prepares the data for reporting by unpacking key-value pairs in the fields, as well as removing depreciated records as they are updated in subsequent change data captures. This reporting-ready data will available in an AWS Glue table named jira_data.

The following figure shows the Step Functions workflow.

A diagram represents the AWS Step Functions workflow. It contains the steps to run an AWS Crawler, wait for it's completion, and then run a AWS Glue DataBrew data transformation job.

Prerequisites

This solution requires the following:

  • Administrative access to your Jira Cloud instance, and an associated Jira Cloud developer account.
  • An AWS account and a login with access to the AWS Management Console. Your login will need AWS Identity and Access Management (IAM) permissions to create and access the resources in your AWS account.
  • Basic knowledge of AWS and working knowledge of Jira administration.

Configure the Jira Instance

After logging in to your Jira Cloud instance, you establish a Jira project with associated epics and issues to download into a data lake. If you’re starting with a new Jira instance, it helps to have at least one project with a sampling of epics and issues for the initial data download, because it allows you to create an initial dataset without errors or missing fields. Note that you may have multiple projects as well.

An image show a Jira Cloud example, with several issues arranged in a Kansan board.

After you have established your Jira project and populated it with epics and issues, ensure you also have access to the Jira developer portal. In later steps, you use this developer portal to establish authentication and permissions for the Amazon AppFlow connection.

Provision resources with AWS CloudFormation

For the initial setup, you launch an AWS CloudFormation stack to create an S3 bucket to store data, IAM roles for data access, and the AWS Glue crawler and Data Catalog components. Complete the following steps:

  1. Sign in to your AWS account.
  2. Click Launch Stack:
  3. For Stack name, enter a name for the stack (the default is aws-blog-jira-datalake-with-AppFlow).
  4. For GlueDatabaseName, enter a unique name for the Data Catalog database to hold the Jira data table metadata (the default is jiralake).
  5. For InitialRunFlag, choose Setup. This mode will scan all data and disable the change data capture (CDC) features of the stack. (Because this is the initial load, the stack needs an initial data load before you configure CDC in later steps.)
  6. Under Capabilities and transforms, select the acknowledgement check boxes to allow IAM resources to be created within your AWS account.
  7. Review the parameters and choose Create stack to deploy the CloudFormation stack. This process will take around 5–10 minutes to complete.
    An image depicts the Amazon CloudFormation configuration steps, including setting a stack name, setting parameters to "jiralake" and "Setup" mode, and checking all IAM capabilities requested.
  8. After the stack is deployed, review the Outputs tab for the stack and collect the following values to use when you set up Amazon AppFlow:
    • Amazon AppFlow destination bucket (o01AppFlowBucket)
    • Amazon AppFlow destination bucket path (o02AppFlowPath)
    • Role for Amazon AppFlow Jira connector (o03AppFlowRole)
      An image demonstrating the Amazon Cloudformation "Outputs" tab, highlighting the values to add to the Amazon AppFlow configuration.

Configure Jira Cloud

Next, you configure your Jira Cloud instance for access by Amazon AppFlow. For full instructions, refer to Jira Cloud connector for Amazon AppFlow. The following steps summarize these instructions and discuss the specific configuration to enable OAuth in the Jira Cloud:

  1. Open the Jira developer portal.
  2. Create the OAuth 2 integration from the developer application console by choosing Create an OAuth 2.0 Integration. This will provide a login mechanism for AppFlow.
  3. Enable fine-grained permissions. See Recommended scopes for the permission settings to grant AppFlow appropriate access to your Jira instance.
  4. Add the following permission scopes to your OAuth app:
    1. manage:jira-configuration
    2. read:field-configuration:jira
  5. Under Authorization, set the Call Back URL to return to Amazon AppFlow with the URL https://us-east-1.console.aws.amazon.com/AppFlow/oauth.
  6. Under Settings, note the client ID and secret to use in later steps to set up authentication from Amazon AppFlow.

Create the Amazon AppFlow Jira Cloud connection

In this step, you configure Amazon AppFlow to run a one-time full data fetch of all your data, establishing the initial data lake:

  1. On the Amazon AppFlow console, choose Connectors in the navigation pane.
  2. Search for the Jira Cloud connector.
  3. Choose Create flow on the connector tile to create the connection to your Jira instance.
    An image of Amazon AppFlor, showing the search for the "Jira Cloud" connector.
  4. For Flow name, enter a name for the flow (for example, JiraLakeFlow).
  5. Leave the Data encryption setting as the default.
  6. Choose Next.
    The Amazon AppFlow Jira connector configuration, showing the Flow name set to "JiraLakeFlow" and clicking the "next" button.
  7. For Source name, keep the default of Jira Cloud.
  8. Choose Create new connection under Jira Cloud connection.
  9. In the Connect to Jira Cloud section, enter the values for Client ID, Client secret, and Jira Cloud Site that you collected earlier. This provides the authentication from AppFlow to Jira Cloud.
  10. For Connection Name, enter a connection name (for example, JiraLakeCloudConnection).
  11. Choose Connect. You will be prompted to allow your OAuth app to access your Atlassian account to verify authentication.
    An image of the Amazon AppFlow conflagration, reflecting the completion of the prior steps.
  12. In the Authorize App window that pops up, choose Accept.
  13. With the connection created, return to the Configure flow section on the Amazon AppFlow console.
  14. For API version, choose V2 to use the latest Jira query API.
  15. For Jira Cloud object, choose Issue to query and download all issues and associated details.
    An image of the Amazon AppFlow configuration, reflecting the completion of the prior steps.
  16. For Destination Name in the Destination Details section, choose Amazon S3.
  17. For Bucket details, choose the S3 bucket name that matches the Amazon AppFlow destination bucket value that you collected from the outputs of the CloudFormation stack.
  18. Enter the Amazon AppFlow destination bucket path to complete the full S3 path. This will send the Jira data to the S3 bucket created by the CloudFormation script.
  19. Leave Catalog your data in the AWS Glue Data Catalog unselected. The CloudFormation script uses an AWS Glue crawler to update the Data Catalog in a different manner, grouping all the downloads into a common table, so we disable the update here.
  20. For File format settings, select Parquet format and select Preserve source data types in Parquet output. Parquet is a columnar format to optimize subsequent querying.
  21. Select Add a timestamp to the file name for Filename preference. This will allow you to easily find data files downloaded at a specific date and time.
    An image of the Amazon AppFlow configuration, reflecting the completion of the prior steps.
  22. For now, select Run on Demand for the Flow trigger to run the full load flow manually. You will schedule downloads in a later step when implementing CDC.
  23. Choose Next.
    An image of the Amazon AppFlow Flow Trigger configuration, reflecting the completion of the prior steps.
  24. On the Map data fields page, select Manually map fields.
  25. For Source to destination field mapping, choose the drop-down box under Source field name and select Map all fields directly. This will bring down all fields as they are received, because we will instead implement data preparation in later steps.
    An image of the Amazon AppFlow configuration, reflecting the completion of steps 24 & 25.
  26. Under Partition and aggregation settings, you can set up the partitions in a way that works for your use case. For this example, we use a daily partition, so select Date and time and choose Daily.
  27. For Aggregation settings, leave it as the default of Don’t aggregate.
  28. Choose Next.
    An image of the Amazon AppFlow configuration, reflecting the completion of steps 26-28.
  29. On the Add filters page, you can create filters to only download specific data. For this example, you download all the data, so choose Next.
  30. Review and choose Create flow.
  31. When the flow is created, choose Run flow to start the initial data seeding. After some time, you should receive a banner indicating the run finished successfully.
    An image of the Amazon AppFlow configuration, reflecting the completion of step 31.

Review seed data

At this stage in the process, you now have data in your S3 environment. When new data files are created in the S3 bucket, it will automatically run an AWS Glue crawler to catalog the new data. You can see if it’s complete by reviewing the Step Functions state machine for a Succeeded run status. There is a link to the state machine on the CloudFormation stack’s Resources tab, which will redirect you to the Step Functions state machine.

A image showing the CloudFormation resources tab of the stack, with a link to the AWS Step Functions workflow.

When the state machine is complete, it’s time to review the raw Jira data with Athena. The database is as you specified in the CloudFormation stack (jiralake by default), and the table name is jira_raw. If you kept the default AWS Glue database name of jiralake, the Athena SQL is as follows:

SELECT * FROM "jiralake"."jira_raw" limit 10;

If you explore the data, you’ll notice that most of the data you would want to work with is actually packed into a column called fields. This means the data is not available as columns in your Athena queries, making it harder to select, filter, and sort individual fields within an Athena SQL query. This will be addressed in the next steps.

An image demonstrating the Amazon Athena query SELECT * FROM "jiralake"."jira_raw" limit 10;

Set up CDC and unpack the fields columns

To add the ongoing CDC and reformat the data for analytics, we introduce a DataBrew job to transform the data and filter to the most recent version of each record as changes come in. You can do this by updating the CloudFormation stack with a flag that includes the CDC and data transformation steps.

  1. On the AWS CloudFormation console, return to the stack.
  2. Choose Update.
  3. Select Use current template and choose Next.
    An image showing Amazon CloudFormation, with steps 1-3 complete.
  4. For SetupOrCDC, choose CDC, then choose Next. This will enable both the CDC steps and the data transformation steps for the Jira data.
    An image showing Amazon CloudFormation, with step 4 complete.
  5. Continue choosing Next until you reach the Review section.
  6. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
    An image showing Amazon CloudFormation, with step 5-6 complete.
  7. Return to the Amazon AppFlow console and open your flow.
  8. On the Actions menu, choose Edit flow. We will now edit the flow trigger to run an incremental load on a periodic basis.
  9. Select Run flow on schedule.
  10. Configure the desired repeats, as well as start time and date. For this example, we choose Daily for Repeats and enter 1 for the number of days you’ll have the flow trigger. For Starting at, enter 01:00.
  11. Select Incremental transfer for Transfer mode.
  12. Choose Updated on the drop-down menu so that changes will be captured based on when the records were updated.
  13. Choose Save. With these settings in our example, the run will happen nightly at 1:00 AM.
    An image showing the Flow Trigger, with incremental transfer selected.

Review the analytics data

When the next incremental load occurs that results in new data, the Step Functions workflow will start the DataBrew job and populate a new staged analytical data table named jira_data in your Data Catalog database. If you don’t want to wait, you can trigger the Step Functions workflow manually.

The DataBrew job performs data transformation and filtering tasks. The job unpacks the key-values from the Jira JSON data and the raw Jira data, resulting in a tabular data schema that facilitates use with BI and AI/ML tools. As Jira items are changed, the changed item’s data is resent, resulting in multiple versions of an item in the raw data feed. The DataBrew job filters the raw data feed so that the resulting data table only contains the most recent version of each item. You could enhance this DataBrew job to further customize the data for your needs, such as renaming the generic Jira custom field names to reflect their business meaning.

When the Step Functions workflow is complete, we can query the data in Athena again using the following query:

SELECT * FROM "jiralake"."jira_data" limit 10;

You can see that in our transformed jira_data table, the nested JSON fields are broken out into their own columns for each field. You will also notice that we’ve filtered out obsolete records that have been superseded by more recent record updates in later data loads so the data is fresh. If you want to rename custom fields, remove columns, or restructure what comes out of the nested JSON, you can modify the DataBrew recipe to accomplish this. At this point, the data is ready to be used by your analytics tools, such as Amazon QuickSight.

An image demonstrating the Amazon Athena query SELECT * FROM "jiralake"."jira_data" limit 10;

Clean up

If you would like to discontinue this solution, you can remove it with the following steps:

  1. On the Amazon AppFlow console, deactivate the flow for Jira, and optionally delete it.
  2. On the Amazon S3 console, select the S3 bucket for the stack, and empty the bucket to delete the existing data.
  3. On the AWS CloudFormation console, delete the CloudFormation stack that you deployed.

Conclusion

In this post, we created a serverless incremental data load process for Jira that will synchronize data while handling custom fields using Amazon AppFlow, AWS Glue, and Step Functions. The approach uses Amazon AppFlow to incrementally load the data into Amazon S3. We then use AWS Glue and Step Functions to manage the extraction of the Jira custom fields and load them in a format to be queried by analytics services such as Athena, QuickSight, or Redshift Spectrum, or AI/ML services like Amazon SageMaker.

To learn more about AWS Glue and DataBrew, refer to Getting started with AWS Glue DataBrew. With DataBrew, you can take the sample data transformation in this project and customize the output to meet your specific needs. This could include renaming columns, creating additional fields, and more.

To learn more about Amazon AppFlow, refer to Getting started with Amazon AppFlow. Note that Amazon AppFlow supports integrations with many SaaS applications in addition to the Jira Cloud.

To learn more about orchestrating flows with Step Functions, see Create a Serverless Workflow with AWS Step Functions and AWS Lambda. The workflow could be enhanced to load the data into a data warehouse, such as Amazon Redshift, or trigger a refresh of a QuickSight dataset for analytics and reporting.

In future posts, we will cover how to unnest parent-child relationships within the Jira data using Athena and how to visualize the data using QuickSight.


About the Authors

Tom Romano is a Sr. Solutions Architect for AWS World Wide Public Sector from Tampa, FL, and assists GovTech and EdTech customers as they create new solutions that are cloud native, event driven, and serverless. He is an enthusiastic Python programmer for both application development and data analytics, and is an Analytics Specialist. In his free time, Tom flies remote control model airplanes and enjoys vacationing with his family around Florida and the Caribbean.

Shane Thompson is a Sr. Solutions Architect based out of San Luis Obispo, California, working with AWS Startups. He works with customers who use AI/ML in their business model and is passionate about democratizing AI/ML so that all customers can benefit from it. In his free time, Shane loves to spend time with his family and travel around the world.

Top Amazon QuickSight features launched in Q2 2022

Post Syndicated from Sindhu Chandra original https://aws.amazon.com/blogs/big-data/top-amazon-quicksight-features-launched-in-q2-2022/

Amazon QuickSight is a serverless, cloud-based business intelligence (BI) service that brings data insights to your teams and end-users through machine learning (ML)-powered dashboards and data visualizations, which can be accessed via QuickSight or embedded in apps and portals that your users access. This post shares the top QuickSight features and updates launched in Q2 2022 categorized into embedding, Amazon QuickSight Q, BI, and admin features.

Embedding

QuickSight offers a new embedding feature:

  • 1-click public embedding – QuickSight now allows you to embed your dashboards into public applications, wikis, and portals without any coding or development. Once enabled, anyone on the internet can start accessing these embedded dashboards with up-to-date information instantly, without server deployments or infrastructure licensing needed! To learn how to empower your end-users with access to insights, visit Amazon QuickSight 1-click public embedding.

An embedded dashboard example showing metrics for a contact center

QuickSight Q

You can take advantage of the following updates in Q:

  • Programmatic question submission – Q can now accept full questions as input without requiring users to enter them when used in embedded mode. This new feature allows developers to create questions as widgets at appropriate placements on their web applications, making it easy for users to discover the capability to ask questions about data within the current context of their user journey. To learn more, see Amazon QuickSight Embedding SDK.
  • Experience Q before signing up – QuickSight authors can now try, learn, and experience Q before signing up. You can choose from six different sample topics to explore relevant dashboard visualizations and ask questions about data in the context of exploration to fully explore Q’s capability before signing up. Get started with a free trial for QuickSight Q.

User inputs a question in natural language about sales numbers for the month by segment and gets answers on the embedded dashboard.

Business intelligence

QuickSight now offers the following BI features:

  • Table row and column size control – QuickSight now provides the flexibility for both authors and readers to use drag controller to resize rows and columns in a table or pivot table visual. You can adjust both row height and column width. To learn more, see Resizing rows and columns in tables and pivot tables.

Animation showing how to use drag controllers to resize rows and columns in a table

  • Level-aware calculations – QuickSight now supports a suite of functions called level-aware calculations (LAC). The new calculation capability brings flexibility and simplification for users to build advanced calculations and powerful analyses. LAC enables you to specify the level of granularity you want the window functions or aggregate functions to be conducted at. For more information, refer to Using level-aware calculations in Amazon QuickSight.
  • Show or hide fields on pivot tables – QuickSight now provides authors the ability to show or hide any column, row, or value fields from the field well context menu on pivot table visuals. With the show/hide column feature, you can hide unwanted columns that are often used for custom actions for interactivity and provide a better visual presentation. For further details, visit Showing and hiding pivot table columns in Amazon QuickSight.
  • Rolling date functionality – QuickSight now enables authors to set up rolling dates to dynamically generate dashboards for end-users. You can set up rolling rules to fetch a date, such as today, yesterday, or different combinations of (start/end) of (this/previous/next) (year/quarter/month/week/day), and dynamically update the dashboard content To learn how to create date filters, visit Creating date filters in analyses.
  • Bookmarks in dashboards – QuickSight now supports bookmarks in dashboards. Bookmarks allow QuickSight readers to save customized dashboard preferences into a list of bookmarks for easy one-click access to specific views of the dashboard without having to manually make multiple filter and parameter changes every time you want to access your dashboard. For further details, visit Bookmarking views of a dashboard.
  • Custom subtotals at all levels – QuickSight now enables custom subtotals at all levels on pivot tables. QuickSight authors can now customize how subtotals are displayed in a pivot table, with options to display subtotals for last level, all levels, or selected level. This customization is available for both rows and columns. To learn more about custom subtotals, refer to Displaying Totals and Subtotals.

Admin

QuickSight offers the following new admin features:

  • Monitor deployments in real time – QuickSight now supports monitoring of QuickSight assets by sending metrics to Amazon CloudWatch. QuickSight developers and administrators can use these metrics to observe and respond to the availability and performance of their QuickSight ecosystem in near-real time. To learn how to monitor your QuickSight deployments in real time, visit Monitor your Amazon QuickSight deployments using the new Amazon CloudWatch integration.
  • Public API for account provisioning – QuickSight now supports APIs for QuickSight account creation. Administrators and developers can automate deployment of QuickSight accounts in their organization at scale. You can now programmatically create accounts with QuickSight Enterprise and Enterprise + Q editions. For more information on account creation, visit CreateAccountSubscription.
  • API for account creation – QuickSight now supports API-based allow listing of domains where QuickSight data visualizations can be embedded. With this new capability, developers can easily scale their embedded analytics offerings across different applications for different customers quickly without any infrastructure setup or management. To learn more, visit Scale Amazon QuickSight embedded analytics with new API-based domain allow listing.

Conclusion

QuickSight serves millions of dashboard views weekly, enabling data-driven decision-making in organizations of all sizes, including customers like the NFL, 3M, Accenture, and more.

To stay up to date on all things new with QuickSight, visit What’s New with Analytics!


About the Author

Sindhu Chandra is a Senior Product Marketing Manager for Amazon QuickSight, AWS’ cloud-native, business intelligence (BI) service that delivers easy-to-understand insights to anyone, wherever they are.

From centralized architecture to decentralized architecture: How data sharing fine-tunes Amazon Redshift workloads

Post Syndicated from Jingbin Ma original https://aws.amazon.com/blogs/big-data/from-centralized-architecture-to-decentralized-architecture-how-data-sharing-fine-tunes-amazon-redshift-workloads/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, Amazon Redshift has become the most widely used cloud data warehouse.

With the significant growth of data for big data analytics over the years, some customers have asked how they should optimize Amazon Redshift workloads. In this post, we explore how to optimize workloads on Amazon Redshift clusters using Amazon Redshift RA3 nodes, data sharing, and pausing and resuming clusters. For more cost-optimization methods, refer to Getting the most out of your analytics stack with Amazon Redshift.

Key features of Amazon Redshift

First, let’s review some key features:

  • RA3 nodes – Amazon Redshift RA3 nodes are backed by a new managed storage model that gives you the power to separately optimize your compute power and your storage. They bring a few very important features, one of which is data sharing. RA3 nodes also support the ability to pause and resume, which allows you to easily suspend on-demand billing while the cluster is not being used.
  • Data sharing – Amazon Redshift data sharing offers you to extend the ease of use, performance, and cost benefits of Amazon Redshift in a single cluster to multi-cluster deployments while being able to share data. Data sharing enables instant, granular, and fast data access across Redshift clusters without the need to copy or move it. You can securely share live data with Amazon Redshift clusters in the same or different AWS accounts, and across regions. You can share data at many levels, including schemas, tables, views, and user-defined functions. You can also share the most up-to-date and consistent information as it’s updated in Amazon Redshift Serverless. It also provides fine-grained access controls that you can tailor for different users and businesses that all need access to the data. However, data sharing in Amazon Redshift has a few limitations.

Solution overview

In this use case, our customer is heavily using Amazon Redshift as their data warehouse for their analytics workloads, and they have been enjoying the possibility and convenience that Amazon Redshift brought to their business. They mainly use Amazon Redshift to store and process user behavioral data for BI purposes. The data has increased by hundreds of gigabytes daily in recent months, and employees from departments continuously run queries against the Amazon Redshift cluster on their BI platform during business hours.

The company runs four major analytics workloads on a single Amazon Redshift cluster, because some data is used by all workloads:

  • Queries from the BI platform – Various queries run mainly during business hours.
  • Hourly ETL – This extract, transform, and load (ETL) job runs in the first few minutes of each hour. It generally takes about 40 minutes.
  • Daily ETL – This job runs twice a day during business hours, because the operation team needs to get daily reports before the end of the day. Each job normally takes between 1.5–3 hours. It’s the second-most resource-heavy workload.
  • Weekly ETL – This job runs in the early morning every Sunday. It’s the most resource-heavy workload. The job normally takes 3–4 hours.

The analytics team has migrated to the RA3 family and increased the number of nodes of the Amazon Redshift cluster to 12 over the years to keep the average runtime of queries from their BI tool within an acceptable time due to the data size, especially when other workloads are running.

However, they have noticed that performance is reduced while running ETL tasks, and the duration of ETL tasks is long. Therefore, the analytics team wants to explore solutions to optimize their Amazon Redshift cluster.

Because CPU utilization spikes appear while the ETL tasks are running, the AWS team’s first thought was to separate workloads and relevant data into multiple Amazon Redshift clusters with different cluster sizes. By reducing the total number of nodes, we hoped to reduce the cost of Amazon Redshift.

After a series of conversations, the AWS team found that one of the reasons that the customer keeps all workloads on the 12-node Amazon Redshift cluster is to manage the performance of queries from their BI platform, especially while running ETL workloads, which have a big impact on the performance of all workloads on the Amazon Redshift cluster. The obstacle is that many tables in the data warehouse are required to be read and written by multiple workloads, and only the producer of a data share can update the shared data.

The challenge of dividing the Amazon Redshift cluster into multiple clusters is data consistency. Some tables need to be read by ETL workloads and written by BI workloads, and some tables are the opposite. Therefore, if we duplicate data into two Amazon Redshift clusters or only create a data share from the BI cluster to the reporting cluster, the customer will have to develop a data synchronization process to keep the data consistent between all Amazon Redshift clusters, and this process could be very complicated and unmaintainable.

After more analysis to gain an in-depth understanding of the customer’s workloads, the AWS team found that we could put tables into four groups, and proposed a multi-cluster, two-way data sharing solution. The purpose of the solution is to divide the workloads into separate Amazon Redshift clusters so that we can use Amazon Redshift to pause and resume clusters for periodic workloads to reduce the Amazon Redshift running costs, because clusters can still access a single copy of data that is required for workloads. The solution should meet the data consistency requirements without building a complicated data synchronization process.

The following diagram illustrates the old architecture (left) compared to the new multi-cluster solution (right).

Improve the old architecture (left) to the new multi-cluster solution (right)

Dividing workloads and data

Due to the characteristics of the four major workloads, we categorized workloads into two categories: long-running workloads and periodic-running workloads.

The long-running workloads are for the BI platform and hourly ETL jobs. Because the hourly ETL workload requires about 40 minutes to run, the gain is small even if we migrate it to an isolated Amazon Redshift cluster and pause and resume it every hour. Therefore, we leave it with the BI platform.

The periodic-running workloads are the daily and weekly ETL jobs. The daily job generally takes about 1 hour and 40 minutes to 3 hours, and the weekly job generally takes 3–4 hours.

Data sharing plan

The next step is identifying all data (tables) access patterns of each category. We identified four types of tables:

  • Type 1 – Tables are only read and written by long-running workloads
  • Type 2 – Tables are read and written by long-running workloads, and are also read by periodic-running workloads
  • Type 3 – Tables are read and written by periodic-running workloads, and are also read by long-running workloads
  • Type 4 – Tables are only read and written by periodic-running workloads

Fortunately, there is no table that is required to be written by all workloads. Therefore, we can separate the Amazon Redshift cluster into two Amazon Redshift clusters: one for the long-running workloads, and the other for periodic-running workloads with 20 RA3 nodes.

We created a two-way data share between the long-running cluster and the periodic-running cluster. For type 2 tables, we created a data share on the long-running cluster as the producer and the periodic-running cluster as the consumer. For type 3 tables, we created a data share on the periodic-running cluster as the producer and the long-running cluster as the consumer.

The following diagram illustrates this data sharing configuration.

The long-running cluster (producer) shares type 2 tables to the periodic-running cluster (consumer). The periodic-running cluster (producer’) shares type 3 tables to the long-running cluster (consumer’)

Build two-way data share across Amazon Redshift clusters

In this section, we walk through the steps to build a two-way data share across Amazon Redshift clusters. First, let’s take a snapshot of the original Amazon Redshift cluster, which became the long-running cluster later.

Take a snapshot of the long-running-cluster from the Amazon Redshift console

Now, let’s create a new Amazon Redshift cluster with 20 RA3 nodes for periodic-running workloads. Then we migrate the type 3 and type 4 tables to the periodic-running cluster. Make sure you choose the ra3 node type. (Amazon Redshift Serverless supports data sharing too, and it becomes generally available in July 2022, so it is also an option now.)

Create the periodic-running-cluster. Make sure you select the ra3 node type.

Create the long-to-periodic data share

The next step is to create the long-to-periodic data share. Complete the following steps:

  1. On the periodic-running cluster, get the namespace by running the following query:
SELECT current_namespace;

Make sure record the namespace.

  1. On the long-running cluster, we run queries similar to the following:
CREATE DATASHARE ltop_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE ltop_share ADD SCHEMA public_long;
ALTER DATASHARE ltop_share ADD ALL TABLES IN SCHEMA public_long;
GRANT USAGE ON DATASHARE ltop_share TO NAMESPACE '[periodic-running-cluster-namespace]';
  1. We can validate the long-to-periodic data share using the following command:
SHOW datashares;
  1. After we validate the data share, we get the long-running cluster namespace with the following query:
SELECT current-namespace;

Make sure record the namespace.

  1. On the periodic-running cluster, run the following command to load the data from the long-to-periodic data share with the long-running cluster namespace:
CREATE DATABASE ltop FROM DATASHARE ltop_share OF NAMESPACE '[long-running-cluster-namespace]';
  1. Confirm that we have read access to tables in the long-to-periodic data share.

Create the periodic-to-long data share

The next step is to create the periodic-to-long data share. We use the namespaces of the long-running cluster and the periodic-running cluster that we collected in the previous step.

  1. On the periodic-running cluster, run queries similar to the following to create the periodic-to-long data share:
CREATE DATASHARE ptol_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE ptol_share ADD SCHEMA public_periodic;
ALTER DATASHARE ptol_share ADD ALL TABLES IN SCHEMA public_periodic;
GRANT USAGE ON DATASHARE ptol_share TO NAMESPACE '[long-running-cluster-namespace]';
  1. Validate the data share using the following command:
SHOW datashares;
  1. On the long-running cluster, run the following command to load the data from the periodic-to-long data using the periodic-running cluster namespace:
CREATE DATABASE ptol FROM DATASHARE ptol_share OF NAMESPACE '[periodic-running-cluster-namespace]';
  1. Check that we have read access to the tables in the periodic-to-long data share.

At this stage, we have separated workloads into two Amazon Redshift clusters and built a two-way data share across two Amazon Redshift clusters.

The next step is updating the code of different workloads to use the correct endpoints of two Amazon Redshift clusters and perform consolidated tests.

Pause and resume the periodic-running Amazon Redshift cluster

Let’s update the crontab scripts, which run periodic-running workloads. We make two updates.

  1. When the scripts start, call the Amazon Redshift check and resume cluster APIs to resume the periodic-running Amazon Redshift cluster when the cluster is paused:
    aws redshift resume-cluster --cluster-identifier [periodic-running-cluster-id]

  2. After the workloads are finished, call the Amazon Redshift pause cluster API with the cluster ID to pause the cluster:
    aws redshift pause-cluster --cluster-identifier [periodic-running-cluster-id]

Results

After we migrated the workloads to the new architecture, the company’s analytics team ran some tests to verify the results.

According to tests, the performance of all workloads improved:

  • The BI workload is about 100% faster during the ETL workload running periods
  • The hourly ETL workload is about 50% faster
  • The daily workload duration reduced to approximately 40 minutes, from a maximum of 3 hours
  • The weekly workload duration reduced to approximately 1.5 hours, from a maximum of 4 hours

All functionalities work properly, and cost of the new architecture only increased approximately 13%, while over 10% of new data had been added during the testing period.

Learnings and limitations

After we separated the workloads into different Amazon Redshift clusters, we discovered a few things:

  • The performance of the BI workloads was 100% faster because there was no resource competition with daily and weekly ETL workloads anymore
  • The duration of ETL workloads on the periodic-running cluster was reduced significantly because there were more nodes and no resource competition from the BI and hourly ETL workloads
  • Even when over 10% new data was added, the overall cost of the Amazon Redshift clusters only increased by 13%, due to using the cluster pause and resume function of the Amazon Redshift RA3 family

As a result, we saw a 70% price-performance improvement of the Amazon Redshift cluster.

However, there are some limitations of the solution:

  • To use the Amazon Redshift pause and resume function, the code for calling the Amazon Redshift pause and resume APIs must be added to all scheduled scripts that run ETL workloads on the periodic-running cluster
  • Amazon Redshift clusters require several minutes to finish pausing and resuming, although you’re not charged during these processes
  • The size of Amazon Redshift clusters can’t automatically scale in and out depending on workloads

Next steps

After improving performance significantly, we can explore the possibility of reducing the number of nodes of the long-running cluster to reduce Amazon Redshift costs.

Another possible optimization is using Amazon Redshift Spectrum to reduce the cost of Amazon Redshift on cluster storage. With Redshift Spectrum, multiple Amazon Redshift clusters can concurrently query and retrieve the same structured and semistructured dataset in Amazon Simple Storage Service (Amazon S3) without the need to make copies of the data for each cluster or having to load the data into Amazon Redshift tables.

Amazon Redshift Serverless was announced for preview in AWS re:Invent 2021 and became generally available in July 2022. Redshift Serverless automatically provisions and intelligently scales your data warehouse capacity to deliver best-in-class performance for all your analytics. You only pay for the compute used for the duration of the workloads on a per-second basis. You can benefit from this simplicity without making any changes to your existing analytics and BI applications. You can also share data for read purposes across different Amazon Redshift Serverless instances within or across AWS accounts.

Therefore, we can explore the possibility of removing the need to script for pausing and resuming the periodic-running cluster by using Redshift Serverless to make the management easier. We can also explore the possibility of improving the granularity of workloads.

Conclusion

In this post, we discussed how to optimize workloads on Amazon Redshift clusters using RA3 nodes, data sharing, and pausing and resuming clusters. We also explored a use case implementing a multi-cluster two-way data share solution to improve workload performance with a minimum code change. If you have any questions or feedback, please leave them in the comments section.


About the authors

Jingbin Ma

Jingbin Ma is a Sr. Solutions Architect at Amazon Web Services. He helps customers build well-architected applications using AWS services. He has many years of experience working in the internet industry, and his last role was CTO of a New Zealand IT company before joining AWS. He is passionate about serverless and infrastructure as code.

Chao PanChao Pan is a Data Analytics Solutions Architect at Amazon Web Services. He’s responsible for the consultation and design of customers’ big data solution architectures. He has extensive experience in open-source big data. Outside of work, he enjoys hiking.

Optimize Federated Query Performance using EXPLAIN and EXPLAIN ANALYZE in Amazon Athena

Post Syndicated from Nishchai JM original https://aws.amazon.com/blogs/big-data/optimize-federated-query-performance-using-explain-and-explain-analyze-in-amazon-athena/

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. In 2019, Athena added support for federated queries to run SQL queries across data stored in relational, non-relational, object, and custom data sources.

In 2021, Athena added support for the EXPLAIN statement, which can help you understand and improve the efficiency of your queries. The EXPLAIN statement provides a detailed breakdown of a query’s run plan. You can analyze the plan to identify and reduce query complexity and improve its runtime. You can also use EXPLAIN to validate SQL syntax prior to running the query. Doing so helps prevent errors that would have occurred while running the query.

Athena also added EXPLAIN ANALYZE, which displays the computational cost of your queries alongside their run plans. Administrators can benefit from using EXPLAIN ANALYZE because it provides a scanned data count, which helps you reduce financial impact due to user queries and apply optimizations for better cost control.

In this post, we demonstrate how to use and interpret EXPLAIN and EXPLAIN ANALYZE statements to improve Athena query performance when querying multiple data sources.

Solution overview

To demonstrate using EXPLAIN and EXPLAIN ANALYZE statements, we use the following services and resources:

Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in your AWS account. The table metadata lets the Athena query engine know how to find, read, and process the data that you want to query. We use Athena data source connectors to connect to data sources external to Amazon S3.

Prerequisites

To deploy the CloudFormation template, you must have the following:

Provision resources with AWS CloudFormation

To deploy the CloudFormation template, complete the following steps:

  1. Choose Launch Stack:

  1. Follow the prompts on the AWS CloudFormation console to create the stack.
  2. Note the key-value pairs on the stack’s Outputs tab.

You use these values when configuring the Athena data source connectors.

The CloudFormation template creates the following resources:

  • S3 buckets to store data and act as temporary spill buckets for Lambda
  • AWS Glue Data Catalog tables for the data in the S3 buckets
  • A DynamoDB table and Amazon RDS for MySQL tables, which are used to join multiple tables from different sources
  • A VPC, subnets, and endpoints, which are needed for Amazon RDS for MySQL and DynamoDB

The following figure shows the high-level data model for the data load.

Create the DynamoDB data source connector

To create the DynamoDB connector for Athena, complete the following steps:

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. For Data sources, select Amazon DynamoDB.
  4. Choose Next.

  1. For Data source name, enter DDB.

  1. For Lambda function, choose Create Lambda function.

This opens a new tab in your browser.

  1. For Application name, enter AthenaDynamoDBConnector.
  2. For SpillBucket, enter the value from the CloudFormation stack for AthenaSpillBucket.
  3. For AthenaCatalogName, enter dynamodb-lambda-func.
  4. Leave the remaining values at their defaults.
  5. Select I acknowledge that this app creates custom IAM roles and resource policies.
  6. Choose Deploy.

You’re returned to the Connect data sources section on the Athena console.

  1. Choose the refresh icon next to Lambda function.
  2. Choose the Lambda function you just created (dynamodb-lambda-func).

  1. Choose Next.
  2. Review the settings and choose Create data source.
  3. If you haven’t already set up the Athena query results location, choose View settings on the Athena query editor page.

  1. Choose Manage.
  2. For Location of query result, browse to the S3 bucket specified for the Athena spill bucket in the CloudFormation template.
  3. Add Athena-query to the S3 path.
  4. Choose Save.

  1. In the Athena query editor, for Data source, choose DDB.
  2. For Database, choose default.

You can now explore the schema for the sportseventinfo table; the data is the same in DynamoDB.

  1. Choose the options icon for the sportseventinfo table and choose Preview Table.

Create the Amazon RDS for MySQL data source connector

Now let’s create the connector for Amazon RDS for MySQL.

  1. On the Athena console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. For Data sources, select MySQL.
  4. Choose Next.

  1. For Data source name, enter MySQL.

  1. For Lambda function, choose Create Lambda function.

  1. For Application name, enter AthenaMySQLConnector.
  2. For SecretNamePrefix, enter AthenaMySQLFederation.
  3. For SpillBucket, enter the value from the CloudFormation stack for AthenaSpillBucket.
  4. For DefaultConnectionString, enter the value from the CloudFormation stack for MySQLConnection.
  5. For LambdaFunctionName, enter mysql-lambda-func.
  6. For SecurityGroupIds, enter the value from the CloudFormation stack for RDSSecurityGroup.
  7. For SubnetIds, enter the value from the CloudFormation stack for RDSSubnets.
  8. Select I acknowledge that this app creates custom IAM roles and resource policies.
  9. Choose Deploy.

  1. On the Lambda console, open the function you created (mysql-lambda-func).
  2. On the Configuration tab, under Environment variables, choose Edit.

  1. Choose Add environment variable.
  2. Enter a new key-value pair:
    • For Key, enter MYSQL_connection_string.
    • For Value, enter the value from the CloudFormation stack for MySQLConnection.
  3. Choose Save.

  1. Return to the Connect data sources section on the Athena console.
  2. Choose the refresh icon next to Lambda function.
  3. Choose the Lambda function you created (mysql-lamdba-function).

  1. Choose Next.
  2. Review the settings and choose Create data source.
  3. In the Athena query editor, for Data Source, choose MYSQL.
  4. For Database, choose sportsdata.

  1. Choose the options icon by the tables and choose Preview Table to examine the data and schema.

In the following sections, we demonstrate different ways to optimize our queries.

Optimal join order using EXPLAIN plan

A join is a basic SQL operation to query data on multiple tables using relations on matching columns. Join operations affect how much data is read from a table, how much data is transferred to the intermediate stages through networks, and how much memory is needed to build up a hash table to facilitate a join.

If you have multiple join operations and these join tables aren’t in the correct order, you may experience performance issues. To demonstrate this, we use the following tables from difference sources and join them in a certain order. Then we observe the query runtime and improve performance by using the EXPLAIN feature from Athena, which provides some suggestions for optimizing the query.

The CloudFormation template you ran earlier loaded data into the following services:

AWS Storage Table Name Number of Rows
Amazon DynamoDB sportseventinfo 657
Amazon S3 person 7,025,585
Amazon S3 ticketinfo 2,488

Let’s construct a query to find all those who participated in the event by type of tickets. The query runtime with the following join took approximately 7 mins to complete:

SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"DDB"."default"."sportseventinfo" e, 
"AwsDataCatalog"."athenablog"."person" p, 
"AwsDataCatalog"."athenablog"."ticketinfo" t 
WHERE 
t.sporting_event_id = cast(e.eventid as double) 
AND t.ticketholder_id = p.id

Now let’s use EXPLAIN on the query to see its run plan. We use the same query as before, but add explain (TYPE DISTRIBUTED):

EXPLAIN (TYPE DISTRIBUTED)
SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"DDB"."default"."sportseventinfo" e, 
"AwsDataCatalog"."athenablog"."person" p, 
"AwsDataCatalog"."athenablog"."ticketinfo" t 
WHERE 
t.sporting_event_id = cast(e.eventid as double) 
AND t.ticketholder_id = p.id

The following screenshot shows our output

Notice the cross-join in Fragment 1. The joins are converted to a Cartesian product for each table, where every record in a table is compared to every record in another table. Therefore, this query takes a significant amount of time to complete.

To optimize our query, we can rewrite it by reordering the joining tables as sportseventinfo first, ticketinfo second, and person last. The reason for this is because the WHERE clause, which is being converted to a JOIN ON clause during the query plan stage, doesn’t have the join relationship between the person table and sportseventinfo table. Therefore, the query plan generator converted the join type to cross-joins (a Cartesian product), which less efficient. Reordering the tables aligns the WHERE clause to the INNER JOIN type, which satisfies the JOIN ON clause and runtime is reduced from 7 minutes to 10 seconds.

The code for our optimized query is as follows:

SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"DDB"."default"."sportseventinfo" e, 
"AwsDataCatalog"."athenablog"."ticketinfo" t, 
"AwsDataCatalog"."athenablog"."person" p 
WHERE 
t.sporting_event_id = cast(e.eventid as double) 
AND t.ticketholder_id = p.id

The following is the EXPLAIN output of our query after reordering the join clause:

EXPLAIN (TYPE DISTRIBUTED) 
SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"DDB"."default"."sportseventinfo" e, 
"AwsDataCatalog"."athenablog"."ticketinfo" t, 
"AwsDataCatalog"."athenablog"."person" p 
WHERE t.sporting_event_id = cast(e.eventid as double) 
AND t.ticketholder_id = p.id

The following screenshot shows our output.

The cross-join changed to INNER JOIN with join on columns (eventid, id, ticketholder_id), which results in the query running faster. Joins between the ticketinfo and person tables converted to the PARTITION distribution type, where both left and right tables are hash-partitioned across all worker nodes due to the size of the person table. The join between the sportseventinfo table and ticketinfo are converted to the REPLICATED distribution type, where one table is hash-partitioned across all worker nodes and the other table is replicated to all worker nodes to perform the join operation.

For more information about how to analyze these results, refer to Understanding Athena EXPLAIN statement results.

As a best practice, we recommend having a JOIN statement along with an ON clause, as shown in the following code:

SELECT t.id AS ticket_id, 
e.eventid, 
p.first_name 
FROM 
"AwsDataCatalog"."athenablog"."person" p 
JOIN "AwsDataCatalog"."athenablog"."ticketinfo" t ON t.ticketholder_id = p.id 
JOIN "ddb"."default"."sportseventinfo" e ON t.sporting_event_id = cast(e.eventid as double)

Also as a best practice when you join two tables, specify the larger table on the left side of join and the smaller table on the right side of the join. Athena distributes the table on the right to worker nodes, and then streams the table on the left to do the join. If the table on the right is smaller, then less memory is used and the query runs faster.

In the following sections, we present examples of how to optimize pushdowns for filter predicates and projection filter operations for the Athena data source using EXPLAIN ANALYZE.

Pushdown optimization for the Athena connector for Amazon RDS for MySQL

A pushdown is an optimization to improve the performance of a SQL query by moving its processing as close to the data as possible. Pushdowns can drastically reduce SQL statement processing time by filtering data before transferring it over the network and filtering data before loading it into memory. The Athena connector for Amazon RDS for MySQL supports pushdowns for filter predicates and projection pushdowns.

The following table summarizes the services and tables we use to demonstrate a pushdown using Aurora MySQL.

Table Name Number of Rows Size in KB
player_partitioned 5,157 318.86
sport_team_partitioned 62 5.32

We use the following query as an example of a filtering predicate and projection filter:

SELECT full_name,
name 
FROM "sportsdata"."player_partitioned" a 
JOIN "sportsdata"."sport_team_partitioned" b ON a.sport_team_id=b.id 
WHERE a.id='1.0'

This query selects the players and their team based on their ID. It serves as an example of both filter operations in the WHERE clause and projection because it selects only two columns.

We use EXPLAIN ANALYZE to get the cost for the running this query:

EXPLAIN ANALYZE 
SELECT full_name,
name 
FROM "MYSQL"."sportsdata"."player_partitioned" a 
JOIN "MYSQL"."sportsdata"."sport_team_partitioned" b ON a.sport_team_id=b.id 
WHERE a.id='1.0'

The following screenshot shows the output in Fragment 2 for the table player_partitioned, in which we observe that the connector has a successful pushdown filter on the source side, so it tries to scan only one record out of the 5,157 records in the table. The output also shows that the query scan has only two columns (full_name as the projection column and sport_team_id and the join column), and uses SELECT and JOIN, which indicates the projection pushdown is successful. This helps reduce the data scan when using Athena data source connectors.

Now let’s look at the conditions in which a filter predicate pushdown doesn’t work with Athena connectors.

LIKE statement in filter predicates

We start with the following example query to demonstrate using the LIKE statement in filter predicates:

SELECT * 
FROM "MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name LIKE '%Aar%'

We then add EXPLAIN ANALYZE:

EXPLAIN ANALYZE 
SELECT * 
FROM "MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name LIKE '%Aar%'

The EXPLAIN ANALYZE output shows that the query performs the table scan (scanning the table player_partitioned, which contains 5,157 records) for all the records even though the WHERE clause only has 30 records matching the condition %Aar%. Therefore, the data scan shows the complete table size even with the WHERE clause.

We can optimize the same query by selecting only the required columns:

EXPLAIN ANALYZE 
SELECT sport_team_id,
full_name 
FROM "MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name LIKE '%Aar%'

From the EXPLAIN ANALYZE output, we can observe that the connector supports the projection filter pushdown, because we select only two columns. This brought the data scan size down to half of the table size.

OR statement in filter predicates

We start with the following query to demonstrate using the OR statement in filter predicates:

SELECT id,
first_name 
FROM "MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name = 'Aaron' OR id ='1.0'

We use EXPLAIN ANALYZE with the preceding query as follows:

EXPLAIN ANALYZE 
SELECT * 
FROM 
"MYSQL"."sportsdata"."player_partitioned" 
WHERE first_name = 'Aaron' OR id ='1.0'

Similar to the LIKE statement, the following output shows that query scanned the table instead of pushing down to only the records that matched the WHERE clause. This query outputs only 16 records, but the data scan indicates a complete scan.

Pushdown optimization for the Athena connector for DynamoDB

For our example using the DynamoDB connector, we use the following data:

Table Number of Rows Size in KB
sportseventinfo 657 85.75

Let’s test the filter predicate and project filter operation for our DynamoDB table using the following query. This query tries to get all the events and sports for a given location. We use EXPLAIN ANALYZE for the query as follows:

EXPLAIN ANALYZE 
SELECT EventId,
Sport 
FROM "DDB"."default"."sportseventinfo" 
WHERE Location = 'Chase Field'

The output of EXPLAIN ANALYZE shows that the filter predicate retrieved only 21 records, and the project filter selected only two columns to push down to the source. Therefore, the data scan for this query is less than the table size.

Now let’s see where filter predicate pushdown doesn’t work. In the WHERE clause, if you apply the TRIM() function to the Location column and then filter, predicate pushdown optimization doesn’t apply, but we still see the projection filter optimization, which does apply. See the following code:

EXPLAIN ANALYZE 
SELECT EventId,
Sport 
FROM "DDB"."default"."sportseventinfo" 
WHERE trim(Location) = 'Chase Field'

The output of EXPLAIN ANALYZE for this query shows that the query scans all the rows but is still limited to only two columns, which shows that the filter predicate doesn’t work when the TRIM function is applied.

We’ve seen from the preceding examples that the Athena data source connector for Amazon RDS for MySQL and DynamoDB do support filter predicates and projection predicates for pushdown optimization, but we also saw that operations such as LIKE, OR, and TRIM when used in the filter predicate don’t support pushdowns to the source. Therefore, if you encounter unexplained charges in your federated Athena query, we recommend using EXPLAIN ANALYZE with the query and determine whether your Athena connector supports the pushdown operation or not.

Please note that running EXPLAIN ANALYZE incurs cost because it scans the data.

Conclusion

In this post, we showcased how to use EXPLAIN and EXPLAIN ANALYZE to analyze Athena SQL queries for data sources on AWS S3 and Athena federated SQL query for data source like DynamoDB and Amazon RDS for MySQL. You can use this as an example to optimize queries which would also result in cost savings.


About the Authors

Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.

Varad Ram is Senior Solutions Architect in Amazon Web Services. He likes to help customers adopt to cloud technologies and is particularly interested in artificial intelligence. He believes deep learning will power future technology growth. In his spare time, he like to be outdoor with his daughter and son.

Simplify and optimize Python package management for AWS Glue PySpark jobs with AWS CodeArtifact

Post Syndicated from Ashok Padmanabhan original https://aws.amazon.com/blogs/big-data/simplify-and-optimize-python-package-management-for-aws-glue-pyspark-jobs-with-aws-codeartifact/

Data engineers use various Python packages to meet their data processing requirements while building data pipelines with AWS Glue PySpark Jobs. Languages like Python and Scala are commonly used in data pipeline development. Developers can take advantage of their open-source packages or even customize their own to make it easier and faster to perform use cases, such as data manipulation and analysis. However, managing standardized packages can be cumbersome with multiple teams using different versions of packages, installing non-approved packages, and causing duplicate development effort due to the lack of visibility of what is available at the enterprise level. This can be especially challenging in large enterprises with multiple data engineering teams.

ETL Developers have requirements to use additional packages for their AWS Glue ETL jobs. With security being job zero for customers, many will restrict egress traffic from their VPC to the public internet, and they need a way to manage the packages used by applications including their data processing pipelines.

Our proposed solution will enable you with network egress restrictions to manage packages centrally with AWS CodeArtifact and use their favorite libraries in their AWS Glue ETL PySpark code. In this post, we’ll describe how CodeArtifact can be used for managing packages and modules for AWS Glue ETL jobs, and we’ll demo a solution using Glue PySpark jobs that run within VPC Subnets that have no internet access.

Solution overview

The solution uses CodeArtifact as a tool to make it easier for organizations of any size to securely store, publish, and share software packages used in their ETL with AWS Glue. VPC Endpoints will be enabled for CodeArtifact and Glue to enable private link connections. AWS Step Functions makes it easy to coordinate the orchestration of components used in the data processing pipeline. Native integrations with both CodeArtifact and AWS Glue enable the workflow to both authenticate the request to CodeArtifact and start the AWS Glue ETL job.

The following architecture shows an implementation of a solution using AWS Glue, CodeArtifact, and Step Functions to use additional Python modules without egress internet access. The solution is deployed using AWS Cloud Development Kit (AWS CDK), an open-source software development framework to define your cloud application resources using familiar programming languages.

Solution Architecture for the blog post

Fig 1: Architecture Diagram for the Solution

To illustrate how to set up this architecture, we’ll walk you through the following steps:

  1. Deploying an AWS CDK stack to provision the following AWS Resources
    1. CodeArtifact
    2. An AWS Glue job
    3. Step Functions workflow
    4. Amazon Simple Storage Service (Amazon S3) bucket
    5. A VPC with a private Subnet and VPC Endpoints to Amazon S3 and CodeArtifact
  2. Validate the Deployment.
  3. Run a Sample Workflow – This workflow will run an AWS Glue PySpark job that uses a custom Python library, and an upgraded version of boto3.
  4. Cleaning up your resources.

Prerequisites

Make sure that you complete the following steps as prerequisites:

The solution

Launching your AWS CDK Stack

Step 1: Using your device’s command line, check out our Git repository to a local directory on your device:

git clone https://github.com/aws-samples/python-lib-management-without-internet-for-aws-glue-in-private-subnets.git

Step 2: Change directories to the new directory Amazon S3 script location:

cd python-lib-management-without-internet-for-aws-glue-in-private-subnets/scripts/s3

Step 3: Download the following CSV, which contains New York City Taxi and Limousine Commission (TLC) Trip weekly trips. This will serve as the input source for the AWS Glue Job:

aws s3 cp s3://nyc-tlc/misc/FOIL_weekly_trips_apps.csv .

Step 4: Change the directories to the path where the app.py file is located (in reference to the previous step, execute the following step):

cd ../..

Step 5: Create a virtual environment:

macOS/Linux:
python3 -m venv .env

Windows:
python -m venv .env

Step 6: Activate the virtual environment after the init process completes and the virtual environment is created:

macOS/Linux:
source .env/bin/activate

Windows:
.env\Scripts\activate.bat

Step 7: Install the required dependencies:

pip3 install -r requirements.txt

Step 8: Make sure that your AWS profile is setup along with the region that you want to deploy as mentioned in the prerequisite. Synthesize the templates. AWS CDK apps use code to define the infrastructure, and when run they produce or “synthesize” a CloudFormation template for each stack defined in the application:

cdk synthesize

Step 9: BootStrap the cdk app using the following command:

cdk bootstrap aws://<AWS_ACCOUNTID>/<AWS_REGION>

Replace the place holder AWS_ACCOUNTID and AWS_REGION with your AWS account ID and the region to be deployed.

This step provisions the initial resources, including an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments.

Step 10: Deploy the solution. By default, some actions that could potentially make security changes require approval. In this deployment, you’re creating an IAM role. The following command overrides the approval prompts, but if you would like to manually accept the prompts, then omit the --require-approval never flag:

cdk deploy "*" --require-approval never

While the AWS CDK deploys the CloudFormation stacks, you can follow the deployment progress in your terminal:

AWS CDK Deployment progress in terminal

Fig 2: AWS CDK Deployment progress in terminal

Once the deployment is successful, you’ll see the successful status as follows:

AWS CDK Deployment completion success

Fig 3: AWS CDK Deployment completion success

Step 11: Log in to the AWS Console, go to CloudFormation, and see the output of the ApplicationStack stack:

AWS CloudFormation stack output

Fig 4: AWS CloudFormation stack output

Note the values of the DomainName and RepositoryName variables. We’ll use them in the next step to upload our artifacts

Step 12: We will upload a custom library into the repo that we created. This will be used by our Glue ETL job.

  • Install twine using pip:
python3 -m pip install twine

The custom python package glueutils-0.2.0.tar.gz can be found under this folder of the cloned repo:

cd scripts/custom_glue_library
  • Configure twine with the login command (additional details here ). Refer to step 11 for the DomainName and RepositoryName from the CloudFormation output:
aws codeartifact login --tool twine --domain <DomainName> --domain-owner <AWS_ACCOUNTID> --repository <RepositoryName>
  • Publish Python package assets:
twine upload --repository codeartifact glueutils-0.2.0.tar.gz
Python package publishing using twine

Fig 5: Python package publishing using twine

Validate the Deployment

The AWS CDK stack will deploy the following AWS resources:

  1. Amazon Virtual Private Cloud (Amazon VPC)
    1. One Private Subnet
  2. AWS CodeArtifact
    1. CodeArtifact Repository
    2. CodeArtifact Domain
    3. CodeArtifact Upstream Repository
  3. AWS Glue
    1. AWS Glue Job
    2. AWS Glue Database
    3. AWS Glue Connection
  4. AWS Step Function
  5. Amazon S3 Bucket for AWS CDK and also for storing scripts and CSV file
  6. IAM Roles and Policies
  7. Amazon Elastic Compute Cloud (Amazon EC2) Security Group

Step 1: Browse to the AWS account and region via the AWS Console to which the resources are deployed.

Step 2: Browse the Subnet page (https://<region> .console.aws.amazon.com/vpc/home?region=<region> #subnets:) (*Replace region with actual AWS Region to which your resources are deployed)

Step 3: Select the Subnet with name as ApplicationStack/enterprise-repo-vpc/Enterprise-Repo-Private-Subnet1

Step 4: Select the Route Table and validate that there are no Internet Gateway or NAT Gateway for routes to Internet, and that it’s similar to the following image:

Route table validation

Fig 6: Route table validation

Step 5: Navigate to the CodeArtifact console and review the repositories created. The enterprise-repo is your local repository, and pypi-store is the upstream repository connected to the PyPI, providing artifacts from pypi.org.

AWS CodeArifact repositories created

Fig 7: AWS CodeArifact repositories created

Step 6: Navigate to enterprise-repo and search for glueutils. This is the custom python package that we published.

AWS CodeArifact custom python package published

Fig 8: AWS CodeArifact custom python package published

Step 7: Navigate to Step Functions Console and review the enterprise-repo-step-function as follows:

AWS Step Functions workflow

Fig 9: AWS Step Functions workflow

The diagram shows how the Step Functions workflow will orchestrate the pattern.

  1. The first step CodeArtifactGetAuthorizationToken calls the getAuthorizationToken API to generate a temporary authorization token for accessing repositories in the domain (this token is valid for 15 mins.).
  2. The next step GenerateCodeArtifactURL takes the authorization token from the response and generates the CodeArtifact URL.
  3. Then, this will move into the GlueStartJobRun state, which makes a synchronous API call to run the AWS Glue job.

Step 8: Navigate to the AWS Glue Console and select the Jobs tab, then select enterprise-repo-glue-job.

The AWS Glue job is created with the following script and AWS Glue Connection enterprise-repo-glue-connection. The AWS Glue connection is a Data Catalog object that enables the job to connect to sources and APIs from within the VPC. The network type connection runs the job from within the private subnet to make requests to Amazon S3 and CodeArtifact over the VPC endpoint connection. This enables the job to run without any traffic through the internet.

Note the connections section in the AWS Glue PySpark Job, which makes the Glue job run on the private subnet in the VPC provisioned.

AWS Glue network connections

Fig 10: AWS Glue network connections

The job takes an Amazon S3 bucket, Glue Database, Python Job Installer Option, and Additional Python Modules as job parameters. The parameters --additional-python-modules and --python-modules-installer-option are passed to install the selected Python module from a PyPI repository hosted in AWS CodeArtifact.

The script itself first reads the Amazon S3 input path of the taxi data in the CSV format. A light transformation to sum the total trips by year, week, and app is performed. Then the output is written to an Amazon S3 path as parquet . A partitioned table in the AWS Glue Data Catalog will either be created or updated if it already exists .

You can find the Glue PySpark script here.

Run a sample workflow

The following steps will demonstrate how to run a sample workflow:

Step 1: Navigate to the Step Functions Console and select the enterprise-repo-step-function.

Step 2: Select Start execution and input the following: We’re including the glueutils and latest boto3 libraries as part of the job run. It is always recommended to pin your python dependencies to avoid any breaking change due to a future version of dependency . In the below example, the latest available version of boto3, and the 0.2.0 version of glueutils will be installed. To pin it to a specific release you may add  boto3==1.24.2   (Current latest release at the time of publishing this post).

{"pythonmodules": "boto3,glueutils==0.2.0"}

Step 3: Select Start execution and wait until Execution Status is Succeeded. This may take a few minutes.

Step 4: Navigate to the CodeArtifact Console to review the enterprise-repo repository. You’ll see the cached PyPi packages and all of their dependencies pulled down from PyPi.

Step 5: In the Glue Console under the Runs section of the enterprise-glue-job, you’ll see the parameters passed:

Fig 11 : AWS Glue job execution history

Fig 11 : AWS Glue job execution history

Note the --index-url which was passed as a parameter to the glue ETL job. The token is valid only for 15 minutes.

Step 6: Navigate to the Amazon CloudWatch Console and go to the /aws/glue-jobs log group to verify that the packages were installed from the local repo.

You will see that the 2 package names passed as parameters are installed with the corresponding versions.

Fig 12 : Amazon CloudWatch logs details for the Glue job

Fig 12 : Amazon CloudWatch logs details for the Glue job

Step 7: Navigate to the Amazon Athena console and select Query Editor.

Step 8: Run the following query to validate the output of the AWS Glue job:

SELECT year, app, SUM(total_trips) as sum_of_total_trips 
FROM 
"codeartifactblog_glue_db"."taxidataparquet" 
GROUP BY year, app;

Clean up

Make sure that you clean up all of the other AWS resources that you created in the AWS CDK Stack deployment. You can delete these resources via the AWS CDK Destroy command as follows or the CloudFormation console.

To destroy the resources using AWS CDK, follow these steps:

  1. Follow Steps 1-6 from the ‘Launching your CDK Stack’ section.
  2. Destroy the app by executing the following command:
    cdk destroy

Conclusion

In this post, we demonstrated how CodeArtifact can be used for managing Python packages and modules for AWS Glue jobs that run within VPC Subnets that have no internet access. We also demonstrated how the versions of existing packages can be updated (i.e., boto3) and a custom Python library (glueutils) that is developed locally is also managed through CodeArtifact.

This post enables you to use your favorite Python packages with AWS Glue ETL PySpark jobs by modifying the input to the AWS StepFunctions workflow (Step 2 in the Run a Sample workflow section).


About the Authors

Bret Pontillo is a Data & ML Engineer with AWS Professional Services. He works closely with enterprise customers building data lakes and analytical applications on the AWS platform. In his free time, Bret enjoys traveling, watching sports, and trying new restaurants.

Gaurav Gundal is a DevOps consultant with AWS Professional Services, helping customers build solutions on the customer platform. When not building, designing, or developing solutions, Gaurav spends time with his family, plays guitar, and enjoys traveling to different places.

Ashok Padmanabhan is a Sr. IOT Data Architect with AWS Professional Services, helping customers build data and analytics platform and solutions. When not helping customers build and design data lakes, Ashok enjoys spending time at the beach near his home in Florida.

Choose the right storage tier for your needs in Amazon OpenSearch Service

Post Syndicated from Changbin Gong original https://aws.amazon.com/blogs/big-data/choose-the-right-storage-tier-for-your-needs-in-amazon-opensearch-service/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) enables organizations to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open-source, distributed search and analytics suite derived from Elasticsearch. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).

In this post, we present three storage tiers of Amazon OpenSearch Service—hot, UltraWarm, and cold storage—and discuss how to effectively choose the right storage tier for your needs. This post can help you understand how these storage tiers integrate together and what the trade-off is for each storage tier. To choose a storage tier of Amazon OpenSearch Service for your use case, you need to consider the performance, latency, and cost of these storage tiers in order to make the right decision.

Amazon OpenSearch Service storage tiers overview

There are three different storage tiers for Amazon OpenSearch Service: hot, UltraWarm, and cold. The following diagram illustrates these three storage tiers.

Hot storage

Hot storage for Amazon OpenSearch Service is used for indexing and updating, while providing fast access to data. Standard data nodes use hot storage, which takes the form of instance store or Amazon Elastic Block Store (Amazon EBS) volumes attached to each node. Hot storage provides the fastest possible performance for indexing and searching new data.

You get the lowest latency for reading data in the hot tier, so you should use the hot tier to store frequently accessed data driving real-time analysis and dashboards. As your data ages, you access it less frequently and can tolerate higher latency, so keeping data in the hot tier is no longer cost-efficient.

If you want to have low latency and fast access to the data, hot storage is a good choice for you.

UltraWarm storage

UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) with related caching solutions to improve performance. UltraWarm offers significantly lower costs per GiB for read-only data that you query less frequently and don’t need the same performance as hot storage. Although you can’t modify the data while in UltraWarm, you can move the data to the hot storage tier for edits before moving it back.

When calculating UltraWarm storage requirements, you consider only the size of the primary shards. When you query for the list of shards in UltraWarm, you still see the primary and replicas listed. Both shards are stubs for the same, single copy of the data, which is in Amazon S3. The durability of data in Amazon S3 removes the need for replicas, and Amazon S3 abstracts away any operating system or service considerations. In the hot tier, accounting for one replica, 20 GB of index uses 40 GB of storage. In the UltraWarm tier, it’s billed at 20 GB.

The UltraWarm tier acts like a caching layer on top of the data in Amazon S3. UltraWarm moves data from Amazon S3 onto the UltraWarm nodes on demand, which speeds up access for subsequent queries on that data. For that reason, UltraWarm works best for use cases that access the same, small slice of data multiple times. You can add or remove UltraWarm nodes to increase or decrease the amount of cache against your data in Amazon S3 to optimize your cost per GB. To dial in your cost, be sure to test using a representative dataset. To monitor performance, use the WarmCPUUtilization and WarmJVMMemoryPressure metrics. See UltraWarm metrics for a complete list of metrics.

The combined CPU cores and RAM allocated to UltraWarm nodes affects performance for simultaneous searches across shards. We recommend deploying enough UltraWarm instances so that you store no more than 400 shards per ultrawarm1.medium.search node and 1,000 shards per ultrawarm1.large.search node (including both primaries and replicas). We recommend a maximum shard size of 50 GB for both hot and warm tiers. When you query UltraWarm, each shard uses a CPU and moves data from Amazon S3 to local storage. Running single or concurrent queries that access many indexes can overwhelm the CPU and local disk resources. This can cause longer latencies through inefficient use of local storage, and even cause cluster failures.

UltraWarm storage requires OpenSearch 1.0 or later, or Elasticsearch version 6.8 or later.

If you have large amounts of read-only data and want to balance the cost and performance, use UltraWarm for your infrequently accessed, older data.

Cold storage

Cold storage is optimized to store infrequently accessed or historical data at $0.024 per GB per month. When you use cold storage, you detach your indexes from the UltraWarm tier, making them inaccessible. You can reattach these indexes in a few seconds when you need to query that data. Cold storage is a great fit for scenarios in which a low ROI necessitates an archive or delete action on historical data, or if you need to conduct research or perform forensic analysis on older data with Amazon OpenSearch Service.

Cold storage doesn’t have specific instance types because it doesn’t have any compute capacity attached to it. You can store any amount of data in cold storage.

Cold storage requires OpenSearch 1.0 or later, or Elasticsearch version 7.9 or later and UltraWarm.

Manage storage tiers in OpenSearch Dashboards

OpenSearch Dashboards installed on your Amazon OpenSearch Service domain provides a useful UI for managing indexes in different storage tiers on your domain. From the OpenSearch Dashboards main menu, you can view all indexes in hot, UltraWarm, and cold storage. You can also see the indexes managed by Index State Management (ISM) policies. OpenSearch Dashboards enables you to migrate indexes between UltraWarm and cold storage, and monitor index migration status, without using the AWS Command Line Interface (AWS CLI) or configuration API. For more information on OpenSearch Dashboards, see Using OpenSearch Dashboards with Amazon OpenSearch Service.

Cost considerations

The hot tier requires you to pay for what is provisioned, which includes the hourly rate for the instance type. Storage is either Amazon EBS or a local SSD instance store. For Amazon EBS-only instance types, additional EBS volume pricing applies. You pay for the amount of storage you deploy.

UltraWarm nodes charge per hour just like other node types, but you only pay for the storage actually stored in Amazon S3. For example, although the instance type ultrawarm1.large.elasticsearch provides up to 20 TiB addressable storage on Amazon S3, if you only store 2 TiB of data, you’re only billed for 2 TiB. Like the standard data node types, you also pay an hourly rate for each UltraWarm node. For more information, see Pricing for Amazon OpenSearch Service.

Cold storage doesn’t incur compute costs, and like UltraWarm, you’re only billed for the amount of data stored in Amazon S3. There are no additional transfer charges when moving data between cold and UltraWarm storage.

Example use case

Let’s look at an example with 1 TB of source data per day, 7 days hot, 83 days warm, 365 days cold. For more information on sizing the cluster, see Sizing Amazon OpenSearch Service domains.

For hot storage, you can go through a baseline estimation with the calculation as: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention. With the best practice for two replicas, we should use two replicas here. The minimum storage requirement to retain 7 TB of data on the hot tier is (7TB*1.25)*(2+1)= 26.25 TB. For this amount of storage, we need 6x R6g.4xlarge.search instances given the Amazon EBS size limit.

We also need to verify from the CPU side, we need 25 primary shards (1TB*1.25/50GB) =25. We have two replicas. With that, we have total 75 active shards. With that, the total vCPU needed is 75*1.5=112.5 vCPU. This means 8x R6g.4xlarge.search instances. This also requires three dedicated c6g.xlarge.search leader nodes.

When calculating UltraWarm storage requirements, you consider only the size of the primary shards, because that’s the amount of data stored in Amazon S3. For this example, the total primary shard size for warm storage is 83*1.25=103.75 TB. Each ultrawarm1.large.search instance has 16 CPU cores and can address up to 20 TiB of storage on Amazon S3. A minimum of six ultrawarm1.large.search nodes is recommended. You’re charged for the actual storage, which is 103.75 TB.

For cold storage, you only pay for the cost of storing 365*1.25=456.25 TB on Amazon S3. The following table contains a breakdown of the monthly costs (USD) you’re likely to incur. This assumes a 1-year reserved instance for the cluster instances with no upfront payment in the US East (N. Virgina) Region.

Cost Type Pricing Usage Cost per month
Instance Usage R6g.4xlarge.search = $0.924 per hour 8 instances * 730 hours in a month = 5,840 hours 5,840 hours * $0.924 = $5,396.16
c6g.xlarge.search = $0.156 per hour 3 instances (leader nodes) * 730 hours in a month = 2,190 hours 2,190 hours * $0.156 = $341.64
ultrawarm1.large.search = $2.68 per hour 6 instances * 730 hours = 4,380 hours 4,380 hours * $2.68 = $11,738.40
Storage Cost Hot storage cost (Amazon EBS) EBS general purpose SSD (gp3) = $0.08 per GB per month 7 days host = 26.25TB 26,880 GB * $0.08 = $2,150.40
UltraWarm managed storage cost = $0.024 per GB per month 83 days warm = 103.75 TB per month 106,240 GB * $0.024 = $2,549.76
Cold storage cost on Amazon S3 = $0.022 per GB per month 365 days cold = 456.25 TB per month 467,200 GB * $0.022 = $10,278.40

The total monthly cost is $32,454.76. The hot tier costs $7,888.20, UltraWarm costs $14,288.16, and cold storage is $10,278.40. UltraWarm allows 83 days of additional retention for slightly more cost than the hot tier, which only provides 7 days. For nearly the same cost as the hot tier, the cold tier stores the primary shards for up to 1 year.

Conclusion

Amazon OpenSearch Service supports three integrated storage tiers: hot, UltraWarm, and cold storage. Based on your data retention, query latency, and budgeting requirements, you can choose the best strategy to balance cost and performance. You can also migrate data between different storage tiers. To start using these storage tiers, sign in to the AWS Management Console, use the AWS SDK, or AWS CLI, and enable the corresponding storage tier.


About the Author

Changbin Gong is a Senior Solutions Architect at Amazon Web Services (AWS). He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Changbin enjoys reading, running, and traveling.

Rich Giuli is a Principal Solutions Architect at Amazon Web Service (AWS). He works within a specialized group helping ISVs accelerate adoption of cloud services. Outside of work Rich enjoys running and playing guitar.

Accelerate self-service analytics with Amazon Redshift Query Editor V2

Post Syndicated from Bhanu Pittampally original https://aws.amazon.com/blogs/big-data/accelerate-self-service-analytics-with-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics platform. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API. Query Editor V2 lets users explore, analyze, and collaborate on data. You can use Query Editor V2 to create databases, schemas, tables, and load data from Amazon Simple Storage Service (S3) either using COPY command or using a wizard . You can browse multiple databases and run queries on your Amazon Redshift data warehouse, data lake, or federated query to operational databases such as Amazon Aurora.

From the smallest start-ups to worldwide conglomerates, customers across the spectrum tell us they want to promote self-service analytics by empowering their end-users, such as data analysts and business analysts, to load data into their analytics platform. Analysts at these organizations create tables and load data in their own workspace, and they join that with the curated data available from the data warehouse to gain insight. This post will discuss how Query Editor V2 accelerates self-service analytics by enabling users to create tables and load data with simple wizards.

The Goal to Accelerate and Empower Data Analysts

A common practice that we see across enterprises today is that more and more enterprises are letting data analysts or business analysts load data into their user or group workspaces that co-exist on data warehouse platforms. Enterprise calls these personal workspaces, departmental schemas, project-based schemas or labs, and so on. The idea of this approach is to empower data analysts to load data sets by themselves and join curated data sets on a data warehouse platform to accelerate the data analysis process.

Amazon Redshift Query Editor V2 makes it easy for administrators to create the workspaces, and it enables data analysts to create and load data into the tables. Query Editor V2 lets you easily create external schemas in Redshift Cluster to extend the data warehouse to a data lake, thereby accelerating analytics.

An example Use case

Let’s assume that an organization has a marketing department with some power users and regular users. In this example, let’s also consider that the organization already has an Enterprise Data Warehouse (EDW) powered by Amazon Redshift. The marketing department would like to have a workspace created for their team members.

A visual depiction of a Data Warehouse Environment may look like the following figure. Enterprises let user/group schemas be created along with an EDW, which contains curated data sets. Analysts can create and load exploratory data sets into user schemas, and then join curated data sets available in the EDW.

ScopeofSolution

Amazon Redshift provides several options to isolate your users’ data from the enterprise data warehouse data,. Amazon Redshift data sharing lets you share data from your EDW cluster with a separate consumer cluster. Your users can consume the EDW data and create their own workspace in the consumer cluster. Alternatively, you can create a separate database for your users’ group workspace in the same cluster, and then isolate each user group to have their own schema. Amazon Redshift supports queries of data joining across databases, and then users can join their tables with the curated data in the EDW. We recommend you use the data sharing option that lets you isolate both compute and data. Query Editor v2 supports both scenarios.

Once you have enabled your data analysts to have their own workspace and provided the relevant privileges, then they can easily create Schema, table, and load data.

Prerequisites

  1.  You have an Amazon Redshift cluster, and you have configured the Query Editor V2. You can view the Simplify Data Analysis with Amazon Redshift Query Editor V2 post for instructions on setting up Query Editor V2.
  2. For loading your data from Amazon S3 into Amazon Redshift, you will start by creating an IAM role to provide permissions to access Amazon S3 and grant that role to the Redshift cluster. By default, Redshift users assume that the IAM role is attached to the Redshift cluster. You can find the instructions in the Redshift getting started guide.
  3. For users who want to load data from Amazon S3, Query Editor V2 provides an option to browse S3 buckets. To use this feature, users should have List permission on the S3 bucket.

Create Schemas

The Query Editor V2 supports the schema creation actions. Likewise, admins can create both native and external schemas by creating Schema wizard.

CreateSchemas

As a user, you can easily create a “schema” by accessing Create Schema wizard available from the Create button, and then selecting “Schema” from the drop-down list, as shown in the following screenshot.

If you select the Schema from the drop-down list, then the Create Schema wizard similar to the following screenshot is displayed. You can choose a local schema and provide a schema name.

Optionally, you can authorize a user to authorize users to create objects in the Schema. When the Authorize user check box is selected, then Create and Usage access are granted to the user. Now, Janedoe can create objects in this Schema.

Let’s assume that the analyst user Janedoe logs in to Query Editor V2 and logs in to the database and wants to create table and load data into their personal workspace.

Creating Tables

The Query Editor V2 provides a Create table wizard for users to create a table quickly. It allows power users to auto-create the table as based on a data file. Users can upload the file from their local machine and let Query Editor V2 figure out the data types and column widths. Optionally, you can change the column definition, such as encoding and table properties.

Below is a sample CSV file with a row header and sample rows from the MarketingCampaign.csv file. We will demonstrate how to create a table based on this file in the following steps.

SampleData

The following screenshot shows the uploading of the MarketingCampaing.csv file into Query Editor V2.

Create Table Wizard has two sections:

  1. Columns

The Columns tab lets users select a file from their local desktop and upload it to Query Editor V2. Users can choose Schema from the drop-down option and provide a table name.

Query Editor V2 automatically infers columns and some data types for each column. It has several options to choose from to set as column properties. For example, you have the option to change column names, data types, and encoding. If you do not choose any encoding option, then the encoding choice will be selected automatically. You can also add new fields, for example, an auto-increment ID column, and define properties for that particular identity column.

  1. Table Details

You can use the Create Table wizard to create a temporary table or regular table with the option of including it in automatic backups. The temporary table is available until the end of the session and is used in queries. A temporary table is automatically dropped once a user session is terminated.

The “Table Details” is optional, as Amazon Redshift’s Automatic Table Optimization feature takes care of Distribution Key and Sort Key on behalf of users.

  1. Viewing Create Table Statement

Once the column and table level detail is set, Query Editor V2 gives an option to view the Create table statement in Query Editor tab. This lets users save the definition for later use or share it with other users. Once the user reviews the create table definition, then the user can hit the “Run” button to run the query. Users can also directly create a table from the Create table wizard.

The following screenshot shows the Create table definition for the marketing campaign data set.

CreateTable3

Query Editor V2 lets users view table definitions in a table format. The following screenshot displays the table that we created earlier. Note that Redshift automatically inferred encoding type for each column. As the best practice, it skipped for “Dt_Customer“, as it was set as the sort key. When creating the table, we did not set the encodings for columns, as Redshift will automatically set the best compression methods for each column.

Query Editor V2 distinguishes columns by data types in a table by using distinct icons for them.

You can also view the table definition by right-clicking on the table and selecting the show definition option. You can also generate a template select command, and drop or truncate the table by right-clicking on a table.

Loading Data

Now that we have created a schema and a table, let’s learn how to upload the data to the table that we created earlier.

Query Editor V2 provides you with the ability to load data for S3 buckets to Redshift tables. The COPY command is recommended to load data in Amazon Redshift. The COPY command leverages the massively parallel processing capabilities of Redshift.

The Load Data wizard in the Query Editor V2 loads data into Redshift by generating the COPY command. As a data analyst, you don’t have to remember the intricacies of the COPY command.

You can quickly load data from CSV, JSON, ORC, or Parquet files to an existing table using the Load Data Wizard. It supports all of the options in the COPY command. The Load Data Wizard lets Data analysts build a COPY command with an easy-to-use GUI.

The following screenshot shows an S3 bucket that has our MarketingCampaign.csv file. This is a much larger file that we used to create the table using Create table wizard. We will use this file to walk you through the Load Data wizard.

LoadData1

The Load Data wizard lets you browse your available S3 bucket and select a file or folder from the S3 bucket. You can also use a manifest file. A manifest file lets you make sure that all of the files are loaded using the COPY command. You can find more information about manifest files here.

The Load Data Wizard lets you enter several properties, such as the Redshift Cluster IAM role and whether data is encrypted. You can also set file options. For example, in the case of CSV, you can set delimiter and quote parameters. If the file is compressed, then you can provide compression settings.

With the Data Conversion Parameters, you can select options like Escape Characters, time format, and if you want to ignore the header in your data file. The Load Operations option lets you set compression encodings and error handling options.

Query Editor V2 lets you browse S3 objects, thereby making it easier to navigate buckets, folders, and files. Below screens displays the flow

Query Editor V2 supports loading data of many open formats, such as JSON, Delimiter, FixedWidth, AVRO, Parquet, ORC, and Shapefile.

In our example, we are loading CSV files. As you can see, we have selected our MarketingCampaing.csv file and set the Region, and then selected the Resfhift cluster IAM Role.

For the CSV file, under additional File Options, Delimiter Character and Quote Character are set with “;” and an empty quote in the below screen.

Once the required parameters are set, continue to next step to load data. Load Data operation builds a copy command and automatically loads it into Query Editor Tab, and then invokes the query.

LoadData5

Data is loaded into the target table successfully, and now you can run a query to view that data. The following screen shows the result of the select query executed on our target table:

LoadData6

Viewing load errors

If your COPY command fails, then these are logged into STL_LOAD_ERRORS system table. Query Editor v2 simplifies the viewing of the common errors by showing the errors in-place as shown in the following screenshot:

LoadData7

Saving and reusing the queries

You can save the load queries for future usage by clicking on the saved query and providing a name in the saved query.

SavingQ1You would probably like to reuse the load query in the future to load data in from another S3 location. In that case, you can use the parameterized query by replacing the S3 URL of the as shown in the following screenshot:

SavingQ2

You can save the query, and then share the query with another user.

When you or other users run the query, a prompt for the parameter will appear as in the following screenshot:

SavingQ3

We discussed how data analysts could load data into their own or the group’s workspace.

We will now discuss using Query Editor V2 to create an external schema to extend your data warehouse to the data lake.

Extending the Data Warehouse to the Data Lake

Extending Data warehouses to Data lakes is part of modern data architecture practices. Amazon Redshift enables this with seamless integration through Data lake running on AWS. Redshift uses Spectrum to allow this extension. You can access data lakes from the Redshift Data warehouse by creating Redshift external schemas.

Query Editor V2 lets you create an external schema referencing an external database in AWS Glue Data Catalogue.

To extend your Data Warehouse to Data Lake, you should have an S3 data lake and AWS Glue Data Catalog database defined for the data lake. Grant permission on AWS Glue to Redshift Cluster Role. You can find more information about external Schema here.

You can navigate to the Create External Schema by using Create Schema wizard, and then selecting the External Schema as shown in the following screenshot:

The Query Editor V2 makes the schema creation experience very easy by hiding the intricacies of the create external schema syntax. You can use the simple interface and provide the required parameters, such as Glue data regions, external database name, and the IAM role. You can browse the Glue Catalog and view the database name.

After you use the create schema option, you can see the schemas in the tree-view. The Query Editor V2 uses distinct icons to distinguish between native Schema and external Schema.

Viewing External Table Definitions

The Query Editor V2 lets data analysts quickly view objects available in external databases and understand their metadata.

You can view tables and columns for a given table by clicking on external Schema and then on a table. When a particular table is selected, its metadata information is displayed in the bottom portion below the tree-view panel. This is a powerful feature, as an analyst can easily understand the data residing externally in the data lake.

You can now run queries against external tables in the external Schema.

In our fictitious enterprise, Marketing Department team members can load data in their own workspace and join the data from their own user/group workspace with the curated data in the enterprise data warehouse or data lake.

Conclusion

This post demonstrated how the Query Editor V2 enabled data analysts to create tables and load data from Amazon S3 easily with a simple wizard.

We also discussed how Query Editor V2 lets you extend the data warehouse to a data lake. The data analysts can easily browse tables in your local data warehouse, data shared from another cluster, or tables in the data lake. You can run queries that can join tables in your data warehouse and data lake. The Query Editor V2 also provides several features for the collaboration of query authoring. You can view the earlier blog to learn more about how the Query Editor V2 simplifies data analysis.

These features let organizations accelerate self-service analytics and end-users deliver the insights faster.

Happy querying!


About the Authors

Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years. His Linkedin profile is here.

Debu-PandaDebu Panda  is a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

cansuaCansu Aksu is a Front End Engineer at AWS, has a several years of experience in developing user interfaces. She is detail oriented, eager to learn and passionate about delivering products and features that solve customer needs and problems

chengyangwangChengyang Wang is a Frontend Engineer in Redshift Console Team. He worked on a number of new features delivered by redshift in the past 2 years. He thrives to deliver high quality products and aim to improve customer experience from UI

Architecting Persona-centric Data Platform with On-premises Data Sources

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/architecting-persona-centric-data-platform-with-on-premises-data-sources/

Many organizations are moving their data from silos and aggregating it in one location. Collecting this data in a data lake enables you to perform analytics and machine learning on that data. You can store your data in purpose-built data stores, like a data warehouse, to get quick results for complex queries on structured data.

In this post, we show how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi. We will also discuss Lake House architecture on AWS, which is the next evolution from data warehouse and data lake-based solutions.

Data movement services

AWS provides a wide variety of services to bring data into a data lake:

You may want to bring on-premises data into the AWS Cloud to take advantage of AWS purpose-built analytics services, derive insights, and make timely business decisions. Apache NiFi is an open source tool that enables you to move and process data using a graphical user interface.

For this use case and solution architecture, we use Apache NiFi to ingest data into Amazon S3 and AWS purpose-built analytics services, based on user personas.

Building persona-centric data platform on AWS

When you are building a persona-centric data platform for analytics and machine learning, you must first identify your user personas. Who will be using your platform? Then choose the appropriate purpose-built analytics services. Envision a data platform analytics architecture as a stack of seven layers:

  1. User personas: Identify your user personas for data engineering, analytics, and machine learning
  2. Data ingestion layer: Bring the data into your data platform and data lineage lifecycle view, while ingesting data into your storage layer
  3. Storage layer: Store your structured and unstructured data
  4. Cataloging layer: Store your business and technical metadata about datasets from the storage layer
  5. Processing layer: Create data processing pipelines
  6. Consumption layer: Enable your user personas for purpose-built analytics
  7. Security and Governance: Protect your data across the layers

Reference architecture

The following diagram illustrates how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi.

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Architecture flow:

    1. Identify user personas: You must first identify user personas to derive insights from your data platform. Let’s start with identifying your users:
      • Enterprise data service users who would like to consume data from your data lake into their respective applications.
      • Business users who would like to like create business intelligence dashboards by using your data lake datasets.
      • IT users who would like to query data from your data lake by using traditional SQL queries.
      • Data scientists who would like to run machine learning algorithms to derive recommendations.
      • Enterprise data warehouse users who would like to run complex SQL queries on your data warehouse datasets.
    2. Data ingestion layer: Apache NiFi scans the on-premises data stores and ingest the data into your data lake (Amazon S3). Apache NiFi can also transform the data in transit. It supports both Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) data transformations. Apache NiFi also supports data lineage lifecycle while ingesting data into Amazon S3.
    3. Storage layer: For your data lake storage, we recommend using Amazon S3 to build a data lake. It has unmatched 11 nines of durability and 99.99% availability. You can also create raw, transformed, and enriched storage layers depending upon your use case.
    4. Cataloging layer: AWS Lake Formation provides the central catalog to store and manage metadata for all datasets hosted in the data lake by AWS Glue Data Catalog. AWS services such as AWS Glue, Amazon EMR, and Amazon Athena natively integrate with Lake Formation. They automate discovering and registering dataset metadata into the Lake Formation catalog.
    5. Processing layer: Amazon EMR processes your raw data and places them into a new S3 bucket. Use AWS Glue DataBrew and AWS Glue to process the data as needed.
    6. Consumption layer or persona-centric analytics: Once data is transformed:
      • AWS Lambda and Amazon API Gateway will allow you to develop data services for enterprise data service users
      • You can develop user-friendly dashboards for your business users using Amazon QuickSight
      • Use Amazon Athena to query transformed data for your IT users
      • Your data scientists can utilize AWS Glue DataBrew to clean and normalize the data and Amazon SageMaker for machine learning models
      • Your enterprise data warehouse users can use Amazon Redshift to derive business intelligence
    7. Security and governance layer: AWS IAM provides users, groups, and role-level identity, in addition to the ability to configure coarse-grained access control for resources managed by AWS services in all layers. AWS Lake Formation provides fine-grained access controls and you can grant/revoke permissions at the database- or table- or column-level access.

Lake House architecture on AWS

The vast majority of data lakes are built on Amazon S3. At the same time, customers are leveraging purpose-built analytics stores that are optimized for specific use cases. Customers want the freedom to move data between their centralized data lakes and the surrounding purpose-built analytics stores. And they want to get insights with speed and agility in a seamless, secure, and compliant manner. We call this modern approach to analytics the Lake House architecture.

Figure 2. Lake House architecture on AWS

Figure 2. Lake House architecture on AWS

Refer to the whitepaper Derive Insights from AWS Lake house for various design patterns to derive persona-centric analytics by using the AWS Lake House approach. Check out the blog post Build a Lake House Architecture on AWS  for a Lake House reference architecture on AWS.

Conclusion

In this post, we show you how to build a persona-centric data platform on AWS with a seven-layered approach. This uses Apache NiFi as a data ingestion tool and AWS purpose-built analytics services for persona-centric analytics and machine learning. We have also shown how to build persona-centric analytics by using the AWS Lake House approach.

With the information in this post, you can now build your own data platform on AWS to gain faster and deeper insights from your data. AWS provides you the broadest and deepest portfolio of purpose-built analytics and machine learning services to support your business needs.

Read more and get started on building a data platform on AWS: