Tag Archives: database

Query a Teradata database using Amazon Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-a-teradata-database-using-amazon-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Teradata as your transactional data store, you may need to join the data in your data lake with Teradata in the cloud, Teradata running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises Teradata database, for example to build a dashboard or create consolidated reporting.

In these use cases, the Amazon Athena Federated Query feature allows you to seamlessly access the data from Teradata database without having to move the data to your S3 data lake. This removes the overhead in managing such jobs.

In this post, we will walk you through a step-by-step configuration to set up Athena Federated Query using AWS Lambda to access data in a Teradata database running on premises.

For this post, we will be using the Oracle Athena Federated Query connector developed by Trianz. The runtime includes a Teradata instance on premises. Your Teradata instance can be on the cloud, on Amazon EC2, or on premises. You can deploy the Trianz Oracle Athena Federated Query connector from the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena Federated Query works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Teradata instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Teradata instance.
  4. Run federated queries with Athena.

Prerequisite

Before you start this walkthrough, make sure your Teradata database is up and running.

Create a secret for the Teradata instance

Our first step is to create a secret for the Teradata instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Set the credentials as key-value pairs (username, password) for your Teradata instance.

  1. For Secret name, enter a name for your secret. Use the prefix TeradataAFQ so it’s easy to find.
  2. Leave the remaining fields at their defaults and choose Next.
  3. Complete your secret creation.

Set up your S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we create athena-accelerator/teradata.

Configure Athena federation with the Teradata instance

To configure Athena federation with Teradata instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. Select Show apps that create custom IAM roles or resource policies.
  3. In the search field, enter TrianzTeradataAthenaJDBC.
  4. Choose the application.

  1. For SecretNamePrefix, enter TeradataAFQ.
  2. For SpillBucket, enter Athena-accelerator/teradata.
  3. For JDBCConnectorConfig, use the format teradata://jdbc:teradata://hostname/user=testUser&password=testPassword.
  4. For DisableSpillEncryption, enter false.
  5. For LambdaFunctionName, enter teradataconnector.
  6. For SecurityGroupID, enter the security group ID where the Teradata instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Teradata instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, Amazon CloudTrail, Secrets Manager, Lambda, and Athena. For more information about Athena IAM access, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your queries using lambda:teradataconnector to run against tables in the Teradata database. teradataconnector is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:teradataconnector references a data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot shows the query that joins the dataset between Teradata and the S3 data lake.

Key performance best practices

If you’re considering Athena Federated Query with Teradata, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Teradata database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the Teradata database to Athena (which could lead to query timeouts or slow performance), you may consider moving data from Teradata to your S3 data lake.
  • The star schema is a commonly used data model in Teradata. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Teradata. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Teradata database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Teradata database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated Query with Teradata. Now you don’t need to wait for all the data in your Teradata data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practices outlined in the post to help minimize the data transferred from Teradata for better performance. When queries are well written for Athena Federated Query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is an AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Field Notes: How Sportradar Accelerated Data Recovery Using AWS Services

Post Syndicated from Mithil Prasad original https://aws.amazon.com/blogs/architecture/field-notes-how-sportradar-accelerated-data-recovery-using-aws-services/

This post was co-written by Mithil Prasad, AWS Senior Customer Solutions Manager, Patrick Gryczkat, AWS Solutions Architect, Ben Burdsall, CTO at Sportradar and Justin Shreve, Director of Engineering at Sportradar. 

Ransomware is a type of malware which encrypts data, effectively locking those affected by it out of their own data and requesting a payment to decrypt the data.  The frequency of ransomware attacks has increased over the past year, with local governments, hospitals, and private companies experiencing cases of ransomware.

For Sportradar, providing their customers with access to high quality sports data and insights is central to their business. Ensuring that their systems are designed securely and in a way which minimizes the possibility of a ransomware attack is top priority.  While ransomware attacks can occur both on premises and in the cloud, AWS services offer increased visibility and native encryption and back up capabilities. This helps prevent and minimize the likelihood and impact of a ransomware attack.

Recovery, backup, and the ability to go back to a known good state is best practice. To further expand their defense and diminish the value of ransom, the Sportradar architecture team set out to leverage their AWS Step Functions expertise to minimize recovery time. The team’s strategy centered on achieving a short deployment process. This process commoditized their production environment, allowing them to spin up interchangeable environments in new isolated AWS accounts, pulling in data from external and isolated sources, and diminishing the value of a production environment as a ransom target. This also minimized the impact of a potential data destruction event.

By partnering with AWS, Sportradar was able to build a secure and resilient infrastructure to provide timely recovery of their service in the event of data destruction by an unauthorized third party. Sportradar automated the deployment of their application to a new AWS account and established a new isolation boundary from an account with compromised resources. In this blog post, we show how the Sportradar architecture team used a combination of AWS CodePipeline and AWS Step Functions to automate and reduce their deployment time to less than two hours.

Solution Overview

Sportradar’s solution uses AWS Step Functions to orchestrate the deployment of resources, the recovery of data, and the deployment of application code, and to navigate all necessary dependencies for order of deployment. While deployment can be orchestrated through CodePipeline, Sportradar used their familiarity with Step Functions to create a quick and repeatable deployment process for their environment.

Sportradar’s solution to a ransomware Disaster Recovery scenario has also provided them with a reliable and accelerated process for deploying development and testing environments. Developers are now able to scale testing and development environments up and down as needed.  This has allowed their Development and QA teams to follow the pace of feature development, versus weekly or bi-weekly feature release and testing schedules tied to a single testing environment.

Reference Architecture Showing How Sportradar Accelerated Data Recovery

Figure 1 – Reference Architecture Diagram showing Automated Deployment Flow

Prerequisites

The prerequisites for implementing this deployment strategy are:

  • An implemented database backup policy
  • Ideally data should be backed up to a data bunker AWS account outside the scope of the environment you are looking to protect. This is so that in the event of a ransomware attack, your backed up data is isolated from your affected environment and account
  • Application code within a GitHub repository
  • Separation of duties
  • Access and responsibility for the backups and GitHub repository should be separated to different stakeholders in order to reduce the likelihood of both being impacted by a security breach

Step 1: New Account Setup 

Once data destruction is identified, the first step in Sportradar’s process is to use a pre-created runbook to create a new AWS account.  A new account is created in case the malicious actors who have encrypted the application’s data have access to not just the application, but also to the AWS account the application resides in.

The runbook sets up a VPC for a selected Region, as well as spinning up the following resources:

  • Security Groups with network connectivity to their git repository (in this case GitLab), IAM Roles for their resources
  • KMS Keys
  • Amazon S3 buckets with CloudFormation deployment templates
  • CodeBuild, CodeDeploy, and CodePipeline

Step 2: Deploying Secrets

It is a security best practice to ensure that no secrets are hard coded into your application code. So, after account setup is complete, the new AWS accounts Access Keys and the selected AWS Region are passed into CodePipeline variables. The application secrets are then deployed to the AWS Parameter Store.

Step 3: Deploying Orchestrator Step Function and In-Memory Databases

To optimize deployment time, Sportradar decided to leave the deployment of their in-memory databases running on Amazon EC2 outside of their orchestrator Step Function.  They deployed the database using a CloudFormation template from their CodePipeline. This was in parallel with the deployment of the Step Function, which orchestrates the rest of their deployment.

Step 4: Step Function Orchestrates the Deployment of Microservices and Alarms

The AWS Step Functions orchestrate the deployment of Sportradar’s microservices solutions, deploying 10+ Amazon RDS instances, and restoring each dataset from DB snapshots. Following that, 80+ producer Amazon SQS queues and  S3 buckets for data staging were deployed. After the successful deployment of the SQS queues, the Lambda functions for data ingestion and 15+ data processing Step Functions are deployed to begin pulling in data from various sources into the solution.

Then the API Gateways and Lambda functions which provide the API layer for each of the microservices are deployed in front of the restored RDS instances. Finally, 300+ Amazon CloudWatch Alarms are created to monitor the environment and trigger necessary alerts. In total Sportradar’s deployment process brings online: 15+ Step Functions for data processing, 30+ micro-services, 10+ Amazon RDS instances with over 150GB of data, 80+ SQS Queues, 180+ Lambda functions, CDN for UI, Amazon Elasticache, and 300+ CloudWatch alarms to monitor the applications. In all, that is over 600 resources deployed with data restored consistently in less than 2 hours total.

Reference Architecture Diagram for How Sportradar Accelerated Data Recovery Using AWS Services

Figure 2 – Reference Architecture Diagram of the Recovered Application

Conclusion

In this blog, we showed how Sportradar’s team used Step Functions to accelerate their deployments, and a walk-through of an example disaster recovery scenario. Step Functions can be used to orchestrate the deployment and configuration of a new environment, allowing complex environments to be deployed in stages, and for those stages to appropriately wait on their dependencies.

For examples of Step Functions being used in different orchestration scenarios, check out how Step Functions acts as an orchestrator for ETLs in Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda and Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy. For migrations of Amazon EC2 based workloads, read more about CloudEndure, Migrating workloads across AWS Regions with CloudEndure Migration.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

 

Ben Burdsall

Ben Burdsall

Ben is currently the chief technology officer of Sportradar, – a data provider to the sporting industry, where he leads a product and engineering team of more than 800. Before that, Ben was part of the global leadership team of Worldpay.

Justin Shreve

Justin Shreve

Justin is Director of Engineering at Sportradar, leading an international team to build an innovative enterprise sports analytics platform.

Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-snowflake-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Snowflake as your data warehouse solution, you may need to join your data in your data lake with Snowflake. For example, you may want to build a dashboard by joining historical data in your Amazon S3 data lake and the latest data in your Snowflake data warehouse or create consolidated reporting.

In such use cases, Amazon Athena Federated Query allows you to seamlessly access the data from Snowflake without building ETL pipelines to copy or unload the data to the S3 data lake or Snowflake. This removes the overhead of creating additional extract, transform, and load (ETL) processes and shortens the development cycle.

In this post, we will walk you through a step-by-step configuration to set up Athena Federated Query using AWS Lambda to access data in a Snowflake data warehouse.

For this post, we are using the Snowflake connector for Amazon Athena developed by Trianz.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data Federation refers to the capability to query data in another data store using a single interface (Amazon Athena). The following diagram depicts how a single Amazon Athena federated query uses Lambda to query the underlying data source and parallelizes execution across many workers.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Snowflake instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Snowflake instance.
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have a Snowflake data warehouse up and running.

Create a secret for the Snowflake instance

Our first step is to create a secret for the Snowflake instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Enter the credentials as key-value pairs (username, password) for your Snowflake instance.
  5. For Secret name, enter a name for your secret. Use the prefix snowflake so it’s easy to find.

  1. Leave the remaining fields at their defaults and choose Next.
  2. Complete your secret creation.

Create an S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we use athena-accelerator/snowflake.

Configure Athena federation with the Snowflake instance

To configure Athena data source connector for Snowflake with your Snowflake instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. In the search field, enter TrianzSnowflakeAthenaJDBC.

  1. For Application name, enter TrianzSnowflakeAthenaJDBC.
  2. For SecretNamePrefix, enter trianz-snowflake-athena.
  3. For SpillBucket, enter Athena-accelerator/snowflake.
  4. For JDBCConnectorConfig, use the format snowflake://jdbc:snowflake://{snowflake_instance_url}/?warehouse={warehousename}&db={databasename}&schema={schemaname}&${secretname}

For example, we enter snowflake://jdbc:snowflake://trianz.snowflakecomputing.com/?warehouse=ATHENA_WH&db=ATHENA_DEV&schema=ATHENA&${trianz-snowflake-athena}DisableSpillEncyption – False

  1. For LambdaFunctionName, enter trsnowflake.
  2. For SecurityGroupID, enter the security group ID where the Snowflake instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Snowflake instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Before running your federated query, be sure that you have selected Athena engine version 2. The current Athena engine version for any workgroup can be found in the Athena console page.

Run your federated queries using lambda:trsnowflake to run against tables in the Snowflake database. This is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:trsnowflake is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot is a unionall query example of data in Amazon S3 with a table in the AWS Glue Data Catalog and a table in Snowflake.

Key performance best practices

If you’re considering Athena Federated Query with Snowflake, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Snowflake database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from Snowflake to Athena (which could lead to query timeouts or slow performance), you may consider copying data from Snowflake to your S3 data lake.
  • The Snowflake schema, which is an extension of the star schema, is used as a data model in Snowflake. In the Snowflake schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Snowflake. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Snowflake database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Snowflake database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena federated with Snowflake using Lambda. With Athena Federated query user can leverage all of their data to produce analytics, derive business value without building ETL pipelines to bring data from different datastore such as Snowflake to Data Lake.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Snowflake for better performance. When queries are well written for federation, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Work with semistructured data using Amazon Redshift SUPER

Post Syndicated from Satish Sathiya original https://aws.amazon.com/blogs/big-data/work-with-semistructured-data-using-amazon-redshift-super/

With the new SUPER data type and the PartiQL language, Amazon Redshift expands data warehouse capabilities to natively ingest, store, transform, and analyze semi-structured data. Semi-structured data (such as weblogs and sensor data) fall under the category of data that doesn’t conform to a rigid schema expected in relational databases. It often contain complex values such as arrays and nested structures that are associated with serialization formats, such as JSON.

The schema of the JSON can evolve over time according to the business use case. Traditional SQL users who are experienced in handling structured data often find it challenging to deal with semi-structured data sources such as nested JSON documents due to lack of SQL support, the need to learn multiple complex functions, and the need to use third-party tools.

This post is part of a series that talks about ingesting and querying semi-structured data in Amazon Redshift using the SUPER data type.

With the introduction of the SUPER data type, Amazon Redshift provides a rapid and flexible way to ingest JSON data and query it without the need to impose a schema. This means that you don’t need to worry about the schema of the incoming document, and can load it directly into Amazon Redshift without any ETL to flatten the data. The SUPER data type is stored in an efficient binary encoded Amazon Redshift native format.

The SUPER data type can represent the following types of data:

  • An Amazon Redshift scalar value:
    • A null
    • A Boolean
    • Amazon Redshift numbers, such as SMALLINT, INTEGER, BIGINT, DECIMAL, or floating point (such as FLOAT4 or FLOAT8)
    • Amazon Redshift string values, such as VARCHAR and CHAR
  • Complex values:
    • An array of values, including scalar or complex
    • A structure, also known as tuple or object, that is a map of attribute names and values (scalar or complex)

For more information about the SUPER type, see Ingesting and querying semistructured data in Amazon Redshift.

After the semi-structured and nested data is loaded into the SUPER data type, you can run queries on it by using the PartiQL extension of SQL. PartiQL is backward-compatible to SQL. It enables seamless querying of semi-structured and structured data and is used by multiple AWS services, such as Amazon DynamoDB, Amazon Quantum Ledger Database (Amazon QLDB), and AWS Glue Elastic Views. With PartiQL, the query engine can work with schema-less SUPER data that originated in serialization formats, such as JSON. With the use of PartiQL, familiar SQL constructs seamlessly combine access to both the classic, tabular SQL data and the semi-structured data in SUPER. You can perform object and array navigation and also unnesting with simple and intuitive extensions to SQL semantics.

For more information about PartiQL, see Announcing PartiQL: One query language for all your data.

Use cases

The SUPER data type is useful when processing and querying semi-structured or nested data such as web logs, data from industrial Internet of Things (IoT) machines and equipment, sensors, genomics, and so on. To explain the different features and functionalities of SUPER, we use sample industrial IoT data from manufacturing.

The following diagram shows the sequence of events in which the data is generated, collected, and finally stored in Amazon Redshift as the SUPER data type.

Manufacturers are embracing the cloud solutions with connected machines and factories to transform and use data to make data-driven decisions so they can optimize operations, increase productivity, and improve availability while reducing costs.

As an example, the following diagram depicts a common asset hierarchy of a fictional smart manufacturing company.

This data is typically semi-structured and hierarchical in nature. To find insights from this data using traditional methods and tools, you need to extract, preprocess, load, and transform it into the proper structured format to run typical analytical queries using a data warehouse. The time to insight is delayed because of the initial steps required for data cleansing and transformation typically performed using third-party tools or other AWS services.

Amazon Redshift SUPER handles these use cases by helping manufacturers extract, load, and query (without any transformation) a variety of data sources collected from edge computing and industrial IoT devices. Let’s use this sample dataset to drive our examples to explain the capabilities of Amazon Redshift SUPER.

The following is an example subscription dataset for assets (such as crowder, gauge, and pneumatic cylinder) in the workshop, which collects metrics on different properties (such as air pressure, machine state, finished parts count, and failures). Data on these metrics is generated continuously in a time series fashion and pushed to the AWS Cloud using the AWS IoT SiteWise connector. This specific example collects the Machine State property for an asset. The following sample data called subscriptions.json has scalar columns like topicId and qos, and a nested array called messages, which has metrics on the assets.

The following is another dataset called asset_metadata.json, which describes different assets and their properties, more like a dimension table. In this example, the asset_name is IAH10 Line 1, which is a processing plant line for a specific site.

Load data into SUPER columns

Now that we covered the basics behind Amazon Redshift SUPER and the industrial IoT use case, let’s look at different ways to load this dataset.

Copy the JSON document into multiple SUPER data columns

In this use case for handling semi-structured data, the user knows the incoming data’s top-level schema and structure, but some of the columns are semi-structured and nested in structures or arrays. You can choose to shred a JSON document into multiple columns that can be a combination of the SUPER data type and Amazon Redshift scalar types. The following code shows what the table DDL looks like if we translate the subscriptions.json semi-structured schema into a SUPER equivalent:

CREATE TABLE subscription_auto (
topicId INT
,topicFilter VARCHAR
,qos INT
,messages SUPER
);

To load data into this table, specify the auto option along with FORMAT JSON in the COPY command to split the JSON values across multiple columns. The COPY matches the JSON attributes with column names and allows nested values, such as JSON arrays and objects, to be ingested as SUPER values. Run the following command to ingest data into the subscription_auto table. Replace the AWS Identity and Access Management (IAM) role with your own credentials. The ignorecase option passed along with auto is required only if your JSON attribute names are in CamelCase. In our case, our columns topicId and topicFilter scalar columns are in CamelCase.

COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'auto ignorecase';

A select * from subscription_auto command looks like the following code. The messages SUPER column holds the entire array in this case.

-- A sample output
SELECT * FROM subscription_auto;
[ RECORD 1 ]
topicid     | 1001
topicfilter | $aws/sitewise/asset-models/+/assets/+/properties/+
qos         | 0
messages    | [{"format":"json","topic":"$aws\/sitewise\/asset-models\/8926cf44-14ea-4cd8-a7c6-e61af641dbeb\/assets\/0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c\/properties\/3ff67d41-bf69-4d57-b461-6f1513e127a4","timestamp":1616381297183,"payload":{"type":"PropertyValueUpdate","payload":{"assetId":"0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c","propertyId":"3ff67d41-bf69-4d57-b461-6f1513e127a4","values":[{"timestamp":{"timeInSeconds":1616381289,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381292,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381286,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381293,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381287,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":925000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381294,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381285,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381288,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}}]}}}]

Alternatively, you can specify jsonpaths to load the same data, as in the following code. The jsonpaths option is helpful if you want to load only selective columns from your JSON document. For more information, see COPY from JSON format.

COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 's3://redshift-downloads/semistructured/super-blog/jsonpaths/subscription_jsonpaths.json';

subscription_jsonpaths.json looks like the following:

{"jsonpaths": [
    "$.topicId",
    "$.topicFilter",
    "$.qos",
    "$.messages"
   ]
}

While we’re in the section about loading, let’s also create the asset_metadata table and load it with relevant data, which we need in our later examples. The asset_metadata table has more information about industry shop floor assets and their properties like asset_name, property_name, and model_id.

CREATE TABLE asset_metadata (
asset_id VARCHAR
,asset_name VARCHAR
,asset_model_id VARCHAR
,asset_property_id VARCHAR
,asset_property_name VARCHAR
,asset_property_data_type VARCHAR
,asset_property_unit VARCHAR
,asset_property_alias VARCHAR
);

COPY asset_metadata FROM 's3://redshift-downloads/semistructured/super-blog/asset_metadata/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'auto';

Copy the JSON document into a single SUPER column

You can also load a JSON document into a single SUPER data column. This is typical when the schema of the incoming JSON is unknown and evolving. SUPER’s schema-less ingestion comes to the forefront here, letting you load the data in a flexible fashion.

For this use case, assume that we don’t know the names of the columns in subscription.json and want to load it into Amazon Redshift. It’s as simple as the following code:

CREATE TABLE subscription_noshred (s super);

After the table is created, we can use the COPY command to ingest data from Amazon Simple Storage Service (Amazon S3) into the single SUPER column. The noshred is a required option to go along with FORMAT JSON, which tells the COPY parser not to shred the document but load it into a single column.

COPY subscription_noshred FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 'noshred';

After the COPY has successfully ingested the JSON, the subscription_noshred table has a SUPER column s that contains the data of the entire JSON object. The ingested data maintains all the properties of the JSON nested structure but in a SUPER data type.

The following code shows how select star (*) into subscription_noshred looks; the entire JSON structure is in SUPER column s:

--A sample output
SELECT * FROM subscription_noshred;
[ RECORD 1 ]
s | {"topicId":1001,"topicFilter":"$aws\/sitewise\/asset-models\/+\/assets\/+\/properties\/+","qos":0,"messages":[{"format":"json","topic":"$aws\/sitewise\/asset-models\/8926cf44-14ea-4cd8-a7c6-e61af641dbeb\/assets\/0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c\/properties\/3ff67d41-bf69-4d57-b461-6f1513e127a4","timestamp":1616381297183,"payload":{"type":"PropertyValueUpdate","payload":{"assetId":"0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c","propertyId":"3ff67d41-bf69-4d57-b461-6f1513e127a4","values":[{"timestamp":{"timeInSeconds":1616381289,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381292,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381286,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381293,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381287,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381290,"offsetInNanos":925000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381294,"offsetInNanos":425000000},"quality":"FAIR","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381285,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}},{"timestamp":{"timeInSeconds":1616381288,"offsetInNanos":425000000},"quality":"GOOD","value":{"doubleValue":0}}]}}}]}

Similar to the noshred option, we can also use jsonpaths to load complete documents. This can be useful in cases where we want to extract additional columns, such as distribution and sort keys, while still loading the complete document as a SUPER column. In the following example, we map our first column to the root JSON object, while also mapping a distribution key to the second column, and a sort key to the third column.

subscription_sorted_jsonpaths.json looks like the following:

{"jsonpaths": [
"$",
"$.topicId",
"$.topicFilter"
]
}

CREATE TABLE subscription_sorted (
s super
,topicId INT
,topicFilter VARCHAR
) DISTKEY(topicFilter) SORTKEY(topicId);

COPY subscription_sorted FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions/' 
IAM_ROLE '<< your IAM role >>' 
FORMAT JSON 's3://redshift-downloads/semistructured/superblog/jsonpaths/subscription_sorted_jsonpaths.json';

Copy data from columnar formats like Parquet and ORC

If your semi-structured or nested data is already available in either Apache Parquet or Apache ORC formats, you can use the COPY command with the SERIALIZETOJSON option to ingest data into Amazon Redshift. The Amazon Redshift table structure should match the number of columns and the column data types of the Parquet or ORC files. Amazon Redshift can replace any Parquet or ORC column, including structure and array types, with SUPER data columns. The following are the COPY examples to load from Parquet and ORC format:

--Parquet 
COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions_parquet/' 
IAM_ROLE '<< your IAM role >>' FORMAT PARQUET SERIALIZETOJSON;

--ORC 
COPY subscription_auto FROM 's3://redshift-downloads/semistructured/super-blog/subscriptions_orc/' 
IAM_ROLE '<< your IAM role >>' FORMAT ORC SERIALIZETOJSON;

Apart from COPY, you can load the same data in columnar format in Amazon S3 using Amazon Redshift Spectrum and the INSERT command. The following example assumes the Redshift Spectrum external schema super_workshop and the external table subscription_parquet is already created and available in the database.

We need to set an important session-level configuration parameter for this to work: SET json_serialization_enable TO true. For more information, see Serializing complex nested JSON. This session-level parameter allows you to query top-level nested collection columns as serialized JSON.

/* -- Set the GUC for JSON serialization -- */ 
SET json_serialization_enable TO true;

INSERT INTO subscription_parquet 
SELECT topicId
,topicFilter
,qos
,JSON_PARSE(messages) FROM super_workshop.subscription_parquet;

As of this writing, we can’t use the SUPER column as such as a distribution or sort key, but if we need to use one of the attributes in a SUPER column as a distribution a sort key and keep the entire SUPER column intact as a separate column, we can use the following code as a workaround (apart from the jsonpaths example described earlier). For example, let’s assume the column format within the array messages needs to be used a distribution key for the Amazon Redshift table:

SET json_serialization_enable TO true;

DROP TABLEIF EXISTS subscription_messages;

CREATE TABLE subscription_messages (
m_format VARCHAR
,m_messages super
) distkey (m_format);

--'format' scalar column is extracted from 'messages' super column 
INSERT INTO subscription_messages 
SELECT element.format::VARCHAR
,supercol FROM (
SELECT JSON_PARSE(messages) AS supercol FROM super_workshop.subscription_parquet
) AS tbl,tbl.supercol element;

Apart from using the Amazon Redshift COPY command, we can also ingest JSON-formatted data into Amazon Redshift using the traditional SQL INSERT command:

--Insert example 
INSERT INTO subscription_auto VALUES (
0000
,'topicFiltervalue'
,0000
,JSON_PARSE('[{"format":"json"},{"topic": "sample topic"},[1,2,3]]')
); 

The JSON_PARSE() function parses the incoming data in proper JSON format and helps convert it into the SUPER data type. Without the JSON_PARSE() function, Amazon Redshift treats and ingests the value as a single string into SUPER instead of a JSON-formatted value.

Query SUPER columns

Amazon Redshift uses the PartiQL language to offer SQL-compatible access to relational, semi-structured, and nested data. PartiQL’s extensions to SQL are straightforward to understand, treat nested data as first-class citizens, and seamlessly integrate with SQL. The PartiQL syntax uses dot notation and array subscript for path navigation when accessing nested data.

The Amazon Redshift implementation of PartiQL supports dynamic typing for querying semi-structured data. This enables automatic filtering, joining, and aggregation on the combination of structured, semi-structured, and nested datasets. PartiQL allows the FROM clause items to iterate over arrays and use unnest operations.

The following sections focus on different query access patterns that involve navigating the SUPER data type with path and array navigation, unnest, or joins.

Navigation

We may want to find an array element by its index, or we may want to find the positions of certain elements in their arrays, such as the first element or the last element. We can reference a specific element simply by using the index of the element within square brackets (the index is 0-based) and within that element we can reference an object or struct by using the dot notation.

Let’s use our subscription_auto table to demonstrate the examples. We want to access the first element of the array, and within that we want to know the value of the attribute format:

SELECT messages[0].format FROM subscription_auto ;
--A sample output
 format
--------
 "json"
 "json"
 "json"
(3 rows)

Amazon Redshift can also use a table alias as a prefix to the notation. The following example is the same query as the previous example:

SELECT sub.messages[0].format FROM subscription_auto AS sub;
--A sample output
 format
--------
 "json"
 "json"
 "json"
(3 rows)

You can use the dot and bracket notations in all types of queries, such as filtering, join, and aggregation. You can use them in a query in which there are normally column references. The following example uses a SELECT statement that filters results:

SELECT COUNT(*) FROM subscription_auto WHERE messages[0].format IS NOT NULL;
-- A sample output
 count
-------
    11
(1 row)

Unnesting and flattening

To unnest or flatten an array, Amazon Redshift uses the PartiQL syntax to iterate over SUPER arrays by using the FROM clause of a query. We continue with the previous example in the following code, which iterates over the array attribute messages values:

SELECT c.*, o FROM subscription_auto c, c.messages AS o WHERE o.topic='sample topic';
-- A sample output
                             messages                 |        o
------------------------------------------------------+--------------------------
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | {"topic":"sample topic"}
(1 row)

The preceding query has one extension over standard SQL: the c.messages AS o, where o is the alias for the SUPER array messages and serves to iterate over it. In standard SQL, the FROM clause x (AS) y means “for each tuple y in table x.” Similarly, the FROM clause x (AS) y, if x is a SUPER value, translates to “for each (SUPER) value y in (SUPER) array value x.” The projection list can also use the dot and bracket notation for regular navigation.

When unnesting an array, if we want to get the array subscripts (starting from 0) as well, we can specify AT some_index right after unnest. The following examples iterates over both the array values and array subscripts:

SELECT o, index FROM subscription_auto c, c.messages AS o AT index 
WHERE o.topic='sample topic';

-- A sample output 
       o                  | index
--------------------------+-------
{"topic":"sample topic"}  | 1
(1 row)

If we have an array of arrays, we can do multiple unnestings to iterate into the inner arrays. The following example shows how we do it. Note that unnesting non-array expressions (the objects inside c.messages) are ignored, only the arrays are unnested.

SELECT messages, array, inner_array_element FROM subscription_auto c, c.messages AS array, array AS inner_array_element WHERE array = json_parse('[1,2,3]');

-- A sample output 
    messages                                          | array   | inner_array_element
------------------------------------------------------+---------+---------------------
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 1
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 2
[{"format":"json"},{"topic":"sample topic"},[1,2,3]]  | [1,2,3] | 3
(3 rows)

Dynamic typing

With the schema-less nature of semi-structured data, the same attributes within the JSON might have values of different types. For example, asset_id from our example might have initially started with integers and then because of a business decision changed into alphanumeric (string) values before finally settling on array type. Amazon Redshift SUPER handles this situation by using dynamic typing. Schema-less SUPER data is processed without the need to statically declare the data types before using them in a query. Dynamic typing is most useful in joins and GROUP BY clauses. Although deep comparing of SUPER column is possible, we recommend restricting the joins and aggregations to use the leaf-level scalar attribute for optimal performance. The following example uses a SELECT statement that requires no explicit casting of the dot and bracket expressions to the usual Amazon Redshift types:

SELECT messages[0].format,
        messages[0].topic
FROM subscription_auto
WHERE messages[0].payload.payload."assetId" > 0;
 
--A sample output
 format | topic
--------+-------
"json"  | "sample topic" 
(1 rows)

When your JSON attribute names are in mixed case or CamelCase, the following session parameter is required to query the data. In the preceding query, assetId is in mixed case. For more information, see SUPER configurations.

SET enable_case_sensitive_identifier to TRUE;

The greater than (>) sign in this query evaluates to true when messages[0].payload.payload."assetId" is greater than 0. In all other cases, the equality sign evaluates to false, including the cases where the arguments of the equality are different types. The messages[0].payload.payload."assetId" attribute can potentially contain a string, integer, array, or structure, and Amazon Redshift only knows that it is a SUPER data type. This is where dynamic typing helps in processing the schemaless data by evaluating the types at runtime based on the query. For example, processing the first record of “assetId” may result in an integer, whereas the second record can be a string, and so on. When using an SQL operator or function with dot and bracket expressions that have dynamic types, Amazon Redshift produces results similar to using a standard SQL operator or function with the respective static types. In this example, when the dynamic type of the path expression is an integer, the comparison with the integer 0 is meaningful. Whenever the dynamic type of “assetId” is any other data type except being an integer, the equality returns false.

For queries with joins and aggregations, dynamic typing automatically matches values with different dynamic types without performing a long CASE WHEN analysis to find out what data types may appear. The following code is an example of an aggregation query in which we count the number of topics:

SELECT messages[0].format,
count(messages[0].topic)
FROM subscription_auto WHERE messages[0].payload.payload."assetId" > 'abc' GROUP BY 1;

-- a sample output
 format | count
--------+-------
"json"  |   2
(1 row)

For the next join query example, we unnest and flatten the messages array from subscription_auto and join with the asset_metadata table to get the asset_name and property_name based on the asset_id and property_id, which we use as join keys.

Joins on SUPER should preferably be on an extracted path and avoid deep compare of the entire nested field for performance. In the following examples, the join keys used are on extracted path keys and not on the whole array:

SELECT c.topicId
,c.qos
,a.asset_name
,o.payload.payload."assetId"
,a.asset_property_name
,o.payload.payload."propertyId"
FROM subscription_auto AS c
,c.messages AS o -- o is the alias for messages array
,asset_metadata AS a
WHERE o.payload.payload."assetId" = a.asset_id AND o.payload.payload."propertyId" = a.asset_property_id AND o.payload.type = 'PropertyValueUpdate';
 
--A sample output 
 topicid | qos |  asset_name  |                assetId                 | asset_property_name |               propertyId
---------+-----+--------------+----------------------------------------+---------------------+----------------------------------------
    1001 |   0 | IAH10 Line 1 | "0aaf2aa2-0299-442a-b2ea-ecf3d62f2a2c" | stop                | "3ff67d41-bf69-4d57-b461-6f1513e127a4"
(1 row)

The following code is another join query that is looking for a count on the quality of the metrics collected for a specific asset (in this case IAH10 Line) and it is property (Machine State) and categorizes it based on the quality:

SELECT v.quality::VARCHAR
,count(*)
FROM subscription_auto AS c
,c.messages AS o -- o is the alias for messages array
,o.payload.payload.VALUES AS v -- v is the alias for values array 
,asset_metadata AS a 
WHERE o."assetId" = a.asset_name 
AND o."propertyId" = a.asset_property_name 
AND a.asset_name = 'IAH10 Line 1' 
AND a.asset_property_name = 'Machine State' 
GROUP BY v.quality;

-- A sample output 
quality   | count
----------+-------
CRITICAL  | 152 
GOOD      | 2926 
FAIR      | 722

Summary

This post discussed the benefits of the new SUPER data type and use cases in which nested data types can help improve storage efficiency, performance, or simplify analysis. Amazon Redshift SUPER along with PartiQL enables you to write queries over relational and hierarchical data model with ease. The next post in this series will focus on how to speed up frequent queries on SUPER columns using materialized views. Try it out and share your experience!


About the Authors

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

 

 

 

Runyao Chen is a Software Development Engineer at Amazon Redshift. He is passionate about MPP databases and has worked on SQL language features such as querying semistructured data using SUPER. In his spare time, he enjoys reading and exploring new restaurants.

 

 

 

Cody Cunningham is a Software Development Engineer at Amazon Redshift. He works on Redshift’s data ingestion, implementing new features and performance improvements, and recently worked on the ingestion support for SUPER.

Streaming Amazon DynamoDB data into a centralized data lake

Post Syndicated from Praveen Krishnamoorthy Ravikumar original https://aws.amazon.com/blogs/big-data/streaming-amazon-dynamodb-data-into-a-centralized-data-lake/

For organizations moving towards a serverless microservice approach, Amazon DynamoDB has become a preferred backend database due to its fully managed, multi-Region, multi-active durability with built-in security controls, backup and restore, and in-memory caching for internet-scale application. , which you can then use to derive near-real-time business insights. The data lake provides capabilities to business teams to plug in BI tools for analysis, and to data science teams to train models. .

This post demonstrates two common use cases of streaming a DynamoDB table into an Amazon Simple Storage Service (Amazon S3) bucket using Amazon Kinesis Data Streams, AWS Lambda, and Amazon Kinesis Data Firehose via Amazon Virtual Private Cloud (Amazon VPC) endpoints in the same AWS Region. We explore two use cases based on account configurations:

  • DynamoDB and Amazon S3 in same AWS account
  • DynamoDB and Amazon S3 in different AWS accounts

We use the following AWS services:

  • Kinesis Data Streams for DynamoDBKinesis Data Streams for DynamoDB captures item-level modifications in any DynamoDB table and replicates them to a Kinesis data stream of your choice. Your applications can access the data stream and view the item-level changes in near-real time. Streaming your DynamoDB data to a data stream enables you to continuously capture and store terabytes of data per hour. Kinesis Data Streams enables you to take advantage of longer data retention time, enhanced fan-out capability to more than two simultaneous consumer applications, and additional audit and security transparency. Kinesis Data Streams also gives you access to other Kinesis services such as Kinesis Data Firehose and Amazon Kinesis Data Analytics. This enables you to build applications to power real-time dashboards, generate alerts, implement dynamic pricing and advertising, and perform sophisticated data analytics, such as applying machine learning (ML) algorithms.
  • Lambda – Lambda lets you run code without provisioning or managing servers. It provides the capability to run code for virtually any type of application or backend service without managing servers or infrastructure. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.
  • Kinesis Data Firehose – Kinesis Data Firehose helps to reliably load streaming data into data lakes, data stores, and analytics services. It can capture, transform, and deliver streaming data to Amazon S3 and other destinations. It’s a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt your data streams before loading, which minimizes the amount of storage used and increases security.

Security is the primary focus of our use cases, so the services used in both use server-side encryption at rest and VPC endpoints for securing the data in transit.

Use case 1: DynamoDB and Amazon S3 in same AWS account

In our first use case, our DynamoDB table and S3 bucket are in the same account. We have the following resources:

  • A Kinesis data stream is configured to use 10 shards, but you can change this as needed.
  • A DynamoDB table with Kinesis streaming enabled is a source to the Kinesis data stream, which is configured as a source to a Firehose delivery stream.
  • The Firehose delivery stream is configured to use a Lambda function for record transformation along with data delivery into an S3 bucket. The Firehose delivery stream is configured to batch records for 2 minutes or 1 MiB, whichever occurs first, before delivering the data to Amazon S3. The batch window is configurable for your use case. For more information, see Configure settings.
  • The Lambda function used for this solution transforms the DynamoDB item’s multi-level JSON structure to a single-level JSON structure. It’s configured to run in a private subnet of an Amazon VPC, with no internet access. You can extend the function to support more complex business transformations.

The following diagram illustrates the architecture of the solution.

The architecture uses the DynamoDB feature to capture item-level changes in DynamoDB tables using Kinesis Data Streams. This feature provides capabilities to securely stream incremental updates without any custom code or components.

Prerequisites

To implement this architecture, you need the following:

  • An AWS account
  • Admin access to deploy the needed resources

Deploy the solution

In this step, we create a new Amazon VPC along with the rest of the components.

We also create an S3 bucket with the following features:

You can extend the template to enable additional S3 bucket features as per your requirements.

For this post, we use an AWS CloudFormation template to deploy the resources. As part of best practices, consider organizing resources by lifecycle and ownership as needed.

We use an AWS Key Management Service (AWS KMS) key for server-side encryption to encrypt the data in Kinesis Data Streams, Kinesis Data Firehose, Amazon S3, and DynamoDB.

The Amazon CloudWatch log group data is always encrypted in CloudWatch Logs. If required, you can extend this stack to encrypt log groups using KMS CMKs.

  1. Click on Launch Stack button below to create a CloudFormation :
  2. On the CloudFormation console, accept default values for the parameters.
  3. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.

After stack creation is complete, note the value of the BucketName output variable from the stack’s Outputs tab. This is the S3 bucket name that is created as part of the stack. We use this value later to test the solution.

Test the solution

To test the solution, we insert a new item and then update the item in the DynamoDB table using AWS CloudShell and the AWS Command Line Interface (AWS CLI). We will also use the AWS Management Console to monitor and verify the solution.

  1. On the CloudShell console, verify that you’re in the same Region as the DynamoDB table (the default is us-east-1).
  2. Enter the following AWS CLI command to insert an item:
    aws dynamodb put-item \ 
    --table-name blog-srsa-ddb-table \ 
    --item '{ "id": {"S": "864732"}, "name": {"S": "Adam"} , "Designation": {"S": "Architect"} }' \ 
    --return-consumed-capacity TOTAL

  3. Enter the following command to update the item: We are updating the Designation from “Architect” to ” Senior Architect
    aws dynamodb put-item \ 
    --table-name blog-srsa-ddb-table \ 
    --item '{ "id": {"S": "864732"}, "name": {"S": "Adam"} , "Designation": {"S": "Senior Architect"} }' \ 
    --return-consumed-capacity TOTAL

All item-level modifications from the DynamoDB table are sent to a Kinesis data stream (blog-srsa-ddb-table-data-stream), which delivers the data to a Firehose delivery stream (blog-srsa-ddb-table-delivery-stream).

You can monitor the processing of updated records in the Firehose delivery stream on the Monitoring tab of the delivery stream.

You can verify the delivery of the updates to the data lake by checking the objects in the S3 bucket (BucketName value from the stack Outputs tab).

The Firehose delivery stream is configured to write records to Amazon S3 using a custom prefix which is based on the date the records are delivered to the delivery stream. This partitions the delivered records by date which helps improve query performance by limiting the amount of data that query engines need to scan in order to return the results for a specific query. For more information, see Custom Prefixes for Amazon S3 Objects.

The file is in JSON format. You can verify the data in the following ways:

Use case 2: DynamoDB and Amazon S3 in different AWS accounts

The solution for this use case uses two CloudFormation stacks: the producer stack (deployed in Account A) and the consumer stack (deployed in Account B).

The producer stack (Account A) deploys the following:

  • A Kinesis data stream is configured to use 10 shards, but you can change this as needed.
  • A DynamoDB table with Kinesis streaming is enabled as a source to the Kinesis data stream, and the data stream is configured as a source to a Firehose delivery stream.
  • The Firehose delivery stream is configured to use a Lambda function for record transformation along with data delivery into an S3 bucket in Account B. The delivery stream is configured to batch records for 2 minutes or 1 MiB, whichever occurs first, before delivering the data to Amazon S3. The batch window is configurable for your use case.
  • The Lambda function is configured to run in a private subnet of an Amazon VPC, with no internet access. For this solution, the function transforms the multi-level JSON structure to a single-level JSON structure. You can extend the function to support more complex business transformations.

The consumer stack (Account B) deploys an S3 bucket configured to receive the data from the Firehose delivery stream in Account A.

The following diagram illustrates the architecture of the solution.

The architecture uses the DynamoDB feature to capture item-level changes in DynamoDB tables using Kinesis Data Streams. This feature provides capabilities to securely stream incremental updates without any custom code or components. 

Prerequisites

For this use case, you need the following:

  • Two AWS accounts (for the producer and consumer)
    • If you already deployed the architecture for the first use case and want to use the same account, delete the stack from the previous use case before proceeding with this section
  • Admin access to deploy needed resources

Deploy the components in Account B (consumer)

This step creates an S3 bucket with the following features:

  • Encryption at rest using CMKs
  • Block Public Access
  • Bucket versioning

You can extend the template to enable additional S3 bucket features as needed.

We deploy the resources with a CloudFormation template. As part of best practices, consider organizing resources by lifecycle and ownership as needed.

We use the KMS key for server-side encryption to encrypt the data in Amazon S3.

The CloudWatch log group data is always encrypted in CloudWatch Logs. If required, you can extend the stack to encrypt log group data using KMS CMKs.

  1. Choose Launch Stack to create a CloudFormation stack in your account:
  2. For DDBProducerAccountID, enter Account A’s account ID.
  3. For KMSKeyAlias, the KMS key used for server-side encryption to encrypt the data in Amazon S3 is populated by default.
  4. Choose Create stack.

After stack creation is complete, note the value of the BucketName output variable. We use this value later to test the solution.

Deploy the components in Account A (producer)

In this step, we sign in to the AWS Management Console with Account A to deploy the producer stack. We use the KMS key for server-side encryption to encrypt the data in Kinesis Data Streams, Kinesis Data Firehose, Amazon S3, and DynamoDB. As with other stacks, the CloudWatch log group data is always encrypted in CloudWatch Logs, but you can extend the stack to encrypt log group data using KMS CMKs.

  1. Choose Launch Stack to create a CloudFormation stack in your account:
  2. For ConsumerAccountID, enter the ID of Account B.
  3. For CrossAccountDatalakeBucket, enter the bucket name for Account B, which you created in the previous step.
  4. For ArtifactBucket, the S3 bucket containing the artifacts required for deployment is populated by default.
  5. For KMSKeyAlias, the KMS key used for server-side encryption to encrypt the data in Amazon S3 is populated by default.
  6. For BlogTransformationLambdaFile, the Amazon S3 key for the Lambda function code to perform Amazon Firehose Data transformation is populated by default.
  7. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  8. Choose Create stack.

Test the solution

To test the solution, we sign in as Account A, insert a new item in the DynamoDB table, and then update that item. Make sure you’re in the same Region as your table.

  1. On the CloudShell console, enter the following AWS CLI command to insert an item:
    aws dynamodb put-item \ 
    --table-name blog-srca-ddb-table \ 
    --item '{ "id": {"S": "864732"}, "name": {"S": "Chris"} , "Designation": {"S": "Senior Consultant"} }' \ 
    --return-consumed-capacity TOTAL

  2. Update the existing item with the following code:
    aws dynamodb put-item \ 
    --table-name blog-srca-ddb-table \ 
    --item '{ "id": {"S": "864732"}, "name": {"S": "Chris"} , "Designation": {"S": "Principal Consultant"} }' \ 
    --return-consumed-capacity TOTAL

  3. Sign out of Account A and sign in as Account B to verify the delivery of records into the data lake.

All item-level modifications from an DynamoDB table are sent to a Kinesis data stream (blog-srca-ddb-table-data-stream), which delivers the data to a Firehose delivery stream (blog-srca-ddb-table-delivery-stream) in Account A.

You can monitor the processing of the updated records on the Monitoring tab of the Firehose delivery stream.

You can verify the delivery of updates to the data lake by checking the objects in the S3 bucket that you created in Account B.

The Firehose delivery stream is configured similarly to the previous use case.

You can verify the data (in JSON format) in the same ways:

  • Download the files
  • Run an AWS Glue crawler to create a table to query in Athena
  • Query the data using Amazon S3 Select

Clean up

To avoid incurring future charges, clean up all the AWS resources that you created using AWS CloudFormation. You can delete these resources on the console or via the AWS CLI. For this post, we walk through the steps using the console.

Clean up resources from use case 1

To clean up the DynamoDB and Amazon S3 resources in the same account, complete the following steps:

  1. On the Amazon S3 console, empty the S3 bucket and remove any previous versions of S3 objects.
  2. On the AWS CloudFormation console, delete the stack bdb1040-ddb-lake-single-account-stack.

You must delete the Amazon S3 resources before deleting the stack, or the deletion fails.

Clean up resources from use case 2

To clean up the DynamoDB and Amazon S3 resources in different accounts, complete the following steps:

  1. Sign in to Account A.
  2. On the AWS CloudFormation console, delete the stack bdb1040-ddb-lake-multi-account-stack.
  3. Sign in to Account B.
  4. On the Amazon S3 console, empty the S3 bucket and remove any pervious versions of S3 objects.
  5. On the AWS CloudFormation console, delete the stack bdb1040-ddb-lake-multi-account-stack.

Extend the solution

You can extend this solution to stream DynamoDB table data into cross-Region S3 buckets by setting up cross-Region replication (using the Amazon secured private channel) on the bucket where Kinesis Data Firehose delivers the data.

You can also perform a point-in-time initial load of the DynamoDB table into the data lake before setting up DynamoDB Kinesis streams. DynamoDB provides a no-coding required feature to achieve this. For more information, see Export Amazon DynamoDB Table Data to Your Data Lake in Amazon S3, No Code Writing Required.

To extend the usability scope of DynamoDB data in S3 buckets, you can crawl the location to create AWS Glue Data Catalog database tables. Registering the locations with AWS Lake Formation helps simplify permission management and allows you to implement fine-grained access control. You can also use Athena, Amazon Redshift, Amazon SageMaker, and Amazon QuickSight for data analysis, ML, and reporting services.

Conclusion

In this post, we demonstrated two solutions for streaming DynamoDB table data into Amazon S3 to build a data lake using a secured Amazon private channel.

The CloudFormation template gives you an easy way to set up the process, which you can be further modify to meet your specific use case needs.

Please let us know if you have comments about this post!


About the Authors

Praveen Krishnamoorthy Ravikumar is a Data Architect with AWS Professional Services Intelligence Practice. He helps customers implement big data and analytics platform and solutions.

 

 

 

Abhishek Gupta is a Data and ML Engineer with AWS Professional Services Practice.

 

 

 

 

Ashok Yoganand Sridharan is a Big Data Consultant with AWS Professional Services Intelligence Practice. He helps customers implement big data and analytics platform and solutions

 

 

 

Forwarding emails automatically based on content with Amazon Simple Email Service

Post Syndicated from Murat Balkan original https://aws.amazon.com/blogs/messaging-and-targeting/forwarding-emails-automatically-based-on-content-with-amazon-simple-email-service/

Introduction

Email is one of the most popular channels consumers use to interact with support organizations. In its most basic form, consumers will send their email to a catch-all email address where it is further dispatched to the correct support group. Often, this requires a person to inspect content manually. Some IT organizations even have a dedicated support group that handles triaging the incoming emails before assigning them to specialized support teams. Triaging each email can be challenging, and delays in email routing and support processes can reduce customer satisfaction. By utilizing Amazon Simple Email Service’s deep integration with Amazon S3, AWS Lambda, and other AWS services, the task of categorizing and routing emails is automated. This automation results in increased operational efficiencies and reduced costs.

This blog post shows you how a serverless application will receive emails with Amazon SES and deliver them to an Amazon S3 bucket. The application uses Amazon Comprehend to identify the dominant language from the message body.  It then looks it up in an Amazon DynamoDB table to find the support group’s email address specializing in the email subject. As the last step, it forwards the email via Amazon SES to its destination. Archiving incoming emails to Amazon S3 also enables further processing or auditing.

Architecture

By completing the steps in this post, you will create a system that uses the architecture illustrated in the following image:

Architecture showing how to forward emails by content using Amazon SES

The flow of events starts when a customer sends an email to the generic support email address like [email protected]. This email is listened to by Amazon SES via a recipient rule. As per the rule, incoming messages are written to a specified Amazon S3 bucket with a given prefix.

This bucket and prefix are configured with S3 Events to trigger a Lambda function on object creation events. The Lambda function reads the email object, parses the contents, and sends them to Amazon Comprehend for language detection.

Amazon DynamoDB looks up the detected language code from an Amazon DynamoDB table, which includes the mappings between language codes and support group email addresses for these languages. One support group could answer English emails, while another support group answers French emails. The Lambda function determines the destination address and re-sends the same email address by performing an email forward operation. Suppose the lookup does not return any destination address, or the language was not be detected. In that case, the email is forwarded to a catch-all email address specified during the application deployment.

In this example, Amazon SES hosts the destination email addresses used for forwarding, but this is not a requirement. External email servers will also receive the forwarded emails.

Prerequisites

To use Amazon SES for receiving email messages, you need to verify a domain that you own. Refer to the documentation to verify your domain with Amazon SES console. If you do not have a domain name, you will register one from Amazon Route 53.

Deploying the Sample Application

Clone this GitHub repository to your local machine and install and configure AWS SAM with a test AWS Identity and Access Management (IAM) user.

You will use AWS SAM to deploy the remaining parts of this serverless architecture.

The AWS SAM template creates the following resources:

  • An Amazon DynamoDB mapping table (language-lookup) contains information about language codes and associates them with destination email addresses.
  • An AWS Lambda function (BlogEmailForwarder) that reads the email content parses it, detects the language, looks up the forwarding destination email address, and sends it.
  • An Amazon S3 bucket, which will store the incoming emails.
  • IAM roles and policies.

To start the AWS SAM deployment, navigate to the root directory of the repository you downloaded and where the template.yaml AWS SAM template resides. AWS SAM also requires you to specify an Amazon Simple Storage Service (Amazon S3) bucket to hold the deployment artifacts. If you haven’t already created a bucket for this purpose, create one now. You will refer to the documentation to learn how to create an Amazon S3 bucket. The bucket should have read and write access by an AWS Identity and Access Management (IAM) user.

At the command line, enter the following command to package the application:

sam package --template template.yaml --output-template-file output_template.yaml --s3-bucket BUCKET_NAME_HERE

In the preceding command, replace BUCKET_NAME_HERE with the name of the Amazon S3 bucket that should hold the deployment artifacts.

AWS SAM packages the application and copies it into this Amazon S3 bucket.

When the AWS SAM package command finishes running, enter the following command to deploy the package:

sam deploy --template-file output_template.yaml --stack-name blogstack --capabilities CAPABILITY_IAM --parameter-overrides [email protected] YOUR_DOMAIN_NAME_HERE [email protected] YOUR_DOMAIN_NAME_HERE

In the preceding command, change the YOUR_DOMAIN_NAME_HERE with the domain name you validated with Amazon SES. This domain also applies to other commands and configurations that will be introduced later.

This example uses “blogstack” as the stack name, you will change this to any other name you want. When you run this command, AWS SAM shows the progress of the deployment.

Configure the Sample Application

Now that you have deployed the application, you will configure it.

Configuring Receipt Rules

To deliver incoming messages to Amazon S3 bucket, you need to create a Rule Set and a Receipt rule under it.

Note: This blog uses Amazon SES console to create the rule sets. To create the rule sets with AWS CloudFormation, refer to the documentation.

  1. Navigate to the Amazon SES console. From the left navigation choose Rule Sets.
  2. Choose Create a Receipt Rule button at the right pane.
  3. Add [email protected]YOUR_DOMAIN_NAME_HERE as the first recipient addresses by entering it into the text box and choosing Add Recipient.

 

 

Choose the Next Step button to move on to the next step.

  1. On the Actions page, select S3 from the Add action drop-down to reveal S3 action’s details. Select the S3 bucket that was created by the AWS SAM template. It is in the format of your_stack_name-inboxbucket-randomstring. You will find the exact name in the outputs section of the AWS SAM deployment under the key name InboxBucket or by visiting the AWS CloudFormation console. Set the Object key prefix to info/. This tells Amazon SES to add this prefix to all messages destined to this recipient address. This way, you will re-use the same bucket for different recipients.

Choose the Next Step button to move on to the next step.

In the Rule Details page, give this rule a name at the Rule name field. This example uses the name info-recipient-rule. Leave the rest of the fields with their default values.

Choose the Next Step button to move on to the next step.

  1. Review your settings on the Review page and finalize rule creation by choosing Create Rule

  1. In this example, you will be hosting the destination email addresses in Amazon SES rather than forwarding the messages to an external email server. This way, you will be able to see the forwarded messages in your Amazon S3 bucket under different prefixes. To host the destination email addresses, you need to create different rules under the default rule set. Create three additional rules for [email protected]YOUR_DOMAIN_NAME_HERE , [email protected] YOUR_DOMAIN_NAME_HERE and [email protected]YOUR_DOMAIN_NAME_HERE email addresses by repeating the steps 2 to 5. For Amazon S3 prefixes, use catchall/, english/, and french/ respectively.

 

Configuring Amazon DynamoDB Table

To configure the Amazon DynamoDB table that is used by the sample application

  1. Navigate to Amazon DynamoDB console and reach the tables view. Inspect the table created by the AWS SAM application.

language-lookup table is the table where languages and their support group mappings are kept. You need to create an item for each language, and an item that will hold the default destination email address that will be used in case no language match is found. Amazon Comprehend supports more than 60 different languages. You will visit the documentation for the supported languages and add their language codes to this lookup table to enhance this application.

  1. To start inserting items, choose the language-lookup table to open table overview page.
  2. Select the Items tab and choose the Create item From the dropdown, select Text. Add the following JSON content and choose Save to create your first mapping object. While adding the following object, replace Destination attribute’s value with an email address you own. The email messages will be forwarded to that address.

{

  “language”: “en”,

  “destination”: “[email protected]_DOMAIN_NAME_HERE”

}

Lastly, create an item for French language support.

{

  “language”: “fr”,

  “destination”: “[email protected]_DOMAIN_NAME_HERE”

}

Testing

Now that the application is deployed and configured, you will test it.

  1. Use your favorite email client to send the following email to the domain name [email protected] email address.

Subject: I need help

Body:

Hello, I’d like to return the shoes I bought from your online store. How can I do this?

After the email is sent, navigate to the Amazon S3 console to inspect the contents of the Amazon S3 bucket that is backing the Amazon SES Rule Sets. You will also see the AWS Lambda logs from the Amazon CloudWatch console to confirm that the Lambda function is triggered and run successfully. You should receive an email with the same content at the address you defined for the English language.

  1. Next, send another email with the same content, this time in French language.

Subject: j’ai besoin d’aide

Body:

Bonjour, je souhaite retourner les chaussures que j’ai achetées dans votre boutique en ligne. Comment puis-je faire ceci?

 

Suppose a message is not matched to a language in the lookup table. In that case, the Lambda function will forward it to the catchall email address that you provided during the AWS SAM deployment.

You will inspect the new email objects under english/, french/ and catchall/ prefixes to observe the forwarding behavior.

Continue experimenting with the sample application by sending different email contents to [email protected] YOUR_DOMAIN_NAME_HERE address or adding other language codes and email address combinations into the mapping table. You will find the available languages and their codes in the documentation. When adding a new language support, don’t forget to associate a new email address and Amazon S3 bucket prefix by defining a new rule.

Cleanup

To clean up the resources you used in your account,

  1. Navigate to the Amazon S3 console and delete the inbox bucket’s contents. You will find the name of this bucket in the outputs section of the AWS SAM deployment under the key name InboxBucket or by visiting the AWS CloudFormation console.
  2. Navigate to AWS CloudFormation console and delete the stack named “blogstack”.
  3. After the stack is deleted, remove the domain from Amazon SES. To do this, navigate to the Amazon SES Console and choose Domains from the left navigation. Select the domain you want to remove and choose Remove button to remove it from Amazon SES.
  4. From the Amazon SES Console, navigate to the Rule Sets from the left navigation. On the Active Rule Set section, choose View Active Rule Set button and delete all the rules you have created, by selecting the rule and choosing Action, Delete.
  5. On the Rule Sets page choose Disable Active Rule Set button to disable listening for incoming email messages.
  6. On the Rule Sets page, Inactive Rule Sets section, delete the only rule set, by selecting the rule set and choosing Action, Delete.
  7. Navigate to CloudWatch console and from the left navigation choose Logs, Log groups. Find the log group that belongs to the BlogEmailForwarderFunction resource and delete it by selecting it and choosing Actions, Delete log group(s).
  8. You will also delete the Amazon S3 bucket you used for packaging and deploying the AWS SAM application.

 

Conclusion

This solution shows how to use Amazon SES to classify email messages by the dominant content language and forward them to respective support groups. You will use the same techniques to implement similar scenarios. You will forward emails based on custom key entities, like product codes, or you will remove PII information from emails before forwarding with Amazon Comprehend.

With its native integrations with AWS services, Amazon SES allows you to enhance your email applications with different AWS Cloud capabilities easily.

To learn more about email forwarding with Amazon SES, you will visit documentation and AWS blogs.

Out-of-the-box database monitoring

Post Syndicated from Renats Valiahmetovs original https://blog.zabbix.com/out-of-the-box-database-monitoring/13957/

From this post and the video, you’ll learn about the possibilities of database monitoring using out-of-the-box Zabbix functionality without having to install additional tools, additional applications, or additional software that might not be allowed by your company.

Contents

I. Classic ODBC monitoring (0:22)

II. Synthetic MySQL monitoring (11:13)
III. DB monitoring with Zabbix Agent 2 (13:48)

IV. LLD for DB monitoring (17:03)

V. Questions & Answers (21:09)

Classic ODBC monitoring

What is ODBC?

ODBC stands for open database connectivity. There are a couple of ODBC drivers available for different database management systems (DBMS):

    • Oracle,
    • PostgreSQL,
    • MySQL,
    • Microsoft SQL Server,
    • Sybase ASE,
    • SAP HANA,
    • DB2.

All of these databases have different ODBCs specifically tailored for them. They offer slightly different functionality. So, even if you have set up the database monitoring for one database it might not necessarily work just as good for the other, as the functionality used to monitor one database might not exist for the other. In addition, as different technologies have different capabilities, most ODBC drivers do not implement all functionality defined in the ODBC standard.

What to monitor?

When we are planning to use ODBC for monitoring, what kind of data we can expect to receive? The answer ultimately depends on your own preferences, needs, or your proficiency in a specific database. You can monitor any possible database performance metrics and incidents using Zabbix templates.

Generally, monitoring of the following areas is of interest:

    • database performance
    • engine availability
    • configuration changes that you need to be aware of

To make the process easier, we provide ready-to-use templates, which can be applied to a host where your database is deployed. You can browse a full list of available metrics in these templates’ descriptions. So, you don’t have to perform configuration completely from scratch, which is good news.

How does it work?

Without diving too deep into the transport layer and all of the technical details, the ODBC driver accesses the database over the network using the database API. So, there is no direct connection between Zabbix and the database. Zabbix only creates a query passed to the ODBC manager for processing, which then moves the request over to the ODBC driver that connects to the database management system and then executes the query. Here, Zabbix does not limit the query execution timeout, and the timeout parameter is used as the ODBC login timeout.

Chain of processes

ODBC configuration is based on two files:

  • odbc.ini — holds a list of installed ODBC database drivers, which are used for specific communication.
  • odbcinst.ini — holds the definitions of data sources so that we know to which database we are going to connect.

Where to start?

What do we need to do in order to start using this ODBC monitoring approach?

  1. First, we will need to install the ODBC driver relevant to the database we are going to monitor. A simple yum command will suffice if we’re working with CentOS.
# yum -y install unixODBC unixODBC-devel
  1. Then we need to specify the package (driver) we want to install and modify the ODBC driver files.
  • odbc.ini:
[[email protected] ~]# cat /etc/odbc.ini
[MySQL]
Description=NewDatabase
Driver=MariaDB
Server=localhost
User=root
Password=VerySecurePassword
Port=3306
Database=DatabaseName
  • odbcinst.ini:
[[email protected] ~]# cat /etc/odbcinst.ini
[MySQL]
Description=ODBC for MySQL
Driver=/usr/lib/libmyodbc5.so
Setup=/usr/lib/libodbcmyS.so
Driver64=/usr/lib64/libmyodbc8a.so
Setup64=/usr/lib64/libmyodbc8a.so
FileUsage=1

Then we need to populate them with the necessary information. So, in this case, DSN (data source name) is used to call a specific connection. We need to get this part correctly, otherwise, the connection will not work out, for instance, in case of a typo.

  1. After we have installed the ODBC driver and configured the configuration files, we don’t really need to go ahead into Zabbix to create a new item and see if it works. We can test the ODBC configuration using isql to connect or at least attempt to connect to a particular database using the specified configuration.

Using isql to test ODBC configuration

If we receive an output that you have been connected then the communication is correct. You can also execute a sort of query, for instance, select some information from the database. If you get the result, then you do have the necessary permissions to access that data, and the connection, that is the ODBC driver, is working fine. Then you can proceed to the frontend.

  1. In the frontend, we will need to create an item of the ‘Database monitor’ type on a particular host or a template and specify one of the two keys available for ODBC monitoring: db.odbc.select or db.odbc.get.

Creating ‘Database monitor’ item

The difference between these item keys is pretty simple — select will return only one value and get will return values in bulk. So, get is more efficient and allows for reducing the load on the database if we are working with a lot of data. Within the key parameters, we need to specify the same DSN that we have defined in our odbc.ini file.

We need to make sure that the first parameter is unique so that this particular item key is unique and does not duplicate anything else, and the second parameter is the DSN.

  1. After we have specified everything, we specify the query, which is a part of the item configuration.
  2. We test the item using the test form in the Zabbix frontend. If the test form returns a value or does not return an error message, then everything is fine and we can proceed with this item or create more items.

Testing the item

ODBC templates

  1. There are a couple of built-in templates. If the metrics obtained through these templates are sufficient, we obviously don’t need to create these items from scratch or configure them. We can simply assign the templates we need to the host, on which we are monitoring the database. All we need to do is to tweak a little, if necessary, modify the macro related to the DSN, and then start monitoring.

Assigning a template

NOTE. The easiest way to get the templates is to upgrade to the latest Zabbix with our official templates already built in. If you don’t have the needed templates for any reason, you can download them from Zabbix official repository or Zabbix integrations. If you still need a specific template, you can definitely check out the community-created templates.

  1. Finally, we can execute discovery rules:

and check the Latest data:

Synthetic MySQL monitoring

Synthetic MySQL monitoring approach is using capabilities of the Zabbix Agent. Though that is not something that Zabbix Agent is doing out of the box, still we don’t need to install anything or perform some super difficult manipulations to make it work as it is a part of Zabbix functionality.

As you might already know, the Zabbix Agent functionality can be extended using custom UserParameters and then used for database monitoring.

  1. So, we can create new UserParameters, which invoke native MySQL administration client commands providing output, which can then be used to calculate performance metrics.
UserParameter=mysql.ping[*], mysqladmin -h"$1" -P"$2" ping
UserParameter=mysql.get_status_variables[*], mysql -h"$1" -P"$2" -sNX -e "show
global status"
UserParameter=mysql.version[*], mysqladmin -s -h"$1" -P"$2" version
UserParameter=mysql.db.discovery[*], mysql -h"$1" -P"$2" -sN -e "show
databases"
  1. It is a good practice to test the commands themselves to make sure that they work and to test the UserParameter keys, for instance using the zabbix_get utility.
  2. Then you might want to use our official MySQL monitoring template by creating an additional file .my.cnf under /var/lib/zabbix (default location) as follows:
[client] 
user='zbx_monitor' 
password='<password>'
  1. Then we need to provide credentials for the user to confirm that the user has the necessary permissions to access the database.
  2. If everything is working, assign MySQL by Zabbix agent template.

In this case, we are not actually logging in to the database. We execute commands from the terminal by using Zabbix Agent and extending the functionality beyond the built-in functions.

DB monitoring with Zabbix Agent 2

Why Zabbix Agent 2?

What are the benefits of Zabbix Agent 2 in relation to database monitoring?

  • Zabbix Agent 2 is the improved version of our original Zabbix Agent, which is now written in Go.
  • Zabbix Agent 2 is more efficient and supports some new functions that Zabbix Agent 1 does not, for instance, custom intervals with active checks as Zabbix Agent 2 is using the Scheduler plugin and is capable of keeping track of time when certain checks need to be executed;
  • Older configuration is also supported. So, if we switch from Zabbix Agent 1 to Zabbix Agent 2, we do not need to rewrite the whole configuration file in order for Zabbix Agent 2 to work.
  • Zabbix Agent 2 is installed simply with one-line command just like Zabbix Agent 1, we need just to specify a different package.
# yum -y install zabbix-agent2
  • Zabbix Agent 2 is based on plugins, so you do not need to install it with ODBC drivers, as plugins do the work, or anything extra as Zabbix Agent 2 has out-of-the-box database-specific plugins to monitor your database, including MySQL, Oracle, and PostgreSQL.
  • Plugins are also written in Go.
  • We have created Zabbix Agent 2-specific templates, which we can assign to the host. So, if you decide to use Zabbix Agent 2, you need to perform even fewer manipulations in order to get your database monitored by Zabbix.

Built-in Zabbix Agent 2 templates

Configuration

The configuration is very simple. We need to decide whether we specify the necessary parameters within the item keys or, if we prefer named sessions, we edit the configuration file of Zabbix Agent 2 to define those and use the session name as the first parameter of the key.

  1. So, we specify the key according to the documentation page. In the first case, we can specify essentially the location of our database and provide the credentials.

In the second case, we simply need to provide the DSN in order to connect to the database using Zabbix Agent 2 built-in plugins.

Plugins.Mysql.Sessions.Prod.Uri=tcp://192.168.1.1:3306

Plugins.Mysql.Sessions.Prod.User=<UserForProd>

Plugins.Mysql.Sessions.Prod.Password=<PasswordForProd>
  1. After we have created these items or applied a template, we can definitely test them out and see whether they are working fine.

NOTE. Check available MySQL-related item keys documentation page.

LLD for DB monitoring

Why LLD?

Finally, you can definitely use low-level discovery for database monitoring. LLD is a very efficient and powerful tool within Zabbix. You can definitely use either built-in discovery keys, which utilize Zabbix Agent, or other sources such as custom scripts to pass the payload to your low-level discovery rule.

LLD:

    • Automatically creates items, triggers, and graphs from different entities on a host.
    • Parses data received in Zabbix-specific JSON format.
    • Different sources for LLD can be used, such as:
      • Built-in discovery keys,
      • Dependent on a built-in item key,
      • Dependent on a custom script/custom UserParameter.

Here we have a script providing our JSON-formatted payload, which is sent by the Data sender Zabbix utility to the Master trapper item within our Zabbix instance, while our LLD rule depends on this particular Master trapper item.

So, we just populate this trapper item with the JSON payload, LLD rule creates new entities based on the prototypes, and then the items created by those prototypes are collecting the data from that master trapper item each time a new payload comes in.

How to configure custom LLD?

In general, to create LLD from scratch:

  1. First, you will need to decide on the actual payload delivery method (Zabbix Agent, script, Zabbix sender, or UserParameter).
  2. Make sure that your payload is in JSON that is structurally sound so that Zabbix can accept and parse it.
[{"{#DATABASE}":"information_schema"},{"{#DATABASE}":"mysql"},{"{#DATABASE}":"p erformance_schema"},{"{#DATABASE}":"sys"},{"{#DATABASE}":"zabbix"}]
  1. Create LLD rule with type according to delivery method.
  2. Test the rule (if available for passive checks) to see JSON you receive.
  3. Create filters or overrides, if necessary.
  4. Create prototypes, based on which your entities will be created.

If we don’t want to create LLD rules from scratch, we can definitely modify the built-in templates without wasting time creating custom LLD rules:

    • Modify/create new entities;
    • Clone the templates;
    • Refer to templated discovery rule configuration.

Modifying LLD rules of official templates

Questions & Answers

Question. Can we monitor the database using active checks or passive checks?

Answer. As I have mentioned, everything depends on your preferences and, ultimately, on the way you want to pass this output to Zabbix Server. If we’re talking about active checks, you can utilize Zabbix sender, for instance. So, it will be a trapper item on the Zabbix Server side waiting for data. In case of passive checks, we can use Zabbix Agent. So, we can use both types of checks for database monitoring.

Question. Can we establish a secure connection between the ODBC gateway and the database, which is somewhere on a distant machine?

Answer. Yes, this can be done though it does require a little bit of finesse. It is an extensive topic, and the security of the connection is highly dependent on the driver, which should support a secure connection. Some older databases might not have this functionality.

Question. Are ODBC checks influencing the performance of the master server?

Answer. It depends on what kind of data you are collecting. If you have a lot of items utilizing db.odbc.get item key, which retrieves just one value from the database, this might impact your database performance. You might not notice this impact if your hardware is powerful enough. However, it is advisable to use the odbc.select key in order to collect this information in bulk. Otherwise, you might be locking up some entries within your database that could potentially lead to problems.

Question. So, we provide two solutions with one of them using ODBC agentless checks ODBC. In addition, we have the agent tool. Will you briefly describe the advantages of ODBC and Agent checks?

Answer. If we’re talking about the ODBC database monitoring method, the most obvious difference is that you don’t need to install an agent. From the data collection perspective, there is not much difference. Everything depends on your specific needs.

 

Create a serverless feedback collector application using Amazon Pinpoint’s two-way SMS functionality

Post Syndicated from Murat Balkan original https://aws.amazon.com/blogs/messaging-and-targeting/create-a-serverless-feedback-collector-application-by-using-amazon-pinpoints-two-way-sms-functionality/

Introduction

Two-way SMS communication is used by many companies to create interactive engagements with their customers. Traditional SMS notifications are one-way. While this is valid for many different use cases like one-time passwords (OTP) notifications and security notifications or reminders, some other use-cases may benefit from collecting information from the same channel. Two-way SMS allows customers to create this feedback mechanism and enhance business interactions and overall customer experience.

SMS is chosen for its simplicity and availability across different sets of devices. By combining the two-way SMS mechanism with the vast breadth of services Amazon Web Services (AWS) offers, companies can create effective architectures to better interact and serve their customers.

This blog post shows you how a serverless online appointment application can use Amazon Pinpoint’s two-way SMS functionality to collect customer feedback for completed appointments. You will learn how Amazon Pinpoint interacts with other AWS serverless services with its out-of-the-box integrations to create a scalable messaging application.

Architecture

By completing the steps in this post, you can create a system that uses the architecture illustrated in the following image:

The architecture of a feedback collector application that is composed of serverless AWS services

The flow of events starts when a Amazon DynamoDB table item, representing an online appointment, changes its status to COMPLETED. An AWS Lambda function which is subscribed to these changes over DynamoDB Streams detects this change and sends an SMS to the customer by using Amazon Pinpoint API’s sendMessages operation.

Amazon Pinpoint delivers the SMS to the recipient and generates a unique message ID to the AWS Lambda function. The Lambda function then adds this message ID to a DynamoDB table called “message-lookup”. This table is used for tracking different feedback requests sent during a multi-step conversation and associate them with the appointment ids. At this stage, the Lambda function also populates another table “feedbacks” which will hold the feedback responses that will be sent as SMS reply messages.

Each time a recipient replies to an SMS, Amazon Pinpoint publishes this reply event to an Amazon SNS topic which is subscribed by an Amazon SQS queue. Amazon Pinpoint will also add a messageId to this event which allows you to bind it to a sendMessages operation call.

A second AWS Lambda function polls these reply events from the Amazon SQS queue. It checks whether the reply is in the correct format (i.e. a number) and also associated with a previous request. If all conditions are met, the AWS Lambda function checks the ConversationStage attribute’s value from its message-lookup table. According to the current stage and the SMS answer received, AWS Lambda function will determine the next step.

For example, if the feedback score received is less than 5, a follow-up SMS is sent to the user asking if they’ll be happy to receive a call from the customer support team.

All SMS replies from the users are reflected to “feedbacks” table for further analysis.

Deploying the Sample Application

  1. Clone this GitHub repository to your local machine and install and configure AWS SAM with a test AWS IAM user.

You will use AWS SAM to deploy the remaining parts of this serverless architecture.

The AWS SAM template creates the following resources:

    • An Amazon DynamoDB table (appointments) that contains information about appointments, customers and their appointment status.
    • An Amazon DynamoDB table (feedbacks) that holds the received feedbacks from customers.
    • An Amazon DynamoDB table (message-lookup) that holds the Amazon Pinpoint message ids and associate them to appointments to track a multi-step conversation.
    • Two AWS Lambda functions (FeedbackSender and FeedbackReceiver)
    • An Amazon SNS topic that collects state change events from Amazon Pinpoint.
    • An Amazon SQS queue that queues the incoming messages.
    • An Amazon Pinpoint Application with an associated SMS channel.

This architecture consists of two Lambda functions, which are represented as two different apps in the AWS SAM template. These functions are named FeedbackSender and FeedbackReceiver. The FeedbackSender function listens the Amazon DynamoDB Stream associated with the appointments table and sends the SMS message requesting a feedback. Second Lambda function, FeedbackReceiver, polls the Amazon SQS queue and updates the feedbacks table in Amazon DynamoDB. (pinpoint-two-way-sms)

          Note: You’ll incur some costs by deploying this stack into your account.

  1. To start the SAM deployment, navigate to the root directory of the repository you downloaded and where the template.yaml AWS SAM template resides. AWS SAM also requires you to specify an Amazon Simple Storage Service (Amazon S3) bucket to hold the deployment artifacts. If you haven’t already created a bucket for this purpose, create one now. The bucket should have read and write access by an AWS Identity and Access Management (IAM) user.

At the command line, enter the following command to package the application:

sam package --template template.yaml --output-template-file output_template.yaml --s3-bucket BUCKET_NAME_HERE

In the preceding command, replace BUCKET_NAME_HERE with the name of the Amazon S3 bucket that should hold the deployment artifacts.

AWS SAM packages the application and copies it into this Amazon S3 bucket.

When the AWS SAM package command finishes running, enter the following command to deploy the package:

sam deploy --template-file output_template.yaml --stack-name BlogStackPinpoint --capabilities CAPABILITY_IAM

When you run this command, AWS SAM shows the progress of the deployment. When the deployment finishes, navigate to the Amazon Pinpoint console and choose the project named “BlogApplication”. This example uses “BlogStackPinpoint” as the stack name, you can change this to any other name you want.

  1. From the left navigation, choose Settings, SMS and voice. On the SMS and voice settings page, choose the Request phone number button under Number settings

Screenshot of request phone number screen

  1. Choose a target country. Set the Default message type as Transactional, and click on the Request long codes button to buy a long code.

Note: In United States, you can also request a Toll Free Number(TFN)

Screenshot showing long code additio

A long code will be added to the Number settings list.

  1. Choose the newly added number to reach the SMS Settings page and enable the option Enable two-way-SMS. At the Incoming messages destination, select Choose an existing SNS topic, and from the drop down select the Amazon SNS topic that was created by the BlogStackPinpoint stack.

Choose Save to save your SMS settings.

 

Testing the Sample Application

Now that the application is deployed and configured, test it by creating sample records in the Amazon DynamoDB table. Navigate to Amazon DynamoDB console and reach the tables view. Inspect the tables that were created by the AWS SAM application.

Here, appointments table is the table where the appointments and their statuses are kept. It tracks the appointment lifecycle events with items identified by unique ids. In this sample scenario, we are assuming that an appointment application creates a record with ‘CREATED’ status when a new appointment is planned. After the appointment is finished, same application updates the status to ‘COMPLETED’ which will trigger the feedback collection process. Feedback results are collected in the feedbacks table. Amazon Pinpoint message id’s, conversation stage and appointment id’s are kept in the message-lookup table.

  1. To start testing the end-to-end flow, choose the appointments table to open table overview page.
  2. Next, select the Items tab and choose the Create item From the dropdown, select Text. Add the following and choose Save to create your first appointment object. While adding the following object, replace CustomerPhone attribute’s value with a phone number you own. The feedback request messages will be delivered to that number. Note: This number should match the country number for the long code you provisioned.

{

"CustomerName": "Customer A",

"CustomerPhone": "+12345678900",

"AppointmentStatus":"CREATED",

"id": "1"

}

  1. To trigger sending the feedback SMS, you need to set an existing item’s status to “COMPLETED” To do this, select the item and click Edit from the Actions menu.

Replace the item’s current JSON with the following.

{

"AppointmentStatus": "COMPLETED",

"CustomerName": "Customer A",

"CustomerPhone": "+12345678900",

"id": "1"

}

  1. Before choosing the Save button, double check that you have set CustomerPhone attribute’s value to a valid phone number.

After the change, you should receive an SMS message asking for a feedback. Provide a numeric reply of that is less than five to this message. This will trigger a follow up question asking for a consent to receive an in-person callback.

 

During your SMS conversation with the application, inspect the feedbacks table. The feedback you have given over this two-way SMS channel should have been reflected into the table.

If you want to repeat the process, make sure to increment the AppointmentId field for any additional appointment records.

Cleanup

To clean up the resources you used in your account, simply navigate to AWS Cloudformation console and delete the stack named “BlogStackPinpoint”.

After the stack is deleted, you also need to delete the Long code from the Pinpoint Console by choosing the number and pressing Remove phone number button. You can also delete the Amazon S3 bucket you used for packaging and deploying the AWS SAM application.

Conclusion

This architecture shows how Amazon Pinpoint can be used to make two-way SMS communication with your customers. You can implement Two-way SMS functionality in other use cases such as appointment reminders, polls, Q&A services, and more.

To learn more about Pinpoint and it’s two-way SMS mechanism, you can visit the Pinpoint documentation.

 

MySQL performance tuning 101 for Zabbix

Post Syndicated from Vittorio Cioe original https://blog.zabbix.com/mysql-performance-tuning-101-for-zabbix/13899/

In this post and the video, you will learn about a proper approach to getting the most out of Zabbix and optimizing the underlying MySQL Database configuration to improve performance while working with a database-intensive application such as Zabbix.

Contents

I. Zabbix and MySQL (1:12)
II. Optimizing MySQL for Zabbix (2:09)

III. Conclusion (15:43)

Zabbix and MySQL

Zabbix and MySQL love each other. Half of the Zabbix installations are running on MySQL. However, Zabbix is quite a write-intensive application, so we need to optimize the database configuration and usage to work smoothly with Zabbix that reads the database and writes to the database a lot.

Optimizing MySQL for Zabbix

Balancing the load on several disks

So, how can we optimize MySQL configuration to work with Zabbix? First of all, it is very important to balance the load on several hard drives by using:

    • datadir to specify the default location, that is to dedicate the hard drives to the data directory;
    • datadir innodb_data_file_path to define size, and attributes of InnoDB system tablespace data files;
    • innodb_undo_directory to specify the path to the InnoDB undo tablespaces;
    • innodb_log_group_home_dir to specify the path to the InnoDB redo log files;
    • log-bin to enable binary logging and set path/file name prefix (dual functionality); and
    • tmpdir (Random, SSD, tmpfs).

The key here is to split the load as much as possible across different hard drives in order to avoid different operations fighting for resources.

Viewing your MySQL configuration

Now, we can jump straight to MySQL configuration. It is important to start from your current configuration and check who and when has changed this configuration.

SELECT t1.*, VARIABLE_VALUE FROM performance_schema.variables_info t1 JOIN
performance_schema.global_variables t2 ON t2.VARIABLE_NAME=t1.VARIABLE_NAME WHERE
t1.VARIABLE_SOURCE not like "COMPILED"

This query can help you to understand who has changed the configuration. However, when the configuration is changing is also important to keep track of these changes.

Viewing MySQL configuration

MySQL key variables to optimize in your configuration

InnoDB buffer pool

The king of all of the variables to be optimized is InnoDB buffer pool, which is the main parameter determining the memory for storing the DB pages — MySQL buffer pool — an area in main memory MySQL where InnoDB caches table and index data as it is accessed.

  • InnoDB default value is to log, for production 50-75% of available memory on the dedicated database server.
  • Since MySQL 5.7, innodb_buffer_pool_size can be changed dynamically.

Judging from experience, 50 percent of available memory will be enough for the majority of databases with a lot of connections or activities, as many other indicators are used, which occupy memory. So, 50 percent is a good though conservative parameter.

To check InnoDB Buffer Pool usage (in %) and if you need to allocate more memory for the InnoDB Buffer Pool, you can use the query, which allows you to see the current usage as a percentage (though there are many queries to monitor the InnoDB Buffer Pool).

SELECT CONCAT(FORMAT(DataPages*100.0/TotalPages,2),
' %') BufferPoolDataPercentage
FROM (SELECT variable_value DataPages FROM information_schema.global_status
WHERE variable_name = 'Innodb_buffer_pool_pages_data') A,
(SELECT variable_value TotalPages FROM information_schema.global_status
WHERE variable_name = 'Innodb_buffer_pool_pages_total') B;

Binary logs

Binary logs contain events that describe changes, provide data changes sent to replicas, and are used for data recovery operations.

If you work with replication, you might know that binary logs require special attention apart from having them on a separate disk. You should size the binary logs properly, set the proper expiration time (1 month by default), and the maximum size, for instance, of 1 GB so that you will be able to write 1 GB of data per day.

We can have about 30 log files in the binary logs. However, you should check the activities of your system to consider increasing this number, as well as the expiration of the binary logs, if you need to keep more data for operations, such as finding time recovery, for instance.

How to control binary logs:

    • log_bin, max_binlog_size, binlog_expire_logs_seconds, etc.
    • PURGE BINARY LOGS TO|BEFORE to delete all the binary log files listed in the log index file prior to the specified log file name or date.
    • In addition, consider using GTID for replication to keep track of transactions.

InnoDB redo logs

This is yet another beast, which we want to keep control of — the redo and undo logs, which get written prior to flushing the data to the disk.

    • innodb_log_file_size

– The size of redo logs will impact the writing speed over the time to recover.
– The default value is too low, so consider using at least 512 MB for production.
– Total redo log capacity is determined by innodb_log_files_in_group (default value 2). For write-intensive systems, consider increasing innodb_log_files_in_group and keeping them on in a separate disk.

NOTE. Here, the related parameters are innodb_log_file_size and innodb_log_files_in_group.

Trading performance over consistency (ACID)

Associated with the redo and undo log discussion is the trading performance over consistency discussion about when InnoDB should flush/sync committed truncations.

innodb_flush_log_at_trx_commit defines how ofter InnoDB flushes the logs to the disk. This variable can have different values:

    • 0 — transactions are written to redo logs once per second;
    • 1 — (default value) fully ACID-compliant with redo logs written and flushed to disk at transaction commit;
    • 2 — transactions are written to redo logs at commit, and redo logs are flushed once per second.

If the system is write-intensive, you might consider setting this value to 2 to keep redo logs at every commit with the data written to disk once per second. This is a very good compromise between data integrity and performance successfully used in a number of write-intensive setups. This is a relief for the disk subsystem allowing you to gain that extra performance.

NOTE. I recommend using default (1) settings unless you are bulk-loading data, set session variable to 2 during load, experiencing an unforeseen peak in workload (hitting your disk system) and need to survive until you can
solve the problem, or you use the latest MySQL 8.0. You can also disable redo-logging completely. 

table_open_cache and max_connections

Opening the cache discussions, we will start from the max_connections parameter, which sets the maximum number of connections that we want to accept on the MySQL server, and the table_open_cache parameter, which sets the value of the cache of open tables we want to keep. Both parameters affect the maximum number of files the server keeps open:

    • table_open_cache value — 2,000 (default), which means that by default you can keep 2,000 tables open per connection.
    • max_connections value — 151 (default).

If you increase both values too much, you may easily run out of memory. So, the total number of open tables in MySQL is:

N of opened tables = N of connections x N (max number of tables per join)

NOTE. This number is related to the joins operated by your database per connection.

So, having an insight into what Zabbix does and which queries it executes can help you fine-tune this parameter. In addition, you can go by the rule of thumb checking if the table_open_cache sheets are full. To do that, you can check the global status like ‘opened_tables‘ to understand what is going on.

In addition, if you are going to increase the table up and cache on the maximum number of connections, you can check open_files_limit in MySQL and ulimit — the maximum number of open files in the operating system, as new connections are kept as open files in Linux. So, this is a parameter to fine-tune as well.

Open buffers per client connection

There are other buffers that depend on the number of connections (max_connections), such as:

    • read_buffer_size,
    • read_rnd_buffer_size,
    • join_buffer_size,
    • sort_buffer_size,
    • binlog_cache_size (if binary logging is enabled),
    • net_buffer_length.

Depending on how often you get connections to the Zabbix database, you might want to increase these parameters. It is recommended to monitor your database to see how these buffers are being filled up.

You also need to reserve some extra memory for these buffers if you have many connections. That is why it is recommended to reserve 50 percent of available memory for InnoDB buffer pool, so that you can use these spare 25 percent for extra buffers.

However, there might be another solution.

Enabling Automatic Configuration for a Dedicated MySQL Server

In MySQL 8.0, innodb_dedicated_server automatically configures the following variables:

    • innodb_buffer_pool_size,
    • innodb_log_file_size,
    • innodb_log_files_in_group, and
    • innodb_flush_method.

I would enable this variable as it configures the innodb_flush_ method which has a dependency with the file system.

NOTE. Enabling innodb_dedicated_server is not recommended if the MySQL instance shares system resources with other applications, as this variable enabled implicitly means that we are running only MySQL on the machine.

Conclusion

Now, you are ready to fine-tune your configuration step by step, starting from innodb_buffer_pool, max_connections, and table_open_cache, and see if your performance improves. Eventually, you can do further analysis and go further to really fine-tune your system up to your needs.

In general, 3-5 core parameters would be enough for operating with Zabbix in the vast majority of cases. If you tune those parameters keeping in mind dealing with a write-intensive application, you can achieve good results, especially if you separate the resources at a hardware level or at a VM level.

Performance tuning dos and don’ts

  • For a high-level performance tuning 101, think carefully and consider the whole stack together with the application.
  • In addition, think methodically:
    1. define what you are trying to solve, starting from the core of variables, which you want to fine-tune;
    2. argue why the proposed change will work;
    3. create an action plan; and
    4. verify the change worked.
  • To make things work:

— don’t micromanage;
— do not optimize too much;
— do not optimize everything; and, most importantly,
— do not take best practices as gospel truth, but try to adjust any practices to your particular environment.

 

Build a data lake using Amazon Kinesis Data Streams for Amazon DynamoDB and Apache Hudi

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/build-a-data-lake-using-amazon-kinesis-data-streams-for-amazon-dynamodb-and-apache-hudi/

Amazon DynamoDB helps you capture high-velocity data such as clickstream data to form customized user profiles and online order transaction data to develop customer order fulfillment applications, improve customer satisfaction, and get insights into sales revenue to create a promotional offer for the customer. It’s essential to store these data points in a centralized data lake, which can be transformed, analyzed, and combined with diverse organizational datasets to derive meaningful insights and make predictions.

A popular use case in order management is receiving, tracking, and fulfilling customer orders. The order management process begins when an order is placed and ends when the customer receives their package. When storing high-velocity order transaction data in DynamoDB, you can use Amazon Kinesis streaming to extract data and store it in a centralized data lake built on Amazon Simple Storage Service (Amazon S3).

Amazon Kinesis Data Streams for DynamoDB helps you to publish item-level changes in any DynamoDB table to a Kinesis data stream of your choice. Additionally, you can take advantage of this feature for use cases that require longer data retention on the stream and fan out to multiple concurrent stream readers. You also can integrate with Amazon Kinesis Data Analytics or Amazon Kinesis Data Firehose to publish data to downstream destinations such as Amazon Elasticsearch Service (Amazon ES), Amazon Redshift, or Amazon S3.

In this post, you use Kinesis Data Streams for DynamoDB and take advantage of managed streaming delivery of DynamoDB data to other Kinesis Data Stream by simply enabling Kinesis streaming connection from Amazon DynamoDB console. To process DynamoDB events from Kinesis, you have multiple options: Amazon Kinesis Client Library (KCL) applications, Lambda, Kinesis Data Analytics for Apache Flink, and Kinesis Data Firehose. In this post, you use Kinesis Data Firehose to save the raw data in the S3 data lake and Apache Hudi to batch process the data.

Architecture

The following diagram illustrates the order processing system architecture.

In this architecture, users buy products in online retail shops and internally create an order transaction stored in DynamoDB. The order transaction data is ingested to the data lake and stored in the raw data layer. To achieve this, you enable Kinesis Data Streams for DynamoDB and use Kinesis Data Firehose to store data in Amazon S3. You use Lambda to transform the data from the delivery stream to remove unwanted data and finally store it in Parquet format. Next, you batch process the raw data and store it back in the Hudi dataset in the S3 data lake. You can then use Amazon Athena to do sales analysis. You build this entire data pipeline in a serverless manner.

Prerequisites

Complete the following steps to create AWS resources to build a data pipeline as mentioned in the architecture. For this post, we use the AWS Region us-west-1.

  1. On the Amazon Elastic Compute Cloud (Amazon EC2) console, create a keypair.
  2. Download the data files, Amazon EMR cluster, and Athena DDL code from GitHub.
  3. Deploy the necessary Amazon resources using the provided AWS CloudFormation template.
  4. For Stack name, enter a stack name of your choice.
  5. For Keypair name, choose a key pair.

A key pair is required to connect to the EMR cluster nodes. For more information, see Use an Amazon EC2 Key Pair for SSH Credentials.

  1. Keep the remaining default parameters.
  2. Acknowledge that AWS CloudFormation might create AWS Identity and Access Management (IAM) resources.

For more information about IAM, see Resources to learn more about IAM.

  1. Choose Create stack.

You can check the Resources tab for the stack after the stack is created.

The following table summarizes the resources that you created, which you use to build the data pipeline and analysis.

Logical ID Physical ID Type
DeliveryPolicy kines-Deli-* AWS::IAM::Policy
DeliveryRole kinesis-hudi-DeliveryRole-* AWS::IAM::Role
Deliverystream kinesis-hudi-Deliverystream-* AWS::KinesisFirehose::DeliveryStream
DynamoDBTable order_transaction_* AWS::DynamoDB::Table
EMRClusterServiceRole kinesis-hudi-EMRClusterServiceRole-* AWS::IAM::Role
EmrInstanceProfile kinesis-hudi-EmrInstanceProfile-* AWS::IAM::InstanceProfile
EmrInstanceRole kinesis-hudi-EmrInstanceRole-* AWS::IAM::Role
GlueDatabase gluedatabase-* AWS::Glue::Database
GlueTable gluetable-* AWS::Glue::Table
InputKinesisStream order-data-stream-* AWS::Kinesis::Stream
InternetGateway igw-* AWS::EC2::InternetGateway
InternetGatewayAttachment kines-Inter-* AWS::EC2::VPCGatewayAttachment
MyEmrCluster AWS::EMR::Cluster
ProcessLambdaExecutionRole kinesis-hudi-ProcessLambdaExecutionRole-* AWS::IAM::Role
ProcessLambdaFunction kinesis-hudi-ProcessLambdaFunction-* AWS::Lambda::Function
ProcessedS3Bucket kinesis-hudi-processeds3bucket-* AWS::S3::Bucket
PublicRouteTable AWS::EC2::RouteTable
PublicSubnet1 AWS::EC2::Subnet
PublicSubnet1RouteTableAssociation AWS::EC2::SubnetRouteTableAssociation
PublicSubnet2 AWS::EC2::Subnet
PublicSubnet2RouteTableAssociation AWS::EC2::SubnetRouteTableAssociation
RawS3Bucket kinesis-hudi-raws3bucket-* AWS::S3::Bucket
S3Bucket kinesis-hudi-s3bucket-* AWS::S3::Bucket
SourceS3Bucket kinesis-hudi-sources3bucket-* AWS::S3::Bucket
VPC vpc-* AWS::EC2::VPC

Enable Kinesis streaming for DynamoDB

AWS recently launched Kinesis Data Streams for DynamoDB so you can send data from DynamoDB to Kinesis data streams. You can use the AWS Command Line Interface (AWS CLI) or the AWS Management Console to enable this feature.

To enable this feature from the console, complete the following steps:

  1. On the DynamoDB console, choose the table you created in the CloudFormation stack earlier (it begins with the prefix order_transaction_).
  2. On the Overview tab, choose Manage streaming to Kinesis.
  3. Choose your input stream (it starts with order-data-stream-).
  4. Choose Enable.
  5. Choose Close.
  6. Make sure that stream enabled is set to Yes.

Populate the sales order transaction dataset

To replicate a real-life use case, you need an online retail application. For this post, you upload raw data files in the S3 bucket and use a Lambda function to upload the data in DynamoDB. You can download the order data CSV files from the AWS Sample GitHub repository. Complete the following steps to upload the data in DynamoDB:

  1. On the Amazon S3 console, choose the bucket <stack-name>-sourcess3bucket-*.
  2. Choose Upload.
  3. Choose Add files.
  4. Choose the order_data_09_02_2020.csv and order_data_10_02_2020.csv files.
  5. Choose Upload.
  6. On the Lambda console, choose the function <stack-name>-CsvToDDBLambdaFunction-*.
  7. Choose Test.
  8. For Event template, enter an event name.
  9. Choose Create.
  10. Choose Test.

This runs the Lambda function and loads the CSV file order_data_09_02_2020.csv to the DynamoDB table.

  1. Wait until the message appears that the function ran successfully.

You can now view the data on the DynamoDB console, in the details page for your table.

Because you enabled the Kinesis data stream in the DynamoDB table, it starts streaming the data to Amazon S3. You can check the data by viewing the bucket on the Amazon S3 console. The following screenshot shows that a Parquet file is under the prefix in the bucket.

Use Apache Hudi with Amazon EMR

Now it’s time to process the streaming data using Hudi.

  1. Log in to the Amazon EMR leader node.

You can use the key pair you chose in the security options to SSH into the leader node.

  1. Use the following bash command to start the Spark shell to use it with Apache Hudi:
spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

The Amazon EMR instance looks like the following screenshot.

  1. You can use the following Scala code to import the order transaction data from the S3 data lake to a Hudi dataset using the copy-on-write storage type. Change inputDataPath as per file path in <stack-name>-raws3bucket-* in your environment, and replace the bucket name in hudiTablePath as <stack-name>- processeds3bucket-*.
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor

//Set up various input values as variables
val inputDataPath = "s3://kinesis-hudi-raws3bucket-1p6nszvqd9awz/2021/02/01/15/"
val hudiTableName = "order_hudi_cow"
val hudiTablePath = "s3://kinesis-hudi-processeds3bucket-yfc6843vmq1o/" + hudiTableName

// Set up our Hudi Data Source Options
val hudiOptions = Map[String,String](
    DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "order_id",
	DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "time_stamp",
    DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "transaction_date", 
    HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
    DataSourceWriteOptions.OPERATION_OPT_KEY ->
        DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
    DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "time_stamp", 
    DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
    DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
    DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "transaction_date", 
    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
        classOf[MultiPartKeysValueExtractor].getName)

// Read data from S3 and create a DataFrame with Partition and Record Key
val inputDF = spark.read.format("parquet").load(inputDataPath)

// Write data into the Hudi dataset
inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiTablePath)

For more information about DataSourceWriteOptions, see Work with a Hudi Dataset.

  1. In the Spark shell, you can now count the total number of records in the Apache Hudi dataset:
scala> inputDF.count()
res1: Long = 1000

You can check the processed Apache Hudi dataset in the S3 data lake via the Amazon S3 console. The following screenshot shows the prefix order_hudi_cow is in <stack-name>- processeds3bucket-*.

When navigating into the order_hudi_cow prefix, you can find a list of Hudi datasets that are partitioned using the transaction_date key—one for each date in our dataset.

Let’s analyze the data stored in Amazon S3 using Athena.

Analyze the data with Athena

To analyze your data, complete the following steps:

  1. On the Athena console, create the database order_db using the following command:
create database order_db;

You use this database to create all the Athena tables.

  1. Create your table using the following command (replace the S3 bucket name with <stack-name>- processeds3bucket* created in your environment):
    CREATE EXTERNAL TABLE order_transaction_cow (
      `_hoodie_commit_time` string,
      `_hoodie_commit_seqno` string,
      `_hoodie_record_key` string,
      `_hoodie_partition_path` string,
      `_hoodie_file_name` string,
      `order_id` string,
      `item_id` string,
      `customer_id` string,
      `product` string,
      `amount` decimal(3,1),
      `currency` string,
      `time_stamp` string
      )
      PARTITIONED BY ( 
      `transaction_date` string)
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT
      'org.apache.hudi.hadoop.HoodieParquetInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://kinesis-hudi-processeds3bucket-yfc6843vmq1o/order_hudi_cow'

  2. Add partitions by running the following query on the Athena console:
    ALTER TABLE order_transaction_cow ADD
    PARTITION (transaction_date = '2020-09-02') LOCATION 's3://kinesis-hudi-processeds3bucket-yfc6843vmq1o/order_hudi_cow/2020-09-02/';

  3. Check the total number of records in the Hudi dataset with the following query:
    SELECT count(*) FROM "order_db"."order_transaction_cow";

It should return a single row with a count of 1,000.

Now check the record that you want to update.

 

  1. Run the following query on the Athena console:
SELECT * FROM "order_db"."order_transaction_cow"
where order_id ='3801'
and item_id ='23'
and transaction_date ='2020-09-02';

The output should look like the following screenshot. Note down the value of product and amount.

Analyze the change data capture

Now let’s test the change data capture (CDC) in streaming. Let’s take an example where the customer changed an existing order. We load the order_data_10_02_2020.csv file, where order_id 3801 has a different product and amount.

To test the CDC feature, complete the following steps:

  1. On the Lambda console, choose the stack <stack-name>-CsvToDDBLambdaFunction-*.
  2. In the Environment variables section, choose Edit.
  3. For key, enter order_data_10_02_2020.csv.
  4. Choose Save.

You can see another prefix has been created in <stack-name>-raws3bucket-*.

  1. In Amazon EMR, run the following code in the Scala shell prompt to update the data (change inputDataPath to the file path in <stack-name>-raws3bucket-* and hudiTablePath to <stack-name>- processeds3bucket-*):
    import org.apache.spark.sql.SaveMode
    import org.apache.spark.sql.functions._
    import org.apache.hudi.DataSourceWriteOptions
    import org.apache.hudi.config.HoodieWriteConfig
    import org.apache.hudi.hive.MultiPartKeysValueExtractor
    
    //Set up various input values as variables
    val inputDataPath = "s3://kinesis-hudi-raws3bucket-1p6nszvqd9awz/2021/02/01/18/"
    val hudiTableName = "order_hudi_cow"
    val hudiTablePath = "s3://kinesis-hudi-processeds3bucket-yfc6843vmq1o/" + hudiTableName
    
    // Set up our Hudi Data Source Options
    val hudiOptions = Map[String,String](
        DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "order_id",
    	DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "time_stamp",
        DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "transaction_date", 
        HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
        DataSourceWriteOptions.OPERATION_OPT_KEY ->
            DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
        DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "time_stamp", 
        DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
        DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
        DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "transaction_date", 
        DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
            classOf[MultiPartKeysValueExtractor].getName)
    
    // Read data from S3 and create a DataFrame with Partition and Record Key
    val inputDF = spark.read.format("parquet").load(inputDataPath)
    
    // Write data into the Hudi dataset
    inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiTablePath
    

  2. Run the following query on the Athena console to check for the change to the total number of records as 1,000:
    SELECT count(*) FROM "order_db"."order_transaction_cow";

  3. Run the following query on the Athena console to test for the update:
SELECT * FROM "order_db"."order_transaction_cow"
where order_id ='3801'
and item_id ='23'
and transaction_date ='2020-09-02';

The following screenshot shows that the product and amount values for the same order are updated.

In a production workload, you can trigger the updates on a schedule or by S3 modification events. A fully automated data lake makes sure your business analysts are always viewing the latest available data.

Clean up the resources

To avoid incurring future charges, follow these steps to remove the example resources:

  1. Delete the resources you created earlier in the pre-requisite section by deleting the stack instances from your stack set, if you created the EMR cluster with the CloudFormation template,.
  2. Stop the cluster via the Amazon EMR console, if you launched the EMR cluster manually.
  3. Empty all the relevant buckets via the Amazon S3 console.

Conclusion

You can build an end-to-end serverless data lake to get real-time insights from DynamoDB by using Kinesis Data Streams—all without writing any complex code. It allows your team to focus on solving business problems by getting useful insights immediately. Application developers have various use cases for moving data quickly through an analytics pipeline, and you can make this happen by enabling Kinesis Data Streams for DynamoDB.

If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available in the GitHub repository for you to use and extend. Contributions are always welcome!


About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to guide enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

 

 

 

Saurabh Shrivastava is a solutions architect leader and analytics/ML specialist working with global systems integrators. He works with AWS Partners and customers to provide them with architectural guidance for building scalable architecture in hybrid and AWS environments. He enjoys spending time with his family outdoors and traveling to new destinations to discover new cultures.

 

 

 

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on data analytics, AI/ML, and DevOps.

 

Better performance for less: AWS continues to beat Azure on SQL Server price/performance

Post Syndicated from Fred Wurden original https://aws.amazon.com/blogs/compute/sql-server-runs-better-on-aws/

By Fred Wurden, General Manager, AWS Enterprise Engineering (Windows, VMware, RedHat, SAP, Benchmarking)

AWS R5b.8xlarge delivers better performance at lower cost than Azure E64_32s_v4 for a SQL Server workload

In this blog, we will review a recent benchmark that Principled Technologies published on 2/25. The benchmark found that an Amazon Elastic Compute Cloud (Amazon EC2) R5b.8xlarge instance delivered better performance for a SQL Server workload at a lower cost when directly tested against an Azure E64_32s_v4 VM.

Behind the study: Understanding how SQL Server performed better, for a lower cost with an AWS EC2 R5b instance

Principled Technologies tested an online transaction processing (OLTP) workload for SQL Server 2019 on both an R5b instance on Amazon EC2 with Amazon Elastic Block Store (EBS) as storage and Azure E64_32s_v4. This particular Azure VM was chosen as an equivalent to the R5b instance, as both instances have comparable specifications for input/output operations per second (IOPS) performance, use Intel Xeon processors from the same generation (Cascade Lake), and offer the same number of cores (32). For storage, Principled Technologies mirrored storage configurations across the Azure VM and the EC2 instance (which used Amazon Elastic Block Store (EBS)), maxing out the IOPS specs on each while offering a direct comparison between instances.

Test Configurations

Source: Principled Technologies

When benchmarking, Principled Technologies ran a TPC-C-like OLTP workload from HammerDB v3.3 on both instances, testing against new orders per minute (NOPM) performance. NOPM shows the number of new-order transactions completed in one minute as part of a serialized business workload. HammerDB claims that because NOPM is “independent of any particular database implementation [it] is the recommended primary metric to use.”

The results: SQL Server on AWS EC2 R5b delivered 2x performance than the Azure VM and 62% less expensive 

Graphs that show AWS instance outperformed the Azure instance

Source: Principled Technologies

These test results from the Principled Technologies report show the price/performance and performance comparisons. The performance metric is New Orders Per Minute (NOPM); faster is better. The price/performance calculations are based on the cost of on-demand, License Included SQL Server instances and storage to achieve 1,000 NOPM performance, smaller is better.

An EC2 r5b.8xlarge instance powered by an Intel Xeon Scalable processor delivered better SQL Server NOPM performance on the HammerDB benchmark and a lower price per 1,000 NOPM than an Azure E64_32s_v4 VM powered by similar Intel Xeon Scalable processors.

On top of that, AWS’s storage price-performance exceeded Azure’s. The Azure managed disks offered 53 percent more storage than the EBS storage, but the EC2 instance with EBS storage cost 24 percent less than the Azure VM with managed disks. Even by reducing Azure storage by the difference in storage, something customers cannot do, EBS would have cost 13 percent less per storage GB than the Azure managed disks.

Why AWS is the best cloud to run your Windows and SQL Server workloads

To us, these results aren’t surprising. In fact, they’re in line with the success that customers find running Windows on AWS for over 12 years. Customers like Pearson and Expedia have all found better performance and enhanced cost savings by moving their Windows, SQL Server, and .NET workloads to AWS. In fact, RepricerExpress migrated its Windows and SQL Server environments from Azure to AWS to slash outbound bandwidth costs while gaining agility and performance.

Not only do we offer better price-performance for your Windows workloads, but we also offer better ways to run Windows in the cloud. Whether you want to rehost your databases to EC2, move to managed with Amazon Relational Database for SQL Server (RDS), or even modernize to cloud-native databases, AWS stands ready to help you get the most out of the cloud.

 


To learn more on migrating Windows Server or SQL Server, visit Windows on AWS. For more stories about customers who have successfully migrated and modernized SQL Server workloads with AWS, visit our Customer Success page. Contact us to start your migration journey today.

Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS

Post Syndicated from Islam Ghanim original https://aws.amazon.com/blogs/architecture/field-notes-stopping-an-automatically-started-database-instance-with-amazon-rds/

Customers needing to keep an Amazon Relational Database Service (Amazon RDS) instance stopped for more than 7 days, look for ways to efficiently re-stop the database after being automatically started by Amazon RDS. If the database is started and there is no mechanism to stop it; customers start to pay for the instance’s hourly cost. Moreover, customers with database licensing agreements could incur penalties for running beyond their licensed cores/users.

Stopping and starting a DB instance is faster than creating a DB snapshot, and then restoring the snapshot. However, if you plan to keep the Amazon RDS instance stopped for an extended period of time, it is advised to terminate your Amazon RDS instance and recreate it from a snapshot when needed.

This blog provides a step-by-step approach to automatically stop an RDS instance once the auto-restart activity is complete. This saves any costs incurred once the instance is turned on. The proposed architecture is fully serverless and requires no management overhead. It relies on AWS Step Functions and a set of Lambda functions to monitor RDS instance state and stop the instance when required.

Architecture overview

Given the autonomous nature of the architecture and to avoid management overhead, the architecture leverages serverless components.

  • The architecture relies on RDS event notifications. Once a stopped RDS instance is started by AWS due to exceeding the maximum time in the stopped state; an event (RDS-EVENT-0154) is generated by RDS.
  • The RDS event is pushed to a dedicated SNS topic rds-event-notifications-topic.
  • The Lambda function start-statemachine-execution-lambda is subscribed to the SNS topic rds-event-notifications-topic.
    • The function filters messages with event code: RDS-EVENT-0154. In order to restrict the ‘force shutdown’ activity further, the function validates that the RDS instance is tagged with auto-restart-protection and that the tag value is set to ‘yes’.
    • Once all conditions are met, the Lambda function starts the AWS Step Functions state machine execution.
  • The AWS Step Functions state machine integrates with two Lambda functions in order to retrieve the instance state, as well as attempt to stop the RDS instance.
    • In case the instance state is not ‘available’, the state machine waits for 5 minutes and then re-checks the state.
    • Finally, when the Amazon RDS instance state is ‘available’; the state machine will attempt to stop the Amazon RDS instance.

Prerequisites

In order to implement the steps in this post, you need an AWS account as well as an IAM user with permissions to provision and delete resources of the following AWS services:

  • Amazon RDS
  • AWS Lambda
  • AWS Step Functions
  • AWS CloudFormation
  • AWS SNS
  • AWS IAM

Architecture implementation

You can implement the architecture using the AWS Management Console or AWS CLI.  For faster deployment, the architecture is available on GitHub. For more information on the repo, visit GitHub.

The steps below explain how to build the end-to-end architecture from within the AWS Management Console:

Create an SNS topic

  • Open the Amazon SNS console.
  • On the Amazon SNS dashboard, under Common actions, choose Create Topic.
  • In the Create new topic dialog box, for Topic name, enter a name for the topic (rds-event-notifications-topic).
  • Choose Create topic.
  • Note the Topic ARN for the next task (for example, arn:aws:sns:us-east-1:111122223333:my-topic).

Configure RDS event notifications

Amazon RDS uses Amazon Simple Notification Service (Amazon SNS) to provide notification when an Amazon RDS event occurs. These notifications can be in any notification form supported by Amazon SNS for an AWS Region, such as an email, a text message, or a call to an HTTP endpoint.

For this architecture, RDS generates an event indicating that instance has automatically restarted because it exceed the maximum duration to remain stopped. This specific RDS event (RDS-EVENT-0154) belongs to ‘notification’ category. For more information, visit Using Amazon RDS Event Notification.

To subscribe to an RDS event notification

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose Event subscriptions.
  • In the Event subscriptions pane, choose Create event subscription.
  • In the Create event subscription dialog box, do the following:
    • For Name, enter a name for the event notification subscription (RdsAutoRestartEventSubscription).
    • For Send notifications to, choose the SNS topic created in the previous step (rds-event-notifications-topic).
    • For Source type, choose ‘Instances’. Since our source will be RDS instances.
    • For Instances to include, choose ‘All instances’. Instances are included or excluded based on the tag, auto-restart-protection. This is to keep the architecture generic and to avoid regular configurations moving forward.
    • For Event categories to include, choose ‘Select specific event categories’.
    • For Specific event, choose ‘notification’. This is the category under which the RDS event of interest falls. For more information, review Using Amazon RDS Event Notification.
    •  Choose Create.
    • The Amazon RDS console indicates that the subscription is being created.

Create Lambda functions

Following are the three Lambda functions required for the architecture to work:

  1. start-statemachine-execution-lambda, the function will subscribe to the newly created SNS topic (rds-event-notifications-topic) and starts the AWS Step Functions state machine execution.
  2. retrieve-rds-instance-state-lambda, the function is triggered by AWS Step Functions state machine to retrieve an RDS instance state (example, available or stopped)
  3. stop-rds-instance-lambda, the function is triggered by AWS Step Functions state machine in order to attempt to stop an RDS instance.

First, create the Lambda functions’ execution role.

To create an execution role

  • Open the roles page in the IAM console.
  • Choose Create role.
  • Create a role with the following properties.
    • Trusted entity – Lambda.
    • Permissions – AWSLambdaBasicExecutionRole.
    • Role namerds-auto-restart-lambda-role.
    • The AWSLambdaBasicExecutionRole policy has the permissions that the function needs to write logs to CloudWatch Logs.

Now, create a new policy and attach to the role in order to allow the Lambda function to: start an AWS StepFunctions state machine execution, stop an Amazon RDS instance, retrieve RDS instance status, list tags and add tags.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "rds:AddTagsToResource",
                "rds:ListTagsForResource",
                "rds:DescribeDBInstances",
                "states:StartExecution",
                "rds:StopDBInstance"
            ],
            "Resource": "*"
        }
    ]
}
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name (rds-auto-restart-lambda-policy) and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.

To link the new policy to the AWS Lambda execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role created for the three Lambda functions. After selecting the identities, choose Attach policy.

Given the principle of least privilege, it is recommended to create 3 different roles restricting a function’s access to the needed resources only. 

Repeat the following step 3 times to create 3 new Lambda functions. Differences between the 3 Lambda functions are: (1) code and (2) triggers:

  • Open the Lambda console.
  • Choose Create function.
  • Configure the following settings:
    • Name
      • start-statemachine-execution-lambda
      • retrieve-rds-instance-state-lambda
      • stop-rds-instance-lambda
    • Runtime – Python 3.8.
    • Role – Choose an existing role.
    • Existing role – rds-auto-restart-lambda-role.
    • Choose Create function.
    • To configure a test event, choose Test.
    • For Event name, enter test.
  • Choose Create.
  • For the Lambda function —  start-statemachine-execution-lambda, use the following Python 3.8 sample code:
import json
import boto3
import logging
import os

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')

def lambda_handler(event, context):

    #log input event
    LOGGER.info("RdsAutoRestart Event Received, now checking if event is eligible. Event Details ==> ", event)

    #Input event from the SNS topic originated from RDS event notifications
    snsMessage = json.loads(event['Records'][0]['Sns']['Message'])
    rdsInstanceId = snsMessage['Source ID']
    stepFunctionInput = {"rdsInstanceId": rdsInstanceId}
    rdsEventId = snsMessage['Event ID']

    #Retrieve RDS instance ARN
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceArn = db_instance['DBInstanceArn']

    # Filter on the Auto Restart RDS Event. Event code: RDS-EVENT-0154. 

    if 'RDS-EVENT-0154' in rdsEventId:

        #log input event
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        #Verify that instance is tagged with auto-restart-protection tag. The tag is used to classify instances that are required to be terminated once started. 

        tagCheckPass = 'false'
        rdsInstanceTags = rdsClient.list_tags_for_resource(ResourceName=rdsInstanceArn)
        for rdsInstanceTag in rdsInstanceTags["TagList"]:
            if 'auto-restart-protection' in rdsInstanceTag["Key"]:
                if 'yes' in rdsInstanceTag["Value"]:
                    tagCheckPass = 'true'
                    #log instance tags
                    LOGGER.info("RdsAutoRestart verified that the instance is tagged auto-restart-protection = yes, now starting the Step Functions Flow")
                else:
                    tagCheckPass = 'false'


        #log instance tags
        LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")

        if 'true' in tagCheckPass:

            #Initialise StepFunctions Client
            stepFunctionsClient = boto3.client('stepfunctions')

            # Start StepFunctions WorkFlow
            # StepFunctionsArn is stored in an environment variable
            stepFunctionsArn = os.environ['STEPFUNCTION_ARN']
            stepFunctionsResponse = stepFunctionsClient.start_execution(
            stateMachineArn= stepFunctionsArn,
            name=event['Records'][0]['Sns']['MessageId'],
            input= json.dumps(stepFunctionInput)

        )

    else:

        LOGGER.info("RdsAutoRestart Event detected, and event is not eligible")

    return {
            'statusCode': 200
        }

And then, configure an SNS source trigger for the function start-statemachine-execution-lambda. RDS event notifications will be published to this SNS topic:

  • In the Designer pane, choose Add trigger.
  • In the Trigger configurations pane, select SNS as a trigger.
  • For SNS topic, choose the SNS topic previously created (rds-event-notifications-topic)
  • For Enable trigger, keep it checked.
  • Choose Add.
  • Choose Save.

For the Lambda function — retrieve-rds-instance-state-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')


def lambda_handler(event, context):
    

    #log input event
    LOGGER.info(event)
    
    #rdsInstanceId is passed as input to the lambda function from the AWS StepFunctions state machine.  
    rdsInstanceId = event['rdsInstanceId']
    db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
    db_instance = db_instances[0]
    rdsInstanceState = db_instance['DBInstanceStatus']
    return {
        'statusCode': 200,
        'rdsInstanceState': rdsInstanceState,
        'rdsInstanceId': rdsInstanceId
    }

Choose Save.

For the Lambda function, stop-rds-instance-lambda, use the following Python 3.8 sample code:

import json
import logging
import boto3

#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')


def lambda_handler(event, context):
    
    #log input event
    LOGGER.info(event)
    
    rdsInstanceId = event['rdsInstanceId']
    
    #Stop RDS instance
    rdsClient.stop_db_instance(DBInstanceIdentifier=rdsInstanceId)
    
    #Tagging
    
    
    return {
        'statusCode': 200,
        'rdsInstanceId': rdsInstanceId
    }

Choose Save.

Create a Step Function

AWS Step Functions will execute the following service logic:

  1. Retrieve RDS instance state by calling Lambda function, retrieve-rds-instance-state-lambda. The Lambda function then returns the parameter, rdsInstanceState.
  2. If the rdsInstanceState parameter value is ‘available’, then the state machine will step into the next action calling the Lambda function, stop-rds-instance-lambda. If the rdsInstanceState is not ‘available’, the state machine will then wait for 5 minutes and then re-check the RDS instance state again.
  3. Stopping an RDS instance is an asynchronous operation and accordingly the state machine will keep polling the instance state once every 5 minutes until the rdsInstanceState parameter value becomes ‘stopped’. Only then, the state machine execution will complete successfully.

  • An RDS instance path to ‘available’ state may vary depending on the various maintenance activities scheduled for the instance.
  • Once the RDS notification event is generated, the instance will go through multiple states till it becomes ‘available’.
  • The use of the 5 minutes timer is to make sure that the automation flow will keep attempting to stop the instance once it becomes available.
  • The second part will make sure that the flow doesn’t end till the instance status is changed to ‘stopped’ and hence notifying the system administrator.

To create an AWS Step Functions state machine

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In the navigation pane, choose State machines.
  • In the State machines pane, choose Create state machine.
  • On the Define state machine page, choose Author with code snippets. For Type, choose Standard.
  • Enter a Name for your state machine, stop-rds-instance-statemachine.
  • In the State machine definition pane, add the following state machine definition using the ARNs of the two Lambda function created earlier, as shown in the following code sample:
{
  "Comment": "stop-rds-instance-statemachine: Automatically shutting down RDS instance after a forced Auto-Restart",
  "StartAt": "retrieveRdsInstanceState",
  "States": {
    "retrieveRdsInstanceState": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceAvailable"
    },
    "isInstanceAvailable": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.rdsInstanceState",
          "StringEquals": "available",
          "Next": "stopRdsInstance"
        }
      ],
      "Default": "waitFiveMinutes"
    },
    "waitFiveMinutes": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRdsInstanceState"
    },
    "stopRdsInstance": {
      "Type": "Task",
      "Resource": "stop-rds-instance-lambda Arn",
      "Next": "retrieveRDSInstanceStateStopping"
    },
    "retrieveRDSInstanceStateStopping": {
      "Type": "Task",
      "Resource": "retrieve-rds-instance-state-lambda Arn",
      "Next": "isInstanceStopped"
    },
    "isInstanceStopped": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.rdsInstanceState",
          "StringEquals": "stopped",
          "Next": "notifyDatabaseAdmin"
        }
      ],
      "Default": "waitFiveMinutesStopping"
    },
    "waitFiveMinutesStopping": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "retrieveRDSInstanceStateStopping"
    },
    "notifyDatabaseAdmin": {
      "Type": "Pass",
      "Result": "World",
      "End": true
    }
  }
}

This is a definition of the state machine written in Amazon States Language which is used to describe the execution flow of an AWS Step Function.

Choose Next.

  • In the Name pane, enter a name for your state machine, stop-rds-instance-statemachine.
  • In the Permissions pane, choose Create new role. Take note of the the new role’s name displayed at the bottom of the page (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd).
  • Choose Create state machine
  • By default, the created role only grants the state machine access to CloudWatch logs. Since the state machine will have to make Lambda calls, then another IAM policy has to be associated with the new role.

Use the JSON policy editor to create a policy

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane on the left, choose Policies.
  • Choose Create policy.
  • Choose the JSON tab.
  • Paste the following JSON policy document:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "*"
}
]
}
  • When you are finished, choose Review policy. The Policy Validator reports any syntax errors.
  • On the Review policy page, type a Name rds-auto-restart-stepfunctions-policy and a Description (optional) for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy.
  • Choose Create policy to save your work.

To link the new policy to the AWS Step Functions execution role

  • Sign in to the AWS Management Console and open the IAM console.
  • In the navigation pane, choose Policies.
  • In the list of policies, select the check box next to the name of the policy to attach. You can use the Filter menu and the search box to filter the list of policies.
  • Choose Policy actions, and then choose Attach.
  • Select the IAM role create for the state machine (example, StepFunctions-stop-rds-instance-statemachine-role-231ffecd). After selecting the identities, choose Attach policy.

 

Testing the architecture

In order to test the architecture, create a test RDS instance, tag it with auto-restart-protection tag and set the tag value to yes. While the RDS instance is still in creation process, test the Lambda function —  start-statemachine-execution-lambda with a sample event that simulates that the instance was started as it exceeded the maximum time to remain stopped (RDS-EVENT-0154).

To invoke a function

  • Sign in to the AWS Management Console and open the Lambda console.
  • In navigation pane, choose Functions.
  • In Functions pane, choose start-statemachine-execution-lambda.
  • In the upper right corner, choose Test.
  • In the Configure test event page, choose Create new test event and in Event template, leave the default Hello World option.
    {
    "Records": [
        {
        "EventSource": "aws:sns",
        "EventVersion": "1.0",
        "EventSubscriptionArn": "<RDS Event Subscription Arn>",
        "Sns": {
            "Type": "Notification",
            "MessageId": "10001-2d55da-9a73-5e42d46748c0",
            "TopicArn": "<SNS Topic Arn>",
            "Subject": "RDS Notification Message",
            "Message": "{\"Event Source\":\"db-instance\",\"Event Time\":\"2020-07-09 15:15:03.031\",\"Identifier Link\":\"https://console.aws.amazon.com/rds/home?region=<region>#dbinstance:id=<RDS instance id>\",\"Source ID\":\"<RDS instance id>\",\"Event ID\":\"http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_Events.html#RDS-EVENT-0154\",\"Event Message\":\"DB instance started\"}",
            "Timestamp": "2020-07-09T15:15:03.991Z",
            "SignatureVersion": "1",
            "Signature": "YsuM+L6N8rk+pBPBWoWeRcSuYqo/BN5v9D2lyoSg0B0uS46Q8NZZSoZWaIQi25TXfHY3RYXCXF9WbVGXiWa4dJs2Mjg46anM+2j6z9R7BDz0vt25qCrCyWhmWtc7yeETrlwa0jCtR/wxXFFexRwynqlZeDfvQpf/x+KNLrnJlT61WZ2FMTHYs124RwWU8NY3pm1Os0XOIvm8rfv3ywm1ccZfP4rF7Lfn+2EK6a0635Z/5aiyIlldNZxbgRYTODJYroO9INTlF7NPzVV1Y/K0E9aaL/wQgLZNquXQGCAxPFWy5lxJKeyUocOWcG48KJGIBUC36JJaqVdIilbZ9HvxTg==",
            "SigningCertUrl": "https://sns.<region>.amazonaws.com/SimpleNotificationService-a86cb10b4e1f29c941702d737128f7b6.pem",
            "UnsubscribeUrl": "https://sns.<region>.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=<arn>",
            "MessageAttributes": {}
        }
        }
    ]
    }
start-statemachine-execution-lambda uses the SNS MessageId parameter as name for the AWS Step Functions execution. The name field is unique for a certain period of time, accordingly, with every test run the MessageId parameter value must be changed. 
  • Choose Create and then choose Test. Each user can create up to 10 test events per function. Those test events are not available to other users.
  • AWS Lambda executes your function on your behalf. The handler in your Lambda function receives and then processes the sample event.
  • Upon successful execution, view results in the console.
  • The Execution result section shows the execution status as succeeded and also shows the function execution results, returned by the return statement. Following is a sample response of the test execution:

Now, verify the execution of the AWS Step Functions state machine:

  • Sign in to the AWS Management Console and open the Amazon RDS console.
  • In navigation pane, choose State machines.
  • In the State machine pane, choose stop-rds-instance-statemachine.
  • In the Executions pane, choose the execution with the Name value passed in the test event MessageId parameter.
  • In the Visual workflow pane, the real-time execution status is displayed:

  • Under the Step details tab, all details related to inputs, outputs and exceptions are displayed:

Monitoring

It is recommended to use Amazon CloudWatch to monitor all the components in this architecture. You can use AWS Step Functions to log the state of the execution, inputs and outputs of each step in the flow. So when things go wrong, you can diagnose and debug problems quickly.

Cost

When you build the architecture using serverless components, you pay for what you use with no upfront infrastructure costs. Cost will depend on the number of RDS instances tagged to be protected against an automatic start.

Architectural considerations

This architecture has to be deployed per AWS Account per Region.

Conclusion

The blog post demonstrated how to build a fully serverless architecture that monitors and stops RDS instances restarted by AWS. This helps to avoid falling behind on any required maintenance updates. This architecture helps you save cost incurred by started instances’ running hours and licensing implications.  Feel free to submit enhancements to the GitHub repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers

Implementing multi-tenant patterns in Amazon Redshift using data sharing

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/implementing-multi-tenant-patterns-in-amazon-redshift-using-data-sharing/

Software service providers offer subscription-based analytics capabilities in the cloud with Analytics as a Service (AaaS), and increasingly customers are turning to AaaS for business insights. A multi-tenant storage strategy allows the service providers to build a cost-effective architecture to meet increasing demand.

Multi-tenancy means a single instance of software and its supporting infrastructure is shared to serve multiple customers. For example, a software service provider could generate data that is housed in a single data warehouse cluster, but accessed securely by multiple customers. This storage strategy offers an opportunity to centralize management of data, simplify ETL processes, and optimize costs. However, service providers have to constantly balance between cost and providing a better user experience for their customers.

With the new data sharing feature, you can use Amazon Redshift to scale and meet both objectives of managing costs by simplifying storage and ETL pipelines while still providing consistent performance to customers.  You can ingest data into a cluster designated as a producer cluster, and share this live data with one or more consumer clusters. Clusters accessing this shared data are isolated from each other, therefore performance of a producer cluster isn’t impacted by workloads on consumer clusters. This enables consuming clusters to get consistent performance based on individual compute capacity.

In this post, we focus on various AaaS patterns, and discuss how you can use data sharing in a multi-tenant architecture to scale for virtually unlimited users. We discuss detailed steps to use data sharing with different storage strategies.

Multi-tenant storage patterns

Multi-tenant storage patterns help simplify the architecture and long-term maintenance of the analytics platform. In a multi-tenant strategy, data is stored centrally in a single cluster for all tenants, enabling simplification of the ETL ingestion pipeline and data management. In the previously published whitepaper SaaS Storage Strategies, various models of storage and benefits are covered for a single cluster scenario.

The three strategies you can choose from are:

  • Pool model – Data is stored in a single database schema for all tenants, and a new column (tenant_id) is used to scope and control access to individual tenant data. Access to the multi-tenant data is controlled using views built on the tables.
  • Bridge model – Storage and access to data for each tenant is controlled at individual schema level in the same database.
  • Silo model – Storage and access control to data for each tenant is maintained in separate databases

The following diagram illustrates the architecture of these multi-tenant storage strategies.

The following diagram illustrates the architecture of these multi-tenant storage strategies.

In the following sections, we will discuss how these multi-tenant strategies can be implemented using Amazon Redshift data sharing feature with a multi-cluster architecture.

Scaling your multi-tenant architecture using data sharing

AaaS providers implementing multi-tenant architectures were previously limited to resources of a single cluster to meet the compute and concurrency requirements of users across all the tenants. As the number of tenants increased, you could either turn on concurrency scaling or create additional clusters. However, the addition of new clusters means additional ingestion pipelines and increased operational overhead.

With data sharing in Amazon Redshift, you can easily and securely share data across clusters. Data ingested into the producer cluster is shared with one or more consumer clusters, which allows total separation of ETL and BI workloads. Several consumer clusters can read data from the managed storage of a producer cluster. This enables instant, granular, and high-performance access without data copies and movement. Workloads accessing shared data are isolated from each other and the producer. You can distribute workloads across multiple clusters while simplifying and consolidating the ETL ingestion pipeline into one main producer cluster, providing optimal price for performance.

Consumer clusters can in turn be producers for the data sets they own. Customers can optimize costs even further by collocating multiple tenants on the same consumer cluster. For instance, you can group low volume tier 3 tenants into a single consumer cluster to provider a lower cost offering, while high volume tier 1 tenants get their own isolated compute clusters. Consumer clusters can be created in the same account as producer or in a different AWS account. With this you can have separate billing for the consumer clusters, where you can chargeback to the business group that uses the consumer cluster or even allow your customers to use their own Redshift cluster in their account, so they pay for usage of the consumer cluster. The following diagram shows the difference in ETL and consumer access patterns in a multi-tenant architecture using data sharing versus a single cluster approach without data sharing.

Consumer clusters can in turn be producers for the data sets they own.

Multi-tenant architecture with data sharing compared to single cluster approach

Creating a multi-tenant architecture for an AaaS solution

For this post, we use a simple data model with a fact and a dimension table to demonstrate how to leverage data sharing to design a scalable multi-tenant AaaS solution. We cover detailed steps involved for each storage strategy using this data model. The tables are as follows:

  • Customer – dimension table containing customer details
  • Sales – fact table containing sales transactions

We use two Amazon Redshift ra3.4xl clusters, with 2 nodes each, and designate one cluster as producer and other as consumer.

The high-level steps involved in enabling data sharing across clusters are as follows:

  1. Create a data share in the producer cluster and assign database objects to the data share.
  2. From the producer cluster, grant usage on the data share to consumer clusters, identified by namespace or AWS account.
  3. From the consumer cluster, create an external database using the data share from the producer
  4. Query the tables in the data share through the external shared database in the consumer cluster. Grant access to other users to access this shared database and objects.

Creating producer and consumer Amazon Redshift clusters

Let us start by creating two Amazon Redshift ra3.4xl clusters with 2-nodes each, one for the producer and other for consumer.

  1. On the Amazon Redshift cluster, create two clusters of RA3 instance type, and name them ds-producer and ds-consumer-c1, respectively.
  1. Next, log in to Amazon Redshift using the query editor. You can also use a SQL client tool like DBeaver, SQL Workbench, or Aginity Workbench. For configuration information, see Connecting to an Amazon Redshift cluster using SQL client tools.

Get the cluster namespace of the producer and consumer clusters from the console. We will use the namespaces to create the tenant table and to create and access the data shares. You can also get the cluster namespaces by logging into each of the clusters and executing the SELECT CURRENT_NAMESPACE statement in the query editor.

Please note to replace the corresponding namespaces in the code sections wherever producercluster_namespace, consumercluster1_namespace, and consumercluster_namespace is referenced.

The following screenshot shows the namespace on the Amazon Redshift console.

Now that we have the clusters created, we will go through the detailed steps for the three models. First, we will cover the Pool model, followed by Bridge model and finally the Silo model.

Pool model

The pool model represents an all-in, multi-tenant model where all tenants share the same storage constructs and provides the most benefit in simplifying the AaaS solution.

With this model, data storage is centralized in one cluster database, and data is stored for all tenants in the same set of data models. To scope and control access to tenant data, we introduce a column (tenant_id) that serves as a unique identifier for each tenant.

Security management to prevent cross-tenant access is one of the main aspects to address with the pool model. We can implement row-level security and provide secure access to the data by creating database views and set application-level policies by creating groups with specific access and assigning users to the groups. The following diagram illustrates the pool model architecture.

The following diagram illustrates the pool model architecture.

To create a multi-tenant solution using the pool model, you create data shares for the pool model in the producer cluster, and share data with the consumer cluster. We provide more detail on these steps in the following sections.

Creating data shares for the pool model in the producer cluster

To create data shares for the pool model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and run the following script.

Note that we have a tenant table to store unique identifiers for each tenant or consumer (tenant).

We add a column (tenant_id) to the sales and customer tables to uniquely identify tenant data. This tenant_id references the tenant_id in the tenant table to uniquely identify the tenant and consumer records. See the following code:

/**********************************************/
/*       datasharing datasetup pool model     */
/**********************************************/

-- Create Schema
create schema sales;

CREATE TABLE IF NOT EXISTS sales.tenant ( 
t_tenantid int8 not null,
t_name varchar(50) not null,
t_namespace varchar(50),
t_account varchar(16)
)
DISTSTYLE AUTO
SORTKEY AUTO;

-- Create tables for multi-tenant sales schema
drop table sales.customer;
CREATE TABLE IF NOT EXISTS sales.customer(
  c_tenantid int8 not null,
  c_custid int8 not null,
  c_name varchar(25) not null,
  c_region varchar(40) not null,
  Primary Key(c_tenantid, c_custid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;

CREATE TABLE IF NOT EXISTS sales.sales (
  s_tenantid int8 not null,
  s_orderid int8 not null,
  s_custid int8 not null,
  s_totalprice numeric(12,2) not null,
  s_orderdate date not null,
  Primary Key(s_tenantid, s_orderid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;
  1. Set up the tenant table with the details for each consumer cluster, and ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use INSERT statements:
    -- Ingest data 
    insert into sales.tenant values
    (0, 'primary', '<producercluster_namespace>',''),
    (1, 'tenant1', '<consumercluster1_namespace>',''),
    (2, 'tenant2', '<consumercluster2_namespace>','');
    
    insert into sales.customer values
    (1, 1, 'Customer 1', 'NorthEast'),
    (1, 2, 'Customer 2', 'SouthEast'),
    (2, 1, 'Customer 3', 'NorthWest'),
    (2, 2, 'Customer 4', 'SouthEast');
    
    truncate table sales.sales;
    insert into sales.sales values
    (1, 1, 1, 2434.33, '2020-11-21'),
    (1, 2, 2, 54.90, '2020-5-5'),
    (1, 3, 2, 9678.99, '2020-3-8'),
    (2, 1, 2, 452.24, '2020-1-23'),
    (2, 2, 1, 76523.10, '2020-11-3'),
    (2, 3, 1, 6745.20, '2020-10-01');
    
    select count(*) from sales.tenant;
    select count(*) from sales.customer;
    select count(*) from sales.sales;

Securing data on the producer cluster by restricting access

In the pool model, no external user has direct access to underlying tables. All access is restricted using views.

  1. Create a view for each of the fact and dimension tables to include a condition to filter records from the consumer tenant’s namespace. In our example, we create v_customersales to combine sales fact and customer dimension tables with a restrictive filter for tenant.namespace = current_namespace. See the following code:
    /**********************************************/
    /* We will create late binding views          */
    /* but materialized views could also be used  */
    /**********************************************/
    
    create or replace view sales.v_customer as
    select * 
    from sales.customer c, sales.tenant t
    where c.c_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_sales as
    select * 
    from sales.sales s, sales.tenant t
    where s.s_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_customersales as 
    select c_tenantid, c_name, c_region, 
    	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
    	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
    	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
    	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
    	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
    	t.t_namespace
    from sales.tenant t, sales.customer c, sales.sales s
    where t.t_tenantid = c.c_tenantid 
    and c.c_tenantid = s.s_tenantid 
    and c.c_custid = s.s_custid 
    and t.t_namespace = current_namespace 
    WITH NO SCHEMA BINDING;
    
    select * from sales.v_customersales;
          
    

Now that we have database objects created in the producer cluster, we can share the data with the consumer clusters.

Sharing data with the consumer cluster

To share data with the consumer cluster, complete the following steps:

  1. Create a data share for the sales data:
    /***************************************************/
    /* Create Datashare and add objects to the share    */
    /****************************************************/
    CREATE DATASHARE salesshare;
    

  1. Enter the following code to alter the data share, add the sales schema to be shared with the consumer clusters, and add all tables in the sales schema to be shared with the consumer cluster:
    /************************************************************/
    /* Add objects at desired granularities: schemas, tables,   */
    /* views include materialized, and SQL UDFs                 */
    /************************************************************/
    ALTER DATASHARE salesshare ADD SCHEMA sales;  -- New addition to create SCHEMA first
    
    /*For pool model, we share only the views and not tables */
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customer;
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customersales;
    

For the pool model, we share only the views with the consumer cluster and not the tables. The ALTER statement ADD TABLE is used to add both views and tables.

  1. Grant usage on the sales data share to the namespace of the BI consumer cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster. See the following code:
    /********************************************************************/
    /* Grant access to consumer clusters                                */
    /* login to Consumer BI Cluster and get the Namespace from          */
    /* the Redshift console or using SELECT CURRENT_NAMESPACE           */
    /********************************************************************/
    SELECT CURRENT_NAMESPACE;
    
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE salesshare TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    GRANT USAGE ON DATASHARE salesshare TO ACCOUNT 'Consumer_AWSAccount';
    

  1. View data shares that are shared from the producer cluster:
    SELECT * FROM SVV_DATASHARES;

The following screenshot shows the output.

The following screenshot shows the output.

You can also see the data shares and their detailed objects and consumers using the following commands:

SHOW DATASHARES;
DESC DATASHARE salesshare;
select * from SVV_DATASHARE_OBJECTS;
select * from SVV_DATASHARE_CONSUMERS;

Viewing and querying data shares for the pool model from the consumer cluster

To view and query data shares from the consumer cluster, complete the following steps:

  1. Log in to the consumer cluster as an admin user and view the data share objects:
    /**********************************************************/
    /* Login to Consumer cluster as awsuser:                  */
    /* View datashares and create local database for querying */
    /**********************************************************/
    select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the results.

The following screenshot shows the results.

  1. Create a new database from the data share of the producer cluster:
    /**********************************************************/
    /* Create a local database and schema reference           */
    /**********************************************************/
    CREATE DATABASE sales_db FROM DATASHARE salesshare
    OF NAMESPACE '<producercluster_namespace>';

  1. Optionally, you can create an external schema in the consumer cluster pointing to the schema in the database of the producer cluster.

Creating a local external schema in the consumer cluster allows schema-level access controls within the consumer cluster, and uses a two-part notation when referencing shared data objects (localschema.table; vs. external_db.producerschema.table). See the following code:

/*********************************************************/
/* Create External Schema - Optional                     */
/* reason for schema: give specific access to schema     */
/* using shared alias get access to a secondary database */
/*********************************************************/
CREATE EXTERNAL SCHEMA sales_schema 
FROM REDSHIFT DATABASE 'sales_db' SCHEMA 'sales';
  1. Now you can query the shared data from the producer cluster by using the syntax tenant.schema.table:
    select * from sales_db.sales.customer;
    select * from sales_db.sales.v_customersales;

  1. From the tenant1 consumer cluster, you can view the databases and the tenants that are accessible to tenant1. tenant1_schema is as follows:
    select * from SVV_REDSHIFT_DATABASES;
    

The following screenshot shows the results.

The following screenshot shows the results.

Creating local consumer users and controlling access

You can control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create tenant1_group, grant usage on the local database sales_db and schema sales_schema to the group, and assign the user tenant1_user to the tenant1_group:
    /********************************************************/
    /* Consumers can create own users and assign privileges */
    /* Create tenant1_group and assign privileges to read   */
    /* sales_db and the sales_schema                        */
    /* Create tenant1_user in tenant1_group                 */
    /********************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE sales_db TO tenant1_group;
    GRANT USAGE ON SCHEMA sales_schema TO GROUP tenant1_group;
    

  1. Now, login as tenant1_user to consumer cluster 1 and select data from the views v_customer and v_customersales:
    /*******************************************************/
    /* select from view returns only sales records related */
    /* to Consumer A namespace                             */
    /*******************************************************/
    select * from sales_db.sales.v_customer;
    select * from sales_db.sales.v_customersales;
    

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

 

Create Materialized views to optimize performance

Consumer clusters can have their own database objects which are local to the consumer. You can also create materialized views on the datashare objects and control when to refresh the dataset for your consumers. This provides another level of isolation from the producer cluster, and will ensure the consumer clusters go against their local dataset.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create a materialized view for customersales. This will create a local view that can be periodically refreshed from the consumer cluster.

 

/*******************************************************/
/* Create materialized view in consumer cluster        */
/*******************************************************/
create MATERIALIZED view tenant1_sales.mv_customersales as 
select c_tenantid, c_name, c_region, 
	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
	t.t_namespace
from sales_db.tenant t, sales_db.customer c, sales_db.sales s
where t.t_tenantid = c.c_tenantid 
and c.c_tenantid = s.s_tenantid 
and c.c_custid = s.s_custid 
and t.t_namespace = current_namespace;

select * from tenant1_sales.mv_customersales top 100;

REFRESH MATERIALIZED VIEW tenant1_sales.mv_customersales;

 

With the preceding steps, we have demonstrated how you can control access to the tenant data in the same datastore using views. We also reviewed how data shares help efficiently share data between producer and consumer clusters with transaction consistency. We also saw how a local materialized view can be created to further isolate your BI workloads for your customers and provide a consistent, performant user experience. In the next section we will discuss the Bridge model.

Bridge model

In the bridge model, data for each tenant is stored in its own schema in a database and contains a similar set of tables. Data shares are created for each schema and shared with the corresponded consumer. This is an appealing balance between silo and pool model, providing both data isolation and ETL consolidation. With Amazon Redshift, you can create up to 9,900 schemas. For more information, see Quotas and limits in Amazon Redshift.

With data sharing, separate consumer clusters can be provisioned to use the same managed storage from producer cluster. Consumer clusters have all the capabilities of a producer cluster, and can in turn be producer clusters for data objects they own. Consumers can’t share data that is already shared with them. Without data sharing, queries from all customers are directed to a single cluster. The following diagram illustrates the bridge model.

The following diagram illustrates the bridge model.

To create a multi-tenant architecture using bridge model, complete the steps in the following sections.

Creating database schemas and tables for the bridge model in the producer cluster

As we did in the pool model, the first step is to create the database schema and tables. We log in to the producer cluster as an admin user and create separate schemas for each tenant. For our post, we create two schemas, tenant1 and tenant2, to store data for two tenants.

  1. Log in to the producer cluster as the admin user.
  1. Use the script below to create two schemas, tenant1 and tenant2, and create tables for customer dimension and sales facts under each of the two schemas. See the following code:
    /****************************************/
    /* Bridge -  Data Model */
    /****************************************/
    -- Create schemas tenant1 and tenant2
    create schema tenant1;
    create schema tenant2;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
      
    
    -- Create tables for tenant2
    CREATE TABLE IF NOT EXISTS tenant2.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant2.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    

  1. Ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use the INSERT statement:
    -- ingest data for tenant1
    -- ingest customer data
    insert into tenant1.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    -- ingest sales data
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    
    select count(*) from tenant1.customer;
    select count(*) from tenant1.sales;
    
    
    -- ingest data for tenant2
    -- ingest customer data
    insert into tenant2.customer values
    (1, 'Customer 3', 'NorthWest'),
    (2, 'Customer 4', 'SouthEast');
    
    -- ingest sales data
    truncate table tenant2.sales;
    insert into tenant2.sales values
    (1, 2, 452.24, '2020-1-23'),
    (2, 1, 76523.10, '2020-11-3'),
    (3, 1, 6745.20, '2020-10-01');
    
    
    select count(*) from tenant2.customer;
    select count(*) from tenant2.sales;
    

Creating data shares and granting usage to the consumer cluster

In the following code, we create two data shares, tenant1share and tenant2share, to share the database objects under the two schemas to the respective consumer clusters.

  1. Create two datashares tenant1share and tenant2share to share the database objects under the two schemas to the respective consumer clusters.
    /******************************************************************/
    /*   Create Datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1share;
    CREATE DATASHARE tenant2share;

  1. Alter the datashare and add the schema(s) for respective tenants to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD SCHEMA tenant2;

  1. Alter the datashare and add all tables in the schema(s) to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD ALL TABLES IN SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD ALL TABLES IN SCHEMA tenant2;

Getting the namespace of the first consumer cluster

  1. Log in to the consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant1 schema for Tenant1 BI Cluster */
    /* login to tenant1 BI Cluster and get the Namespace 
     * or get the Namespace from the Redshift console */
    SELECT CURRENT_NAMESPACE;

 

  1. Grant usage on the data share for the first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    -- Grant usage on the datashare to the first consumer cluster
    -- Namespace refers to the namespace GUID of the consumer cluster 
    GRANT USAGE ON DATASHARE tenant1share TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1share TO ACCOUNT '<Consumer_AWSAccount>';

Getting the namespace of the second consumer cluster

  1. Log in to the second consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant2 schema for Tenant2 BI Cluster */
    /*login to tenant1 BI Cluster and get the Namespace      */
    SELECT CURRENT_NAMESPACE;
    

  1. Grant usage on the data share for the second tenant to the namespace of the second consumer cluster you just got from the previous step:
    -- Grant usage on the datashare to the second consumer cluster
    GRANT USAGE ON DATASHARE tenant2share TO NAMESPACE '<consumercluster2_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant2share TO ACCOUNT '<Consumer_AWSAccount>';

  1. To view data shares from the producer cluster, enter the following code:
    /******************************************************************/
    /*   View Datashares created, Datashare objects and consumers     */
    /******************************************************************/
    select * from SVV_DATASHARES;
    select * from SVV_DATASHARE_OBJECTS;
    select * from SVV_DATASHARE_CONSUMERS;

The following screenshot shows the commands in the query editor.

The following screenshot shows the commands in the query editor.

The following screenshot shows the query results.

The following screenshot shows the query results.

Accessing data using the consumer cluster from the data share

To access data using the consumer cluster, complete the following steps:

  1. Log in to the first consumer cluster ds-consumer-c1, as an admin user.
  1. View data share objects from the SVV_DATASHARE_OBJECTS system view:
    /*****************************************************/
    /* Consumer cluster as adminuser
    /* List the shares available and review contents for each 
    /********************************************************/
    -- You can view datashare objects associated with the cluster
    -- using either of the two commands
    SHOW DATASHARES;
    select * from SVV_DATASHARES;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View objects shared in inbound share for consumer
select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View namespace or clusters granted usage to a datashare
select * from svv_datashare_consumers;
  1. Create a local database in the first consumer cluster, and an external schema to be able to provide controlled access to the specific schema to the consumer clusters:
    /*******************************************************/
    /* Create a local database and schema reference        */
    /* to the share objects                                */
    /*******************************************************/
    CREATE DATABASE tenant1_db FROM DATASHARE tenant1share
    OF NAMESPACE '<producercluster_namespace>';

  1. Query the database tables using the three-part notation db.tenant.table:
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;

  1. Optionally, you can create an external schema.

There are two reasons to create an external schema: either to enable two-part notation access to the tables from the consumer cluster, or to provide restricted access to the specific schemas for selected users, when multiple schemas are shared from the producer cluster. See the following code for our external schema:

/* Create External Schema */
CREATE EXTERNAL SCHEMA tenant1_schema FROM REDSHIFT DATABASE 'tenant1_db' SCHEMA 'tenant1';
  1. If you created the local schemas, you can use the following two-part notation to query the database tables:
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;

  1. You can view the shared databases by querying the SVV_REDSHIFT_DATABASES table:
    select * from SVV_REDSHIFT_DATABASES;

The following screenshot shows the query results.

The following screenshot shows the query results.

Creating consumer users for managing access

Still logged in as an admin user to the consumer cluster, you can create other users who have access to the database objects.

  1. Create users and groups, and assign users and object privileges to the groups with the following code:
    /*******************************************************/
    /* Consumer can create own users and assign privileges */
    /* Create tenant1_user and assign privileges to        */
    /* read datashare from tenant1 schema                  */
    /*******************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE tenant1_db TO tenant1_user;
    GRANT USAGE ON SCHEMA tenant1_schema TO GROUP tenant1_group;
    

Now tenant1_user can log in and query the shared tables from tenant schema.

  1. Log in to the consumer cluster as tenant1_user and query the tables:
    /************************************************************/
    /* Consumer cluster as tenant1_user - As a consumer cluster */
    /* administrator, you must follow these steps:              */
    /************************************************************/
    
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;
    
    
    /************************************************************/
    /* If you have created and External Schema                  */
    /* you can use the two-part notation to query tables.       */
    /************************************************************/
    
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;
    

Revoking access to a data share (optional)

  1. At any point, if you want to revoke access to the data share, you can the REVOKE USAGE command:
    /*************************************************************/
    /* To revoke access at any time use the REVOKE USAGE command */
    /*************************************************************/
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    REVOKE USAGE ON DATASHARE Salesshare FROM NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    REVOKE USAGE ON DATASHARE Salesshare FROM ACCOUNT '<Consumer_AWSAccount>';
    

Silo model

The third option is to store data for each tenant in separate databases within a cluster. If you need your data isolated from other tenants, you can use the silo model and each database may have distinct data models, monitoring, management, and security footprints.

Amazon Redshift supports cross-database queries across databases, which allow you to simplify data organization. You can store common or granular datasets used across all tenants in a centralized database, and use the cross-database query capability to join relevant data for each tenant.

The steps to create a data share in a silo model is similar to a bridge model; however, unlike a bridge model (where data share is for each schema), the silo model has a data share created for each database. The following diagram illustrates the architecture of the silo model.

The following diagram illustrates the architecture of the silo model.

Creating data shares for the silo model in the producer cluster

To create data shares for the silo model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and create separate databases for each tenant:
    /*****************************************************/
    /** Silo Model – Create databases for the 2 tenants **/
    /*****************************************************/
    create database tenant1_silodb;
    
    create database tenant2_silodb;
    

  1. Log in again to the producer cluster with the database name and user ID for the database that you want to share (tenant1_silodb) and create the schema and tables:
    /***********************************************************/
    /* login to tenant1_db and create schema tenant1 and tables*/
    /***********************************************************/
    create schema tenant1_siloschema;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
    insert into tenant1_siloschema.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    truncate table tenant1_siloschema.sales;
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    

  1. Create a data share with a name for the first tenant (for example, tenant1dbshare):
    /******************************************************************/
    /*   Create datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1_silodbshare;
    

  1. Run Alter datashare commands to add the schemas to be shared with the consumer cluster and add all tables in the schemas to be shared with the consumer cluster:
    ALTER DATASHARE tenant1_silodbshare ADD SCHEMA tenant1_siloschema;
    ALTER DATASHARE tenant1_silodbshare ADD ALL TABLES IN SCHEMA tenant1_siloschema;

  1. Grant usage on the data share for first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE tenant1_silodbshare TO NAMESPACE ‘<consumercluster1_namespace>’
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1_silodbshare TO ACCOUNT '<AWS-Account>';
    

Viewing and querying data shares for the silo model from the consumer cluster

To view and query your data shares, complete the following steps:

  1. Log in to the consumer cluster as an admin user.
  2. Create a new database from the data share of the producer cluster:
    CREATE DATABASE tenant1_silodb FROM DATASHARE tenant1_silodbshare
    OF NAMESPACE ‘<producercluster_namespace>’;

Now you can start querying the shared data from the producer cluster by using the syntax – tenant.schema.table. If you created an external schema, then you can also use the two-part notation to query the tables.

  1. Query the data with the following code:
    select * from tenant1_silodb.tenant1.customer;
    select * from tenant1_silodb.tenant1.sales;
    

  1. Optionally, you can create an external schema pointing to the schema in the database of the producer cluster. This allows you query shared tables using a two-part notation. See the following code:
    CREATE EXTERNAL SCHEMA tenant1_siloschema FROM REDSHIFT DATABASE 'tenant1_silodb' SCHEMA 'tenant1';
    
    --With this Schema, you can access using two-part notation to select from data share tables
    select * from tenant1_siloschema.customer;
    select * from tenant1_siloschema.sales;
    

  1. You can repeat the same steps for tenant2 to share the tenant2 database with tenant2 You can also control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

System views to view data shares

We have introduced new system tables and views to easily identify the data shares and related objects. You can use three different groups of system views to view the data share objects:

  • Views starting with SVV_DATASHARES – has detail of datashares and objects in a datashare.
View Name Purpose
SVV_DATASHARES View a list of data shares created on the cluster and data shares shared with the cluster
SVV_DATASHARE_OBJECTS View a list of objects in all data shares created on the cluster or shared with the cluster
SVV_DATASHARE_CONSUMERS View a list of consumers for data share created on the cluster
  • Views starting with SVV_REDSHIFT – contains details on both local and remote Redshift databases.
View Name Purpose
SVV_REDSHIFT_DATABASES List of all databases that a user has access to
SVV_REDSHIFT_SCHEMAS List of all schemas that user has access to
SVV_REDSHIFT_TABLES List of all tables that a user has access to
SVV_REDSHIFT_COLUMNS List of all columns that a user has access to
SVV_REDSHIFT_FUNCTIONS List of all functions that user has access to
  • Views starting with SVV_ALL– contain local and remote databases, external schemas including  spectrum and federated query, and external schema references to shared data.  If you create external schemas in consumer cluster, you need to use the SVV_ALL views to look at the objects.
View Name Purpose
SVV_ ALL _SCHEMAS Union of list of all schemas from SVV_REDSHIFT_SCHEMA view and consolidated list of all external tables and schemas that user has access to
SVV_ ALL _TABLES List of all tables that a user has access to
SVV_ ALL _COLUMNS List of all columns that a user has access to
SVV_ ALL _FUNCTIONS List of all functions that user has access to

Considerations for choosing a storage strategy

You can adopt a storage strategy or choose a hybrid approach based on business, technical, and operational requirements. Before deciding on a strategy, consider the quotas and limits for various objects in Amazon Redshift, and the number of databases per cluster or number of schemas per database to check if it meets your requirements. The following table summarizes these considerations.

  Pool Bridge Silo
Separation of tenant data Views Schema Database
ETL pipeline complexity Low Low Medium
Limits 100,000 tables (RA3 – 4x, 16x large clusters) 9,900 schemas per database 60 databases per cluster
Chargeback to consumer accounts Yes Yes Yes
Scalability High High High

Conclusion

In this post, we discussed how you can use the new data sharing feature of Amazon Redshift to implement an AaaS solution with a multi-tenant architecture while meeting SLAs for consumers using separate Amazon Redshift clusters. We demonstrated three types of models providing various levels of isolation for the tenant data. We compared and contrasted the models and provided guidance on when to choose an implementation model. We encourage you to try the data sharing feature to build your AaaS or software as a service (SaaS) solutions.


About the Authors

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.

 

 

 

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

 

Jeetesh Srivastva is a Sr. Analytics specialist solutions architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions leveraging Redshift and other AWS Analytic services. He has worked to deliver on premises and cloud based analytic solutions for customers in banking & finance and hospitality industry verticals.

Updating opt-in status for Amazon Pinpoint channels

Post Syndicated from Varinder Dhanota original https://aws.amazon.com/blogs/messaging-and-targeting/updating-opt-in-status-for-amazon-pinpoint-channels/

In many real-world scenarios, customers are using home-grown or 3rd party systems to manage their campaign related information. This includes user preferences, segmentation, targeting, interactions, and more. To create customer-centric engagement experiences with such existing systems, migrating or integrating into Amazon Pinpoint is needed. Luckily, many AWS services and mechanisms can help to streamline this integration in a resilient and cost-effective way.

In this blog post, we demonstrate a sample solution that captures changes from an on-premises application’s database by utilizing AWS Integration and Transfer Services and updates Amazon Pinpoint in real-time.

If you are looking for a serverless, mobile-optimized preference center allowing end users to manage their Pinpoint communication preferences and attributes, you can also check the Amazon Pinpoint Preference Center.

Architecture

Architecture

In this scenario, users’ SMS opt-in/opt-out preferences are managed by a home-grown customer application. Users interact with the application over its web interface. The application, saves the customer preferences on a MySQL database.

This solution’s flow of events is triggered with a change (insert / update / delete) happening in the database. The change event is then captured by AWS Database Migration Service (DMS) that is configured with an ongoing replication task. This task continuously monitors a specified database and forwards the change event to an Amazon Kinesis Data Streams stream. Raw events that are buffered in this stream are polled by an AWS Lambda function. This function transforms the event, and makes it ready to be passed to Amazon Pinpoint API. This API call will in turn, change the opt-in/opt-out subscription status of the channel for that user.

Ongoing replication tasks are created against multiple types of database engines, including Oracle, MS-SQL, Postgres, and more. In this blog post, we use a MySQL based RDS instance to demonstrate this architecture. The instance will have a database we name pinpoint_demo and one table we name optin_status. In this sample, we assume the table is holding details about a user and their opt-in preference for SMS messages.

userid phone optin lastupdate
user1 +12341111111 1 1593867404
user2 +12341111112 1 1593867404
user2 +12341111113 1 1593867404

Prerequisites

  1. AWS CLI is configured with an active AWS account and appropriate access.
  2. You have an understanding of Amazon Pinpoint concepts. You will be using Amazon Pinpoint to create a segment, populate endpoints, and validate phone numbers. For more details, see the Amazon Pinpoint product page and documentation.

Setup

First, you clone the repository that contains a stack of templates to your local environment. Make sure you have configured your AWS CLI with AWS credentials. Follow the steps below to deploy the CloudFormation stack:

  1. Clone the git repository containing the CloudFormation templates:
    git clone https://github.com/aws-samples/amazon-pinpoint-rds-integration.git
    cd amazon-pinpoint-rds-integration
  2. You need an S3 Bucket to hold the template:
    aws s3 create-bucket –bucket <YOUR-BUCKET-NAME>
  3. Run the following command to package the CloudFormation templates:
    aws cloudformation package --template-file template_stack.yaml --output-template-file template_out.yaml --s3-bucket <YOUR-BUCKET-NAME>
  4. Deploy the stack with the following command:
    aws cloudformation deploy --template-file template_out.yaml --stack-name pinpointblogstack --capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_NAMED_IAM

The AWS CloudFormation stack will create and configure resources for you. Some of the resources it will create are:

  • Amazon RDS instance with MySQL
  • AWS Database Migration Service replication instance
  • AWS Database Migration Service source endpoint for MySQL
  • AWS Database Migration Service target endpoint for Amazon Kinesis Data Streams
  • Amazon Kinesis Data Streams stream
  • AWS Lambda Function
  • Amazon Pinpoint Application
  • A Cloud9 environment as a bastion host

The deployment can take up to 15 minutes. You can track its progress in the CloudFormation console’s Events tab.

Populate RDS data

A CloudFormation stack will output the DNS address of an RDS endpoint and Cloud9 environment upon completion. The Cloud9 environment acts as a bastion host and allows you to reach the RDS instance endpoint deployed into the private subnet by CloudFormation.

  1. Open the AWS Console and navigate to the Cloud9 service.
    Cloud9Console
  2. Click on the Open IDE button to reach your IDE environment.
    Cloud9Env
  3. At the console pane of your IDE, type the following to login to your RDS instance. You can find the RDS Endpoint address at the outputs section of the CloudFormation stack. It is under the key name RDSInstanceEndpoint.
    mysql -h <YOUR_RDS_ENDPOINT> -uadmin -pmypassword
    use blog_db;
  4. Issue the following command to create a table that holds the user’s opt-in status:
    create table optin_status (
      userid varchar(50) not null,
      phone varchar(50) not null,
      optin tinyint default 1,
      lastupdate TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
    );
  5. Next, load sample data into the table. The following inserts nine users for this demo:
    
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user1', '+12341111111', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user2', '+12341111112', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user3', '+12341111113', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user4', '+12341111114', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user5', '+12341111115', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user6', '+12341111116', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user7', '+12341111117', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user8', '+12341111118', 1);
    INSERT INTO optin_status (userid, phone, optin) VALUES ('user9', '+12341111119', 1);
  6. The table’s opt-in column holds the SMS opt-in status and phone number for a specific user.

Start the DMS Replication Task

Now that the environment is ready, you can start the DMS replication task and start watching the changes in this table.

  1. From the AWS DMS Console, go to the Database Migration Tasks section.
    DMSMigTask
  2. Select the Migration task named blogreplicationtask.
  3. From the Actions menu, click on Restart/Resume to start the migration task. Wait until the task’s Status transitions from Ready to Starting and Replication ongoing.
  4. At this point, all the changes on the source database are replicated into a Kinesis stream. Before introducing the AWS Lambda function that will be polling this stream, configure the Amazon Pinpoint application.

Inspect the AWS Lambda Function

An AWS Lambda function has been created to receive the events. The Lambda function uses Python and Boto3 to read the records delivered by Kinesis Data Streams. It then performs the update_endpoint API calls in order to add, update, or delete endpoints in the Amazon Pinpoint application.

Lambda code and configuration is accessible through the Lambda Functions Console. In order to inspect the Python code, click the Functions item on the left side. Select the function starting with pinpointblogstack-MainStack by clicking on the function name.

Note: The PINPOINT_APPID under the Environment variables section. This variable provides the Lambda function with the Amazon Pinpoint application ID to make the API call.

LambdaPPAPPID

Inspect Amazon Pinpoint Application in Amazon Pinpoint Console

A Pinpoint application is needed by the Lambda Function to update the endpoints. This application has been created with an SMS Channel by the CloudFormation template. Once the data from the RDS database has been imported into Pinpoint as SMS endpoints, you can validate this import by creating a segment in Pinpoint.

PinpointProject

Testing

With the Lambda function ready, you now test the whole solution.

  1. To initiate the end-to-end test, go to the Cloud9 terminal. Perform the following SQL statement on the optin_table:
    UPDATE optin_status SET optin=0 WHERE userid='user1';
    UPDATE optin_status SET optin=0 WHERE userid='user2';
    UPDATE optin_status SET optin=0 WHERE userid='user3';
    UPDATE optin_status SET optin=0 WHERE userid='user4';
  2. This statement will cause four changes in the database which is collected by DMS and passed to Kinesis Data Streams stream.
  3. This triggers the Lambda function that construct an update_endpoint API call to the Amazon Pinpoint application.
  4. The update_endpoint operation is an upsert operation. Therefore, if the endpoint does not exist on the Amazon Pinpoint application, it creates one. Otherwise, it updates the current endpoint.
  5. In the initial dataset, all the opt-in values are 1. Therefore, these endpoints will be created with an OptOut value of NONE in Amazon Pinpoint.
  6. All OptOut=NONE typed endpoints are considered as active endpoints. Therefore, they are available to be used within segments.

Create Amazon Pinpoint Segment

  1. In order to see these changes, go to the Pinpoint console. Click on PinpointBlogApp.
    PinpointConsole
  2. Click on Segments on the left side. Then click Create a segment.
    PinpointSegment
  3. For the segment name, enter US-Segment.
  4. Select Endpoint from the Filter dropdown.
  5. Under the Choose an endpoint attribute dropdown, select Country.
  6. For Choose values enter US.
    Note: As you do this, the right panel Segment estimate will refresh to show the number of endpoints eligible for this segment filter.
  7. Click Create segment at the bottom of the page.
    PinpointSegDetails
  8. Once the new segment is created, you are directed to the newly created segment with configuration details. You should see five eligible endpoints corresponding to database table rows.
    PinpointSegUpdate
  9. Now, change one row by issuing the following SQL statement. This simulates a user opting out from SMS communication for one of their numbers.
    UPDATE optin_status SET optin=0 WHERE userid='user5';
  10. After the update, go to the Amazon Pinpoint console. Check the eligible endpoints again. You should only see four eligible endpoints.

PinpointSegUpdate

Cleanup

If you no longer want to incur further charge, delete the Cloudformation stack named pinpointblogstack. Select it and click Delete.

PinpointCleanup

Conclusion

This solution walks you through how opt-in change events are delivered from Amazon RDS to Amazon Pinpoint. You can use this solution in other use cases as well. Some examples are importing segments from a 3rd party application like Salesforce and importing other types of channels like e-mail, push, and voice. To learn more about Amazon Pinpoint, visit our website.

The best new features for data analysts in Amazon Redshift in 2020

Post Syndicated from Helen Anderson original https://aws.amazon.com/blogs/big-data/the-best-new-features-for-data-analysts-in-amazon-redshift-in-2020/

This is a guest post by Helen Anderson, data analyst and AWS Data Hero

Every year, the Amazon Redshift team launches new and exciting features, and 2020 was no exception. New features to improve the data warehouse service and add interoperability with other AWS services were rolling out all year.

I am part of a team that for the past 3 years has used Amazon Redshift to store source tables from systems around the organization and usage data from our software as a service (SaaS) product. Amazon Redshift is our one source of truth. We use it to prepare operational reports that support the business and for ad hoc queries when numbers are needed quickly.

When AWS re:Invent comes around, I look forward to the new features, enhancements, and functionality that make things easier for analysts. If you haven’t tried Amazon Redshift in a while, or even if you’re a longtime user, these new capabilities are designed with analysts in mind to make it easier to analyze data at scale.

Amazon Redshift ML

The newly launched preview of Amazon Redshift ML lets data analysts use Amazon SageMaker over datasets in Amazon Redshift to solve business problems without the need for a data scientist to create custom models.

As a data analyst myself, this is one of the most interesting announcements to come out in re:Invent 2020. Analysts generally use SQL to query data and present insights, but they don’t often do data science too. Now there is no need to wait for a data scientist or learn a new language to create predictive models.

For information about what you need to get started with Amazon Redshift ML, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

For information about what you need to get started with Amazon Redshift ML, see the Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML blog post.

Federated queries

As analysts, we often have to join datasets that aren’t in the same format and sometimes aren’t ready for use in the same place. By using federated queries to access data in other databases or Amazon Simple Storage Service (Amazon S3), you don’t need to wait for a data engineer or ETL process to move data around.

re:Invent 2019 featured some interesting talks from Amazon Redshift customers who were tackling this problem. Now that federated queries over operational databases like Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL are generally available and querying Amazon RDS for MySQL and Amazon Aurora MySQL is in preview, I’m excited to hear more.

For a step-by-step example to help you get started, see Build a Simplified ETL and Live Data Query Solution Using Redshift Federated Query.

SUPER data type

Another problem we face as analysts is that the data we need isn’t always in rows and columns. The new SUPER data type makes JSON data easy to use natively in Amazon Redshift with PartiQL.

PartiQL is an extension that helps analysts get up and running quickly with structured and semistructured data so you can unnest and query using JOINs and aggregates. This is really exciting for those who deal with data coming from applications that store data in JSON or unstructured formats.

For use cases and a quickstart, see Ingesting and querying semistructured data in Amazon Redshift (preview).

Partner console integration

The preview of the native console integration with partners announced at AWS re:Invent 2020 will also make data analysis quicker and easier. Although analysts might not be doing the ETL work themselves, this new release makes it easier to move data from platforms like Salesforce, Google Analytics, and Facebook Ads into Amazon Redshift.

Matillion, Sisense, Segment, Etleap, and Fivetran are launch partners, with other partners coming soon. If you’re an Amazon Redshift partner and would like to integrate into the console, contact [email protected].

RA3 nodes with managed storage

Previously, when you added Amazon Redshift nodes to a cluster, both storage and compute were scaled up. This all changed with the 2019 announcement of RA3 nodes, which upgrade storage and compute independently.

In 2020, the Amazon Redshift team introduced RA3.xlplus nodes, which offer even more compute sizing options to address a broader set of workload requirements.

AQUA for Amazon Redshift

As analysts, we want our queries to run quickly so we can spend more time empowering the users of our insights and less time watching data slowly return. AQUA, the Advanced Query Accelerator for Amazon Redshift tackles this problem at an infrastructure level by bringing the stored data closer to the compute power

This hardware-accelerated cache enables Amazon Redshift to run up to 10 times faster as it scales out and processes data in parallel across many nodes. Each node accelerates compression, encryption, and data processing tasks like scans, aggregates, and filtering. Analysts should still try their best to write efficient code, but the power of AQUA will speed up the return of results considerably.

AQUA is available on Amazon Redshift RA3 instances at no additional cost. To get started with AQUA, sign up for the preview.

The following diagram shows Amazon Redshift architecture with an AQUA layer.

AQUA is available on Amazon Redshift RA3 instances at no additional cost.

Figure 1: Amazon Redshift architecture with AQUA layer

Automated performance tuning

For analysts who haven’t used sort and distribution keys, the learning curve can be steep. A table created with the wrong keys can mean results take much longer to return.

Automatic table optimization tackles this problem by using machine learning to select the best keys and tune the physical design of tables. Letting Amazon Redshift determine how to improve cluster performance reduces manual effort.

Summary

These are just some of the Amazon Redshift announcements made in 2020 to help analysts get query results faster. Some of these features help you get access to the data you need, whether it’s in Amazon Redshift or somewhere else. Others are under-the-hood enhancements that make things run smoothly with less manual effort.

For more information about these announcements and a complete list of new features, see What’s New in Amazon Redshift.


About the Author

Helen Anderson is a Data Analyst based in Wellington, New Zealand. She is well known in the data community for writing beginner-friendly blog posts, teaching, and mentoring those who are new to tech. As a woman in tech and a career switcher, Helen is particularly interested in inspiring those who are underrepresented in the industry.

Bringing machine learning to more builders through databases and analytics services

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/bringing-machine-learning-to-more-builders-through-databases-and-analytics-services/

Machine learning (ML) is becoming more mainstream, but even with the increasing adoption, it’s still in its infancy. For ML to have the broad impact that we think it can have, it has to get easier to do and easier to apply. We launched Amazon SageMaker in 2017 to remove the challenges from each stage of the ML process, making it radically easier and faster for everyday developers and data scientists to build, train, and deploy ML models. SageMaker has made ML model building and scaling more accessible to more people, but there’s a large group of database developers, data analysts, and business analysts who work with databases and data lakes where much of the data used for ML resides. These users still find it too difficult and involved to extract meaningful insights from that data using ML.

This group is typically proficient in SQL but not Python, and must rely on data scientists to build the models needed to add intelligence to applications or derive predictive insights from data. And even when you have the model in hand, there’s a long and involved process to prepare and move data to use the model. The result is that ML isn’t being used as much as it can be.

To meet the needs of this large and growing group of builders, we’re integrating ML into AWS databases, analytics, and business intelligence (BI) services.

AWS customers generate, process, and collect more data than ever to better understand their business landscape, market, and customers. And you don’t just use one type of data store for all your needs. You typically use several types of databases, data warehouses, and data lakes, to fit your use case. Because all these use cases could benefit from ML, we’re adding ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can train models on their data or add inference results right from their database, without having to export and process their data or write large amounts of ETL code.

Machine Learning for database developers

At re:Invent last year, we announced ML integrated inside Amazon Aurora for developers working with relational databases. Previously, adding ML using data from Aurora to an application was a very complicated process. First, a data scientist had to build and train a model, then write the code to read data from the database. Next, you had to prepare the data so it can be used by the ML model. Then, you called an ML service to run the model, reformat the output for your application, and finally load it into the application.

Now, with a simple SQL query in Aurora, you can add ML to an enterprise application. When you run an ML query in Aurora using SQL, it can directly access a wide variety of ML models from Amazon SageMaker and Amazon Comprehend. The integration between Aurora and each AWS ML service is optimized, delivering up to 100 times better throughput when compared to moving data between Aurora and SageMaker or Amazon Comprehend without this integration. Because the ML model is deployed separately from the database and the application, each can scale up or scale out independently of the other.

In addition to making ML available in relational databases, combining ML with certain types of non-relational database models can also lead to better predictions. For example, database developers use Amazon Neptune, a purpose-built, high-performance graph database, to store complex relationships between data in a graph data model. You can query these graphs for insights and patterns and apply the results to implement capabilities such as product recommendations or fraud detection.

However, human intuition and analyzing individual queries is not enough to discover the full breadth of insights available from large graphs. ML can help, but as was the case with relational databases it requires you to do a significant amount of heavy lifting upfront to prepare the graph data and then select the best ML model to run against that data. The entire process can take weeks.

To help with this, today we announced the general availability of Amazon Neptune ML to provide database developers access to ML purpose-built for graph data. This integration is powered by SageMaker and uses the Deep Graph Library (DGL), a framework for applying deep learning to graph data. It does the hard work of selecting the graph data needed for ML training, automatically choosing the best model for the selected data, exposing ML capabilities via simple graph queries, and providing templates to allow you to customize ML models for advanced scenarios. The following diagram illustrates this workflow.

And because the DGL is purpose-built to run deep learning on graph data, you can improve accuracy of most predictions by over 50% compared to that of traditional ML techniques.

Machine Learning for data analysts

At re:Invent last year, we announced ML integrated inside Amazon Athena for data analysts. With this integration, you can access more than a dozen built-in ML models or use your own models in SageMaker directly from ad-hoc queries in Athena. As a result, you can easily run ad-hoc queries in Athena that use ML to forecast sales, detect suspicious logins, or sort users into customer cohorts.

Similarly, data analysts also want to apply ML to the data in their Amazon Redshift data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day. These Amazon Redshift users want to run ML on their data in Amazon Redshift without having to write a single line of Python. Today we announced the preview of Amazon Redshift ML to do just that.

Amazon Redshift now enables you to run ML algorithms on Amazon Redshift data without manually selecting, building, or training an ML model. Amazon Redshift ML works with Amazon SageMaker Autopilot, a service that automatically trains and tunes the best ML models for classification or regression based on your data while allowing full control and visibility.

When you run an ML query in Amazon Redshift, the selected data is securely exported from Amazon Redshift to Amazon Simple Storage Service (Amazon S3). SageMaker Autopilot then performs data cleaning and preprocessing of the training data, automatically creates a model, and applies the best model. All the interactions between Amazon Redshift, Amazon S3, and SageMaker are abstracted away and automatically occur. When the model is trained, it becomes available as a SQL function for you to use. The following diagram illustrates this workflow.

Rackspace Technology – a leading end-to-end multicloud technology services company, and Slalom –  a modern consulting firm focused on strategy, technology, and business transformation are both users of Redshift ML in preview.

Nihar Gupta, General Manager for Data Solutions at Rackspace Technology says “At Rackspace Technology, we help companies elevate their AI/ML operationsthe seamless integration with Amazon SageMaker will empower data analysts to use data in new ways, and provide even more insight back to the wider organization.”

And Marcus Bearden, Practice Director at Slalom shared “We hear from our customers that they want to have the skills and tools to get more insight from their data, and Amazon Redshift is a popular cloud data warehouse that many of our customers depend on to power their analytics, the new Amazon Redshift ML feature will make it easier for SQL users to get new types of insight from their data with machine learning, without learning new skills.”

Machine Learning for business analysts

To bring ML to business analysts, we launched new ML capabilities in Amazon QuickSight earlier this year called ML Insights. ML Insights uses SageMaker Autopilot to enable business analysts to perform ML inference on their data and visualize it in BI dashboards with just a few clicks. You can get results for different use cases that require ML, such as anomaly detection to uncover hidden insights by continuously analyzing billions of data points, to do forecasting, to predict growth, and other business trends. In addition, QuickSight can also give you an automatically generated summary in plain language (a capability we call auto-narratives), which interprets and describes what the data in your dashboard means. See the following screenshot for an example.

Customers like Expedia Group, Tata Consultancy Services, and Ricoh Company are already benefiting from ML out of the box with QuickSight. These human-readable narratives enable you to quickly interpret the data in a shared dashboard and focus on the insights that matter most.

In addition, customers have also been interested in asking questions of their business data in plain language and receiving answers in near-real time. Although some BI tools and vendors have attempted to solve this challenge with Natural Language Query (NLQ), the existing approaches require that you first spend months in advance preparing and building a model on a pre-defined set of data, and even then, you still have no way of asking ad hoc questions when those questions require a new calculation that wasn’t pre-defined in the data model. For example, the question “What is our year-over-year growth rate?” requires that “growth rate” be pre-defined as a calculation in the model. With today’s BI tools, you need to work with your BI teams to create and update the model to account for any new calculation or data, which can take days or weeks of effort.

Last week, we announced Amazon QuickSight Q. ‘Q’ gives business analysts the ability to ask any question of all their data and receive an accurate answer in seconds. To ask a question, you simply type it into the QuickSight Q search bar using natural language and business terminology that you’re familiar with. Q uses ML (natural language processing, schema understanding, and semantic parsing for SQL code generation) to automatically generate a data model that understands the meaning of and relationships between business data, so you can get answers to your business questions without waiting weeks for a data model to be built. Because Q eliminates the need to build a data model, you’re also not limited to asking only a specific set of questions. See the following screenshot for an example.

Best Western Hotels & Resorts is a privately-held hotel brand with a global network of approximately 4,700 hotels in over 100 countries and territories worldwide. “With Amazon QuickSight Q, we look forward to enabling our business partners to self-serve their ad hoc questions while reducing the operational overhead on our team for ad hoc requests,” said Joseph Landucci, Senior Manager of Database and Enterprise Analytics at Best Western Hotels & Resorts. “This will allow our partners to get answers to their critical business questions quickly by simply typing and searching their questions in plain language.”

Summary

For ML to have a broad impact, we believe it has to get easier to do and easier to apply. Database developers, data analysts, and business analysts who work with databases and data lakes have found it too difficult and involved to extract meaningful insights from their data using ML. To meet the needs of this large and growing group of builders, we’ve added ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can all use ML more easily without the need to be an ML expert. These capabilities put ML in the hands of every data professional so that they can get the most value from their data.


About the Authors

Swami Sivasubramanian is Vice President at AWS in charge of all Amazon AI and Machine Learning services. His team’s mission is “to put machine learning capabilities in the hands on every developer and data scientist.” Swami and the AWS AI and ML organization work on all aspects of machine learning, from ML frameworks (Tensorflow, Apache MXNet and PyTorch) and infrastructure, to Amazon SageMaker (an end-to-end service for building, training and deploying ML models in the cloud and at the edge), and finally AI services (Transcribe, Translate, Personalize, Forecast, Rekognition, Textract, Lex, Comprehend, Kendra, etc.) that make it easier for app developers to incorporate ML into their apps with no ML experience required.

Previously, Swami managed AWS’s NoSQL and big data services. He managed the engineering, product management, and operations for AWS database services that are the foundational building blocks for AWS: DynamoDB, Amazon ElastiCache (in-memory engines), Amazon QuickSight, and a few other big data services in the works. Swami has been awarded more than 250 patents, authored 40 referred scientific papers and journals, and participates in several academic circles and conferences.

 

Herain Oberoi leads Product Marketing for AWS’s Databases, Analytics, BI, and Blockchain services. His team is responsible for helping customers learn about, adopt, and successfully use AWS services. Prior to AWS, he held various product management and marketing leadership roles at Microsoft and a successful startup that was later acquired by BEA Systems. When he’s not working, he enjoys spending time with his family, gardening, and exercising.

 

 

 

Accessing and visualizing external tables in an Apache Hive metastore with Amazon Athena and Amazon QuickSight

Post Syndicated from James Sun original https://aws.amazon.com/blogs/big-data/accessing-and-visualizing-external-tables-in-an-apache-hive-metastore-with-amazon-athena-and-amazon-quicksight/

Many organizations have an Apache Hive metastore that stores the schemas for their data lake. You can use Amazon Athena due to its serverless nature; Athena makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. You may also want to reliably query the rich datasets in the lake, with their schemas hosted in an external Hive metastore. In response to customers’ requests, AWS announced Athena’s support for Hive metastore in early 2020. This extends the ability in Athena to query external data stores, with a combined benefit of better performance and lower cost.

In this post, we provide an AWS CloudFormation template that configures a remote Hive metastore based on Amazon Relational Database Service (Amazon RDS) and MySQL with Amazon EMR located in a private subnet to perform ETL tasks. We then demonstrate how you can use a Spark step to pull COVID-19 datasets from a public repository and transform the data into a performant Parquet columnar storage format. We also walk through the steps to query the data with Athena and visualize it with Amazon QuickSight. QuickSight is a fully managed data visualization service; it lets the you easily create and publish interactive dashboards by analyzing data from various data sources, including Athena.

Solution walkthrough

The following diagram shows the architecture for our solution.


As shown in the preceding architecture, we have an Availability Zone within a VPC in an AWS Region. The Availability Zone hosts subnets that are either public or private. A multi-master EMR cluster that has Hive software components installed is launched in a private subnet with egress internet traffic through a NAT gateway to download data from public sites for analysis. The multi-master feature also ensures that the primary nodes are highly available. The Hive metastore is backed by a remote RDS for MySQL instance located in the same private subnet.

We also have an Amazon Simple Storage Service (Amazon S3)-based data lake. A Spark step in Amazon EMR retrieves the data in CSV format, saves it in the provisioned S3 bucket, and transforms the data into Parquet format. The Spark step creates an external Hive table referencing the Parquet data and is ready for Athena to query.

With Athena, you can create a data source connector based on an AWS Lambda function to access the Hive metastore hosted on the EMR cluster by using the Apache Thrift interface.

The connector is called a catalog, which when invoked in a SQL statement with Athena, invokes the Lambda function. The function exits if the connector is not active for 15 minutes. For queries that run longer than 15 minutes, it’s recommended to let the query complete before retrieving the query results in an Amazon S3 location you can specify.

In Athena, you can compose a SQL statement against the Hive tables with predicates to further limit the size of the query result for faster visualization by QuickSight.

Deploying the resources with AWS CloudFormation

To demonstrate our solution, we provide a CloudFormation template that you can download to easily deploy the necessary AWS resources. The template creates the following resources to simulate the environment:

  • VPC and subnets – A VPC with one public and one private subnets. A NAT gateway is also created to allow outbound internet access from the EMR cluster to download public COVID-19 datasets from the Johns Hopkins GitHub repo.
  • EMR cluster – A multi-master EMR cluster with Hive, running on three primary nodes (m5.xlarge) and two core nodes (m5.xlarge), is launched to support the thrift connection required by the Athena Lambda connectors.
  • Amazon RDS for MySQL database – An RDS for MySQL primary instance is launched in the same subnet as the EMR cluster. The RDS instance serves as the Hive metastore backend data store.
  • S3 bucket – An S3 bucket stores files in Parquet format by Amazon EMR and is accessed later by Athena.
  • AWS IAM users – Two AWS Identity and Access Management (IAM) users belonging to different user groups. The first user, the data engineer, has permissions to access the Lambda-based Athena data source connector. The other user, the salesperson, does not have permissions to access the connector.

To get started, you need to have an AWS account. If you don’t have one, go to aws.amazon.com to sign up for one. Then complete the following steps:

  1. Sign in to the AWS Management Console as an IAM power user, preferably an admin user.
  2. Choose Launch Stack to launch the CloudFormation template:

This template has been tested in the US East (N. Virginia) Region.

  1. Choose Next.

You’re prompted to enter a few launch parameters.

  1. For Stack name, enter a name for the stack (for example, athena-hms).
  2. For Hive Metastore Host Number, choose the Amazon EMR primary node to connect to (or use the default value).
  3. Continue to choose Next and leave other parameters at their default.
  4. On the review page, select the three check boxes to confirm that AWS CloudFormation might create resources.
  5. Choose Create stack.

The stack takes 15–20 minutes to complete.

  1. On the Outputs tab of the stack details, save the key-value pairs to use later.

When the EMR cluster is provisioned, it uses a bootstrap action to install the necessary Python libraries. It runs an Amazon EMR Spark step (a PySpark script) that downloads three COVID-19 datasets for confirmed, recovered, and death cases from the John Hopkins GitHub repo in CSV format and stores them in the csv subfolder of the S3 bucket created by the CloudFormation stack. Lastly, the final transformed data is converted to Parquet format and external Hive tables are created referencing the Parquet data located in the parquet subfolder of the S3 bucket.

The following are the source codes for the bootstrap and Spark step actions for your reference:

To validate the data, on the Amazon S3 console, choose the bucket name from the CloudFormation template outputs. You should see a covid_data folder in the bucket. The folder contains the two subfolders, csv and parquet, which store the raw CSV and transformed Parquet data, respectively.

Querying the data in Athena

The CloudFormation template creates two users belonging to two different AWS IAM groups. The de_user_us-east-1_athena-hms user is a data engineer with permissions to access the Lambda function to communicate with the Hive metastore using the Athena data source connector. They belong to the group athena-hms-DataEngineerGroup-xxxxxxx. The sales_user_us-east-1_athena-hms user is a salesperson with no access to the connector. They belong to the group athena-hms-SalesGroup-xxxxx. 

To query the data, first retrieve your secret values in AWS Secrets Manager:

  1. On the Secrets Manager console, choose Secrets.
  2. Choose DataEngineerUserCreds.

  1. Choose Retrieve secret value.

 

  1. Save the username and password.

 

  1. Repeat these steps for SalesUserCreds. 

With the data in place, we can now use Athena to query it.

Accessing data as the data engineer

To query the data as the de_user_us-east-1_athena-hms user, complete the following steps:

  1. Sign in to the Athena console as the de_user_us-east-1_athena-hms user with the credentials retrieved from Secrets Manager.

After logging in, we need to create a data source for Hive metastore.

  1. On the navigation bar, choose Data sources.
  2. Choose Connect data source.

  1. For Choose where you data is located, select Query data in Amazon S3.
  2. For Choose a metadata catalog, select Apache Hive metastore.
  3. Chose Next.

  1. For Lambda function, choose the function the CloudFormation template created with the key HiveMetastoreFunctionName.
  2. For Catalog name, enter a name for the Athena catalog (for example, demo_hive_metastore).
  3. Choose Connect.

You should now have a catalog named demo_hive_metastore with the catalog type Hive metastore.

  1. On the navigation bar, choose Query editor.
  2. Enter the following SQL statement:
    SELECT *
    FROM demo_hive_metastore.default.covid_confirmed_cases
    WHERE country = 'US'
            OR country = 'Italy'
            OR country = 'India'
            OR country = 'Brazil'
            OR country = 'Spain'
            OR country = 'United Kingdom'

This SQL statement selects all the columns in the covid_confirmed_cases table with predicates to only include a few countries of interest. We use a table name with the pattern <catalog name>.<hive database name>.<hive table name in the database>, which for this post translates to demo_hive_metastore.default.covid_confirmed_cases.

  1. Choose Run query.

The following screenshot shows your query results.

Make sure you completely sign out of the console before moving on to the next steps.

Accessing data as the salesperson

Sign in to the console as sales_user_us-east-1_athena-hms user. Because the salesperson user doesn’t have the appropriate IAM policies to access the Hive metastore connection, you can’t see the tables.

 

Permissions policies

The data engineer has additional policies attached to their IAM group in addition to the managed AmazonAthenaFullAccess policy: the <stack-name>-DataBucketReadAccessPolicy-xxxxx and <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policies created by the CloudFormation template. Therefore, the data engineer can view the tables, but the salesperson can’t.

These policies are available on the IAM console, on the Permissions tab for the group <stack-name>-DataEngineerGroup-xxxxx.

The following sample JSON code is the <stack-name>-DataBucketReadWriteAccessPolicy-xxxxx policy to allow access to the provisioned S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::<provisioned bucket name>",
                "arn:aws:s3:::<provisioned bucket name>/",
                "arn:aws:s3:::<provisioned bucket name>/*"
            ],
            "Effect": "Allow"
        }
    ]
}

The following sample JSON code is the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy to allow access to the Lambda Hive metastore function:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "lambda:GetFunction",
                "lambda:GetLayerVersion",
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:us-east-1:<account id>:function:athena-hms-HiveMetastoreFunction"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
               "arn:aws:s3:::<provisioned bucket name>/hms_spill"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Next, we walk through using QuickSight to visualize the results. Make sure you completely sign out of the console as the salesperson user before proceeding to the next steps. 

Signing up for QuickSight

You can skip this section if you have already signed up for QuickSight previously.

  1. Sign in to the console as the IAM power user who deployed the CloudFormation template or any user with enough IAM privileges to set up QuickSight.
  2. On the QuickSight console, choose Sign up for QuickSight.
  3. Select either the Standard or Enterprise edition for QuickSight.
  4. Choose Continue.

  1. For QuickSight account name, enter your AWS account ID.
  2. For Notification email address, enter your email.
  3. Select Amazon S3.

  1. Select the provisioned S3 bucket to grant QuickSight permission to access.
  2. Choose Finish.

Your QuickSight account should now be set up.

Attaching the Hive metastore access policy

Before you can use QuickSight, you have to attach the Hive metastore access policy to the QuickSight service role.

  1. On the IAM console, search for the service role aws-quicksight-service-role-v0.
  2. Choose the role.

  1. Search for the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy.
  2. Select the policy and attach it to the QuickSight service role.

Creating your data source and performing data conversions

Before we can create visualizations, we need to set up our data source.

  1. Download the SQL script covid-19.sql.
  2. On the QuickSight console, choose Datasets in the navigation pane.

  1. Choose New dataset.

  1. Choose Athena.

  1. In the pop-up window, for Data source name, enter demo_hive_metastore.
  2. Choose Validate.
  3. When the connection is validated, choose Create data source.

  1. In the next window, choose Use custom SQL.

  1. Enter the content of the covid-19.sql script in the query window.
  2. Choose Confirm query.

  1. Leave Import to SPICE for quicker analytics
  2. Choose Visualize.

Now we perform a few data type conversions before visualizing the data.

  1. Choose the Edit icon next to Data set on the menu bar.

  1. Choose the … icon.
  2. Choose Edit.

  1. Expand date and choose Change data type.
  2. Choose Date.

  1. Enter yyyy-MM-dd to convert the date format.
  2. Choose Update.

Now we create a coordinate using the latitude and longitude values.r

  1. Expand lat and choose Ad to coordinates.

  1. Leave Create new geospatial coordinates
  2. Chose Add.

  1. In the pop-up window, for Name your coordinates, enter coordinate.
  2. For Field to use for longitude, choose lon.
  3. Choose Create coordinates.

  1. Choose Save and visualize on the menu bar.

Creating visualizations in QuickSight

Now we can visualize our data. For this post, we create a map visualization.

  1. For Visual types, choose the map

  1. For Geospatial, drag coordinates.
  2. For Size, drag confirmed.
  3. For Color, drag country.

This world map shows the accumulated confirmed cases for selected countries over time; you need to use a filter to look at confirmed cases on a specific date.

  1. In the navigation pane, choose Filter.
  2. Choose the + icon.
  3. For Filter type, choose Time range.
  4. Choose start and end dates, such as 2020-09-10 00:00 and 2020-09-11 00:00, respectively.
  5. Choose Apply.

This plots the confirmed cases on September 10, 2020, for these countries.

Similarly, you can choose other visual types, such as a line chart, and generate the mortality rate for selected countries over time.

Using highly available primary nodes of the Amazon EMR cluster

The EMR cluster has a multi-master configuration with three primary nodes running to meet high availability requirements. At any time, the Lambda function communicates with one of these three EMR primary nodes. In the rare event that this node goes down, you can quickly re-establish the Athena data source connector to the external Hive metastore by failing over to another active primary node.

To perform this failover, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose athena-hms.
  3. Choose update.
  4. Choose Use current update.
  5. Choose Next.
  6. For Hive Metastore Host Number, choose a host other than the current one you’re using.

  1. Choose Next.
  2. Acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Update stack.

In less than a minute, you should be able to access the Hive metastore and continue to query on the Athena console.

Cleaning up

You may want to clean up the demo environment when you’re done. To do so, on the AWS CloudFormation console, select the template and choose Delete.

This action also deletes the S3 bucket and any data in it. If you want to retain the data for future use, you should make a copy of the bucket before you delete it.

Summary

In this post, we walked through a solution using Athena to query external Hive tables with public COVID-19 datasets hosted in an S3 bucket and visualizing the data with QuickSight. We provided a CloudFormation template to automate the deployment of necessary AWS services for the demo. We encourage you to use these managed and scalable services for your specific use cases in production.


About the Authors

James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.

 

 

Chinmayi Narasimhadevara is a Solutions Architect with Amazon Web Services. Chinmayi has over 15 years of experience in information technology and has worked with organizations ranging from large enterprises to mid-sized startups. She helps AWS customers leverage the correct mix of AWS services to achieve success for their business goals.

 

 

Gagan Brahmi is a Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

New – Export Amazon DynamoDB Table Data to Your Data Lake in Amazon S3, No Code Writing Required

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/

Hundreds of thousands of AWS customers have chosen Amazon DynamoDB for mission-critical workloads since its launch in 2012. DynamoDB is a nonrelational managed database that allows you to store a virtually infinite amount of data and retrieve it with single-digit-millisecond performance at any scale. To get the most value out of this data, customers had […]

My Advice To Developers About Working With Databases: Make It Secure

Post Syndicated from Bozho original https://techblog.bozho.net/my-advice-to-developers-about-working-with-databases-make-it-secure/

Last month Ben Brumm asked me for the one advice I’d like to give to developers that are working with databases (in reality – almost all of us). He published mine as well as many others’ answers here, but I’d like to share it with my readers as well.

If I had to give developers working with databases one advice, it would be: make it secure. Every other thing you’ll figure in time – how to structure your tables, how to use ORM, how to optimize queries, how to use indexes, how to do multitenancy. But security may not be on the list of requirements and it may be too late when the need becomes obvious.

So I’d focus on several things:

  • Prevent SQL injections – make sure you use an ORM or prepared statements rather than building queries with string concatenation. Otherwise a malicious actor can inject anything in your queries and turn them into a DROP DATABASE query, or worse – one that exfiltrates all the data.
  • Support encryption in transit – this often has to be supported by the application’s driver configuration, e.g. by trusting a particular server certificate. Unencrypted communication, even within the same datacenter, is a significant risk and that’s why databases support encryption in transit. (You should also think about encryption at rest, but that’s more of an Ops task)
  • Have an audit log at the application level – “who did what” is a very important question from a security and compliance point of view. And no native database functionality can consistently answer the question “who” – it’s the application that manages users. So build an audit trail layer that records who did what changes to what entities/tables.
  • Consider record-level encryption for sensitive data – a database can be dumped in full by those who have access (or gain access maliciously). This is how data breaches happen. Sensitive data (like health data, payment data, or even API keys, secrets or tokens) benefits from being encrypted with an application-managed key, so that access to the database alone doesn’t reveal that data. Another option, often used for credit cards, is tokenization, which shifts the encryption responsibility to the tokenization providers. Managing the keys is hard, but even a basic approach is better than nothing.

Security is often viewed as an “operations” responsibility, and this has lead to a lot of tools that try to solve the above problem without touching the application – web application firewalls, heuristics for database access monitoring, trying to extract the current user, etc. But the application is the right place for many of these protections (although certainly not the only place), and as developers we need to be aware of the risks and best practices.

The post My Advice To Developers About Working With Databases: Make It Secure appeared first on Bozho's tech blog.

Accessing external components using Amazon Redshift Lambda UDFs

Post Syndicated from Rajiv Gupta original https://aws.amazon.com/blogs/big-data/accessing-external-components-using-amazon-redshift-lambda-udfs/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse. It makes it simple and cost-effective to analyze all your data using standard SQL, your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads.

Previously, you could create custom scalar user-defined functions (UDFs) using either a SQL SELECT statement or a Python program. With the general availability of the Amazon Redshift Lambda UDF, you can also create functions backed by AWS Lambda code. These Lambda UDFs enable you to interact and integrate with external components outside of Amazon Redshift. You can write Lambda UDFs in any supported programming language, such as Java, Go PowerShell, Node.js, C#, Python, Ruby, or a custom runtime. This functionality enables new Amazon Redshift use cases, including but not limited to the following:

This post describes how the Amazon Redshift Lambda UDF works and walks you through creating your first Amazon Redshift Lambda UDF.

Solution architecture

Amazon Redshift Lambda UDFs are architected to perform efficiently and securely. When you run an Amazon Redshift Lambda UDF, each slice in the Amazon Redshift cluster accumulates the applicable rows of arguments and sends those rows in batches to your Lambda function in parallel. The data is transferred between the Amazon Redshift compute nodes and function in a separate isolated network that is inaccessible by clients. For Amazon Redshift to reconstruct the results, the function receives the input data in an ordered nested array of values and is expected to return the results in a similarly ordered array. Because each Lambda invocation runs in parallel, the result returns to the compute nodes quickly. The compute nodes further process the data (if needed) and return the results to the leader node and SQL client. The following diagram illustrates this architecture.

Solution overview

For this post, we explore the use case of data enrichment from an external data source, using DynamoDB. In our function, we can pass the DynamoDB table, key field, and value. The function should return a JSON string containing the document associated to that key. In our use case, the transaction data is loaded into Amazon Redshift via a pipeline that is batch loaded from the POS system but contains only the CustomerId. The customer metadata is used by the web application and is stored in DynamoDB. The goal is to run the following SQL statement to look up customer details by CustomerId and retrieve the most up-to-date information:

select
  CustomerId,
  udf_dynamodb_lookup ('Customer', 'id', CustomerId) Customer
from transactions;

The following table shows the expected results, in which the customer information is returned when it’s available, and shows as NULL when it’s not.

CustomerId Customer
0
1 {“lname”: “Doe”, “id”: “1”, “fname”: “John”}
2 {“lname”: “Doe”, “id”: “2”, “fname”: “Jane”}
3 {“lname”: “Salazar”, “id”: “3”, “fname”: “Carlos”}
4 {“lname”: “Ramirez”, “id”: “4”, “fname”: “Diego”}
12

To implement this solution, we create the following:

  • The DynamoDB and Amazon Redshift tables
  • The Lambda function with an AWS Identity and Access Management (IAM) role that has access to the DynamoDB table.
  • The Amazon Redshift UDF mapped to the Lambda function with an IAM role that has access to run the function

Creating the DynamoDB and Amazon Redshift tables

Create a DynamoDB table containing the following customer data:

id fname lname
1 John Doe
2 Jane Doe
3 Carlos Salazar
4 Diego Ramirez
5 Mary Major
6 Richard Roe

To create a table in Amazon Redshift with transactions data that refers to a CustomerId, enter the following code:

create table transactions (CustomerId varchar, StoreId varchar, TransactionAmount decimal(10,4));
insert into transactions values 
('0', '123', '10.34'),
('1', '123', '9.99'),
('2', '234', '10.34'),
('3', '123', '4.15'),
('4', '234', '17.25'),
('12', '123', '9.99');

Creating the Lambda function

The Lambda function receives an input event with metadata about the invocation. The arguments attribute is an ordered nested array of input values. For our use case, you can expect the arguments to be sent as follows:

{
 "arguments": [
  ["Customer", "id", "0"],
  ["Customer", "id", "1"],
  ["Customer", "id", "2"],
  ["Customer", "id", "3"],
  ["Customer", "id", "4"],
  ["Customer", "id", "12"]
 ]
}

This function is written generically so that any table and field combination can be used to lookup. For example, if I need to enrich my data with Store metadata, I run the same function: udf_dynamodb_lookup ('Store', 'id', StoreId).

The code within the Lambda function needs to traverse through the input parameters and for each row, and retrieve the corresponding record from DynamoDB. The function is expected to return an output containing metadata such as success, results, and error_msg. The success attribute determines if the function was successful. In the case of failure, you can use error_msg to pass a custom message to the user about the error. The results attribute is an ordered array of output values. For our use case, the output is as follows:

{
 "success": true,
 "results": [
   "NULL",
   "{\"lname\": \"Doe\", \"id\": \"1\", \"fname\": \"John\"}",
   "{\"lname\": \"Doe\", \"id\": \"2\", \"fname\": \"Jane\"}",
   "{\"lname\": \"Doe\", \"id\": \"3\", \"fname\": \"Bob\"}",
   "{\"lname\": \"Doe\", \"id\": \"4\", \"fname\": \"Frank\"}",
   "NULL"
 ]
}

The following code is of the Lambda function Lambda_DynamoDB_Lookup, which takes the preceding input arguments containing the CustomerId values and produces the output results, containing the corresponding customer metadata when available. The error handling uses the error_msg if the table doesn’t exist, but populates None if a particular key doesn’t exist.

import json
import boto3
dynamodb = boto3.resource('dynamodb')

def lambda_handler(event, context):
 ret = dict()
 try: 
  tableName = event["arguments"][0][0]
  columnName = event["arguments"][0][1]

  table = dynamodb.Table(tableName)
  table.item_count 
  res = []
  for argument in event['arguments']:
   try:
    columnValue = argument[2]
    response = table.get_item(Key={columnName: columnValue })
    res.append(json.dumps(response["Item"]))
   except: 
    res.append(None)
  ret['success'] = True
  ret['results'] = res
 except Exception as e:
  ret['success'] = False
  ret['error_msg'] = str(e)
 return json.dumps(ret)

For the Lambda function to query the DynamoDB Customer table, you need to modify the execution role and grant DescribeTable and GetItem privileges. Add the following policy to the IAM role, replacing the account number with your AWS account number:

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": [
     "dynamodb:DescribeTable",
     "dynamodb:GetItem"
   ],
   "Resource": "arn:aws:dynamodb:*:xxxxxxxxx999:table/Customer"
  }
 ]
}

For the Amazon Redshift cluster to invoke the Lambda function you created, you need to associate the function to an IAM role that Amazon Redshift can assume and has the InvokeFunction privilege. Add the following policy to the IAM role associated to Amazon Redshift, replacing the account number with your AWS account number. If you don’t already have an IAM role associated to Amazon Redshift, you can create one.

{
 "Version": "2012-10-17",
 "Statement": [
  {
   "Effect": "Allow",
   "Action": "lambda:InvokeFunction",
   "Resource": "arn:aws:lambda:*:xxxxxxxxx999:function:Lambda_DynamoDB_Lookup"
  }
 ]
}

Creating the Amazon Redshift UDF

You can now register this function to Amazon Redshift. The following statement creates the function udf_dynamodb_lookup within Amazon Redshift mapped to the earlier Lambda function. Be sure to associate the function to the IAM role modified earlier.

CREATE OR REPLACE EXTERNAL FUNCTION udf_dynamodb_lookup (tableName varchar, columnName varchar, columnValue varchar)
RETURNS varchar STABLE
LAMBDA 'Lambda_DynamoDB_Lookup'
IAM_ROLE '<Role ARN>';

Finally, run the function and you should receive the expected results from earlier:

select
  CustomerId,
  udf_dynamodb_lookup ('Customer', 'id', CustomerId) Customer
from transactions;

Summary

In this post, I introduced you to Amazon Redshift Lambda UDFs. I also provided a step-by-step guide for creating your first function, which enriches data in Amazon Redshift using DynamoDB. For more information about creating an Amazon Redshift Lambda UDF, see online documentation. If you want to share the UDFs you’ve created with other Amazon Redshift customers, please reach out to us.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Author

Rajiv Gupta is a data warehouse specialist solutions architect with Amazon Web Services.