Tag Archives: database

Graviton Fast Start – A New Program to Help Move Your Workloads to AWS Graviton

2022-08-03 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/graviton-fast-start-a-new-program-to-help-move-your-workloads-to-aws-graviton/

With the Graviton Challenge last year, we helped customers migrate to Graviton-based EC2 instances and get up to 40 percent price performance benefit in as little as 4 days. Tens of thousands of customers, including 48 of the top 50 Amazon Elastic Compute Cloud (Amazon EC2) customers, use AWS Graviton processors for their workloads. In addition to EC2, many AWS managed services can run their workloads on Graviton. For most customers, adoption is easy, requiring minimal code changes. However, the effort and time required to move workloads to Graviton depends on a few factors including your software development environment and the technology stack on which your application is built.

This year, we want to take it a step further and make it even easier for customers to adopt Graviton not only through EC2, but also through managed services. Today, we are launching AWS Graviton Fast Start, a new program that makes it even easier to move your workloads to AWS Graviton by providing step-by-step directions for EC2 and other managed services that support the Graviton platform:

Amazon Elastic Compute Cloud (Amazon EC2) – EC2 provides the most flexible environment for a migration and can support many kinds of workloads, such as web apps, custom databases, or analytics. You have full control over the interpreted or compiled code running in the EC2 instance. You can also use many open-source and commercial software products that support the Arm64 architecture.
AWS Lambda – Migrating your serverless functions can be really easy, especially if you use an interpreted runtime such as Node.js or Python. Most of the time, you only have to check the compatibility of your software dependencies. I have shown a few examples in this blog post.
AWS Fargate – Fargate works best if your applications are already running in containers or if you are planning to containerize them. By using multi-architecture container images or images that have Arm64 in their image manifest, you get the serverless benefits of Fargate and the price-performance advantages of Graviton.
Amazon Aurora – Relational databases are at the core of many applications. If you need a database compatible with PostgreSQL or MySQL, you can use Amazon Aurora to have a highly performant and globally available database powered by Graviton.
Amazon Relational Database Service (RDS) – Similarly to Aurora, Amazon RDS engines such as PostgreSQL, MySQL, and MariaDB can provide a fully managed relational database service using Graviton-based instances.
Amazon ElastiCache – When your workload requires ultra-low latency and high throughput, you can speed up your applications with ElastiCache and have a fully managed in-memory cache running on Graviton and compatible with Redis or Memcached.
Amazon EMR – With Amazon EMR, you can run large-scale distributed data processing jobs, interactive SQL queries, and machine learning applications on Graviton using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.

Here’s some feedback we got from customers running their workloads on Graviton:

Formula 1 racing told us that Graviton2-based C6gn instances provided the best price performance benefits for some of their computational fluid dynamics (CFD) workloads. More recently, they found that Graviton3 C7g instances are 40 percent faster for the same simulations and expect Graviton3-based instances to become the optimal choice to run all of their CFD workloads.
Honeycomb has 100 percent of their production workloads running on Graviton using EC2 and Lambda. They have tested the high-throughput telemetry ingestion workload they use for their observability platform against early preview instances of Graviton3 and have seen a 35 percent performance increase for their workload over Graviton2. They were able to run 30 percent fewer instances of C7g than C6g serving the same workload and with 30 percent reduced latency. With these instances in production, they expect over 50 percent price performance improvement over x86 instances.
Twitter is working on a multi-year project to leverage Graviton-based EC2 instances to deliver Twitter timelines. As part of their ongoing effort to drive further efficiencies, they tested the new Graviton3-based C7g instances. Across a number of benchmarks representative of their workloads, they found Graviton3-based C7g instances deliver 20-80 percent higher performance compared to Graviton2-based C6g instances, while also reducing tail latencies by as much as 35 percent. They are excited to utilize Graviton3-based instances in the future to realize significant price performance benefits.

With all these options, getting the benefits of running all or part of your workload on AWS Graviton can be easier than you expect. To help you get started, there’s also a free trial on the Graviton-based T4g instances for up to 750 hours per month through December 31st, 2022.

Visit AWS Graviton Fast Start to get step-by-step directions on how to move your workloads to AWS Graviton.

— Danilo

Amazon Redshift Serverless – Now Generally Available with New Capabilities

2022-07-12 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-redshift-serverless-now-generally-available-with-new-capabilities/

Last year at re:Invent, we introduced the preview of Amazon Redshift Serverless, a serverless option of Amazon Redshift that lets you analyze data at any scale without having to manage data warehouse infrastructure. You just need to load and query your data, and you pay only for what you use. This allows more companies to build a modern data strategy, especially for use cases where analytics workloads are not running 24-7 and the data warehouse is not active all the time. It is also applicable to companies where the use of data expands within the organization and users in new departments want to run analytics without having to take ownership of data warehouse infrastructure.

Today, I am happy to share that Amazon Redshift Serverless is generally available and that we added many new capabilities. We are also reducing Amazon Redshift Serverless compute costs compared to the preview.

You can now create multiple serverless endpoints per AWS account and Region using namespaces and workgroups:

A namespace is a collection of database objects and users, such as database name and password, permissions, and encryption configuration. This is where your data is managed and where you can see how much storage is used.
A workgroup is a collection of compute resources, including network and security settings. Each workgroup has a serverless endpoint to which you can connect your applications. When configuring a workgroup, you can set up private or publicly accessible endpoints.

Each namespace can have only one workgroup associated with it. Conversely, each workgroup can be associated with only one namespace. You can have a namespace without any workgroup associated with it, for example, to use it only for sharing data with other namespaces in the same or another AWS account or Region.

In your workgroup configuration, you can now use query monitoring rules to help keep your costs under control. Also, the way Amazon Redshift Serverless automatically scales data warehouse capacity is more intelligent to deliver fast performance for demanding and unpredictable workloads.

Let’s see how this works with a quick demo. Then, I’ll show you what you can do with namespaces and workgroups.

Using Amazon Redshift Serverless
In the Amazon Redshift console, I select Redshift serverless in the navigation pane. To get started, I choose Use default settings to configure a namespace and a workgroup with the most common options. For example, I’ll be able to connect using my default VPC and default security group.

With the default settings, the only option left to configure is Permissions. Here, I can specify how Amazon Redshift can interact with other services such as S3, Amazon CloudWatch Logs, Amazon SageMaker, and AWS Glue. To load data later, I give Amazon Redshift access to an S3 bucket. I choose Manage IAM roles and then Create IAM role.

When creating the IAM role, I select the option to give access to specific S3 buckets and pick an S3 bucket in the same AWS Region. Then, I choose Create IAM role as default to complete the creation of the role and to automatically use it as the default role for the namespace.

I choose Save configuration and after a few minutes the database is ready for use. In the Serverless dashboard, I choose Query data to open the Redshift query editor v2. There, I follow the instructions in the Amazon Redshift Database Developer guide to load a sample database. If you want to do a quick test, a few sample databases (including the one I am using here) are already available in the sample_data_dev database. Note also that loading data into Amazon Redshift is not required for running queries. I can use data from an S3 data lake in my queries by creating an external schema and an external table.

The sample database consists of seven tables and tracks sales activity for a fictional “TICKIT” website, where users buy and sell tickets for sporting events, shows, and concerts.

To configure the database schema, I run a few SQL commands to create the users, venue, category, date, event, listing, and sales tables.

Then, I download the tickitdb.zip file that contains the sample data for the database tables. I unzip and load the files to a tickit folder in the same S3 bucket I used when configuring the IAM role.

Now, I can use the COPY command to load the data from the S3 bucket into my database. For example, to load data into the users table:

copy users from 's3://MYBUCKET/tickit/allusers_pipe.txt' iam_role default;

The file containing the data for the sales table uses tab-separated values:

copy sales from 's3://MYBUCKET/tickit/sales_tab.txt' iam_role default delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS';

After I load data in all tables, I start running some queries. For example, the following query joins five tables to find the top five sellers for events based in California (note that the sample data is for the year 2008):

select sellerid, username, (firstname ||' '|| lastname) as sellername, venuestate, sum(qtysold)
from sales, date, users, event, venue
where sales.sellerid = users.userid
and sales.dateid = date.dateid
and sales.eventid = event.eventid
and event.venueid = venue.venueid
and year = 2008
and venuestate = 'CA'
group by sellerid, username, sellername, venuestate
order by 5 desc
limit 5;

Now that my database is ready, let’s see what I can do by configuring Amazon Redshift Serverless namespaces and workgroups.

Using and Configuring Namespaces
Namespaces are collections of database data and their security configurations. In the navigation pane of the Amazon Redshift console, I choose Namespace configuration. In the list, I choose the default namespace that I just created.

In the Data backup tab, I can create or restore a snapshot or restore data from one of the recovery points that are automatically created every 30 minutes and kept for 24 hours. That can be useful to recover data in case of accidental writes or deletes.

In the Security and encryption tab, I can update permissions and encryption settings, including the AWS Key Management Service (AWS KMS) key used to encrypt and decrypt my resources. In this tab, I can also enable audit logging and export the user, connection, and user activity logs.

In the Datashares tab, I can create a datashare to share data with other namespaces and AWS accounts in the same or different Regions. In this tab, I can also create a database from a share I receive from other namespaces or AWS accounts, and I can see the subscriptions for datashares managed by AWS Data Exchange.

When I create a datashare, I can select which objects to include. For example, here I want to share only the date and event tables because they don’t contain sensitive data.

Using and Configuring Workgroups
Workgroups are collections of compute resources and their network and security settings. They provide the serverless endpoint for the namespace they are configured for. In the navigation pane of the Amazon Redshift console, I choose Workgroup configuration. In the list, I choose the default namespace that I just created.

In the Data access tab, I can update the network and security settings (for example, change the VPC, the subnets, or the security group) or make the endpoint publicly accessible. In this tab, I can also enable Enhanced VPC routing to route network traffic between my serverless database and the data repositories I use (for example, the S3 buckets used to load or unload data) through a VPC instead of the internet. To access serverless endpoints that are in another VPC or subnet, I can create a VPC endpoint managed by Amazon Redshift.

In the Limits tab, I can configure the base capacity (expressed in Redshift processing units, or RPUs) used to process my queries. Amazon Redshift Serverless scales the capacity to deal with a higher number of users. Here I also have the option to increase the base capacity to speed up my queries or decrease it to reduce costs.

In this tab, I can also set Usage limits to configure daily, weekly, and monthly thresholds to keep my costs predictable. For example, I configured a daily limit of 200 RPU-hours, and a monthly limit of 2,000 RPU-hours for my compute resources. To control the data-transfer costs for cross-Region datashares, I configured a daily limit of 3 TB and a weekly limit of 10 TB. Finally, to limit the resources used by each query, I use Query limits to time out queries running for more than 60 seconds.

Availability and Pricing
Amazon Redshift Serverless is generally available today in the US East (Ohio), US East (N. Virginia), US East (Oregon), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Stockholm), and Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo) AWS Regions.

You can connect to a workgroup endpoint using your favorite client tools via JDBC/ODBC or with the Amazon Redshift query editor v2, a web-based SQL client application available on the Amazon Redshift console. When using web services-based applications (such as AWS Lambda functions or Amazon SageMaker notebooks), you can access your database and perform queries using the built-in Amazon Redshift Data API.

With Amazon Redshift Serverless, you pay only for the compute capacity your database consumes when active. The compute capacity scales up or down automatically based on your workload and shuts down during periods of inactivity to save time and costs. Your data is stored in managed storage, and you pay a GB-month rate.

To give you improved price performance and the flexibility to use Amazon Redshift Serverless for an even broader set of use cases, we are lowering the price from $0.5 to $0.375 per RPU-hour for the US East (N. Virginia) Region. Similarly, we are lowering the price in other Regions by an average of 25 percent from the preview price. For more information, see the Amazon Redshift pricing page.

To help you get practice with your own use cases, we are also providing $300 in AWS credits for 90 days to try Amazon Redshift Serverless. These credits are used to cover your costs for compute, storage, and snapshot usage of Amazon Redshift Serverless only.

Get insights from your data in seconds with Amazon Redshift Serverless.

— Danilo

Making your Go workloads up to 20% faster with Go 1.18 and AWS Graviton

2022-05-31 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/making-your-go-workloads-up-to-20-faster-with-go-1-18-and-aws-graviton/

This blog post was written by Syl Taylor, Professional Services Consultant.

In March 2022, the highly anticipated Go 1.18 was released. Go 1.18 brings to the language some long-awaited features and additions, such as generics. It also brings significant performance improvements for Arm’s 64-bit architecture used in AWS Graviton server processors. In this post, we show how migrating Go workloads from Go 1.17.8 to Go 1.18 can help you run your applications up to 20% faster and more cost-effectively. To achieve this goal, we selected a series of realistic and relatable workloads to showcase how they perform when compiled with Go 1.18.

Overview

Go is an open-source programming language which can be used to create a wide range of applications. It’s developer-friendly and suitable for designing production-grade workloads in areas such as web development, distributed systems, and cloud-native software.

AWS Graviton2 processors are custom-built by AWS using 64-bit Arm Neoverse cores to deliver the best price-performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2). They provide up to 40% better price/performance over comparable x86-based instances for a wide variety of workloads and they can run numerous applications, including those written in Go.

Web service throughput

For web applications, the number of HTTP requests that a server can process in a window of time is an important measurement to determine scalability needs and reduce costs.

To demonstrate the performance improvements for a Go-based web service, we selected the popular Caddy web server. To perform the load testing, we selected the hey application, which was also written in Go. We deployed these packages in a client/server scenario on m6g Graviton instances.

The Caddy web server compiled with Go 1.18 brings a 7-8% throughput improvement as compared with the variant compiled with Go 1.17.8.

We conducted a second test where the client downloads a dynamic page on which the request handler performs some additional processing to write the HTTP response content. The performance gains were also noticeable at 10-11%.

Regular expression searches

Searching through large amounts of text is where regular expression patterns excel. They can be used for many use cases, such as:

Checking if a string has a valid format (e.g., email address, domain name, IP address),
Finding all of the occurrences of a string (e.g., date) in a text document,
Identifying a string and replacing it with another.

However, despite their efficiency in search engines, text editors, or log parsers, regular expression evaluation is an expensive operation to run. We recommend identifying optimizations to reduce search time and compute costs.

The following example uses the Go regexp package to compile a pattern and search for the presence of a standard date format in a large generated string. We observed a 13.5% increase in completed executions with a 12% reduction in execution time.

In a second example, we used the Go regexp package to find all of the occurrences of a pattern for character sequences in a string, and then replace them with a single character. We observed a 12% increase in evaluation rate with an 11% reduction in execution time.

As with most workloads, the improvements will vary depending on the input data, the hardware selected, and the software stack installed. Furthermore, with this use case, the regular expression usage will have an impact on the overall performance. Given the importance of regex patterns in modern applications, as well as the scale at which they’re used, we recommend upgrading to Go 1.18 for any software that relies heavily on regular expression operations.

Database storage engines

Many database storage engines use a key-value store design to benefit from simplicity of use, faster speed, and improved horizontal scalability. Two implementations commonly used are B-trees and LSM (log-structured merge) trees. In the age of cloud technology, building distributed applications that leverage a suitable database service is important to make sure that you maximize your business outcomes.

B-trees are seen in many database management systems (DBMS), and they’re used to efficiently perform queries using indexes. When we tested a sample program for inserting and deleting in a large B-tree structure, we observed a 10.5% throughput increase with a 10% reduction in execution time.

On the other hand, LSM trees can achieve high rates of write throughput, thus making them useful for big data or time series events, such as metrics and real-time analytics. They’re used in modern applications due to their ability to handle large write workloads in a time of rapid data growth. The following are examples of databases that use LSM trees:

InfluxDB is a powerful database used for high-speed read and writes on time series data. It’s written in Go and its storage engine uses a variation of LSM called the Time-Structured Merge Tree (TSM).
CockroachDB is a popular distributed SQL database written in Go with its own LSM tree implementation.
Badger is written in Go and is the engine behind Dgraph, a graph database. Its design leverages LSM trees.

When we tested an LSM tree sample program, we observed a 13.5% throughput increase with a 9.5% reduction in execution time.

We also tested InfluxDB using comparison benchmarks to analyze writes and reads to the database server. On the load stress test, we saw a 10% increase of insertion throughput and a 14.5% faster rate when querying at a large scale.

In summary, for databases with an engine written in Go, you’ll likely observe better performance when upgrading to a version that has been compiled with Go 1.18.

Machine learning training

A popular unsupervised machine learning (ML) algorithm is K-Means clustering. It aims to group similar data points into k clusters. We used a dataset of 2D coordinates to train K-Means and obtain the cluster distribution in a deterministic manner. The example program uses an OOP design. We noticed an 18% improvement in execution throughput and a 15% reduction in execution time.

A widely-used and supervised ML algorithm for both classification and regression is Random Forest. It’s composed of numerous individual decision trees, and it uses a voting mechanism to determine which prediction to use. It’s a powerful method for optimizing ML models.

We ran a deterministic example to train a dense Random Forest. The program uses an OOP design and we noted a 20% improvement in execution throughput and a 15% reduction in execution time.

Recursion

An efficient, general-purpose method for sorting data is the merge sort algorithm. It works by repeatedly breaking down the data into parts until it can compare single units to each other. Then, it decides their order in the intermediary steps that will merge repeatedly until the final sorted result. To implement this divide-and-conquer approach, merge sort must use recursion. We ran the program using a large dataset of numbers and observed a 7% improvement in execution throughput and a 4.5% reduction in execution time.

Depth-first search (DFS) is a fundamental recursive algorithm for traversing tree or graph data structures. Many complex applications rely on DFS variants to solve or optimize hard problems in various areas, such as path finding, scheduling, or circuit design. We implemented a standard DFS traversal in a fully-connected graph. Then we observed a 14.5% improvement in execution throughput and a 13% reduction in execution time.

Conclusion

In this post, we’ve shown that a variety of applications, not just those primarily compute-bound, can benefit from the 64-bit Arm CPU performance improvements released in Go 1.18. Programs with an object-oriented design, recursion, or that have many function calls in their implementation will likely benefit more from the new register ABI calling convention.

By using AWS Graviton EC2 instances, you can benefit from up to a 40% price/performance improvement over other instance types. Furthermore, you can save even more with Graviton through the additional performance improvements by simply recompiling your Go applications with Go 1.18.

To learn more about Graviton, see the Getting started with AWS Graviton guide.

Chaos experiments on Amazon RDS using AWS Fault Injection Simulator

2022-05-01 Anup Sivadas

Post Syndicated from Anup Sivadas original https://aws.amazon.com/blogs/devops/chaos-experiments-on-amazon-rds-using-aws-fault-injection-simulator/

Performing controlled chaos experiments on your Amazon Relational Database Service (RDS) database instances and validating the application behavior is essential to making sure that your application stack is resilient. How does the application behave when there is a database failover? Will the connection pooling solution or tools being used gracefully connect after a database failover is successful? Will there be a cascading failure if the database node gets rebooted for a few seconds? These are some of the fundamental questions that you should consider when evaluating the resiliency of your database stack. Chaos engineering is a way to effectively answer these questions.

Traditionally, database failure conditions, such as a failover or a node reboot, are often triggered using a script or 3rd party tools. However, at scale, these external dependencies often become a bottleneck and are hard to maintain and manage. Scripts and 3rd party tools can fail when called, whereas a web service is highly available. The scripts and 3rd party tools also tend to require elevated permissions to work, which is a management overhead and insecure from a least privilege access model perspective. This is where AWS Fault Injection Simulator (FIS) comes to the rescue.

AWS Fault Injection Simulator (AWS FIS) is a fully managed service for running fault injection experiments on AWS that makes it easier to improve an application’s performance, observability, and resiliency. Fault injection experiments are used in chaos engineering, which is the practice of stressing an application in testing or production environments by creating disruptive events, such as a sudden increase in CPU or memory consumption, database failover and observing how the system responds, and implementing improvements.

We can define the key phases of chaos engineering as identifying the steady state of the workload, defining a hypothesis, running the experiment, verifying the experiment results and making necessary improvements based on the experiment results. These phases will confirm that you are injecting failures in a controlled environment through well-planned experiments in order to build confidence in the workloads and tools we are using to withstand turbulent conditions.

This diagram explains the phases of chaos engineering. We start with identifying the steady state, defining a hypothesis, run the experiment, verify the experiment results and improve. This is a cycle.

Example—

Baseline: we have a managed database with a replica and automatic failover enabled.
Hypothesis: failure of a single database instance / replica may slow down a few requests but will not adversely affect our application.
Run experiment: trigger a DB failover.
Verify: confirm/dis-confirm the hypothesis by looking at KPIs for the application (e.g., via CloudWatch metric/alarm).

Methodology and Walkthrough

Let’s look at how you can configure AWS FIS to perform failure conditions for your RDS database instances. For this walkthrough, we’ll look at injecting a cluster failover for Amazon Aurora PostgreSQL. You can leverage an existing Aurora PostgreSQL cluster or you can launch a new cluster by following the steps in the Create an Aurora PostgreSQL DB Cluster documentation.

Step 1: Select the Aurora Cluster.

The Aurora PostgreSQL instance that we’ll use for this walkthrough is provisioned in us-east-1 (N. Virginia), and it’s a cluster with two instances. There is one writer instance and another reader instance (Aurora replica). The cluster is named chaostest, the writer instance is named chaostest-instance-1, and the reader is named chaostest-intance-1-us-east-1a.

Under RDS Databases Section, the cluster named chaostest is selected. Under the cluster there are two instances which is available. Chaostest-instance-1 is the writer instance and chaostest-instance-1-us-east-1a is the reader instance.

The goal is to simulate a failover for this Aurora PostgreSQL cluster so that the existing chaostest-intance-1-us-east-1a reader instance will switch roles and then be promoted as the writer, and the existing chaostest-instance-1 will become the reader.

Step 2: Navigate to the AWS FIS console.

We will now navigate to the AWS FIS console to create an experiment template. Select Create experiment template.

Under FIS console, Create experiment template needs to be selected.

Step 3: Complete the AWS FIS template pre-requisites.

Enter a Description, Name, and select the AWS IAM Role for the experiment template.

Under create experiment template section, Simulate Database Failover is entered for the Description field. DBFailover is entered for the Name field(Optional). FISWorkshopServiceRole is selected for the IAM role drop down field.

The IAM role selected above was pre-created. To use AWS FIS, you must create an IAM role that grants AWS FIS the permissions required so that the service can run experiments on your behalf. The role follows the least privileged model and includes permissions to act on your database clusters like trigger a failover. AWS FIS only uses the permissions that have been delegated explicitly for the role. To learn more about how to create an IAM role with the required permissions for AWS FIS, refer to the FIS documentation.

Step 4: Navigate to the Actions, Target, Stop Condition section of the template.

The next key section of AWS FIS is Action, Target, and Stop Condition.

Action, Target and Stop Conditions section is highlighted in the image.

Action—An action is an activity that AWS FIS performs on an AWS resource during an experiment. AWS FIS provides a set of pre-configured actions based on the AWS resource type. Each Action runs for a specified duration during an experiment, or until you stop the experiment. An action can run sequentially or in parallel.

For our experiment, the Action will be aws:rds:failover-db-cluster.

Target—A target is one or more AWS resources on which AWS FIS performs an action during an experiment. You can choose specific resources or select a group of resources based on specific criteria, such as tags or state.

For our experiment, the target will be the chaostest Aurora PostgreSQL cluster.

Stop Condition—AWS FIS provides the controls and guardrails that you need to run experiments safely on your AWS workloads. A stop condition is a mechanism to stop an experiment if it reaches a threshold that you define as an Amazon CloudWatch alarm. If a stop condition is triggered while the experiment is running, then AWS FIS stops the experiment.

For our experiment, we won’t be defining a stop condition. This is because this simple experiment contains only one action. Stop conditions are especially useful for experiments with a series of actions, to prevent them from continuing if something goes wrong.

Step 5: Configure Action.

Now, let’s configure the Action and Target for our experiment template. Under the Actions section, we will select Add action to get the New action window.

The action section displays Name,Description, Action Type and Start After fields. There is a Add action button that needs to be selected.

Enter a Name, a Description, and select Action type aws:rds:failover-db-cluster. Start after is an optional setting. This setting allows you to specify an action that should precede the one we are currently configuring.

Under Actions section, DBFailover is entered for the Name field. DB Failover Action is entered for the Description field. aws:rds:failover-db-cluster is entered for the Action Type field. Start after field is left blank and Clusters-Target-1 is selected for the Target field. The save button will save the info entered for the respective fields.

Step 6: Configure Target.

Note that a Target has been automatically created with the name Clusters-Target-1. Select Save to save the action.

Next, you will edit the Clusters-Target-1 target to select the target, i.e., the Aurora PostgreSQL cluster.

Under Targets section, the edit button for Clusters-Target-1 is highlighted.

Select Target method as Resource IDs, and select the chaostest cluster. If you are interested to select a group of resources, then select Resource tags, filters and parameters option.

Under Edit Target section, Clusters-Target-1 is selected for the Name field. aws:rds:cluster is selected for the Resource type field. Action is set as DBFailover. Resource IDs is selected for the Target Method field. Chaostest cluster is selected for the Resource IDs drop down box. Save button is also available in this section to save the configuration.

Step 7: Create the experiment template to complete this stage.

We will wrap up the process by selecting the create experiment template.

Create experiment template option is highlighted. The user will click this button to proceed creating a template.

We will get a warning stating that a stop condition isn’t defined. We’ll enter create in the provided field to create the template.

After selecting the create experiment template option in the previous screen, the user is prompted to enter "create" in the field to proceed.

We will get a success message if the entries are correct and the template will be successfully created.

"You successfully created experiment template" success message is displayed in the screen and its highlighted in green color.

Step 8: Verify the Aurora Cluster.

Before we run the experiment, let’s double-check the chaostest Aurora Cluster to confirm which instance is the writer and which is the reader.

Under the RDS section, chaos cluster is listed. The user is confirming that chaostest-instance-1 is the writer and chaostest-instance-1-us-east-1a is the reader.

We confirmed that chaostest-instance-1 is the writer and chaostest-instance-1-us-east-1a is the reader.

Step 9: Run the AWS FIS experiment.

Now we’ll run the FIS experiment. Select Actions, and then select Start for the experiment template.

Under the experiment template section, the Simulate Database Failover template is selection. Under the actions section, option Start is selected to start the experiment. The other options under Actions section includes Update, Manage tags and Delete.

Select Start experiment and you’ll get another warning to confirm if you really want to start this experiment. Confirm by entering start say Start experiment.

Under the Start Experiement section, user is promoted to enter "start" in the field to start the experiement.

Step 10: Observe the various stages of the experiment.

The experiment will be in initiating, running and will eventually be in completed states.

The experiment is in initiating state.

The experiment is in Complete state.

Step 11: Verify the Aurora Cluster to confirm failover.

Now let’s look at the chaostest Aurora PostgreSQL cluster to check the state. Note that a failover was indeed triggered by FIS and chaostest-instance-1-us-east-1a is the newly promoted writer and chaostest-instance-1 is the reader now.

Under RDS Section, Chaostest cluster is shown. This time, the writer is chaostest-instance-1-us-east-1a.

Step 12: Verify the Aurora Cluster logs.

We can also confirm the failover action by looking at the Logs and events section of the Aurora Cluster.

Under the Recent Events section of the chaos-test cluster, the failover messages is displayed. One of the messages lists "Started cross AZ failover to DB instance:chaostest-instance-1-us-east-1a. This confirms that the experiment was successful.

Clean up

If you created a new Aurora PostgreSQL cluster for this walkthrough, then you can terminate the cluster to optimize the costs by following the steps in the Deleting an Aurora DB cluster documentation.

You can also delete the AWS FIS experiment template by following the steps in the Delete an experiment template documentation.

You can refer to the AWS FIS documentation to learn more about the service. If you want to know more about chaos engineering, check out the AWS re:Invent session Testing resiliency using chaos engineering and The Chaos Engineering Collection. Finally, check out the following GitHub repo for additional example experiments, and how you can work with AWS FIS using the AWS Cloud Development Kit (AWS CDK).

Conclusion

In this walkthrough, you learned how you can leverage AWS FIS to inject failures into your RDS Instances. To get started with AWS Fault Injection Service for Amazon RDS, refer to the service documentation.

Author:

Financial Crime Discovery using Amazon EKS and Graph Databases

2022-02-01 Severin Gassauer-Fleissner

Post Syndicated from Severin Gassauer-Fleissner original https://aws.amazon.com/blogs/architecture/financial-crime-discovery-using-amazon-eks-and-graph-databases/

Discovering and solving financial crimes has become a challenge due to an increasing amount of financial data. While storing transactional payment data in a structured table format is useful for searching, filtering, and calculations, it is not always an ideal way to represent transactional data. For example, determining if there is a suspicious financial relationship between entity A and entity B is difficult to visualize in a table. Using tables, we would have to do SQL joins for every possible transaction from entity A to every possible subsequent transaction. We would then have to iterate this process until we found a relationship to entity B. Moreover, certain queries are challenging to run on a relational database management system (RDBMS). For example, it can be quite time consuming to discover which account received a minimum amount of $10,000,000 from other accounts.

Graph databases such as Amazon Neptune can be helpful with performing queries, because they can traverse the data and perform calculations simultaneously. Graphs enable us to represent transactions and parties over a multi-connected network, and discover patterns and chains of connections. It is common to use them in anti-money laundering (AML) applications, as they can help find patterns of suspicious transactions.

We needed a solution that could scale and process millions of transactions, by effectively using high memory and CPU configurations to perform complex queries quickly. As part of our customer demonstration to show how graph databases can help discover financial crimes, we also sourced a large dataset on which to test the solution. We used a graph database, Amazon Elastic Kubernetes Service (EKS), and Amazon Neptune, to search for suspicious financial chains across large amounts of transactional data in minutes.

Overview of our conceptual financial crime discovery solution

Figure 1. Workflow for financial crime discovery

We first needed to find a rules engine that could perform transaction inferencing and reasoning. It had to be able to process various rules on our data, be efficient, and able to scale. Next, we needed a straightforward way to ingest data into the solution. Once we had the data available, we needed to initiate a task to begin our inference job. Finally, we needed a location to store the result for further analysis and persistence, see Figure 1.

Using the AWS Cloud to scale up a graph database

We used multiple AWS services to create a fully automated end-to-end batch-based transaction process, shown in Figure 2. We used RDFox, which is an AWS Marketplace product, created by Oxford Semantic Technologies. RDFox is a high-performance in-memory graph database and semantic reasoner. To orchestrate RDFox, we used Amazon EKS Autoscaler to spin up a cluster to instantiate the RDFox container. Amazon EKS can spin up multiple containers for difference inference jobs and recycle the resources when the job finishes. We also used Amazon Neptune, an Amazon managed distributed graph database that can store up to 64 TiB of the results for diagnosis and long-term retention.

The data is stored in Amazon S3 buckets, which provide a streamlined way to feed a large dataset for processing.

Figure 2. Architecture diagram for financial crime discovery

Financial crimes rules

The power of graphs can help discover financial crimes that are reflected in relationships and monetary transactions. To demonstrate this, we will write two rules to detect two scenarios:

Given two suspicious parties X and Y, find out if there is a transactional relationship between them, and if so, provide the chain that connects them. This is a common scenario that financial institutions must detect.
Given a chain identifying suspicious behavior, find out if the minimum transaction amount that reached the beneficiary exceeds a threshold of Z dollars.

Generating data

To generate data for testing, we used a synthetic data generator written in Python (see GitHub repo in References). The generator built two sets of graph artifacts – parties and transactions. Every transaction is being paired with two random parties, and this iterative process creates a network of connected transactions and parties.

We created a dataset with a small percentage (0.01%) of parties tagged as “Suspicious Party,” to simulate the preceding business scenario. Note that those parties will have transactions going both in and out. In some cases, this will collide with other suspicious parties and establish a chain. This method enables us to get simulated data without engineering the suspicious chains.
The test dataset used with this solution comprises 1M transactions and thousands of parties.
For more information on generating data, see GitHub: Transaction Chains Data Generator.

Walkthrough of the financial investigator workflow

Once deployed, this solution can assist investigators as follows:

An investigator places the transactions and party data (nt triples) in a subdirectory within the input bucket. Typically, subdirectories can be named as a date or range of dates. In addition, the investigator uploads the particular rules (dlog files) and queries (rq files) that must be processed on the data.
Once the data is ready, the investigator uploads a job spec file (simple JSON, see References section). This contains the description of what resources the job requires (CPUs and memory), along with other configurations for the job.
Once the job spec has been uploaded to the bucket’s subdirectory, the job is automatically initiated. The Kubernetes scheduler will allocate enough resources to initiate the RDFox pod. The containers in the pod will then load, process, query results, and upload them to the output bucket.
Once the data reaches the output bucket, an AWS Lambda function is initiated. This invokes the Amazon Neptune Bulk Loader, which asynchronously loads the results to the Neptune cluster in a parallelized manner.
Once the load completes, the investigator gets an email notification that the job has been completed, and the results are ready for view.

Additionally:

The investigator can upload multiple rules and queries, they will all be processed automatically.
The investigator can launch multiple jobs with different/same data, and with different rules and queries at the same time.
All jobs outputs are saved in a unique job ID subdirectory in the output bucket.

To create the solution in your account, follow the instruction here: GitHub repo

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
RDFox license, request a free trial
Terraform
Familiarity with Kubernetes, Amazon Neptune, and SPARQL queries

Rules

We create two materialization rule sets to fetch the two scenarios described.

1. detect-suspicious-parties-pair.dlog

The purpose of this first set of rules is to detect chains that might exist between two suspicious parties. The idea of these chains is to represent all the possible relationships that contain monetary transfers between a suspicious originator and the beneficiary. This will include non-suspicious parties in the chain. The rule tags these chains with the “SuspiciousChain” flag.

2. detect-chains-exceed-100-dollars.dlog

This set of rules is designed to tag the chains identified by the first set of rules. It also contains a minimum amount of $100 passed to the beneficiary. We can change the amount to check for different compliance requirements. We tag those chains as “HighValueChain.”

Run the job and check your results

Now we can run our job, with the given data, rules, and two additional queries (to extract “SuspiciousChain” and “HighValueChain” respectively). The result of the queries will be loaded to Amazon Neptune automatically for persistent storage, and is made durably available for further analysis.

Let’s look at the results. The following query can be initiated against RDFox console or Amazon Neptune.

Visualization

PREFIX : <http://oxfordsemantic.tech/transactions/entities#>

PREFIX prop: <http://oxfordsemantic.tech/transactions/properties#>

PREFIX type: <http://oxfordsemantic.tech/transactions/classes#>

PREFIX tt: <http://oxfordsemantic.tech/transactions/tupletables#>

SELECT ?S ?P ?O WHERE {

?S a type:SuspiciousChain .

?S ?P ?O .

}

Figure 3. Visualizing suspicious chains

Whoa! Figure 3 might look complicated at first, but this is because we are visualizing every pair of suspicious parties that have a relationship with another suspicious party. Let’s filter the query to look only at a single particular chain, which exceeds a minimum of $100 to the beneficiary. The following query can be executed against RDFox console or Amazon Neptune.

PREFIX : <http://oxfordsemantic.tech/transactions/entities#>
PREFIX prop: <http://oxfordsemantic.tech/transactions/properties#>
PREFIX type: <http://oxfordsemantic.tech/transactions/classes#>
PREFIX tt: <http://oxfordsemantic.tech/transactions/tupletables#>

SELECT * WHERE {
?S ?P ?O
{
SELECT ?S WHERE {
?S a type:HighValueChain .

} Limit 1
}
}

Figure 4. Visualizing a particular chain

In Figure 4, we can see that Allison, the suspicious originator of the chain, has sent a transaction to Troy. Troy, who is not suspicious, sent the transaction to Karina. Karina is the suspicious beneficiary. We can also see additional information, such as the transaction amount that Karina received, and the chain length of 2 in this case.

In our testing, we were able to scale up to 500M transactions with 50M parties and process this in less than two hours! And we performed this at a significant lower cost when compared to running constant, fixed similar hardware.

Cleaning up

Follow Terraform cleanup instructions.

Conclusion

Graph databases are a powerful tool to apply reasoning on complex financial relationships. The combination of Amazon Web Services and the RDFox engine results in an automated, scalable, and cost-effective, thanks to the dynamic Kubernetes Cluster Autoscaler. Customers can use this solution and provide their investigators with a tool they can experiment and reason on financial transactions. This solution simplifies the process, and makes it easier to try different rules and queries on complex large data collections.

This blog post is written with Oxford Semantic.

References

Solution:

Solution GitHub repository
Data generator repository
Low-level design documents and architecture
Examples for submitting a job

New DynamoDB Table Class – Save Up To 60% in Your DynamoDB Costs

2021-12-01 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/new-dynamodb-table-class-save-up-to-60-in-your-dynamodb-costs/

Today we are announcing Amazon DynamoDB Standard-Infrequent Access (DynamoDB Standard-IA). A new table class for DynamoDB that reduces storage costs by 60 percent compared to existing DynamoDB Standard tables, and that delivers the same performance, durability, and scaling.

Nowadays, many customers are moving their infrequently accessed data between DynamoDB and Amazon Simple Storage Service (Amazon S3). This means that customers are developing a process to migrate the data and build complex applications that must support two different APIs—one for DynamoDB and another for Amazon S3.

DynamoDB Standard-IA table class is designed for customers who want a cost-optimized solution for storing infrequently accessed data in DynamoDB without changing any application code. Using this new table class, you get the single-digit millisecond read and write performance from DynamoDB and use all of the same APIs.

When you use DynamoDB Standard-IA table class, you will save up to 60 percent in storage costs as compared to using the DynamoDB Standard table class. However, DynamoDB reads and writes for this new table class are priced higher than the Standard tables. Therefore, it is important to understand your use cases before applying this new table class to your tables.

DynamoDB Standard-IA is a great solution if you must store terabytes of data for several years where the data must be highly available, but it is not frequently accessed. An example is social media applications where end users rarely access their old posts. However, these posts remain stored, because if someone scrolls on a profile to see an old photo from 2009, they should be able to retrieve it as fast as if it was a newer post.

E-commerce sites are another good use case. These sites might have a lot of products that are not frequently accessed, but administrators of the site still want to have them available in their store just in case someone wants to buy them. Furthermore, this is a good solution for storing a customer’s previous orders. DynamoDB Standard-IA table offers the ability to retain historical orders at a lower cost.

Get started using DynamoDB Standard-IA
Get started using DynamoDB Standard-IA by evaluating the best class for your existing tables.

Go to the table page and select Update the table class in the Actions dropdown to change the table class. Then, choose the new table class and save the changes. You can change the table class for an existing table to be Standard-IA or Standard twice every 30-days with no impact on performance or availability. All of the features of DynamoDB are available when using a table in the Standard-IA table class.

Moreover, you can also create a new table with the DynamoDB Standard-IA table class.

Availability and Pricing
DynamoDB Standard-IA is available in all of the AWS Regions, except the China Regions and AWS GovCloud.

For example, DynamoDB Standard-IA storage pricing in US East (N. Virginia) is now $0.10 per GB (60 percent less than DynamoDB Standard), while reads and writes are 25 percent higher.

For more information about this feature and its pricing, see the DynamoDB Standard-IA Feature page and the DynamoDB pricing page.

— Marcia

Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure

2021-11-30 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-redshift-serverless-run-analytics-at-any-scale-without-having-to-manage-infrastructure/

We’re seeing the use of data analytics expanding among new audiences within organizations, for example with users like developers and line of business analysts who don’t have the expertise or the time to manage a traditional data warehouse. Also, some customers have variable workloads with unpredictable spikes, and it can be very difficult for them to constantly manage capacity.

With Amazon Redshift, you use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Today, I am happy to introduce the public preview of Amazon Redshift Serverless, a new capability that makes it super easy to run analytics in the cloud with high performance at any scale. Just load your data and start querying. There is no need to set up and manage clusters. You pay for the duration in seconds when your data warehouse is in use, for example, while you are querying or loading data. There is no charge when your data warehouse is idle.

Amazon Redshift Serverless automatically provisions the right compute resources for you to get started. As your demand evolves with more concurrent users and new workloads, your data warehouse scales seamlessly and automatically to adapt to the changes. You can optionally specify the base data warehouse size to have additional control on cost and application-specific SLAs.

With the new serverless option, you can continue to query data in other AWS data stores, such as Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Aurora and Amazon Relational Database Service (RDS) databases.

Amazon Redshift Serverless is ideal when it is difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. This approach is also a good fit for ad-hoc analytics needs that need to get started quickly and for test and development environments.

Let’s see how this works in practice.

Using Amazon Redshift Serverless
I go to the Amazon Redshift console and choose the new serverless option. The first time, I set up the serverless endpoint and configure networking and security.

I confirm the default settings that use all subnets in my default Amazon Virtual Private Cloud (VPC) and its default security group. Data is always encrypted, and I use the default AWS-owned key. Optionally, I can customize all settings. I can associate now or later the AWS Identity and Access Management (IAM) roles to give permissions to access other AWS resources, for example, to be able to load data from an S3 bucket. The configuration of the serverless endpoint will be shared by all my serverless data warehouses in the same AWS account and Region.

To query data, I use Amazon Redshift Query Editor V2, a new free web-based tool that we made available a few months back. The query editor provides quick access to a few sample datasets to make it easy to learn Amazon Redshift’s SQL capabilities: TPC-H, TPC-DS, and tickit, a dataset containing information on ticket sales for events.

For a quick test, I use the tickit sample dataset so I don’t need to load any data. I prepare a query to get the list of tickets sold per date, sorted to see the dates with more sales first:

SELECT caldate, sum(qtysold) as sumsold
FROM   tickit.sales, tickit.date
WHERE  sales.dateid = date.dateid 
GROUP BY caldate
ORDER BY sumsold DESC;

By using the web-based query editor, I don’t need to configure a SQL client or set up the network permissions to reach the serverless endpoint. Instead, I just write my SQL query and run it.

I am a visual person. I enable the Chart option on the right of the result table and select a bar chart.

Satisfied with the clarity of the chart, I export it as an image file. In this way, I can quickly share it or include it in a report.

Amazon Redshift Serverless supports all rich SQL functionality of Amazon Redshift such as semi-structured data support. I can use any JDBC/ODBC-compliant tool or the Amazon Redshift Data API to query my data. To migrate data, I can take a snapshot of an Amazon Redshift provisioned cluster and restore it as serverless. Then, I just need to update my SQL applications to use the new serverless endpoint.

Availability and Pricing
Amazon Redshift Serverless is available in public preview in the following AWS Regions: US East (N. Virginia), US West (N. California, Oregon), Europe (Frankfurt, Ireland), Asia Pacific (Tokyo).

With Amazon Redshift Serverless, you pay separately for the compute and storage you use. Compute capacity is measured in Redshift Processing Units (RPUs), and you pay for the workloads in RPU-hours with per-second billing. For storage, you pay for data stored in Amazon Redshift-managed storage and storage used for snapshots, similar to what you’d pay with a provisioned cluster using RA3 instances.

To control your costs, you can specify usage limits and define actions that Amazon Redshift automatically takes if those limits are reached. You can specify usage limits in RPU-hours and associated with a daily, weekly, or monthly duration. Setting higher usage limits can improve the overall throughput of the system, especially for workloads that need to handle high concurrency while maintaining consistently high performance.

Compute resources automatically shutdown behind the scenes when there is no activity and resume when you are loading data, or there are queries coming in. When accessing your S3 data lake via the new serverless endpoint, you do not pay for Amazon Redshift Spectrum separately. You have a unified serverless experience and pay for data lake queries also in RPU-seconds. For more information, see the Amazon Redshift pricing page.

The serverless end point is configured at the AWS account level. If you have multiple teams or projects and want to manage costs separately, you can use separate AWS accounts. You can share data between your provisioned clusters and serverless endpoint and between serverless endpoints across accounts.

To help you get practice, we provide you upfront with $500 in AWS credits to try the Amazon Redshift Serverless public preview. You get the credits when you first create a database with Amazon Redshift Serverless. These credits are used to cover your costs for compute, storage, and snapshot usage of Amazon Redshift Serverless only.

Start using Amazon Redshift Serverless today to run and scale analytics without having to provision and manage data warehouse clusters.

— Danilo

Handy Tips #12: Optimizing Zabbix database size with custom data storage periods

2021-11-18 Arturs Lontons

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/handy-tips-12-optimizing-zabbix-database-size-with-custom-data-storage-periods/17396/

Zabbix allows its users to configure custom data retention periods for different types of data – from history and trend storage periods to user session storage periods.

Data retention requirements can vary a lot between different environments. With considerations to data storage footprint and company policies, some environments might require storing months of historical data, while others are fine with storing mostly trends.

Use housekeeping settings to define custom data storage periods:

Storage periods can be defined for history, trends, events, and more
Unique storage periods can be defined for each individual item

TimescaleDB backends support native data partitioning and compression
Housekeeping for individual data types can be disabled – not recommended in production environments

Check out the video to learn how to define data storage periods on your Zabbix instance.

How to define data storage periods on your Zabbix instance:

Navigate to Configuration → Hosts and click on the Items button next to an existing host
Select any integer or float item
Set the History storage period to 30d, Trend storage period to 180d
Save the item
Navigate to Administration → General → Housekeeping
Set the Trigger data storage period to 90d
Tick the checkbox next to the Override item history period option
Set the History storage period to 90d
Navigate back to Configuration → Hosts and click on your host
Click on the Items next to your host and find the previously modified item
Click on the green i next to the History storage period
Read the override notification

Tips and best practices::

Usually, long term storage of internal, network discovery, and autoregistration events is not required
Item and trend storage periods can be overridden by global settings
Storage period will not be overridden for items that have Do not keep history or Do not keep trends enabled
An event will not be removed until the associated problem is resolved

Accelerate self-service analytics with Amazon Redshift Query Editor V2

2021-11-08 Bhanu Pittampally

Post Syndicated from Bhanu Pittampally original https://aws.amazon.com/blogs/big-data/accelerate-self-service-analytics-with-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics platform. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API. Query Editor V2 lets users explore, analyze, and collaborate on data. You can use Query Editor V2 to create databases, schemas, tables, and load data from Amazon Simple Storage Service (S3) either using COPY command or using a wizard . You can browse multiple databases and run queries on your Amazon Redshift data warehouse, data lake, or federated query to operational databases such as Amazon Aurora.

From the smallest start-ups to worldwide conglomerates, customers across the spectrum tell us they want to promote self-service analytics by empowering their end-users, such as data analysts and business analysts, to load data into their analytics platform. Analysts at these organizations create tables and load data in their own workspace, and they join that with the curated data available from the data warehouse to gain insight. This post will discuss how Query Editor V2 accelerates self-service analytics by enabling users to create tables and load data with simple wizards.

The Goal to Accelerate and Empower Data Analysts

A common practice that we see across enterprises today is that more and more enterprises are letting data analysts or business analysts load data into their user or group workspaces that co-exist on data warehouse platforms. Enterprise calls these personal workspaces, departmental schemas, project-based schemas or labs, and so on. The idea of this approach is to empower data analysts to load data sets by themselves and join curated data sets on a data warehouse platform to accelerate the data analysis process.

Amazon Redshift Query Editor V2 makes it easy for administrators to create the workspaces, and it enables data analysts to create and load data into the tables. Query Editor V2 lets you easily create external schemas in Redshift Cluster to extend the data warehouse to a data lake, thereby accelerating analytics.

An example Use case

Let’s assume that an organization has a marketing department with some power users and regular users. In this example, let’s also consider that the organization already has an Enterprise Data Warehouse (EDW) powered by Amazon Redshift. The marketing department would like to have a workspace created for their team members.

A visual depiction of a Data Warehouse Environment may look like the following figure. Enterprises let user/group schemas be created along with an EDW, which contains curated data sets. Analysts can create and load exploratory data sets into user schemas, and then join curated data sets available in the EDW.

ScopeofSolution

Amazon Redshift provides several options to isolate your users’ data from the enterprise data warehouse data,. Amazon Redshift data sharing lets you share data from your EDW cluster with a separate consumer cluster. Your users can consume the EDW data and create their own workspace in the consumer cluster. Alternatively, you can create a separate database for your users’ group workspace in the same cluster, and then isolate each user group to have their own schema. Amazon Redshift supports queries of data joining across databases, and then users can join their tables with the curated data in the EDW. We recommend you use the data sharing option that lets you isolate both compute and data. Query Editor v2 supports both scenarios.

Once you have enabled your data analysts to have their own workspace and provided the relevant privileges, then they can easily create Schema, table, and load data.

Prerequisites

You have an Amazon Redshift cluster, and you have configured the Query Editor V2. You can view the Simplify Data Analysis with Amazon Redshift Query Editor V2 post for instructions on setting up Query Editor V2.
For loading your data from Amazon S3 into Amazon Redshift, you will start by creating an IAM role to provide permissions to access Amazon S3 and grant that role to the Redshift cluster. By default, Redshift users assume that the IAM role is attached to the Redshift cluster. You can find the instructions in the Redshift getting started guide.
For users who want to load data from Amazon S3, Query Editor V2 provides an option to browse S3 buckets. To use this feature, users should have List permission on the S3 bucket.

Create Schemas

The Query Editor V2 supports the schema creation actions. Likewise, admins can create both native and external schemas by creating Schema wizard.

CreateSchemas

As a user, you can easily create a “schema” by accessing Create Schema wizard available from the Create button, and then selecting “Schema” from the drop-down list, as shown in the following screenshot.

If you select the Schema from the drop-down list, then the Create Schema wizard similar to the following screenshot is displayed. You can choose a local schema and provide a schema name.

Optionally, you can authorize a user to authorize users to create objects in the Schema. When the Authorize user check box is selected, then Create and Usage access are granted to the user. Now, Janedoe can create objects in this Schema.

Let’s assume that the analyst user Janedoe logs in to Query Editor V2 and logs in to the database and wants to create table and load data into their personal workspace.

Creating Tables

The Query Editor V2 provides a Create table wizard for users to create a table quickly. It allows power users to auto-create the table as based on a data file. Users can upload the file from their local machine and let Query Editor V2 figure out the data types and column widths. Optionally, you can change the column definition, such as encoding and table properties.

Below is a sample CSV file with a row header and sample rows from the MarketingCampaign.csv file. We will demonstrate how to create a table based on this file in the following steps.

SampleData

The following screenshot shows the uploading of the MarketingCampaing.csv file into Query Editor V2.

Create Table Wizard has two sections:

Columns

The Columns tab lets users select a file from their local desktop and upload it to Query Editor V2. Users can choose Schema from the drop-down option and provide a table name.

Query Editor V2 automatically infers columns and some data types for each column. It has several options to choose from to set as column properties. For example, you have the option to change column names, data types, and encoding. If you do not choose any encoding option, then the encoding choice will be selected automatically. You can also add new fields, for example, an auto-increment ID column, and define properties for that particular identity column.

Table Details

You can use the Create Table wizard to create a temporary table or regular table with the option of including it in automatic backups. The temporary table is available until the end of the session and is used in queries. A temporary table is automatically dropped once a user session is terminated.

The “Table Details” is optional, as Amazon Redshift’s Automatic Table Optimization feature takes care of Distribution Key and Sort Key on behalf of users.

Viewing Create Table Statement

Once the column and table level detail is set, Query Editor V2 gives an option to view the Create table statement in Query Editor tab. This lets users save the definition for later use or share it with other users. Once the user reviews the create table definition, then the user can hit the “Run” button to run the query. Users can also directly create a table from the Create table wizard.

The following screenshot shows the Create table definition for the marketing campaign data set.

CreateTable3

Query Editor V2 lets users view table definitions in a table format. The following screenshot displays the table that we created earlier. Note that Redshift automatically inferred encoding type for each column. As the best practice, it skipped for “Dt_Customer“, as it was set as the sort key. When creating the table, we did not set the encodings for columns, as Redshift will automatically set the best compression methods for each column.

Query Editor V2 distinguishes columns by data types in a table by using distinct icons for them.

You can also view the table definition by right-clicking on the table and selecting the show definition option. You can also generate a template select command, and drop or truncate the table by right-clicking on a table.

Loading Data

Now that we have created a schema and a table, let’s learn how to upload the data to the table that we created earlier.

Query Editor V2 provides you with the ability to load data for S3 buckets to Redshift tables. The COPY command is recommended to load data in Amazon Redshift. The COPY command leverages the massively parallel processing capabilities of Redshift.

The Load Data wizard in the Query Editor V2 loads data into Redshift by generating the COPY command. As a data analyst, you don’t have to remember the intricacies of the COPY command.

You can quickly load data from CSV, JSON, ORC, or Parquet files to an existing table using the Load Data Wizard. It supports all of the options in the COPY command. The Load Data Wizard lets Data analysts build a COPY command with an easy-to-use GUI.

The following screenshot shows an S3 bucket that has our MarketingCampaign.csv file. This is a much larger file that we used to create the table using Create table wizard. We will use this file to walk you through the Load Data wizard.

LoadData1

The Load Data wizard lets you browse your available S3 bucket and select a file or folder from the S3 bucket. You can also use a manifest file. A manifest file lets you make sure that all of the files are loaded using the COPY command. You can find more information about manifest files here.

The Load Data Wizard lets you enter several properties, such as the Redshift Cluster IAM role and whether data is encrypted. You can also set file options. For example, in the case of CSV, you can set delimiter and quote parameters. If the file is compressed, then you can provide compression settings.

With the Data Conversion Parameters, you can select options like Escape Characters, time format, and if you want to ignore the header in your data file. The Load Operations option lets you set compression encodings and error handling options.

Query Editor V2 lets you browse S3 objects, thereby making it easier to navigate buckets, folders, and files. Below screens displays the flow

Query Editor V2 supports loading data of many open formats, such as JSON, Delimiter, FixedWidth, AVRO, Parquet, ORC, and Shapefile.

In our example, we are loading CSV files. As you can see, we have selected our MarketingCampaing.csv file and set the Region, and then selected the Resfhift cluster IAM Role.

For the CSV file, under additional File Options, Delimiter Character and Quote Character are set with “;” and an empty quote in the below screen.

Once the required parameters are set, continue to next step to load data. Load Data operation builds a copy command and automatically loads it into Query Editor Tab, and then invokes the query.

LoadData5

Data is loaded into the target table successfully, and now you can run a query to view that data. The following screen shows the result of the select query executed on our target table:

LoadData6

Viewing load errors

If your COPY command fails, then these are logged into STL_LOAD_ERRORS system table. Query Editor v2 simplifies the viewing of the common errors by showing the errors in-place as shown in the following screenshot:

LoadData7

Saving and reusing the queries

You can save the load queries for future usage by clicking on the saved query and providing a name in the saved query.

You would probably like to reuse the load query in the future to load data in from another S3 location. In that case, you can use the parameterized query by replacing the S3 URL of the as shown in the following screenshot:

SavingQ2

You can save the query, and then share the query with another user.

When you or other users run the query, a prompt for the parameter will appear as in the following screenshot:

SavingQ3

We discussed how data analysts could load data into their own or the group’s workspace.

We will now discuss using Query Editor V2 to create an external schema to extend your data warehouse to the data lake.

Extending the Data Warehouse to the Data Lake

Extending Data warehouses to Data lakes is part of modern data architecture practices. Amazon Redshift enables this with seamless integration through Data lake running on AWS. Redshift uses Spectrum to allow this extension. You can access data lakes from the Redshift Data warehouse by creating Redshift external schemas.

Query Editor V2 lets you create an external schema referencing an external database in AWS Glue Data Catalogue.

To extend your Data Warehouse to Data Lake, you should have an S3 data lake and AWS Glue Data Catalog database defined for the data lake. Grant permission on AWS Glue to Redshift Cluster Role. You can find more information about external Schema here.

You can navigate to the Create External Schema by using Create Schema wizard, and then selecting the External Schema as shown in the following screenshot:

The Query Editor V2 makes the schema creation experience very easy by hiding the intricacies of the create external schema syntax. You can use the simple interface and provide the required parameters, such as Glue data regions, external database name, and the IAM role. You can browse the Glue Catalog and view the database name.

After you use the create schema option, you can see the schemas in the tree-view. The Query Editor V2 uses distinct icons to distinguish between native Schema and external Schema.

Viewing External Table Definitions

The Query Editor V2 lets data analysts quickly view objects available in external databases and understand their metadata.

You can view tables and columns for a given table by clicking on external Schema and then on a table. When a particular table is selected, its metadata information is displayed in the bottom portion below the tree-view panel. This is a powerful feature, as an analyst can easily understand the data residing externally in the data lake.

You can now run queries against external tables in the external Schema.

In our fictitious enterprise, Marketing Department team members can load data in their own workspace and join the data from their own user/group workspace with the curated data in the enterprise data warehouse or data lake.

Conclusion

This post demonstrated how the Query Editor V2 enabled data analysts to create tables and load data from Amazon S3 easily with a simple wizard.

We also discussed how Query Editor V2 lets you extend the data warehouse to a data lake. The data analysts can easily browse tables in your local data warehouse, data shared from another cluster, or tables in the data lake. You can run queries that can join tables in your data warehouse and data lake. The Query Editor V2 also provides several features for the collaboration of query authoring. You can view the earlier blog to learn more about how the Query Editor V2 simplifies data analysis.

These features let organizations accelerate self-service analytics and end-users deliver the insights faster.

Happy querying!

About the Authors

Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years. His Linkedin profile is here.

Debu Panda is a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Cansu Aksu is a Front End Engineer at AWS, has a several years of experience in developing user interfaces. She is detail oriented, eager to learn and passionate about delivering products and features that solve customer needs and problems

Chengyang Wang is a Frontend Engineer in Redshift Console Team. He worked on a number of new features delivered by redshift in the past 2 years. He thrives to deliver high quality products and aim to improve customer experience from UI

Goodbye Microsoft SQL Server, Hello Babelfish

2021-10-28 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/goodbye-microsoft-sql-server-hello-babelfish/

Many of our customers are telling us they want to move away from commercial database vendors to avoid expensive costs and burdensome licensing terms. But migrating away from commercial and legacy databases can be time-consuming and resource-intensive. When migrating your databases, you can automate the migration of your database schema and data using the AWS Schema Conversation Tool and AWS Database Migration Service. But there is always more work to do to migrate the application itself, including rewriting application code that interacts with the database. Motivation is there, but costs and risks are often limiting factors.

Today, we are making Babelfish for Aurora PostgreSQL available. Babelfish allows Amazon Aurora PostgreSQL-Compatible Edition to understand the SQL Server wire protocol. It allows you to migrate your SQL Server applications to PostgreSQL cheaper, faster, and with less risks involved with such change.

You can migrate your application in a fraction of the time that a traditional migration would require. You continue to use the existing queries and drivers your application uses today. Just point the application to an Amazon Aurora PostgreSQL database with Babelfish activated. Babelfish adds the capability to Amazon Aurora PostgreSQL to understand the SQL Server wire protocol Tabular Data Stream (TDS), as well as extending PostgreSQL to understand commonly used T-SQL commands used by SQL Server. Support for T-SQL includes elements such as the SQL dialect, static cursors, data types, triggers, stored procedures, and functions. Babelfish reduces the risk associated with database migration projects by significantly reducing the number of changes required to the application. When adopting Babelfish, you save on licensing costs of using SQL Server. Amazon Aurora provides the security, availability, and reliability of commercial databases at 1/10th the cost.

SQL Server has evolved over more than 30 years, and we do not expect to support all functionalities right away. Instead, we focused on the most common T-SQL commands and returning the correct response or an error message. For example, the MONEY datatype has different characteristics in SQL Server (with four decimals precision) and PostgreSQL (with two decimals precision). Such a subtle difference might lead to rounding errors and have a significant impact on downstream processes, such as financial reporting. In this case, and many others, Babelfish ensures the semantics of SQL Server data types and T-SQL functionality are preserved: we created a MONEY datatype that behaves as SQL Server apps would expect. When you create a table with this datatype through the Babelfish connection, you get this compatible datatype and behaviors that a SQL Server app would expect.

Create a Babelfish Cluster Using the Console
To show you how Babelfish works, let’s first connect to the console and create a new Amazon Aurora PostgreSQL cluster. The procedure is no different than for the regular Amazon Aurora database. In the RDS launch wizard, I first make sure I select an Aurora version compatible with PostgreSQL 13.4, or more recent. The updated console has additional filters to help you select the versions that are compatible with Babelfish.

Then, lower on the page, I select the option Turn on Babelfish.

Under Monitoring section, I also make sure I turn off Enable Enhanced monitoring. This option requires additional IAM permissions and preparation that are not relevant for this demo.

After a couple of minutes, my cluster is created, it has two instances, one writer and one reader.

Create a Babelfish Cluster Using the CLI
Alternatively, I may use the CLI to create a cluster. I first create a parameter group to activate Babelfish (the console does it automatically):

aws rds create-db-cluster-parameter-group             \
    --db-cluster-parameter-group-name myapp-babelfish \
    --db-parameter-group-family aurora-postgresql13   \
    --description "babelfish APG 13"
aws rds modify-db-cluster-parameter-group             \
    --db-cluster-parameter-group-name myapp-babelfish \
    --parameters "ParameterName=rds.babelfish_status,ParameterValue=on,ApplyMethod=pending-reboot" \

Then I create the database cluster (when using the command below, adjust the security group id and the subnet group name) :

aws rds create-db-cluster \
    --db-cluster-identifier awsnewblog-cli-demo \
    --master-username postgres \  
    --master-user-password Passw0rd \
    --engine aurora-postgresql \
    --engine-version 13.4 \
    --vpc-security-group-ids sg-abcd1234 \
    --db-subnet-group-name default-vpc-1234abcd \
    --db-cluster-parameter-group-name myapp-babelfish
{
    "DBCluster": {
        "AllocatedStorage": 1,
        "AvailabilityZones": [
            "us-east-1c",
            "us-east-1d",
            "us-east-1a"
        ],
        "BackupRetentionPeriod": 1,
        "DBClusterIdentifier": "awsnewblog-cli-demo",
        "Status": "creating",
        ... <redacted for brevity> ...
    }
}

Once the cluster is created, I create an instance using

aws rds create-db-instance \
    --db-instance-identifier myapp-db1 \
    --db-instance-class db.r5.4xlarge \
    --db-subnet-group-name default-vpc-1234abcd \
    --db-cluster-identifier awsnewblog-cli-demo \
    --engine aurora-postgresql
{
    "DBInstance": {
        "DBInstanceIdentifier": "myapp-db1",
        "DBInstanceClass": "db.r5.4xlarge",
        "Engine": "aurora-postgresql",
        "DBInstanceStatus": "creating",
        ... <redacted for brevity> ...

Connect to the Babelfish Cluster
Once the cluster and instances are ready, I connect to the writer instance to create the database itself. I may connect to the instance using SQL Server Management Studio (SSMS) or other SQL client such as sqlcmd. The Windows client must be able to connect to the Babelfish cluster, I made sure the RDS security group authorizes connections from the Windows host.

Using SSMS on Windows, I select New Query in the toolbar, I enter the database DNS name as Server name. I select SQL Server Authentication and I enter the database Login and Password. I click on Connect.

Important: Do not connect via the SSMS Object Explorer. Be sure to connect using the query editor via the New Query button. At this time, Babelfish supports the query editor, but not the Object Explorer.

Once connected, I check the version with select @@version statement and click the green Execute button in the toolbar. I can read the statement result on the bottom part of the screen.

Finally, I create the database on the instance with the create database demo statement.

By default, Babelfish runs in single-db mode. Using this mode, you can have maximum one user database per instance. It allows to have a close mapping of schema names between SQL Server and PostgreSQL. Alternatively, you may turn on multi-db mode at cluster creation time. This allows you to create multiple user databases per instance. In PostgreSQL, user databases will be mapped to multiple schemas with the database name as a prefix.

Run an Application
For the purpose of this demo, I use a database schema provided by SQLServerTutorial.net as part of their SQL Server Tutorial to create a schema and populate it with data. The SQL script and application C# code I use in this demo are available on my GitHub repository. A big thanks to my colleague Anuja for providing me with a C# demo application.

In SQL Server Management Studio, I open the create_objects.sql script and I choose the green execute icon on the top toolbar. A confirmation message tells me the database schema is created.

I repeat the operation with the load_data.sql script to load data in the newly created tables. Data loading takes a few minutes to run.

Now the database is loaded, let’s open Anuja‘s C# application developed to access a SQL Server database. I modify two lines of code:

line 12 : I type the DNS name of the Babelfish cluster I created earlier. Note that I use the DNS name of a “write” node from my cluster.
line 15 : I type the password I entered when I created the database cluster.

And that’s it! No other modification is required on this app. This code written to query and interact with SQL Server is just working “as-is” on Aurora PostgreSQL with Babelfish.

Open Source Transparency
We decided to open-source the technology behind Babelfish to create the Babelfish for PostgreSQL open source project. It uses the permissive Apache 2.0 and PostgreSQL licenses, meaning you can modify or tweak or distribute Babelfish in whatever fashion you see fit. Over time, we are shifting Babelfish to fully open development on GitHub, so there is transparency from the start. Now, anyone, whether you are an AWS customer or not, can use Babelfish to leave behind SQL Server and quickly, easily, and cost-effectively migrate your applications to open source PostgreSQL. We believe Babelfish is going to make PostgreSQL accessible to a much wider group of customers and developers than ever before, particularly those with large numbers of complex applications originally written for SQL Server.

Availability
Babelfish for Aurora PostgreSQL is available today in all publicly available AWS Regions at no additional cost. Start your application migration today.

— seb

PS : if you wonder where the name Babelfish comes from, just remember the answer is 42. (Or you can read this slightly longer answer.)

Amazon RDS Custom for Oracle – New Control Capabilities in Database Environment

2021-10-26 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-rds-custom-for-oracle-new-control-capabilities-in-database-environment/

Managing databases in self-managed environments such as on premises or Amazon Elastic Compute Cloud (Amazon EC2) requires customers to spend time and resources doing database administration tasks such as provisioning, scaling, patching, backups, and configuring for high availability. So, hundreds of thousands of AWS customers use Amazon Relational Database Service (Amazon RDS) because it automates these undifferentiated administration tasks.

However, there are some legacy and packaged applications that require customers to make specialized customizations to the underlying database and the operating system (OS), such as Oracle industry specialized applications for healthcare and life sciences, telecom, retail, banking, and hospitality. Customers with these specific customization requirements cannot get the benefits of a fully managed database service like Amazon RDS, and they end up deploying their databases on premises or on EC2 instances.

Today, I am happy to announce the general availability of Amazon RDS Custom for Oracle, new capabilities that enable database administrators to access and customize the database environment and operating system. With RDS Custom for Oracle, you can now access and customize your database server host and operating system, for example by applying special patches and changing the database software settings to support third-party applications that require privileged access.

You can easily move your existing self-managed database for these applications to Amazon RDS and automate time-consuming database management tasks, such as software installation, patching, and backups. Here is a simple comparison of features and responsibilities between Amazon EC2, RDS Custom for Oracle, and RDS.

Features and Responsibilities	Amazon EC2	RDS Custom for Oracle	Amazon RDS
Application optimization	Customer	Customer	Customer
Scaling/high availability	Customer	Shared	AWS
DB backups	Customer	Shared	AWS
DB software maintenance	Customer	Shared	AWS
OS maintenance	Customer	Shared	AWS
Server maintenance	AWS	AWS	AWS

The shared responsibility model of RDS Custom for Oracle gives you more control than in RDS, but also more responsibility, similar to EC2. So, if you need deep control of your database environment where you take responsibility for changes that you make and want to offload common administration tasks to AWS, RDS Custom for Oracle is the recommended deployment option over self-managing databases on EC2.

Getting Started with Amazon RDS Custom for Oracle
To get started with RDS Custom for Oracle, you create a custom engine version (CEV), the database installation files of supported Oracle database versions and upload the CEV to Amazon Simple Storage Service (Amazon S3). This launch includes Oracle Enterprise Edition allowing Oracle customers to use their own licensed software with bring your own license (BYOL).

Then with just a few clicks in the AWS Management Console, you can deploy an Oracle database instance in minutes. Then, you can connect to it using SSH or AWS Systems Manager.

Before creating and connecting your DB instance, make sure that you meet some prerequisites such as configuring the AWS Identity and Access Management (IAM) role and Amazon Virtual Private Cloud (VPC) using the pre-created AWS CloudFormation template in the Amazon RDS User Guide.

A symmetric AWS Key Management Service (KMS) key is required for RDS Custom for Oracle. If you don’t have an existing symmetric KMS key in your account, create a KMS key by following the instructions in Creating keys in the AWS KMS Developer Guide.

The Oracle Database installation files and patches are hosted on Oracle Software Delivery Cloud. If you want to create a CEV, search and download your preferred version under the Linux x86/64 platform and upload it to Amazon S3.

$ aws s3 cp install-or-patch-file.zip \ s3://my-oracle-db-files

To create CEV for creating a DB instance, you need a CEV manifest, a JSON document that describes installation .zip files stored in Amazon S3. RDS Custom for Oracle will apply the patches in the order in which they are listed when creating the instance by using this CEV.

{
    "mediaImportTemplateVersion": "2020-08-14",
    "databaseInstallationFileNames": [
        "V982063-01.zip"
    ],
    "opatchFileNames": [
        "p6880880_190000_Linux-x86-64.zip"
    ],
    "psuRuPatchFileNames": [
        "p32126828_190000_Linux-x86-64.zip"
    ],
    "otherPatchFileNames": [
        "p29213893_1910000DBRU_Generic.zip",
        "p29782284_1910000DBRU_Generic.zip",
        "p28730253_190000_Linux-x86-64.zip",
        "p29374604_1910000DBRU_Linux-x86-64.zip",
        "p28852325_190000_Linux-x86-64.zip",
        "p29997937_190000_Linux-x86-64.zip",
        "p31335037_190000_Linux-x86-64.zip",
        "p31335142_190000_Generic.zip"
] }

To create a CEV in the AWS Management Console, choose Create custom engine version in the Custom engine version menu.

You can set Engine type to Oracle, choose your preferred database edition and version, and enter CEV manifest, the location of the S3 bucket that you specified. Then, choose Create custom engine version. Creation takes approximately two hours.

To create your DB instance with the prepared CEV, choose Create database in the Databases menu. When you choose a database creation method, select Standard create. You can set Engine options to Oracle and choose Amazon RDS Custom in the database management type.

In Settings, enter a unique name for the DB instance identifier and your master username and password. By default, the new instance uses an automatically generated password for the master user. To learn more in the remaining setting, see Settings for DB instances in the Amazon RDS User Guide. Choose Create database.

Alternatively, you can create a CEV by running create-custom-db-engine-version command in the AWS Command Line Interface (AWS CLI).

$ aws rds create-db-instances \
      --engine my-oracle-ee \
      --db-instance-identifier my-oracle-instance \ 
      --engine-version 19.my_cev1 \ 
      --allocated-storage 250 \ 
      --db-instance-class db.m5.xlarge \ 
      --db-subnet-group mydbsubnetgroup \ 
      --master-username masterawsuser \ 
      --master-user-password masteruserpassword \ 
      --backup-retention-period 3 \ 
      --no-multi-az \ 
              --port 8200 \
      --license-model bring-your-own-license \
      --kms-key-id my-kms-key

After you create your DB instance, you can connect to this instance using an SSH client. The procedure is the same as for connecting to an Amazon EC2 instance. To connect to the DB instance, you need the key pair associated with the instance. RDS Custom for Oracle creates the key pair on your behalf. The pair name uses the prefix do-not-delete-ssh-privatekey-db-. AWS Secrets Manager stores your private key as a secret.

For more information, see Connecting to your Linux instance using SSH in the Amazon EC2 User Guide.

You can also connect to it using AWS Systems Manager Session Manager, a capability that lets you manage EC2 instances through a browser-based shell. To learn more, see Connecting to your RDS Custom DB instance using SSH and AWS Systems Manager in the Amazon RDS User Guide.

Things to Know
Here are a couple of things to keep in mind about managing your DB instance:

High Availability (HA): To configure replication between DB instances in different Availability Zones to be resilient to Availability Zone failures, you can create read replicas for RDS Custom for Oracle DB instances. Read replica creation is similar to Amazon RDS, but with some differences. Not all options are supported when creating RDS Custom read replicas. To learn how to configure HA, see Working with RDS Custom for Oracle read replicas in the AWS Documentation.

Backup and Recovery: Like Amazon RDS, RDS Custom for Oracle creates and saves automated backups during the backup window of your DB instance. You can also back up your DB instance manually. The procedure is identical to taking a snapshot of an Amazon RDS DB instance. The first snapshot contains the data for the full DB instance just like in Amazon RDS. RDS Custom also includes a snapshot of the OS image, and the EBS volume that contains the database software. Subsequent snapshots are incremental. With backup retention enabled, RDS Custom also uploads transaction logs into an S3 bucket in your account to be used with the RDS point-in-time recovery feature. Restore DB snapshots, or restore DB instances to a specific point in time using either the AWS Management Console or the AWS CLI. To learn more, see Backing up and restoring an Amazon RDS Custom for Oracle DB instance in the Amazon RDS User Guide.

Monitoring and Logging: RDS Custom for Oracle provides a monitoring service called the support perimeter. This service ensures that your DB instance uses a supported AWS infrastructure, operating system, and database. Also, all changes and customizations to the underlying operating system are automatically logged for audit purposes using Systems Manager and AWS CloudTrail. To learn more, see Troubleshooting an Amazon RDS Custom for DB instance in the Amazon RDS User Guide.

Now Available
Amazon RDS Custom for Oracle is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), EU (Frankfurt), EU (Ireland), EU (Stockholm), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo) regions.

To learn more, take a look at the product page and documentations of Amazon RDS Custom for Oracle. Please send us feedback either in the AWS forum for Amazon RDS or through your usual AWS support contacts.

– Channy

Designing products and services based on Jobs to be Done

2021-10-21 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/designing-products-and-services-based-on-jtbd

Introduction

In 2016, Clayton Christensen, a Harvard Business School professor, wrote a book called Competing Against Luck. In his book, he talked about the kind of jobs that exist in our everyday life and how we can uncover hidden jobs through the act of non-consumption. Non-consumption is the inability for a consumer to fulfil an important Job to be Done (JTBD).

JTBD is a framework; it is a different way of looking at consumer goals and is based on the notion that people buy products and services to get a job done. In this article, we will walk through what the JTBD framework is, look at an example of a popular JTBD, and look at how we use the JTBD framework in one of Grab’s services.

JTBD framework

In his book, Clayton Christensen gives the example of the milkshake, as a JTBD example. In the mid-90s, a fast food chain was trying to understand how to improve the milkshakes they were selling and how they could sell more milkshakes. To sell more, they needed to improve the product. To understand the job of the milkshake, they interviewed their customers. They asked their customers why they were buying the milkshakes, and what progress the milkshake would help them make.

Job 1: To fill their stomachs

One of the key insights was the first job, the customers wanted something that could fill their stomachs during their early morning commute to the office. Usually, these car drives would take one to two hours, so they needed something to keep them awake and to keep themselves full.

In this scenario, the competition could be a banana, but think about the properties of a banana. A banana could fill your stomach but your hands get dirty and sticky after peeling it. Bananas cannot do a good job here. Another competitor could be a Snickers bar, but it is rather unhealthy, and depending on how many bites you take, you could finish it in one minute.

By understanding the job the milkshake was performing, the restaurant now had a specific way of improving the product. The milkshake could be made milkier so it takes time to drink through a straw. The customer can then enjoy the milkshake throughout the journey; the milkshake is optimised for the job.

Job 2: To make children happy

As part of the study, they also interviewed parents who came to buy milkshakes in the afternoon, around 3:00 PM. They found out that the parents were buying the milkshakes to make their children happy.

By knowing this, they were able to optimise the job by offering a smaller version of the milkshake which came in different flavours like strawberry and chocolate. From this milkshake example, we learn that multiple jobs can exist for one product. From that, we can make changes to a product to meet those different jobs.

JTBD at GrabFood

A team at GrabFood wanted to prioritise which features or products to build, and performed a prioritisation exercise. However, there was a lack of fundamental understanding of why our consumers were using GrabFood or any other food delivery services. To gain deeper insights on this, we conducted a JTBD study.

We applied the JTBD framework in our research investigation. We used the force diagram framework to find out what job a consumer wanted to achieve and the corresponding push and pull factors driving the consumer’s decision. A job here is defined as the progress that the consumer is trying to make in a particular context.

There were four key points in the force diagram:

What jobs are people using GrabFood for?
What did people use prior to GrabFood to get the jobs done?
What pushed them to seek a new solution? What is attractive about this new solution?
What are the things that will make them go back to the old product? What are the anxieties of the new product?

By applying this framework, we progressively asked these questions in our interview sessions:

Can you remind us of the last time you used GrabFood? — This was to uncover the situation or the circumstances.
Why did you order this food? — This was to get down to the core of the need.
Can you tell us, before GrabFood, what did you use to get the same job done?

From the interview sessions, we were able to uncover a number of JTBDs, one example was working parents buying food for their families. Before GrabFood, most of them were buying from food vendors directly, but that is a time consuming activity and it adds additional friction to an already busy day. This led them in search of a new solution and GrabFood provided that solution.

Let’s look at this JTBD in more depth. One anxiety that parents had when ordering GrabFood was the sheer number of choices they had to make in order to check out their order:

There was already a solution for this problem: bundles! Food bundles is a well-known concept from the food and beverage industry; items that complement each other are bundled together for a more efficient checkout experience.

However, not all GrabFood merchants created bundles to solve this problem for their consumers. This was an untapped opportunity for the merchants to solve a critical problem for their consumers. Eureka! We knew that we needed to help merchants create bundles in an efficient way to solve for the consumer’s JTBD.

We decided to add a functionality to the GrabMerchant app that allowed merchants to create bundles. We built an algorithm that matched complementary items and automatically suggested these bundles to merchants. The merchant only had to tap a button to create a bundle instantly.

The feature was released and thousands of restaurants started adding bundles to their menu. Our JTBD analysis proved to be correct: food and beverage entrepreneurs were now equipped with an essential tool to drive growth and we removed an obstacle for parents to choose GrabFood to solve for their JTBD.

Conclusion

At Grab, we understand the importance of research. We educate designers and other non-researcher employees to conduct research studies. We also encourage the sharing of research findings, and we ensure that research insights are consumable. By using the JTBD framework and asking questions specifically to understand the job of our consumers and partners, we are able to gain fundamental understanding of why our consumers are using our products and services. This helps us improve our products and services, and optimise it for the jobs that need to be done throughout Southeast Asia.

This article was written based on an episode of the Grab Design Podcast – a conversation with Grab Lead Researcher Soon Hau Chua. Want to listen to the Grab Design Podcast? Join the team, we’re hiring!

Special thanks to Amira Khazali and Irene from Tech Learning.

Join us

Grab is a leading superapp in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across over 400 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Monitoring MongoDB nodes and clusters with Zabbix

2021-10-05 Dmitry Lambert

Post Syndicated from Dmitry Lambert original https://blog.zabbix.com/monitoring-mongodb-nodes-and-clusters-with-zabbix/16031/

Zabbix Agent 2 enables our users to monitor a whole set of new systems with minimal configuration required on the monitored systems. Forget about writing custom monitoring scripts, deploying additional packages, or configuring ODBC. A great use-case for Zabbix Agent 2 is monitoring one of the most popular NoSQL DB backends – MongoDB. Below, you can read a detailed description and step-by-step guide through the use case or refer to the video available here.

Zabbix MongoDB template

For this example, we will be using Zabbix version 5.4, but MongoDB monitoring by Zabbix Agent 2 is supported starting from version 5.0. If you have a fresh deployment of Zabbix version 5.0 or newer, you will be able to find the MongoDB template in your ‘Configuration‘ – ‘Templates‘ section.

MongoDB Node and Cluster templates

On the other hand, if you have an instance that you deployed before the release of Zabbix 5.0 and then upgraded to Zabbix 5.0 or newer, you will have to import the template manually from our git page. Let’s remember that Zabbix DOES NOT apply new templates or modify existing templates during an upgrade. Therefore, newly released templates have to be IMPORTED MANUALLY!

We can see that we have two MongoDB templates – ‘MongoDB cluster by Zabbix Agent 2’ and ‘MongoDB node by Zabbix agent 2’. Depending on your MongoDB setup – individual nodes or a cluster, apply the corresponding template. Note that the MongoDB cluster template can automatically create hosts for your config servers and shards and apply the MongoDB node template on these hosts.

Host prototypes for config servers and shards

Deploying Zabbix Agent 2 on your Host

Since the data collection is done by Zabbix Agent 2, first, let’s deploy Zabbix Agent 2 on our MongoDB node or cluster host. Let’s start with adding the Zabbix 5.4 repository and install the Zabbix Agent 2 via a package.

Add the Zabbix 5.4 repository:

rpm -Uvh https://repo.zabbix.com/zabbix/5.4/rhel/8/x86_64/zabbix-release-5.4-1.el8.noarch.rpm

Install Zabbix Agent 2:

yum install zabbix-agent2

What if you already have the regular Zabbix Agent running on this machine? In this case, we have two options for how we can proceed. We can simply remove the regular Zabbix Agent and Deploy Zabbix Agent 2. In this case, make sure you make a backup of the Zabbix Agent configuration file and migrate all of the changes to the Zabbix Agent 2 configuration file.

The second option is running both of the Zabbix Agents in parallel. In this case, we need to make sure that both agents – Zabbix Agent and Zabbix Agent 2 are listening on their own specific ports because, by default, both agents are listening for connections on port 10050. This can be configured in the Zabbix Agent configuration file by changing the ‘ListenPort’ parameter.

Don’t forget to specify the ‘Server‘ parameter in the Zabbix Agent 2 configuration file. This parameter should contain your Zabbix Server address or DNS name. By defining it here, you will allow Zabbix Agent 2 to accept the metric poll requests from Zabbix Server.

After you have made the configuration changes in the Zabbix Agent 2 configuration file, don’t forget to restart Zabbix Agent 2 to apply the changes:

systemctl restart zabbix-agent2

Creating a MongoDB user for monitoring

Once the agent has been deployed and configured, you need to ensure that you have a MongoDB database user that we can use for monitoring purposes. Below you can find a brief example of how you can create a MongoDB user:

Access the MongoDB shell:

mongosh

Switch to the MongoDB admin database:

use admin

Create a user with ‘userAdminAnyDatabase‘ permissions:

db.createUser(
... {
..... user: "zabbix_mon",
..... pwd: "zabbix_mon",
..... roles: [ { role: "userAdminAnyDatabase", db: "admin" } ]
..... }
... )

The username for the newly created user is ‘zabbix_mon’. The password is also ‘zabbix_mon‘ – feel free to change these as per your security policy.

Creating and configuring a MongoDB host

Next, you need to open your Zabbix frontend and create a new host representing your MongoDB node. You can see that in our example, we called our node ‘MongoDB’ and assigned it to a ‘MongoDB Servers’ host group. You can use more detailed naming in a production environment and use your own host group assignment logic. But remember – a host needs to belong to AT LEAST a single host group!

Since the metrics are collected by Zabbix Agent 2, you must also create an Agent interface on the host. Zabbix Server will connect to this interface and request the metrics from the Zabbix Agent 2. Define the IP address or DNS name of your MongoDB host, where you previously deployed Zabbix Agent 2. Mind the port – by default, we have port 10050 defined over here, but if you have modified the ‘ListenPort’ parameter in the Zabbix Agent 2 config and changed the value from the default one (10050) to something else, you also need to use the same port number here.

MongoDB host configuration example

Next, navigate to the ‘Templates’ tab and assign the corresponding template – either ‘MongoDB node by Zabbix agent 2’ or ‘MongoDB cluster by Zabbix Agent 2’. In our example, we will assign the MongoDB node template.

Before adding the host, you also need to provide authentication and connection parameters by editing the corresponding User Macros. These User Macros are used by the items that specify which metrics should we be collecting. Essentially, we are forwarding the connectivity and authentication information to Zabbix Agent 2, telling it to use these values when collecting the metrics from our MongoDB instance.

To do this, navigate to the ‘Macros’ tab in the host configuration screen. Then, select ‘Inherited and host macros’ to display macros inherited from the MongoDB template.

We can see a bunch of macros here – some of them are related to trigger thresholds and discovery filters, but what we’re interested in right now are the following macros:

{$MONGODB.PASSWORD} – MongoDB username. For our example, we will set this to zabbix_mon
{$MONGODB.USER} – MongoDB password. For our example, we will set this to zabbix_mon
{$MONGODB.CONNSTRING} – MongoDB connection string. Specify the MongoDB address and port here to which the Zabbix Agent 2 should connect and perform the metric collection

Now we are ready to add the host. Once the host has been added, we might have to wait for a minute or so before Zabbix begins to monitor the host. This is because Zabbix Server doesn’t instantly pick up the configuration changes. By default, Zabbix Server updates the Configuration Cache once a minute.

Fine-tuning MongoDB monitoring

At this point, we should see a green ZBX Icon next to our MongoDB host.

Data collection on the MongoDB host has started – note the green ‘ZBX’ icon.

This means that the Zabbix Server has successfully connected to our Zabbix Agent 2, and the metric collection has begun. You can now navigate to the ‘Monitoring’ – ‘Latest data’ section, filter the view by your MongoDB host, and you should see all of the collected metrics here.

MongoDB metrics in ‘Monitoring’ – ‘Latest data’

The final task is to tune the MongoDB monitoring on your hosts, collecting only the required metrics. Navigate to ‘Configuration’ – ‘Hosts’, find your MongoDB hosts, and go through the different entity types on the host – items, triggers, discovery rules. See an item that you don’t wish to collect metrics for? Feel free to disable it. Open up the discovery rules – change the update intervals on them or disable the unnecessary ones.

Note: Be careful not to disable master items. Many of the items and discovery rules here are of type ‘Dependent item’ which means, that they require a so-called ‘Master item’. Feel free to read more about dependent items here.

Remember the ‘Macros’ section in the host configuration? Let’s return to it. here we can see some macros which are used in our trigger thresholds, like:

{$MONGODB.REPL.LAG.MAX.WARN} – Maximum replication lag in seconds
{$MONGODB.CURSOR.OPEN.MAX.WARN} – Maximum number of open cursors

Feel free to change these as per your problem threshold requirements.

One last thing here – we can filter which elements get discovered by our discovery rules. This is once again defined by user macros like:

{$MONGODB.LLD.FILTER.DB.MATCHES} – Databases that should be discovered (By default, the value here is ‘.*’, which will match everything)
{$MONGODB.LLD.FILTER.DB.NOT_MATCHES} – Databases that should be excluded from the discovery

And that’s it! After some additional tuning has been applied, we are good to go – our MongoDB entities are being discovered, metrics are getting collected, and problem thresholds have been defined. And all of it has been done with the native Zabbix Agent 2 functionality and an out-of-the-box MongoDB template!

Amazon QuickSight deployment models for cross-account and cross-Region access to Amazon Redshift and Amazon RDS

2021-09-29 Lotfi Mouhib

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/amazon-quicksight-deployment-models-for-cross-account-and-cross-region-access-to-amazon-redshift-and-amazon-rds/

Many AWS customers use multiple AWS accounts and Regions across different departments and applications within the same company. However, you might deploy services like Amazon QuickSight using a single-account approach to centralize users, data source access, and dashboard management. This post explores how you can use different Amazon Virtual Private Cloud (Amazon VPC) private connectivity features to connect QuickSight to Amazon RDS or Amazon Redshift deployed in a different AWS account or Region.

Prepare the QuickSight environment

To implement any of the options discussed in this post, you need QuickSight with an Enterprise Edition subscription. One of the differences between this edition and the Standard Edition is the ability to connect QuickSight to a VPC—through an Elastic Network Interface (ENI)—and keep network traffic private within the AWS network. In this section, we take you through creating a VPC and enabling a VPC connection for QuickSight. For the rest of this post, you use AWS CloudShell from the AWS Management Console to run AWS Command Line Interface (AWS CLI) commands.

Before you complete the following steps, make sure you have sufficient AWS Identity and Access Management (IAM) privileges associated with your IAM user or role to create VPC endpoints, security groups, route tables, and a VPC peering connection.

If you have Amazon RDS or Amazon Redshift in the same Region and your organization uses cross-account resource sharing for your VPC, you can skip the next steps and go to the section on VPC sharing.

If you already have a VPC and subnet you want to use, you can skip to step 3.

Create a VPC and a subnet in the Region where your QuickSight account is deployed. See the following code:

aws ec2 create-vpc \
    --cidr-block CIDR block different than where redshift and rds are deployed \
    --tag-specifications \ 
 'ResourceType=vpc,Tags=[{Key=Name,Value=QuicksightVPC}]

Create a subnet where QuickSight can deploy an ENI:

aws ec2 create-subnet \
    --vpc-id ID of the VPC you created above\
    --cidr-block subnet cidr block within your VPC range \
    --tag-specifications \ 
    'ResourceType=subnet,Tags=[{Key=Name,Value=QuicksightSubnet}]'

Create a security group and allow a self-reference for all TCP traffic from the VPC CIDR range:

aws ec2 create-security-group \
    --group-name QuicksightSG \
    --description "Quicksight security group" \
    --vpc-id ID of the VPC you created above \
    --tag-specifications 'ResourceType=security-group,Tags=[{Key=Name,Value=QuicksightSG}]'

Create a route table and associate it to the subnet you created:

aws ec2 create-route-table \
    --vpc-id ID of the VPC you created above \
    --tag-specifications 'ResourceType=route- \ table,Tags=[{Key=Name,Value=QuicksightRouteTable}]'

aws ec2 associate-route-table \ 
    --route-table-id id of route table created above \
    --subnet-id subnet id created above

Now you configure QuickSight to create a VPC connection in the subnet you just created.

Sign in to the QuickSight console with administrator privileges.
Choose your profile icon and choose Manage QuickSight.
On the Manage QuickSight console, in the left panel, choose Manage VPC Connections.
Choose Add VPC Connection.
Provide the VPC ID, subnet ID, and security group ID you created earlier.
You can leave DNS resolver endpoints empty unless you have a private DNS deployment in the VPC.

You have now enabled QuickSight to access a subnet in your VPC. The following diagram shows the infrastructure you deployed.

Quicksight-ENI-VPC-Deployment

Set up the data source infrastructure

You can choose from multiple possible deployment models when connecting Amazon Redshift or Amazon RDS. In the following sections, we walk you through how you can achieve each solution and when to use each one. For the rest of this post, we refer to Amazon Redshift and Amazon RDS as the data source.

VPC peering

A VPC peering connection is a networking connection between two VPCs that enables you to route traffic between them using private addresses. Instances in either VPC can communicate with each other as if they’re within the same network. This model is suitable for when the two accounts aren’t in the same AWS Organizations organization or Region, or are in the same organization but there is no AWS Transit Gateway attachment to the data source VPC. In this section, we create a network path between the QuickSight VPC and the data source VPC.

Create a VPC peering between the two VPCs. For instructions, see Creating and accepting a VPC peering connection.
To use the hostname of the data source, you need to enable the DNS resolution for the peering connection at the requester and accepter of the peering connection. The DNS resolution must also be enabled at bot VPCs for this to work.

Run the following commands to enable the DNS resolution (depending on which account or Region the peering connection is initiated in, you need to run one command in the requester account or Region and the other in the accepter account or Region)

aws ec2 modify-vpc-peering-connection-options \
    --vpc-peering-connection-id your peering connection id \
    --accepter-peering-connection-options \ 
'{"AllowDnsResolutionFromRemoteVpc":true}'

aws ec2 modify-vpc-peering-connection-options \
    --vpc-peering-connection-id your peering connection id \
    --requester-peering-connection-options \ 
'{"AllowDnsResolutionFromRemoteVpc":true}'

Update the route tables in both the QuickSight VPC and data source VPC to route network traffic between them:

aws ec2 create-route \
    --route-table-id amazon quicksight subnet route table id\
    --destination-cidr-block data source vpc cidr \
    --vpc-peering-connection-id the id of the peering you created above

aws ec2 create-route \
    --route-table-id data source subnet route table id\
    --destination-cidr-block quicksight vpc cidr \
    --vpc-peering-connection-id the id of the peering you created above

Finally, you update the security groups for both QuickSight and the data source to allow traffic. If both VPCs are in the same Region, you can reference the security group instead of the CIDR of either data source or the QuickSight subnet.

In the QuickSight AWS account, run the following command:

aws ec2 authorize-security-group-ingress \
    --group-id quicksight security group \
    --ip-permissions \ 
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=data source subnet CIDR}]

In the data source AWS account, run the following command:
```
aws ec2 authorize-security-group-ingress \
    --group-id data source security group \
    --ip-permissions \ 
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=quicksight subnet CIDR}]
```
You have now created and configured a network path between QuickSight and your data source. The following diagram shows the infrastructure you deployed.

Quiclsight-VPC-Peering-Deployment-Architecture

You can skip the next sections and proceed to connecting QuickSight to the data source.

AWS Transit Gateway

Transit Gateway connects VPCs through a central hub. It’s usually used in large deployments because it simplifies your network and reduces complex peering relationships. The transit gateway acts as a cloud router—each new connection is only made one time. A deployment model for the VPC of QuickSight with Transit Gateway is suitable when your network or infrastructure team has a policy on limiting VPC peering connections and enforces all new connections between VPCs through the transit gateway.

The Transit Gateway deployment model only works if the two AWS accounts are in the same organization and Region. This method also works if you have two transit gateway in different region that are peered.

Complete the following steps to create a network path between the QuickSight VPC and the data source VPC:

Share the transit gateway with the AWS account where QuickSight is deployed.

Run the following command to attach the QuickSight VPC to the shared transit gateway:

aws ec2 create-transit-gateway-vpc-attachment \
    --transit-gateway-id transit gateway id \
    --vpc-id quicksight vpc id \
    --subnet-id quicksight subnet id \
    --options {"DnsSupport": "enable"}

Run the following commands to accept the transit gateway attachment you created:
```
aws ec2 accept-transit-gateway-vpc-attachment \
    --transit-gateway-attachment-id your transit gateway attachment id
```
Now you need to update the transit gateway attachment with same route table that is used by the data source VPC transit gateway attachment.

Get the route table ID (TransitGatewayRouteTableId) with the following command:

aws ec2 describe-transit-gateway-attachments \
    --transit-gateway-attachment-ids your transit gateway id

Run the following command to attach the route table to the transit gateway attachment of QuickSight:
```
aws ec2 associate-transit-gateway-route-table \
    --transit-gateway-route-table-id your transit gateway id \
    --transit-gateway-attachment-id your transit gateway attachment id
```
Lastly, we update the security groups for both QuickSight and the data source to allow traffic.

In the AWS account where QuickSight is deployed, run the following command:

aws ec2 authorize-security-group-ingress \
    --group-id quicksight security group \
    --ip-permissions \ 
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=data source subnet CIDR}]

In the AWS account where the data source is deployed, run the following command:
```
aws ec2 authorize-security-group-ingress \
    --group-id data source security group \
    --ip-permissions \
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=quicksight subnet CIDR}]
```
You have now created and configured a network path between QuickSight and your data source. The following diagram shows the infrastructure you deployed.

You can skip the next sections and proceed to connecting QuickSight to the data source.

AWS PrivateLink

AWS PrivateLink provides private connectivity between VPCs, AWS services, and your on-premises networks. It simplifies your network architecture and makes it easy to connect AWS services across different accounts and VPCs.

You can use AWS PrivateLink as a self-managed endpoint to connect two VPCs or as a managed VPC endpoint for some services like Amazon Redshift. With an Amazon Redshift-managed VPC endpoint, you can privately access your Amazon Redshift data warehouse in your VPC from your client applications in another VPC within the same AWS account or another AWS account.

A self-managed AWS PrivateLink deployment is a solution for cross-account access; however, I don’t discuss it in this post. The solution relies on the IP of the data source ENI, which may change, which causes the AWS PrivateLink endpoint to lose connectivity to the data source, because the AWS PrivateLink Network Load Balancer uses IP-as-a-target to refer to the data source. This results in QuickSight being unable to refresh the dashboards and reports until you update the AWS PrivateLink configuration with the new IP.

The following steps walk you through creating a network path—using an Amazon Redshift managed VPC endpoint—between the QuickSight VPC and the data source VPC. This method only works with the RA3-instance of Amazon Redshift and not the DS2 instances.

Set up an Amazon Redshift-managed VPC endpoint between the Amazon Redshift cluster VPC and QuickSight VPC. For instructions, see Connecting to Amazon Redshift using an interface VPC endpoint.
Then you update the security groups for both QuickSight and Amazon Redshift to allow traffic.

In the AWS account where QuickSight is deployed, run the following command:

aws ec2 authorize-security-group-ingress \
    --group-id quicksight security group \
    --ip-permissions \ 
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=data source subnet CIDR}]

In the AWS account where the data source is deployed, run the following command:
```
aws ec2 authorize-security-group-ingress \
    --group-id data source security group \
    --ip-permissions \
IpProtocol=tcp,FromPort=0,ToPort=65535,IpRanges=[{CidrIp=quicksight subnet CIDR}]
```
You have now created and configured a network path between QuickSight and your Amazon Redshift cluster. The following diagram shows the infrastructure you deployed.

You can skip the next section and proceed to connecting QuickSight to the data source.

VPC sharing

VPC sharing allows multiple AWS accounts to create their application resources, such as Amazon Elastic Compute Cloud (Amazon EC2) or RDS instances, into a shared, centrally managed VPC. In this model, the account that owns the VPC where the data source is deployed (owner) shares one or more subnets with other accounts (participants) that belong to the same organization. After a subnet is shared, the participants can use QuickSight to create the VPC connection in a subnet shared with them. Participants can’t view, modify, or delete resources that belong to other participants or the VPC owner.

This model is the most cost-effective because there’s no underlying cost for VPC peering, Transit Gateway, or AWS PrivateLink. It also simplifies network topologies and reduces the number of VPCs that you create and manage, while using separate accounts for billing and access control for each service used.

The following steps walk you through creating a VPC share and deploying a QuickSight ENI in the VPC to connect to the data source.

Share the data source VPC with the AWS account where QuickSight is deployed. For instructions, see Work with shared VPCs.
Configure QuickSight to create a VPC connection in the VPC you shared. For instructions, see Setting Up a VPC to Use with Amazon QuickSight.

You have now created and configured a network path between QuickSight and your data source. The following diagram shows the infrastructure you deployed.

Quicksight-VPC-sharing-deployment-architecture

You can now connect QuickSight to the data source.

Connect QuickSight to the data source

Now that you have established the network link and configured both QuickSight and the data source to accept incoming and outgoing network traffic, you set up QuickSight to connect to the data source.

On the QuickSight console, choose Datasets in the navigation pane.
Choose Add a dataset.
Choose your database engine (such as PostgreSQL, MySQL or Redshift Manual connect).
Do not choose Amazon RDS or Amazon Redshift auto-discover.
For Connection type, choose the VPC connection you created.
Provide the necessary information about your database server.
Choose Validate connection button to make sure QuickSight can connect to the data source.
Choose Create data source.
Choose the database you want to use and select the table.

You’re now ready to build your dashboards and reports.

Clean up

After following these steps, you should have successfully connected QuickSight to your data source. To avoid incurring any extra charges from the VPC configurations you made and depending on the use case you followed, you should delete the VPC peering, AWS PrivateLink endpoint, or transit gateway attachment.

Conclusion

In this post, we walked through reference access patterns that connect QuickSight to either Amazon RDS or Amazon Redshift running either on a different account, Region, or both. These methods aren’t limited to Amazon RDS or Amazon Redshift; you can use them with any database engine that is self-managed on Amazon EC2 and supported by QuickSight.

The following table summarizes the different deployment models and the preferred way to connect QuickSight to the data source.

	Same Region		Different Region
	Data Source		Data Source
	Amazon Redshift RA3	Amazon RDS or Amazon Redshift DC2	Amazon Redshift RA3	Amazon RDS or Amazon Redshift DC2
Different AWS accounts same AWS organization	Amazon Redshift managed VPC endpoint or VPC sharing	Transit Gateway or VPC sharing	VPC peering + Amazon Redshift managed VPC endpoint	VPC peering
Different AWS accounts not in the same AWS organization	Amazon Redshift managed VPC endpoint	VPC peering	VPC peering + Amazon Redshift managed VPC endpoint	VPC peering

After you set up your various data sources, you can join your data across various sources. You can also use these new data sources to gain further insights from your data by setting up ML Insights in QuickSight and setting graphical representations of your data using QuickSight visuals.

About the author

author-picture Lotfi Mouhib is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

Search indexing optimisation

2021-09-27 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/search-indexing-optimisation

Modern applications commonly utilise various database engines, with each serving a specific need. At Grab Deliveries, MySQL database (DB) is utilised to store canonical forms of data, and ElasticSearch (ES) to provide advanced search capabilities. MySQL serves as the primary data storage for raw data, and ES as the derived storage.

Efforts have been made to synchronise data between MySQL and ES. In this post, a series of techniques will be introduced on how to optimise incremental search data indexing.

Background

The synchronisation of data from the primary data storage to the derived data storage is handled by Food-Puxian, a Data Synchronisation Platform (DSP). In a search service context, it is the synchronisation of data between MySQL and ES.
The data synchronisation process is triggered on every real-time data update to MySQL, which will streamline the updated data to Kafka. DSP consumes the list of Kafka streams and incrementally updates the respective search indexes in ES. This process is also known as Incremental Sync.

Kafka to DSP

DSP uses Kafka streams to implement Incremental Sync. A stream represents an unbounded, continuously updating data set. A stream is ordered, replayable and is a fault-tolerant sequence of immutable.

*Data synchronisation process using Kafka*

The above diagram depicts the process of data synchronisation using Kafka. The Data Producer creates a Kafka stream for every operation done on MySQL and sends it to Kafka in real-time. DSP creates a stream consumer for each Kafka stream and the consumer reads data updates from respective Kafka streams and synchronises them to ES.

MySQL to ES

Indexes in ES correspond to tables in MySQL. MySQL data is stored in tables, while ES data is stored in indexes. Multiple MySQL tables are joined to form an ES index. The below snippet shows the Entity-Relationship mapping in MySQL and ES. Entity A has a one-to-many relationship with entity B. Entity A has multiple associated tables in MySQL, table A1 and A2, and they are joined into a single ES index A.

Sometimes a search index contains both entity A and entity B. In a keyword search query on this index, e.g. “Burger”, objects from both entity A and entity B whose name contains “Burger” are returned in the search response.

Original Incremental Sync

Original Kafka streams

The Data Producers create a Kafka stream for every MySQL table in the ER diagram above. Every time there is an insert/update/delete operation on the MySQL tables, a copy of the data after the operation executes is sent to its Kafka stream. DSP creates different stream consumers for every Kafka stream since their data structures are different.

Stream Consumer infrastructure

Stream Consumer consists of 3 components.

Event Dispatcher: Listens and fetches events from the Kafka stream, pushes them to the Event Buffer and starts a goroutine to run Event Handler for every event whose ID does not exist in the Event Buffer
Event Buffer: Caches events in memory by the primary key (aID, bID, etc). An event is cached in the Buffer until it is picked by a goroutine or replaced when a new event with the same primary key is pushed into the Buffer.
Event Handler: Reads an event from the Event Buffer and the goroutine started by the Event Dispatcher handles it.

Event Buffer procedure

Event Buffer consists of many sub buffers, each with a unique ID which is the primary key of the event cached in it. The maximum size of a sub buffer is 1. This allows Event Buffer to deduplicate events having the same ID in the buffer.
The below diagram shows the procedure of pushing an event to Event Buffer. When a new event is pushed to the buffer, the old event sharing the same ID will be replaced. The replaced event is therefore not handled.

Event Handler procedure

The below flowchart shows the procedures executed by the Event Handler. It consists of the common handler flow (in white), and additional procedures for object B events (in green). After creating a new ES document by data loaded from the database, it will get the original document from ES to compare if any field is changed and decide whether it is necessary to send the new document to ES.
When object B event is being handled, on top of the common handler flow, it also cascades the update to the related object A in the ES index. We name this kind of operation Cascade Update.

Issues in the original infrastructure

Data in an ES index can come from multiple MySQL tables as shown below.

The original infrastructure came with a few issues.

Heavy DB load: Consumers read from Kafka streams, treat stream events as notifications then use IDs to load data from the DB to create a new ES document. Data in the stream events are not well utilised. Loading data from the DB every time to create a new ES document results in heavy traffic to the DB. The DB becomes a bottleneck.
Data loss: Producers send data copies to Kafka in application code. Data changes made via MySQL command-line tool (CLT) or other DB management tools are lost.
Tight coupling with MySQL table structure: If producers add a new column to an existing table in MySQL and this column needs to be synchronised to ES, DSP is not able to capture the data changes of this column until the producers make the code change and add the column to the related Kafka Stream.
Redundant ES updates: ES data is a subset of MySQL data. Producers publish data to Kafka streams even if changes are made on fields that are not relevant to ES. These stream events that are irrelevant to ES would still be picked up.
Duplicate cascade updates: Consider a case where the search index contains both object A and object B. A large number of updates to object B are created within a short span of time. All the updates will be cascaded to the index containing both objects A and B. This will bring heavy traffic to the DB.

Optimised Incremental Sync

MySQL Binlog

MySQL binary log (Binlog) is a set of log files that contain information about data modifications made to a MySQL server instance. It contains all statements that update data. There are two types of binary logging:

Statement-based logging: Events contain SQL statements that produce data changes (inserts, updates, deletes).
Row-based logging: Events describe changes to individual rows.

The Grab Caspian team (Data Tech) has built a Change Data Capture (CDC) system based on MySQL row-based Binlog. It captures all the data modifications made to MySQL tables.

Current Kafka streams

The Binlog stream event definition is a common data structure with three main fields: Operation, PayloadBefore and PayloadAfter. The Operation enums are Create, Delete, and Update. Payloads are the data in JSON string format. All Binlog streams follow the same stream event definition. Leveraging PayloadBefore and PayloadAfter in the Binlog event, optimisations of incremental sync on DSP becomes possible.

Stream Consumer optimisations

Event Handler optimisations

Optimisation 1

Remember that there was a redundant ES updates issue mentioned above where the ES data is a subset of the MySQL data. The first optimisation is to filter out irrelevant stream events by checking if the fields that are different between PayloadBefore and PayloadAfter are in the ES data subset.
Since the payloads in the Binlog event are JSON strings, a data structure only with fields that are present in ES data is defined to parse PayloadBefore and PayloadAfter. By comparing the parsed payloads, it is easy to know whether the change is relevant to ES.
The below diagram shows the optimised Event Handler flows. As shown in the blue flow, when an event is handled, PayloadBefore and PayloadAfter are compared first. An event will be processed only if there is a difference between PayloadBefore and PayloadAfter. Since the irrelevant events are filtered, it is unnecessary to get the original document from ES.

Achievements

No data loss: changes made via MySQL CLT or other DB manage tools can be captured.
No dependency on MySQL table definition: All the data is in JSON string format.
No redundant ES updates and DB reads.
ES reads traffic reduced by 90%: Not a need to get the original document from ES to compare with the newly created document anymore.
55% irrelevant stream events filtered out
DB load reduced by 55%

Optimisation 2

The PayloadAfter in the event provides updated data. This makes us think about whether a completely new ES document is needed each time, with its data read from several MySQL tables. The second optimisation is to change to a partial update using data differences from the Binlog event.
The below diagram shows the Event Handler procedure flow with a partial update. As shown in the red flow, instead of creating a new ES document for each event, a check on whether the document exists will be performed first. If the document exists, which happens for the majority of the time, the data is changed in this event, provided the comparison between PayloadBefore and PayloadAfter is updated to the existing ES document.

Achievements

Change most ES relevant events to partial update: Use data in stream events to update ES.
ES load reduced: Only fields that have been changed will be sent to ES.
DB load reduced: DB load reduced by 80% based on Optimisation 1.

Event Buffer optimisation

Instead of replacing the old event, we merge the new event with the old event when the new event is pushed to the Event Buffer.
The size of each sub buffer in Event Buffer is 1. In this optimisation, the stream event is not treated as a notification anymore. We use the Payloads in the event to perform Partial Updates. The old procedure of replacing old events is no longer suitable for the Binlog stream.
When the Event Dispatcher pushes a new event to a non-empty sub buffer in the Event Buffer, it will merge event A in the sub buffer and the new event B into a new Binlog event C, whose PayloadBefore is from Event A and PayloadAfter is from Event B.

Cascade Update optimisation

Optimisation

Use a new stream to handle cascade update events.
When the producer sends data to the Kafka stream, data sharing the same ID will be stored at the same partition. Every DSP service instance has only one stream consumer. When Kafka streams are consumed by consumers, one partition will be consumed by only one consumer. So the Cascade Update events sharing the same ID will be consumed by one stream consumer on the same EC2 instance. With this special mechanism, the in-memory Event Buffer is able to deduplicate most of the Cascade Update events sharing the same ID.
The flowchart below shows the optimised Event Handler procedure. Highlighted in green is the original flow while purple highlights the current flow with Cascade Update events.
When handling an object B event, instead of cascading update the related object A directly, the Event Handler will send a Cascade Update event to the new stream. The consumer of the new stream will handle the Cascade Update event and synchronise the data of object A to the ES.

*Event Handler with Cascade Update events*

Achievements

Cascade Update events deduplicated by 80%
DB load introduced by cascade update reduced

Summary

In this article four different DSP optimisations are explained. After switching to MySQL Binlog streams provided by the Coban team and optimising Stream Consumer, DSP has saved about 91% DB reads and 90% ES reads, and the average queries per second (QPS) of stream traffic processed by Stream Consumer increased from 200 to 800. The max QPS at peak hours could go up to 1000+. With a higher QPS, the duration of processing data and the latency of synchronising data from MySQL to ES was reduced. The data synchronisation ability of DSP has greatly improved after optimisation.

Special thanks to Jun Ying Lim and Amira Khazali for proofreading this article.

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Field Notes: Set Up a Highly Available Database on AWS with IBM Db2 Pacemaker

2021-09-24 Sai Parthasaradhi

Post Syndicated from Sai Parthasaradhi original https://aws.amazon.com/blogs/architecture/field-notes-set-up-a-highly-available-database-on-aws-with-ibm-db2-pacemaker/

Many AWS customers need to run mission-critical workloads—like traffic control system, online booking system, and so forth—using the IBM Db2 LUW database server. Typically, these workloads require the right high availability (HA) solution to make sure that the database is available in the event of a host or Availability Zone failure.

This HA solution for the Db2 LUW database with automatic failover is managed using IBM Tivoli System Automation for Multiplatforms (Tivoli SA MP) technology with IBM Db2 high availability instance configuration utility (db2haicu). However, this solution is not supported on AWS Cloud deployment because the automatic failover may not work as expected.

In this blog post, we will go through the steps to set up an HA two-host Db2 cluster with automatic failover managed by IBM Db2 Pacemaker with quorum device setup on a third EC2 instance. We will also set up an overlay IP as a virtual IP pointing to a primary instance initially. This instance is used for client connections and in case of failover, the overlay IP will automatically point to a new primary instance.

IBM Db2 Pacemaker is an HA cluster manager software integrated with Db2 Advanced Edition and Standard Edition on Linux (RHEL 8.1 and SLES 15). Pacemaker can provide HA and disaster recovery capabilities on AWS, and an alternative to Tivoli SA MP technology.

Note: The IBM Db2 v11.5.5 database server implemented in this blog post is a fully featured 90-day trial version. After the trial period ends, you can select the required Db2 edition when purchasing and installing the associated license files. Advanced Edition and Standard Edition are supported by this implementation.

Overview of solution

For this solution, we will go through the steps to install and configure IBM Db2 Pacemaker along with overlay IP as virtual IP for the clients to connect to the database. This blog post also includes prerequisites, and installation and configuration instructions to achieve an HA Db2 database on Amazon Elastic Compute Cloud (Amazon EC2).

Figure 1. Cluster management using IBM Db2 Pacemaker

Prerequisites for installing Db2 Pacemaker

To set up IBM Db2 Pacemaker on a two-node HADR (high availability disaster recovery) cluster, the following prerequisites must be met.

Set up instance user ID and group ID.

Instance user id and group id’s must be set up as part of Db2 Server installation which can be verified as follows:

grep db2iadm1 /etc/group
grep db2inst1 /etc/group

Set up host names for all the hosts in /etc/hosts file on all the hosts in the cluster.

For both of the hosts in the HADR cluster, ensure that the host names are set up as follows.

Format: ipaddress fully_qualified_domain_name alias

Install kornshell (ksh) on both of the hosts.

sudo yum install ksh -y

Ensure that all instances have TCP/IP connectivity between their ethernet network interfaces.
Enable password less secure shell (ssh) for the root and instance user IDs across both instances.After the password less root ssh is enabled, verify it using the “ssh <host name> -l root ls” command (hostname is either an alias or fully-qualified domain name).

ssh <host name> -l root ls

Activate HADR for the Db2 database cluster.
Make available the IBM Db2 Pacemaker binaries in the /tmp folder on both hosts for installation. The binaries can be downloaded from IBM download location (login required).

Installation steps

After completing all prerequisites, run the following command to install IBM Db2 Pacemaker on both primary and standby hosts as root user.

cd /tmp
tar -zxf Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64.tar.gz
cd Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/RPMS/

dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm -y
dnf install */*.rpm -y

cp /tmp/Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/Db2/db2cm /home/db2inst1/sqllib/adm

chmod 755 /home/db2inst1/sqllib/adm/db2cm

Run the following command by replacing the -host parameter value with the alias name you set up in prerequisites.

/home/db2inst1/sqllib/adm/db2cm -copy_resources
/tmp/Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/Db2agents -host <host>

After the installation is complete, verify that all required resources are created as shown in Figure 2.

ls -alL /usr/lib/ocf/resource.d/heartbeat/db2*

Figure 2. List of heartbeat resources

Configuring Pacemaker

After the IBM Db2 Pacemaker is installed on both primary and standby hosts, initiate the following configuration commands from only one of the hosts (either primary or standby hosts) as root user.

Create the cluster using db2cm utility.Create the Pacemaker cluster using db2cm utility using the following command. Before running the command, replace the -domain and -host values appropriately.

/home/db2inst1/sqllib/adm/db2cm -create -cluster -domain <anydomainname> -publicEthernet eth0 -host <primary host alias> -publicEthernet eth0 -host <standby host alias>

Note: Run ifconfig to get the –publicEthernet value and replace in the former command.

Create instance resource model using the following commands.Modify -instance and -host parameter values in the following command before running.

/home/db2inst1/sqllib/adm/db2cm -create -instance db2inst1 -host <primary host alias>
/home/db2inst1/sqllib/adm/db2cm -create -instance db2inst1 -host <standby host alias>

Create the database instance using db2cm utility. Modify -db parameter value accordingly.

/home/db2inst1/sqllib/adm/db2cm -create -db TESTDB -instance db2inst1

After configuring Pacemaker, run crm status command from both the primary and standby hosts to check if the Pacemaker is running with automatic failover activated.

Figure 3. Pacemaker cluster status

Quorum device setup

Next, we shall set up a third lightweight EC2 instance that will act as a quorum device (QDevice) which will act as a tie breaker avoiding a potential split-brain scenario. We need to install only corsync-qnetd* package from the Db2 Pacemaker cluster software.

Prerequisites (quorum device setup)

Update /etc/hosts file on Db2 primary and standby instances to include the host details of QDevice EC2 instance.
Set up password less root ssh access between Db2 instances and the QDevice instance.
Ensure TCP/IP connectivity between the Db2 instances and the QDevice instance on port 5403.

Steps to set up quorum device

Run the following commands on the quorum device EC2 instance.

cd /tmp
tar -zxf Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64.tar.gz
cd Db2_v11.5.5.0_Pacemaker_20201118_RHEL8.1_x86_64/RPMS/
dnf install */corosync-qnetd* -y

Run the following command from one of the Db2 instances to join the quorum device to the cluster by replacing the QDevice value appropriately.

/home/db2inst1/sqllib/adm/db2cm -create -qdevice <hostnameofqdevice>

Verify the setup using the following commands.

From any Db2 servers:

/home/db2inst1/sqllib/adm/db2cm -list

From QDevice instance:

corosync-qnetd-tool -l

Figure 4. Quorum device status

Setting up overlay IP as virtual IP

For HADR activated databases, virtual IP provides a common connection point for the clients so that in case of failovers there is no need to update the connection strings with the actual IP address of the hosts. Furtermore, the clients can continue to establish the connection to the new primary instance.

We can use the overlay IP address routing on AWS to send the network traffic to HADR database servers within Amazon Virtual Private Cloud (Amazon VPC) using a route table so that the clients can connect to the database using the overlay IP from the same VPC (any Availability Zone) where the database exists. aws-vpc-move-ip is a resource agent from AWS which is available along with the Pacemaker software that helps to update the route table of the VPC.

If you need to connect to the database using overlay IP from on-premises or outside of the VPC (different VPC than database servers), then additional setup is needed using either AWS Transit Gateway or Network Load Balancer.

Prerequisites (setting up overlay IP as virtual IP)

Choose the overlay IP address range which needs to be configured. This IP should not be used anywhere in the VPC or on-premises, and should be a part of the private IP address range as defined in RFC 1918. If the VPC is conﬁgured in the range of 0.0.0.0/8 or 172.16.0.0/12, we can use the overlay IP from the range of 192.168.0.0/16.We will use the following IP and ethernet settings.

192.168.1.81/32
eth0

To route traffic through overlay IP, we need to disable source and target destination checks on the primary and standby EC2 instances.

aws ec2 modify-instance-attribute –profile <AWS CLI profile> –instance-id EC2-instance-id –no-source-dest-check

Steps to configure overlay IP

The following commands can be run as root user on the primary instance.

Create the following AWS Identity and Access Management (IAM) policy and attach it to the instance profile. Update region, account_id, and routetableid values.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt0",
      "Effect": "Allow",
      "Action": "ec2:ReplaceRoute",
      "Resource": "arn:aws:ec2:<region>:<account_id>:route-table/<routetableid>"
    },
    {
      "Sid": "Stmt1",
      "Effect": "Allow",
      "Action": "ec2:DescribeRouteTables",
      "Resource": "*"
    }
  ]
}

Add the overlay IP on the primary instance.

ip address add 192.168.1.81/32 dev eth0

Update the route table (used in Step 1) with the overlay IP specifying the node with the Db2 primary instance. The following command returns True.

aws ec2 create-route –route-table-id <routetableid> –destination-cidr-block 192.168.1.81/32 –instance-id <primrydb2instanceid>

Create a file overlayip.txt with the following command to create the resource manager for overlay ip.

overlayip.txt

primitive db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP ocf:heartbeat:aws-vpc-move-ip \
params ip=192.168.1.81 routing_table=<routetableid> interface=<ethernet> profile=<AWS CLI profile name> \
op start interval=0 timeout=180s \
op stop interval=0 timeout=180s \
op monitor interval=30s timeout=60s

eifcolocation db2_db2inst1_db2inst1_TESTDB_AWS_primary-colocation inf:

db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP:Started
db2_db2inst1_db2inst1_TESTDB-clone:Master
order order-rule-db2_db2inst1_db2inst1_TESTDB-then-primary-oip Mandatory:

db2_db2inst1_db2inst1_TESTDB-clone db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP
location prefer-node1_db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP 100: <primaryhostname>
location prefer-node2_db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP 100: <standbyhostname>

The following parameters must be replaced in the resource manager create command in the file.

- Name of the database resource agent (This can be found through crm config show | grep primitive | grep DBNAME command. For this example, we will use: db2_db2inst1_db2inst1_TESTDB)
- Overlay IP address (created earlier)
- Routing table ID (used earlier)
- AWS command-line interface (CLI) profile name
- Primary and standby host names

After the file with commands is ready, run the following command to create the overlay IP resource manager.

crm configure load update overlayip.txt

Next, create the VIP resource manager—not in managed state. Run the following command to manage and start the resource.

crm resource manage db2_db2inst1_db2inst1_TESTDB_AWS_primary-OIP

Validate the setup with crm status command.

Figure 5. Pacemaker cluster status along with overlay IP resource

Test failover with client connectivity

For the purpose of this testing, launch another EC2 instance with Db2 client installed, and catalog the Db2 database server using overlay IP.

Figure 6. Database directory list

Establish a connection with the Db2 primary instance using the cataloged alias (created earlier) using overlay IP address.

Figure 7. Connect to database

If we connect to the primary instance and check the applications connected, we can see the active connection from the client’s IP as shown in Figure 8.

Figure 8. Check client connections before failover

Next, let’s stop the primary Db2 instance and check if the Pacemaker cluster promoted the standby to primary and we can still connect to the database using the overlay IP, which now points to the new primary instance.

If we check the CRM status from the new primary instance, we can see that the Pacemaker cluster has promoted the standby database to new primary database as shown in Figure 9.

Figure 9. Automatic failover to standby

Let’s go back to our client and reestablish the connection using the cataloged DB alias created using overlay IP.

Figure 10. Database reconnection after failover

If we connect to the new promoted primary instance and check the applications connected, we can see the active connection from the client’s IP as shown in Figure 11.

Figure 11. Check client connections after failover

Cleaning up

To avoid incurring future charges, terminate all EC2 instances which were created as part of the setup referencing this blog post.

Conclusion

In this blog post, we have set up automatic failover using IBM Db2 Pacemaker with overlay (virtual) IP to route traffic to secondary database instance during failover, which helps to reconnect to the database without any manual intervention. In addition, we can also enable automatic client reroute using the overlay IP address to achieve a seamless failover connectivity to the database for mission-critical workloads.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

How MEDHOST’s cardiac risk prediction successfully leveraged AWS analytic services

2021-08-23 Pandian Velayutham

Post Syndicated from Pandian Velayutham original https://aws.amazon.com/blogs/big-data/how-medhosts-cardiac-risk-prediction-successfully-leveraged-aws-analytic-services/

MEDHOST has been providing products and services to healthcare facilities of all types and sizes for over 35 years. Today, more than 1,000 healthcare facilities are partnering with MEDHOST and enhancing their patient care and operational excellence with its integrated clinical and financial EHR solutions. MEDHOST also offers a comprehensive Emergency Department Information System with business and reporting tools. Since 2013, MEDHOST’s cloud solutions have been utilizing Amazon Web Services (AWS) infrastructure, data source, and computing power to solve complex healthcare business cases.

MEDHOST can utilize the data available in the cloud to provide value-added solutions for hospitals solving complex problems, like predicting sepsis, cardiac risk, and length of stay (LOS) as well as reducing re-admission rates. This requires a solid foundation of data lake and elastic data pipeline to keep up with multi-terabyte data from thousands of hospitals. MEDHOST has invested a significant amount of time evaluating numerous vendors to determine the best solution for its data needs. Ultimately, MEDHOST designed and implemented machine learning/artificial intelligence capabilities by leveraging AWS Data Lab and an end-to-end data lake platform that enables a variety of use cases such as data warehousing for analytics and reporting.

Since you’re reading this post, you may also be interested in the following:

MEDHOST expands multi-tenant cloud EHR capabilities with AWS (AWS for Industries)

Getting started

MEDHOST’s initial objectives in evaluating vendors were to:

Build a low-cost data lake solution to provide cardiac risk prediction for patients based on health records
Provide an analytical solution for hospital staff to improve operational efficiency
Implement a proof of concept to extend to other machine learning/artificial intelligence solutions

The AWS team proposed AWS Data Lab to architect, develop, and test a solution to meet these objectives. The collaborative relationship between AWS and MEDHOST, AWS’s continuous innovation, excellent support, and technical solution architects helped MEDHOST select AWS over other vendors and products. AWS Data Lab’s well-structured engagement helped MEDHOST define clear, measurable success criteria that drove the implementation of the cardiac risk prediction and analytical solution platform. The MEDHOST team consisted of architects, builders, and subject matter experts (SMEs). By connecting MEDHOST experts directly to AWS technical experts, the MEDHOST team gained a quick understanding of industry best practices and available services allowing MEDHOST team to achieve most of the success criteria at the end of a four-day design session. MEDHOST is now in the process of moving this work from its lower to upper environment to make the solution available for its customers.

Solution

For this solution, MEDHOST and AWS built a layered pipeline consisting of ingestion, processing, storage, analytics, machine learning, and reinforcement components. The following diagram illustrates the Proof of Concept (POC) that was implemented during the four-day AWS Data Lab engagement.

Ingestion layer

The ingestion layer is responsible for moving data from hospital production databases to the landing zone of the pipeline.

The hospital data was stored in an Amazon RDS for PostgreSQL instance and moved to the landing zone of the data lake using AWS Database Migration Service (DMS). DMS made migrating databases to the cloud simple and secure. Using its ongoing replication feature, MEDHOST and AWS implemented change data capture (CDC) quickly and efficiently so MEDHOST team could spend more time focusing on the most interesting parts of the pipeline.

Processing layer

The processing layer was responsible for performing extract, tranform, load (ETL) on the data to curate them for subsequent uses.

MEDHOST used AWS Glue within its data pipeline for crawling its data layers and performing ETL tasks. The hospital data copied from RDS to Amazon S3 was cleaned, curated, enriched, denormalized, and stored in parquet format to act as the heart of the MEDHOST data lake and a single source of truth to serve any further data needs. During the four-day Data Lab, MEDHOST and AWS targeted two needs: powering MEDHOST’s data warehouse used for analytics and feeding training data to the machine learning prediction model. Even though there were multiple challenges, data curation is a critical task which requires an SME. AWS Glue’s serverless nature, along with the SME’s support during the Data Lab, made developing the required transformations cost efficient and uncomplicated. Scaling and cluster management was addressed by the service, which allowed the developers to focus on cleaning data coming from homogenous hospital sources and translating the business logic to code.

Storage layer

The storage layer provided low-cost, secure, and efficient storage infrastructure.

MEDHOST used Amazon S3 as a core component of its data lake. AWS DMS migration tasks saved data to S3 in .CSV format. Crawling data with AWS Glue made this landing zone data queryable and available for further processing. The initial AWS Glue ETL job stored the parquet formatted data to the data lake and its curated zone bucket. MEDHOST also used S3 to store the .CSV formatted data set that will be used to train, test, and validate its machine learning prediction model.

Analytics layer

The analytics layer gave MEDHOST pipeline reporting and dashboarding capabilities.

The data was in parquet format and partitioned in the curation zone bucket populated by the processing layer. This made querying with Amazon Athena or Amazon Redshift Spectrum fast and cost efficient.

From the Amazon Redshift cluster, MEDHOST created external tables that were used as staging tables for MEDHOST data warehouse and implemented an UPSERT logic to merge new data in its production tables. To showcase the reporting potential that was unlocked by the MEDHOST analytics layer, a connection was made to the Redshift cluster to Amazon QuickSight. Within minutes MEDHOST was able to create interactive analytics dashboards with filtering and drill-down capabilities such as a chart that showed the number of confirmed disease cases per US state.

Machine learning layer

The machine learning layer used MEDHOST’s existing data sets to train its cardiac risk prediction model and make it accessible via an endpoint.

Before getting into Data Lab, the MEDHOST team was not intimately familiar with machine learning. AWS Data Lab architects helped MEDHOST quickly understand concepts of machine learning and select a model appropriate for its use case. MEDHOST selected XGBoost as its model since cardiac prediction falls within regression technique. MEDHOST’s well architected data lake enabled it to quickly generate training, testing, and validation data sets using AWS Glue.

Amazon SageMaker abstracted underlying complexity of setting infrastructure for machine learning. With few clicks, MEDHOST started Jupyter notebook and coded the components leading to fitting and deploying its machine learning prediction model. Finally, MEDHOST created the endpoint for the model and ran REST calls to validate the endpoint and trained model. As a result, MEDHOST achieved the goal of predicting cardiac risk. Additionally, with Amazon QuickSight’s SageMaker integration, AWS made it easy to use SageMaker models directly in visualizations. QuickSight can call the model’s endpoint, send the input data to it, and put the inference results into the existing QuickSight data sets. This capability made it easy to display the results of the models directly in the dashboards. Read more about QuickSight’s SageMaker integration here.

Reinforcement layer

Finally, the reinforcement layer guaranteed that the results of the MEDHOST model were captured and processed to improve performance of the model.

The MEDHOST team went beyond the original goal and created an inference microservice to interact with the endpoint for prediction, enabled abstracting of the machine learning endpoint with the well-defined domain REST endpoint, and added a standard security layer to the MEDHOST application.

When there is a real-time call from the facility, the inference microservice gets inference from the SageMaker endpoint. Records containing input and inference data are fed to the data pipeline again. MEDHOST used Amazon Kinesis Data Streams to push records in real time. However, since retraining the machine learning model does not need to happen in real time, the Amazon Kinesis Data Firehose enabled MEDHOST to micro-batch records and efficiently save them to the landing zone bucket so that the data could be reprocessed.

Conclusion

Collaborating with AWS Data Lab enabled MEDHOST to:

Store single source of truth with low-cost storage solution (data lake)
Complete data pipeline for a low-cost data analytics solution
Create an almost production-ready code for cardiac risk prediction

The MEDHOST team learned many concepts related to data analytics and machine learning within four days. AWS Data Lab truly helped MEDHOST deliver results in an accelerated manner.

About the Authors

Pandian Velayutham is the Director of Engineering at MEDHOST. His team is responsible for delivering cloud solutions, integration and interoperability, and business analytics solutions. MEDHOST utilizes modern technology stack to provide innovative solutions to our customers. Pandian Velayutham is a technology evangelist and public cloud technology speaker.

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

Introducing Amazon MemoryDB for Redis – A Redis-Compatible, Durable, In-Memory Database Service

2021-08-19 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-memorydb-for-redis-a-redis-compatible-durable-in-memory-database-service/

Interactive applications need to process requests and respond very quickly, and this requirement extends to all the components of their architecture. That is even more important when you adopt microservices and your architecture is composed of many small independent services that communicate with each other.

For this reason, database performance is critical to the success of applications. To reduce read latency to microseconds, you can put an in-memory cache in front of a durable database. For caching, many developers use Redis, an open-source in-memory data structure store. In fact, according to Stack Overflow’s 2021 Developer Survey, Redis has been the most loved database for five years.

To implement this setup on AWS, you can use Amazon ElastiCache for Redis, a fully managed in-memory caching service, as a low latency cache in front of a durable database service such as Amazon Aurora or Amazon DynamoDB to minimize data loss. However, this setup requires you to introduce custom code in your applications to keep the cache in sync with the database. You’ll also incur costs for running both a cache and a database.

Introducing Amazon MemoryDB for Redis
Today, I am excited to announce the general availability of Amazon MemoryDB for Redis, a new Redis-compatible, durable, in-memory database. MemoryDB makes it easy and cost-effective to build applications that require microsecond read and single-digit millisecond write performance with data durability and high availability.

Instead of using a low-latency cache in front of a durable database, you can now simplify your architecture and use MemoryDB as a single, primary database. With MemoryDB, all your data is stored in memory, enabling low latency and high throughput data access. MemoryDB uses a distributed transactional log that stores data across multiple Availability Zones (AZs) to enable fast failover, database recovery, and node restarts with high durability.

MemoryDB maintains compatibility with open-source Redis and supports the same set of Redis data types, parameters, and commands that you are familiar with. This means that the code, applications, drivers, and tools you already use today with open-source Redis can be used with MemoryDB. As a developer, you get immediate access to many data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams. You also get access to advanced features such as built-in replication, least recently used (LRU) eviction, transactions, and automatic partitioning. MemoryDB is compatible with Redis 6.2 and will support newer versions as they are released in open source.

One question you might have at this point is how MemoryDB compares to ElastiCache because both services give access to Redis data structures and API:

MemoryDB can safely be the primary database for your applications because it provides data durability and microsecond read and single-digit millisecond write latencies. With MemoryDB, you don’t need to add a cache in front of the database to achieve the low latency you need for your interactive applications and microservices architectures.
On the other hand, ElastiCache provides microsecond latencies for both reads and writes. It is ideal for caching workloads where you want to accelerate data access from your existing databases. ElastiCache can also be used as a primary datastore for use cases where data loss might be acceptable (for example, because you can quickly rebuild the database from another source).

Creating an Amazon MemoryDB Cluster
In the MemoryDB console, I follow the link on the left navigation pane to the Clusters section and choose Create cluster. This opens Cluster settings where I enter a name and a description for the cluster.

All MemoryDB clusters run in a virtual private cloud (VPC). In Subnet groups I create a subnet group by selecting one of my VPCs and providing a list of subnets that the cluster will use to distribute its nodes.

In Cluster settings, I can change the network port, the parameter group that controls the runtime properties of my nodes and clusters, the node type, the number of shards, and the number of replicas per shard. Data stored in the cluster is partitioned across shards. The number of shards and the number of replicas per shard determine the number of nodes in my cluster. Considering that for each shard there is a primary node plus the replicas, I expect this cluster to have eight nodes.

For Redis version compatibility, I choose 6.2. I leave all other options to their default and choose Next.

In the Security section of Advanced settings I add the default security group for the VPC I used for the subnet group and choose an access control list (ACL) that I created before. MemoryDB ACLs are based on Redis ACLs and provide user credentials and permissions to connect to the cluster.

In the Snapshot section, I leave the default to have MemoryDB automatically create a daily snapshot and select a retention period of 7 days.

For Maintenance, I leave the defaults and then choose Create. In this section I can also provide an Amazon Simple Notification Service (SNS) topic to be notified of important cluster events.

After a few minutes, the cluster is running and I can connect using the Redis command line interface or any Redis client.

Using Amazon MemoryDB as Your Primary Database
Managing customer data is a critical component of many business processes. To test the durability of my new Amazon MemoryDB cluster, I want to use it as a customer database. For simplicity, let’s build a simple microservice in Python that allows me to create, update, delete, and get one or all customer data from a Redis cluster using a REST API.

Here’s the code of my server.py implementation:

from flask import Flask, request
from flask_restful import Resource, Api, abort
from rediscluster import RedisCluster
import logging
import os
import uuid

host = os.environ['HOST']
port = os.environ['PORT']
db_host = os.environ['DBHOST']
db_port = os.environ['DBPORT']
db_username = os.environ['DBUSERNAME']
db_password = os.environ['DBPASSWORD']

logging.basicConfig(level=logging.INFO)

redis = RedisCluster(startup_nodes=[{"host": db_host, "port": db_port}],
            decode_responses=True, skip_full_coverage_check=True,
            ssl=True, username=db_username, password=db_password)

if redis.ping():
    logging.info("Connected to Redis")

app = Flask(__name__)
api = Api(app)


class Customers(Resource):

    def get(self):
        key_mask = "customer:*"
        customers = []
        for key in redis.scan_iter(key_mask):
            customer_id = key.split(':')[1]
            customer = redis.hgetall(key)
            customer['id'] = customer_id
            customers.append(customer)
            print(customer)
        return customers

    def post(self):
        print(request.json)
        customer_id = str(uuid.uuid4())
        key = "customer:" + customer_id
        redis.hset(key, mapping=request.json)
        customer = request.json
        customer['id'] = customer_id
        return customer, 201


class Customers_ID(Resource):

    def get(self, customer_id):
        key = "customer:" + customer_id
        customer = redis.hgetall(key)
        print(customer)
        if customer:
            customer['id'] = customer_id
            return customer
        else:
            abort(404)

    def put(self, customer_id):
        print(request.json)
        key = "customer:" + customer_id
        redis.hset(key, mapping=request.json)
        return '', 204

    def delete(self, customer_id):
        key = "customer:" + customer_id
        redis.delete(key)
        return '', 204


api.add_resource(Customers, '/customers')
api.add_resource(Customers_ID, '/customers/<customer_id>')


if __name__ == '__main__':
    app.run(host=host, port=port)

This is the requirements.txt file, which lists the Python modules required by the application:

redis-py-cluster
Flask
Flask-RESTful

The same code works with MemoryDB, ElastiCache, or any Redis Cluster database.

I start a Linux Amazon Elastic Compute Cloud (Amazon EC2) instance in the same VPC as the MemoryDB cluster. To be able to connect to the MemoryDB cluster, I assign the default security group. I also add another security group that gives me SSH access to the instance.

I copy the server.py and requirements.txt files onto the instance and then install the dependencies:

pip3 install --user -r requirements.txt

Now, I start the microservice:

python3 server.py

In another terminal connection, I use curl to create a customer in my database with an HTTP POST on the /customers resource:

curl -i --header "Content-Type: application/json" --request POST \
     --data '{"name": "Danilo", "address": "Somewhere in London",
              "phone": "+1-555-2106","email": "[email protected]", "balance": 1000}' \
     http://localhost:8080/customers

The result confirms that the data has been stored and a unique ID (a UUIDv4 generated by the Python code) has been added to the fields:

HTTP/1.0 201 CREATED
Content-Type: application/json
Content-Length: 172
Server: Werkzeug/2.0.1 Python/3.7.10
Date: Wed, 11 Aug 2021 18:16:58 GMT

{"name": "Danilo", "address": "Somewhere in London",
 "phone": "+1-555-2106", "email": "[email protected]",
 "balance": 1000, "id": "3894e683-1178-4787-9f7d-118511686415"}

All the fields are stored in a Redis Hash with a key formed as customer:<id>.

I repeat the previous command a couple of times to create three customers. The customer data is the same, but each one has a unique ID.

Now, I get a list of all customer with an HTTP GET to the /customers resource:

curl -i http://localhost:8080/customers

In the code there is an iterator on the matching keys using the SCAN command. In the response, I see the data for the three customers:

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 526
Server: Werkzeug/2.0.1 Python/3.7.10
Date: Wed, 11 Aug 2021 18:20:11 GMT

[{"name": "Danilo", "address": "Somewhere in London",
"phone": "+1-555-2106", "email": "[email protected]",
"balance": "1000", "id": "1d734b6a-56f1-48c0-9a7a-f118d52e0e70"},
{"name": "Danilo", "address": "Somewhere in London",
"phone": "+1-555-2106", "email": "[email protected]",
"balance": "1000", "id": "89bf6d14-148a-4dfa-a3d4-253492d30d0b"},
{"name": "Danilo", "address": "Somewhere in London",
"phone": "+1-555-2106", "email": "[email protected]",
"balance": "1000", "id": "3894e683-1178-4787-9f7d-118511686415"}]

One of the customers has just spent all his balance. I update the field with an HTTP PUT on the URL of the customer resource that includes the ID (/customers/<id>):

curl -i --header "Content-Type: application/json" \
     --request PUT \
     --data '{"balance": 0}' \
     http://localhost:8080/customers/3894e683-1178-4787-9f7d-118511686415

The code is updating the fields of the Redis Hash with the data of the request. In this case, it’s setting the balance to zero. I verify the update by getting the customer data by ID:

curl -i http://localhost:8080/customers/3894e683-1178-4787-9f7d-118511686415

In the response, I see that the balance has been updated:

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 171
Server: Werkzeug/2.0.1 Python/3.7.10
Date: Wed, 11 Aug 2021 18:32:15 GMT

{"name": "Danilo", "address": "Somewhere in London",
"phone": "+1-555-2106", "email": "[email protected]",
"balance": "0", "id": "3894e683-1178-4787-9f7d-118511686415"}

That’s the power of Redis! I was able to create the skeleton of a microservice with just a few lines of code. On top of that, MemoryDB gives me the durability and the high availability I need in production without the need to add another database in the backend.

Depending on my workload, I can scale my MemoryDB cluster horizontally, by adding or removing nodes, or vertically, by moving to larger or smaller node types. MemoryDB supports write scaling with sharding and read scaling by adding replicas. My cluster continues to stay online and support read and write operations during resizing operations.

Availability and Pricing
Amazon MemoryDB for Redis is available today in US East (N. Virginia), EU (Ireland), Asia Pacific (Mumbai), and South America (Sao Paulo) with more AWS Regions coming soon.

You can create a MemoryDB cluster in minutes using the AWS Management Console, AWS Command Line Interface (CLI), or AWS SDKs. AWS CloudFormation support will be coming soon. For the nodes, MemoryDB currently supports R6g Graviton2 instances.

To migrate from ElastiCache for Redis to MemoryDB, you can take a backup of your ElastiCache cluster and restore it to a MemoryDB cluster. You can also create a new cluster from a Redis Database Backup (RDB) file stored on Amazon Simple Storage Service (Amazon S3).

With MemoryDB, you pay for what you use based on on-demand instance hours per node, volume of data written to your cluster, and snapshot storage. For more information, see the MemoryDB pricing page.

Learn More
Check out the video below for a quick overview.

Start using Amazon MemoryDB for Redis as your primary database today.

— Danilo

Query SAP HANA using Athena Federated Query and join with data in your Amazon S3 data lake

2021-08-18 Navnit Shukla

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-sap-hana-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use SAP HANA as your transactional data store, you may need to join the data in your data lake with SAP HANA in the cloud, SAP HANA running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises SAP HANA, for example to build a dashboard or create consolidated reporting.

In such use cases, Amazon Athena Federated Query allows you to seamlessly access the data from SAP HANA database without building ETL pipelines to copy or unload the data to the S3 data lake or SAP HANA. This removes the overhead of creating additional extract, transform, and load (ETL) processes and shortens the development cycle.

In this post, we walk you through a step-by-step configuration to set up Amazon Athena Federated Query using AWS Lambda to access data in a SAP HANA database running on AWS.

For this post, we will be using the SAP HANA Athena Federated query connector developed by Trianz. You can deploy the Athena Federated query connector developed by Trianz available in the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

Create a secret for the SAP HANA instance using AWS Secrets Manager.
Create an S3 bucket and subfolder for Lambda to use.
Configure Athena federation with the SAP HANA instance.
Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have a SAP HANA database up and running on AWS.

Create a secret for the SAP HANA instance

Our first step is to create a secret for the SAP HANA instance with a username and password using Secrets Manager.

On the Secrets Manager console, choose Secrets.
Choose Store a new secret.
Select Other types of secrets.
Set the credentials as key-value pairs (username, password) for your SAP HANA instance.
For Secret name, enter a name for your secret. Use the prefix SAP HANAAFQ so it’s easy to find.
Leave the remaining fields at their defaults and choose Next.
Complete your secret creation.

Setting up your S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use

For this post, we have used (Amazon S3 bucket name/folder) athena-accelerator/saphana.

Configure Athena federation with the SAP HANA instance

To configure Athena federation with your SAP HANA instance, complete the following steps:

On the AWS Serverless Application Repository console, choose Available applications.
In the search field, enter TrianzSAPHANAAthenaJDBC.

In the Application settings section, provide the following details:

For Application name, enter TrianzSAPHANAAthenaJDBC.
For SecretNamePrefix, enter trianz-saphana-athena-jdbc.
For SpillBucket, enter Athena-accelerator/saphana.

For JDBCConnectorConfig, use the format saphana://jdbc:sap://{saphana_instance_url}/?${secretname}.

For DisableSpillEncyption, choose False.
For LambdaFunctionName, enter trsaphana.
For SecurityGroupID, use the security group id using which lambda can connect to the SAP HANA

Make sure to apply valid inbound and outbound rules based on your connection.

For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
For Subnetids – Use the subnets using which lambda can connect to SAP HANA instance with comma separation.

Make sure the subnet is in a VPC and has a NAT gateway and internet gateway attached.

Select the I acknowledge check box.
Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, Amazon CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your federated queries using lambda:trsaphana to run against tables in the SAP HANA database. trsaphana is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:trsaphana is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot demonstrates joining the dataset between SAP HANA and the data lake.

Key performance best practice considerations

If you’re considering Athena federation with a SAP HANA database, we recommend the following best practices:

Athena federation works great for queries with predicate filtering because the predicates are pushed down to the SAP HANA database. Use filter and limited-range scans in your queries to avoid full table scans.
If your SQL query requires returning a large volume of data from the SAP HANA database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from SAP HANA to your S3 data lake.
Star schema is a commonly used data model in SAP HANA databases. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in your SAP HANA database. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the SAP HANA database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your SAP HANA database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with SAP HANA using Lambda. Now you don’t need to wait for all the data in your SAP HANA data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from SAP HANA for better performance. When queries are well-written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.

About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Query a Teradata database using Amazon Athena Federated Query and join with data in your Amazon S3 data lake

2021-07-20 Navnit Shukla

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-a-teradata-database-using-amazon-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Teradata as your transactional data store, you may need to join the data in your data lake with Teradata in the cloud, Teradata running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises Teradata database, for example to build a dashboard or create consolidated reporting.

In these use cases, the Amazon Athena Federated Query feature allows you to seamlessly access the data from Teradata database without having to move the data to your S3 data lake. This removes the overhead in managing such jobs.

In this post, we will walk you through a step-by-step configuration to set up Athena Federated Query using AWS Lambda to access data in a Teradata database running on premises.

For this post, we will be using the Oracle Athena Federated Query connector developed by Trianz. The runtime includes a Teradata instance on premises. Your Teradata instance can be on the cloud, on Amazon EC2, or on premises. You can deploy the Trianz Oracle Athena Federated Query connector from the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena Federated Query works by using Lambda to integrate with a federated data source.

To implement this solution, we complete the following steps:

Create a secret for the Teradata instance using AWS Secrets Manager.
Create an S3 bucket and subfolder for Lambda to use.
Configure Athena federation with the Teradata instance.
Run federated queries with Athena.

Prerequisite

Before you start this walkthrough, make sure your Teradata database is up and running.

Create a secret for the Teradata instance

Our first step is to create a secret for the Teradata instance with a username and password using Secrets Manager.

On the Secrets Manager console, choose Secrets.
Choose Store a new secret.
Select Other types of secrets.
Set the credentials as key-value pairs (username, password) for your Teradata instance.

For Secret name, enter a name for your secret. Use the prefix TeradataAFQ so it’s easy to find.
Leave the remaining fields at their defaults and choose Next.
Complete your secret creation.

Set up your S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we create athena-accelerator/teradata.

Configure Athena federation with the Teradata instance

To configure Athena federation with Teradata instance, complete the following steps:

On the AWS Serverless Application Repository console, choose Available applications.
Select Show apps that create custom IAM roles or resource policies.
In the search field, enter TrianzTeradataAthenaJDBC.
Choose the application.

For SecretNamePrefix, enter TeradataAFQ.
For SpillBucket, enter Athena-accelerator/teradata.
For JDBCConnectorConfig, use the format teradata://jdbc:teradata://hostname/user=testUser&password=testPassword.
For DisableSpillEncryption, enter false.
For LambdaFunctionName, enter teradataconnector.
For SecurityGroupID, enter the security group ID where the Teradata instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
For Subnetids, use the subnets where the Teradata instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

Select the I acknowledge check box.
Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, Amazon CloudTrail, Secrets Manager, Lambda, and Athena. For more information about Athena IAM access, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your queries using lambda:teradataconnector to run against tables in the Teradata database. teradataconnector is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:teradataconnector references a data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot shows the query that joins the dataset between Teradata and the S3 data lake.

Key performance best practices

If you’re considering Athena Federated Query with Teradata, we recommend the following best practices:

Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Teradata database. Use filter and limited-range scans in your queries to avoid full table scans.
If your SQL query requires returning a large volume of data from the Teradata database to Athena (which could lead to query timeouts or slow performance), you may consider moving data from Teradata to your S3 data lake.
The star schema is a commonly used data model in Teradata. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Teradata. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Teradata database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Teradata database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated Query with Teradata. Now you don’t need to wait for all the data in your Teradata data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practices outlined in the post to help minimize the data transferred from Teradata for better performance. When queries are well written for Athena Federated Query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.

About the Author

Navnit Shukla is an AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Overview

Web service throughput

Regular expression searches

Database storage engines

Machine learning training

Recursion

Conclusion

Example—

Methodology and Walkthrough

Clean up

Conclusion

Author:

Overview of our conceptual financial crime discovery solution

Using the AWS Cloud to scale up a graph database

Financial crimes rules

Generating data

Walkthrough of the financial investigator workflow

Prerequisites

Rules

Run the job and check your results

Visualization

Cleaning up

Conclusion

References

Further reading

Check out the video to learn how to define data storage periods on your Zabbix instance.

Tips and best practices::

The Goal to Accelerate and Empower Data Analysts

An example Use case

Prerequisites

Create Schemas

Creating Tables

Loading Data

Viewing load errors

Saving and reusing the queries

Extending the Data Warehouse to the Data Lake

Viewing External Table Definitions

Conclusion

About the Authors

Introduction

JTBD framework

Job 1: To fill their stomachs

Job 2: To make children happy

JTBD at GrabFood

Conclusion

Join us

Zabbix MongoDB template

Deploying Zabbix Agent 2 on your Host

Creating a MongoDB user for monitoring

Creating and configuring a MongoDB host

Fine-tuning MongoDB monitoring

Prepare the QuickSight environment

Set up the data source infrastructure

VPC peering

AWS Transit Gateway

AWS PrivateLink

VPC sharing

Connect QuickSight to the data source

Clean up

Conclusion

About the author

Background

Kafka to DSP

MySQL to ES

Original Incremental Sync

Original Kafka streams

Stream Consumer infrastructure

Event Buffer procedure

Event Handler procedure

Issues in the original infrastructure

Optimised Incremental Sync

MySQL Binlog

Current Kafka streams

Stream Consumer optimisations

Event Handler optimisations

Optimisation 1

Achievements

Optimisation 2

Achievements

Event Buffer optimisation