Tag Archives: Advanced (300)

Monitor Apache HBase on Amazon EMR using Amazon Managed Service for Prometheus and Amazon Managed Grafana

2023-02-13 Anubhav Awasthi

Post Syndicated from Anubhav Awasthi original https://aws.amazon.com/blogs/big-data/monitor-apache-hbase-on-amazon-emr-using-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/

Amazon EMR provides a managed Apache Hadoop framework that makes it straightforward, fast, and cost-effective to run Apache HBase. Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is an open-source, non-relational, versioned database that runs on top of the Apache Hadoop Distributed File System (HDFS). It’s built for random, strictly consistent, real-time access for tables with billions of rows and millions of columns. Monitoring HBase clusters is critical in order to identify stability and performance bottlenecks and proactively preempt them. In this post, we discuss how you can use Amazon Managed Service for Prometheus and Amazon Managed Grafana to monitor, alert, and visualize HBase metrics.

HBase has built-in support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia or via JMX. You can either use AWS Distro for OpenTelemetry or Prometheus JMX exporters to collect metrics exposed by HBase. In this post, we show how to use Prometheus exporters. These exporters behave like small webservers that convert internal application metrics to Prometheus format and serve it at /metrics path. A Prometheus server running on an Amazon Elastic Compute Cloud (Amazon EC2) instance collects these metrics and remote writes to an Amazon Managed Service for Prometheus workspace. We then use Amazon Managed Grafana to create dashboards and view these metrics using an Amazon Managed Service for Prometheus workspace as its data source.

This solution can be extended to other big data platforms such as Apache Spark and Apache Presto that also use JMX to expose their metrics.

Solution overview

The following diagram illustrates our solution architecture.

Solution Architecture

This post uses an AWS CloudFormation template to perform below actions:

Install an open-source Prometheus server on an EC2 instance.
Create appropriate AWS Identity and Access Management (IAM) roles and security group for the EC2 instance running the Prometheus server.
Create an EMR cluster with an HBase on Amazon S3 configuration.
Install JMX exporters on all EMR nodes.
Create additional security groups for the EMR master and worker nodes to connect with the Prometheus server running on the EC2 instance.
Create a workspace in Amazon Managed Service for Prometheus.

Prerequisites

To implement this solution, make sure you have the following prerequisites:

An AWS account that provides access to AWS services.
AWS IAM Identity Center (successor to AWS Single Sign-On) enabled in your account and an IAM Identity Center user to use with Amazon Managed Grafana. For instructions, refer to Enable IAM Identity Center.
A key pair to SSH into the EMR master node. For instructions, refer to Create a key pair using Amazon EC2.
Amazon EMR default roles (EMR_DefaultRole and EMR_EC2_DefaultRole). Refer to Configure IAM service roles for Amazon EMR permissions to AWS services and resources for instructions or run the following API from the terminal or AWS Cloud9 to create the default roles:

aws emr create-default-roles

Deploy the CloudFormation template

Deploy the CloudFormation template in the us-east-1 Region:

It will take 15–20 minutes for the template to complete. The template requires the following fields:

Stack Name – Enter a name for the stack
VPC – Choose an existing VPC
Subnet – Choose an existing subnet
EMRClusterName – Use EMRHBase
HBaseRootDir – Provide a new HBase root directory (for example, s3://hbase-root-dir/).
MasterInstanceType – Use m5x.large
CoreInstanceType – Use m5x.large
CoreInstanceCount – Enter 2
SSHIPRange – Use <your ip address>/32 (you can go to https://checkip.amazonaws.com/ to check your IP address)
EMRKeyName – Choose a key pair for the EMR cluster
EMRRleaseLabel – Use emr-6.9.0
InstanceType – Use the EC2 instance type for installing the Prometheus server

cloud formation parameters

Enable remote writes on the Prometheus server

The Prometheus server is running on an EC2 instance. You can find the instance hostname in the CloudFormation stack’s Outputs tab for key PrometheusServerPublicDNSName.

SSH into the EC2 instance using the key pair:

ssh -i <sshKey.pem> ec2-user@<Public IPv4 DNS of EC2 instance running Prometheus server>

Copy the value for Endpoint – remote write URL from the Amazon Managed Service for Prometheus workspace console.

Edit remote_write url in /etc/prometheus/conf/prometheus.yml:

sudo vi /etc/prometheus/conf/prometheus.yml

It should look like the following code:

Now we need to restart the Prometheus server to pick up the changes:

sudo systemctl restart prometheus

Enable Amazon Managed Grafana to read from an Amazon Managed Service for Prometheus workspace

We need to add the Amazon Managed Prometheus workspace as a data source in Amazon Managed Grafana. You can skip directly to step 3 if you already have an existing Amazon Managed Grafana workspace and want to use it for HBase metrics.

First, let’s create a workspace on Amazon Managed Grafana. You can follow the appendix to create a workspace using the Amazon Managed Grafana console or run the following API from your terminal (provide your role ARN):

aws grafana create-workspace \
--account-access-type CURRENT_ACCOUNT \
--authentication-providers AWS_SSO \
--permission-type CUSTOMER_MANAGED \
--workspace-data-sources PROMETHEUS \
--workspace-name emr-metrics \
--workspace-role-arn <role-ARN> \
--workspace-notification-destinations SNS

On the Amazon Managed Grafana console, choose Configure users and select a user you want to allow to log in to Grafana dashboards.

Make sure your IAM Identity Center user type is admin. We need this to create dashboards. You can assign the viewer role to all the other users.

Log in to the Amazon Managed Grafana workspace URL using your admin credentials.
Choose AWS Data Sources in the navigation pane.

For Service, choose Amazon Managed Service for Prometheus.

For Regions, choose US East (N. Virginia).

Create an HBase dashboard

Grafana labs has an open-source dashboard that you can use. For example, you can follow the guidance from the following HBase dashboard. Start creating your dashboard and chose the import option. Provide the URL of the dashboard or enter 12722 and choose Load. Make sure your Prometheus workspace is selected on the next page. You should see HBase metrics showing up on the dashboard.

Key HBase metrics to monitor

HBase has a wide range of metrics for HMaster and RegionServer. The following are a few important metrics to keep in mind.

HMASTER	Metric Name	Metric Description
.	hadoop_HBase_numregionservers	Number of live region servers
.	hadoop_HBase_numdeadregionservers	Number of dead region servers
.	hadoop_HBase_ritcount	Number of regions in transition
.	hadoop_HBase_ritcountoverthreshold	Number of regions that have been in transition longer than a threshold time (default: 60 seconds)
.	hadoop_HBase_ritduration_99th_percentile	Maximum time taken by 99% of the regions to remain in transition state

REGIONSERVER	Metric Name	Metric Description
.	hadoop_HBase_regioncount	Number of regions hosted by the region server
.	hadoop_HBase_storefilecount	Number of store files currently managed by the region server
.	hadoop_HBase_storefilesize	Aggregate size of the store files
.	hadoop_HBase_hlogfilecount	Number of write-ahead logs not yet archived
.	hadoop_HBase_hlogfilesize	Size of all write-ahead log files
.	hadoop_HBase_totalrequestcount	Total number of requests received
.	hadoop_HBase_readrequestcount	Number of read requests received
.	hadoop_HBase_writerequestcount	Number of write requests received
.	hadoop_HBase_numopenconnections	Number of open connections at the RPC layer
.	hadoop_HBase_numactivehandler	Number of RPC handlers actively servicing requests
Memstore	.	.
.	hadoop_HBase_memstoresize	Total memstore memory size of the region server
.	hadoop_HBase_flushqueuelength	Current depth of the memstore flush queue (if increasing, we are falling behind with clearing memstores out to Amazon S3)
.	hadoop_HBase_flushtime_99th_percentile	99th percentile latency for flush operation
.	hadoop_HBase_updatesblockedtime	Number of milliseconds updates have been blocked so the memstore can be flushed
Block Cache	.	.
.	hadoop_HBase_blockcachesize	Block cache size
.	hadoop_HBase_blockcachefreesize	Block cache free size
.	hadoop_HBase_blockcachehitcount	Number of block cache hits
.	hadoop_HBase_blockcachemisscount	Number of block cache misses
.	hadoop_HBase_blockcacheexpresshitpercent	Percentage of the time that requests with the cache turned on hit the cache
.	hadoop_HBase_blockcachecounthitpercent	Percentage of block cache hits
.	hadoop_HBase_blockcacheevictioncount	Number of block cache evictions in the region server
.	hadoop_HBase_l2cachehitratio	Local disk-based bucket cache hit ratio
.	hadoop_HBase_l2cachemissratio	Bucket cache miss ratio
Compaction	.	.
.	hadoop_HBase_majorcompactiontime_99th_percentile	Time in milliseconds taken for major compaction
.	hadoop_HBase_compactiontime_99th_percentile	Time in milliseconds taken for minor compaction
.	hadoop_HBase_compactionqueuelength	Current depth of the compaction request queue (if increasing, we are falling behind with storefile compaction)
.	flush queue length	Number of flush operations waiting to be processed in the region server (a higher number indicates flush operations are slow)
IPC Queues	.	.
.	hadoop_HBase_queuesize	Total data size of all RPC calls in the RPC queues in the region server
.	hadoop_HBase_numcallsingeneralqueue	Number of RPC calls in the general processing queue in the region server
.	hadoop_HBase_processcalltime_99th_percentile	99th percentile latency for RPC calls to be processed in the region server
.	hadoop_HBase_queuecalltime_99th_percentile	99th percentile latency for RPC calls to stay in the RPC queue in the region server
JVM and GC	.	.
.	hadoop_HBase_memheapusedm	Heap used
.	hadoop_HBase_memheapmaxm	Total heap
.	hadoop_HBase_pausetimewithgc_99th_percentile	Pause time in milliseconds
.	hadoop_HBase_gccount	Garbage collection count
.	hadoop_HBase_gctimemillis	Time spent in garbage collection, in milliseconds
Latencies	.	.
.	HBase.regionserver.<op>_<measure>	Operation latencies, where <op> is Append, Delete, Mutate, Get, Replay, or Increment, and <measure> is min, max, mean, median, 75th_percentile, 95th_percentile, or 99th_percentile
.	HBase.regionserver.slow<op>Count	Number of operations we thought were slow, where <op> is one of the preceding list
Bulk Load	.	.
.	Bulkload_99th_percentile	hadoop_HBase_bulkload_99th_percentile
I/O	.	.
.	FsWriteTime_99th_percentile	hadoop_HBase_fswritetime_99th_percentile
.	FsReadTime_99th_percentile	hadoop_HBase_fsreadtime_99th_percentile
Exceptions	.	.
.	exceptions.RegionTooBusyException	.
.	exceptions.callQueueTooBig	.
.	exceptions.NotServingRegionException	.

Considerations and limitations

Note the following when using this solution:

You can set up alerts on Amazon Managed Service for Prometheus and visualize them in Amazon Managed Grafana.
This architecture can be easily extended to include other open-source frameworks such as Apache Spark, Apache Presto, and Apache Hive.
Refer to the pricing details for Amazon Managed Service for Prometheus and Amazon Managed Grafana.
These scripts are for guidance purposes only and aren’t ready for production deployments. Make sure to perform thorough testing.

Clean up

To avoid ongoing charges, delete the CloudFormation stack and workspaces created in Amazon Managed Grafana and Amazon Managed Service for Prometheus.

Conclusion

In this post, you learned how to monitor EMR HBase clusters and set up dashboards to visualize key metrics. This solution can serve as a unified monitoring platform for multiple EMR clusters and other applications. For more information on EMR HBase, see Release Guide and HBase Migration whitepaper.

Appendix

Complete the following steps to create a workspace on Amazon Managed Grafana:

For Authentication access, select AWS IAM Identity Center.

If you don’t have IAM Identity Center enabled, refer to Enable IAM Identity Center.

Optionally, to view Prometheus alerts in your Grafana workspace, select Turn Grafana alerting on.

On the next page, select Amazon Managed Service for Prometheus as the data source.

After the workspace is created, assign users to access Amazon Managed Grafana.

For a first-time setup, assign admin privileges to the user.

You can add other users with only viewer access.

Make sure you are able to log in to the Grafana workspace URL using your IAM Identity Center user credentials.

About the Author

Anubhav Awasthi is a Sr. Big Data Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

2023-02-13 Miguel Chin

Post Syndicated from Miguel Chin original https://aws.amazon.com/blogs/big-data/how-olx-group-migrated-to-amazon-redshift-ra3-for-simpler-faster-and-more-cost-effective-analytics/

This is a guest post by Miguel Chin, Data Engineering Manager at OLX Group and David Greenshtein, Specialist Solutions Architect for Analytics, AWS.

OLX Group is one of the world’s fastest-growing networks of online marketplaces, operating in over 30 countries around the world. We help people buy and sell cars, find housing, get jobs, buy and sell household goods, and much more.

We live in a data-producing world, and as companies want to become data driven, there is the need to analyze more and more data. These analyses are often done using data warehouses. However, a common data warehouse issue with ever-growing volumes of data is storage limitations and the degrading performance that comes with it. This scenario is very familiar to us in OLX Group. Our data warehouse is built using Amazon Redshift and is used by multiple internal teams to power their products and data-driven business decisions. As such, it’s crucial to maintain a cluster with high availability and performance while also being storage cost-efficient.

In this post, we share how we modernized our Amazon Redshift data warehouse by migrating to RA3 nodes and how it enabled us to achieve our business expectations. Hopefully you can learn from our experience in case you are considering doing the same.

Status quo before migration

Here at OLX Group, Amazon Redshift has been our choice for data warehouse for over 5 years. We started with a small Amazon Redshift cluster of 7 DC2.8xlarge nodes, and as its popularity and adoption increased inside the OLX Group data community, this cluster naturally grew.

Before migrating to RA3, we were using a 16 DC2.8xlarge nodes cluster with a highly tuned workload management (WLM), and performance wasn’t an issue at all. However, we kept facing challenges with storage demand due to having more users, more data sources, and more prepared data. Almost every day we would get an alert that our disk space was close to 100%, which was about 40 TB worth of data.

Our usual method to solve storage problems used to be to simply increase the number of nodes. Overall, we reached a cluster size of 18 nodes. However, this solution wasn’t cost-efficient enough because we were adding compute capacity to the cluster even though computation power was underutilized. We saw this as a temporary solution, and we mainly did it to buy some time to explore other cost-effective alternatives, such as RA3 nodes.

Amazon Redshift RA3 nodes along with Redshift Managed Storage (RMS) provided separation of storage and compute, enabling us to scale storage and compute separately to better meet our business requirements.

Our data warehouse had the following configuration before the migration:

18 x DC2.8xlarge nodes
250 monthly active users, consistently increasing
10,000 queries per hour, 30 queries in parallel
40 TB of data, consistently increasing
100% disk space utilization

This cluster’s performance was generally good, ETL (extract, transform, and load) and interactive queries barely had any queue time, and 80% of them would finish in under 5 minutes.

Evaluating the performance of Amazon Redshift clusters with RA3 nodes

In this section, we discuss how we conducted a performance evaluation of RA3 nodes with an Amazon Redshift cluster.

Test environment

In order to be confident with the performance of the RA3 nodes, we decided to stress test them in a controlled environment before making the decision to migrate. To assess the nodes and find an optimal RA3 cluster configuration, we collaborated with AllCloud, the AWS premier consulting partner. The following figures illustrate the approach we took to evaluate the performance of RA3.

Test setup

This strategy aims to replicate a realistic workload in different RA3 cluster configurations and compare them with our DC2 configuration. To do this, we required the following:

A reference cluster snapshot – This ensures that we can replay any tests starting from the same state.
A set of queries from the production cluster – This set can be reconstructed from the Amazon Redshift logs (STL_QUERYTEXT) and enriched by metadata (STL_QUERY). It should be noted that we only took into consideration SELECT and FETCH query types (to simplify this first stage of performance tests). The following chart shows what the profile of our test set looked like.

A replay tool to orchestrate all the query operations – AllCloud developed a Python application for us for this purpose.

For more details about approach we used, including using the Amazon Redshift Simple Replay utility, refer to Compare different node types for your workload using Amazon Redshift.

Next, we picked which cluster configurations we wanted to test, which RA3 type, and how many nodes. For the specifications of each node type, refer to Amazon Redshift pricing.

First, we decided to test the same DC2 cluster we had in production as a way to validate our test environment, followed by RA3 clusters using RA3.4xlarge nodes with various numbers of nodes. We used RA3.4xlarge because it gives us more flexibility to fine-tune how many nodes we need compared to the RA3.16xlarge instance (1 x RA3.16xlarge node is equivalent to 4 x RA3.4xlarge nodes in terms of CPU and memory). With this in mind, we tested the following cluster configurations and used the replay tool to take measurements of the performance of each cluster.

		18 x DC2 (Reference)	18 x RA3 (Before Classic Resize)	18 x RA3	6 x RA3
Queries	Number	1560	1560	1560	1560
	Timeouts	–	25	66	127
Duration/s	Mean	1.214	1.037	1.167	1.921
	Std.	2.268	2.026	2.525	3.488
	Min.	0.003	0.000	0.002	0.002
	Q 25%	0.005	0.004	0.004	0.004
	Q 50%	0.344	0.163	0.118	0.183
	Q 75%	1.040	0.746	1.076	2.566
	Max.	25.411	15.492	19.770	19.132

These results show how the DC2 cluster compares with other RA3 configurations. For 50% of the faster queries (quantile 50%) they ran faster than on DC2. Regarding the number of RA3 nodes, six nodes were clearly slower, particularly noticeable on quantile 75% of query durations.

We used the following steps to deploy different clusters:

Use 18 x DC2.8xlarge, restored from the original snapshot (18 x DC2.8xlarge).
Take measurements 18 x DC2.
Use 18 x RA3.4xlarge, restored from the original snapshot (18 x DC2.8xlarge).
Take measurements 18 x RA3 (before classic resize).
Use 6 x RA3.4xlarge, classic resize from 18 x RA3.4xlarge.
Take snapshot from 6 x RA3.4xlarge.
Take measurements 6 x RA3.
Use 6 x RA3.4xlarge, restored from 6 x RA3.4xlarge snapshot.
Use 18 x RA3.4xlarge, elastic resize from 6 x RA3.4xlarge.
Take measurements 18x RA3.

Although these are promising results, there were some limitations in the test environment setup. We were concerned that we weren’t stressing the clusters enough, queries were only running in sequence using a single client, and the fact that we were using only SELECT and FETCH query types moved us away from a realistic workload. Therefore, we proceeded to the second stage of our tests.

Concurrency stress test

To stress the clusters, we changed our replay tool to run multiple queries in parallel. Queries extracted from the log files were queued with the same frequency as they were originally run in the reference cluster. Up to 50 clients take queries from the queue and send them to Amazon Redshift. The timing of all queries is recorded for comparison with the reference cluster.

The cluster performance is evaluated by measuring the temporal course of the query concurrency. If a cluster is equally performant as the reference cluster, the concurrency will closely follow the concurrency of the reference cluster. Queries pushed to the query queue are immediately picked up by a client and sent to the cluster. If the cluster isn’t capable of handling the queries as fast as the reference cluster, the number of running concurrent queries will increase when compared to the reference cluster. We also decided to keep concurrency scaling disabled during this test because we wanted to focus on node types instead of cluster features.

The following table shows the concurrent queries running on a DC2 and RA3 (both 18 nodes) with two different query test sets (3:00 AM and 1:00 PM). These were selected so we could test both our day and overnight workloads. 3:00 AM is when we have a peak of automated ETL jobs running, and 1:00 PM is when we have high user activity.

The median of running concurrent queries on the RA3 cluster is much higher than the DC2 one. This led us to conclude that a cluster of 18 RA3.4xlarge might not be enough to handle this workload reliably.

Concurrency	18 x DC2.8xlarge		18 x RA3.4xlarge
Starting	3:00 AM	1:00 PM	3:00 AM	1:00 PM
Mean	5	7	10	5
STD	11	13	7	4
25%	1	1	5	2
50%	2	2	8	4
75%	4	4	13	7
Max	50	50	50	27

RA3.16xlarge

Initially, we chose the RA3.4xlarge node type for more granular control in fine-tuning the number of nodes. However, we overlooked one important detail: the same instance type is used for worker and leader nodes. A leader node needs to manage all the parallel processing happening in the cluster, and a single RA3.4xlarge wasn’t enough to do so.

With this in mind, we tested two more cluster configurations: 6 x RA3.16xlarge and 8 x RA3.16xlarge, and once again measured concurrency. This time the results were much better; RA3.16xlarge was able to keep up with the reference concurrency, and the sweet spot seemed to be between 6–8 nodes.

Concurrency	18 x DC2.8xlarge		18 x RA3.4xlarge		6 x RA3.16xlarge	8 x RA3.16xlarge
Starting	3:00 AM	1:00 PM	3:00 AM	1:00 PM	3:00 AM	3:00 AM
Mean	5	7	10	5	3	1
STD	11	13	7	4	4	1
25%	1	1	5	2	2	0
50%	2	2	8	4	3	1
75%	4	4	13	7	4	2
Max	50	50	50	27	38	9

Things were looking better and our target configuration was now a 7 x RA3.16xlarge cluster. We were now confident enough to proceed with the migration.

The migration

Regardless of how excited we were to proceed, we still wanted to do a calculated migration. It’s best practice to have a playbook for migrations—a step-by-step guide on what needs to be done and also a contingency plan that includes a rollback plan. For simplicity reasons, we list here only the relevant steps in case you are looking for inspiration.

Migration plan

The migration plan included the following key steps:

Remove the DNS from the current cluster, in our case in Amazon Route 53. No users should be able to query after this.
Check if any sessions are still running a query, and decide to wait or stop it. This strongly indicates these users are using the direct cluster URL to connect.
1. To check running sessions, use SELECT * FROM STV_SESSIONS.
2. To check stopped sessions, use SELECT PG_TERMINATE_BACKEND(xxxxx);.
Create a snapshot of the DC2 cluster.
Pause the DC2 cluster.
Create an RA3 cluster from the snapshot with the following configuration:
1. Node type – RA3.16xlarge
2. Number of nodes – 7
3. Database name – Same as the DC2
4. Associated IAM roles – Same as the DC2
5. VPC – Same as the DC2
6. VPC security groups – Same as the DC2
7. Parameter groups – Same as the DC2
Wait for SELECT COUNT(1) FROM STV_UNDERREPPED_BLOCKS to return 0. This is related to the hydration process of the cluster.
Point the DNS to the RA3 cluster.
Users can now query the cluster again.

Contingency plan

In case the performance of hourly and daily ETL is not acceptable, the contingency plan is triggered:

Add one more node to deal with the unexpected workload.
Increase the limit of concurrency scaling hours.
Reassess the parameter group.

Following this plan, we migrated from DC2 to RA3 nodes in roughly 3.5 hours, from stopping the old cluster to booting the new one and letting our processes fully synchronize. We then proceeded to monitor performance for a couple of hours. Storage capacity was looking great and everything was running smoothly, but we were curious to see how the overnight processes would perform.

The next morning, we woke up to what we dreaded: a slow cluster. We triggered our contingency plan and in the following few days we ended up implementing all three actions we had in the contingency plan.

Adding one extra node itself didn’t provide much help, however users did experience good performance during the hours concurrency scaling was on. The concurrency scaling feature allows Amazon Redshift to temporarily increase cluster capacity whenever the workload requires it. We configured it to allow a maximum of 4 hours per day—1 hour for free and 3 hours paid. We chose this particular value because price-wise it is equivalent to adding one more node (taking us to nine nodes) with the added advantage of only using and paying for it when the workload requires it.

The last action we took was related to the parameter group, in particular, the WLM. As initially stated, we had a manually fine-tuned WLM, but it proved to be inefficient for this new RA3 cluster. Therefore, we decided to try auto WLM with the following configuration.

Manual WLM before introducing auto WLM	Queue 1	Data Team ETL queue (daily and hourly), admin, monitoring, data quality queries
Manual WLM before introducing auto WLM	Queue 2	Users queue (for both their ETL and ad hoc queries)
Auto WLM	Queue 1: Priority highest	Daily Data Team ETL queue
	Queue 2: Priority high	Admin queries
	Queue 3: Priority normal	User queries and hourly Data Team ETL
	Queue 4: Priority low	Monitoring, data quality queries

Manual WLM requires you to manually allocate a percentage of resources and define a number of slots per queue. Although this gives you resource segregation, it also means resources are constantly allocated and can go to waste if they’re not used. Auto WLM dynamically sets these variables depending on each queue’s priority and workload. This means that a query in the highest priority queue will get all the resources allocated to it, while lower priority queues will need to wait for available resources. With this in mind, we split our ETL depending on its priority: daily ETL to highest, hourly ETL to normal (to give a fair chance for user queries to compete for resources), and monitoring and data quality to low.

After applying concurrency scaling and auto WLM, we achieved stable performance for a whole week, and considered the migration a success.

Status quo after migration

Almost a year has passed since we migrated to RA3 nodes, and we couldn’t be more satisfied. Thanks to Redshift Managed Storage (RMS), our disk space issues are a thing of the past, and performance has been generally great compared to our previous DC2 cluster. We are now at 300 monthly active users. Cluster costs did increase due to the new node type and concurrency scaling, but we now feel prepared for the future and don’t expect any cluster resizing anytime soon.

Looking back, we wanted to have a carefully planned and prepared migration, and we were able to learn more about RA3 with our test environment. However, our experience also shows that test environments aren’t always bulletproof, and some details may be overlooked. In the end, these are our main takeaways from the migration to RA3 nodes:

Pick the right node type according to your workload. An RA3.16xlarge cluster provides more powerful leader and worker nodes.
Use concurrency scaling to provision more resources when the workload demands it. Adding a new node is not always the most cost-efficient solution.
Manual WLM requires a lot of adjustments; using auto WLM allows for a better and fairer distribution of cluster resources.

Conclusion

In this post, we covered how OLX Group modernized our Amazon Redshift data warehouse by migrating to RA3 nodes. We detailed how we tested before migration, the migration itself, and the outcome. We are now starting to explore the possibilities provided by the RA3 nodes. In particular, the data sharing capabilities together with Redshift Serverless open the door for exciting architecture setups that we are looking forward to.

If you are going through the same storage issues we used to face with your Amazon Redshift cluster, we highly recommend migrating to RA3 nodes. Its RMS feature decouples the scalability of compute and storage power, providing a more cost-efficient solution.

Thanks for reading this post and hopefully you found it useful. If you’re going through the same scenario and have any questions, feel free to reach out.

About the author

Miguel Chin is a Data Engineering Manager at OLX Group, one of the world’s fastest-growing networks of trading platforms. He is responsible for managing a domain-oriented team of data engineers that helps shape the company’s data ecosystem by evangelizing cutting-edge data concepts like data mesh.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

2023-02-09 Ramesh Ranganathan

Post Syndicated from Ramesh Ranganathan original https://aws.amazon.com/blogs/big-data/synchronize-your-salesforce-and-snowflake-data-to-speed-up-your-time-to-insight-with-amazon-appflow/

This post was co-written with Amit Shah, Principal Consultant at Atos.

Customers across industries seek meaningful insights from the data captured in their Customer Relationship Management (CRM) systems. To achieve this, they combine their CRM data with a wealth of information already available in their data warehouse, enterprise systems, or other software as a service (SaaS) applications. One widely used approach is getting the CRM data into your data warehouse and keeping it up to date through frequent data synchronization.

Integrating third-party SaaS applications is often complicated and requires significant effort and development. Developers need to understand the application APIs, write implementation and test code, and maintain the code for future API changes. Amazon AppFlow, which is a low-code/no-code AWS service, addresses this challenge.

Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications, like Salesforce, SAP, Zendesk, Slack, and ServiceNow, and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift in just a few clicks. With Amazon AppFlow, you can run data flows at enterprise scale at the frequency you choose—on a schedule, in response to a business event, or on demand.

In this post, we focus on synchronizing your data from Salesforce to Snowflake (on AWS) without writing code. This post walks you through the steps to set up a data flow to address full and incremental data load using an example use case.

Solution overview

Our use case involves the synchronization of the Account object from Salesforce into Snowflake. In this architecture, you use Amazon AppFlow to filter and transfer the data to your Snowflake data warehouse.

You can configure Amazon AppFlow to run your data ingestion in three different ways:

On-demand – You can manually run the flow through the AWS Management Console, API, or SDK call.
Event-driven – Amazon AppFlow can subscribe and listen to change data capture (CDC) events from the source SaaS application.
Scheduled – Amazon AppFlow can run schedule-triggered flows based on a pre-defined schedule rule. With scheduled flows, you can choose either full or incremental data transfer:
- With full transfer, Amazon AppFlow transfers a snapshot of all records at the time of the flow run from the source to the destination.
- With incremental transfer, Amazon AppFlow transfers only the records that have been added or changed since the last successful flow run. To determine the incremental delta of your data, AppFlow requires you to specify a source timestamp field to instruct how Amazon AppFlow identifies new or updated records.

We use the on-demand trigger for the initial load of data from Salesforce to Snowflake, because it helps you pull all the records, irrespective of their creation. To then synchronize data periodically with Snowflake, after we run the on-demand trigger, we configure a scheduled trigger with incremental transfer. With this approach, Amazon AppFlow pulls the records based on a chosen timestamp field from the Salesforce Account object periodically, based on the time interval specified in the flow.

The Account_Staging table is created in Snowflake to act as a temporary storage that can be used to identify the data change events. Then the permanent table (Account) is updated from the staging table by running a SQL stored procedure that contains the incremental update logic. The following figure depicts the various components of the architecture and the data flow from the source to the target.

The data flow contains the following steps:

First, the flow is run with on-demand and full transfer mode to load the full data into Snowflake.
The Amazon AppFlow Salesforce connector pulls the data from Salesforce and stores it in the Account Data S3 bucket in CSV format.
The Amazon AppFlow Snowflake connector loads the data into the Account_Staging table.
A scheduled task, running at regular intervals in Snowflake, triggers a stored procedure.
The stored procedure starts an atomic transaction that loads the data into the Account table and then deletes the data from the Account_Staging table.
After the initial data is loaded, you update the flow to capture incremental updates from Salesforce. The flow trigger configuration is changed to scheduled, to capture data changes in Salesforce. This enables Snowflake to get all updates, deletes, and inserts in Salesforce at configured intervals.
The flow uses the configured LastModifiedDate field to determine incremental changes.
Steps 3, 4, and 5 are run again to load the incremental updates into the Snowflake Accounts table.

Prerequisites

To get started, you need the following prerequisites:

A Salesforce user account with sufficient privileges to install connected apps. Amazon AppFlow uses a connected app to communicate with Salesforce APIs. If you don’t have a Salesforce account, you can sign up for a developer account.
A Snowflake account with sufficient permissions to create and configure the integration, external stage, table, stored procedures, and tasks.
An AWS account with access to AWS Identity and Access Management (IAM), Amazon AppFlow, and Amazon S3.

Set up Snowflake configuration and Amazon S3 data

Complete the following steps to configure Snowflake and set up your data in Amazon S3:

Create two S3 buckets in your AWS account: one for holding the data coming from Salesforce, and another for holding error records.

A best practice when creating your S3 bucket is to make sure you block public access to the bucket to ensure your data is not accessible by unauthorized users.

Create an IAM policy named snowflake-access that allows listing the bucket contents and reading S3 objects inside the bucket.

Follow the instructions for steps 1 and 2 in Configuring a Snowflake Storage Integration to Access Amazon S3 to create an IAM policy and role. Replace the placeholders with your S3 bucket names.

Log in to your Snowflake account and create a new warehouse called SALESFORCE and database called SALESTEST.
Specify the format in which data will be available in Amazon S3 for Snowflake to load (for this post, CSV):

USE DATABASE SALESTEST;
CREATE or REPLACE file format my_csv_format
type = csv
field_delimiter = ','
Y skip_header = 1
null_if = ('NULL', 'null')
empty_field_as_null = true
compression = gzip;

Amazon AppFlow uses the Snowflake COPY command to move data using an S3 bucket. To configure this integration, follow steps 3–6 in Configuring a Snowflake Storage Integration to Access Amazon S3.

These steps create a storage integration with your S3 bucket, update IAM roles with Snowflake account and user details, and creates an external stage.

This completes the setup in Snowflake. In the next section, you create the required objects in Snowflake.

Create schemas and procedures in Snowflake

In your Snowflake account, complete the following steps to create the tables, stored procedures, and tasks for implementing the use case:

In your Snowflake account, open a worksheet and run the following DDL scripts to create the Account and Account_staging tables:

CREATE or REPLACE TABLE ACCOUNT_STAGING (
ACCOUNT_NUMBER STRING NOT NULL,
ACCOUNT_NAME STRING,
ACCOUNT_TYPE STRING,
ANNUAL_REVENUE NUMBER,
ACTIVE BOOLEAN NOT NULL,
DELETED BOOLEAN,
LAST_MODIFIED_DATE STRING,
primary key (ACCOUNT_NUMBER)
);

CREATE or REPLACE TABLE ACCOUNT (
ACCOUNT_NUMBER STRING NOT NULL,
ACCOUNT_NAME STRING,
ACCOUNT_TYPE STRING,
ANNUAL_REVENUE NUMBER,
ACTIVE BOOLEAN NOT NULL,
LAST_MODIFIED_DATE STRING,
primary key (ACCOUNT_NUMBER)
);

Create a stored procedure in Snowflake to load data from staging to the Account table:

CREATE or REPLACE procedure sp_account_load( )
returns varchar not null
language sql
as
$$
begin
Begin transaction;
merge into ACCOUNT using ACCOUNT_STAGING
on ACCOUNT.ACCOUNT_NUMBER = ACCOUNT_STAGING.ACCOUNT_NUMBER
when matched AND ACCOUNT_STAGING.DELETED=TRUE then delete
when matched then UPDATE SET
ACCOUNT.ACCOUNT_NAME = ACCOUNT_STAGING.ACCOUNT_NAME,
ACCOUNT.ACCOUNT_TYPE = ACCOUNT_STAGING.ACCOUNT_TYPE,
ACCOUNT.ANNUAL_REVENUE = ACCOUNT_STAGING.ANNUAL_REVENUE,
ACCOUNT.ACTIVE = ACCOUNT_STAGING.ACTIVE,
ACCOUNT.LAST_MODIFIED_DATE = ACCOUNT_STAGING.LAST_MODIFIED_DATE
when NOT matched then
INSERT (
ACCOUNT.ACCOUNT_NUMBER,
ACCOUNT.ACCOUNT_NAME,
ACCOUNT.ACCOUNT_TYPE,
ACCOUNT.ANNUAL_REVENUE,
ACCOUNT.ACTIVE,
ACCOUNT.LAST_MODIFIED_DATE
)
values(
ACCOUNT_STAGING.ACCOUNT_NUMBER,
ACCOUNT_STAGING.ACCOUNT_NAME,
ACCOUNT_STAGING.ACCOUNT_TYPE,
ACCOUNT_STAGING.ANNUAL_REVENUE,
ACCOUNT_STAGING.ACTIVE,
ACCOUNT_STAGING.LAST_MODIFIED_DATE
) ;

Delete from ACCOUNT_STAGING;
Commit;
end;
$$
;

This stored procedure determines whether the data contains new records that need to be inserted or existing records that need to be updated or deleted. After a successful run, the stored procedure clears any data from your staging table.

Create a task in Snowflake to trigger the stored procedure. Make sure that the time interval for this task is more than the time interval configured in Amazon AppFlow for pulling the incremental changes from Salesforce. The time interval should be sufficient for data to be processed.

CREATE OR REPLACE TASK TASK_ACCOUNT_LOAD
WAREHOUSE = SALESFORCE
SCHEDULE = 'USING CRON 5 * * * * America/Los_Angeles'
AS
call sp_account_load();

Provide the required permissions to run the task and resume the task:

show tasks;

As soon as task is created it will be suspended state so needs to resume it manually first time

ALTER TASK TASK_ACCOUNT_LOAD RESUME;

If the role which is assigned to us doesn’t have proper access to resume/execute task needs to grant execute task privilege to that role

GRANT EXECUTE TASK, EXECUTE MANAGED TASK ON ACCOUNT TO ROLE SYSADMIN;

This completes the Snowflake part of configuration and setup.

Create a Salesforce connection

First, let’s create a Salesforce connection that can be used by AppFlow to authenticate and pull records from your Salesforce instance. On the AWS console, make sure you are in the same Region where your Snowflake instance is running.

On the Amazon AppFlow console, choose Connections in the navigation pane.
From the list of connectors, select Salesforce.
Choose Create connection.
For Connection name, enter a name of your choice (for example, Salesforce-blog).
Leave the rest of the fields as default and choose Continue.
You’re redirected to a sign-in page, where you need to log in to your Salesforce instance.
After you allow Amazon AppFlow access to your Salesforce account, your connection is successfully created.

Create a Snowflake connection

Complete the following steps to create your Snowflake connection:

On the Connections menu, choose Snowflake.
Choose Create connection.
Provide information for the Warehouse, Stage name, and Bucket details fields.
Enter your credential details.

For Region, choose the same Region where Snowflake is running.
For Connection name, name your connection Snowflake-blog.
Leave the rest of the fields as default and choose Connect.

Create a flow in Amazon AppFlow

Now you create a flow in Amazon AppFlow to load the data from Salesforce to Snowflake. Complete the following steps:

On the Amazon AppFlow console, choose Flows in the navigation pane.
Choose Create flow.
On the Specify flow details page, enter a name for the flow (for example, AccountData-SalesforceToSnowflake).
Optionally, provide a description for the flow and tags.
Choose Next.

On the Configure flow page, for Source name¸ choose Salesforce.
Choose the Salesforce connection we created in the previous step (Salesforce-blog).
For Choose Salesforce object, choose Account.
For Destination name, choose Snowflake.
Choose the newly created Snowflake connection.
For Choose Snowflake object, choose the staging table you created earlier (SALESTEST.PUBLIC. ACCOUNT_STAGING).

In the Error handling section, provide your error S3 bucket.
For Choose how to trigger the flow¸ select Run on demand.
Choose Next.

Select Manually map fields to map the fields between your source and destination.
Choose the fields Account Number, Account Name, Account Type, Annual Revenue, Active, Deleted, and Last Modified Date.

Map each source field to its corresponding destination field.
Under Additional settings, leave the Import deleted records unchecked (default setting).

In the Validations section, add validations for the data you’re pulling from Salesforce.

Because the schema for the Account_Staging table in Snowflake database has a NOT NULL constraint for the fields Account_Number and Active, records containing a null value for these fields should be ignored.

Choose Add Validation to configure validations for these fields.
Choose Next.

Leave everything else as default, proceed to the final page, and choose Create Flow.
After the flow is created, choose Run flow.

When the flow run completes successfully, it will bring all records into your Snowflake staging table.

Verify data in Snowflake

The data will be loaded into the Account_staging table. To verify that data is loaded in Snowflake, complete the following steps:

Validate the number of records by querying the ACCOUNT_STAGING table in Snowflake.
Wait for your Snowflake task to run based on the configured schedule.
Verify that all the data is transferred to the ACCOUNT table and the ACCOUNT_STAGING table is truncated.

Configure an incremental data load from Salesforce

Now let’s configure an incremental data load from Salesforce:

On the Amazon AppFlow console, select your flow, and choose Edit.
Go to the Edit configuration step and change to Run flow on schedule.
Set the flow to run every 5 minutes, and provide a start date of Today, with a start time in the future.
Choose Incremental transfer and choose the LastModifiedDate field.
Choose Next.
In the Additional settings section, select Import deleted records.

This ensures that deleted records from the source are also ingested.

Choose Save and then choose Activate flow.

Now your flow is configured to capture all incremental changes.

Test the solution

Within 5 minutes or less, a scheduled flow will pick up your change and write the changed record into your Snowflake staging table and trigger the synchronization process.

You can see the details of the run, including number of records transferred, on the Run History tab of your flow.

Clean up

Clean up the resources in your AWS account by completing the following steps:

On the Amazon AppFlow console, choose Flows in the navigation pane.
From the list of flows, select the flow AccountData-SalesforceToSnowflakeand delete it.
Enter delete to delete the flow.
Choose Connections in the navigation pane.
Choose Salesforce from the list of connectors, select Salesforce-blog, and delete it.
Enter delete to delete the connector.
On the Connections page, choose Snowflake from the list of connectors, select Snowflake-blog, and delete it.
Enter delete to delete the connector.
On the IAM console, choose Roles in the navigation page, then select the role you created for Snowflake and delete it.
Choose Policies in the navigation pane, select the policy you created for Snowflake, and delete it.
On the Amazon S3 console, search for the data bucket you created, choose Empty to delete the objects, then delete the bucket.
Search for the error bucket you created, choose Empty to delete the objects, then delete the bucket.
Clean up resources in your Snowflake account:

Delete the task TASK_ACCOUNT_LOAD:

ALTER TASK TASK_ACCOUNT_LOAD SUSPEND;
DROP TASK TASK_ACCOUNT_LOAD;

Delete the stored procedure sp_account_load:

DROP procedure sp_account_load();

Delete the tables ACCOUNT_STAGING and ACCOUNT:

DROP TABLE ACCOUNT_STAGING;
DROP TABLE ACCOUNT;

Conclusion

In this post, we walked you through how to integrate and synchronize your data from Salesforce to Snowflake using Amazon AppFlow. This demonstrates how you can set up your ETL jobs without having to learn new programming languages by using Amazon AppFlow and your familiar SQL language. This is a proof of concept, but you can try to handle edge cases like failure of Snowflake tasks or understand how incremental transfer works by making multiple changes to a Salesforce record within the scheduled time interval.

For more information on Amazon AppFlow, visit Amazon AppFlow.

About the authors

Ramesh Ranganathan is a Senior Partner Solution Architect at AWS. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, application modernization and cloud native development. He is passionate about technology and enjoys experimenting with AWS Serverless services.

Kamen Sharlandjiev is an Analytics Specialist Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.

Amit Shah is a cloud based modern data architecture expert and currently leading AWS Data Analytics practice in Atos. Based in Pune in India, he has 20+ years of experience in data strategy, architecture, design and development. He is on a mission to help organization become data-driven.

Using GitHub Actions with Amazon CodeCatalyst

2023-02-08 Dr. Rahul Sharad Gaikwad

Post Syndicated from Dr. Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/using-github-actions-with-amazon-codecatalyst/

An Amazon CodeCatalyst workflow is an automated procedure that describes how to build, test, and deploy your code as part of a continuous integration and continuous delivery (CI/CD) system. You can use GitHub Actions alongside native CodeCatalyst actions in a CodeCatalyst workflow.

Introduction:

In a prior post in this series, Using Workflows to Build, Test, and Deploy with Amazon CodeCatalyst, I discussed creating CI/CD pipelines in CodeCatalyst and how that relates to The Unicorn Project’s main protagonist, Maxine. CodeCatalyst workflows help you reliably deliver high-quality application updates frequently, quickly, and securely. CodeCatalyst allows you to quickly assemble and configure actions to compose workflows that automate your CI/CD pipeline, test reporting, and other manual processes. Workflows use provisioned compute, Lambda compute, custom container images, and a managed build infrastructure to scale execution easily without sacrificing flexibility. In this post, I will return to workflows and discuss running GitHub Actions alongside native CodeCatalyst actions.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Have an AWS Builder ID for signing in to CodeCatalyst.
Belong to a space and have the space administrator role assigned to you in that space. For more information, see Creating a space in CodeCatalyst, Managing members of your space, and Space administrator role.
Have an AWS account associated with your space and have the IAM role in that account. For more information about the role and role policy, see Creating a CodeCatalyst service role.

Walkthrough

As with the previous posts in the CodeCatalyst series, I am going to use the Modern Three-tier Web Application blueprint. Blueprints provide sample code and CI/CD workflows to help you get started easily across different combinations of programming languages and architectures. To follow along, you can re-use a project you created previously, or you can refer to a previous post that walks through creating a project using the Three-tier blueprint.

As the team has grown, I have noticed that code quality has decreased. Therefore, I would like to add a few additional tools to validate code quality when a new pull request is submitted. In addition, I would like to create a Software Bill of Materials (SBOM) for each pull request so I know what components are used by the code. In the previous post on workflows, I focused on the deployment workflow. In this post, I will focus on the OnPullRequest workflow. You can view the OnPullRequest pipeline by expanding CI/CD from the left navigation, and choosing Workflows. Next, choose OnPullRequest and you will be presented with the workflow shown in the following screenshot. This workflow runs when a new pull request is submitted and currently uses Amazon CodeGuru to perform an automated code review.

Figure 1. OnPullRequest Workflow with CodeGuru code review

While CodeGuru provides intelligent recommendations to improve code quality, it does not check style. I would like to add a linter to ensure developers follow our coding standards. While CodeCatalyst supports a rich collection of native actions, this does not currently include a linter. Fortunately, CodeCatalyst also supports GitHub Actions. Let’s use a GitHub Action to add a linter to the workflow.

Select Edit in the top right corner of the Workflow screen. If the editor opens in YAML mode, switch to Visual mode using the toggle above the code. Next, select “+ Actions” to show the list of actions. Then, change from Amazon CodeCatalyst to GitHub using the dropdown. At the time this blog was published, CodeCatalyst includes about a dozen curated GitHub Actions. Note that you are not limited to the list of curated actions. I’ll show you how to add GitHub Actions that are not on the list later in this post. For now, I am going to use Super-Linter to check coding style in pull requests. Find Super-Linter in the curated list and click the plus icon to add it to the workflow.

Figure 2. Super-Linter action with add icon

This will add a new action to the workflow and open the configuration dialog box. There is no further configuration needed, so you can simply close the configuration dialog box. The workflow should now look like this.

Figure 3. Workflow with the new Super-Linter action

Notice that the actions are configured to run in parallel. In the previous post, when I discussed the deployment workflow, the steps were sequential. This made sense since each step built on the previous step. For the pull request workflow, the actions are independent, and I will allow them to run in parallel so they complete faster. I select Validate, and assuming there are no issues, I select Commit to save my changes to the repository.

While CodeCatalyst will start the workflow when a pull request is submitted, I do not have a pull request to submit. Therefore, I select Run to test the workflow. A notification at the top of the screen includes a link to view the run. As expected, Super Linter fails because it has found issues in the application code. I click on the Super Linter action and review the logs. Here are few issues that Super Linter reported regarding app.py used by the backend application. Note that the log has been modified slightly to fit on a single line.

/app.py:2:1: F401 'os' imported but unused
/app.py:2:1: F401 'time' imported but unused
/app.py:2:1: F401 'json' imported but unused
/app.py:2:10: E401 multiple imports on one line
/app.py:4:1: F401 'boto3' imported but unused
/app.py:6:9: E225 missing whitespace around operator
/app.py:8:1: E402 module level import not at top of file
/app.py:10:1: E402 module level import not at top of file
/app.py:15:35: W291 trailing whitespace
/app.py:16:5: E128 continuation line under-indented for visual indent
/app.py:17:5: E128 continuation line under-indented for visual indent
/app.py:25:5: E128 continuation line under-indented for visual indent
/app.py:26:5: E128 continuation line under-indented for visual indent
/app.py:33:12: W292 no newline at end of file

With Super-Linter working, I turn my attention to creating a Software Bill of Materials
(SBOM). I am going to use OWASP CycloneDX to create the SBOM. While there is a GitHub Action for CycloneDX, at the time I am writing this post, it is not available from the list of curated GitHub Actions in CodeCatalyst. Fortunately, CodeCatalyst is not limited to the curated list. I can use most any GitHub Action in CodeCatalyst. To add a GitHub Action that is not in the curated list, I return to edit mode, find GitHub Actions in the list of curated actions, and click the plus icon to add it to the workflow.

Figure 4. GitHub Action with add icon

CodeCatalyst will add a new action to the workflow and open the configuration dialog box. I choose the Configuration tab and use the pencil icon to change the Action Name to Software-Bill-of-Materials. Then, I scroll down to the configuration section, and change the GitHub Action YAML. Note that you can copy the YAML from the GitHub Actions Marketplace, including the latest version number. In addition, the CycloneDX action expects you to pass the path to the Python requirements file as an input parameter.

Figure 5. GitHub Action YAML configuration

Since I am using the generic GitHub Action, I must tell CodeCatalyst which artifacts are produced by the action and should be collected after execution. CycloneDX creates an XML file called bom.xml which I configure as an artifact. Note that a CodeCatalyst artifact is the output of a workflow action, and typically consists of a folder or archive of files. You can share artifacts with subsequent actions.

Figure 6. Artifact configuration with the path to bom.xml

Once again, I select Validate, and assuming there are no issues, I select Commit to save my changes to the repository. I now have three actions that run in parallel when a pull request is submitted: CodeGuru, Super-Linter, and Software Bill of Materials.

Figure 7. Workflow including the software bill of materials

As before, I select Run to test my workflow and click the view link in the notification. As expected, the workflow fails because Super-Linter is still reporting issues. However, the new Software Bill of Materials has completed successfully. From the artifacts tab I can download the SBOM.

Figure 8. Artifacts tab listing code review and SBOM

The artifact is a zip archive that includes the bom.xml created by CycloneDX. This includes, among other information, a list of components used in the backend application.

    <components>
        <component type="library" bom-ref="7474f0f6-8aa2-46db-bebf-a7648cff84e1">
            <name>Jinja2</name>
            <version>3.1.2</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="fad0708b-d007-4f98-a80c-056b136015df">
            <name>aws-cdk-lib</name>
            <version>2.43.0</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="23e3aaae-b4e1-4f3b-b026-fcd298c9cb9b">
            <name>aws-cdk.aws-apigatewayv2-alpha</name>
            <version>2.43.0a0</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="d283cf17-9125-422c-b55c-cabb64d18f79">
            <name>aws-cdk.aws-apigatewayv2-integrations-alpha</name>
            <version>2.43.0a0</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="0f095c84-c9e9-4d6c-a4ed-c4a6c7605426">
            <name>aws-cdk.aws-lambda-python-alpha</name>
            <version>2.43.0a0</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
        <component type="library" bom-ref="b248b85b-ba27-4796-bcdf-6bd82ad47295">
            <name>constructs</name>
            <version>&gt;=10.0.0,&lt;11.0.0</version>
            <purl>pkg:pypi/constructs@%3E%3D10.0.0%2C%3C11.0.0</purl>
        </component>
        <component type="library" bom-ref="72b1da33-19c2-4b5c-bd58-7f719dafc28a">
            <name>simplejson</name>
            <version>3.17.6</version>
            <purl>pkg:pypi/[email protected]</purl>
        </component>
    </components>

The workflow is now enforcing code quality and generating a SBOM like I wanted. Note that while this is a great start, there is still room for improvement. First, I could collect reports generated by the actions in my workflow, and define success criteria for code quality. Second, I could scan the SBOM for known security vulnerabilities using a Software Composition Analysis (SCA) solution. I will be covering this in a future post in this series.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the two stacks that CDK deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project.

Conclusion

In this post, you learned how to add GitHub Actions to a CodeCatalyst workflow. I used GitHub Actions alongside native CodeCatalyst actions in my workflow. I also discussed adding actions from both the curated list of actions and others not in the curated list. Read the documentation to learn more about using GitHub Actions in CodeCatalyst.

About the authors:

Automate schema evolution at scale with Apache Hudi in AWS Glue

2023-02-07 Subhro Bose

Post Syndicated from Subhro Bose original https://aws.amazon.com/blogs/big-data/automate-schema-evolution-at-scale-with-apache-hudi-in-aws-glue/

In the data analytics space, organizations often deal with many tables in different databases and file formats to hold data for different business functions. Business needs often drive table structure, such as schema evolution (the addition of new columns, removal of existing columns, update of column names, and so on) for some of these tables in one business function that requires other business functions to replicate the same. This post focuses on such schema changes in file-based tables and shows how to automatically replicate the schema evolution of structured data from table formats in databases to the tables stored as files in cost-effective way.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. In this post, we show how to use Apache Hudi, a self-managing database layer on file-based data lakes, in AWS Glue to automatically represent data in relational form and manage their schema evolution at scale using Amazon Simple Storage Service (Amazon S3), AWS Database Migration Service (AWS DMS), AWS Lambda, AWS Glue, Amazon DynamoDB, Amazon Aurora, and Amazon Athena to automatically identify schema evolution and apply it to manage data load at petabyte scale.

Apache Hudi supports ACID transactions and CRUD operations on a data lake. This lays the foundation of a data lake architecture by enabling transaction support and schema evolution and management, decoupling storage from compute, and ensuring support for accessibility through business intelligence (BI) tools. In this post, we implement an architecture to build a transactional data lake built on the aforementioned Hudi features.

Solution overview

This post assumes a scenario where multiple tables are present in a source database, and we want to replicate any schema changes in any of those tables in Apache Hudi tables in the data lake. It uses the native support for Apache Hudi on AWS Glue for Apache Spark.

In this post, the schema evolution of source tables in the Aurora database is captured via the AWS DMS incremental load or change data capture (CDC) mechanism, and the same schema evolution is replicated in Apache Hudi tables stored in Amazon S3. Apache Hudi tables are discovered by the AWS Glue Data Catalog and queried by Athena. An AWS Glue job, supported by an orchestration pipeline using Lambda and a DynamoDB table, takes care of the automated replication of schema evolution in the Apache Hudi tables.

We use Aurora as a sample data source, but any data source that supports Create, Read, Update, and Delete (CRUD) operations can replace Aurora in your use case.

The following diagram illustrates our solution architecture.

The flow of the solution is as follows:

Aurora, as a sample data source, contains a RDBMS table with multiple rows, and AWS DMS does the full load of that data to an S3 bucket (which we call the raw bucket). We expect that you may have multiple source tables, but for demonstration purposes, we only use one source table in this post.
We trigger a Lambda function with the source table name as an event so that the corresponding parameters of the source table are read from DynamoDB. To schedule this operation for specific time intervals, we schedule Amazon EventBridge to trigger the Lambda with the table name as a parameter.
There are many tables in the source database, and we want to run one AWS Glue job for each source table for simplicity in operations. Because we use each AWS Glue job to update each Apache Hudi table, this post uses a DynamoDB table to hold the configuration parameters used by each AWS Glue job for each Apache Hudi table. The DynamoDB table contains each Apache Hudi table name, corresponding AWS Glue job name, AWS Glue job status, load status (full or delta), partition key, record key, and schema to pass to the corresponding table’s AWS Glue Job. The values in the DynamoDB table are static values.
To trigger each AWS Glue job (10 G.1X DPUs) in parallel to run an Apache Hudi specific code to insert data in the corresponding Hudi tables, Lambda passes each Apache Hudi table specific parameters read from DynamoDB to each AWS Glue job. The source data comes from tables in the Aurora source database via AWS DMS with full load and incremental load or CDC.

Create resources with AWS CloudFormation

We provide an AWS CloudFormation template to create the following resources:

Lambda and DynamoDB as the data load management orchestrators
S3 buckets for the raw, refined zone, and assets for holding code for schema evolution
An AWS Glue job to update the Hudi tables and perform schema evolution, both forward- and backward-compatible

The Aurora table and AWS DMS replication instance is not provisioned via this stack. For instructions to set up Aurora, refer to Creating an Amazon Aurora DB cluster.

Launch the following stack and provide your stack name.

eu-west-1

Schema evolution

To access your Aurora database, refer to How do I connect to my Amazon RDS for MySQL instance by using MySQL Workbench. Then complete the following steps:

Create a table named object following the queries in the Aurora database and change its schema so that we can see the schema evolution is reflected at the data lake level:

create database db;

create table db.object ( 
object_name varchar(255),
object_description varchar(255),
new_column_add varchar(255), 
new_field_1 varchar(255), 
object_id int);

insert into object 
values("obj1","Object-1","","",1);

After you create the stacks, some manual steps are needed to prepare the solution end to end.

Create an AWS DMS instance, AWS DMS endpoints, and AWS DMS task with the following configurations:
- Add dataFormat as Parquet in the target endpoint.
- Point the target endpoint of AWS DMS to the raw bucket, which is formatted as raw-bucket-<account_number>-<region_name>, and the folder name should be POC.
Start the AWS DMS task.
Create a test event in the HudiLambda Lambda function with the content of the event JSON as POC.db and save it.
Run the Lambda function.

In this post, the schema evolution is reflected through Hudi Hive sync in AWS Glue. You don’t alter queries separately in the data lake.

Now we complete the following steps to change the schema at the source. Trigger the Lambda function after each step to generate a file in the POC/db/object folder within the raw bucket. AWS DMS almost instantly picks up the schema changes and reports to the raw bucket.

Add a column called test_column to the source table object in your Aurora database:

alter table db.object add column test_column int after object_name;

insert into object 
values("obj2",22,"test-2","","",2);

Rename the column new_field_1 to new_field_2 in the source table object:

alter table db.object change new_field_1 new_field_2 varchar(10);

insert into object 
values("obj3",33,"test-3","","new-3",3);

The column new_field_1 is expected to stay in the Hudi table but without any new values being populated to it anymore.

Delete the column new_field_2 from the source table object:

alter table db.object drop column new_field_2;

insert into object 
values("obj4",44,"test-4","",4);

Similar to the previous operation, the column new_field_2 is expected to stay in the Hudi table but without any new values being populated to it anymore.

If you already have AWS Lake Formation data permissions set up in your account, you may encounter permission issues. In that case, grant full permission (Super) to the default database (before triggering the Lambda function) and all tables in the POC.db database (after the load is complete).

Review the results

When the aforementioned run happens after schema changes, the following results are generated in the refined bucket. We can view the Apache Hudi tables with its contents in Athena. To set up Athena, refer to Getting started.

The table and the database are available in the AWS Glue Data Catalog and ready for browsing the schema.

Before the schema change, the Athena results look like the following screenshot.

After you add the column test_column and insert a value in the test_column field in the object table in the Aurora database, the new column (test_column) is reflected in its corresponding Apache Hudi table in the data lake.

The following screenshot shows the results in Athena.

After you rename the column new_field_1 to new_field_2 and insert a value in the new_field_2 field in the object table, the renamed column (new_field_2) is reflected in its corresponding Apache Hudi table in the data lake, and new_field_1 remains in the schema, having no new value populated to the column.

The following screenshot shows the results in Athena.

After you delete the column new_field_2 in the object table and insert or update any values under any columns in the object table, the deleted column (new_field_2) remains in the corresponding Apache Hudi table schema, having no new value populated to the column.

The following screenshot shows the results in Athena.

Clean up

When you’re done with this solution, delete the sample data in the raw and refined S3 buckets and delete the buckets.

Also, delete the CloudFormation stack to remove all the service resources used in this solution.

Conclusion

This post showed how to implement schema evolution with an open-source solution using Apache Hudi in an AWS environment with an orchestration pipeline.

You can explore the different configurations of AWS Glue to change the AWS Glue job structures and implement it for your data analytics and other use cases.

About the Authors

Subhro Bose is a Senior Data Architect in Emergent Technologies and Intelligence Platform in Amazon. He loves solving science problems with emergent technologies such as AI/ML, big data, quantum, and more to help businesses across different industry verticals succeed within their innovation journey. In his spare time, he enjoys playing table tennis, learn theories of environmental economics and explore the best muffins across the city.

Ketan Karalkar is a Big Data Solutions Consultant at AWS. He has nearly 2 decades of experience helping customers design and build data analytics, and database solutions. He believes in using technology as an enabler to solve real life business problems.

Eva Fang is a Data Scientist within Professional Services in AWS. She is passionate about using the technology to provide value to customers and achieve business outcomes. She is based in London, in her spare time, she likes to watch movies and musicals.

Migrate your indexes to Amazon OpenSearch Serverless with Logstash

2023-01-31 Prashant Agrawal

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/migrate-your-indexes-to-amazon-opensearch-serverless-with-logstash/

We recently announced the general availability of Amazon OpenSearch Serverless , a new option for Amazon OpenSearch Service that makes it easy run large-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. With OpenSearch Serverless, you get the same interactive millisecond response times as OpenSearch Service with the simplicity of a serverless environment.

In this post, you’ll learn how to migrate your existing indices from an OpenSearch Service managed cluster domain to a serverless collection using Logstash.

With OpenSearch domains, you get dedicated, secure clusters configured and optimized for your workloads in minutes. You have full control over the configuration of compute, memory, and storage resources in clusters to optimize cost and performance for your applications. OpenSearch Serverless provides an even simpler way to run search and analytics workloads—without ever having to think about clusters. You simply create a collection and a group of indexes, and can start ingesting and querying the data.

Solution overview

Logstash is open-source software that provides ETL (extract, transform, and load) for your data. You can configure Logstash to connect to a source and a destination via input and output plugins. In between, you configure filters that can transform your data. This post walks you through the steps you need to set up Logstash to connect an OpenSearch Service domain (input) to an OpenSearch Serverless collection (output).

You set the source and destination plugins in Logstash’s config file. The config file has sections for Input, Filter, and Output. Once configured, Logstash will send a request to the OpenSearch Service domain and read the data according to the query you put in the input section. After data is read from OpenSearch Service, you can optionally send it to the next stage Filter for transformations such as adding or removing a field from the input data or updating a field with different values. In this example, you won’t use the Filter plugin. Next is the Output plugin. The open-source version of Logstash (Logstash OSS) provides a convenient way to use the bulk API to upload data to your collections. OpenSearch Serverless supports the logstash-output-opensearch output plugin, which supports AWS Identity and Access Management (IAM) credentials for data access control.

The following diagram illustrates our solution workflow.

Prerequisites

Before getting started, make sure you have completed the following prerequisites:

Note down your OpenSearch Service domain’s ARN, user name, and password.
Create an OpenSearch Serverless collection. If you’re new to OpenSearch Serverless, refer to Log analytics the easy way with Amazon OpenSearch Serverless for details on how to set up your collection.

Set up Logstash and the input and output plugins for OpenSearch

Complete the following steps to set up Logstash and your plugins:

Download logstash-oss-with-opensearch-output-plugin. (This example uses the distro for macos-x64. For other distros, refer to the artifacts.)
```
wget https://artifacts.opensearch.org/logstash/logstash-oss-with-opensearch-output-plugin-8.4.0-macos-x64.tar.gz
```

Extract the downloaded tarball:

tar -zxvf logstash-oss-with-opensearch-output-plugin-8.4.0-macos-x64.tar.gz
cd logstash-8.4.0/

Update the logstash-output-opensearch plugin to the latest version:

<path/to/your/logstash/root/directory>/bin/logstash-plugin update logstash-output-opensearch

Install the logstash-input-opensearch plugin:

<path/to/your/logstash/root/directory>/bin/logstash-plugin install logstash-input-opensearch

Test the plugin

Let’s get into action and see how the plugin works. The following config file retrieves data from the movies index in your OpenSearch Service domain and indexes that data in your OpenSearch Serverless collection with same index name, movies.

Create a new file and add the following content, then save the file as opensearch-serverless-migration.conf. Provide the values for the OpenSearch Service domain endpoint under HOST, USERNAME, and PASSWORD in the input section, and the OpenSearch Serverless collection endpoint details under HOST along with REGION, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY in the output section.

input {
    opensearch {
        hosts =>  ["https://<HOST>:443"]
        user  =>  "<USERNAME>"
        password  =>  "<PASSWORD>"
        index =>  "movies"
        query =>  '{ "query": { "match_all": {}} }'
    }
}
output {
    opensearch {
        ecs_compatibility => disabled
        index => "movies"
        hosts => "<HOST>:443"
        auth_type => {
            type => 'aws_iam'
            aws_access_key_id => '<AWS_ACCESS_KEY_ID>'
            aws_secret_access_key => '<AWS_SECRET_ACCESS_KEY>'
            region => '<REGION>'
            service_name => 'aoss'
            }
        legacy_template => false
        default_server_major_version => 2
    }
}

You can specify a query in the input section of the preceding config. The match_all query matches all data in the movies index. You can change the query if you want to select a subset of the data. You can also use the query to parallelize the data transfer by running multiple Logstash processes with configs that specify different data slices. You can also parallelize by running Logstash processes against multiple indexes if you have them.

Start Logstash

Use the following command to start Logstash:

<path/to/your/logstash/root/directory>/bin/logstash -f <path/to/your/config/file>

After you run the command, Logstash will retrieve the data from the source index from your OpenSearch Service domain, and write to the destination index in your OpenSearch Serverless collection. When the data transfer is complete, Logstash shuts down. See the following code:

[2023-01-24T20:14:28,965][INFO][logstash.agent] Successfully
started Logstash API endpoint {:port=>9600, :ssl_enabled=>false}
…
…
[2023-01-24T20:14:38,852][INFO][logstash.javapipeline][main] Pipeline terminated {"pipeline.id"=>"main"}
[2023-01-24T20:14:39,374][INFO][logstash.pipelinesregistry] Removed pipeline from registry successfully {:pipeline_id=>:main}
[2023-01-24T20:14:39,399][INFO][logstash.runner] Logstash shut down.

Verify the data in OpenSearch Serverless

You can verify that Logstash copied all your data by comparing the document count in your domain and your collection. Run the following query either from the Dev tools tab, or with curl, postman, or a similar HTTP client. The following query helps you search all documents from the movies index and returns the top documents along with the count. By default, OpenSearch will return the document count up to a maximum of 10,000. Adding the track_total_hits flag helps you get the exact count of documents if the document count exceeds 10,000.

GET movies/_search
{
  "query": {
    "match_all": {}
  },
  "track_total_hits" : true
}

Conclusion

In this post, you migrated data from your OpenSearch Service domain to your OpenSearch Serverless collection using Logstash’s OpenSearch input and output plugins.

Stay tuned for a series of posts focusing on the various options available for you to build effective log analytics and search solutions using OpenSearch Serverless. You can also refer the Getting started with Amazon OpenSearch Serverless workshop to know more about OpenSearch Serverless.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

About the authors

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Jon Handler (@_searchgeek) is a Sr. Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

Serverless logging with Amazon OpenSearch Service and Amazon Kinesis Data Firehose

2023-01-31 Jon Handler

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/serverless-logging-with-amazon-opensearch-service-and-amazon-kinesis-data-firehose/

In this post, you will learn how you can use Amazon Kinesis Data Firehose to build a log ingestion pipeline to send VPC flow logs to Amazon OpenSearch Serverless. First, you create the OpenSearch Serverless collection you use to store VPC flow logs, then you create a Kinesis Data Firehose delivery pipeline that forwards the flow logs to OpenSearch Serverless. Finally, you enable delivery of VPC flow logs to your Firehose delivery stream. The following diagram illustrates the solution workflow.

OpenSearch Serverless is a new serverless option offered by Amazon OpenSearch Service. OpenSearch Serverless makes it simple to run petabyte-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. OpenSearch Serverless automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads.

Kinesis Data Firehose is a popular service that delivers streaming data from over 20 AWS services to over 15 analytical and observability tools such as OpenSearch Serverless. Kinesis Data Firehose is great for those looking for a fast and easy way to send your VPC flow logs data to your OpenSearch Serverless collection in minutes without a single line of code and without building or managing your own data ingestion and delivery infrastructure.

VPC flow logs capture the traffic information going to and from your network interfaces in your VPC. With the launch of Kinesis Data Firehose support to OpenSearch Serverless, it makes an easy solution to analyze your VPC flow logs with just a few clicks. Kinesis Data Firehose provides a true end-to-end serverless mechanism to deliver your flow logs to OpenSearch Serverless, where you can use OpenSearch Dashboards to search through those logs, create dashboards, detect anomalies, and send alerts. VPC flow logs helps you to answer questions like:

What percentage of your traffic is getting dropped?
How much traffic is getting generated for specific sources and destinations?

Create your OpenSearch Serverless collection

To get started, you first create a collection. An OpenSearch Serverless collection is a logical grouping of one or more indexes that represent an analytics workload. Complete the following steps:

On the OpenSearch Service console, choose Collections under Serverless in the navigation pane.
Choose Create a collection.
For Collection name, enter a name (for example, vpc-flow-logs).
For Collection type¸ choose Time series.
For Encryption, choose your preferred encryption setting:
1. Choose Use AWS owned key to use an AWS managed key.
2. Choose a different AWS KMS key to use your own AWS Key Management Service (AWS KMS) key.
For Network access settings, choose your preferred setting:
1. Choose VPC to use a VPC endpoint.
2. Choose Public to use a public endpoint.

AWS recommends that you use a VPC endpoint for all production workloads. For this walkthrough, select Public.

Choose Create.

It should take couple of minutes to create the collection.

The following graphic gives a quick demonstration of creating the OpenSearch Serverless collection via the preceding steps.

At this point, you have successfully created a collection for OpenSearch Serverless. Next, you create a delivery pipeline for Kinesis Data Firehose.

Create a Kinesis Data Firehose delivery stream

To set up a delivery stream for Kinesis Data Firehose, complete the following steps:

On the Kinesis Data Firehose console, choose Create delivery stream.
For Source, specify Direct PUT.

Check out Source, Destination, and Name to learn more about different sources supported by Kinesis Data Firehose.

For Destination, choose Amazon OpenSearch Serverless.
For Delivery stream name, enter a name (for example, vpc-flow-logs).
Under Destination settings, in the OpenSearch Serverless collection settings, choose Browse.
Select vpc-flow-logs.
Choose Choose.

If your collection is still creating, wait a few minutes and try again.

For Index, specify vpc-flow-logs.
In the Backup settings section, select Failed data only for the Source record backup in Amazon S3.

Kinesis Data Firehose uses Amazon Simple Storage Service (Amazon S3) to back up failed data that it attempts to deliver to your chosen destination. If you want to keep all data, select All data.

For S3 Backup Bucket, choose Browse to select an existing S3 bucket, or choose Create to create a new bucket.
Choose Create delivery stream.

The following graphic gives a quick demonstration of creating the Kinesis Data Firehose delivery stream via the preceding steps.

At this point, you have successfully created a delivery stream for Kinesis Data Firehose, which you will use to stream data from your VPC flow logs and send it to your OpenSearch Serverless collection.

Set up the data access policy for your OpenSearch Serverless collection

Before you send any logs to OpenSearch Serverless, you need to create a data access policy within OpenSearch Serverless that allows Kinesis Data Firehose to write to the vpc-flow-logs index in your collection. Complete the following steps:

On the Kinesis Data Firehose console, choose the Configuration tab on the details page for the vpc-flow-logs delivery stream you just created.
In the Permissions section, note down the AWS Identity and Access Management (IAM) role.
Navigate to the vpc-flow-logs collection details page on the OpenSearch Serverless dashboard.
Under Data access, choose Manage data access.
Choose Create access policy.
In the Name and description section, specify an access policy name, add a description, and select JSON as the policy definition method.

Add the following policy in the JSON editor. Provide the collection name and index you specified during the delivery stream creation in the policy. Provide the IAM role name that you got from the permissions page of the Firehose delivery stream, and the account ID for your AWS account.

[
  {
    "Rules": [
      {
        "ResourceType": "index",
        "Resource": [
          "index/<collection-name>/<index-name>"
        ],
        "Permission": [
          "aoss:WriteDocument",
          "aoss:CreateIndex",
          "aoss:UpdateIndex"
        ]
      }
    ],
    "Principal": [
      "arn:aws:sts::<aws-account-id>:assumed-role/<IAM-role-name>/*"
    ]
  }
]

Choose Create.

The following graphic gives a quick demonstration of creating the data access policy via the preceding steps.

Set up VPC flow logs

In the final step of this post, you enable flow logs for your VPC with the destination as Kinesis Data Firehose, which sends the data to OpenSearch Serverless.

Navigate to the AWS Management Console.
Search for “VPC” and then choose Your VPCs in the search result (hover over the VPC rectangle to reveal the link).
Choose the VPC ID link for one of your VPCs.
On the Flow Logs tab, choose Create flow log.
For Name, enter a name.
Leave the Filter set to All. You can limit the traffic by selecting Accept or Reject.
Under Destination, select Send to Kinesis Firehose in the same account.
For Kinesis Firehose delivery stream name, choose vpc-flow-logs.
Choose Create flow log.

The following graphic gives a quick demonstration of creating a flow log for your VPC following the preceding steps.

Examine the VPC flow logs data in your collection using OpenSearch Dashboards

You won’t be able to access your collection data until you configure data access. Data access policies allow users to access the actual data within a collection.

To create a data access policy for OpenSearch Dashboards, complete the following steps:

Navigate to the vpc-flow-logs collection details page on the OpenSearch Serverless dashboard.
Under Data access, choose Manage data access.
Choose Create access policy.
In the Name and description section, specify an access policy name, add a description, and select JSON as the policy definition method.
Add the following policy in the JSON editor. Provide the collection name and index you specified during the delivery stream creation in the policy. Additionally, provide the IAM user and the account ID for your AWS account. You need to make sure that you have the AWS access and secret keys for the principal that you specified as an IAM user.
```
[
  {
    "Rules": [
      {
        "Resource": [
          "index/<collection-name>/<index-name>"
        ],
        "Permission": [
          "aoss:ReadDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::<aws-account-id>:user/<IAM-user-name>"
    ]
  }
]
```
Choose Create.
Navigate to OpenSearch Serverless and choose the collection you created (vpc-flow-logs).
Choose the OpenSearch Dashboards URL and log in with your IAM access key and secret key for the user you specified under Principal.
Navigate to dev tools within OpenSearch Dashboards and run the following query to retrieve the VPC flow logs for your VPC:
```
GET <index-name>/_search
{
  "query": {
    "match_all": {}
  }
}
```

The query returns the data as shown in the following screenshot, which contains information such as account ID, interface ID, source IP address, destination IP address, and more.

Create dashboards

After the data is flowing into OpenSearch Serverless, you can easily create dashboards to monitor the activity in your VPC. The following example dashboard shows overall traffic, accepted and rejected traffic, bytes transmitted, and some charts with the top sources and destinations.

Clean up

If you don’t want to continue using the solution, be sure to delete the resources you created:

Return to the AWS console and in the VPCs section, disable the flow logs for your VPC.
In the OpenSearch Serverless dashboard, delete your vpc-flow-logs collection.
On the Kinesis Data Firehose console, delete your vpc-flow-logs delivery stream.

Conclusion

In this post, you created an end-to-end serverless pipeline to deliver your VPC flow logs to OpenSearch Serverless using Kinesis Data Firehose. In this example, you built a delivery pipeline for your VPC flow logs, but you can also use Kinesis Data Firehose to send logs from Amazon Kinesis Data Streams and Amazon CloudWatch, which in turn can be sent to OpenSearch Serverless collections for running analytics on those logs. With serverless solutions on AWS, you can focus on your application development rather than worrying about the ingestion pipeline and tools to visualize your logs.

Get hands-on with OpenSearch Serverless by taking the Getting Started with Amazon OpenSearch Serverless workshop and check out other pipelines for analyzing your logs.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

About the authors

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

How to set up ongoing replication from your third-party secrets manager to AWS Secrets Manager

2023-01-31 Laurens Brinker

Post Syndicated from Laurens Brinker original https://aws.amazon.com/blogs/security/how-to-set-up-ongoing-replication-from-your-third-party-secrets-manager-to-aws-secrets-manager/

Secrets managers are a great tool to securely store your secrets and provide access to secret material to a set of individuals, applications, or systems that you trust. Across your environments, you might have multiple secrets managers hosted on different providers, which can increase the complexity of maintaining a consistent operating model for your secrets. In these situations, centralizing your secrets in a single source of truth, and replicating subsets of secrets across your other secrets managers, can simplify your operating model.

This blog post explains how you can use your third-party secrets manager as the source of truth for your secrets, while replicating a subset of these secrets to AWS Secrets Manager. By doing this, you will be able to use secrets that originate and are managed from your third-party secrets manager in Amazon Web Services (AWS) applications or in AWS services that use Secrets Manager secrets.

I’ll demonstrate this approach in this post by setting up a sample open-source HashiCorp Vault to create and maintain secrets and create a replication mechanism that enables you to use these secrets in AWS by using AWS Secrets Manager. Although this post uses HashiCorp Vault as an example, you can also modify the replication mechanism to use secrets managers from other providers.

Important: This blog post is intended to provide guidance that you can use when planning and implementing a secrets replication mechanism. The examples in this post are not intended to be run directly in production, and you will need to take security hardening requirements into consideration before deploying this solution. As an example, HashiCorp provides tutorials on hardening production vaults.

You can use these links to navigate through this post:

Why and when to consider replicating secrets
Two approaches to secrets replication
Replicate secrets to AWS Secrets Manager with the pull model
Solution overview
Set up the solution
Step 1: Deploy the solution by using the AWS CDK toolkit
Step 2: Initialize the HashiCorp Vault
Step 3: Update the Vault connection secret
Step 4: (Optional) Set up email notifications for replication failures
Test your secret replication
Update a secret
Secret replication logic
Use your secret
Manage permissions
Options for customizing the sample solution

Why and when to consider replicating secrets

The primary use case for this post is for customers who are running applications on AWS and are currently using a third-party secrets manager to manage their secrets, hosted on-premises, in the AWS Cloud, or with a third-party provider. These customers typically have existing secrets vending processes, deployment pipelines, and procedures and processes around the management of these secrets. Customers with such a setup might want to keep their existing third-party secrets manager and have a set of secrets that are accessible to workloads running outside of AWS, as well as workloads running within AWS, by using AWS Secrets Manager.

Another use case is for customers who are in the process of migrating workloads to the AWS Cloud and want to maintain a (temporary) hybrid form of secrets management. By replicating secrets from an existing third-party secrets manager, customers can migrate their secrets to the AWS Cloud one-by-one, test that they work, integrate the secrets with the intended applications and systems, and once the migration is complete, remove the third-party secrets manager.

Additionally, some AWS services, such as Amazon Relational Database Service (Amazon RDS) Proxy, AWS Direct Connect MACsec, and AD Connector seamless join (Linux), only support secrets from AWS Secrets Manager. Customers can use secret replication if they have a third-party secrets manager and want to be able to use third-party secrets in services that require integration with AWS Secrets Manager. That way, customers don’t have to manage secrets in two places.

Two approaches to secrets replication

In this post, I’ll discuss two main models to replicate secrets from an external third-party secrets manager to AWS Secrets Manager: a pull model and a push model.

Pull model
In a pull model, you can use AWS services such as Amazon EventBridge and AWS Lambda to periodically call your external secrets manager to fetch secrets and updates to those secrets. The main benefit of this model is that it doesn’t require any major configuration to your third-party secrets manager. The AWS resources and mechanism used for pulling secrets must have appropriate permissions and network access to those secrets. However, there could be a delay between the time a secret is created and updated and when it’s picked up for replication, depending on the time interval configured between pulls from AWS to the external secrets manager.

Push model
In this model, rather than periodically polling for updates, the external secrets manager pushes updates to AWS Secrets Manager as soon as a secret is added or changed. The main benefit of this is that there is minimal delay between secret creation, or secret updating, and when that data is available in AWS Secrets Manager. The push model also minimizes the network traffic required for replication since it’s a unidirectional flow. However, this model adds a layer of complexity to the replication, because it requires additional configuration in the third-party secrets manager. More specifically, the push model is dependent on the third-party secrets manager’s ability to run event-based push integrations with AWS resources. This will require a custom integration to be developed and managed on the third-party secrets manager’s side.

This blog post focuses on the pull model to provide an example integration that requires no additional configuration on the third-party secrets manager.

Replicate secrets to AWS Secrets Manager with the pull model

In this section, I’ll walk through an example of how to use the pull model to replicate your secrets from an external secrets manager to AWS Secrets Manager.

Solution overview

Figure 1: Secret replication architecture diagram

The architecture shown in Figure 1 consists of the following main steps, numbered in the diagram:

A Cron expression in Amazon EventBridge invokes an AWS Lambda function every 30 minutes.
To connect to the third-party secrets manager, the Lambda function, written in NodeJS, fetches a set of user-defined API keys belonging to the secrets manager from AWS Secrets Manager. These API keys have been scoped down to give read-only access to secrets that should be replicated, to adhere to the principle of least privilege. There is more information on this in Step 3: Update the Vault connection secret.
The third step has two variants depending on where your third-party secrets manager is hosted:
1. The Lambda function is configured to fetch secrets from a third-party secrets manager that is hosted outside AWS. This requires sufficient networking and routing to allow communication from the Lambda function.
  
  Note: Depending on the location of your third-party secrets manager, you might have to consider different networking topologies. For example, you might need to set up hybrid connectivity between your external environment and the AWS Cloud by using AWS Site-to-Site VPN or AWS Direct Connect, or both.
2. The Lambda function is configured to fetch secrets from a third-party secrets manager running on Amazon Elastic Compute Cloud (Amazon EC2).
Important: To simplify the deployment of this example integration, I’ll use a secrets manager hosted on a publicly available Amazon EC2 instance within the same VPC as the Lambda function (3b). This minimizes the additional networking components required to interact with the secrets manager. More specifically, the EC2 instance runs an open-source HashiCorp Vault. In the rest of this post, I’ll refer to the HashiCorp Vault’s API keys as Vault tokens.
The Lambda function compares the version of the secret that it just fetched from the third-party secrets manager against the version of the secret that it has in AWS Secrets Manager (by tag). The function will create a new secret in AWS Secrets Manager if the secret does not exist yet, and will update it if there is a new version. The Lambda function will only consider secrets from the third-party secrets manager for replication if they match a specified prefix. For example, hybrid-aws-secrets/.
In case there is an error synchronizing the secret, an email notification is sent to the email addresses which are subscribed to the Amazon Simple Notification Service (Amazon SNS) Topic deployed. This sample application uses email notifications with Amazon SNS as an example, but you could also integrate with services like ServiceNow, Jira, Slack, or PagerDuty. Learn more about how to use webhooks to publish Amazon SNS messages to external services.

Set up the solution

In this section, I walk through deploying the pull model solution displayed in Figure 1 using the following steps:
Step 1: Deploy the solution by using the AWS CDK toolkit
Step 2: Initialize the HashiCorp Vault
Step 3: Update the Vault connection secret
Step 4: (Optional) Set up email notifications for replication failures

Step 1: Deploy the solution by using the AWS CDK toolkit

For this blog post, I’ve created an AWS Cloud Development Kit (AWS CDK) script, which can be found in this AWS GitHub repository. Using the AWS CDK, I’ve defined the infrastructure depicted in Figure 1 as Infrastructure as Code (IaC), written in TypeScript, ready for you to deploy and try out. The AWS CDK is an open-source software development framework that allows you to write your cloud application infrastructure as code using common programming languages such as TypeScript, Python, Java, Go, and so on.

Prerequisites:

To deploy the solution, the following should be in place on your system:

Git
Node (version 16 or higher)
jq
AWS CDK Toolkit. Install using npm (included in Node setup) by running npm install -g aws-cdk in a local terminal.
An AWS access key ID and secret access key configured as this setup will interact with your AWS account. See Configuration basics in the AWS Command Line Interface User Guide for more details.
Docker installed and running on your machine

To deploy the solution

Clone the CDK script for secret replication.
git clone https://github.com/aws-samples/aws-secrets-manager-hybrid-secret-replication-from-hashicorp-vault.git SecretReplication
Use the cloned project as the working directory.
cd SecretReplication
Install the required dependencies to deploy the application.
npm install
Adjust any configuration values for your setup in the cdk.json file. For example, you can adjust the secretsPrefix value to change which prefix is used by the Lambda function to determine the subset of secrets that should be replicated from the third-party secrets manager.
Bootstrap your AWS environments with some resources that are required to deploy the solution. With correctly configured AWS credentials, run the following command.
cdk bootstrap
The core resources created by bootstrapping are an Amazon Elastic Container Registry (Amazon ECR) repository for the AWS Lambda Docker image, an Amazon Simple Storage Service (Amazon S3) bucket for static assets, and AWS Identity and Access Management (IAM) roles with corresponding IAM policies. You can find a full list of the resources by going to the CDKToolkit stack in AWS CloudFormation after the command has finished.
Deploy the infrastructure.
cdk deploy
This command deploys the infrastructure shown in Figure 1 for you by using AWS CloudFormation. For a full list of resources, you can view the SecretsManagerReplicationStack in AWS CloudFormation after the deployment has completed.

Note: If your local environment does not have a terminal that allows you to run these commands, consider using AWS Cloud9 or AWS CloudShell.

After the deployment has finished, you should see an output in your terminal that looks like the one shown in Figure 2. If successful, the output provides the IP address of the sample HashiCorp Vault and its web interface.

Figure 2: AWS CDK deployment output

Step 2: Initialize the HashiCorp Vault

As part of the output of the deployment script, you will be given a URL to access the user interface of the open-source HashiCorp Vault. To simplify accessibility, the URL points to a publicly available Amazon EC2 instance running the HashiCorp Vault user interface as shown in step 3b in Figure 1.

Let’s look at the HashiCorp Vault that was just created. Go to the URL in your browser, and you should see the Raft Storage initialize page, as shown in Figure 3.

Figure 3: HashiCorp Vault Raft Storage initialize page

The vault requires an initial configuration to set up storage and get the initial set of root keys. You can go through the steps manually in the HashiCorp Vault’s user interface, but I recommend that you use the initialise_vault.sh script that is included as part of the SecretsManagerReplication project instead.

Using the HashiCorp Vault API, the initialization script will automatically do the following:

Initialize the Raft storage to allow the Vault to store secrets locally on the instance.
Create an initial set of unseal keys for the Vault. Importantly, for demo purposes, the script uses a single key share. For production environments, it’s recommended to use multiple key shares so that multiple shares are needed to reconstruct the root key, in case of an emergency.
Store the unseal keys in init/vault_init_output.json in your project.
Unseals the HashiCorp Vault by using the unseal keys generated earlier.
Enables two key-value secrets engines:
1. An engine named after the prefix that you’re using for replication, defined in the cdk.json file. In this example, this is hybrid-aws-secrets. We’re going to use the secrets in this engine for replication to AWS Secrets Manager.
2. An engine called super-secret-engine, which you’re going to use to show that your replication mechanism does not have access to secrets outside the engine used for replication.
Creates three example secrets, two in hybrid-aws-secrets, and one in super-secret-engine.
Creates a read-only policy, which you can see in the init/replication-policy-payload.json file after the script has finished running, that allows read-only access to only the secrets that should be replicated.
Creates a new vault token that has the read-only policy attached so that it can be used by the AWS Lambda function later on to fetch secrets for replication.

To run the initialization script, go back to your terminal, and run the following command.
./initialise_vault.sh

The script will then ask you for the IP address of your HashiCorp Vault. Provide the IP address (excluding the port) and choose Enter. Input y so that the script creates a couple of sample secrets.

If everything is successful, you should see an output that includes tokens to access your HashiCorp Vault, similar to that shown in Figure 4.

Figure 4: Initialize HashiCorp Vault bash script output

The setup script has outputted two tokens: one root token that you will use for administrator tasks, and a read-only token that will be used to read secret information for replication. Make sure that you can access these tokens while you’re following the rest of the steps in this post.

Note: The root token is only used for demonstration purposes in this post. In your production environments, you should not use root tokens for regular administrator actions. Instead, you should use scoped down roles depending on your organizational needs. In this case, the root token is used to highlight that there are secrets under super-secret-engine/ which are not meant for replication. These secrets cannot be seen, or accessed, by the read-only token.

Go back to your browser and refresh your HashiCorp Vault UI. You should now see the Sign in to Vault page. Sign in using the Token method, and use the root token. If you don’t have the root token in your terminal anymore, you can find it in the init/vault_init_output.json file.

After you sign in, you should see the overview page with three secrets engines enabled for you, as shown in Figure 5.

Figure 5: HashiCorp Vault secrets engines overview

If you explore hybrid-aws-secrets and super-secret-engine, you can see the secrets that were automatically created by the initialization script. For example, first-secret-for-replication, which contains a sample key-value secret with the key secrets and value manager.

If you navigate to Policies in the top navigation bar, you can also see the aws-replication-read-only policy, as shown in Figure 6. This policy provides read-only access to only the hybrid-aws-secrets path.

Figure 6: Read-only HashiCorp Vault token policy

The read-only policy is attached to the read-only token that we’re going to use in the secret replication Lambda function. This policy is important because it scopes down the access that the Lambda function obtains by using the token to a specific prefix meant for replication. For secret replication we only need to perform read operations. This policy ensures that we can read, but cannot add, alter, or delete any secrets in HashiCorp Vault using the token.

You can verify the read-only token permissions by signing into the HashiCorp Vault user interface using the read-only token rather than the root token. Now, you should only see hybrid-aws-secrets. You no longer have access to super-secret-engine, which you saw in Figure 5. If you try to create or update a secret, you will get a permission denied error.

Great! Your HashiCorp Vault is now ready to have its secrets replicated from hybrid-aws-secrets to AWS Secrets Manager. The next section describes a final configuration that you need to do to allow access to the secrets in HashiCorp Vault by the replication mechanism in AWS.

Step 3: Update the Vault connection secret

To allow secret replication, you must give the AWS Lambda function access to the HashiCorp Vault read-only token that was created by the initialization script. To do that, you need to update the vault-connection-secret that was initialized in AWS Secrets Manager as part of your AWS CDK deployment.

For demonstration purposes, I’ll show you how to do that by using the AWS Management Console, but you can also do it programmatically by using the AWS Command Line Interface (AWS CLI) or AWS SDK with the update-secret command.

To update the Vault connection secret (console)

In the AWS Management Console, go to AWS Secrets Manager > Secrets > hybrid-aws-secrets/vault-connection-secret.
Under Secret Value, choose Retrieve Secret Value, and then choose Edit.
Update the vaultToken value to contain the read-only token that was generated by the initialization script.

Figure 7: AWS Secrets Manager – Vault connection secret page

Step 4: (Optional) Set up email notifications for replication failures

As highlighted in Figure 1, the Lambda function will send an email by using Amazon SNS to a designated email address whenever one or more secrets fails to be replicated. You will need to configure the solution to use the correct email address. To do this, go to the cdk.json file at the root of the SecretReplication folder and adjust the notificationEmail parameter to an email address that you own. Once done, deploy the changes using the cdk deploy command. Within a few minutes, you’ll get an email requesting you to confirm the subscription. Going forward, you will receive an email notification if one or more secrets fails to replicate.

Test your secret replication

You can either wait up to 30 minutes for the Lambda function to be invoked automatically to replicate the secrets, or you can manually invoke the function.

To test your secret replication

Open the AWS Lambda console and find the Secret Replication function (the name starts with SecretsManagerReplication-SecretReplication).
Navigate to the Test tab.
For the text event action, select Create new event, create an event using the default parameters, and then choose the Test button on the right-hand side, as shown in Figure 8.

Figure 8: AWS Lambda – Test page to manually invoke the function

This will run the function. You should see a success message, as shown in Figure 9. If this is the first time the Lambda function has been invoked, you will see in the results that two secrets have been created.

Figure 9: AWS Lambda function output

You can find the corresponding logs for the Lambda function invocation in a Log group in AWS CloudWatch matching the name /aws/lambda/SecretsManagerReplication-SecretReplicationLambdaF-XXXX.

To verify that the secrets were added, navigate to AWS Secrets Manager in the console, and in addition to the vault-connection-secret that you edited before, you should now also see the two new secrets with the same hybrid-aws-secrets prefix, as shown in Figure 10.

Figure 10: AWS Secrets Manager overview – New replicated secrets

For example, if you look at first-secret-for-replication, you can see the first version of the secret, with the secret key secrets and secret value manager, as shown in Figure 11.

Figure 11: AWS Secrets Manager – New secret overview showing values and version number

Success! You now have access to the secret values that originate from HashiCorp Vault in AWS Secrets Manager. Also, notice how there is a version tag attached to the secret. This is something that is necessary to update the secret, which you will learn more about in the next two sections.

Update a secret

It’s a recommended security practice to rotate secrets frequently. The Lambda function in this solution not only replicates secrets when they are created — it also periodically checks if existing secrets in AWS Secrets Manager should be updated when the third-party secrets manager (HashiCorp Vault in this case) has a new version of the secret. To validate that this works, you can manually update a secret in your HashiCorp Vault and observe its replication in AWS Secrets Manager in the same way as described in the previous section. You will notice that the version tag of your secret gets updated automatically when there is a new secret replication from the third-party secrets manager to AWS Secrets Manager.

Secret replication logic

This section will explain in more detail the logic behind the secret replication. Consider the following sequence diagram, which explains the overall logic implemented in the Lambda function.

Figure 12: State diagram for secret replication logic

This diagram highlights that the Lambda function will first fetch a list of secret names from the HashiCorp Vault. Then, the function will get a list of secrets from AWS Secrets Manager, matching the prefix that was configured for replication. AWS Secrets Manager will return a list of the secrets that match this prefix and will also return their metadata and tags. Note that the function has not fetched any secret material yet.

Next, the function will loop through each secret name that HashiCorp Vault gave and will check if the secret exists in AWS Secrets Manager:

If there is no secret that matches that name, the function will fetch the secret material from HashiCorp Vault, including the version number, and create a new secret in AWS Secrets Manager. It will also add a version tag to the secret to match the version.
If there is a secret matching that name in AWS Secrets Manager already, the Lambda function will first fetch the metadata from that secret in HashiCorp Vault. This is required to get the version number of the secret, because the version number was not exposed when the function got the list of secrets from HashiCorp Vault initially. If the secret version from HashiCorp Vault does not match the version value of the secret in AWS Secrets Manager (for example, the version in HashiCorp vault is 2, and the version in AWS Secrets manager is 1), an update is required to get the values synchronized again. Only now will the Lambda function fetch the actual secret material from HashiCorp Vault and update the secret in AWS Secrets Manager, including the version number in the tag.

The Lambda function fetches metadata about the secrets, rather than just fetching the secret material from HashiCorp Vault straight away. Typically, secrets don’t update very often. If this Lambda function is called every 30 minutes, then it should not have to add or update any secrets in the majority of invocations. By using metadata to determine whether you need the secret material to create or update secrets, you minimize the number of times secret material is fetched both from HashiCorp Vault and AWS Secrets Manager.

Note: The AWS Lambda function has permissions to pull certain secrets from HashiCorp Vault. It is important to thoroughly review the Lambda code and any subsequent changes to it to prevent leakage of secrets. For example, you should ensure that the Lambda function does not get updated with code that unintentionally logs secret material outside the Lambda function.

Use your secret

Now that you have created and replicated your secrets, you can use them in your AWS applications or AWS services that are integrated with Secrets Manager. For example, you can use the secrets when you set up connectivity for a proxy in Amazon RDS, as follows.

To use a secret when creating a proxy in Amazon RDS

Go to the Amazon RDS service in the console.
In the left navigation pane, choose Proxies, and then choose Create Proxy.
On the Connectivity tab, you can now select first-secret-for-replication or second-secret-for-replication, which were created by the Lambda function after replicating them from the HashiCorp Vault.

Figure 13: Amazon RDS Proxy – Example of using replicated AWS Secrets Manager secrets

It is important to remember that the consumers of the replicated secrets in AWS Secrets Manager will require scoped-down IAM permissions to use the secrets and AWS Key Management Service (AWS KMS) keys that were used to encrypt the secrets. For example, see Step 3: Create IAM role and policy on the Set up shared database connections with Amazon RDS Proxy page.

Manage permissions

Due to the sensitive nature of the secrets, it is important that you scope down the permissions to the least amount required to prevent inadvertent access to your secrets. The setup adopts a least-privilege permission strategy, where only the necessary actions are explicitly allowed on the resources that are required for replication. However, the permissions should be reviewed in accordance to your security standards.

In the architecture of this solution, there are two main places where you control access to the management of your secrets in Secrets Manager.

Lambda execution IAM role: The IAM role assumed by the Lambda function during execution contains the appropriate permissions for secret replication. There is an additional safety measure, which explicitly denies any action to a resource that is not required for the replication. For example, the Lambda function only has permission to publish to the Amazon SNS topic that is created for the failed replications, and will explicitly deny a publish action to any other topic. Even if someone accidentally adds an allow to the policy for a different topic, the explicit deny will still block this action.

AWS KMS key policy: When other services need to access the replicated secret in AWS Secrets Manager, they need permission to use the hybrid-aws-secrets-encryption-key AWS KMS key. You need to allow the principal these permissions through the AWS KMS key policy. Additionally, you can manage permissions to the AWS KMS key for the principal through an identity policy. For example, this is required when accessing AWS KMS keys across AWS accounts. See Permissions for AWS services in key policies and Specifying KMS keys in IAM policy statements in the AWS KMS Developer Guide.

Options for customizing the sample solution

The solution that was covered in this post provides an example for replication of secrets from HashiCorp Vault to AWS Secrets Manager using the pull model. This section contains additional customization options that you can consider when setting up the solution, or your own variation of it.

Depending on the solution that you’re using, you might have access to different metadata attached to the secrets, which you can use to determine if a secret should be updated. For example, if you have access to data that represents a last_updated_datetime property, you could use this to infer whether or not a secret ought to be updated.
It is a recommended practice to not use long-lived tokens wherever possible. In this sample, I used a static vault token to give the Lambda function access to the HashiCorp Vault. Depending on the solution that you’re using, you might be able to implement better authentication and authorization mechanisms. For example, HashiCorp Vault allows you to use IAM auth by using AWS IAM, rather than a static token.
This post addressed the creation of secrets and updating of secrets, but for your production setup, you should also consider deletion of secrets. Depending on your requirements, you can choose to implement a strategy that works best for you to handle secrets in AWS Secrets Manager once the original secret in HashiCorp Vault has been deleted. In the pull model, you could consider removing a secret in AWS Secrets Manager if the corresponding secret in your external secrets manager is no longer present.
In the sample setup, the same AWS KMS key is used to encrypt both the environment variables of the Lambda function, and the secrets in AWS Secrets Manager. You could choose to add an additional AWS KMS key (which would incur additional cost), to have two separate keys for these tasks. This would allow you to apply more granular permissions for the two keys in the corresponding KMS key policies or IAM identity policies that use the keys.

Conclusion

In this blog post, you’ve seen how you can approach replicating your secrets from an external secrets manager to AWS Secrets Manager. This post focused on a pull model, where the solution periodically fetched secrets from an external HashiCorp Vault and automatically created or updated the corresponding secret in AWS Secrets Manager. By using this model, you can now use your external secrets in your AWS Cloud applications or services that have an integration with AWS Secrets Manager.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Reduce risk by implementing HttpOnly cookie authentication in Amazon API Gateway

2023-01-30 Marc Borntraeger

Post Syndicated from Marc Borntraeger original https://aws.amazon.com/blogs/security/reduce-risk-by-implementing-httponly-cookie-authentication-in-amazon-api-gateway/

Some web applications need to protect their authentication tokens or session IDs from cross-site scripting (XSS). It’s an Open Web Application Security Project (OWASP) best practice for session management to store secrets in the browsers’ cookie store with the HttpOnly attribute enabled. When cookies have the HttpOnly attribute set, the browser will prevent client-side JavaScript code from accessing the value. This reduces the risk of secrets being compromised.

In this blog post, you’ll learn how to store access tokens and authenticate with HttpOnly cookies in your own workloads when using Amazon API Gateway as the client-facing endpoint. The tutorial in this post will show you a solution to store OAuth2 access tokens in the browser cookie store, and verify user authentication through Amazon API Gateway. This post describes how to use Amazon Cognito to issue OAuth2 access tokens, but the solution is not limited to OAuth2. You can use other kinds of tokens or session IDs.

The solution consists of two decoupled parts:

OAuth2 flow
Authentication check

Note: This tutorial takes you through detailed step-by-step instructions to deploy an example solution. If you prefer to deploy the solution with a script, see the api-gw-http-only-cookie-auth GitHub repository.

Prerequisites

You should have an AWS account.
You should have level 200-300 knowledge about the OAuth2 protocol.
You should have the AWS Command Line Interface (AWS CLI) installed.
You should have the AWS Toolkit for Visual Studio Code installed, so you can simply upload your code.
You should have Node.js installed on your local machine.
This solution uses the following services:

No costs should incur when you deploy the application from this tutorial because the services you’re going to use are included in the AWS Free Tier. However, be aware that small charges may apply if you have other workloads running in your AWS account and exceed the free tier. Make sure to clean up your resources from this tutorial after deployment.

Solution architecture

This solution uses Amazon Cognito, Amazon API Gateway, and AWS Lambda to build a solution that persists OAuth2 access tokens in the browser cookie store. Figure 1 illustrates the solution architecture for the OAuth2 flow.

Figure 1: OAuth2 flow solution architecture

A user authenticates by using Amazon Cognito.
Amazon Cognito has an OAuth2 redirect URI pointing to your API Gateway endpoint and invokes the integrated Lambda function oAuth2Callback.
The oAuth2Callback Lambda function makes a request to the Amazon Cognito token endpoint with the OAuth2 authorization code to get the access token.
The Lambda function returns a response with the Set-Cookie header, instructing the web browser to persist the access token as an HttpOnly cookie. The browser will automatically interpret the Set-Cookie header, because it’s a web standard. HttpOnly cookies can’t be accessed through JavaScript—they can only be set through the Set-Cookie header.

After the OAuth2 flow, you are set up to issue and store access tokens. Next, you need to verify that users are authenticated before they are allowed to access your protected backend. Figure 2 illustrates how the authentication check is handled.

Figure 2: Authentication check solution architecture

A user requests a protected backend resource. The browser automatically attaches HttpOnly cookies to every request, as defined in the web standard.
The Lambda function oAuth2Authorizer acts as the Lambda authorizer for HTTP APIs. It validates whether requests are authenticated. If requests include the proper access token in the request cookie header, then it allows the request.
API Gateway only passes through requests that are authenticated.

Amazon Cognito is not involved in the authentication check, because the Lambda function can validate the OAuth2 access tokens by using a JSON Web Token (JWT) validation check.

1. Deploying the OAuth2 flow

In this section, you’ll deploy the first part of the solution, which is the OAuth2 flow. The OAuth2 flow is responsible for issuing and persisting OAuth2 access tokens in the browser’s cookie store.

1.1. Create a mock protected backend

As shown in in Figure 2, you need to protect a backend. For the purposes of this post, you create a mock backend by creating a simple Lambda function with a default response.

To create the Lambda function

In the Lambda console, choose Create function.

Note: Make sure to select your desired AWS Region.
Choose Author from scratch as the option to create the function.
In the Basic information section as shown in , enter or select the following values:
- For Function name, enter getProtectedResource.
- For Runtime, select Node.js 16.x.
- For Architecture, select arm64, because arm64 is designed to be faster and cheaper.
Choose Create function.

Figure 3: Configuring the getProtectedResource Lambda function

The default Lambda function code returns a simple Hello from Lambda message, which is sufficient to demonstrate the concept of this solution.

1.2. Create an HTTP API in Amazon API Gateway

Next, you create an HTTP API by using API Gateway. Either an HTTP API or a REST API will work. In this example, choose HTTP API because it’s offered at a lower price point (for this tutorial you will stay within the free tier).

To create the API Gateway API

In the API Gateway console, under HTTP API, choose Build.
On the Create and configure integrations page, as shown in Figure 4, choose Add integration, then enter or select the following values:
- Select Lambda.
- For Lambda function, select the getProtectedResource Lambda function that you created in the previous section.
- For API name, enter a name. In this example, I used MyApp.
- Choose Next.
Figure 4: Configuring API Gateway integrations and API name
On the Configure routes page, as shown in Figure 5, enter or select the following values:
- For Method, select GET.
- For Resource path, enter / (a single forward slash).
- For Integration target, select the getProtectedResource Lambda function.
- Choose Next.
Figure 5: Configuring API Gateway routes
On the Configure stages page, keep all the default options, and choose Next.
On the Review and create page, choose Create.
Note down the value of Invoke URL, as shown in Figure 6.

Figure 6: Note down the invoke URL

Now it’s time to test your API Gateway API. Paste the value of Invoke URL into your browser. You’ll see the following message from your Lambda function: Hello from Lambda.

1.3. Use Amazon Cognito

You’ll use Amazon Cognito user pools to create and maintain a user directory, and add sign-up and sign-in to your web application.

To create an Amazon Cognito user pool

In the Amazon Cognito console, choose Create user pool.
On the Authentication providers page, as shown in Figure 7, for Cognito user pool sign-in options, select Email, then choose Next.

Figure 7: Configuring authentication providers
In the Multi-factor authentication pane of the Configure Security requirements page, as shown in Figure 8, choose your MFA enforcement. For this example, choose No MFA to make it simpler for you to test your solution. However, in production for data sensitive workloads you should choose Require MFA – Recommended. Choose Next.

Figure 8: Configuring MFA
On the Configure sign-up experience page, keep all the default options and choose Next.
On the Configure message delivery page, as shown in Figure 9, choose your email provider. For this example, choose Send email with Cognito to make it simple to test your solution. In production workloads, you should choose Send email with Amazon SES – Recommended. Choose Next.

Figure 9: Configuring email
In the User pool name section of the Integrate your app page, as shown in Figure 10, enter or select the following values:
1. For User pool name, enter a name. In this example, I used MyUserPool.
  
  Figure 10: Configuring user pool name
2. In the Hosted authentication pages section, as shown in Figure 11, select Use the Cognito Hosted UI.
  
  Figure 11: Configuring hosted authentication pages
3. In the Domain section, as shown in Figure 12, for Domain type, choose Use a Cognito domain. For Cognito domain, enter a domain name. Note that domains in Cognito must be unique. Make sure to enter a unique name, for example by appending random numbers at the end of your domain name. For this example, I used https://http-only-cookie-secured-app.
  
  Figure 12: Configuring an Amazon Cognito domain
4. In the Initial app client section, as shown in Figure 13, enter or select the following values:
  - For App type, keep the default setting Public client.
  - For App client name, enter a friendly name. In this example, I used MyAppClient.
  - For Client secret, keep the default setting Don’t generate a client secret.
  - For Allowed callback URLs, enter <API_GW_INVOKE_URL>/oauth2/callback, replacing <API_GW_INVOKE_URL> with the invoke URL you noted down from API Gateway in the previous section.
    
    Figure 13: Configuring the initial app client
5. Choose Next.
Choose Create user pool.

Next, you need to retrieve some Amazon Cognito information for later use.

To note down Amazon Cognito information

In the Amazon Cognito console, choose the user pool you created in the previous steps.
Under User pool overview, make note of the User pool ID value.
On the App integration tab, under Cognito Domain, make note of the Domain value.
Under App client list, make note of the Client ID value.
Under App client list, choose the app client name you created in the previous steps.
Under Hosted UI, make note of the Allowed callback URLs value.

Next, create the user that you will use in a later section of this post to run your test.

To create a user

In the Amazon Cognito console, choose the user pool you created in the previous steps.
Under Users, choose Create user.
For Email address, enter [email protected]. For this tutorial, you don’t need to send out actual emails, so the email address does not need to actually exist.
Choose Mark email address as verified.
For password, enter a password you can remember (or even better: use a password generator).
Remember the email and password for later use.
Choose Create user.

1.4. Create the Lambda function oAuth2Callback

Next, you create the Lambda function oAuth2Callback, which is responsible for issuing and persisting the OAuth2 access tokens.

To create the Lambda function oAuth2Callback

In the Lambda console, choose Create function.

Note: Make sure to select your desired Region.
For Function name, enter oAuth2Callback.
For Runtime, select Node.js 16.x.
For Architecture, select arm64.
Choose Create function.

After you create the Lambda function, you need to add the code. Create a new folder on your local machine and open it with your preferred integrated development environment (IDE). Add the package.json and index.js files, as shown in the following examples.

package.json

{
  "name": "oAuth2Callback",
  "version": "0.0.1",
  "dependencies": {
    "axios": "^0.27.2",
    "qs": "^6.11.0"
  }
}

In a terminal at the root of your created folder, run the following command.

$ npm install

In the index.js example code that follows, be sure to replace the placeholders with your values.

index.js

const qs = require("qs");
const axios = require("axios").default;
exports.handler = async function (event) {
  const code = event.queryStringParameters?.code;
  if (code == null) {
    return {
      statusCode: 400,
      body: "code query param required",
    };
  }
  const data = {
    grant_type: "authorization_code",
    client_id: "<your client ID from Cognito>",
    // The redirect has already happened, but you still need to pass the URI for validation, so a valid oAuth2 access token can be generated
    redirect_uri: encodeURI("<your callback URL from Cognito>"),
    code: code,
  };
  // Every Cognito instance has its own token endpoints. For more information check the documentation: https://docs.aws.amazon.com/cognito/latest/developerguide/token-endpoint.html
  const res = await axios.post(
    "<your App Client Cognito domain>/oauth2/token",
    qs.stringify(data),
    {
      headers: {
        "Content-Type": "application/x-www-form-urlencoded",
      },
    }
  );
  return {
    statusCode: 302,
    // These headers are returned as part of the response to the browser.
    headers: {
      // The Location header tells the browser it should redirect to the root of the URL
      Location: "/",
      // The Set-Cookie header tells the browser to persist the access token in the cookie store
      "Set-Cookie": `accessToken=${res.data.access_token}; Secure; HttpOnly; SameSite=Lax; Path=/`,
    },
  };
};

Along with the HttpOnly attribute, you pass along two additional cookie attributes:

Secure – Indicates that cookies are only sent by the browser to the server when a request is made with the https: scheme.
SameSite – Controls whether or not a cookie is sent with cross-site requests, providing protection against cross-site request forgery attacks. You set the value to Lax because you want the cookie to be set when the user is forwarded from Amazon Cognito to your web application (which runs under a different URL).

For more information, see Using HTTP cookies on the MDN Web Docs site.

Afterwards, upload the code to the oAuth2Callback Lambda function as described in Upload a Lambda Function in the AWS Toolkit for VS Code User Guide.

1.5. Configure an OAuth2 callback route in API Gateway

Now, you configure API Gateway to use your new Lambda function through a Lambda proxy integration.

To configure API Gateway to use your Lambda function

In the API Gateway console, under APIs, choose your API name. For me, the name is MyApp.
Under Develop, choose Routes.
Choose Create.
Enter or select the following values:
- For method, select GET.
- For path, enter /oauth2/callback.
Choose Create.
Choose GET under /oauth2/callback, and then choose Attach integration.
Choose Create and attach an integration.
- For Integration type, choose Lambda function.
- For Lambda function, choose oAuth2Callback from the last step.
Choose Create.

Your route configuration in API Gateway should now look like Figure 14.

Figure 14: Routes for API Gateway

2. Testing the OAuth2 flow

Now that you have the components in place, you can test your OAuth2 flow. You test the OAuth2 flow by invoking the login on your browser.

To test the OAuth2 flow

In the Amazon Cognito console, choose your user pool name. For me, the name is MyUserPool.
Under the navigation tabs, choose App integration.
Under App client list, choose your app client name. For me, the name is MyAppClient.
Choose View Hosted UI.
In the newly opened browser tab, open your developer tools, so you can inspect the network requests.
Log in with the email address and password you set in the previous section. Change your password, if you’re asked to do so. You can also choose the same password as you set in the previous section.
You should see your Hello from Lambda message.

To test that the cookie was accurately set

Check your browser network tab in the browser developer settings. You’ll see the /oauth2/callback request, as shown in Figure 15.

Figure 15: Callback network request

The response headers should include a set-cookie header, as you specified in your Lambda function. With the set-cookie header, your OAuth2 access token is set as an HttpOnly cookie in the browser, and access is prohibited from any client-side code.
Alternatively, you can inspect the cookie in the browser cookie storage, as shown in Figure 16.

Figure 16: Cookie storage
If you want to retry the authentication, navigate in your browser to your Amazon Cognito domain that you chose in the previous section and clear all site data in the browser developer tools. Do the same with your API Gateway invoke URL. Now you can restart the test with a clean state.

3. Deploying the authentication check

In this section, you’ll deploy the second part of your application: the authentication check. The authentication check makes it so that only authenticated users can access your protected backend. The authentication check works with the HttpOnly cookie, which is stored in the user’s cookie store.

3.1. Create the Lambda function oAuth2Authorizer

This Lambda function checks that requests are authenticated.

To create the Lambda function

In the Lambda console, choose Create function.

Note: Make sure to select your desired Region.
For Function name, enter oAuth2Authorizer.
For Runtime, select Node.js 16.x.
For Architecture, select arm64.
Choose Create function.

After you create the Lambda function, you need to add the code. Create a new folder on your local machine and open it with your preferred IDE. Add the package.json and index.js files as shown in the following examples.

package.json

{
  "name": "oAuth2Authorizer",
  "version": "0.0.1",
  "dependencies": {
    "aws-jwt-verify": "^3.1.0"
  }
}

In a terminal at the root of your created folder, run the following command.

$ npm install

In the index.js example code, be sure to replace the placeholders with your values.

index.js

const { CognitoJwtVerifier } = require("aws-jwt-verify");
function getAccessTokenFromCookies(cookiesArray) {
  // cookieStr contains the full cookie definition string: "accessToken=abc"
  for (const cookieStr of cookiesArray) {
    const cookieArr = cookieStr.split("accessToken=");
    // After splitting you should get an array with 2 entries: ["", "abc"] - Or only 1 entry in case it was a different cookie string: ["test=test"]
    if (cookieArr[1] != null) {
      return cookieArr[1]; // Returning only the value of the access token without cookie name
    }
  }
  return null;
}
// Create the verifier outside the Lambda handler (= during cold start),
// so the cache can be reused for subsequent invocations. Then, only during the
// first invocation, will the verifier actually need to fetch the JWKS.
const verifier = CognitoJwtVerifier.create({
  userPoolId: "<your user pool ID from Cognito>",
  tokenUse: "access",
  clientId: "<your client ID from Cognito>",
});
exports.handler = async (event) => {
  if (event.cookies == null) {
    console.log("No cookies found");
    return {
      isAuthorized: false,
    };
  }
  // Cookies array looks something like this: ["accessToken=abc", "otherCookie=Random Value"]
  const accessToken = getAccessTokenFromCookies(event.cookies);
  if (accessToken == null) {
    console.log("Access token not found in cookies");
    return {
      isAuthorized: false,
    };
  }
  try {
    await verifier.verify(accessToken);
    return {
      isAuthorized: true,
    };
  } catch (e) {
    console.error(e);
    return {
      isAuthorized: false,
    };
  }
};

After you add the package.json and index.js files, upload the code to the oAuth2Authorizer Lambda function as described in Upload a Lambda Function in the AWS Toolkit for VS Code User Guide.

3.2. Configure the Lambda authorizer in API Gateway

Next, you configure your authorizer Lambda function to protect your backend. This way you control access to your HTTP API.

To configure the authorizer Lambda function

In the API Gateway console, under APIs, choose your API name. For me, the name is MyApp.
Under Develop, choose Routes.
Under / (a single forward slash) GET, choose Attach authorization.
Choose Create and attach an authorizer.
Choose Lambda.
Enter or select the following values:
- For Name, enter oAuth2Authorizer.
- For Lambda function, choose oAuth2Authorizer.
- Clear Authorizer caching. For this tutorial, you disable authorizer caching to make testing simpler. See the section Bonus: Enabling authorizer caching for more information about enabling caching to increase performance.
- Under Identity sources, choose Remove.
  
  Note: Identity sources are ignored for your Lambda authorizer. These are only used for caching.
- Choose Create and attach.
Under Develop, choose Routes to inspect all routes.

Now your API Gateway route /oauth2/callback should be configured as shown in Figure 17.

Figure 17: API Gateway route configuration

4. Testing the OAuth2 authorizer

You did it! From your last test, you should still be authenticated. So, if you open the API Gateway Invoke URL in your browser, you’ll be greeted from your protected backend.

In case you are not authenticated anymore, you’ll have to follow the steps again from the section Testing the OAuth2 flow to authenticate.

When you inspect the HTTP request that your browser makes in the developer tools as shown in Figure 18, you can see that authentication works because the HttpOnly cookie is automatically attached to every request.

Figure 18: Browser requests include HttpOnly cookies

To verify that your authorizer Lambda function works correctly, paste the same Invoke URL you noted previously in an incognito window. Incognito windows do not share the cookie store with your browser session, so you see a {"message":"Forbidden"} error message with HTTP response code 403 – Forbidden.

Cleanup

Delete all unwanted resources to avoid incurring costs.

To delete the Amazon Cognito domain and user pool

In the Amazon Cognito console, choose your user pool name. For me, the name is MyUserPool.
Under the navigation tabs, choose App integration.
Under Domain, choose Actions, then choose Delete Cognito domain.
Confirm by entering your custom Amazon Cognito domain, and choose Delete.
Choose Delete user pool.
Confirm by entering your user pool name (in my case, MyUserPool), and then choose Delete.

To delete your API Gateway resource

In the API Gateway console, select your API name. For me, the name is MyApp.
Under Actions, choose Delete and confirm your deletion.

To delete the AWS Lambda functions

In the Lambda console, select all three of the Lambda functions you created.
Under Actions, choose Delete and confirm your deletion.

Bonus: Enabling authorizer caching

As mentioned earlier, you can enable authorizer caching to help improve your performance. When caching is enabled for an authorizer, API Gateway uses the authorizer’s identity sources as the cache key. If a client specifies the same parameters in identity sources within the configured Time to Live (TTL), then API Gateway uses the cached authorizer result, rather than invoking your Lambda function.

To enable caching, your authorizer must have at least one identity source. To cache by the cookie request header, you specify $request.header.cookie as the identity source. Be aware that caching will be affected if you pass along additional HttpOnly cookies apart from the access token.

For more information, see Working with AWS Lambda authorizers for HTTP APIs in the Amazon API Gateway Developer Guide.

Conclusion

In this blog post, you learned how to implement authentication by using HttpOnly cookies. You used Amazon API Gateway and AWS Lambda to persist and validate the HttpOnly cookies, and you used Amazon Cognito to issue OAuth2 access tokens. If you want to try an automated deployment of this solution with a script, see the api-gw-http-only-cookie-auth GitHub repository.

The application of this solution to protect your secrets from potential cross-site scripting (XSS) attacks is not limited to OAuth2. You can protect other kinds of tokens, sessions, or tracking IDs with HttpOnly cookies.

In this solution, you used NodeJS for your Lambda functions to implement authentication. But HttpOnly cookies are widely supported by many programing frameworks. You can find more implementation options on the OWASP Secure Cookie Attribute page.

Although this blog post gives you a tutorial on how to implement HttpOnly cookie authentication in API Gateway, it may not meet all your security and functional requirements. Make sure to check your business requirements and talk to your stakeholders before you adopt techniques from this blog post.

Furthermore, it’s a good idea to continuously test your web application, so that cookies are only set with your approved security attributes. For more information, see the OWASP Testing for Cookies Attributes page.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon API Gateway re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

2023-01-30 Praveen Allam

Post Syndicated from Praveen Allam original https://aws.amazon.com/blogs/big-data/handle-upsert-data-operations-using-open-source-delta-lake-and-aws-glue/

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities. In this post, we discuss how to handle UPSERTs (updates and inserts) of the operational data using natively integrated Delta Lake with AWS Glue, and query the Delta Lake using Amazon Athena.

We examine a hypothetical insurance organization that issues commercial policies to small- and medium-scale businesses. The insurance prices vary based on several criteria, such as where the business is located, business type, earthquake or flood coverage, and so on. This organization is planning to build a data analytical platform, and the insurance policy data is one of the inputs to this platform. Because the business is growing, hundreds and thousands of new insurance policies are being enrolled and renewed every month. Therefore, all this operational data needs to be sent to Delta Lake in near-real time so that the organization can perform various analytics, and build machine learning (ML) models to serve their customers in a more efficient and cost-effective way.

Solution overview

The data can originate from any source, but typically customers want to bring operational data to data lakes to perform data analytics. One of the solutions is to bring the relational data by using AWS Database Migration Service (AWS DMS). AWS DMS tasks can be configured to copy the full load as well as ongoing changes (CDC). The full load and CDC load can be brought into the raw and curated (Delta Lake) storage layers in the data lake. To keep it simple, in this post we opt out of the data sources and ingestion layer; the assumption is that the data is already copied to the raw bucket in the form of CSV files. An AWS Glue ETL job does the necessary transformation and copies the data to the Delta Lake layer. The Delta Lake layer ensures ACID compliance of the source data.

The following diagram illustrates the solution architecture.
Architecture diagram

The use case we use in this post is about a commercial insurance company. We use a simple dataset that contains the following columns:

Policy – Policy number, entered as text
Expiry – Date that policy expires
Location – Location type (Urban or Rural)
State – Name of state where property is located
Region – Geographic region where property is located
Insured Value – Property value
Business Type – Business use type for property, such as Farming or Retail
Earthquake – Is earthquake coverage included (Y or N)
Flood – Is flood coverage included (Y or N)

The dataset contains a sample of 25 insurance policies. In the case of a production dataset, it may contain millions of records.

policy_id,expiry_date,location_name,state_code,region_name,insured_value,business_type,earthquake,flood
200242,2023-01-02,Urban,NY,East,1617630,Retail,N,N
200314,2023-01-02,Urban,NY,East,8678500,Apartment,Y,Y
200359,2023-01-02,Rural,WI,Midwest,2052660,Farming,N,N
200315,2023-01-02,Urban,NY,East,17580000,Apartment,Y,Y
200385,2023-01-02,Urban,NY,East,1925000,Hospitality,N,N
200388,2023-01-04,Urban,IL,Midwest,12934500,Apartment,Y,Y
200358,2023-01-05,Urban,WI,Midwest,928300,Office Bldg,N,N
200264,2023-01-07,Rural,NY,East,2219900,Farming,N,N
200265,2023-01-07,Urban,NY,East,14100000,Apartment,Y,Y
100582,2023-03-25,Urban,NJ,East,4651680,Apartment,Y,Y
100487,2023-03-25,Urban,NY,East,5990067,Apartment,N,N
100519,2023-03-25,Rural,NY,East,4102500,Farming,N,N
100462,2023-03-25,Urban,NY,East,3400000,Construction,Y,Y
100486,2023-03-26,Urban,NY,East,9973900,Apartment,Y,Y
100463,2023-03-27,Urban,NY,East,15480000,Office Bldg,Y,Y
100595,2023-03-27,Rural,NY,East,2446600,Farming,N,N
100617,2023-03-27,Urban,VT,Northeast,8861500,Office Bldg,N,N
100580,2023-03-30,Urban,NH,Northeast,97920,Office Bldg,Y,Y
100581,2023-03-30,Urban,NY,East,5150000,Apartment,Y,Y
100475,2023-03-31,Rural,WI,Midwest,1451662,Farming,N,N
100503,2023-03-31,Urban,NJ,East,1761960,Office Bldg,N,N
100504,2023-03-31,Rural,NY,East,1649105,Farming,N,N
100616,2023-03-31,Urban,NY,East,2329500,Apartment,N,N
100611,2023-04-25,Urban,NJ,East,1595500,Office Bldg,Y,Y
100621,2023-04-25,Urban,MI,Central,394220,Retail,N,N

In the following sections, we walk through the steps to perform the Delta Lake UPSERT operations. We use the AWS Management Console to perform all the steps. However, you can also automate these steps using tools like AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), Terraforms, and so on.

Prerequisites

This post is focused towards architects, engineers, developers, and data scientists who build, design, and build analytical solutions on AWS. We expect a basic understanding of the console, AWS Glue, Amazon Simple Storage Service (Amazon S3), and Athena. Additionally, the persona is able to create AWS Identity and Access Management (IAM) policies and roles, create and run AWS Glue jobs and crawlers, and is able work with the Athena query editor.

Use Athena query engine version 3 to query delta lake tables, later in the section “Query the full load using Athena”.

Athena QE V3

Set up an S3 bucket for full and CDC load data feeds

To set up your S3 bucket, complete the following steps:

Log in to your AWS account and choose a Region nearest to you.
On the Amazon S3 console, create a new bucket. Make sure the name is unique (for example, delta-lake-cdc-blog-<some random number>).
Create the following folders:
1. $bucket_name/fullload – This folder is used for a one-time full load from the upstream data source
2. $bucket_name/cdcload – This folder is used for copying the upstream data changes
3. $bucket_name/delta – This folder holds the Delta Lake data files
Copy the sample dataset and save it in a file called full-load.csv to your local machine.
Upload the file using the Amazon S3 console into the folder $bucket_name/fullload.

s3 folders

Set up an IAM policy and role

In this section, we create an IAM policy for the S3 bucket access and a role for AWS Glue jobs to run, and also use the same role for querying the Delta Lake using Athena.

On the IAM console, choose Polices in the navigation pane.
Choose Create policy.
Select JSON tab and paste the following policy code. Replace the {bucket_name} you created in the earlier step.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListingOfFolders",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::{bucket_name}"
            ]
        },
        {
            "Sid": "ObjectAccessInBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::{bucket_name}/*"
        }
    ]
}

Name the policy delta-lake-cdc-blog-policy and select Create policy.
On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Select AWS Glue as your trusted entity and choose Next.
Select the policy you just created, and with two additional AWS managed policies:
1. delta-lake-cdc-blog-policy
2. AWSGlueServiceRole
3. CloudWatchFullAccess

Choose Next.
Give the role a name (for example, delta-lake-cdc-blog-role).

IAM role

Set up AWS Glue jobs

In this section, we set up two AWS Glue jobs: one for full load and one for the CDC load. Let’s start with the full load job.

On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs. AWS Glue Studio opens in a new tab.
Select Spark script editor and choose Create.

Glue Studio Editor

In the script editor, replace the code with the following code snippet

import sys
from awsglue.utils import getResolvedOptions
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','s3_bucket'])

# Initialize Spark Session with Delta Lake
spark = SparkSession \
.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

#Define the table schema
schema = StructType() \
      .add("policy_id",IntegerType(),True) \
      .add("expiry_date",DateType(),True) \
      .add("location_name",StringType(),True) \
      .add("state_code",StringType(),True) \
      .add("region_name",StringType(),True) \
      .add("insured_value",IntegerType(),True) \
      .add("business_type",StringType(),True) \
      .add("earthquake_coverage",StringType(),True) \
      .add("flood_coverage",StringType(),True) 

# Read the full load
sdf = spark.read.format("csv").option("header",True).schema(schema).load("s3://"+ args['s3_bucket']+"/fullload/")
sdf.printSchema()

# Write data as DELTA TABLE
sdf.write.format("delta").mode("overwrite").save("s3://"+ args['s3_bucket']+"/delta/insurance/")

Navigate to the Job details tab.
Provide a name for the job (for example, Full-Load-Job).
For IAM Role¸ choose the role delta-lake-cdc-blog-role that you created earlier.
For Worker type¸ choose G 2X.
For Job bookmark, choose Disable.
Set Number of retries to 0.
Under Advanced properties¸ keep the default values, but provide the delta core JAR file path for Python library path and Dependent JARs path.
Under Job parameters:
1. Add the key --s3_bucket with the bucket name you created earlier as the value.
2. Add the key --datalake-formats and give the value delta
Keep the remaining default values and choose Save.

Job details

Now let’s create the CDC load job.

Create a second job called CDC-Load-Job.
Follow the steps on the Job details tab as with the previous job.
Alternatively, you may choose “Clone job” option from the Full-Load-Job, this will carry all the job details from the full load job.
In the script editor, enter the following code snippet for the CDC logic:

import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import expr

## For Delta lake
from delta.tables import DeltaTable


## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','s3_bucket'])

# Initialize Spark Session with Delta Lake
spark = SparkSession \
.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

# Read the CDC load
cdc_df = spark.read.csv("s3://"+ args['s3_bucket']+"/cdcload")
cdc_df.show(5,True)

# now read the full load (latest data) as delta table
delta_df = DeltaTable.forPath(spark, "s3://"+ args['s3_bucket']+"/delta/insurance/")
delta_df.toDF().show(5,True)

# UPSERT process if matches on the condition the update else insert
# if there is no keyword then create a data set with Insert, Update and Delete flag and do it separately.
# for delete it has to run in loop with delete condition, this script do not handle deletes.
    
final_df = delta_df.alias("prev_df").merge( \
source = cdc_df.alias("append_df"), \
#matching on primarykey
condition = expr("prev_df.policy_id = append_df._c1"))\
.whenMatchedUpdate(set= {
    "prev_df.expiry_date"           : col("append_df._c2"), 
    "prev_df.location_name"         : col("append_df._c3"),
    "prev_df.state_code"            : col("append_df._c4"),
    "prev_df.region_name"           : col("append_df._c5"), 
    "prev_df.insured_value"         : col("append_df._c6"),
    "prev_df.business_type"         : col("append_df._c7"),
    "prev_df.earthquake_coverage"   : col("append_df._c8"), 
    "prev_df.flood_coverage"        : col("append_df._c9")} )\
.whenNotMatchedInsert(values =
#inserting a new row to Delta table
{   "prev_df.policy_id"             : col("append_df._c1"),
    "prev_df.expiry_date"           : col("append_df._c2"), 
    "prev_df.location_name"         : col("append_df._c3"),
    "prev_df.state_code"            : col("append_df._c4"),
    "prev_df.region_name"           : col("append_df._c5"), 
    "prev_df.insured_value"         : col("append_df._c6"),
    "prev_df.business_type"         : col("append_df._c7"),
    "prev_df.earthquake_coverage"   : col("append_df._c8"), 
    "prev_df.flood_coverage"        : col("append_df._c9")
})\
.execute()

Run the full load job

On the AWS Glue console, open full-load-job and choose Run. The job takes about 2 minutes to complete, and the job run status changes to Succeeded. Go to $bucket_name and open the delta folder, which contains the insurance folder. You can note the Delta Lake files in it. Delta location on S3

Create and run the AWS Glue crawler

In this step, we create an AWS Glue crawler with Delta Lake as the data source type. After successfully running the crawler, we inspect the data using Athena.

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler.
Provide a name (for example, delta-lake-crawler) and choose Next.
Choose Add a data source and choose Delta Lake as your data source.
Copy your delta folder URI (for example, s3://delta-lake-cdc-blog-123456789/delta/insurance) and enter the Delta Lake table path location.
Keep the default selection Create Native tables, and choose Add a Delta Lake data source.
Choose Next.
Choose the IAM role you created earlier, then choose Next.
Select the default target database, and provide delta_ for the table name prefix. If no default database exist, you may create one.
Choose Next.
Choose Create crawler.
Run the newly created crawler. After the crawler is complete, the delta_insurance table is available under Databases/Tables.
Open the table to check the table overview.

You can observe nine columns and their data types. Glue table

Query the full load using Athena

In the earlier step, we created the delta_insurance table by running a crawler against the Delta Lake location. In this section, we query the delta_insurance table using Athena. Note that if you’re using Athena for the first time, set the query output folder to store the Athena query results (for example, s3://<your-s3-bucket>/query-output/).

On the Athena console, open the query editor.
Keep the default selections for Data source and Database.
Run the query SELECT * FROM delta_insurance;. This query returns a total of 25 rows, the same as what was in the full load data feed.
For the CDC comparison, run the following query and store the results in a location where you can compare these results later:

SELECT * FROM delta_insurance
WHERE policy_id IN (100462,100463,100475,110001,110002)
order by policy_id;

The following screenshot shows the Athena query result.

Query results from full load

Upload the CDC data feed and run the CDC job

In this section, we update three insurance policies and insert two new policies.

Copy the following insurance policy data and save it locally as cdc-load.csv:

U,100462,2024-12-31,Urban,NY,East,3400000,Construction,Y,Y
U,100463,2023-03-27,Urban,NY,East,1000000,Office Bldg,Y,Y
U,100475,2023-03-31,Rural,WI,Midwest,1451662,Farming,N,Y
I,110001,2024-03-31,Urban,CA,WEST,210000,Office Bldg,N,N
I,110002,2024-03-31,Rural,FL,East,975000,Retail,N,Y

The first column in the CDC feed describes the UPSERT operations. U is for updating an existing record, and I is for inserting a new record.

Upload the cdc-load.csv file to the $bucket_name/cdcload/ folder.
On the AWS Glue console, run CDC-Load-Job. This job takes care of updating the Delta Lake accordingly.

The change details are as follows:

100462 – Expiry date changes to 12/31/2024
100463 – Insured value changes to 1 million
100475 – This policy is now under a new flood zone
110001 and 110002 – New policies added to the table

Run the query again:

SELECT * FROM delta_insurance
WHERE policy_id IN (100462, 100463,100475,110001,110002)
order by policy_id;

As shown in the following screenshot, the changes in the CDC data feed are reflected in the Athena query results.

Clean up

In this solution, we used all managed services, and there is no cost if AWS Glue jobs aren’t running. However, if you want to clean up the tasks, you can delete the two AWS Glue jobs, AWS Glue table, and S3 bucket.

Conclusion

Organizations are continuously looking at high performance, cost-effective, and scalable analytical solutions to extract the value of their operational data sources in near-real time. The analytical platform should be ready to receive changes in the operational data as soon as they occur. Typical data lake solutions face challenges to handle the changes in source data; the Delta Lake framework can close this gap. This post demonstrated how to build data lakes for UPSERT operations using AWS Glue and native Delta Lake tables, and how to query AWS Glue tables from Athena. You can implement your large scale UPSERT data operations using AWS Glue, Delta Lake and perform analytics using Amazon Athena.

References

About the Authors

Praveen Allam is a Solutions Architect at AWS. He helps customers design scalable, better cost-perfromant enterprise-grade applications using the AWS Cloud. He builds solutions to help organizations make data-driven decisions.

Vivek Singh is Senior Solutions Architect with the AWS Data Lab team. He helps customers unblock their data journey on the AWS ecosystem. His interest areas are data pipeline automation, data quality and data governance, data lakes, and lake house architectures.

Deliver Operational Insights to Atlassian Opsgenie using DevOps Guru

2023-01-27 Brendan Jenkins

Post Syndicated from Brendan Jenkins original https://aws.amazon.com/blogs/devops/deliver-operational-insights-to-atlassian-opsgenie-using-devops-guru/

As organizations continue to grow and scale their applications, the need for teams to be able to quickly and autonomously detect anomalous operational behaviors becomes increasingly important. Amazon DevOps Guru offers a fully managed AIOps service that enables you to improve application availability and resolve operational issues quickly. DevOps Guru helps ease this process by leveraging machine learning (ML) powered recommendations to detect operational insights, identify the exhaustion of resources, and provide suggestions to remediate issues. Many organizations running business critical applications use different tools to be notified about anomalous events in real-time for the remediation of critical issues. Atlassian is a modern team collaboration and productivity software suite that helps teams organize, discuss, and complete shared work. You can deliver these insights in near-real time to DevOps teams by integrating DevOps Guru with Atlassian Opsgenie. Opsgenie is a modern incident management platform that receives alerts from your monitoring systems and custom applications and categorizes each alert based on importance and timing.

This blog post walks you through how to integrate Amazon DevOps Guru with Atlassian Opsgenie to
receive notifications for new operational insights detected by DevOps Guru with more flexibility and customization using Amazon EventBridge and AWS Lambda. The Lambda function will be used to demonstrate how to customize insights sent to Opsgenie.

Solution overview

Figure 1: Amazon EventBridge Integration with Opsgenie using AWS Lambda

Amazon DevOps Guru directly integrates with Amazon EventBridge to notify you of events relating to generated insights and updates to insights. To begin routing these notifications to Opsgenie, you can configure routing rules to determine where to send notifications. As outlined below, you can also use pre-defined DevOps Guru patterns to only send notifications or trigger actions that match that pattern. You can select any of the following pre-defined patterns to filter events to trigger actions in a supported AWS resource. Here are the following predefined patterns supported by DevOps Guru:

DevOps Guru New Insight Open
DevOps Guru New Anomaly Association
DevOps Guru Insight Severity Upgraded
DevOps Guru New Recommendation Created
DevOps Guru Insight Closed

By default, the patterns referenced above are enabled so we will leave all patterns operational in this implementation. However, you do have flexibility to change which of these patterns to choose to send to Opsgenie. When EventBridge receives an event, the EventBridge rule matches incoming events and sends it to a target, such as AWS Lambda, to process and send the insight to Opsgenie.

Prerequisites

The following prerequisites are required for this walkthrough:

An AWS Account
An Opsgenie Account
Maven
AWS Command Line Interface (CLI)
AWS Serverless Application Model (SAM) CLI
Create a team and add members within your Opsgenie Account
AWS Cloud9 is recommended to create an environment to get access to the AWS Serverless Application Model (SAM) CLI or AWS Command Line Interface (CLI) from a bash terminal.

Push Insights using Amazon EventBridge & AWS Lambda

In this tutorial, you will perform the following steps:

Create an Opsgenie integration
Launch the SAM template to deploy the solution
Test the solution

Create an Opsgenie integration

In this step, you will navigate to Opsgenie to create the integration with DevOps Guru and to obtain the API key and team name within your account. These parameters will be used as inputs in a later section of this blog.

Navigate to Teams, and take note of the team name you have as shown below, as you will need this parameter in a later section.

Figure 2: Opsgenie team names

Click on the team to proceed and navigate to Integrations on the left-hand pane. Click on Add Integration and select the Amazon DevOps Guru option.

Figure 3: Integration option for DevOps Guru

Now, scroll down and take note of the API Key for this integration and copy it to your notes as it will be needed in a later section. Click Save Integration at the bottom of the page to proceed.

Figure 4: API Key for DevOps Guru Integration

Now, the Opsgenie integration has been created and we’ve obtained the API key and team name. The email of any team member will be used in the next section as well.

Review & launch the AWS SAM template to deploy the solution

In this step, you will review & launch the SAM template. The template will deploy an AWS Lambda function that is triggered by an Amazon EventBridge rule when Amazon DevOps Guru generates a new event. The Lambda function will retrieve the parameters obtained from the deployment and pushes the events to Opsgenie via an API.

Reviewing the template

Below is the SAM template that will be deployed in the next step. This template launches a few key components specified earlier in the blog. The Transform section of the template allows us takes an entire template written in the AWS Serverless Application Model (AWS SAM) syntax and transforms and expands it into a compliant CloudFormation template. Under the Resources section this solution will deploy an AWS Lamba function using the Java runtime as well as an Amazon EventBridge Rule/Pattern. Another key aspect of the template are the Parameters. As shown below, the ApiKey, Email, and TeamName are parameters we will use for this CloudFormation template which will then be used as environment variables for our Lambda function to pass to OpsGenie.

Figure 5: Review of SAM Template

Launching the Template

Navigate to the directory of choice within a terminal and clone the GitHub repository with the following command:

git clone https://github.com/aws-samples/amazon-devops-guru-connector-opsgenie.git

Change directories with the command below to navigate to the directory of the SAM template.

cd amazon-devops-guru-connector-opsgenie/OpsGenieServerlessTemplate

From the CLI, use the AWS SAM to build and process your AWS SAM template file, application code, and any applicable language-specific files and dependencies.

sam build

From the CLI, use the AWS SAM to deploy the AWS resources for the pattern as specified in the template.yml file.

sam deploy --guided

You will now be prompted to enter the following information below. Use the information obtained from the previous section to enter the Parameter ApiKey, Parameter Email, and Parameter TeamName fields.

Stack Name
AWS Region
Parameter ApiKey
Parameter Email
Parameter TeamName
Allow SAM CLI IAM Role Creation

Test the solution

Follow this blog to enable DevOps Guru and generate an operational insight.
When DevOps Guru detects a new insight, it will generate an event in EventBridge. EventBridge then triggers Lambda and sends the event to Opsgenie as shown below.

Figure 6: Event Published to Opsgenie with details such as the source, alert type, insight type, and a URL to the insight in the AWS console.enecccdgruicnuelinbbbigebgtfcgdjknrjnjfglclt

Cleaning up

To avoid incurring future charges, delete the resources.

Delete resources deployed from this blog.
From the command line, use AWS SAM to delete the serverless application along with its dependencies.

sam delete

Customizing Insights published using Amazon EventBridge & AWS Lambda

The foundation of the DevOps Guru and Opsgenie integration is based on Amazon EventBridge and AWS Lambda which allows you the flexibility to implement several customizations. An example of this would be the ability to generate an Opsgenie alert when a DevOps Guru insight severity is high. Another example would be the ability to forward appropriate notifications to the AIOps team when there is a serverless-related resource issue or forwarding a database-related resource issue to your DBA team. This section will walk you through how these customizations can be done.

EventBridge customization

EventBridge rules can be used to select specific events by using event patterns. As detailed below, you can trigger the lambda function only if a new insight is opened and the severity is high. The advantage of this kind of customization is that the Lambda function will only be invoked when needed.

{
  "source": [
    "aws.devops-guru"
  ],
  "detail-type": [
    "DevOps Guru New Insight Open"
  ],
  "detail": {
    "insightSeverity": [
         "high"
         ]
  }
}

Applying EventBridge customization

Open the file template.yaml reviewed in the previous section and implement the changes as highlighted below under the Events section within resources (original file on the left, changes on the right hand side).

Figure 7: CloudFormation template file changed so that the EventBridge rule is only triggered when the alert type is “DevOps Guru New Insight Open” and insightSeverity is “high”.

Save the changes and use the following command to apply the changes

sam deploy --template-file template.yaml

Accept the changeset deployment

Determining the Ops team based on the resource type

Another customization would be to change the Lambda code to route and control how alerts will be managed. Let’s say you want to get your DBA team involved whenever DevOps Guru raises an insight related to an Amazon RDS resource. You can change the AlertType Java class as follows:

To begin this customization of the Lambda code, the following changes need to be made within the AlertType.java file:

At the beginning of the file, the standard java.util.List and java.util.ArrayList packages were imported
Line 60: created a list of CloudWatch metrics namespaces
Line 74: Assigned the dataIdentifiers JsonNode to the variable dataIdentifiersNode
Line 75: Assigned the namespace JsonNode to a variable namespaceNode
Line 77: Added the namespace to the list for each DevOps Insight which is always raised as an EventBridge event with the structure detail►anomalies►0►sourceDetails►0►dataIdentifiers►namespace
Line 88: Assigned the default responder team to the variable defaultResponderTeam
Line 89: Created the list of responders and assigned it to the variable respondersTeam
Line 92: Check if there is at least one AWS/RDS namespace
Line 93: Assigned the DBAOps_Team to the variable dbaopsTeam
Line 93: Included the DBAOps_Team team as part of the responders list
Line 97: Set the OpsGenie request teams to be the responders list

Figure 8: java.util.List and java.util.ArrayList packages were imported

Figure 9: AlertType Java class customized to include DBAOps_Team for RDS-related DevOps Guru insights.

You then need to generate the jar file by using the mvn clean package command.

The function needs to be updated with:
- FUNCTION_NAME=$(aws lambda
  list-functions –query ‘Functions[?contains(FunctionName, `DevOps-Guru`) ==
  `true`].FunctionName’ –output text)
- aws lambda update-function-code –region
  us-east-1 –function-name $FUNCTION_NAME –zip-file fileb://target/Functions-1.0.jar

As result, the DBAOps_Team will be assigned to the Opsgenie alert in the case a DevOps Guru Insight is related to RDS.

Figure 10: Opsgenie alert assigned to both DBAOps_Team and AIOps_Team.

Conclusion

In this post, you learned how Amazon DevOps Guru integrates with Amazon EventBridge and publishes insights to Opsgenie using AWS Lambda. By creating an Opsgenie integration with DevOps Guru, you can now leverage Opsgenie strengths, incident management, team communication, and collaboration when responding to an insight. All of the insight data can be viewed and addressed in Opsgenie’s Incident Command Center (ICC). By customizing the data sent to Opsgenie via Lambda, you can empower your organization even more by fine tuning and displaying the most relevant data thus decreasing the MTTR (mean time to resolve) of the responding operations team.

About the authors:

Super-charged pivot tables in Amazon QuickSight

2023-01-25 Bhupinder Chadha

Post Syndicated from Bhupinder Chadha original https://aws.amazon.com/blogs/big-data/super-charged-pivot-tables-in-amazon-quicksight/

Amazon QuickSight is a fast and cloud-powered business intelligence (BI) service that makes it easy to create and deliver insights to everyone in your organization without any servers or infrastructure. QuickSight dashboards can also be embedded into applications and portals to deliver insights to external stakeholders. Additionally, with Amazon QuickSight Q, end-users can simply ask questions in natural language to get machine learning (ML)-powered visual responses to their questions.

Recently, Amazon FinTech migrated all their financial reporting to QuickSight. This involved migrating complex tables and pivot tables, helping them slice and dice large datasets and deliver pixel-perfect views of their data to their stakeholders. Amazon FinTech, like all QuickSight customers, needs fast performance on very large pivot tables in order to drive adoption of their dashboards. We have specifically launched two new features focused on scaling our pivot tables with the following improvements:

Faster loading of pivot tables during expand and collapse operations
Increased field limits for rows, columns, and values

In this post, we discuss these improvements to pivot tables in QuickSight.

Blazing fast pivot tables during expand and collapse operations

Today, QuickSight pivot tables work as an infinite load. As users scroll vertically or horizontally on the visual, new queries are run to fetch additional rows and columns of data with fixed row and column configurations for every query request.

For example, in the following table, we would load all carrier/city combinations nested under Dec 7, 2014 before we can continue querying the next date. Let’s say we have more than 500 carrier/city rows for a specific date; this will take more than a single query to get to the next date. The count of queries run depends on the cardinality of the dimension used in the pivot table.

In the following example of a collapsed pivot table, since the reader doesn’t see anything beyond the flight dates, having all carrier/city rows doesn’t change what is actively displayed on the pivot table. Even though individual SQL queries can be fast, users can perceive this table to load slowly due to the sheer number of queries being fired to load the hidden (collapsed) data. Therefore, loading every single row up to the Destination City field isn’t very useful when the pivot table in the collapsed state.

Therefore, to make our pivot tables load faster, we now only fetch the data for visible fields (expanded fields) along with a small subset of values under the collapsed field. This makes sure that data fetched in every new query is used to render new values that can be displayed immediately. We have seen customers improve their load time from 2–10 times faster depending on the complexity of their dataset.

This new behavior is automatically enabled, without requiring users to do anything on their side. Please note that while we plan to support all kinds of pivot tables to use this optimization, our current rollout only includes pivot tables with only row or only column fields not sorted by any metric.

Increased field limits for pivot tables

With the ever-growing depth and granularity of data being collected, our customers asked us to increase the number of fields and data points they can display in their visuals. We have been actively listening to your needs, and just like supporting more data points in line charts, we now are increasing our field limits for pivot tables.

The value field well limits have been increased from 20 to 40, and rows and columns have been increased from 20 each to a combined limit of 40. For example, if the user has 34 fields in rows, then they can add up to 6 fields to the column field well.

This will help unblock use cases requiring increased limits such as:

Metrics reporting – Monthly and weekly business reporting often requires having dozens of metrics presented in tabular formats. With the updated limits, you can display detailed, robust financial reports in a single pivot table rather than having to split it across multiple pivot tables.
Migration from legacy BI and reporting tools – Existing reports in these legacy systems require displaying and slicing across a large number of row hierarchies, for example a cost center expense analysis.
Custom use cases – These are specific industry and organization use cases where you can add dozens of values and row fields to display additional attributes. For example, a customer 360 report sliced by different regions.

As soon as you hit the limit, you receive an error message to indicate that the limit has been reached for that field well. For more details, refer to here.

Get started and stay updated!

Learn more about our new features in our newly launched QuickSight community’s Announcement section and supercharge your dashboards with the latest features from QuickSight!

About the authors

Bhupinder Chadha is a senior product manager for Amazon QuickSight focused on visualization and front end experiences. He is passionate about BI, data visualization and low-code/no-code experiences. Prior to QuickSight he was the lead product manager for Inforiver, responsible for building a enterprise BI product from ground up. Bhupinder started his career in presales, followed by a small gig in consulting and then PM for xViz, an add on visualization product.

Igal Mizrahi is a Senior Software Engineer for AWS QuickSight Charting team. He has been part of the team for the past 3 years, and previously worked on Amazon’s mobile shopping application for 4 years.

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

2023-01-24 Vivek Shrivastava

Post Syndicated from Vivek Shrivastava original https://aws.amazon.com/blogs/big-data/build-a-multi-region-and-highly-resilient-modern-data-architecture-using-aws-glue-and-aws-lake-formation/

AWS Lake Formation helps with enterprise data governance and is important for a data mesh architecture. It works with the AWS Glue Data Catalog to enforce data access and governance. Both services provide reliable data storage, but some customers want replicated storage, catalog, and permissions for compliance purposes.

This post explains how to create a design that automatically backs up Amazon Simple Storage Service (Amazon S3), the AWS Glue Data Catalog, and Lake Formation permissions in different Regions and provides backup and restore options for disaster recovery. These mechanisms can be customized for your organization’s processes. The utility for cloning and experimentation is available in the open-sourced GitHub repository.

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication, S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process. This ensures that the data lake will still be functional in another Region if Lake Formation has an availability issue. The Data Catalog setup (tables, databases, resource links) and Lake Formation setup (permissions, settings) must also be replicated in the backup Region.

Solution overview

This post shows how to create a backup of the Lake Formation permissions and AWS Glue Data Catalog from one Region to another in the same account. The solution doesn’t create or modify AWS Identity and Access Management (IAM) roles, which are available in all Regions. There are three steps to creating a multi-Region data lake:

Migrate Lake Formation data permissions.
Migrate AWS Glue databases and tables.
Migrate Amazon S3 data.

In the following sections, we look at each migration step in more detail.

Lake Formation permissions

In Lake Formation, there are two types of permissions: metadata access and data access.

Metadata access permissions allow users to create, read, update, and delete metadata databases and tables in the Data Catalog.

Data access permissions allow users to read and write data to specific locations in Amazon S3. Data access permissions are managed using data location permissions, which allow users to create and alter metadata databases and tables that point to specific Amazon S3 locations.

When data is migrated from one Region to another, only the metadata access permissions are replicated. This means that if data is moved from a bucket in the source Region to another bucket in the target Region, the data access permissions need to be reapplied in the target Region.

AWS Glue Data Catalog

The AWS Glue Data Catalog is a central repository of metadata about data stored in your data lake. It contains references to data that is used as sources and targets in AWS Glue ETL (extract, transform, and load) jobs, and stores information about the location, schema, and runtime metrics of your data. The Data Catalog organizes this information in the form of metadata tables and databases. A table in the Data Catalog is a metadata definition that represents the data in a data lake, and databases are used to organize these metadata tables.

Lake Formation permissions can only be applied to objects that already exist in the Data Catalog in the target Region. Therefore, in order to apply these permissions, the underlying Data Catalog databases and tables must already exist in the target Region. To meet this requirement, this utility migrates both the AWS Glue databases and tables from the source Region to the target Region.

Amazon S3 data

The data that underlies an AWS Glue table can be stored in an S3 bucket in any Region, so replication of the data itself isn’t necessary. However, if the data has already been replicated to the target Region, this utility has the option to update the table’s location to point to the replicated data in the target Region. If the location of the data is changed, the utility updates the S3 bucket name and keeps the rest of the prefix hierarchy unchanged.

This utility doesn’t include the migration of data from the source Region to the target Region. Data migration must be performed separately using methods such as S3 replication, S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication.

This utility has two modes for replicating Lake Formation and Data Catalog metadata: on-demand and real-time. The on-demand mode is a batch replication that takes a snapshot of the metadata at a specific point in time and uses it to synchronize the metadata. The real-time mode replicates changes made to the Lake Formation permissions or Data Catalog in near-real time.

The on-demand mode of this utility is recommended for creating existing Lake Formation permissions and Data Catalogs because it replicates a snapshot of the metadata. After the Lake Formation and Data Catalogs are synchronized, you can use real-time mode to replicate any ongoing changes. This creates a mirror image of the source Region in the target Region and keeps it up to date as changes are made in the source Region. These two modes can be used independently of each other, and the operations are idempotent.

The code for the on-demand and real-time modes is available in the GitHub repository. Let’s look at each mode in more detail.

On-demand mode

On-demand mode is used to copy the Lake Formation permissions and Data Catalog at a specific point in time. The code is deployed using the AWS Cloud Development Kit (AWS CDK). The following diagram shows the solution architecture for this mode.

The AWS CDK deploys an AWS Glue job to perform the replication. The job retrieves configuration information from a file stored in an S3 bucket. This file includes details such as the source and target Regions, an optional list of databases to replicate, and options for moving data to a different S3 bucket. More information about these options and deployment instructions is available in the GitHub repository.

The AWS Glue job retrieves the Lake Formation permissions and Data Catalog object metadata from the source Region and stores it in a JSON file in an S3 bucket. The same job then uses this file to create the Lake Formation permissions and Data Catalog databases and tables in the target Region.

This tool can be run on demand by running the AWS Glue job. It copies the Lake Formation permissions and Data Catalog object metadata from the source Region to the target Region. If you run the tool again after making changes to the target Region, the changes are replaced with the latest Lake Formation permissions and Data Catalog from the source Region.

This utility can detect any changes made to the Data Catalog metadata, databases, tables, and columns while replicating the Data Catalog from the source to the target Region. If a change is detected in the source Region, the latest version of the AWS Glue object is applied to the target Region. The utility reports the number of objects modified during its run.

The Lake Formation permissions are copied from the source to the target Region, so any new permissions are replicated in the target Region. If a permission is removed from the source Region, it is not removed from the target Region.

Real-time mode

Real-time mode replicates the Lake Formation permissions and Data Catalog at a regular interval. The default interval is 1 minute, but it can be modified during deployment. The code is deployed using the AWS CDK. The following diagram shows the solution architecture for this mode.

The AWS CDK deploys two AWS Lambda jobs and creates an Amazon DynamoDB table to store AWS CloudTrail events and an Amazon EventBridge rule to run the replication at a regular interval. The Lambda jobs retrieve the configuration information from a file stored in an S3 bucket. This file includes details such as the source and target Regions, options for moving data to a different S3 bucket, and the lookback period for CloudTrail in hours. More information about these options and deployment instructions is available in the GitHub repository.

The EventBridge rule triggers a Lambda job at a fixed interval. This job retrieves the configuration information and queries CloudTrail events related to the Data Catalog and Lake Formation that occurred in the past hour (the duration is configurable). All relevant events are then stored in a DynamoDB table.

After the event information is inserted into the DynamoDB table, another Lambda job is triggered. This job retrieves the configuration information and queries the DynamoDB table. It then applies all the changes to the target Region. If the tool is run again after making changes to the target Region, the changes are replaced with the latest Lake Formation permissions and Data Catalog from the source Region. Unlike on-demand mode, this utility also removes any Lake Formation permissions that were removed from the source Region from the target Region.

Limitations

This utility is designed to replicate permissions within a single account only. The on-demand mode replicates a snapshot and doesn’t remove existing permissions, so it doesn’t perform delete operations. The API currently doesn’t support replicating changes to row and column permissions.

Conclusion

In this post, we showed how you can use this utility to migrate the AWS Glue Data Catalog and Lake Formation permissions from one Region to another. It can also keep the source and target Regions synchronized if any changes are made to the Data Catalog or the Lake Formation permissions. Implementing it across Regions (multi-Region) is a good option if you are looking for the most separation and complete independence of your globally diverse data workloads. Also consider the trade-offs. Implementing and operating this strategy, particularly using multi-Region, can be more complicated and more expensive, than other DR strategies.

To get started, checkout the github repo. For more resources, refer to the following:

About the authors

Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a Bigdata enthusiast and holds 13 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation

Raza Hafeez is a Senior Data Architect within the Shared Delivery Practice of AWS Professional Services. He has over 12 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Nivas Shankar is a Principal Product Manager for AWS Lake Formation. He works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure and access data lake. Also leads several data and analytics initiatives within AWS including support for Data Mesh.

Build a serverless analytics application with Amazon Redshift and Amazon API Gateway

2023-01-24 David Zhang

Post Syndicated from David Zhang original https://aws.amazon.com/blogs/big-data/build-a-serverless-analytics-application-with-amazon-redshift-and-amazon-api-gateway/

Serverless applications are a modernized way to perform analytics among business departments and engineering teams. Business teams can gain meaningful insights by simplifying their reporting through web applications and distributing it to a broader audience.

Use cases can include the following:

Dashboarding – A webpage consisting of tables and charts where each component can offer insights to a specific business department.
Reporting and analysis – An application where you can trigger large analytical queries with dynamic inputs and then view or download the results.
Management systems – An application that provides a holistic view of the internal company resources and systems.
ETL workflows – A webpage where internal company individuals can trigger specific extract, transform, and load (ETL) workloads in a user-friendly environment with dynamic inputs.
Data abstraction – Decouple and refactor underlying data structure and infrastructure.
Ease of use – An application where you want to give a large set of user-controlled access to analytics without having to onboard each user to a technical platform. Query updates can be completed in an organized manner and maintenance has minimal overhead.

In this post, you will learn how to build a serverless analytics application using Amazon Redshift Data API and Amazon API Gateway WebSocket and REST APIs.

Amazon Redshift is fully managed by AWS, so you no longer need to worry about data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, monitoring nodes and drives to recover from failures, or backups. The Data API simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later.

API Gateway is a fully managed service that makes it easy for developers to publish, maintain, monitor, and secure APIs at any scale. With API Gateway, you can create RESTful APIs and WebSocket APIs that enable real-time two-way communication applications. API Gateway supports containerized and serverless workloads, as well as web applications. API Gateway acts as a reverse proxy to many of the compute resources that AWS offers.

Event-driven model

Event-driven applications are increasingly popular among customers. Analytical reporting web applications can be implemented through an event-driven model. The applications run in response to events such as user actions and unpredictable query events. Decoupling the producer and consumer processes allows greater flexibility in application design and building decoupled processes. This design can be achieved with the Data API and API Gateway WebSocket and REST APIs.

Both REST API calls and WebSocket establish communication between the client and the backend. Due to the popularity of REST, you may wonder why WebSockets are present and how they contribute to an event-driven design.

What are WebSockets and why do we need them?

Unidirectional communication is customary when building analytical web solutions. In traditional environments, the client initiates a REST API call to run a query on the backend and either synchronously or asynchronously waits for the query to complete. The “wait” aspect is engineered to apply the concept of polling. Polling in this context is when the client doesn’t know when a backend process will complete. Therefore, the client will consistently make a request to the backend and check.

What is the problem with polling? Main challenges include the following:

Increased traffic in your network bandwidth – A large number of users performing empty checks will impact your backend resources and doesn’t scale well.
Cost usage – Empty requests don’t deliver any value to the business. You pay for the unnecessary cost of resources.
Delayed response – Polling is scheduled in time intervals. If the query is complete in-between these intervals, the user can only see the results after the next check. This delay impacts the user experience and, in some cases, may result in UI deadlocks.

For more information on polling, check out From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets.

WebSockets is another approach compared to REST when establishing communication between the front end and backend. WebSockets enable you to create a full duplex communication channel between the client and the server. In this bidirectional scenario, the client can make a request to the server and is notified when the process is complete. The connection remains open, with minimal network overhead, until the response is received.

You may wonder why REST is present, since you can transfer response data with WebSockets. A WebSocket is a light weight protocol designed for real-time messaging between systems. The protocol is not designed for handling large analytical query data and in API Gateway, each frame’s payload can only hold up to 32 KB. Therefore, the REST API performs large data retrieval.

By using the Data API and API Gateway, you can build decoupled event-driven web applications for your data analytical needs. You can create WebSocket APIs with API Gateway and establish a connection between the client and your backend services. You can then initiate requests to perform analytical queries with the Data API. Due to the Data API’s asynchronous nature, the query completion generates an event to notify the client through the WebSocket channel. The client can decide to either retrieve the query results through a REST API call or perform other follow-up actions. The event-driven architecture enables bidirectional interoperable messages and data while keeping your system components agnostic.

Solution overview

In this post, we show how to create a serverless event-driven web application by querying with the Data API in the backend, establishing a bidirectional communication channel between the user and the backend with the WebSocket feature in API Gateway, and retrieving the results using its REST API feature. Instead of designing an application with long-running API calls, you can use the Data API. The Data API allows you to run SQL queries asynchronously, removing the need to hold long, persistent database connections.

The web application is protected using Amazon Cognito, which is used to authenticate the users before they can utilize the web app and also authorize the REST API calls when made from the application.

Other relevant AWS services in this solution include AWS Lambda and Amazon EventBridge. Lambda is a serverless, event-driven compute resource that enables you to run code without provisioning or managing servers. EventBridge is a serverless event bus allowing you to build event-driven applications.

The solution creates a lightweight WebSocket connection between the browser and the backend. When a user submits a request using WebSockets to the backend, a query is submitted to the Data API. When the query is complete, the Data API sends an event notification to EventBridge. EventBridge signals the system that the data is available and notifies the client. Afterwards, a REST API call is performed to retrieve the query results for the client to view.

We have published this solution on the AWS Samples GitHub repository and will be referencing it during the rest of this post.

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with AWS CloudFormation templates run as part of the shell script with some parameter variables.

The application performs the following steps (note the corresponding numbered steps in the process flow):

A web application is provisioned on AWS Amplify; the user needs to sign up first by providing their email and a password to access the site.
The user verifies their credentials using a pin sent to their email. This step is mandatory for the user to then log in to the application and continue access to the other features of the application.
After the user is signed up and verified, they can sign in to the application and requests data through their web or mobile clients with input parameters. This initiates a WebSocket connection in API Gateway. (Flow 1, 2)
The connection request is handled by a Lambda function, OnConnect, which initiates an asynchronous database query in Amazon Redshift using the Data API. The SQL query is taken from a SQL script in Amazon Simple Storage Service (Amazon S3) with dynamic input from the client. (Flow 3, 4, 6, 7)
In addition, the OnConnect Lambda function stores the connection, statement identifier, and topic name in an Amazon DynamoDB database. The topic name is an extra parameter that can be used if users want to implement multiple reports on the same webpage. This allows the front end to map responses to the correct report. (Flow 3, 4, 5)
The Data API runs the query, mentioned in step 2. When the operation is complete, an event notification is sent to EventBridge. (Flow 8)
EventBridge activates an event rule to redirect that event to another Lambda function, SendMessage. (Flow 9)
The SendMessage function notifies the client that the SQL query is complete via API Gateway. (Flow 10, 11, 12)
After the notification is received, the client performs a REST API call (GET) to fetch the results. (Flow 13, 14, 15, 16)
The GetResult function is triggered, which retrieves the SQL query result and returns it to the client.
The user is now able to view the results on the webpage.
When clients disconnect from their browser, API Gateway automatically deletes the connection information from the DynamoDB table using the onDisconnect function. (Flow 17, 18,19)

Prerequisites

Prior to deploying your event-driven web application, ensure you have the following:

An Amazon Redshift cluster in your AWS environment – This is your backend data warehousing solution to run your analytical queries. For instructions to create your Amazon Redshift cluster, refer to Getting started with Amazon Redshift.
- After you create your Amazon Redshift cluster, add the AmazonS3ReadOnlyAccess permission to the associated cluster AWS Identity and Access Management (IAM) role.
An S3 bucket that you have access to – The S3 bucket will be your object storage solution where you can store your SQL scripts. To create your S3 bucket, refer to Create your first S3 bucket.

Deploy CloudFormation templates

The code associated to the design is available in the following GitHub repository. You can clone the repository inside an AWS Cloud9 environment in our AWS account. The AWS Cloud9 environment comes with AWS Command Line Interface (AWS CLI) installed, which is used to run the CloudFormation templates to set up the AWS infrastructure. Make sure that the jQuery library is installed; we use it to parse the JSON output during the run of the script.

The complete architecture is set up using three CloudFormation templates:

cognito-setup.yaml – Creates the Amazon Cognito user pool to web app client, which is used for authentication and protecting the REST API
backend-setup.yaml – Creates all the required Lambda functions and the WebSocket and Rest APIs, and configures them on API Gateway
webapp-setup.yaml – Creates the web application hosting using Amplify to connect and communicate with the WebSocket and Rest APIs.

These CloudFormation templates are run using the script.sh shell script, which takes care of all the dependencies as required.

A generic template is provided for you to customize your own DDL SQL scripts as well as your own query SQL scripts. We have created sample scripts for you to follow along.

Download the sample DDL script and upload it to an existing S3 bucket.
Change the IAM role value to your Amazon Redshift cluster’s IAM role with permissions to AmazonS3ReadOnlyAccess.

For this post, we copy the New York Taxi Data 2015 dataset from a public S3 bucket.

Download the sample query script and upload it to an existing S3 bucket.
Upload the modified sample DDL script and the sample query script into a preexisting S3 bucket that you own, and note down the S3 URI path.

If you want to run your own customized version, modify the DDL and query script to fit your scenario.

Edit the script.sh file before you run it and set the values for the following parameters:

- RedshiftClusterEndpoint (aws_redshift_cluster_ep) – Your Amazon Redshift cluster endpoint available on the AWS Management Console
- DBUsername (aws_dbuser_name) – Your Amazon Redshift database user name
- DDBTableName (aws_ddbtable_name) – The name of your DynamoDB table name that will be created
- WebsocketEndpointSSMParameterName (aws_wsep_param_name) – The parameter name that stores the WebSocket endpoint in AWS Systems Manager Parameter Store.
- RestApiEndpointSSMParameterName (aws_rapiep_param_name) – The parameter name that stores the REST API endpoint in Parameter Store.
- DDLScriptS3Path (aws_ddl_script_path) – The S3 URI to the DDL script that you uploaded.
- QueryScriptS3Path (aws_query_script_path) – The S3 URI to the query script that you uploaded.
- AWSRegion (aws_region) – The Region where the AWS infrastructure is being set up.
- CognitoPoolName (aws_user_pool_name) – The name you want to give to your Amazon Cognito user pool
- ClientAppName (aws_client_app_name) – The name of the client app to be configured for the web app to handle the user authentication for the users

The default acceptable values are already provided as part of the downloaded code.

Run the script using the following command:

./script.sh

During deployment, AWS CloudFormation creates and triggers the Lambda function SetupRedshiftLambdaFunction, which sets up an Amazon Redshift database table and populates data into the table. The following diagram illustrates this process.

Use the demo app

When the shell script is complete, you can start interacting with the demo web app:

On the Amplify console, under All apps in the navigation pane, choose DemoApp.
Choose Run build.

The DemoApp web application goes through a phase of Provision, Build, Deploy.

When it’s complete, use the URL provided to access the web application.

The following screenshot shows the web application page. It has minimal functionality: you can sign in, sign up, or verify a user.

Choose Sign Up.

For Email ID, enter an email.
For Password, enter a password that is at least eight characters long, has at least one uppercase and lowercase letter, at least one number, and at least one special character.
Choose Let’s Enroll.

The Verify your Login to Demo App page opens.

Enter your email and the verification code sent to the email you specified.
Choose Verify.

You’re redirected to a login page.

You’re redirected to the demoPage.html website.

Choose Open Connection.

You now have an active WebSocket connection between your browser and your backend AWS environment.

For Trip Month, specify a month (for this example, December) and choose Submit.

You have now defined the month and year you want to query your data upon. After a few seconds, you can to see the output delivered from the WebSocket.

You may continue using the active WebSocket connection for additional queries—just choose a different month and choose Submit again.

When you’re done, choose Close Connection to close the WebSocket connection.

For exploratory purposes, while your WebSocket connection is active, you can navigate to your DynamoDB table on the DynamoDB console to view the items that are currently stored. After the WebSocket connection is closed, the items stored in DynamoDB are deleted.

Clean up

To clean up your resources, complete the following steps:

On the Amazon S3 console, navigate to the S3 bucket containing the sample DDL script and query script and delete them from the bucket.
On the Amazon Redshift console, navigate to your Amazon Redshift cluster and delete the data you copied over from the sample DDL script.
1. Run truncate nyc_yellow_taxi;
2. Run drop table nyc_yellow_taxi;
On the AWS CloudFormation console, navigate to the CloudFormation stacks and choose Delete. Delete the stacks in the following order:
1. WebappSetup
2. BackendSetup
3. CognitoSetup

All resources created in this solution will be deleted.

Monitoring

You can monitor your event-driven web application events, user activity, and API usage with Amazon CloudWatch and AWS CloudTrail. Most areas of this solution already have logging enabled. To view your API Gateway logs, you can turn on CloudWatch Logs. Lambda comes with default logging and monitoring and can be accessed with CloudWatch.

Security

You can secure access to the application using Amazon Cognito, which is a developer-centric and cost-effective customer authentication, authorization, and user management solution. It provides both identity store and federation options that can scale easily. Amazon Cognito supports logins with social identity providers and SAML or OIDC-based identity providers, and supports various compliance standards. It operates on open identity standards (OAuth2.0, SAML 2.0, and OpenID Connect). You can also integrate it with API Gateway to authenticate and authorize the REST API calls either using the Amazon Cognito client app or a Lambda function.

Considerations

The nature of this application includes a front-end client initializing SQL queries to Amazon Redshift. An important component to consider are potential malicious activities that the client can perform, such as SQL injections. With the current implementation, that is not possible. In this solution, the SQL queries preexist in your AWS environment and are DQL statements (they don’t alter the data or structure). However, as you develop this application to fit your business, you should evaluate these areas of risk.

AWS offers a variety of security services to help you secure your workloads and applications in the cloud, including AWS Shield, AWS Network Firewall, AWS Web Application Firewall, and more. For more information and a full list, refer to Security, Identity, and Compliance on AWS.

Cost optimization

The AWS services that the CloudFormation templates provision in this solution are all serverless. In terms of cost optimization, you only pay for what you use. This model also allows you to scale without manual intervention. Review the following pages to determine the associated pricing for each service:

Conclusion

In this post, we showed you how to create an event-driven application using the Amazon Redshift Data API and API Gateway WebSocket and REST APIs. The solution helps you build data analytical web applications in an event-driven architecture, decouple your application, optimize long-running database queries processes, and avoid unnecessary polling requests between the client and the backend.

You also used severless technologies, API Gateway, Lambda, DynamoDB, and EventBridge. You didn’t have to manage or provision any servers throughout this process.

This event-driven, serverless architecture offers greater extensibility and simplicity, making it easier to maintain and release new features. Adding new components or third-party products is also simplified.

With the instructions in this post and the generic CloudFormation templates we provided, you can customize your own event-driven application tailored to your business. For feedback or contributions, we welcome you to contact us through the AWS Samples GitHub Repository by creating an issue.

About the Authors

David Zhang is an AWS Data Architect in Global Financial Services. He specializes in designing and implementing serverless analytics infrastructure, data management, ETL, and big data systems. He helps customers modernize their data platforms on AWS. David is also an active speaker and contributor to AWS conferences, technical content, and open-source initiatives. During his free time, he enjoys playing volleyball, tennis, and weightlifting. Feel free to connect with him on LinkedIn.

Manash Deb is a Software Development Manager in the AWS Directory Service team. With over 18 years of software dev experience, his passion is designing and delivering highly scalable, secure, zero-maintenance applications in the AWS identity and data analytics space. He loves mentoring and coaching others and to act as a catalyst and force multiplier, leading highly motivated engineering teams, and building large-scale distributed systems.

Pavan Kumar Vadupu Lakshman Manikya is an AWS Solutions Architect who helps customers design robust, scalable solutions across multiple industries. With a background in enterprise architecture and software development, Pavan has contributed in creating solutions to handle API security, API management, microservices, and geospatial information system use cases for his customers. He is passionate about learning new technologies and solving, automating, and simplifying customer problems using these solutions.

Managing Dev Environments with Amazon CodeCatalyst

2023-01-23 Ryan Bachman

Post Syndicated from Ryan Bachman original https://aws.amazon.com/blogs/devops/managing-dev-environments-with-amazon-codecatalyst/

An Amazon CodeCatalyst Dev Environment is a cloud-based development environment that you can use in CodeCatalyst to quickly work on the code stored in the source repositories of your project. The project tools and application libraries included in your Dev Environment are defined by a devfile in the source repository of your project.

Introduction

In the previous CodeCatalyst post, Team Collaboration with Amazon CodeCatalyst, I focused on CodeCatalyst’s collaboration capabilities and how that related to The Unicorn Project’s main protaganist. At the beginning of Chapter 2, Maxine is struggling to configure her development environment. She is two days into her new job and still cannot build the application code. She has identified over 100 dependencies she is missing. The documentation is out of date and nobody seems to know where the dependencies are stored. I can sympathize with Maxine. In this post, I will focus on managing development environments to show how CodeCatalyst removes the burden of managing workload specific configurations and produces reliable on-demand development environments.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Have an AWS Builder ID for signing in to CodeCatalyst.

Belong to a space and have the space administrator role assigned to you in that space. For more information, see Creating a space in CodeCatalyst, Managing members of your space, and Space administrator role.

Have an AWS account associated with your space and have the IAM role in that account. For more information about the role and role policy, see Creating a CodeCatalyst service role.

Walkthrough

As with the previous posts in our CodeCatalyst series, I am going to use the Modern Three-tier Web Application blueprint. Blueprints provide sample code and CI/CD workflows to help make getting started easier across different combinations of programming languages and architectures. To follow along, you can re-use a project you created previously, or you can refer to a previous post that walks through creating a project using the blueprint.

One of the most difficult aspects of my time spent as a developer was finding ways to quickly contribute to a new project. Whenever I found myself working on a new project, getting to the point where I could meaningfully contribute to a project’s code base was always more difficult than writing the actual code. A major contributor to this inefficiency, was the lack of process managing my local development environment. I will be exploring how CodeCatalyst can help solve this challenge. For this walkthrough, I want to add a new test that will allow local testing of Amazon DynamoDB. To achieve this, I will use a CodeCatalyst dev environment.

CodeCatalyst Dev Environments are managed cloud-based development environments that you can use to access and modify code stored in a source repository. You can launch a project specific dev environment that will automate check-out of your project’s repo or you can launch an empty environment to use for accessing third-party source providers. You can learn more about CodeCatalyst Dev Environments in the CodeCatalyst User Guide.

CodeCatalyst user interface showing Create Dev Environment

Figure 1. Creating a new Dev Environment

To begin, I navigate to the Dev Environments page under the Code section of the navigaiton menu. I then use the Create Dev Environment to launch my environment. For this post, I am using the AWS Cloud9 IDE, but you can follow along with the IDE you are most comfortable using. In the next screen, I select Work in New Branch and assign local_testing for the new branch name, and I am branching from main. I leave the remaining default options and Create.

Create Dev Environment user interface with work in a new branch selected

Figure 2. Dev Environment Create Options

After waiting less than a minute, my IDE is ready in a new tab and I am ready to begin work. The first thing I see in my dev environment is an information window asking me if I want to navigate to the Dev Environment Settings. Because I need to enable local testing of Dynamodb, not only for myself, but other developers that will collaborate on this project, I need to update the project’s devfile. I select to navigate to the settings tab because I know that contains information on the project’s devfile and allows me to access the file to edit.

AWS Toolkit prompting to Open Dev Environment Settings.

Figure 3. Toolkit Welcome Banner

Devfiles allow you to model a Dev Environment’s configuration and dependencies so that you can re-produce consisent Dev Environments and reduce the manual effort in setting up future environments. The tools and application libraries included in your Dev Environment are defined by the devfile in the source repository of your project. Since this project was created from a blueprint, there is one provided. For blank projects, a default CodeCatalyst devfile is created when you first launch an environment. To learn more about the devfile, see https://devfile.io.

In the settings tab, I find a link to the devfile that is configured. When I click the edit button, a new file tab launches and I can now make changes. I first add an env section to the container that hosts our dev environment. By adding an environment variable and value, anytime a new dev environment is created from this project’s repository, that value will be included. Next, I add a second container to the dev environment that will run DynamoDB locally. I can do this by adding a new container component. I use Amazon’s verified DynamoDB docker image for my environment. Attaching additional images allow you to extend the dev environment and include tools or services that can be made available locally. My updates are highlighted in the green sections below.

Devfile.yaml with environment variable and DynamoDB container added

Figure 4. Example Devfile

I save my changes and navigate back to the Dev Environment Settings tab. I notice that my changes were automatically detected and I am prompted to restart my development environment for the changes to take effect. Modifications to the devfile requires a restart. You can restart a dev environment using the toolkit, or from the CodeCatalyst UI.

AWS Toolkit prompt asking to restart the dev environment

Figure 5. Dev Environment Settings

After waiting a few seconds for my dev environment to restart, I am ready to write my test. I use the IDE’s file explorer, expand the repo’s ./tests/unit folder, and create a new file named test_dynamodb.py. Using the IS_LOCAL environment variable I configured in the devfile, I can include a conditional in my test that sets the endpoint that Amazon’s python SDK ( Boto3 ) will use to connect to the Dynamodb service. This way, I can run tests locally before pushing my changes and still have tests complete successfully in my project’s workflow. My full test file is included below.

Figure 6. Dynamodb test file

Now that I have completed my changes to the dev environment using the devfile and added a test, I am ready to run my test locally to verify. I will use pytest to ensure the tests are passing before pushing any changes. From the repo’s root folder, I run the command pip install -r requirements-dev.txt. Once my dependencies are installed, I then issue the command pytest -k unit. All tests pass as I expect.

Result of the pytest shown at the command line

Figure 7. Pytest test results

Rather than manually installing my development dependencies in each environment, I could also use the devfile to include commands and automate the execution of those commands during the dev environment lifecycle events. You can refer to the links for commands and events for more information.

Finally, I am ready to push my changes back to my CodeCatalyst source repository. I use the git extension of Cloud9 to review my changes. After reviewing my changes are what I expect, I use the git extension to stage, commit, and push the new test file and the modified devfile so other collaborators can adopt the improvements I made.

Figure 8. Changes reviewed in CodeCatalyst Cloud9 git extension.

Cleanup

Conclusion

In this post, you learned how CodeCatalyst provides configurable on-demand dev environments. You also learned how devfiles help you define a consistent experience for developing within a CodeCatalyst project. Please follow our DevOps blog channel as I continue to explore how CodeCatalyst solve Maxine’s and other builders’ challenges.

About the author:

Journey to adopt Cloud-Native DevOps platform Series #2: Progressive delivery on Amazon EKS with Flagger and Gloo Edge Ingress Controller

2023-01-18 Purna Sanyal

Post Syndicated from Purna Sanyal original https://aws.amazon.com/blogs/devops/journey-to-adopt-cloud-native-devops-platform-series-2-progressive-delivery-on-amazon-eks-with-flagger-and-gloo-edge-ingress-controller/

In the last post, OfferUp modernized its DevOps platform with Amazon EKS and Flagger to accelerate time to market, we talked about hypergrowth and the technical challenges encountered by OfferUp in its existing DevOps platform. As a reminder, we presented how OfferUp modernized its DevOps platform with Amazon Elastic Kubernetes Service (Amazon EKS) and Flagger to gain developer’s velocity, automate faster deployment, and achieve lower cost of ownership.

In this post, we discuss the technical steps to build a DevOps platform that enables the progressive deployment of microservices on Amazon Managed Amazon EKS. Progressive delivery exposes a new version of the software incrementally to ingress traffic and continuously measures the success rate of the metrics before allowing all of the new traffics to a newer version of the software. Flagger is the Graduate project of Cloud Native Computing Foundations (CNCF) that enables progressive canary delivery, along with bule/green and A/B Testing, while measuring metrics like HTTP/gRPC request success rate and latency. Flagger shifts and routes traffic between app versions using a service mesh or an Ingress controller

We leverage Gloo Ingress Controller for traffic routing, Prometheus, Datadog, and Amazon CloudWatch for application metrics analysis and Slack to send notification. Flagger will post messages to slack when a deployment has been initialized, when a new revision has been detected, and if the canary analysis failed or succeeded.

Prerequisite steps to build the modern DevOps platform

You need an AWS Account and AWS Identity and Access Management (IAM) user to build the DevOps platform. If you don’t have an AWS account with Administrator access, then create one now by clicking here. Create an IAM user and assign admin role. You can build this platform in any AWS region however, I will you us-west-1 region throughout this post. You can use a laptop (Mac or Windows) or an Amazon Elastic Compute Cloud (AmazonEC2) instance as a client machine to install all of the necessary software to build the GitOps platform. For this post, I launched an Amazon EC2 instance (with Amazon Linux2 AMI) as the client and install all of the prerequisite software. You need the awscli, git, eksctl, kubectl, and helm applications to build the GitOps platform. Here are the prerequisite steps,

Create a named profile(eks-devops) with the config and credentials file:

aws configure --profile eks-devops

AWS Access Key ID [None]: xxxxxxxxxxxxxxxxxxxxxx

AWS Secret Access Key [None]: xxxxxxxxxxxxxxxxxx

Default region name [None]: us-west-1

Default output format [None]:

View and verify your current IAM profile:

export AWS_PROFILE=eks-devops

aws sts get-caller-identity

If the Amazon EC2 instance doesn’t have git preinstalled, then install git in your Amazon EC2 instance:

sudo yum update -y

sudo yum install git -y

Check git version

git version

Git clone the repo and download all of the prerequisite software in the home directory.

git clone https://github.com/aws-samples/aws-gloo-flux.git

Download all of the prerequisite software from install.sh which includes awscli, eksctl, kubectl, helm, and docker:

cd aws-gloo-flux/eks-flagger/

ls -lt

chmod 700 install.sh ecr-setup.sh

. install.sh

Check the version of the software installed:

aws --version

eksctl version

kubectl version -o json

helm version

docker --version

docker info

If the docker info shows an error like “permission denied”, then reboot the Amazon EC2 instance or re-log in to the instance again.

Create an Amazon Elastic Container Repository (Amazon ECR) and push application images.

Amazon ECR is a fully-managed container registry that makes it easy for developers to share and deploy container images and artifacts. ecr setup.sh script will create a new Amazon ECR repository and also push the podinfo images (6.0.0, 6.0.1, 6.0.2, 6.1.0, 6.1.5 and 6.1.6) to the Amazon ECR. Run ecr-setup.sh script with the parameter, “ECR repository name” (e.g. ps-flagger-repository) and region (e.g. us-west-1)

./ecr-setup.sh <ps-flagger-repository> <us-west-1>

You’ll see output like the following (truncated).

###########################################################

Successfully created ECR repository and pushed podinfo images to ECR #

Please note down the ECR repository URI

xxxxxx.dkr.ecr.us-west-1.amazonaws.com/ps-flagger-repository

Technical steps to build the modern DevOps platform

This post shows you how to use the Gloo Edge ingress controller and Flagger to automate canary releases for progressive deployment on the Amazon EKS cluster. Flagger requires a Kubernetes cluster v1.16 or newer and Gloo Edge ingress 1.6.0 or newer. This post will provide a step-by-step approach to install the Amazon EKS cluster with managed node group, Gloo Edge ingress controller, and Flagger for Gloo in the Amazon EKS cluster. Now that the cluster, metrics infrastructure, and Flagger are installed, we can install the sample application itself. We’ll use the standard Podinfo application used in the Flagger project and the accompanying loadtester tool. The Flagger “podinfo” backend service will be called by Gloo’s “VirtualService”, which is the root routing object for the Gloo Gateway. A virtual service describes the set of routes to match for a set of domains. We’ll automate the canary promotion, with the new image of the “podinfo” service, from version 6.0.0 to version 6.0.1. We’ll also create a scenario by injecting an error for automated canary rollback while deploying version 6.0.2.

Use myeks-cluster.yaml to create your Amazon EKS cluster with managed nodegroup. myeks-cluster.yaml deployment file has “cluster name” value as ps-eks-66, region value as us-west-1, availabilityZones as [us-west-1a, us-west-1b], Kubernetes version as 1.24, and nodegroup Amazon EC2 instance type as m5.2xlarge. You can change this value if you want to build the cluster in a separate region or availability zone.

eksctl create cluster -f myeks-cluster.yaml

Check the Amazon EKS Cluster details:

kubectl cluster-info

kubectl version -o json

kubectl get nodes -o wide

kubectl get pods -A -o wide

Deploy the Metrics Server:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

kubectl get deployment metrics-server -n kube-system

Update the kubeconfig file to interact with you cluster:

# aws eks update-kubeconfig --name <ekscluster-name> --region <AWS_REGION>

kubectl config view

cat $HOME/.kube/config

Create a namespace “gloo-system” and Install Gloo with Helm Chart. Gloo Edge is an Envoy-based Kubernetes-native ingress controller to facilitate and secure application traffic.

helm repo add gloo https://storage.googleapis.com/solo-public-helm

kubectl create ns gloo-system

helm upgrade -i gloo gloo/gloo --namespace gloo-system

Install Flagger and the Prometheus add-on in the same gloo-system namespace. Flagger is a Cloud Native Computing Foundation project and part of Flux family of GitOps tools.

helm repo add flagger https://flagger.app

helm upgrade -i flagger flagger/flagger \

--namespace gloo-system \

--set prometheus.install=true \

--set meshProvider=gloo

[Optional] If you’re using Datadog as a monitoring tool, then deploy Datadog agents as a DaemonSet using the Datadog Helm chart. Replace RELEASE_NAME and DATADOG_API_KEY accordingly. If you aren’t using Datadog, then skip this step. For this post, we leverage the Prometheus open-source monitoring tool.

helm repo add datadog https://helm.datadoghq.com

helm repo update

helm install <RELEASE_NAME> \

--set datadog.apiKey=<DATADOG_API_KEY> datadog/datadog

Integrate Amazon EKS/ K8s Cluster with the Datadog Dashboard – go to the Datadog Console and add the Kubernetes integration.

[Optional] If you’re using Slack communication tool and have admin access, then Flagger can be configured to send alerts to the Slack chat platform by integrating the Slack alerting system with Flagger. If you don’t have admin access in Slack, then skip this step.

helm upgrade -i flagger flagger/flagger \

--set slack.url=https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK \

--set slack.channel=general \

--set slack.user=flagger \

--set clusterName=<my-cluster>

Create a namespace “apps”, and applications and load testing service will be deployed into this namespace.

kubectl create ns apps

Create a deployment and a horizontal pod autoscaler for your custom application or service for which canary deployment will be done.

kubectl -n apps apply -k app

kubectl get deployment -A

kubectl get hpa -n apps

Deploy the load testing service to generate traffic during the canary analysis.

kubectl -n apps apply -k tester

kubectl get deployment -A

kubectl get svc -n apps

Use apps-vs.yaml to create a Gloo virtual service definition that references a route table that will be generated by Flagger.

kubectl apply -f ./apps-vs.yaml

kubectl get vs -n apps

[Optional] If you have your own domain name, then open apps-vs.yaml in vi editor and replace podinfo.example.com with your own domain name to run the app in that domain.

Use canary.yaml to create a canary custom resource. Review the service, analysis, and metrics sections of the canary.yaml file.

kubectl apply -f ./canary.yaml

After a couple of seconds, Flagger will create the canary objects. When the bootstrap finishes, Flagger will set the canary status to “Initialized”.

kubectl -n apps get canary podinfo

NAME STATUS WEIGHT LASTTRANSITIONTIME

podinfo Initialized 0 2023-xx-xxTxx:xx:xxZ

Gloo automatically creates an ELB. Once the load balancer is provisioned and health checks pass, we can find the sample application at the load balancer’s public address. Note down the ELB’s Public address:

kubectl get svc -n gloo-system --field-selector 'metadata.name==gateway-proxy' -o=jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}{"\n"}'

Validate if your application is running, and you’ll see an output with version 6.0.0.

curl <load balancer’s public address> -H "Host:podinfo.example.com"

Trigger progressive deployments and monitor the status

You can Trigger a canary deployment by updating the application container image from 6.0.0 to 6.01.

kubectl -n apps set image deployment/podinfo podinfod=<ECR URI>:6.0.1

Flagger detects that the deployment revision changed and starts a new rollout.

kubectl -n apps describe canary/podinfo

Monitor all canaries, as the promoted status condition can have one of the following statuses: initialized, Waiting, Progressing, Promoting, Finalizing, Succeeded, and Failed.

watch kubectl get canaries --all-namespaces

curl < load balancer’s public address> -H "Host:podinfo.example.com"

Once canary is completed, validate your application. You can see that the version of the application is changed from 6.0.0 to 6.0.1.

{

"hostname": "podinfo-primary-658c9f9695-4pqbl",

"version": "6.0.1",

"revision": "",

"color": "#34577c",

"logo": "https://raw.githubusercontent.com/stefanprodan/podinfo/gh-pages/cuddle_clap.gif",

"message": "greetings from podinfo v6.0.1",

}

[Optional] Open podinfo application from the laptop browser

Find out both of the IP addresses associated with load balancer.

dig < load balancer’s public address >

Open /etc/hosts file in the laptop and add both of the IPs of load balancer in the host file.

sudo vi /etc/hosts

<Public IP address of LB Target node> podinfo.example.com

e.g.

xx.xx.xxx.xxx podinfo.example.com

Type “podinfo.example.com” in your browser and you’ll find the application in form similar to this:

Figure 1: Greetings from podinfo v6.0.1

Automated rollback

While doing the canary analysis, you’ll generate HTTP 500 errors and high latency to check if Flagger pauses and rolls back the faulted version. Flagger performs automatic Rollback in the case of failure.

Introduce another canary deployment with podinfo image version 6.0.2 and monitor the status of the canary.

kubectl -n apps set image deployment/podinfo podinfod=<ECR URI>:6.0.2

Run HTTP 500 errors or a high-latency error from a separate terminal window.

Generate HTTP 500 errors:

watch curl -H 'Host:podinfo.example.com' <load balancer’s public address>/status/500

Generate high latency:

watch curl -H 'Host:podinfo.example.com' < load balancer’s public address >/delay/2

When the number of failed checks reaches the canary analysis threshold, the traffic is routed back to the primary, the canary is scaled to zero, and the rollout is marked as failed.

kubectl get canaries --all-namespaces

kubectl -n apps describe canary/podinfo

Cleanup

When you’re done experimenting, you can delete all of the resources created during this series to avoid any additional charges. Let’s walk through deleting all of the resources used.

Delete Flagger resources and apps namespace
kubectl delete canary podinfo -n apps

kubectl delete HorizontalPodAutoscaler podinfo -n apps

kubectl delete deployment podinfo -n apps

helm -n gloo-system delete flagger

helm -n gloo-system delete gloo

kubectl delete namespace apps

Delete Amazon EKS Cluster
After you’ve finished with the cluster and nodes that you created for this tutorial, you should clean up by deleting the cluster and nodes with the following command:

eksctl delete cluster --name <cluster name> --region <region code>

Delete Amazon ECR

aws ecr delete-repository --repository-name ps-flagger-repository --force

Conclusion

This post explained the process for setting up Amazon EKS cluster and how to leverage Flagger for progressive deployments along with Prometheus and Gloo Ingress Controller. You can enhance the deployments by integrating Flagger with Slack, Datadog, and webhook notifications for progressive deployments. Amazon EKS removes the undifferentiated heavy lifting of managing and updating the Kubernetes cluster. Managed node groups automate the provisioning and lifecycle management of worker nodes in an Amazon EKS cluster, which greatly simplifies operational activities such as new Kubernetes version deployments.

We encourage you to look into modernizing your DevOps platform from monolithic architecture to microservice-based architecture with Amazon EKS, and leverage Flagger with the right Ingress controller for secured and automated service releases.

Further Reading

Journey to adopt Cloud-Native DevOps platform Series #1: OfferUp modernized DevOps platform with Amazon EKS and Flagger to accelerate time to market

About the authors:

Automate data lineage on Amazon MWAA with OpenLineage

2023-01-18 Stephen Said

Post Syndicated from Stephen Said original https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/

In modern data architectures, datasets are combined across an organization using a variety of purpose-built services to unlock insights. As a result, data governance becomes a key component for data consumers and producers to know that their data-driven decisions are based on trusted and accurate datasets. One aspect of data governance is data lineage, which captures the flow of data as it goes through various systems and allows consumers to understand how a dataset was derived.

In order to capture data lineage consistently across various analytical services, you need to use a common lineage model and a robust job orchestration that is able to tie together diverse data flows. One possible solution is the open-source OpenLineage project. It provides a technology-agnostic metadata model for capturing data lineage and integrates with widely used tools. For job orchestration, it integrates with Apache Airflow, which you can run on AWS conveniently through the managed service Amazon Managed Workflows for Apache Airflow (Amazon MWAA). OpenLineage provides a plugin for Apache Airflow that extracts data lineage from Directed Acyclic Graphs (DAGs).

In this post, we show how to get started with data lineage on AWS using OpenLineage. We provide a step-by-step configuration guide for the openlineage-airflow plugin on Amazon MWAA. Additionally, we share an AWS Cloud Development Kit (AWS CDK) project that deploys a pre-configured demo environment for evaluating and experiencing OpenLineage first-hand.

OpenLineage on Apache Airflow

In the following example, Airflow turns OLTP data into a star schema on Amazon Redshift Serverless.

After staging and preparing source data from Amazon Simple Storage Service (Amazon S3), fact and dimension tables are eventually created. For this, Airflow orchestrates the execution of SQL statements that create and populate tables on Redshift Serverless.

Overview on DAGs in Amazon MWAA

The openlineage-airflow plugin collects metadata about creation of datasets and dependencies between them. This allows us to move from a jobs-centric approach of Airflow to a datasets-centric approach, improving the observability of workflows.

The following screenshot shows parts of the captured lineage for the previous example. It’s displayed in Marquez, an open-source metadata service for collection and visualization of data lineage with support for the OpenLineage standard. In Marquez, you can analyze the upstream datasets and transformations that eventually create the user dimension table on the right.

Data lineage graph in marquez

The example in this post is based on SQL and Amazon Redshift. OpenLineage also supports other transformation engines and data stores such as Apache Spark and dbt.

Solution overview

The following diagram shows the AWS setup required to capture data lineage using OpenLineage.

Solution overview

The workflow includes the following components:

The openlineage-airflow plugin is configured on Airflow as a lineage backend. Metadata about the DAG runs is passed by Airflow core to the plugin, which converts it into OpenLineage format and sends it to an external metadata store. In our demo setup, we use Marquez as the metadata store.
The openlineage-airflow plugin receives its configuration from environment variables. To populate these variables on Amazon MWAA, a custom Airflow plugin is used. First, the plugin reads source values from AWS Secrets Manager. Then, it creates environment variables.
Secrets Manager is configured as a secrets backend. Typically, this type of configuration is stored in Airflow’s native metadata database. However, this approach has limitations. For instance, in case of multiple Airflow environments, you need to track and store credentials across multiple environments, and updating credentials requires you to update all the environments. With a secrets backend, you can centralize configuration.
For demo purposes, we collect data lineage from a data pipeline, which creates a star schema in Redshift Serverless.

In the following sections, we walk you through the steps for end-to-end configuration.

Install the openlineage-airflow plugin

Specify the following dependency in the requirements.txt file of the Amazon MWAA environment. Note that the latest Airflow version currently available on Amazon MWAA is 2.4.3; for this post, use the compatible version 0.19.2 of the plugin:

openlineage-airflow==0.19.2

For more details on installing Python dependencies on Amazon MWAA, refer to Installing Python dependencies.

For Airflow < 2.3, configure the plugin’s lineage backend through the following configuration overrides on the Amazon MWAA environment and load it immediately at Airflow start by disabling lazy load of plugins:

AirflowConfigurationOptions:
    core.lazy_load_plugins: False
    lineage.backend: openlineage.lineage_backend.OpenLineageBackend

For more information on configuration overrides, refer to Configuration options overview.

Configure the Secrets Manager backend with Amazon MWAA

Using Secrets Manager as a secrets backend for Amazon MWAA is straightforward. First, provide the execution role of Amazon MWAA with read permission to Secrets Manager. You can use the following policy template as a starting point:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetResourcePolicy",
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret",
        "secretsmanager:ListSecretVersionIds"
      ],
      "Resource": "arn:aws:secretsmanager:AWS_REGION:<YOUR_ACCOUNT_ID>:secret:"
    },
    {
      "Effect": "Allow",
      "Action": "secretsmanager:ListSecrets",
      "Resource": ""
    }
  ]
}

Second, configure Secrets Manager as a backend in Amazon MWAA through the following configuration overrides:

AirflowConfigurationOptions:
secrets.backend: airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend
secrets.backend_kwargs: '{"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}'

For more information configuring a secrets backend in Amazon MWAA, refer to Configuring an Apache Airflow connection using a Secrets Manager secret and Move your Apache Airflow connections and variables to AWS Secrets Manager.

Deploy a custom envvar plugin to Amazon MWAA

Apache Airflow has a built-in plugin manager through which it can be extended with custom functionality. In our case, this functionality is to populate OpenLineage-specific environment variables based on values in Secrets Manager. Natively, Amazon MWAA allows environment variables with the prefix AIRFLOW__, but the openlineage-airflow plugin expects the prefix OPENLINEAGE__.

The following Python code is used in the plugin. We assume the file is called envvar_plugin.py:

from airflow.plugins_manager import AirflowPlugin
from airflow.models import Variable
import os

os.environ["OPENLINEAGE_URL"] = Variable.get('OPENLINEAGE_URL', default_var='')

class EnvVarPlugin(AirflowPlugin):
  name = "env_var_plugin"

Amazon MWAA has a mechanism to install a plugin through a zip archive. You zip your code, upload the archive to an S3 bucket, and pass the URL to the file to Amazon MWAA:

zip plugins.zip envvar_plugin.py

Upload plugins.zip to an S3 bucket and configure the URL in Amazon MWAA. The following screenshot shows the configuration via the Amazon MWAA console.

Configuration of a custom plugin in Amazon MWAA

For more information on installing custom plugins on Amazon MWAA, refer to Creating a custom plugin that generates runtime environment variables.

Configure connectivity between the openlineage-airflow plugin and Marquez

As a last step, store the URL to Marquez in Secrets Manager. For this, create a secret called airflow/variables/OPENLINEAGE_URL with value <protocol>://<hostname/ip>:<port> (for example, https://marquez.mysite.com:5000).

Configuration of OPENLINEAGE_URL as secret

In case you need to spin up Marquez on AWS, you have multiple options to host, including running it on Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Compute Cloud (Amazon EC2). Refer to Running Marquez on AWS or check out our infrastructure template in the next section to deploy Marquez on AWS.

Deploy with an AWS CDK-based solution template

Assuming you want to set up a demo infrastructure for all of the above in one step, you can use the following template based on the AWS CDK.

The template has the following prerequisites:

An AWS account.
Amazon Linux 2 with AWS CDK, Docker CLI, and Python3 installed. Alternatively, setting up an AWS Cloud9 environment will satisfy this requirement.

Complete the following steps to deploy the template:

Clone GitHub repository and install Python dependencies. Bootstrap the AWS CDK if required.

git clone https://github.com/aws-samples/aws-mwaa-openlineage 
	cd aws-mwaa-openlineage
	python3 -m venv .env && source .env/bin/activate
	python3 -m pip install -r requirements.txt
	cdk bootstrap

Update the value for the variable EXTERNAL_IP in constants.py to your outbound IP for connecting to the internet:
```
# Set variable to outbound IP for connecting to the internet.
EXTERNAL_IP = "255.255.255.255"
```
This configures security groups so that you can access Marquez but block other clients. constants.py is found in the root folder of the cloned repository.
Deploy the VPC_S3 stack to provision a new VPC dedicated for this solution as well as the security groups that are used by the different components:
```
cdk deploy vpc-s3
```
It creates a new S3 bucket and uploads the source raw data based on the TICKIT sample database. This serves as the landing area from the OLTP database. We then need to parse the metadata of these files through an AWS Glue crawler, which facilitates the native integration between Amazon Redshift and the S3 data lake.
Deploy the lineage stack to create an EC2 instance that hosts Marquez:
```
cdk deploy marquez
```
Access the Marquez web UI through https://{ec2.public_dns_name}:3000/. This URL is also available as part of the AWS CDK outputs for the lineage stack.
Deploy the Amazon Redshift stack to create a Redshift Serverless endpoint:
```
cdk deploy redshift
```
Deploy the Amazon MWAA stack to create an Amazon MWAA environment:
```
cdk deploy mwaa
```
You can access the Amazon MWAA UI through the URL provided in the AWS CDK output.

Test a sample data pipeline

On Amazon MWAA, you can see an example data pipeline deployed that consists of two DAGs. It builds a star schema on top of the TICKIT sample database. One DAG is responsible for loading data from the S3 data lake into an Amazon Redshift staging layer; the second DAG loads data from the staging layer to the dimensional model.

Datamodel of star schema

Open the Amazon MWAA UI through the URL obtained in the deployment steps and launch the following DAGs: rs_source_to_staging and rs_staging_to_dm. As part of the run, the lineage metadata is sent to Marquez.

After the DAG has been run, open the Marquez URL obtained in the deployment steps. In Marquez, you can find the lineage metadata for the computed star schema and related data assets on Amazon Redshift.

Clean up

Delete the AWS CDK stacks to avoid ongoing charges for the resources that you created. Run the following command in the aws-mwaa-openlineage project directory so that all resources are undeployed:

cdk destroy --all

Summary

In this post, we showed you how to automate data lineage with OpenLineage on Amazon MWAA. As part of this, we covered how to install and configure the openlineage-airflow plugin on Amazon MWAA. Additionally, we provided a ready-to-use infrastructure template for a complete demo environment.

We encourage you to explore what else can be achieved with OpenLineage. A job orchestrator like Apache Airflow is only one piece of a data platform and not all possible data lineage can be captured on it. We recommend exploring OpenLineage’s integration with other platforms like Apache Spark or dbt. For more information, refer to Integrations.

Additionally, we recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon MWAA and data governance on AWS.

About the Authors

Stephen Said is a Senior Solutions Architect and works with Digital Native Businesses. His areas of interest are data analytics, data platforms and cloud-native software engineering.

Vishwanatha Nayak is a Senior Solutions Architect at AWS. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable modern data platforms using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.

Paul Villena is an Analytics Solutions Architect with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

Manually Approving Security Changes in CDK Pipeline

2023-01-18

Post Syndicated from original https://aws.amazon.com/blogs/devops/manually-approving-security-changes-in-cdk-pipeline/

In this post I will show you how to add a manual approval to AWS Cloud Development Kit (CDK) Pipelines to confirm security changes before deployment. With this solution, when a developer commits a change, CDK pipeline identifies an IAM permissions change, pauses execution, and sends a notification to a security engineer to manually approve or reject the change before it is deployed.

Introduction

In my role I talk to a lot of customers that are excited about the AWS Cloud Development Kit (CDK). One of the things they like is that L2 constructs often generate IAM and other security policies. This can save a lot of time and effort over hand coding those policies. Most customers also tell me that the policies generated by CDK are more secure than the policies they generate by hand.

However, these same customers are concerned that their security engineering team does not know what is in the policies CDK generates. In the past, these customers spent a lot of time crafting a handful of IAM policies that developers can use in their apps. These policies were well understood, but overly permissive because they were often reused across many applications.

Customers want more visibility into the policies CDK generates. Luckily CDK provides a mechanism to approve security changes. If you are using CDK, you have probably been prompted to approve security changes when you run cdk deploy at the command line. That works great on a developer’s machine, but customers want to build the same confirmation into their continuous delivery pipeline. CDK provides a mechanism for this with the ConfirmPermissionsBroadening action. Note that ConfirmPermissionsBroadening is only supported by the AWS CodePipline deployment engine.

Background

Before I talk about ConfirmPermissionsBroadening, let me review how CDK creates IAM policies. Consider the “Hello, CDK” application created in AWS CDK Workshop. At the end of this module, I have an AWS Lambda function and an Amazon API Gateway defined by the following CDK code.

// defines an AWS Lambda resource
const hello = new lambda.Function(this, 'HelloHandler', {
  runtime: lambda.Runtime.NODEJS_14_X,    // execution environment
  code: lambda.Code.fromAsset('lambda'),  // code loaded from "lambda" directory
  handler: 'hello.handler'                // file is "hello", function is "handler"
});

// defines an API Gateway REST API resource backed by our "hello" function.
new apigw.LambdaRestApi(this, 'Endpoint', {
  handler: hello
});

Note that I did not need to define the IAM Role or Lambda Permissions. I simply passed a refence to the Lambda function to the API Gateway (line 10 above). CDK understood what I was doing and generated the permissions for me. For example, CDK generated the following Lambda Permission, among others.

{
  "Effect": "Allow",
  "Principal": {
    "Service": "apigateway.amazonaws.com"
  },
  "Action": "lambda:InvokeFunction",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:HelloHandler2E4FBA4D",
  "Condition": {
    "ArnLike": {
      "AWS:SourceArn": "arn:aws:execute-api:us-east-1:123456789012:9y6ioaohv0/prod/*/"
    }
  }
}

Notice that CDK generated a narrowly scoped policy, that allows a specific API (line 10 above) to call a specific Lambda function (line 7 above). This policy cannot be reused elsewhere. Later in the same workshop, I created a Hit Counter Construct using a Lambda function and an Amazon DynamoDB table. Again, I associated them using a single line of CDK code.

table.grantReadWriteData(this.handler);

As in the prior example, CDK generated a narrowly scoped IAM policy. This policy allows the Lambda function to perform certain actions (lines 4-11) on a specific table (line 14 below).

{
  "Effect": "Allow",
  "Action": [
    "dynamodb:BatchGetItem",
    "dynamodb:ConditionCheckItem",
    "dynamodb:DescribeTable",
    "dynamodb:GetItem",
    "dynamodb:GetRecords",
    "dynamodb:GetShardIterator",
    "dynamodb:Query",
    "dynamodb:Scan"
  ],
  "Resource": [
    "arn:aws:dynamodb:us-east-1:123456789012:table/HelloHitCounterHits"
  ]
}

As you can see, CDK is doing a lot of work for me. In addition, CDK is creating narrowly scoped policies for each resource, rather than sharing a broadly scoped policy in multiple places.

CDK Pipelines Permissions Checks

Now that I have reviewed how CDK generates policies, let’s discuss how I can use this in a Continuous Deployment pipeline. Specifically, I want to allow CDK to generate policies, but I want a security engineer to review any changes using a manual approval step in the pipeline. Of course, I don’t want security to be a bottleneck, so I will only require approval when security statements or traffic rules are added. The pipeline should skip the manual approval if there are no new security rules added.

Let’s continue to use CDK Workshop as an example. In the CDK Pipelines module, I used CDK to configure AWS CodePipeline to deploy the “Hello, CDK” application I discussed above. One of the last things I do in the workshop is add a validation test using a post-deployment step. Adding a permission check is similar, but I will use a pre-deployment step to ensure the permission check happens before deployment.

First, I will import ConfirmPermissionsBroadening from the pipelines package

import {ConfirmPermissionsBroadening} from "aws-cdk-lib/pipelines";

Then, I can simply add ConfirmPermissionsBroadening to the deploySatage using the addPre method as follows.

const deploy = new WorkshopPipelineStage(this, 'Deploy');
const deployStage = pipeline.addStage(deploy);

deployStage.addPre(    
  new ConfirmPermissionsBroadening("PermissionCheck", {
    stage: deploy
})

deployStage.addPost(
    // Post Deployment Test Code Omitted
)

Once I commit and push this change, a new manual approval step called PermissionCheck.Confirm is added to the Deploy stage of the pipeline. In the future, if I push a change that adds additional rules, the pipeline will pause here and await manual approval as shown in the screenshot below.

Figure 1. Pipeline waiting for manual review

When the security engineer clicks the review button, she is presented with the following dialog. From here, she can click the URL to see a summary of the change I am requesting which was captured in the build logs. She can also choose to approve or reject the change and add comments if needed.

Figure 2. Manual review dialog with a link to the build logs

When the security engineer clicks the review URL, she is presented with the following sumamry of security changes.

Figure 3. Summary of security changes in the build logs

The final feature I want to add is an email notification so the security engineer knows when there is something to approve. To accomplish this, I create a new Amazon Simple Notification Service (SNS) topic and subscription and associate it with the ConfirmPermissionsBroadening Check.

// Create an SNS topic and subscription for security approvals
const topic = new sns.Topic(this, 'SecurityApproval’);
topic.addSubscription(new subscriptions.EmailSubscription('[email protected]')); 

deployStage.addPre(    
  new ConfirmPermissionsBroadening("PermissionCheck", {
    stage: deploy,
    notificationTopic: topic
})

With the notification configured, the security engineer will receive an email when an approval is needed. She will have an opportunity to review the security change I made and assess the impact. This gives the security engineering team the visibility they want into the policies CDK is generating. In addition, the approval step is skipped if a change does not add security rules so the security engineer does not become a bottle neck in the deployment process.

Conclusion

AWS Cloud Development Kit (CDK) automates the generation of IAM and other security policies. This can save a lot of time and effort but security engineering teams want visibility into the policies CDK generates. To address this, CDK Pipelines provides the ConfirmPermissionsBroadening action. When you add ConfirmPermissionsBroadening to your CI/CD pipeline, CDK will wait for manual approval before deploying a change that includes new security rules.

About the author:

Build near real-time logistics dashboards using Amazon Redshift and Amazon Managed Grafana for better operational intelligence

2023-01-17 Paul Villena

Post Syndicated from Paul Villena original https://aws.amazon.com/blogs/big-data/build-near-real-time-logistics-dashboards-using-amazon-redshift-and-amazon-managed-grafana-for-better-operational-intelligence/

Amazon Redshift is a fully managed data warehousing service that is currently helping tens of thousands of customers manage analytics at scale. It continues to lead price-performance benchmarks, and separates compute and storage so each can be scaled independently and you only pay for what you need. It also eliminates data silos by simplifying access to your operational databases, data warehouse, and data lake with consistent security and governance policies.

With the Amazon Redshift streaming ingestion feature, it’s easier than ever to access and analyze data coming from real-time data sources. It simplifies the streaming architecture by providing native integration between Amazon Redshift and the streaming engines in AWS, which are Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). Streaming data sources like system logs, social media feeds, and IoT streams can continue to push events to the streaming engines, and Amazon Redshift simply becomes just another consumer. Before Amazon Redshift streaming was available, we had to stage the streaming data first in Amazon Simple Storage Service (Amazon S3) and then run the copy command to load it into Amazon Redshift. Eliminating the need to stage data in Amazon S3 results in faster performance and improved latency. With this feature, we can ingest hundreds of megabytes of data per second and have a latency of just a few seconds.

Another common challenge for our customers is the additional skill required when using streaming data. In Amazon Redshift streaming ingestion, only SQL is required. We use SQL to do the following:

Define the integration between Amazon Redshift and our streaming engines with the creation of external schema
Create the different streaming database objects that are actually materialized views
Query and analyze the streaming data
Generate new features that are used to predict delays using machine learning (ML)
Perform inferencing natively using Amazon Redshift ML

In this post, we build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. Our example is an operational intelligence dashboard for a logistics company that provides situational awareness and augmented intelligence for their operations team. From this dashboard, the team can see the current state of their consignments and their logistics fleet based on events that happened only a few seconds ago. It also shows the consignment delay predictions of an Amazon Redshift ML model that helps them proactively respond to disruptions before they even happen.

Solution overview

This solution is composed of the following components, and the provisioning of resources is automated using the AWS Cloud Development Kit (AWS CDK):

Multiple streaming data sources are simulated through Python code running in our serverless compute service, AWS Lambda
The streaming events are captured by Amazon Kinesis Data Streams, which is a highly scalable serverless streaming data service
We use the Amazon Redshift streaming ingestion feature to process and store the streaming data and Amazon Redshift ML to predict the likelihood of a consignment getting delayed
We use AWS Step Functions for serverless workflow orchestration
The solution includes a consumption layer built on Amazon Managed Grafana where we can visualize the insights and even generate alerts through Amazon Simple Notification Service (Amazon SNS) for our operations team

The following diagram illustrates our solution architecture.

Prerequisites

The project has the following prerequisites:

An AWS account.
Amazon Linux 2 with AWS CDK, Docker CLI, and Python3 installed. Alternatively, setting up an AWS Cloud9 environment will satisfy this requirement.
To run this code, you need elevated privileges into the AWS account you are using.

Sample deployment using the AWS CDK

The AWS CDK is an open-source project that allows you to define your cloud infrastructure using familiar programming languages. It uses high-level constructs to represent AWS components to simplify the build process. In this post, we use Python to define the cloud infrastructure due to its familiarity to many data and analytics professionals.

Clone the GitHub repository and install the Python dependencies:

git clone https://github.com/aws-samples/amazon-redshift-streaming-workshop
cd amazon-redshift-streaming-workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Next, bootstrap the AWS CDK. This sets up the resources required by the AWS CDK to deploy into the AWS account. This step is only required if you haven’t used the AWS CDK in the deployment account and Region.

cdk bootstrap

Deploy all stacks:

cdk deploy IngestionStack 
cdk deploy RedshiftStack 
cdk deploy StepFunctionStack

The entire deployment time takes 10–15 minutes.

Access streaming data using Amazon Redshift streaming ingestion

The AWS CDK deployment provisions an Amazon Redshift cluster with the appropriate default IAM role to access the Kinesis data stream. We can create an external schema to establish a connection between the Amazon Redshift cluster and the Kinesis data stream:

CREATE EXTERNAL SCHEMA ext_kinesis FROM KINESIS
IAM_ROLE default;

For instructions on how to connect to the cluster, refer to Connecting to the Redshift Cluster.

We use a materialized view to parse data in the Kinesis data stream. In this case, the whole payload is ingested as is and stored using the SUPER data type in Amazon Redshift. Data stored in streaming engines is usually in semi-structured format, and the SUPER data type provides a fast and efficient way to analyze semi-structured data within Amazon Redshift.

See the following code:

CREATE MATERIALIZED VIEW consignment_stream AS
SELECT approximate_arrival_timestamp,
JSON_PARSE(from_varbyte(kinesis_data, 'utf-8')) as consignment_data FROM ext_kinesis.consignment_stream
WHERE is_utf8(kinesis_data)
AND is_valid_json(from_varbyte(kinesis_data, 'utf-8'));

Refreshing the materialized view invokes Amazon Redshift to read data directly from the Kinesis data stream and load it into the materialized view. This refresh can be done automatically by adding the AUTO REFRESH clause in the materialized view definition. However, in this example, we are orchestrating the end-to-end data pipeline using AWS Step Functions.

REFRESH MATERIALIZED VIEW consignment_stream;

Now we can start running queries against our streaming data and unify it with other datasets like logistics fleet data. If we want to know the distribution of our consignments across different states, we can easily unpack the contents of the JSON payload using the PartiQL syntax.

SELECT cs.consignment_data.origin_state::VARCHAR,
COUNT(1) number_of_consignments,
AVG(on_the_move) running_fleet,
AVG(scheduled_maintenance + unscheduled_maintenance) under_maintenance
FROM consignment_stream cs
INNER JOIN fleet_summary fs
on TRIM(cs.consignment_data.origin_state::VARCHAR) = fs.vehicle_location
GROUP BY 1

Generate features using Amazon Redshift SQL functions

The next step is to transform and enrich the streaming data using Amazon Redshift SQL to generate additional features that will be used by Amazon Redshift ML for its predictions. We use date and time functions to identify the day of the week, and calculate the number of days between the order date and target delivery date.

We also use geospatial functions, specifically ST_DistanceSphere, to calculate the distance between origin and destination locations. The GEOMETRY data type within Amazon Redshift provides a cost-effective way to analyze geospatial data such as longitude and latitudes at scale. In this example, the addresses have already been converted to longitude and latitude. However, if you need to perform geocoding, you can integrate Amazon Location Services with Amazon Redshift using user-defined functions (UDFs). On top of geocoding, Amazon Location Service also allows you to more accurately calculate route distance between origin and destination, and even specify waypoints along the way.

We use another materialized view to persist these transformations. A materialized view provides a simple yet efficient way to create data pipelines using its incremental refresh capability. Amazon Redshift identifies the incremental changes from the last refresh and only updates the target materialized view based on these changes. In this materialized view, all our transformations are deterministic, so we expect our data to be consistent when going through a full refresh or an incremental refresh.

See the following code:

CREATE MATERIALIZED VIEW consignment_transformed AS
SELECT
consignment_data.consignmentid::INT consignment_id,
consignment_data.consignment_date::TIMESTAMP consignment_date,
consignment_data.delivery_date::TIMESTAMP delivery_date,
consignment_data.origin_state::VARCHAR origin_state,
consignment_data.destination_state::VARCHAR destination_state,
consignment_data.revenue::FLOAT revenue,
consignment_data.cost::FLOAT cost,
DATE_PART(dayofweek, consignment_data.consignment_date::TIMESTAMP)::INT day_of_week,
DATE_PART(hour, consignment_data.consignment_date::TIMESTAMP)::INT "hour",
DATEDIFF(days,
consignment_data.consignment_date::TIMESTAMP,
consignment_data.delivery_date::TIMESTAMP
)::INT days_to_deliver,
(ST_DistanceSphere(
ST_Point(consignment_data.origin_lat::FLOAT, consignment_data.origin_long::FLOAT),
ST_Point(consignment_data.destination_lat::FLOAT, consignment_data.destination_long::FLOAT)
) / 1000 --convert to km
) delivery_distance
FROM consignment_stream;

Predict delays using Amazon Redshift ML

We can use this enriched data to make predictions on the delay probability of a consignment. Amazon Redshift ML is a feature of Amazon Redshift that allows you to use the power of Amazon Redshift to build, train, and deploy ML models directly within your data warehouse.

The training of a new Amazon Redshift ML model has been initiated as part of the AWS CDK deployment using the CREATE MODEL statement. The training dataset is defined in the FROM clause, and TARGET defines which column the model is trying to predict. The FUNCTION clause defines the name of the function that is used to make predictions.

CREATE MODEL ml_delay_prediction -- already executed by CDK
FROM (SELECT * FROM ext_s3.consignment_train)
TARGET probability
FUNCTION fnc_delay_probabilty
IAM_ROLE default
SETTINGS (
MAX_RUNTIME 1800, --seconds
S3_BUCKET '<ingestionstack-s3bucketname>' --replace S3 bucket name
)

This simplified model is trained using historical observations, and the training process takes around 30 minutes to complete. You can check the status of the training job by running the SHOW MODEL statement:

SHOW MODEL ml_delay_prediction;

When the model is ready, we can start making predictions on new data that is streamed into Amazon Redshift. Predictions are generated using the Amazon Redshift ML function that was defined during the training process. We pass the calculated features from the transformed materialized view into this function, and the prediction results populate the delay_probability column.

This final output is persisted into the consignment_predictions table, and Step Functions is orchestrating the ongoing incremental data load into this target table. We use a table for the final output, instead of a materialized view, because ML predictions have randomness involved and it may give us non-deterministic results. Using a table gives us more control on how data is loaded.

See the following code:

CREATE TABLE consignment_predictions AS
SELECT *, fnc_delay_probability(
day_of_week, "hour", days_to_deliver, delivery_distance) delay_probability
FROM consignment_transformed;

Create an Amazon Managed Grafana dashboard

We use Amazon Managed Grafana to create a near real-time logistics dashboard. Amazon Managed Grafana is a fully managed service that makes it easy to create, configure, and share interactive dashboards and charts for monitoring your data. We can also use Grafana to set up alerts and notifications based on specific conditions or thresholds, allowing you to quickly identify and respond to issues.

The high-level steps in setting up the dashboard are as follows:

Create a Grafana workspace.
Set up Grafana authentication using AWS IAM Identity Center (successor to AWS Single Sign-On) or using direct SAML integration.
Configure Amazon Redshift as a Grafana data source.
Import the JSON file for the near real-time logistics dashboard.

A more detailed set of instructions is available in the GitHub repository for your reference.

Clean up

To avoid ongoing charges, delete the resources deployed. Access the Amazon Linux 2 environment and run the AWS CDK destroy command. Delete the Grafana objects related to this deployment.

cd amazon-redshift-streaming-workshop
source .venv/bin/activate
cdk destroy –all

Conclusion

In this post, we showed how easy it is to build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. We created an end-to-end modern data pipeline using only SQL. This shows how Amazon Redshift is a powerful platform for democratizing your data—it enables a wide range of users, including business analysts, data scientists, and others, to work with and analyze data without requiring specialized technical skills or expertise.

We encourage you to explore what else can be achieved with Amazon Redshift and Amazon Managed Grafana. We also recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon Redshift.

About the Author

Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

How to revoke federated users’ active AWS sessions

2023-01-16 Matt Howard

Post Syndicated from Matt Howard original https://aws.amazon.com/blogs/security/how-to-revoke-federated-users-active-aws-sessions/

When you use a centralized identity provider (IdP) for human user access, changes that an identity administrator makes to a user within the IdP won’t invalidate the user’s existing active Amazon Web Services (AWS) sessions. This is due to the nature of session durations that are configured on assumed roles. This situation presents a challenge for identity administrators.

In this post, you’ll learn how to revoke access to specific users’ sessions on AWS assumed roles through the use of AWS Identity and Access Management (IAM) policies and service control policies (SCPs) via AWS Organizations.

Session duration overview

When you configure IAM roles, you have the option of configuring a maximum session duration that specifies how long a session is valid. By default, the temporary credentials provided to the user will last for one hour, but you can change this to a value of up to 12 hours.

When a user assumes a role in AWS by using their IdP credentials, that role’s credentials will remain valid for the length of their session duration. It’s convenient for end users to have a maximum session duration set to 12 hours, because this prevents their sessions from frequently timing out and then requiring re-login. However, a longer session duration also poses a challenge if you, as an identity administrator, attempt to revoke or modify a user’s access to AWS from your IdP.

For example, user John Doe is leaving the company and you want to verify that John has his privileges within AWS revoked. If John has access to IAM roles with long-session durations, then he might have residual access to AWS despite having his session revoked or his user identity deleted within the IdP. Perhaps John assumed a role for his daily work at 8 AM and then you revoked his credentials within the IdP at 9 AM. Because John had already assumed an AWS role, he would still have access to AWS through that role for the duration of the configured session, 8 PM if the session was configured for 12 hours. Therefore, as a security best practice, AWS recommends that you do not set the session duration length longer than is needed. This example is displayed in Figure 1.

Figure 1: Session duration overview

In order to restrict access despite the session duration being active, you could update the roles that are assumable from an IdP with a deny-all policy or delete the role entirely. However, this is a disruptive action for the users that have access to this role. If the role was deleted or the policy was updated to deny all, then users would no longer be able to assume the role or access their AWS environment. Instead, the recommended approach is to revoke access based on the specific user’s principalId or sourceIdentity values.

The principalId is the unique identifier for the entity that made the API call. When requests are made with temporary credentials, such as assumed roles through IdPs, this value also includes the session name, such as [email protected]. The sourceIdentity identifies the original user identity that is making the request, such as a user who is authenticated through SAML federation from an IdP. As a best practice, AWS recommends that you configure this value within the IdP, because this improves traceability for user sessions within AWS. You can find more information on this functionality in the blog post, How to integrate AWS STS SourceIdentity with your identity provider.

Identify the principalId and sourceIdentity by using CloudTrail

You can use AWS CloudTrail to review the actions taken by a user, role, or AWS service that are recorded as events. In the following procedure, you will use CloudTrail to identify the principalId and sourceIdentity contained in the CloudTrail record contents for your IdP assumed role.

To identify the principalId and sourceIdentity by using CloudTrail

Assume a role in AWS by signing in through your IdP.
Perform an action such as a creating an S3 bucket.
Navigate to the CloudTrail service.
In the navigation pane, choose Event History.
For Lookup attributes, choose Event name. For Event name, enter CreateBucket.

Figure 2: Looking up the CreateBucket event in the CloudTrail event history

Select the corresponding event record and review the event details. An example showing the userIdentity element is as follows.


"userIdentity": {
	"type": "AssumedRole",
	"principalId": 
"AROATVGBKRLCHXEXAMPLE:[email protected]",
	"arn": "arn:aws:sts::111122223333:assumed-
role/roleexample/[email protected]",
	"accountId": "111122223333",
	"accessKeyId": "ASIATVGBKRLCJEXAMPLE",
	"sessionContext": {
		"sessionIssuer": {
			"type": "Role",
			"principalId": "AROATVGBKRLCHXEXAMPLE",
			"arn": 
"arn:aws:iam::111122223333:role/roleexample",
			"accountId": "111122223333",
			"userName": "roleexample"
		},
		"webIdFederationData": {},
		"attributes": {
			"creationDate": "2022-07-05T15:48:28Z",
			"mfaAuthenticated": "false"
		},
		"sourceIdentity": "[email protected]"
	}
}

In this event record, you can see that principalId is “AROATVGBKRLCHXEXAMPLE:[email protected]” and sourceIdentity was specified as “[email protected]”. Now that you have these values, let’s explore how you can revoke access by using SCP and IAM policies.

Use an SCP to deny users based on IdP user name or revoke session token

First, you will create an SCP, a policy that can be applied to an organization to offer central control of the maximum available permissions across the accounts in the organization. More information on SCPs, including steps to create and apply them, can be found in the AWS Organizations User Guide.

The SCP will have a deny-all statement with a condition for aws:userid, which will evaluate the principalId field; and a condition for aws:SourceIdentity, which will evaluate the sourceIdentity field. In the following example SCP, the users John Doe and Mary Major are prevented from accessing AWS, in member accounts, regardless of their session duration, because each action will check against their aws:userid and aws:SourceIdentity values and be denied accordingly.

SCP to deny access based on IdP user name


{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Deny",
			"Action": "*",
			"Resource": "*",
			"Condition": {
				"StringLike": {
					"aws:userid": [
						"*:[email protected]",
						"*:[email protected]"
				]
			}
		}
	},
	{
			"Effect": "Deny",
			"Action": "*",
			"Resource": "*",
			"Condition": {
				"StringEquals": {
					"aws:SourceIdentity": [
						"[email protected]",
						"[email protected]"
					]
				}
			}
		}
	]
}

Use an IAM policy to revoke access in the AWS Organizations management account

SCPs do not affect users or roles in the AWS Organizations management account and instead only affect the member accounts in the organization. Therefore, using an SCP alone to deny access may not be sufficient. However, identity administrators can revoke access in a similar way within their management account by using the following procedure.

To create an IAM policy in the management account

Sign in to the AWS Management Console by using your AWS Organizations management account credentials.
Follow these steps to use the JSON policy editor to create an IAM policy. Use the JSON of the SCP shown in the preceding section, SCP to deny access based on IdP user name, in the IAM JSON editor.
Follow these steps to add the IAM policy to roles that IdP users may assume within the account.

Revoke active sessions when role chaining

At this point, the user actions on the IdP assumable roles within the AWS organization have been blocked. However, there is still an edge case if the target users use role chaining (use an IdP assumedRole credential to assume a second role) that uses a different RoleSessionName than the one assigned by the IdP. In a role chaining situation, the users will still have access by using the cached credentials for the second role.

This is where the sourceIdentity field is valuable. After a source identity is set, it is present in requests for AWS actions that are taken during the role session. The value that is set persists when a role is used to assume another role (role chaining). The value that is set cannot be changed during the role session. Therefore, it’s recommended that you configure the sourceIdentity field within the IdP as explained previously. This concept is shown in Figure 3.

Figure 3: Role chaining with sourceIdentity configured

A user assumes an IAM role via their IdP (#1), and the CloudTrail record displays sourceIdentity: [email protected] (#2). When the user assumes a new role within AWS (#3), that CloudTrail record continues to display sourceIdentity: [email protected] despite the principalId changing (#4).

However, if a second role is assumed in the account through role chaining and the sourceIdentity is not set, then it’s recommended that you revoke the issued session tokens for the second role. In order to do this, you can use the SCP policy at the end of this section, SCP to revoke active sessions for assumed roles. When you use this policy, the issued credentials related to the roles specified will be revoked for the users currently using them, and only users who were not denied through the previous SCP or IAM policies restricting their aws:userid will be able to reassume the target roles to obtain a new temporary credential.

If you take this approach, you will need to use an SCP to apply across the organization’s member accounts. The SCP must have the human-assumable roles for role chaining listed and a token issue time set to a specific time when you want users’ access revoked. (Normally, this time window would be set to the present time to immediately revoke access, but there might be circumstances in which you wish to revoke the access at a future date, such as when a user moves to a new project or team and therefore requires different access levels.) In addition, you will need to follow the same procedures in your management account by creating a customer-managed policy by using the same JSON with the condition statement for aws:PrincipalArn removed. Then attach the customer managed policy to the individual roles that are human-assumable through role chaining.

SCP to revoke active sessions for assumed roles


{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "RevokeActiveSessions",
			"Effect": "Deny",
			"Action": [
				"*"
			],
			"Resource": [
				"*"
			],
			"Condition": {
				"StringEquals": {
					"aws:PrincipalArn": [
						"arn:aws:iam::<account-id>:role/<role-name>",
						"arn:aws:iam::<account-id>:role/<role-name>"
					]
				},
				"DateLessThan": {
					"aws:TokenIssueTime": "2022-06-01T00:00:00Z"
				}
			}
		}
	]
}

Conclusion and final recommendations

In this blog post, I demonstrated how you can revoke a federated user’s active AWS sessions by using SCPs and IAM policies that restrict the use of the aws:userid and aws:SourceIdentity condition keys. I also shared how you can handle a role chaining situation with the aws:TokenIssueTime condition key.

This exercise demonstrates the importance of configuring the session duration parameter on IdP assumed roles. As a security best practice, you should set the session duration to no longer than what is needed to perform the role. In some situations, that could mean an hour or less in a production environment and a longer session in a development environment. Regardless, it’s important to understand the impact of configuring the maximum session duration in the user’s environment and also to have proper procedures in place for revoking a federated user’s access.

This post also covered the recommendation to set the sourceIdentity for assumed roles through the IdP. This value cannot be changed during role sessions and therefore persists when a user conducts role chaining. Following this recommendation minimizes the risk that a user might have assumed another role with a different session name than the one assigned by the IdP and helps prevent the edge case scenario of revoking active sessions based on TokenIssueTime.

You should also consider other security best practices, described in the Security Pillar of the AWS Well-Architected Framework, when you revoke users’ AWS access. For example, rotating credentials such as IAM access keys in situations in which IAM access keys are regularly used and shared among users. The example solutions in this post would not have prevented a user from performing AWS actions if that user had IAM access keys configured for a separate IAM user in the environment. Organizations should limit long-lived security credentials such as IAM keys and instead rotate them regularly or avoid their use altogether. Also, the concept of least privilege is highly important to limit the access that users have and scope it solely to the requirements that are needed to perform their job functions. Lastly, you should adopt a centralized identity provider coupled with the AWS IAM Identity Center (successor to AWS Single Sign-On) service in order to centralize identity management and avoid the need for multiple credentials for users.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Identity and Access Management re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.