Tag Archives: Advanced (300)

Stream VPC flow logs to Amazon OpenSearch Service via Amazon Kinesis Data Firehose

2022-12-20 Chaitanya Shah

Post Syndicated from Chaitanya Shah original https://aws.amazon.com/blogs/big-data/stream-vpc-flow-logs-to-amazon-opensearch-service-via-amazon-kinesis-data-firehose/

Amazon Virtual Private Cloud (Amazon VPC) flow logs enable you to track the IP traffic going to and from the network interfaces in your VPC for your workloads. Analyzing VPC logs helps you understand how your applications are communicating over your VPC network with log records and acts as a main source of information to the network in your VPC. After collecting the flow logs, the next step is performing log analysis to understand user or application behavior and patterns to make informed decisions. You can analyze logs using log analytics tools such as Amazon OpenSearch Service.

Amazon Kinesis Data Firehose is a fully managed service for delivering near real-time streaming data to various destinations for storage and performing near real-time analytics. With its extensible data transformation capabilities, you can also streamline log processing and log delivery pipelines into a single Firehose delivery stream.

Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. Amazon OpenSearch is an open source, distributed search and analytics suite. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), as well as visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). Amazon OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management processing trillions of requests per month.

In this post, you will learn how to ingest VPC flow logs with Kinesis Data Firehose and deliver them to an Amazon OpenSearch Service for analysis using OpenSearch Service Dashboards.

Overview of solution

This solution uses native integration of VPC flow logs streaming to Kinesis Data Firehose. We use a Firehose delivery stream to buffer the streamed VPC flow logs, and deliver those to an OpenSearch Service destination endpoint. We use Amazon OpenSearch Service Dashboards to create an index pattern for the VPC flow logs to analyze and visualize the logs in a near-real time. The following diagram illustrates this architecture.

We walk you through the following high-level steps:

Create an OpenSearch Service domain for storing and analyzing the VPC flow logs.
Create a Firehose delivery stream to deliver the flow logs to the OpenSearch Service domain.
Create a VPC flow log subscription to the delivery stream.
Explore VPC flow logs in OpenSearch Service Dashboards
- Create role mapping with an OpenSearch Service user to the Kinesis Data Firehose service role. Because we’re using a public access domain for OpenSearch Service, we have to map the delivery stream AWS Identity and Access Management (IAM) role to the OpenSearch Service primary user to deliver logs in bulk to the OpenSearch Service domain.
- Create an index pattern in OpenSearch Service Dashboards to enable analysis and visualization of VPC logs.

Prerequisites

As a prerequisite, you need to create an Amazon Simple Storage Service (Amazon S3) bucket to store the Firehose delivery stream backups and failed logs.

Create an Amazon OpenSearch Service domain

For demonstration purposes, and to limit the costs, we create an OpenSearch Service domain with the Development and testing deployment type and public access to the dashboard. For instructions, refer to Create an Amazon OpenSearch Service domain. Note that we select Public access only for demo purposes. For production, we recommend using VPC access for security reasons.

When it’s complete, the OpenSearch Service domain shows as Active.

Create a Kinesis Data Firehose delivery stream

Now that your Amazon OpenSearch Service domain is active, you can create a Firehose delivery stream where VPC flow logs are streamed.

On the Amazon Kinesis console, choose Kinesis Data Firehose in the navigation pane, then choose Create delivery stream.
Choose Direct PUT as the source and set the destination as Amazon OpenSearch Service.
For Delivery stream name, enter PUT-OPENSEARCH-STREAM-DEMO.
In the Destination settings section, choose Browse and choose the previously created Amazon OpenSearch Service domain.
For Index name, enter vpcflowlogs.
For Index rotation, choose Every day.
For this post, we set Buffer size to 5 and Buffer interval to 900.You can modify these settings to optimize ingestion throughput and near-real-time behavior.

In the Backup settings section, for Source record backup in Amazon S3, select Failed events only so you only save the data that fails to deliver to Amazon OpenSearch Service.
For S3 bucket, choose Browse and choose the S3 bucket you created to store failed logs and backups.
Optionally, you can input a prefix for backup files and error files.
Select GZIP for Compression for data records.
For Encryption for data records, select Disabled.
Expand Advanced settings, and for Amazon CloudWatch error logging, select Enabled.
Choose Create delivery stream.

When the delivery stream is active, proceed to the next step.

Create a VPC flow logs subscription

Now you create a VPC flow logs subscription for the Firehose delivery stream you created in the previous step.

On the Amazon VPC console, choose Your VPCs.
Select the VPC for which to create the flow log.
On the Actions menu, choose Create flow log.
Select All to send all flow log records to Amazon OpenSearch Service.

If you want to filter the flow logs, you can select either Accept or Reject.

For Maximum aggregation interval, select 10 minutes or the minimum setting of 1 minute if you need the flow log data to be available for near-real-time analysis in Amazon OpenSearch Service.
For Destination, select Send to Kinesis Firehose in the same account if the delivery stream is set up on the same account where you create the VPC flow logs.
For Log record format, if you leave it at AWS default format, the flow logs are sent as version 2 format.

Alternatively, you can specify which fields you need the flow logs to capture and send to an Amazon OpenSearch Service. For more information on log format and available fields, refer to Flow log records.

Choose Create flow log.

Now let’s explore the VPC flow logs in Amazon OpenSearch Service.

Explore VPC flow logs in Amazon OpenSearch Service Dashboards

In the final step, we set up OpenSearch Service Dashboards to explore the VPC flow logs.

On the OpenSearch Service console, choose Domains in the navigation pane.
Choose the domain you created.
Under OpenSearch Dashboards URL, choose the link to open a new tab.
Log in with the user you created during OpenSearch Service domain setup.
Select Private for Select your tenant, then choose Confirm.

Because we used a public access domain for OpenSearch Service, you need to map the role created for the Firehose delivery stream to the OpenSearch Service Dashboards user, so that the delivery stream can deliver logs in bulk to the OpenSearch Service domain.

On the menu icon, choose Security.
Choose Roles.
Choose the all_access role.
On the Mapped users tab, choose Manage mapping.
For Backend roles, enter the IAM role ARN created for the Firehose delivery stream.
Choose Map.
Now that mapping is complete, choose the menu icon, then choose Stack management.
Choose Index Patterns, then choose Create index pattern.
For Index pattern name, enter vpcflowlogs*.
Choose Next step.
Navigate to the Discover menu option.You can see the VPC flow logs from your VPC in this dashboard. Now you can search and visualize the flow logs that are being streamed in near-real time to the OpenSearch Service domain.

Clean up

After you test out this solution, remember to delete all the resources you created to avoid incurring future charges:

Delete your Amazon OpenSearch Service domain.
Delete the VPC flow logs subscription.
Delete the Firehose delivery stream.
Delete the S3 bucket for the VPC flow logs backup and failed logs.
If you created a new VPC and new resources in the VPC, delete the resources and VPC.

Conclusion

In this post, we walked through a solution of how integrate VPC flow logs with a Kinesis Data Firehose delivery stream and deliver it to an Amazon OpenSearch Service destination with no code and visualize it in OpenSearch Service Dashboards.

Try this new quick and hassle-free way of sending your VPC flow logs to an Amazon OpenSearch Service using Kinesis Data Firehose.

About the Author

Chaitanya Shah is a Sr. Technical Account Manager with AWS, based out of New York. He has over 22 years of experience working with enterprise customers. He loves to code and actively contributes to the AWS solutions labs to help customers solve complex problems. He provides guidance to AWS customers on best practices for their AWS Cloud migrations. He is also specialized in AWS data transfer and the data and analytics domain.

How to get best price performance from your Amazon Redshift Data Sharing deployment

2022-12-20 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/how-to-get-best-price-performance-from-your-amazon-redshift-data-sharing-deployment/

Amazon Redshift is a fast, scalable, secure, and fully-managed data warehouse that enables you to analyze all of your data using standard SQL easily and cost-effectively. Amazon Redshift Data Sharing allows customers to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and regions without needing to copy or move data from one cluster to another.

Amazon Redshift Data Sharing was initially launched in March 2021, and added support for cross-account data sharing was added in August 2021. The cross-region support became generally available in February 2022. This provides full flexibility and agility to share data across Redshift clusters in the same AWS account, different accounts, or different regions.

Amazon Redshift Data Sharing is used to fundamentally redefine Amazon Redshift deployment architectures into a hub-spoke, data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, easily onboard new use cases, and most importantly do all of this without the complexity of data movement and data copies. Some of the most common questions asked during data sharing deployment are, “How big should my consumer clusters and producer clusters be?”, and “How do I get the best price performance for workload isolation?”. As workload characteristics like data size, ingestion rate, query pattern, and maintenance activities can impact data sharing performance, a continuous strategy to size both consumer and producer clusters to maximize the performance and minimize cost should be implemented. In this post, we provide a step-by-step approach to help you determine your producer and consumer clusters sizes for the best price performance based on your specific workload.

Generic consumer sizing guidance

The following steps show the generic strategy to size your producer and consumer clusters. You can use it as a starting point and modify accordingly to cater your specific use case scenario.

Size your producer cluster

You should always make sure that you properly size your producer cluster to get the performance that you need to meet your SLA. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation for the producer cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started, and you should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Size and setup initial consumer cluster

You should always size your consumer cluster based on your compute needs. One way to get started is to follow the generic cluster sizing guide similar to the producer cluster above.

Setup Amazon Redshift data sharing

Setup data sharing from producer to consumer once you have both the producer and consumer cluster setup. Refer to this post for guidance on how to setup data sharing.

Test consumer only workload on initial consumer cluster

Test consumer only workload on the new initial consumer cluster. This can be done by pointing consumer applications, for example ETL tools, BI applications, and SQL clients, to the new consumer cluster and rerunning the workload to evaluate the performance against your requirements.

Test consumer only workload on different consumer cluster configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either continue to use this cluster configuration or you can test on smaller configurations to see if you can further reduce the cost and still get the performance that you need.

On the other hand, if the initial size consumer cluster fails to meet your workload performance requirements, then you can further test larger configurations to get the configuration that meets your SLA.

As a rule of thumb, size up the consumer cluster by 2x the initial cluster configuration incrementally until it meets your workload requirements.

Once you plan out what configuration you want to test, use elastic resize to resize the initial cluster to the target cluster configuration. After elastic resize is completed, perform the same workload test and evaluate the performance against your SLA. Select the configuration that meets your price performance target.

Test producer only workload on different producer cluster configurations

Once you move your consumer workload to the consumer cluster with the optimum price performance, there might be an opportunity to reduce the compute resource on the producer to save on costs.

To achieve this, you can rerun the producer only workload on 1/2x of the original producer size and evaluate the workload performance. Resizing the cluster up and down accordingly depends on the result, and then you select the minimum producer configuration that meets your workload performance requirements.

Re-evaluate after a full workload run over time

As Amazon Redshift continues evolving, and there are continuous performance and scalability improvement releases, data sharing performance will continue improving. Furthermore, numerous variables might impact the performance of data sharing queries. The following are just some examples:

Ingestion rate and amount of data change
Query pattern and characteristic
Workload changes
Concurrency
Maintenance activities, for example vacuum, analyze, and ATO

This is why you must re-evaluate the producer and consumer cluster sizing using the strategy above on occasion, especially after a full workload deployment, to gain the new best price performance from your cluster’s configuration.

Automated sizing solutions

If your environment involved more complex architecture, for example with multiple tools or applications (BI, ingestion or streaming, ETL, data science), then it might not feasible to use the manual method from the generic guidance above. Instead, you can leverage solutions in this section to automatically replay the workload from your production cluster on the test consumer and producer clusters to evaluate the performance.

Simple Replay utility will be leveraged as the automated solution to guide you through the process of getting the right producer and consumer clusters size for the best price performance.

Simple Replay is a tool for conducting a what-if analysis and evaluating how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.

Here we walk through the steps to extract your workload logs from the source production cluster and replay them in an isolated environment. This lets you perform a direct comparison between these Amazon Redshift clusters seamlessly and select the clusters configuration that best meet your price performance target.

The following diagram shows the solution architecture.

Architecutre for testing simple replay

Solution walkthrough

Follow these steps to go through the solution to size your consumer and producer clusters.

Size your production cluster

You should always make sure to properly size your existing production cluster to get the performance that you need to meet your workload requirements. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation on the production cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started. You should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Identify the workload to be isolated

You might have different workloads running on your original cluster, but the first step is to identify the most critical workload to the business that we want to isolate. This is because we want to make sure that the new architecture can meet your workload requirements. This post is a good reference on a data sharing workload isolation use case that can help you decide which workload can be isolated.

Setup Simple Replay

Once you know your critical workload, you must enable audit logging in your production cluster where the critical workload identified above is running to capture query activities and store in Amazon Simple Storage Service (Amazon S3). Note that it may take up to three hours for the audit logs to be delivered to Amazon S3. Once the audit log is available, proceed to setup Simple Replay and then extract the critical workload from the audit log. Note that start_time and end_time could be used as parameters to filter out the critical workload if those workloads run in certain time periods, for example 9am to 11am. Otherwise it will extract all of the logged activities.

Baseline workload

Create a baseline cluster with the same configuration as the producer cluster by restoring from the production snapshot. The purpose of starting with the same configuration is to baseline the performance with an isolated environment.

Once the baseline cluster is available, replay the extracted workload in the baseline cluster. The output from this replay will be the baseline used to compare against subsequent replays on different consumer configurations.

Setup initial producer and consumer test clusters

Create a producer cluster with the same production cluster configuration by restoring from the production snapshot. Create a consumer cluster with the recommended initial consumer size from the previous guidance. Furthermore, setup data sharing between the producer and consumer.

Replay workload on initial producer and consumer

Replay the producer only workload on the initial size producer cluster. This can be achieved using the “Exclude” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Replay the consumer only workload on the initial size consumer cluster. This can be achieved using the “Include” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Evaluate the performance of these replays against the baseline and workload performance requirements.

Replay consumer workload on different configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either use this cluster configuration or you can follow these steps to test on smaller configurations to see if you can further reduce costs and still get the performance that you need.

Compare initial consumer performance results against your workload requirements:

If the result exceeds your workload performance requirements, then you can reduce the size of the consumer cluster incrementally, starting with 1/2x, retry the replay and evaluate the performance, then resize up or down accordingly based on the result until it meets your workload requirements. The purpose is to get a sweet spot where you’re comfortable with the performance requirements and get the lowest price possible.
If the result fails to meet your workload performance requirements, then you can increase the size of the cluster incrementally, starting with 2x the original size, retry the replay and evaluate the performance until it meets your workload performance requirements.

Replay producer workload on different configurations

Once you split your workloads out to consumer clusters, the load on the producer cluster should be reduced and you should evaluate your producer cluster’s workload performance to seek the opportunity to downsize to save on costs.

The steps are similar to consumer replay. Elastic resize the producer cluster incrementally starting with 1/2x the original size, replay the producer only workload and evaluate the performance, and then further resize up or down until it meets your workload performance requirements. The purpose is to get a sweet spot where you’re comfortable with the workload performance requirements and get the lowest price possible. Once you have the desired producer cluster configuration, retry replay consumer workloads on the consumer cluster to make sure that the performance wasn’t impacted by producer cluster configuration changes. Finally, you should replay both producer and consumer workloads concurrently to make sure that the performance is achieved in a full workload scenario.

Re-evaluate after a full workload run over time

Similar to the generic guidance, you should re-evaluate the producer and consumer clusters sizing using the previous strategy on occasion, especially after full workload deployment to gain the new best price performance from your cluster’s configuration.

Clean up

Running these sizing tests in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save on costs. We also recommend pausing your clusters when they’re not in use.

Applying Amazon Redshift and data sharing best practices

Proper sizing of both your producer and consumer clusters will give you a good start to get the best price performance from your Amazon Redshift deployment. However, sizing isn’t the only factor that can maximize your performance. In this case, understanding and following best practices are equally important.

General Amazon Redshift performance tuning best practices are applicable to data sharing deployment. Make sure that your deployment follows these best practices.

There numerous data sharing specific best practices that you should follow to make sure that you maximize the performance. Refer to this post for more details.

Summary

There is no one-size-fits-all recommendation on producer and consumer cluster sizes. It varies by workloads and your performance SLA. The purpose of this post is to provide you with guidance for how you can evaluate your specific data sharing workload performance to determine both consumer and producer cluster sizes to get the best price performance. Consider testing your workloads on producer and consumer using simple replay before adopting it in production to get the best price performance.

About the Authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance and operational excellence at scale in their cloud journey. He has a keen interest in Data Analytics as well.

Migrate Google BigQuery to Amazon Redshift using AWS Schema Conversion tool (SCT)

2022-12-14 Jagadish Kumar

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/migrate-google-bigquery-to-amazon-redshift-using-aws-schema-conversion-tool-sct/

Amazon Redshift is a fast, fully-managed, petabyte scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. Using Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers. The following are just some of the notable capabilities:

Amazon Redshift seamlessly integrates with broader analytics services on AWS. This enables you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. Amazon Redshift lets you build lake house architectures and then perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning (ML), and more.
You don’t need to worry about workloads, such as ETL, dashboards, ad-hoc queries, and so on, interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get one hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
Amazon Redshift is easy-to-use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
Fault Tolerance is inbuilt. All of the data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, and serverless web services-based or event-driven applications and so on.
Redshift ML makes it easy for data scientists to create, train, and deploy ML models using familiar SQL. They can also run predictions using SQL.
Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row and column level security controls on sensitive data, and so on.
Amazon Redshift integrates seamlessly with other AWS services and third-party tools. You can move, transform, load, and query large datasets quickly and reliably.

In this post, we provide a walkthrough for migrating a data warehouse from Google BigQuery to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the majority of the database code and storage objects to a format that is compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. Furthermore, AWS SCT can scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your BigQuery project. First, we create an Amazon Redshift database into which BigQuery data is migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert BigQuery schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from BigQuery, upload it into the S3 bucket, and then copy to Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

A workstation with AWS SCT, Amazon Corretto 11, and Amazon Redshift drivers.
1. You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we’re using Amazon EC2 Windows instance. To create it, use this guide.
2. To download and install AWS SCT on the EC2 instance that you previously created, use this guide.
3. Download the Amazon Redshift JDBC driver from this location.
4. Download and install Amazon Corretto 11.
A GCP service account that AWS SCT can use to connect to your source BigQuery project.
1. Grant BigQuery Admin and Storage Admin roles to the service account.
2. Copy the Service account key file, which was created in the Google cloud management console, to the EC2 instance that has AWS SCT.
3. Create a Cloud Storage bucket in GCP to store your source data during migration.

This walkthrough covers the following steps:

Create an Amazon Redshift Serverless Workgroup and Namespace
Create the AWS S3 Bucket and Folder
Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT
- Connecting to the Google BigQuery Source
- Connect to the Amazon Redshift Target
- Convert BigQuery schema to an Amazon Redshift
- Analyze the assessment report and address the action items
- Apply converted schema to target Amazon Redshift
Migrate data using AWS SCT data extraction agents
- Generating Trust and Key Stores (Optional)
- Install and start data extraction agent
- Register data extraction agent
- Add virtual partitions for large tables (Optional)
- Create a local migration task
- Start the Local Data Migration Task
View Data in Amazon Redshift

Create an Amazon Redshift Serverless Workgroup and Namespace

In this step, we create an Amazon Redshift Serverless workgroup and namespace. Workgroup is a collection of compute resources and namespace is a collection of database objects and users. To isolate workloads and manage different resources in Amazon Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create Amazon Redshift Serverless workgroup and namespace:

Navigate to the Amazon Redshift console.
In the upper right, choose the AWS Region that you want to use.
Expand the Amazon Redshift pane on the left and choose Redshift Serverless.
Choose Create Workgroup.
For Workgroup name, enter a name that describes the compute resources.
Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
Choose Next.

For Namespace name, enter a name that describes your dataset.
In Database name and password section, select the checkbox Customize admin user credentials.
- For Admin user name, enter a username of your choice, for example awsuser.
- For Admin user password: enter a password of your choice, for example MyRedShiftPW2022.

Choose Next. Note that data in Amazon Redshift Serverless namespace is encrypted by default.
In the Review and Create page, choose Create.
Create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace, as described in the following. Note that there can only be one default IAM role.
- Navigate to the Amazon Redshift Serverless Dashboard.
- Under Namespaces / Workgroups, choose the namespace that you just created.
- Navigate toSecurity and encryption.
- Under Permissions, choose Manage IAM roles.
- Navigate to Manage IAM roles. Then, choose the Manage IAM roles drop-down and choose Create IAM role.
- Under Specify an Amazon S3 bucket for the IAM role to access, choose one of the following methods:
  - Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with a name starting with redshift.
  - Choose Any Amazon S3 bucket to allow the created IAM role to access all of the S3 buckets.
  - Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
- Choose Create IAM role as default. Amazon Redshift automatically creates and sets the IAM role as default.
Capture the Endpoint for the Amazon Redshift Serverless workgroup that you just created.
- Navigate to the Amazon Redshift Serverless Dashboard.
- Under Namespaces / Workgroups, choose the workgroup that you just created.
- Under General information, copy the Endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create the S3 bucket:

Navigate to the Amazon S3 console
Choose Create bucket. The Create bucket wizard opens.
For Bucket name, enter a unique DNS-compliant name for your bucket (e.g., uniquename-bq-rs). See rules for bucket naming when choosing a name.
For AWS Region, choose the region in which you created the Amazon Redshift Serverless workgroup.
Select Create Bucket.
In the Amazon S3 console, navigate to the S3 bucket that you just created (e.g., uniquename-bq-rs).
Choose “Create folder” to create a new folder.
For Folder name, enter incoming and choose Create Folder.

Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT

To convert BigQuery schema to the Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that we created previously, and then launch AWS SCT.

Follow these steps using AWS SCT:

Connect to the BigQuery Source

From the File Menu choose Create New Project.
Choose a location to store your project files and data.
Provide a meaningful but memorable name for your project, such as BigQuery to Amazon Redshift.
To connect to the BigQuery source data warehouse, choose Add source from the main menu.
Choose BigQuery and choose Next. The Add source dialog box appears.
For Connection name, enter a name to describe BigQuery connection. AWS SCT displays this name in the tree in the left panel.
For Key path, provide the path of the service account key file that was previously created in the Google cloud management console.
Choose Test Connection to verify that AWS SCT can connect to your source BigQuery project.
Once the connection is successfully validated, choose Connect.

Connect to the Amazon Redshift Target

Follow these steps to connect to Amazon Redshift:

In AWS SCT, choose Add Target from the main menu.
Choose Amazon Redshift, then choose Next. The Add Target dialog box appears.
For Connection name, enter a name to describe the Amazon Redshift connection. AWS SCT displays this name in the tree in the right panel.
For Server name, enter the Amazon Redshift Serverless workgroup endpoint captured previously.
For Server port, enter 5439.
For Database, enter dev.
For User name, enter the username chosen when creating the Amazon Redshift Serverless workgroup.
For Password, enter the password chosen when creating Amazon Redshift Serverless workgroup.
Uncheck the “Use AWS Glue” box.
Choose Test Connection to verify that AWS SCT can connect to your target Amazon Redshift workgroup.
Choose Connect to connect to the Amazon Redshift target.

Note that alternatively you can use connection values that are stored in AWS Secrets Manager.

Convert BigQuery schema to an Amazon Redshift

After the source and target connections are successfully made, you see the source BigQuery object tree on the left pane and target Amazon Redshift object tree on the right pane.

Follow these steps to convert BigQuery schema to the Amazon Redshift format:

On the left pane, right-click on the schema that you want to convert.
Choose Convert Schema.
A dialog box appears with a question, The objects might already exist in the target database. Replace?. Choose Yes.

Once the conversion is complete, you see a new schema created on the Amazon Redshift pane (right pane) with the same name as your BigQuery schema.

The sample schema that we used has 16 tables, 3 views, and 3 procedures. You can see these objects in the Amazon Redshift format in the right pane. AWS SCT converts all of the BigQuery code and data objects to the Amazon Redshift format. Furthermore, you can use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects. However, some of the objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from the Main view to the Assessment Report view as follows:

The Summary tab shows objects that were converted automatically, and objects that weren’t converted automatically. Green represents automatically converted or with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action Items tab shows the recommended actions for each conversion issue. If you select an action item from the list, AWS SCT highlights the object to which the action item applies.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database/schema show you the effort required to design and implement the recommendations for converting Action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. Amazon Redshift takes some actions automatically while converting the schema to Amazon Redshift. Objects with these actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the RECORD and JSON datatype columns in BigQuery table ncaaf_referee_data to the SUPER datatype in Amazon Redshift. The partition key in the ncaaf_referee_data table is converted to the distribution key and sort key in Amazon Redshift.

Apply converted schema to target Amazon Redshift

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and then choose Apply to database.

Migrate data from BigQuery to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this walkthrough, we show how to configure AWS SCT extraction agents to extract data from BigQuery and migrate to Amazon Redshift.

First, install AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For big datasets, you can use several data extraction agents to increase the data migration speed.

Generating trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all of the data passed between the applications remains private and integral. To use SSL communication, you must generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

In AWS SCT, navigate to Settings → Global Settings → Security.
Choose Generate trust and key store.
Enter the name and password for trust and key stores and choose a location where you would like to store them.
Choose Generate.

Install and configure Data Extraction Agent

In the installation package for AWS SCT, you find a sub-folder agent (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

For Listening port, enter the port number on which the agent listens. It is 8192 by default.
For Add a source vendor, enter no, as you don’t need drivers to connect to BigQuery.
For Add the Amazon Redshift driver, enter YES.
For Enter Redshift JDBC driver file or files, enter the location where you downloaded Amazon Redshift JDBC drivers.
For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data. The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.
For Enable SSL communication, enter yes. Choose No here if you don’t want to use SSL.
For Key store, enter the storage location chosen when creating the trust and key store.
For Key store password, enter the password for the key store.
For Enable client SSL authentication, enter yes.
For Trust store, enter the storage location chosen when creating the trust and key store.
For Trust store password, enter the password for the trust store.

*************************************************
*                                               *
*     AWS SCT Data Extractor Configuration      *
*              Version 2.0.1.666                *
*                                               *
*************************************************
User name: Administrator
User home: C:\Windows\system32\config\systemprofile
*************************************************
Listening port [8192]: 8192
Add a source vendor [YES/no]: no
No one source data warehouse vendors configured. AWS SCT Data Extractor cannot process data extraction requests.
Add the Amazon Redshift driver [YES/no]: YES
Enter Redshift JDBC driver file or files: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\redshift-jdbc42-2.1.0.9.jar
Working folder [C:\Windows\system32\config\systemprofile]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject
Enable SSL communication [YES/no]: YES
Setting up a secure environment at "C:\Windows\system32\config\systemprofile". This process will take a few seconds...
Key store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftKeyStore
Key store password:
Re-enter the key store password:
Enable client SSL authentication [YES/no]: YES
Trust store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftTrustStore
Trust store password:
Re-enter the trust store password:

Starting Data Extraction Agent(s)

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, double-click C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed following your operating system.
To check the status of the agent, run the same command but replace start with status.
To stop an agent, run the same command but replace start with stop.
To restart an agent, run the same RestartAgent.bat file.

Register the Data Extraction Agent

Follow these steps to register the Data Extraction Agent:

In AWS SCT, change the view to Data Migration view (other) and choose + Register.
In the connection tab:
1. For Description, enter a name to identify the Data Extraction Agent.
2. For Host name, if you installed the Data Extraction Agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT Data Extraction Agent is installed. It’s recommended to install the Data Extraction Agents on Linux for better performance.
3. For Port, enter the number entered for the Listening Port when installing the AWS SCT Data Extraction Agent.
4. Select the checkbox to use SSL (if using SSL) to encrypt the AWS SCT connection to the Data Extraction Agent.
If you’re using SSL, then in the SSL Tab:
1. For Trust store, choose the trust store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
2. For Key Store, choose the key store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
Choose Test Connection.
Once the connection is validated successfully, choose Register.

Add virtual partitions for large tables (optional)

You can use AWS SCT to create virtual partitions to optimize migration performance. When virtual partitions are created, AWS SCT extracts the data in parallel for partitions. We recommend creating virtual partitions for large tables.

Follow these steps to create virtual partitions:

Deselect all objects on the source database view in AWS SCT.
Choose the table for which you would like to add virtual partitioning.
Right-click on the table, and choose Add Virtual Partitioning.
You can use List, Range, or Auto Split partitions. To learn more about virtual partitioning, refer to Use virtual partitioning in AWS SCT. In this example, we use Auto split partitioning, which generates range partitions automatically. You would specify the start value, end value, and how big the partition should be. AWS SCT determines the partitions automatically. For a demonstration, on the Lineorder table:
1. For Start Value, enter 1000000.
2. For End Value, enter 3000000.
3. For Interval, enter 1000000 to indicate partition size.
4. Choose Ok.

You can see the partitions automatically generated under the Virtual Partitions tab. In this example, AWS SCT automatically created the following five partitions for the field:

1. <1000000
2. >=1000000 and <=2000000
3. >2000000 and <=3000000
4. >3000000
5. IS NULL

Create a local migration task

To migrate data from BigQuery to Amazon Redshift, create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local migration task:

In AWS SCT, under the schema name in the left pane, right-click on Standard tables.
Choose Create Local task.
There are three migration modes from which you can choose:
1. Extract source data and store it on a local pc/virtual machine (VM) where the agent runs.
2. Extract data and upload it on an S3 bucket.
3. Choose Extract upload and copy, which extracts data to an S3 bucket and then copies to Amazon Redshift.
In the Advanced tab, for Google CS bucket folder enter the Google Cloud Storage bucket/folder that you created earlier in the GCP Management Console. AWS SCT stores the extracted data in this location.
In the Amazon S3 Settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket that you created earlier. The AWS SCT data extraction agent uploads the data into the S3 bucket/folder before copying to Amazon Redshift.
Choose Test Task.
Once the task is successfully validated, choose Create.

Start the Local Data Migration Task

To start the task, choose the Start button in the Tasks tab.

First, the Data Extraction Agent extracts data from BigQuery into the GCP storage bucket.
Then, the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.
At this point, AWS SCT has successfully migrated data from the source BigQuery table to the Amazon Redshift table.

View data in Amazon Redshift

After the data migration task executes successfully, you can connect to Amazon Redshift and validate the data.

Follow these steps to validate the data in Amazon Redshift:

Navigate to the Amazon Redshift QueryEditor V2.
Double-click on the Amazon Redshift Serverless workgroup name that you created.
Choose the Federated User option under Authentication.
Choose Create Connection.
Create a new editor by choosing the + icon.
In the editor, write a query to select from the schema name and table name/view name you would like to verify. Explore the data, run ad-hoc queries, and make visualizations and charts and views.

The following is a side-by-side comparison between source BigQuery and target Amazon Redshift for the sports data-set that we used in this walkthrough.

Clean up up any AWS resources that you created for this exercise

Follow these steps to terminate the EC2 instance:

Navigate to the Amazon EC2 console.
In the navigation pane, choose Instances.
Select the check-box for the EC2 instance that you created.
Choose Instance state, and then Terminate instance.
Choose Terminate when prompted for confirmation.

Follow these steps to delete Amazon Redshift Serverless workgroup and namespace

Navigate to Amazon Redshift Serverless Dashboard.
Under Namespaces / Workgroups, choose the workspace that you created.
Under Actions, choose Delete workgroup.
Select the checkbox Delete the associated namespace.
Uncheck Create final snapshot.
Enter delete in the delete confirmation text box and choose Delete.

Follow these steps to delete the S3 bucket

Navigate to Amazon S3 console.
Choose the bucket that you created.
Choose Delete.
To confirm deletion, enter the name of the bucket in the text input field.
Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. Following this walkthrough, you can understand how a data migration task extracts, downloads, and then migrates data from BigQuery to Amazon Redshift. The solution that we presented in this post performs a one-time migration of database objects and data. Data changes made in BigQuery when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to BigQuery on hold or replay the ETLs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

AWS SCT has some limitations when using BigQuery as a source. For example, AWS SCT can’t convert sub queries in analytic functions, geography functions, statistical aggregate functions, and so on. Find the full list of limitations in the AWS SCT user guide. We plan to address these limitations in future releases. Despite these limitations, you can use AWS SCT to automatically convert most of your BigQuery code and storage objects.

Download and install AWS SCT, sign in to the AWS Console, checkout Amazon Redshift Serverless, and start migrating!

About the authors

Cedrick Hoodye is a Solutions Architect with a focus on database migrations using the AWS Database Migration Service (DMS) and the AWS Schema Conversion Tool (SCT) at AWS. He works on DB migrations related challenges. He works closely with EdTech, Energy, and ISV business sector customers to help them realize the true potential of DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.

Amit Arora is a Solutions Architect with a focus on Database and Analytics at AWS. He works with our Financial Technology and Global Energy customers and AWS certified partners to provide technical assistance and design customer solutions on cloud migration projects, helping customers migrate and modernize their existing databases to the AWS Cloud.

Jagadish Kumar is an Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Using Workflows to Build, Test, and Deploy with Amazon CodeCatalyst

2022-12-13 Kumar Karra

Post Syndicated from Kumar Karra original https://aws.amazon.com/blogs/devops/using-workflows-to-build-test-and-deploy-with-amazon-codecatalyst/

Amazon CodeCatalyst workflows are continuous integration and continuous delivery (CI/CD) pipelines that enable you to easily build, test and deploy applications. CodeCatalyst was announced at re:Invent 2022 and is currently in preview.

Introduction:

I recently read The Unicorn Project, the follow-up to the bestselling title The Phoenix Project from Gene Kim. After a few years at Amazon, I had forgotten how some companies write software, but it all came back to me as I read. In the book, the main character, Maxine, struggles with a complicated software development lifecycle (SLDC) after joining a new team. Some of the challenges she encounters include:

Continually delivering high-quality updates is complicated and slow
Collaborating efficiently with others is challenging
Managing application environments is increasingly complex
Setting up a new project is a time consuming chore

Amazon CodeCatalyst can help address all of these issues. CodeCatalyst is an integrated DevOps service that makes it easy for development teams to quickly build and deliver applications on AWS. Over the next few weeks, my colleagues and I will release a series of blog posts describing the individual features of CodeCatalyst and how they will help you overcome the challenges that Maxine encountered in The Unicorn Project. In this first post, I focus on Workflows and address the first bullet above, “continually delivering high-quality updates is complicated and slow”.

CodeCatalyst Workflows help you reliably deliver high-quality application updates frequently, quickly and securely. CodeCatalyst uses a visual editor — or if you prefer YAML — to quickly assemble and configure actions to compose workflows that automate your CI/CD pipeline, test reporting and other manual processes. Workflows use provisioned compute, lambda compute, custom container images and a managed build infrastructure to scale execution easily without sacrificing flexibility

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Have an AWS Builder ID for signing in to CodeCatalyst.
Belong to a space and have the space administrator role assigned to you in that space. For more information, see Creating a space in CodeCatalyst, Managing members of your space, and Space administrator role.
Have an AWS account associated with your space and have the IAM role in that account. For more information about the role and role policy, see Creating a CodeCatalyst service role.

Walkthrough

For this walkthrough, I am going use the Modern Three-tier Web Application blueprint. A CodeCatalyst blueprint provides a template for a new project. If you would like to follow along, you can launch the blueprint as described in Creating a project in Amazon CodeCatalyst. This will deploy the architecture shown below.

Figure 1. Modern Three-tier Web Application architecture including a presentation, application and data layer

Once the new project is launched, navigate to CI/CD > Workflows. You will see two workflows listed. Click on ApplicationDeploymentPipeline and you will be presented with the workflow pictured below. The workflow consists of six actions: 1) ensures that CDK is configured in the account; 2) builds the backend, written in Python, including unit tests; 3) deploys the backend to either AWS Lambda or AWS Fargate depending on which you selected when you launched the project; 4) runs a series of integration tests on the deployed backend; 5) builds the frontend, written with Vue, including unit tests; and finally, 6) deploys the frontend to Amazon Simple Storage Service (Amazon S3) and Amazon CloudFront.

Figure 2. Six step Workflow described in the prior paragraph

Let’s look at a few of these actions. If you click on each action you will see details about the workflow execution. For example, I clicked on build_backend. On the logs tab, I can see the build action executes a series of steps. In this example, pip installs requirements and then pytest and coverage run a series of unit test. If this had been a compiled language — like Java or .NET — there would have been a build step as well.

Figure 3. Logs from the build action including pip, pytest, and coverage

If I switch to the Reports tab, I see the result of the unit tests as well as code and branch coverage. In each case the test has exceeded the pass rate, indicated by the black bar on the graph. If they had not, the build would have failed.

Figure 4. Results of the unit tests including code and branch coverage

Next, let’s examine how the workflow is defined by clicking on the Edit button in the top right corner of the screen. If the editor opens in YAML mode, switch to Visual mode using the toggle above the code. If I click on WorkflowSource, I see that the Workflow is triggered by a push to the main branch. I could add additional triggers. CodeCatalyst supports triggering on Push or Pull Request. In addition, I can trigger off multiple branches, including wildcards (e.g. “release-.*”). Finally, I can trigger branches when only some files in a repository change (e.g. "src/.*")

Figure 5. Trigger configuration showing various options

Now, let’s look at the build_frontend action. This is a build action, similar to the build_backend action you looked at earlier. On the Configure tab I can see the Shell commands that will be executed during the build. Remember that the frontend is written using Vue. Here I can see npm install used to install dependencies, npm run test:unit used to run tests, and finally npm run build-only to build the Single Page App (SPA). The resulting artifacts are passed to subsequent actions in the Workflow.

Figure 6. Shell commands run in the build action

Next, let’s look at the integration_test action. A managed test action is very similar to a build action, defining a series of commands to execute. On the configuration tab (not shown), I can see that this action is again running pytest. Switching to the Outputs tab, I see that CodeCatalyst is configured to automatically discover the test reports generated by pytest and other test frameworks. In addition, I have defined a minimum pass rate of 100%. This means that the workflow should fail if any of the integration tests fail.

Figure 7. Test report configuration dialog including success criteria

Finally, let’s examine the deploy_frontend action. Note that all of the actions you have looked at so far include a series of commands to run in their configuration. While these actions are highly flexible, CodeCatalyst also supports purpose built actions. The cdk-deploy action is an example of this. As the name implies, this action deploys AWS Cloud Development Kit (CDK) resources. I could have called cdk deploy from the shell commands in a build action. However, using the purpose built action is easier. CodeCatalyst supports many purpose build actions developed by AWS as well as third parties. Click on the + sign in the top left corner of the screen to see a few examples. In addition, CodeCatalyst supports GitHub actions, but that is a topic for another post.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges (See pricing page for more details). First, delete the two stacks that CDK deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and clicking the Delete project button.

Conclusion

In this post, you learned how CodeCatalyst can help you rapidly assemble automation workflows by configuring composable, pre-built actions into CI/CD pipelines. I examined actions to build, test and deploy both frontend and backend applications. In future posts, I will discuss how CodeCatalyst can address the rest of the challenges Maxine encountered in The Unicorn Project.

About the authors:

Simplify private network access for solutions using Amazon OpenSearch Service managed VPC endpoints

2022-12-13 Aish Gunasekar

Post Syndicated from Aish Gunasekar original https://aws.amazon.com/blogs/big-data/simplify-private-network-access-for-solutions-using-amazon-opensearch-service-managed-vpc-endpoints/

To meet the needs of customers who want simplicity in their network setup with the Amazon OpenSearch Service, you can now use Amazon OpenSearch Service-managed virtual private cloud (VPC) endpoints (powered by AWS PrivateLink) to connect to your applications using Amazon OpenSearch Service domains launched in Amazon Virtual Private Cloud (VPC). With Amazon OpenSearch Service-managed VPC endpoints, you can privately access your Amazon OpenSearch Service domain from multiple VPCs in your account or other AWS accounts based on your application needs without configuring other services features such as VPC peering, AWS Transit Gateway (TGW), or other more complex network routing strategies that place operational burden on your support and engineering teams.

The feature is built using AWS PrivateLink. AWS PrivateLink provides private connectivity between VPCs, supported AWS services, and your on-premises networks without exposing your traffic to the public internet. It provides you with the means to connect multiple application deployments effortlessly to your Amazon OpenSearch Service domains.

This post introduces Amazon OpenSearch Service-managed VPC endpoints that build on top of AWS PrivateLink and shows how you can access a private Amazon OpenSearch Service from one or more VPCs hosted in the same account, or even VPCs hosted in other AWS accounts using AWS PrivateLink managed by Amazon OpenSearch Service.

Amazon OpenSearch Service managed VPC endpoints

Before the launch of Amazon OpenSearch Service managed VPC endpoints, if you needed to gain access to your domain outside of your VPC, you had three options:

Use VPC peering to connect your VPC with other VPCs
Use AWS Transit Gateway to connect your VPC with other VPCs
Create your own implementation of an AWS PrivateLink setup

The first two options require you to setup your VPCs so that the Classless Inter-Domain Routing (CIDR) block ranges don’t overlap. If they did, then your options are more complicated. The third option, create your own implementation of AWS PrivateLink, involve configuring a network load balancer (NLB) and associating a target group with the NLB as one of the steps in the setup. The architecture discussed in this post, demonstrates these additional layers of complexity.

With Amazon OpenSearch Service managed VPC endpoints (i.e., powered by AWS PrivateLink), these complex setups and processes are no longer needed!

You can access your Amazon OpenSearch Service private domain as if it were deployed in all the VPCs that you want to connect to your domain. If you need private connectivity from your on-premises hybrid deployments, then AWS PrivateLink helps you bring access from your Amazon OpenSearch Service domain to your data centers with minimal effort.

By using AWS PrivateLink with Amazon OpenSearch Service, you can realize the following benefits:

You simplify your network architecture between hybrid, multi-VPC, and multi account solutions
You address a multitude of compliance concerns by better controlling the traffic that moves between your solutions and Amazon OpenSearch Service domains

Shared search cluster for multiple development teams

Imagine that your company hosts a service as a software (SaaS) application that provides a search application programming interface (API) for the healthcare industry. Each team works on a different function of the API. The development teams API team 1 and API team 2 are in two different AWS accounts and each has their own VPCs within these accounts. Another team (data refinement team) works on the ingestion and data refinement to populate the Amazon OpenSearch Service domain hosted in the same account as API team 2 but in different VPC. Each team shares the domain during the development cycles to save costs and foster collaboration on the data modeling.

Solution overview

Self-managed AWS PrivateLink architecture to connect different VPCs

In this scenario prior to Amazon OpenSearch Service manage VPC endpoints (i.e., powered by AWS PrivateLink), you would have to create the following items:

Deploy an NLB in your VPC
Create a target group that points to the IP addresses of the Elastic Network Interfaces (ENIs), which the Amazon OpenSearch Service creates in your VPC and is used to launch the Amazon OpenSearch Service
Create an AWS PrivateLink deployment and reference your newly created NLB

When you implement the NLB, a target group can only reference IP addresses, an Amazon EC2 instance, or an Application Load Balancer (ALB). If you referenced the IP addresses as targets, then you had to build a process that detected the changes in the IP address if the domain changed due to service initiated or self-initiated blue/green deployments. You must maintain yet another complex process to ensure that you always have active ENIs with which to point your target groups or you lose connectivity.

Typically, customers use an AWS Lambda with scheduled events in Amazon CloudWatch. This means that you use the AWS Lambda to detect the current state where the ENIs that provided the IP addresses were marked as active for the description that matched the ENIs your domain creates. You schedule AWS Lambda to wake up within the time to live (TTL) of the Domain Name Service (DNS) settings (typically 60 seconds) and compare the existing IP addresses in the target group with any new ones found when you query all ENIs with a description referencing your domain in the VPC. You then build a new target group with the deltas and you swap the target groups and drop the old one. It’s tricky, it’s complex, and you have to maintain the solution!

With the new simplified networking architecture, your teams go through the following steps.

OpenSearch Service managed VPC endpoints architecture (powered by AWS PrivateLink)

Since the Amazon OpenSearch Service takes care of the infrastructure described previously — but not necessarily on the same implementation — all you really need to concern yourself with is creating the connections using the instructions in our service documentation.

Once you complete the steps in the instructions and remove your own implementation, your architecture is then simplified as seen in the following diagram.

At this point, the development teams (API team 1 and API team 2) can access the Amazon OpenSearch cluster via Amazon OpenSearch Service Managed VPC Endpoint. This option is highly scalable with a simplified network architecture in which you don’t have to worry about managing a NLB, or setting up target groups and the additional resources. If the number of development teams and VPCs grow in the future, you associate the domain with the associated interface VPC endpoint. You can access services in VPCs in same or different accounts, even if there are overlapping CIDR Block IP ranges.

Conclusion

In this post, we walked through the architectural design of accessing Amazon OpenSearch cluster from different VPCs across different accounts using OpenSearch Service-managed VPC endpoint (AWS PrivateLink). Using Transit Gateway, self-managed AWS PrivateLink or VPC peering required complex networking strategies that increased operation burden. With the introduction of VPC endpoints for Amazon OpenSearch Service, the complexity of your solutions is greatly simplified and what’s even better, it’s managed for you!

About the authors

Aish Gunasekar is a Specialist Solutions architect with a focus on Amazon OpenSearch Service. Her passion at AWS is to help customers design highly scalable architectures and help them in their cloud adoption journey. Outside of work, she enjoys hiking and baking.

Kevin Fallis (@AWSCodeWarrior) is an AWS specialist search solutions architect. His passion at AWS is to help customers leverage the correct mix of AWS services to achieve success for their business goals. His after-work activities include family, DIY projects, carpentry, playing drums, and all things music.

Scale read and write workloads with Amazon Redshift

2022-12-13 Harsha Tadiparthi

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/scale-read-and-write-workloads-with-amazon-redshift/

Amazon Redshift is a fast, fully managed, petabyte-scale cloud data warehouse that enables you to analyze large datasets using standard SQL. The concurrency scaling feature in Amazon Redshift automatically adds and removes capacity by adding concurrency scaling to handle demands from thousands of concurrent users, thereby providing consistent SLAs for unpredictable and spiky workloads such as BI reports, dashboards, and other analytics workloads.

Until now, concurrency scaling only supported auto scaling for read queries; write queries had to run on the main cluster. Now, we are extending concurrency scaling to support auto scaling for common write queries including COPY, INSERT, UPDATE, and DELETE. This is available on Amazon Redshift RA3 provisioned instance types in the Regions where concurrency scaling is available. Amazon Redshift serverless comes with built in dynamic auto scaling capability for read workload scaling.

In this post, we discuss how to enable concurrency scaling to offer consistent SLAs for concurrent workloads such as data loads, ETL (extract, transform, and load), and data processing with reduced queue times.

Concurrency scaling overview

With concurrency scaling, Amazon Redshift automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency scaling resources are added to your Amazon Redshift cluster transparently in seconds, as concurrency increases, to serve sudden spikes in concurrent requests with fast performance without wait time. When the workload demand subsides, Amazon Redshift automatically shuts down concurrency scaling resources to save you cost.

The following diagram shows how concurrency scaling works at a high level.

The workflow contains the following steps:

All queries go to the main cluster.
When queries in the designated workload management (WLM) queue begin queuing, Amazon Redshift automatically routes eligible queries to the new clusters, enabling concurrency scaling.
Amazon Redshift automatically spins up a new cluster, processes waiting queries, and shuts down the concurrency scaling cluster when no longer needed.

Enable Amazon Redshift concurrency scaling

You can manage concurrency scaling at the WLM queue level, where you set concurrency scaling policies for specific queues. When concurrency scaling is enabled for a queue, eligible write and read queries are sent to concurrency scaling clusters without having to wait for resources to free up on the main Amazon Redshift cluster. Amazon Redshift handles spinning up concurrency scaling clusters, routing of the queries to the transient clusters, and relinquishing the concurrency clusters.

You can enable concurrency scaling on both automatic and manual WLM.

You first need to determine which parameter group your cluster is. To do so, complete the following steps:

On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose your cluster.
On the Properties tab, note the parameter group associated to the cluster.
Now you can configure your WLM parameters.
Under Configurations in the navigation pane, choose Workload management.
Choose the parameter group associated to the cluster.If you’re using the default parameter group default.redshift-1.0, you need to create a custom parameter group and assign that to the cluster. The default parameter group has preset values for each of its parameters, and it can’t be modified.
On the Parameters tab, you can choose between 1–10 max_concurrency_scaling_clusters.This is the max number of concurrent Amazon Redshift clusters you can have running at the same time. Ten is the soft limit; this limit can be increased by submitting a service limit increase request with a support case.
On the Workload management tab, choose auto mode for the concurrency scaling cluster.

Example use cases

In this section, we use three use cases to help you understand how concurrency scaling for read and write heavy workloads can seamlessly scale to improve workload performance SLAs.

We used a 3 TB Cloud DW benchmark dataset. The test included a total of 103 concurrent queries, with each run using a separate database connection. The 103 queries constituted 60 queries from the 99 TPC-DS queries and 43 write queries, with a mix of copy, insert, update and delete statements. We used RA3.4xlarge 5 compute nodes.

The following scenarios showcase how concurrency scaling for reads and writes can seamlessly auto scale and positively impact a heavy concurrent mixed workload:

All queries triggered concurrently with concurrency scaling turned off
All queries triggered concurrently with concurrency scaling cluster limit set to 5 clusters
All queries triggered concurrently with concurrency scaling cluster limit set to 10 clusters

Scenario 1: All queries triggered concurrently with concurrency scaling turned off

In this benchmark test, all queries completed in 299 minutes. The following are the test details.

The Amazon Redshift query optimizer turned the 103 queries into 257 sub-queries for better performance in this run. Amazon Redshift continuous to learn from operational statistics to optimize your workload.

The following screenshot shows how Amazon Redshift auto WLM mode chose to run 16 queries concurrently while queuing the rest. Because concurrency scaling is turned off, no additional clusters are spun up and the queries continue to wait for running queries to complete before they can be processed. Notice the number of queries queued stayed at a higher number for a long period of time and eventually lowered as only a few queries could concurrently run.

No additional concurrent clusters spun up during the window of the workload, as seen in the following screenshot, requiring the primary cluster to process all the queries.

Scenario 2: All queries triggered concurrently with concurrency scaling cluster max limit set to 5 clusters

In this test, all queries completed in 49 minutes.

The following screenshot depicts significant queuing. Within seconds, five additional Amazon Redshift clusters are spun up into ready state, allowing 53 queries to run simultaneously. This number can change in your cluster based on the query types. Notice the number of queries queued starts lowering as more queries are completed using the five additional clusters.

Over time, the concurrency scaling clusters start to wind down progressively to 0 as the queries no longer waited.

Scenario 3: All queries triggered concurrently with concurrency scaling cluster limit set to 10 clusters

In this test, all queries completed in 28 minutes.

The following screenshot depicts significant queuing. Within seconds, 10 additional Amazon Redshift clusters are spun up into ready state, allowing multiple queries to run simultaneously. This number can change in your cluster based on the query types. Notice the number of queries queued starts lowering as more queries are completed using the five additional clusters.

Over time, the concurrency scaling clusters start to wind down progressively to 0 as the queries no longer waited.

Test results review

The following table summarizes our test results.

.	Test Scenario 1	Test Scenario 2	Test Scenario 3
Total Workload Completion Time	299 Minutes	49 Minutes	28 Minutes

The test results reveal how concurrency scaling for a mixed workload of reads and writes lowered the total workload completion time from 299 minutes to 28 minutes, which is more than 10 times an improvement in SLAs while being cost effective by only paying for the additional clusters when scaling is necessary.

Monitor concurrency scaling

One method to monitor concurrency scaling is via system views. To monitor which queries benefitted from concurrency scaling, you can use concurrency_scaling_status from stl_query. Concurrency scaling of 1 indicates that the query ran on a concurrency scaling cluster. To monitor concurrency scaling usage, you can use the SVCS_CONCURRENCY_SCALING_USAGE system view.

The Amazon CloudWatch metrics ConcurrencyScalingActiveClusters and ConcurrencyScalingSeconds enable you to set up monitoring of concurrency scaling usage. For more information, refer to Monitoring Amazon Redshift using CloudWatch metrics.

Configure usage limit

With every 24 hours used of the main Amazon Redshift cluster, you accrue 1 hour of concurrency scaling credit. This free credit can be used by both read and write queries. For any usage that exceeds the accrued free usage credits, you’re billed on a per-second basis based on the on-demand rate of your Amazon Redshift cluster. You can apply cost controls for concurrency scaling at the cluster level. You can choose to create multiple queues for ETL, Dashboard, and adhoc workload. With this you can choose to turn on concurrency scaling for selective queues.

As shown in the following screenshot, you can choose a time period (daily, weekly, or monthly) and specify the desired usage limit. You can then choose an action option (Alert, Log to system table, or Disable feature). For more details on how to set cost controls for concurrency scaling, refer to Manage and control your cost with Amazon Redshift Concurrency Scaling and Spectrum.

Summary

In this post, we showed how you can enable concurrency scaling to help you meet the SLAs for both read and write workloads by seamlessly scaling out to the maximum number of clusters you configured, thereby increasing your cluster throughput while controlling your costs. Concurrency scaling with read and write capability can enable you to handle a number of scenarios, such as sudden increases in the volume of data in your data pipeline, backfill operations, ad hoc reporting, and month end processing. It’s now time to put this learning into action and begin optimizing your Redshift cluster(s) for both read and write throughput!

About the Authors

Harsha Tadiparthi is a specialist Principal Solutions Architect, Analytics at AWS. He enjoys solving complex customer problems in databases and analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

Harshida Patel is a Specialist Principal Solutions Architect, Analytics with AWS.

Ramu Ponugumati is a Sr. Technical Account Manager, specialist in Analytics and AI/ML at AWS. He works with enterprise customers to modernize and cost optimize workloads, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, playing tennis, and gardening.

Build an AWS Lake Formation permissions inventory dashboard using AWS Glue and Amazon QuickSight

2022-12-12 Srividya Parthasarathy

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/build-an-aws-lake-formation-permissions-inventory-dashboard-using-aws-glue-and-amazon-quicksight/

AWS Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning (ML). Lake Formation provides a single place to define fine-grained access control on catalog resources. These permissions are granted to the principals by a data lake admin, and integrated engines like Amazon Athena, AWS Glue, Amazon EMR, and Amazon Redshift Spectrum enforce the access controls defined in Lake Formation. It also allows principals to securely share data catalog resources across multiple AWS accounts and AWS organizations through a centralized approach.

As organizations are adopting Lake Formation for scaling their permissions, there is steady increase in the access policies established and managed within the enterprise. However, it becomes more difficult to analyze and understand the permissions for auditing. Therefore, customers are looking for a simple way to collect, analyze, and visualize permissions data so that they can inspect and validate the policies. It also enables organizations to take actions that help them with compliance requirements.

This solution offers the ability to consolidate and create a central inventory of Lake Formation permissions that are registered in the given AWS account and Region. It provides a high-level view of various permissions that Lake Formation manages and aims at answering questions like:

Who has select access on given table
Which tables have delete permission granted
Which databases or tables does the given principal have select access to

In this post, we walk through how to set up and collect the permissions granted on resources in a given account using the Lake Formation API. AWS Glue makes it straightforward to set up and run jobs for collecting the permission data and creating an external table on the collected data. We use Amazon QuickSight to create a permissions dashboard using an Athena data source and dataset.

Overview of solution

The following diagram illustrates the architecture of this solution.

In this solution, we walk through the following tasks:

Create an AWS Glue job to collect and store permissions data, and create external tables using Boto3.
Verify the external tables created using Athena.
Sign up for a QuickSight Enterprise account and enable Athena access.
Create a dataset using an Athena data source.
Use the datasets for analysis.
Publish the analyses as a QuickSight dashboard.

The collected JSON data is flattened and written into an Amazon Simple Storage Service (Amazon S3) bucket as Parquet files partitioned by account ID, date, and resource type. After the data is stored in Amazon S3, external tables are created on them and filters are added for different types of resource permissions. These datasets can be imported into SPICE, an in-memory query engine that is part of QuickSight, or queried directly from QuickSight to create analyses. Later, you can publish these analyses as a dashboard and share it with other users.

Dashboards are created for the following use cases:

Database permissions
Table permissions
Principal permissions

Prerequisites

You should have the following prerequisites:

An S3 bucket to store the permissions inventory data
An AWS Glue database for permissions inventory metadata
An AWS Identity and Access Management (IAM) role for the AWS Glue job with access to the inventory AWS Glue database and S3 bucket and added as a data lake admin
A QuickSight account with access to Athena
An IAM role for QuickSight with access to the inventory AWS Glue database and S3 bucket

Set up and run the AWS Glue job

We create an AWS Glue job to collect Lake Formation permissions data for the given account and Region that is provided as job parameters, and the collected data is flattened before storage. Data is partitioned by account ID, date, and permissions type, and is stored as Parquet in an S3 bucket using Boto3. We create external tables on the data and add filters for different types of resource permissions.

To create the AWS Glue job, complete the following steps:

Download the Python script file to local.
On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs.
Under Create job, select Python Shell script editor.
For Options, select Upload and edit an existing script.
For File upload, choose Choose file.
Choose the downloaded file (lf-permissions-inventory.py).
Choose Create.

GlueJob

After the job is created, enter a name for the job (for this post, lf-inventory-builder) and choose Save.

Glue Job save

Choose the Job details tab.
For Name, enter a name for the job.
For IAM Role¸ choose an IAM role that has access to the inventory S3 bucket and inventory schema and registered as data lake admin.
For Type, choose Python Shell.
For Python version, choose Python 3.9.

Glue Job Details

You can leave other property values at their default.
Under Advanced properties¸ configure the following job parameters and values:
1. catalog-id: with the value as the current AWS account ID whose permissions data are collected.
2. databasename: with the value as the AWS Glue database where the inventory-related schema objects are created.
3. region: with the value as the current Region where the job is configured and whose permissions data is collected.
4. s3bucket: with the value as the S3 bucket where the collected permissions data is written.
5. createtable: with the value yes, which enables external table creation on the data.

Job Parameters

Choose Save to save the job settings.

Glue Job Save

Choose Run to start the job.

When the job is complete, the run status changes to Succeeded. You can view the log messages in Amazon CloudWatch Logs.

Job Run

Permissions data is collected and stored in the S3 bucket (under lfpermissions-data) that you provided in the job parameters.

S3 Structure

The following external tables are created on the permissions data and can be queried using Athena:

lfpermissions – A summary of resource permissions
lfpermissionswithgrant – A summary of grantable resource permissions

For both tables, the schema structure is the same and the lftype column indicates what type of permissions the row applies to.

Athena Table Schema

Verify the tables using Athena

You can use Athena to verify the data using the following queries.

For more information, refer to Running SQL queries using Amazon Athena

List the database permissions:

Select * from lfpermissions where lftype=’DATABASE’

List the table permissions:

Select * from lfpermissions where lftype= ‘TABLE’

List the data lake permissions:

Select * from lfpermissions where lftype= ‘DATA_LOCATION’

List the grantable database permissions:

Select * from lfpermissionswithgrant where lftype=’DATABASE’

List the grantable table permissions:

Select * from lfpermissionswithgrant where lftype= ‘TABLE’

List grantable data lake permissions:

Select * from lfpermissionswithgrant where lftype= ‘DATA_LOCATION’

As the next step, we create a QuickSight dashboard with three sheets, each focused on different sets of permissions (database, table, principal) to slice and dice the data.

Sign up for a QuickSight account

If you haven’t signed up for QuickSight, complete the following steps:

QuickSight signup

For Edition, select Enterprise.
Choose Continue.
For Authentication method, select Use IAM federated identities & QuickSight-managed users.
Under QuickSight Region, choose the same Region as your inventory S3 bucket.
Under Account info, enter a QuickSight account name and email address for notification.

QuickSight Form

In the Quick access to AWS services section, for IAM Role, select Use QuickSight-managed role (default).
Allow access to IAM, Athena, and Amazon S3.
Specify the S3 bucket that contains the permissions data.
Choose Finish to complete the signup process.

QuickSight configuration

Note: If the inventory bucket and database is managed by Lake Formation, grant database and table access to the created QuickSight IAM role. For instructions, refer to Granting and revoking permissions on Data Catalog resources.

Configure your dataset in QuickSight

QuickSight is configured with an Athena data source the same Region as the S3 bucket. To set up your dataset, complete the following steps:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.

Quicksight DataSet

Choose Athena as your data source.

QuickSight Datasource

Enter LF_DASHBOARD_DS as the name of your data source.
Choose Create data source.
For Catalog, leave it as AwsDataCatalog.
For Database, choose database name provided as parameter to the Job.
For Tables, select lfpermissions.
Choose Select.

QuickSight Catalog Info

Select Directly query your data and choose Visualize to take you to the analysis.

Quicksight data mode

Create analyses

We create three sheets for our dashboard to view different levels of permissions.

Sheet 1: Database permission view

To view database permissions, complete the following steps:

On the QuickSight console, choose the plus sign to create a new sheet.
Choose Add, then choose Add title.

QuickSight Title

Name the sheet Database Permissions.
Repeat steps (5-7) to add the following parameters:
- catalogid
- databasename
- permission
- tablename
On the Add menu, choose Add parameter.
Enter a name for the parameter.
Leave the other values as default and choose Create.
Choose Insights in the navigation pane, then choose Add control.

QuickSight Control

Add a control for each parameter:
1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
2. Provide additional information for each parameter according to the following table.

Parameter	Display Name	Dataset	Field
catalogid	AccountID	lfpermissions	catalog_id
databasename	DatabaseName	lfpermissions	databasename
permission	Permission	lfpermissions	permission

Add a control dependency and for Database, choose the options menu and choose Edit.

QuickSight Dependency

Under Format control, choose Control options.
Change the relevant values, choose AccountID, and choose Update.
Similarly, under Permission control, choose Control options.
Change the relevant values, choose AccountID, and choose Update.

We create two visuals for this view.

For the first visual, choose Visualize and choose pivot table as the visual type.
Drag and drop catalog_id and databasename into Rows.
Drag and drop permission into Column.
Drag and drop principal into Values and change the aggregation to Count distinct.

QuickSight Database View1

Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. DATABASE as the value.
Add a filter on catalog_id the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Add a filter on databasename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose databasename.
Add a filter on permission with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose permission.
Choose Actions in the navigation pane.
Define a new action with the following parameters:
1. For Activation, select Select.
2. For Filter action, select All fields.
3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

Add a second visual and choose the table visual type.
Drag and drop principal to Group by.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. DATABASE as the value.
Add a filter on catalog_id the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Add a filter on databasename the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose databasename.
Add a filter on permission with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose permission.

Now the Database and Permission drop-down menus are populated based on the relevant attributes and changes dynamically.

QuickSight Database View2

Sheet 2: Table permission view

Now that we have created the database permissions sheet, we can add a table permissions sheet.

Choose the plus sign to add a new sheet.
On the QuickSight console, choose Add, then choose Add title.
Name the sheet Table Permissions.
Choose Insights in the navigation pane, then choose Add control.
Add a control for each parameter:
1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
2. Provide the additional information for each parameter according to the following table.

Parameter	Display Name	Dataset	Field
catalogid	AccountID	lfpermissions	catalog_id
databasename	DatabaseName	lfpermissions	databasename
permission	Permission	lfpermissions	permission
tablename	TableName	lfpermissions	tablename

We create two visuals for this view.

For the first visual, choose Visualize and choose pivot table as the visual type.
Drag and drop catalog_id, databasename, and tablename into Rows.
Drag and drop permission into Column.
Drag and drop principal into Values and change the aggregation to Count distinct.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. TABLE as the value.
Add a filter on catalog_id the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Add a filter on the databasename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose databasename.
Add a filter on tablename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose tablename.
Add a filter on permission with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose permission.
Choose Actions in the navigation pane.
Define a new action with the following parameters:
1. For Activation, select Select.
2. For Filter action, select All fields.
3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

Add a second visual and choose the table visual type.
Drag and drop principal to Group by.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. TABLE as the value.
Add a filter on catalog_id the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Add a filter on the databasename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose databasename.
Add a filter on tablename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose tablename.
Add a filter on permission with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose permission.

Now the Databasename, Tablename, and Permission drop-down menus are populated based on the relevant attributes.

QuickSight Table Permissions

Sheet 3: Principal permission view

Now we add a third sheet for principal permissions.

Choose the plus sign to add a new sheet.
On the QuickSight console, choose Add, then choose Add title.
Name the sheet Principal Permissions.
Choose Insights in the navigation pane, then choose Add control.
Add a control for the catalogid parameter:
1. For Style¸ choose List, and for Values, select Link to a dataset field.
2. Provide the additional information for the parameter according to the following table.

Parameter	Display Name	Dataset	Field
catalogid	AccountID	lfpermissions	catalog_id

We create four visuals for this view.

For the first visual, choose Visualize and choose pivot table as the visual type.
Drag and drop catalog_id and principal into Rows.
Drag and drop permission into Column.
Drag and drop databasename into Values and change the aggregation to Count distinct.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. DATABASE as the value.
Add a filter on the catalog_id field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Choose Actions in the navigation pane.
Define a new action with the following parameters:
1. For Activation, select Select.
2. For Filter action, select All fields.
3. For Target visuals, select Select visuals and Check Databasename.
For the second visual, choose Visualize and choose table as the visual type.
Drag and drop databasename into Group by.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. DATABASE as the value.
Add a filter on the catalog_id field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
For the third visual, choose Visualize and choose pivot table as the visual type.
Drag and drop catalog_id and principal into Rows.
Drag and drop permission into Column.
Drag and drop tablename into Values and change the aggregation to Count distinct.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. TABLE as the value.
Add a filter on the catalog_id field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Choose Actions in the navigation pane.
Define a new action with the following parameters:
1. For Activation, select Select.
2. For Filter action, select All fields.
3. For Target visuals, select Select visuals and Check Tablename.
For the final visual, choose Visualize and choose table as the visual type.
Drag and drop tablename into Group by.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. TABLE as the value.
Add a filter on the catalog_id field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.

The following screenshot shows our sheet.

QuickSight Prinicipal View

Create a dashboard

Now that the analysis is ready, you can publish it as a dashboard and share it with other users. For instructions, refer to the tutorial Create an Amazon QuickSight dashboard.

Clean up

To clean up the resources created in this post, complete the following steps:

Delete the AWS Glue job lf-inventory-builder.
Delete the data stored under the bucket provided as the value of the s3bucket job parameter.
Drop the external tables created under the schema provided as the value of the databasename job parameter.
If you signed up for QuickSight to follow along with this post, you can delete the account.
For an existing QuickSight account, delete the following resources:
1. lfpermissions dataset
2. lfpermissions analysis
3. lfpermissions dashboard

Conclusion

In this post, we provided a design and implementation steps for a solution to collect Lake Formation permissions in a given Region of an account and consolidate them for analysis. We also walked through the steps to create a dashboard using Amazon QuickSight. You can utilize other QuickSight visuals to create more sophisticated dashboards based on your requirements.

You can also expand this solution to consolidate permissions for a multi-account setup. You can use a shared bucket across organizations and accounts and configure an AWS Glue job in each account or organization to write their permission data. With this solution, you can maintain a unified dashboard view of all the Lake Formation permissions within your organization, thereby providing a central audit mechanism to comply with business requirements.

Thanks for reading this post! If you have any comments or questions, please don’t hesitate to leave them in the comments section.

About the Author

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

Enable Multi-AZ deployments for your Amazon Redshift data warehouse

2022-12-09 Ranjan Burman

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/enable-multi-az-deployments-for-your-amazon-redshift-data-warehouse/

November 2023: This post was reviewed and updated with the general availability of Multi-AZ deployments for provisioned RA3 clusters.
Originally published on December 9th, 2022.

Amazon Redshift is a fully managed, petabyte scale cloud data warehouse that enables you to analyze large datasets using standard SQL. Data warehouse workloads are increasingly being used with mission-critical analytics applications that require the highest levels of resilience and availability. Amazon Redshift is a cloud-based data warehouse that supports many recovery capabilities to address unforeseen outages and minimize downtime. Amazon Redshift RA3 instance types store their data in Redshift Managed Storage (RMS), which is backed by Amazon Simple Storage Service (Amazon S3) making it highly available and durable by default. Amazon Redshift also supports automatic backups that can recover a data warehouse, automatically remediate failures and relocates clusters to different AZs without changes to applications. Although many customers benefit from these features, enterprise data warehouse customers require a low Recovery Time Objective (RTO) and higher availability to support their business continuity with minimal impact to applications.

Amazon Redshift just announced the general availability of Multi-AZ deployments for provisioned RA3 clusters that support running your data warehouse in two Availability Zones simultaneously and can continue operating in unforeseen failure scenarios. A Multi-AZ deployment is intended for customers with mission-critical analytics applications that require the highest levels of resilience and availability.

A Redshift Multi-AZ deployment leverages compute resources in two AZs to scale data warehouse workload processing. In situations where there is a high level or concurrency Redshift will automatically leverage the resources in both AZs to scale the workload for both read and write requests.

Our pre-launch tests found that Amazon Redshift Multi-AZ deployments reduce recovery time to under 60 seconds or less in the unlikely case of an AZ failure.

Single-AZ vs. Multi-AZ deployment

Amazon Redshift requires a cluster subnet group to create a cluster in your VPC. The cluster subnet group includes information about the VPC ID and a list of subnets in your VPC. When you launch a cluster, Amazon Redshift either creates a default cluster subnet group automatically or you choose a cluster subnet group of your choice so that Amazon Redshift can provision your cluster in one of the subnets in the VPC. You can configure your cluster subnet group to add subnets from different Availability Zones that you want Amazon Redshift to use for cluster deployment.

All Amazon Redshift clusters today are created and situated in a particular Availability Zone within an AWS Region and thus called Single-AZ deployments. For a Single-AZ deployment, Amazon Redshift selects the subnet from one of the Availability Zones within a Region and deploys the cluster there. You can choose an Availability Zone for deployment, and Amazon Redshift will deploy your cluster in the chosen Availability Zone based on the subnets provided.

On the other hand, a multi-AZ deployment is provisioned in two Availability Zones simultaneously. For a Multi-AZ deployment, Amazon Redshift automatically selects two subnets from two different Availability Zones and deploys an equal number of compute nodes in each Availability Zone. All these compute nodes are utilized via a single endpoint as compute nodes from both Availability Zones are used for workload processing.

As shown in the following diagrams, Amazon Redshift deploys a cluster in a single Availability Zone for Single-AZ deployment, and two Availability Zones for Multi-AZ deployment.

Auto recovery of multi-AZ deployment

In the unlikely event of an Availability Zone failure, Amazon Redshift Multi-AZ deployments continue to serve your workloads by automatically using resources in the other Availability Zone. You are not required to make any application changes to maintain business continuity during unforeseen outages since a multi-AZ deployment is accessed as a single data warehouse with one endpoint. Amazon Redshift Multi-AZ deployments are designed to ensure there is no data loss, and you can query all data committed up until the point of failure.

As shown in the below diagram, if there is an unlikely event that causes compute nodes in AZ1 to fail, then a multi-AZ deployment automatically recovers to use compute resources in AZ2. Amazon Redshift will also automatically provision identical compute nodes in another availability zone (AZ3) to continue operating simultaneously in two Availability zones (AZ2 and AZ3).

Amazon Redshift Multi-AZ deployment is not only used for protection against the possibility of Availability Zone failures, but it can also maximize your data warehouse performance by automatically distributing workload processing across two Availability Zones. A Multi-AZ deployment will always process an individual query using compute resources only from one Availability Zone, but it can automatically distribute processing of multiple simultaneous queries to both Availability Zones to increase overall performance for high concurrency workloads.

It’s a good practice to set up automatic retries in your extract, transform, and load (ETL) processes and dashboards so that they can be reissued and served by the cluster in the secondary Availability Zone when an unlikely failure happens in the primary Availability Zone. If a connection is dropped, it can then be retried or reestablished immediately. In addition, queries and loads that were running in the failed Availability Zone will be aborted. New queries issued at or after a failure occurs may experience run delays while the multi-AZ data warehouse is being recovered to a two AZ setup.

Overview of solution

In this post, we provide a walkthrough of how to create and manage a Multi-AZ deployment for Amazon Redshift using the AWS Management Console. We also test the fault tolerance of an Amazon Redshift Multi-AZ data warehouse and monitor queries in your Multi-AZ deployment.

Create a new Multi-AZ deployment from the console

You can easily create a new multi-AZ deployments through Amazon Redshift console. Amazon Redshift will deploy the same number of nodes in each of the two Availability Zones for a Multi-AZ deployment. All nodes of a multi-AZ deployment can perform read and write workload processing during normal operation. A Multi-AZ deployment is supported only for provisioned RA3 clusters.

Follow these steps to create an Amazon Redshift provisioned cluster in two Availability Zones:

On the Amazon Redshift console, in the navigation pane, choose Clusters.
Click on Create cluster.

For general information about creating clusters, see Creating a cluster.

Choose one of the RA3 node types on the Node type drop-down menu. The Multi-AZ deployment option only becomes available when you choose an RA3 node type.
For Multi-AZ deployment, select Multi-AZ option.
For Number of nodes per AZ, enter the number of nodes that you need for your cluster.

Under the Database configurations, choose Admin user name and Admin user password.
Turn Use defaults on next to Additional configurations to modify the default settings.
Under Network and security, specify the following:
1. For Virtual private cloud (VPC), choose the VPC you want to deploy the cluster in.
2. For VPC security groups, either leave as default or add the security groups of your choice.
3. For Cluster subnet group, either leave as default or add a cluster subnet group of your choice. For a Multi-AZ deployment, a cluster subnet group must include one subnet each from at least three or more different Availability Zones.

For general information about managing cluster subnet groups, see Cluster subnet groups

Under Database configuration, for Database port, you either use the default value 5439 or choose a value from the range of 5431–5455 and 8191–8215.
Under Database configuration, in the Database encryption section, to use a custom AWS Key Management Service (AWS KMS) key other than the default KMS key, choose Customize encryption settings. This option is deselected by default.
Under Choose an AWS KMS key, you can either choose an existing KMS key, or choose Create an AWS KMS key to create a new KMS key.

For more information to create key using KMS, refer to Creating keys.

Choose Create cluster.

When the cluster creation succeeds, you can view the details on the cluster details page.

Under General information, you can see Multi-AZ as Yes.

On the Properties tab, under Network and security settings, you can find the details on the primary and secondary Availability Zone.

Create a new Multi-AZ deployment from the CLI

The following create-cluster AWS CLI command shows how to create a Multi-AZ cluster

aws redshift create-cluster 
--port 5439 
    --master-username master
    --master-user-password ######
    --node-type ra3.4xlarge
    --number-of-nodes 2
    --profile maz-test
    --endpoint-url https://redshift.us-east-1.amazonaws.com
    --region eu-west-1
    --cluster-identifier redshift-cluster-1
    --multi-az 
    --maintenance-track-name CURRENT
    --encrypted

Convert a Single-AZ deployment to Multi-AZ deployment

To convert an existing Single-AZ deployment to a Multi-AZ deployment, you can go to the Redshift console and select your Redshift cluster that currently is Single-AZ setup and navigate to Actions and select Activate Multi-AZ. Your Single-AZ cluster must be encrypted for a successful conversion to Multi-AZ. During conversion to Multi-AZ, Redshift will double the total number of nodes distributing them equally in each AZ. Redshift will not allow you to split existing number of nodes while converting to Multi-AZ to maintain consistent query performance.

Complete the following steps to create a Multi-AZ deployment restored from a snapshot:

On the Amazon Redshift console, in the navigation pane, choose Clusters.
Select your cluster and navigate to the cluster details page.
On the Actions menu, choose Activate Multi-AZ.

Review the modification summary and confirm by choosing Activate Multi-AZ.

Using the below AWS CLI command you can convert a single AZ Redshift data warehouse to Multi-AZ.

aws redshift modify-cluster 
    --profile maz-test
    --endpoint-url https://redshift.eu-west-1.amazonaws.com
    --region eu-west-1
    --cluster-identifier redshift-cluster-1
    --multi-az

Convert a Multi-AZ deployment to Single-AZ deployment

Redshift also supports conversion of a Multi-AZ deployment into Single-AZ. This option provides customers with the flexibility to switch between different deployments with few easy steps as follows:

On the Amazon Redshift console, in the navigation pane, choose Clusters.
Select your cluster and navigate to the cluster details page.
On the Actions menu, choose Deactivate Multi-AZ.
Review the modification summary and confirm by choosing Deactivate Multi-AZ.

Creating a Multi-AZ data warehouse restored from a snapshot

Existing customers can also create a Multi-AZ deployment by restoring a snapshot from an existing Single-AZ deployment. See the required steps as below.

On the Amazon Redshift console, in the navigation pane, choose Clusters.
Select the cluster and navigate to the cluster details page.
Choose the Maintenance
Select a snapshot and choose Restore snapshot, Restore to provisioned cluster.
Review the Cluster configuration and Cluster details values of the new cluster to be created using the snapshot information.
Select Multi-AZ option and update the properties of the new cluster, then choose Restore cluster from snapshot at the bottom of the page.

Resizing a Multi-AZ data warehouse

Redshift Multi-AZ feature also supports resizing Multi-AZ Redshift cluster deployments to change the cluster configuration based on scaling needs. You can change both number and type of nodes as per needs.

On the Amazon Redshift console, in the navigation pane, choose Clusters.
Select your cluster and navigate to the cluster details page.
On the Actions menu, choose
Once selected it will bring into another screen to show cluster resize screen where you can choose type and number of nodes and click on Resize cluster.

Failing over Multi-AZ deployment

In addition to the automatic recovery process, you can also trigger this process manually for your data warehouse using the Failover primary compute option. This approach can be used to manage operational maintenance and other planned operational procedures as per the needs of the respective environment. When the cluster successfully recovers, Multi-AZ deployment becomes available. Your Multi-AZ deployment also automatically provisions new compute nodes in another Availability Zone as soon as it is available.

Let’s manually trigger the Failover of your Redshift Multi-AZ deployment.

On the Amazon Redshift console, choose Clusters in the navigation pane.
Navigate to the cluster detail page
From Actions, choose Failover primary compute.
When prompted, choose Confirm.

After the cluster is back to Available status, you can observe that the primary and secondary Availability Zones have changed.

The following screenshot shows the status before injecting failure.

The following screenshot shows the status after injecting failure.

Restore a table from snapshot

You can restore a single table from a snapshot from your Multi-AZ cluster. When you restore a single table from a snapshot, you specify the source snapshot, database, schema, and table name, and the target database, schema, and a new table name for the restored table.

To restore a table from a snapshot:

On the Amazon Redshift console, in the navigation pane, choose Clusters.
Select your cluster and navigate to the cluster details page.
On the Actions menu, choose Restore table.
Enter the information about which snapshot, source table, and target table to use, and then choose Restore table.

Enable public connections for your Multi-AZ data warehouse

From the navigation menu, choose CLUSTERS.
Choose the Multi-AZ cluster that you want to modify.
Choose Actions.
Choose Turn on Publicly accessible.
Choose Elastic IP address, if you do not choose one, an address will be randomly assigned to you.
Choose Save changes.

Monitor queries for Multi-AZ deployments

A Multi-AZ deployment uses compute resources that are deployed in both Availability Zones and can continue operating in the event that the resources in a given Availability Zone are not available. All the compute resources are used at all times, which allows full operation across two Availability Zones in both read and write operations.

You can query SYS_views in the pg_catalog schema to monitor Multi-AZ query runs. The SYS_views cover query run activities and stats from primary and secondary clusters.

The following are the system tables in the SYS_view list:

Follow these steps to monitor the query run on Multi-AZ deployment from the Amazon Redshift Console:

On the Amazon Redshift console, connect to the database in your Multi-AZ deployment and run queries through the query editor.
Run any sample query on the Multi-AZ Redshift deployment.
For a Multi-AZ deployment, you can identify a query and the Availability Zone where it is being run (running on the primary or secondary availability zone) by using the compute_type column in the SYS_QUERY_HISTORY table. The valid values for the compute type column are as follows:
1. primary – When run on primary availability zone in the Multi-AZ deployment.
2. secondary – When run on secondary availability zone in the Multi-AZ deployment.

The following is a sample query using the compute_type column to monitor a query:

dev=# select (compute_type) as compute_type, left(query_text, 50) query_text from sys_query_history order by start_time desc;

 compute_type | query_text
--------------+----------------------------------------------------
 secondary    | select count(*) from t1;
 primary 	   select count(*) from t2;

You can also access the query history from the console to analyze your query diagnostics.

On the Query monitoring tab, choose Connect to database.

For Connection, choose Create a new connection
For Authentication, choose Temporary credentials
For Database name, enter the database name (for example, dev).
For Database user, enter the database user name (for example, awsuser).
Choose Connect.

After you’re connected, under Query Monitoring, on the Query history tab, you can view all the queries and loads, as shown in the following screenshot.

Under Metric filters, you can use the various filters in the Additional filtering options section to view query history based on Time interval, Users, Databases, or SQL commands.

There are a few limitations when working with Amazon Redshift Multi-AZ in preview mode, refer here for the limitations.

Customer feedback

Janssen Pharmaceuticals, a subsidiary of Johnson & Johnson, researches and manufactures medicines with a focus on the changing needs of patients and the healthcare industry.

“Janssen Pharmaceutical uses Amazon Redshift to enable critical insights that drive important business decisions for our data scientists, data stewards, business users, and external stakeholders. With Amazon Redshift Multi-AZ, we can be confident that our data warehouse will always be available without any disruptions that might delay impact our ability to make critical business decisions.”

– Shyam Mohapatra, Director of Information Technology – Janssen Pharmaceutical Companies of Johnson & Johnson

Stripe is a technology company that builds economic infrastructure for the internet. Stripe’s products power payments for online and in-person retailers, subscriptions businesses, software platforms and marketplaces, and everything in between.

“Millions of companies use Stripe’s software and APIs to accept payments, send payouts, and manage their businesses online. Access to their Stripe data via leading data warehouses like Amazon Redshift has been a top request from our customers. Our customers needed highly available, secure, fast, and integrated analytics at scale without building complex data pipelines or moving and copying data around. With Stripe Data Pipeline for Amazon Redshift, we’re helping our customers set up a direct and reliable data pipeline in a few clicks.

Stripe Data Pipeline enables our customers to automatically share their complete, up-to-date Stripe data with their Amazon Redshift data warehouse, and take their business analytics and reporting to the next level.”

– Brian Brunner, Senior Manager, Engineering at Stripe

Conclusion

This post demonstrated how to configure an Amazon Redshift Multi-AZ deployment in two Availability Zones and test the fault tolerance of your workloads during an unlikely failure of an Availability Zone. Amazon Redshift Multi-AZ deployment also helps improve overall performance of your data warehouse because compute nodes in both Availability Zones are used for read and write operations. Amazon Redshift Multi-AZ data warehouse helps meet the demands of customers with mission critical analytics applications that require the highest levels of availability and resiliency. For more details, refer Configuring Multi-AZ deployment.

About the Authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 16 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with cloud solutions.

Saurav Das is part of the Amazon Redshift Product Management team. He has more than 16 years of experience in working with relational databases technologies and data protection. He has a deep interest in solving customer challenges centered around high availability and disaster recovery.

Anusha Challa is a Senior Analytics Specialist Solutions Architect focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. She is passionate about data analytics and data science.

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Suresh Patnam is a Principal BDM – GTM AI/ML Leader at AWS. He works with customers to build IT strategy, making digital transformation through the cloud more accessible by using data and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Query cross-account Amazon DynamoDB tables using Amazon Athena Federated Query

2022-12-08 Satyanarayana Adimula

Post Syndicated from Satyanarayana Adimula original https://aws.amazon.com/blogs/big-data/query-cross-account-amazon-dynamodb-tables-using-amazon-athena-federated-query/

Amazon DynamoDB is ideal for applications that need a flexible NoSQL database with low read and write latencies and the ability to scale storage and throughput up or down as needed without code changes or downtime. You can use DynamoDB for use cases including mobile apps, gaming, digital ad serving, live voting, audience interaction for live events, sensor networks, log ingestion, access control for web-based content, metadata storage for Amazon S3 objects, e-commerce shopping carts, and web session management.

What if you have the need to allow other AWS accounts to query your DynamoDB table? What if other accounts need to join data on your DynamoDB table with their data stored in data sources like Amazon CloudWatch, Amazon DocumentDB, Amazon Redshift, Amazon OpenSearch, MySQL, PostgreSQL connected with Athena data source connectors, and Amazon S3?

Amazon Athena cross-account federated query enables you to run SQL queries across data stored in relational, non-relational, object, and custom data sources where data source and its connector are in different AWS accounts from the user querying the data. There are no new charges for querying connectors in another account, but Athena’s standard rates for data scanned, Lambda usage, and other services apply.

This post will demonstrate Athena in an AWS account accessing a DynamoDB table of another AWS account by using the Athena cross-account federated query. It also explains deploying Amazon Athena DynamoDB connector using AWS Serverless Application Repository and setting up Athena cross-account federation between two accounts for the Demo.

Walkthrough

The solution has the following steps to demonstrate Athena cross-account federated query:

Set up Athena federation – To deploy a Lambda function for the data source connector and connect it to a data source.
Set up Athena cross-account federation – To set up IAM permissions for Athena cross-account federation.
Test Athena cross-account federated query – To show a demo of how an AWS account can share its DynamoDB table as an Athena data source with another AWS account.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Two AWS Accounts
AWS resources: Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon DynamoDB, AWS Lambda

Data source connectors

A data source connector is a piece of code that can translate between your target data source and Athena. Athena uses data source connectors that run on AWS Lambda to run federated queries. You can think of a connector as an extension of Athena’s query engine.

Connectors use Apache Arrow as the format for returning data requested in a query, which enables connectors to be implemented in languages such as C, C++, Java, Python, and Rust.

Athena uses data source connectors that run on AWS Lambda to run federated queries. Since connectors are processed in Lambda, they can be used to access data from any data source on the cloud or on premises that is accessible from Lambda

To use a connector in your Athena queries, deploy it to your account using one of the following ways:

AWS Serverless Application Repository
Athena and Lambda consoles

This blog uses the AWS Serverless Application Repository to deploy the Amazon Athena DynamoDB connector.

After you deploy data source connectors, the connector is associated with a catalog that you can specify in SQL queries. You can combine SQL statements from multiple catalogs and span multiple data sources with a single query. When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

Architecture

AWS Account-A has a DynamoDB table called Music.
Account-A has an Athena data source connector to federate into DynamoDB.
AWS Account-B has Analysts who need to query the DynamoDB table.
Account-A is sharing the Athena data source with Account-B by using Athena cross-account federated query.

The following figure shows Amazon Athena cross-account federation for Account-B to access DynamoDB in Account-A.

To demonstrate the Athena cross-account federation, create a sample DynamoDB table called music in Account-A. Follow the instructions at Getting started with DynamoDB to create the table Music and load thesample data.

Set up Athena federation

Preparing to create federated queries is a two-part process: deploying a Lambda function for the data source connector and connecting the Lambda function to a data source. For more details, see Enabling cross-account federated queries.

Deploy AthenaDynamoDBConnector using AWS Serverless Application Repository

Sign in as an administrator to AWS Account-A.
Open the Serverless Application Repository.
In the navigation pane, choose Available applications.
Select the option Show apps that create custom IAM roles or resource policies.
In the search box, type the name of the connector AthenaDynamoDBConnector.
Choosing a connector opens the Lambda function’s Application details page in the AWS Lambda console.
On the right side of the details page, for Application settings, fill in the required information.
- Application name – Name of AWS CloudFormation Stack to deploy the connector: AthenaDynamoDBConnector.
- AthenaCatalogName – It is the catalog name to create in Athena. It is also the name of the Lambda function. Give it in lowercase: acct1dynamodb.
- SpillBucket – Specify an existing S3 bucket (spill-bucket) in your account to receive data from any large response payloads that exceed Lambda function response size limits.
Select I acknowledge that this app creates custom IAM roles and resource policies. For more information, choose the Info link.
At the bottom right of the Application settings section, choose Deploy.
Serverless Application Repository will create an AWS CloudFormation stack to deploy the connector.
When the deployment is complete, you will see the Lambda function in the Resources section of the AWS CloudFormation stack. Note down the Lambda function name.

Connect Athena to the data source

Go to Athena console in Account-A.
Choose Data sources. Click Create Data source.
In Choose data source, search for Amazon DynamoDB and select it.
In Data source details, give a Data source name acct1dynamodb
For Lambda function in the Connection details section, choose the name of the function acct1dynamodb from the dropdown.
On the Review and create page, review the data source details, and then choose Create data source.
You will see the data source acctdynamodb in the Data sources.
Go to Query editor. Choose the Data Source acct1dynamodb from the dropdown.
You will see all the tables in the shared data source.

Run the following SQL in Athena Query editor

SELECT songtitle, albumtitle, cast(awards as int) as awards 
FROM "acct1dynamodb"."default"."music" 
WHERE artist = 'Acme Band' 
limit 2;

Verify Athena federation works.

Set up Athena cross-account federation

In Account-A: Set up IAM permissions for cross-account

Sign in as an administrator to Account-A.
On the S3 spill bucket (of the Lambda function), grant GetObject and ListBucket permissions to the IAM user analyst of Account-B.

Note: Replace Account-B-id with your actual AWS cross-account id to which you want to share the DynamoDB table. Replace spill-bucket with your actual S3 bucket in Account-A.

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": ["arn:aws:iam::Account-B-id:user/analyst"]
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
             ],
            "Resource": [
                "arn:aws:s3::: spill-bucket",
                "arn:aws:s3::: spill-bucket/*"
            ]
        }
     ]
 }

Grant InvokeFunction on Lambda function acct1dynamodb to IAM user analyst of Account-B.

Note: Replace Account-A-id with your actual AWS account id where you have the DynamoDB table. Replace Account-B-id with your actual AWS cross-account id to which you want to share the DynamoDB table.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CrossAccountInvocationStatement",
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::Account-B-id:user/analyst"]
      }, 
      "Action": "lambda:InvokeFunction",
      "Resource": "arn:aws:lambda:aws-region:Account-A-id:function:acct1dynamodb"
    }
  ]
}

Go to the Lambda function acct1dynamodb. Choose Configuration and Permissions.

Go to Resource-based policy and Add permissions.

When you save the above permissions, you can see them under Policy statements in Resource-based policy of the Lambda function.

In Account-B: Set up IAM permissions for cross-account

Sign in as an administrator to AWS Account-B.
Create IAM role called AthenaCrossAccountFederated-Account-A-id for Account-A to assume. Add the following inline policy to the role.

Note: Replace Account-B-id with your actual AWS cross-account id to which you want to share the DynamoDB table.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "athena:CreateDataCatalog",
            "Resource": "arn:aws:athena:aws-region:Account-B-id:datacatalog/*"
        }
    ]
}

Grant permission to the IAM user analyst to invoke the Lambda function acct1dymanodb of Account-A

Note: Replace Account-A-id with your actual AWS account id where you have the DynamoDB table.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "lambda:InvokeFunction",
            "Resource": "arn:aws:lambda:aws-region:Account-A-id:function:acct1dynamodb"
        }
    ]
}

Share the Athena Data source with Account-B

After the permissions are in place, you share a data connector in your account (Account-A) with another account (Account-B). Account-A retains full control and ownership of the connector. When Account-A makes configuration changes to the connector, the updated configuration applies to the shared connector in Account-B.

Sign in as an administrator to Account-A.
On Athena, go to Data sources, choose data source acct1dynamodb you want to share. Go to the Share option in the top right corner.

For Account ID, enter the Account-B-id to share your data source with Account-B and click Share.

Test Athena cross-account federated query: Access the shared data source from Account-B

Sign in as IAM user analyst to Account-B.
In Athena, go to Data sources. You will see the data source acct1dynamodb.

Go to Query editor. Choose the Data Source acct1dynamodb from the dropdown.

You will see all the tables in the shared data source.

Run the following SQL in Athena Query editor

SELECT songtitle, albumtitle, cast(awards as int) as awards 
FROM "acct1dynamodb"."default"."music" 
WHERE artist = 'Acme Band' 
limit 2;

Athena cross-account federated has worked! This validates that user analyst in Account-B can see the data of the DynamoDB table of Account-A.

Clean up

To avoid incurring future charges, delete the following resources that were provisioned for this demo:

S3 spill bucket used in AWS Lambda
Lambda function used for the data source connector
Sample DynamoDB table

Conclusion

In this post, we saw how you can access a cross-account DynamoDB table using Athena Federated Query to query the data in place. You can execute a single SQL query to join this data across data sources like Amazon CloudWatch, Amazon DocumentDB, Amazon Redshift, Amazon OpenSearch, MySQL, PostgreSQL, Oracle, SQL Server, HBase, Redis, BigQuery, Snowflake, Teradata with Athena data source connectors and Amazon S3.

About the author

Satya Adimula is a Senior Data Architect at AWS based in Boston. With extensive experience in data and analytics, Satya helps organizations derive their business insights from the data at scale.

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

2022-12-07 Jason Pedreza

Post Syndicated from Jason Pedreza original https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers today rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse.

Data ingestion is the process of getting data to Amazon Redshift. You can leverage one of the many zero-ETL integration methods to make data available in Amazon Redshift directly. However, if your data is in your Amazon S3 bucket, then you can simply load data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift using the COPY command. A COPY command is the most efficient way to load a table from S3 because it uses the Amazon Redshift’s massively parallel processing (MPP) architecture to read and load data in parallel.

Amazon Redshift launched auto-copy support to simplify data loading from Amazon S3 into Amazon Redshift. You can now setup continuous file ingestion rules to track your Amazon S3 paths and automatically load new files without the need for additional tools or custom solutions. This also enables end users to have the latest data available in Amazon Redshift shortly after the source data is available.

This post shows you how to build automatic file ingestion pipelines in Amazon Redshift when source files are located on Amazon S3 by using a simple SQL command. In addition, we show you how to enable auto-copy using auto-copy jobs, how to monitor jobs, considerations, and best practices.

Overview of the auto-copy feature in Amazon Redshift

The auto-copy feature in Amazon Redshift leverages the S3 event integration to automatically load data into Amazon Redshift and simplifies automatic data loading from Amazon S3 with a simple SQL command. You can enable Amazon Redshift auto-copy by creating auto-copy jobs. A auto-copy job is a database object that stores, automates, and reuses the COPY statement for newly created files that land in the S3 folder.

The following diagram illustrates this process.

S3 event integration and auto-copy jobs have the following benefits:

Users can now load data from Amazon S3 automatically without having to build a pipeline or using an external framework
auto-copy jobs offer automatic and incremental data ingestion from an Amazon S3 location without the need to implement a custom solution
This functionality comes at no additional cost
Existing COPY statements can be converted into auto-copy jobs by appending the JOB CREATE <job_name> parameter
It keeps track of loaded files and minimizes data duplication.
It can be quickly set up using a simple SQL statement using your choice of JDBC/ODBC clients.
It has automatic error handling of bad quality data files.
It has a mechanism to load-once for each file. This means that there is no need to generate explicit manifest files.

Prerequisites

To get started with auto-copy, you need the following prerequisites:

An AWS account
An encrypted Amazon Redshift provisioned cluster or Amazon redshift serverless workgroup
An Amazon S3 bucket

Add following to the Amazon S3 bucket policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Auto-Copy-Policy-01",
            "Effect": "Allow",
            "Principal": {
                "Service":"redshift.amazonaws.com"
                    
                
            },
            "Action": [
                "s3:GetBucketNotification",
                "s3:PutBucketNotification",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<<your-s3-bucket-name>>",
            "Condition": {
                "StringEquals": {
                    "aws:SourceArn": "arn:aws:redshift:<region-name>:<aws-account-id>:integration:*",
                    "aws:SourceAccount": "<aws-account-id>"
                }
            }
        }
    ]
}

Set up Amazon S3 event Integration

An Amazon S3 event integration facilitates seamless and automated data ingestion from S3 buckets into an Amazon Redshift data warehouse, streamlining the process of transferring and storing data for analytical purposes

Sign in to the AWS Management Console and Navigate to Amazon Redshift home page. Under Integrations section choose S3 event integrations
Choose Create S3 event integration
Enter Integration name and Description, choose Next
Choose Browse S3 buckets, a dialog box pops up, select the Amazon S3 bucket and choose Continue
Amazon S3 bucket is selected. Choose Next
Choose Browse Redshift data warehouse
Choose the Amazon Redshift data warehouse and choose Continue
Then Amazon Redshift resource policy needs access to S3 event integration. In case of Resource policy error, check Fix it for me and choose Next
Add Tags as required and choose Next
Review changes and choose Create S3 event integration
An S3 event integration is created. Wait until the status of S3 event integration is Active

Set up auto-copy jobs

In this section, we demonstrate how to automate data loading of files from Amazon S3 into Amazon Redshift. With the existing COPY syntax, we add the JOB CREATE parameter to perform a one-time setup for automatic file ingestion. See the following code:

COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name> [AUTO ON | OFF];

Auto ingestion is enabled by default on auto-copy jobs. Files already present at the S3 location will not be visible to the auto-copy job. Only files added after JOB creation are tracked by Amazon Redshift.

Automate ingestion from a single data source

With a auto-copy job, you can automate ingestion from a single data source by creating one job and specifying the path to the S3 objects that contain the data. The S3 object path can reference a set of folders that have the same key prefix.

In this example, we have multiple files that are being loaded on a daily basis containing the sales transactions across all the stores in the US. For this we can create a store_sales folder in the bucket.

The following code creates the store_sales table:

DROP TABLE IF EXISTS public.store_sales;
CREATE TABLE IF NOT EXISTS public.store_sales
(
  ss_sold_date_sk int4,            
  ss_sold_time_sk int4,     
  ss_item_sk int4 not null,      
  ss_customer_sk int4,           
  ss_cdemo_sk int4,              
  ss_hdemo_sk int4,         
  ss_addr_sk int4,               
  ss_store_sk int4,           
  ss_promo_sk int4,           
  ss_ticket_number int8 not null,        
  ss_quantity int4,           
  ss_wholesale_cost numeric(7,2),          
  ss_list_price numeric(7,2),              
  ss_sales_price numeric(7,2),
  ss_ext_discount_amt numeric(7,2),             
  ss_ext_sales_price numeric(7,2),              
  ss_ext_wholesale_cost numeric(7,2),           
  ss_ext_list_price numeric(7,2),               
  ss_ext_tax numeric(7,2),                 
  ss_coupon_amt numeric(7,2), 
  ss_net_paid numeric(7,2),   
  ss_net_paid_inc_tax numeric(7,2),             
  ss_net_profit numeric(7,2),
  primary key (ss_item_sk, ss_ticket_number)
 ) 
DISTKEY (ss_item_sk) 
SORTKEY (ss_sold_date_sk);

Next, we create the auto-copy job to automatically load the gzip-compressed files into the store_sales table:

COPY store_sales
FROM 's3://aws-redshift-s3-auto-copy-demo/store_sales'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
gzip delimiter '|' EMPTYASNULL
region 'us-east-1'
JOB CREATE job_store_sales AUTO ON;

Each day’s sales transactions are loaded to their own folder in Amazon S3.

Now upload the files for transaction sold on 2002-12-31. Each folder contains multiple gzip-compressed files.

Since the auto-copy job is already created, it automatically loads the gzip-compressed files located in the S3 object path specified in the COPY command to the store_sales table.

Let’s run a query to get the daily total of sales transactions across all the stores in the US:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

The output shown comes from the transactions sold on 2002-12-31.

The following day, incremental sales transactions data are loaded to a new folder in the same S3 object path.

As new files arrive to the same S3 object path, the auto-copy job automatically loads the unprocessed files to the store_sales table in an incremental fashion.

All new sales transactions for 2003-01-01 are automatically ingested, which can be verified by running the following query:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

Automate ingestion from multiple data sources

We can also load an Amazon Redshift table from multiple data sources. When using a pub/sub pattern where multiple S3 buckets populate data to an Amazon Redshift table, you have to maintain multiple data pipelines for each source/target combination. With new parameters in the COPY command, this can be automated to handle data loads efficiently.

In the following example, the Customer_1 folder has Green Cab Company sales data, and the Customer_2 folder has Red Cab Company sales data. We can use the COPY command with the JOB parameter to automate this ingestion process.

The following screenshot shows sample data stored in files. Each folder has similar data but for different customers.

The target for these files in this example is the Amazon Redshift table cab_sales_data.

Define the target table cab_sales_data:

DROP TABLE IF EXISTS cab_sales_data;
CREATE TABLE IF NOT EXISTS cab_sales_data
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
DISTSTYLE EVEN
SORTKEY (passenger_count,pickup_datetime);

You can define two auto-copy jobs as shown in the following code to handle and monitor the ingestion of sales data belonging to different customers, in our case Customer_1 and Customer_2. These jobs monitor the Customer_1 and Customer_2 folders and load new files that are added here.

COPY public.cab_sales_data
FROM 's3://aws-redshift-s3-auto-copy-demo/Customer_1'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_green_cab AUTO ON;

COPY public.cab_sales_data
FROM 's3:// aws-redshift-s3-auto-copy-demo/Customer_2'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_red_cab AUTO ON;

After setting up the two jobs, we can upload the relevant files into their respective folders. This will make sure that the data is loaded efficiently as soon as the files arrive. Each customer is assigned its own vendorid, as shown in the following output:

SELECT vendorid,
       sum(passenger_count) as total_passengers 
  FROM cab_sales_data
GROUP BY vendorid;

Manually run a auto-copy job

There might be scenarios wherein the auto-copy job needs to be paused, meaning it needs to stop looking for new files, for example, to fix a corrupted data pipeline at the data source.

In that case, either use the auto-copy job ALTER command to set AUTO to OFF or create a new auto-copy job with AUTO OFF. Once this is set, auto copy will no longer look for new files.

If necessary, users can manually invoke auto-copy job which will do the work and ingest if new files are found.

auto-copy job RUN <auto-copy job Name>

You can disable “AUTO ON” in the existing auto-copy job using the following command:

auto-copy job ALTER <auto-copy job Name> AUTO OFF

The following table compares the syntax and data duplication between a regular copy statement and the new auto-copy job

.	Copy	Auto-copy job
Syntax	`COPY <table-name> FROM 's3://<s3-object-path>' [COPY PARAMETERS...]`	`COPY <table-name> FROM 's3://<s3-object-path>' [COPY PARAMETERS...] JOB CREATE <job-name>;`
Data Duplication	If it is run multiple times against the same S3 folder, it will load the data again, resulting in data duplication.	It will not load the same file twice, preventing data duplication.

Error handling and monitoring for auto-copy jobs

auto-copy jobs continuously monitor the S3 folder specified during job creation and perform ingestion whenever new files are created. New files created under the S3 folder are loaded exactly once to avoid data duplication.

By default, if there are data or format issues with the specific files, the auto-copy job will fail to ingest the files with a load error and log details to the system tables. The auto-copy job will remain AUTO ON with new data files and will continue to ignore previously failed files.

Amazon Redshift provides the following system tables for users to monitor or troubleshoot auto-copy jobs as needed:

List auto-copy jobs – Use SYS_COPY_JOB to list the auto-copy jobs stored in the database:

SELECT * 
  FROM sys_copy_job;

Get a summary of a auto-copy job – Use the SYS_LOAD_HISTORY view to get the aggregate metrics of a auto-copy job operation by specifying the copy_job_id. It shows the aggregate metrics of the files that have been processed by a auto-copy job.

SELECT *
  FROM sys_load_history
 WHERE copy_job_id = 274978;

Get details of a auto-copy job – Use STL_LOAD_COMMITS to get the status and details of each file that was processed by a auto-copy job:

SELECT *
  FROM stl_load_commits
 WHERE copy_job_id = 274978
ORDER BY curtime ASC;

Get exception details of a auto-copy job – Use STL_LOAD_ERRORS to get the details of files that failed to ingest from a auto-copy job:

SELECT   query,
    slice,
    starttime , 
    filename,
    line_number,
    colname,
    type,
    err_code,
    err_reason,
    copy_job_id,
    raw_line,
    raw_field_value
  FROM stl_load_errors
 WHERE copy_job_id = 274978;

Auto-copy job best practices

In an auto-copy job, when a new file is detected and ingested (automatically or manually), Amazon Redshift stores the file name and doesn’t run this specific job when a new file is created with the same file name.

The following are the recommended best practices when working with files using the auto-copy job:

Use unique file names for each file in a auto-copy job (for example, 2022-10-15-batch-1.csv). However, you can use the same file name as long as it’s from different auto-copy jobs:
- job_customerA_sales – s3://redshift-blogs/sales/customerA/2022-10-15-sales.csv
- job_customerB_sales – s3://redshift-blogs/sales/customerB/2022-10-15-sales.csv
Do not update file contents. Do not overwrite existing files. Changes in existing files will not be reflected to the target table. The auto-copy job doesn’t pick up updated or overwritten files, so make sure they’re renamed as new file names for the auto-copy job to pick up.
Run regular COPY statements (not a job) if you need to ingest a file that was already processed by your auto-copy job. (COPY statement without a JOB CREATE syntax doesn’t track loaded files.) For example, this is helpful in scenarios where you don’t have control of the file name and the initial file received failed. The following figure shows a typical workflow in this case.

Delete and recreate your auto-copy job if you want to reset file tracking history and start over. You can drop auto-copy job using following command.
```
auto-copy job DROP <auto-copy job Name>
```

auto-copy job considerations

Here are the main things to consider when using auto-copy:

Existing files in Amazon S3 prefix are not loaded, use Copy command to catch up historical data
The following features are unsupported:

For additional details on other considerations for auto-copy, refer to the AWS documentation.

Customer feedback

GE Aerospace is a global provider of jet engines, components, and systems for commercial and military aircraft. The company has been designing, developing, and manufacturing jet engines since World War I.

“GE Aerospace uses AWS analytics and Amazon Redshift to enable critical business insights that drive important business decisions. With the support for auto-copy from Amazon S3, we can build simpler data pipelines to move data from Amazon S3 to Amazon Redshift. This accelerates our data product teams’ ability to access data and deliver insights to end users. We spend more time adding value through data and less time on integrations.”

– Alcuin Weidus Sr Principal Data Architect at GE Aerospace

Conclusion

This post demonstrated how to automate data ingestion from Amazon S3 to Amazon Redshift using the auto-copy feature. This new functionality helps make Amazon Redshift data ingestion easier than ever, and will allow SQL users to get access to the most recent data using a simple SQL command.

Users can begin ingesting data to Redshift from Amazon S3 with simple SQL commands and gain access to the most up-to-date data without the need for third-party tools or custom implementation.

About the authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Eren Baydemir, a Technical Product Manager at AWS, has 15 years of experience in building customer-facing products and is currently focusing on data lake and file ingestion topics in the Amazon Redshift team. He was the CEO and co-founder of DataRow, which was acquired by Amazon in 2020.

Eesha Kumar is an Analytics Solutions Architect with AWS. He works with customers to realize the business value of data by helping them build solutions using the AWS platform and tools.

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

Hangjian Yuan is a Software Development Engineer at Amazon Redshift. He’s passionate about analytical databases and focuses on delivering cutting-edge streaming experiences for customers.

Enable federation to Amazon QuickSight with automatic provisioning of users between AWS IAM Identity Center and Microsoft Azure AD

2022-12-06 Aditya Ravikumar

Post Syndicated from Aditya Ravikumar original https://aws.amazon.com/blogs/big-data/enable-federation-to-amazon-quicksight-with-automatic-provisioning-of-users-between-aws-iam-identity-center-and-microsoft-azure-ad/

Organizations are working towards centralizing their identity and access strategy across all their applications, including on-premises, third-party, and applications on AWS. Many organizations use identity providers (IdPs) based on OIDC or SAML-based protocols like Microsoft Azure Active Directory (Azure AD) and manage user authentication along with authorization centrally. This authorizes users to access Amazon QuickSight assets-analyses, dashboards, folders, and datasets-through centrally managed Azure AD and AWS IAM Identity Center (successor to AWS Single Sign-On).

IAM Identity Center is an authentication process that allows users to sign into multiple applications with a single set of usernames and passwords. IAM Identity Center makes it easy to centrally manage access to multiple AWS accounts and business applications. It provides your workforce with single sign-on (SSO) access to all assigned accounts and applications from one place.

In this post, we walk you through the steps required to configure federated SSO along with automated email sync between QuickSight and Azure AD via IAM Identity Center. We also demonstrate ways System for Cross-domain Identity Management (SCIM) keeps your IAM Identity Center identities in sync with identities from your IdP.

Solution overview

The following is the reference architecture for configuring IAM Identity Center with Azure AD for automated federation to QuickSight and the AWS Management Console.

The following are the steps involved to set up federated SSO from Azure to QuickSight:

Configure Azure as an IdP in IAM Identity Center.
Register an IAM Identity Center application in Azure AD.
Configure the application in Azure AD.
Enable automatic provisioning of users and groups.
Enable email syncing for federated users in QuickSight console.
Create a QuickSight application in IAM Identity Center.
Add the IAM Identity Center application as a SAML IdP.
Configure AWS Identity and Access Management (IAM) policies and roles.
Configure attribute mappings in IAM Identity Center.
Validate federation to QuickSight from IAM Identity Center.

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

An Azure AD subscription with Administrator permission.
QuickSight account subscription with Administrator permission.
IAM Administrator account.
IAM Identity Center Administrator account.

Configure Azure as IdP in IAM Identity Center

To configure Azure as an IdP, complete the following steps:

On the IAM Identity Center console, choose Enable.
Choose Choose your identity source.
Select External identity provider to manage all users and groups.
Choose Next.
In the Configure external identity provider section, download the service provider metadata file.
Save the AWS access portal sign-in URL, IAM Identity Center Assertion Consumer Service (ACS) URL, and IAM Identity Center issuer URL.
These are used later in this post.
Leave this tab open in your browser while proceeding to the next steps.

Register an IAM Identity Center application in Azure AD

To register an IAM Identity Center application in Azure AD, complete the following steps:

Sign in to your Azure portal using an administrator account.
Under Azure Services, choose Azure AD and under Manage, choose Enterprise applications.
Choose New application.
Choose Create your own application.
Enter a name for the application.
Select the option Integrate any other application you don’t find in the gallery (Non-gallery).
Choose Create.

Configure the application in Azure AD

To configure your application, complete the following steps:

Under Enterprise applications, choose All applications and select the application created in the previous step.
Under Manage, choose Single Sign-on.
Choose SAML.
Choose Single Sign-on to set up SSO with SAML.
Choose Upload metadata file, and upload the file you downloaded from IAM Identity Center.
Choose Edit to edit the Basic SAML Configuration section.

For Identifier (Entity ID), enter the IAM Identity Center issuer URL.
For Reply URL (Assertion Consumer Service URL), enter the IAM Identity Center ACS URL.

Under SAML Signing Certificate, choose Download next to Federation Metadata XML.

We use this XML document in later steps when setting up the SAML provider in IAM and in IAM Identity Center.

Leave this tab open in your browser while moving to the next steps.
Switch to the IAM Identity Center tab to complete its setup.
Under Identity provider metadata, choose IdP SAML metadata and upload the federation metadata XML file you downloaded.
Review and confirm the changes.

Enable automatic provisioning of users and groups

IAM Identity Center supports System for Cross-domain Identity Management (SCIM) v2.0 standard. SCIM keeps your IAM Identity Center identities in sync with external IdPs. This includes any provisioning, updates, and deprovisioning of users between IdP and IAM Identity Center. To enable SCIM, complete the following steps:

On the IAM Identity Center console, choose Settings in the navigation pane.
Next to Automatic provisioning, choose Enable.
Copy the SCIM endpoint and Access token.
Switch to the Azure AD tab.
On the Default Directory Overview page, under Manage, choose Users.
Choose New user and Create new user(s).
Make sure the user profile has valid information under First name, Last name, and Email attribute.
Under Enterprise applications, choose All applications and select the application you created earlier.
Under Manage, choose Users and groups.
Choose Add user/group, and select the users you created earlier.
Choose Assign.
Under Manage, choose Provisioning and Get started.
Choose Provisioning Mode as Automatic.
For Tenant URL, enter the SCIM endpoint.
For Secret Token, enter the Access token.
Choose Test Connection and Save.
Under Provisioning, choose Start provisioning.

Make sure the user profile has valid information under First name, Last name, and Email attribute. This is the key value for email sync with QuickSight.

On the IAM Identity Center console, under Users, you can now see all the users provisioned from Azure AD.

Enable email syncing for federated users in QuickSight console

Complete the following steps to enable email syncing for federated users:

Sign in as an admin user to the QuickSight console and choose Manage QuickSight from the user name menu.
Choose Single sign-on (SSO) in the navigation pane.
Under Email Syncing for Federated Users, select ON.

Create a QuickSight application in IAM Identity Center

Complete the following steps to create a custom SAML 2.0 application in IAM Identity Center.

On the IAM Identity Center console, choose Applications in the navigation pane.
Choose Add application.
Under Preintegrated applications, search for and choose Amazon QuickSight.
Choose Next.
For Display name, enter a name, such as Amazon QuickSight.
For Description, enter a description.
Download the IAM Identity Center SAML metadata file to use later in this post.
For Application start URL, leave as is.
For Relay state, enter https://quicksight.aws.amazon.com.
For Session duration, choose your session duration. The recommended value is 8 hours.
For Application ACS URL, enter https://signin.aws.amazon.com/saml.
For Application SAML audience, enter urn:amazon:webservices.
Choose Submit
After your settings are saved, your application configuration should look similar to the following screenshot.

You can now assign your users to this application, so that the application appears in their IAM Identity Center portal after login.

On the application page, under Assigned users, choose Assign Users.
Select your users.
Optionally, if you want to enable multiple users in your organization to use QuickSight, the fastest and easiest way is to use IAM Identity Center groups.
Choose Assign Users.

Add the IAM Identity Center application as a SAML IdP

Complete the following steps to configure IAM Identity Center as your SAML IdP:

Open a new tab in your browser.
Sign in to the IAM console in your AWS account with admin permissions.
Choose Identity providers in the navigation pane.
Choose Add provider.
Select SAML for Provider type.
For Provider name, enter IAM_Identity_Center.
Choose Choose File to upload the metadata document you downloaded earlier from the Amazon QuickSight application.
Choose Add Provider.
On the summary page, record the value for the provider ARN (arn:aws:iam::<AccountID>:saml-provider/IAM_Identity_Center).

You will use this ARN while configuring claims rules later in this post.

Configure IAM policies

In this step, you create three IAM policies for different role permissions in QuickSight:

QuickSight-Federated-Admin
QuickSight-Federated-Author
QuickSight-Federated-Reader

Use the following steps to set up QuickSight-Federated-Admin policy. This policy grants admin privileges in QuickSight to the federated user:

On the IAM console, choose Policies in the navigation pane
Choose Create policy.

Choose JSON and replace the existing text with the following code:

{
    "Statement": [
        {
            "Action": [
                "quicksight:CreateAdmin"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:quicksight::<yourAWSAccountID>:user/${aws:userid}"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Ignore the “Missing ARN Region: Add a Region to the quicksight resource ARN” error and continue. Optionally, you could also add a specific AWS region in the ARN.

Choose Review policy
For Name enter QuickSight-Federated-Admin.
Choose Create policy.

Repeat these steps to create the QuickSight-Federated-Author policy using the following JSON code to grant author privileges in QuickSight to the federated user:

{
    "Statement": [
        {
            "Action": [
                "quicksight:CreateUser"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:quicksight::<yourAWSAccountID>:user/${aws:userid}"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Ignore the “Missing ARN Region: Add a Region to the quicksight resource ARN” error and continue. Optionally, you could also add a specific AWS region in the ARN.

Repeat these steps to create the QuickSight-Federated-Reader policy using the following JSON code to grant reader privileges in QuickSight to the federated user:

{
    "Statement": [
        {
            "Action": [
                "quicksight:CreateReader"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:quicksight::<yourAWSAccountID>:user/${aws:userid}"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Ignore the “Missing ARN Region: Add a Region to the quicksight resource ARN” error and continue. Optionally, you could also add a specific AWS region in the ARN.

Configure IAM roles

Next, create roles that your Azure AD and IAM Identity Center users assume when federating into QuickSight. The following steps set up the admin role:

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Select type of trusted entity, choose SAML 2.0 federation.
For SAML provider, choose the provider you created earlier (IAM_Identity_Center).
Select Allow programmatic and AWS Management Console access.
For Attribute, make sure SAML:aud is selected.
For Value, make sure https://signin.aws.amazon.com/saml is selected.
Choose Next.
Choose the QuickSight-Federated-Admin IAM policy you created earlier.
Choose Next: Tags.
Choose Next: Review.
For Role name, enter QuickSight-Admin-Role.
For Role description, enter a description.
Choose Create role.
On the IAM console, in the navigation pane, choose Roles.
Choose the QuickSight-Admin-Role role you created to open the role’s properties.
Record the role ARN to use later.
On the Trust relationships tab, choose Edit trust policy.

For the policy details, enter the following JSON:

{
    "Version": "2012-10-17",
     "Statement": [
 {
    "Effect": "Allow",
    "Principal": {
"Federated": "arn:aws:iam::<yourAWSAccountID>:saml-provider/IAM_Identity_Center"
                        },
            "Action": "sts:AssumeRoleWithSAML",
            "Condition": {
                "StringEquals": {
                    "SAML:aud": "https://signin.aws.amazon.com/saml"	
           }
            }
        },
        {	
            		"Effect": "Allow",
            		"Principal": {
                	 "Federated":"arn:aws:iam::<yourAWSAccountID>:saml-provider/IAM_Identity_Center"
            				},
            		  "Action": "sts:TagSession",
               "Condition": {
                	  "StringLike": {
                   "aws:RequestTag/Email": "*"
           }
            }
        }
    ]
}

Choose Update Policy.
Repeat these steps to create the roles QuickSight-Author-Role and QuickSight-Reader-Role. Attach the QuickSight-Federated-Author and QuickSight-Federated-Reader policies to their respectively roles.

Configure attribute mappings in IAM Identity Center

The final step is to configure the attribute mappings in IAM Identity Center. The attributes you map here become part of the SAML assertion that is sent to the QuickSight application. You can choose which user attributes in your application map to corresponding user attributes in your connected directory. For more information, refer to Attribute mappings.

On IAM Identity Center console, choose Applications in the navigation pane.
Select the Amazon QuickSight application you created earlier.
On the Actions menu, choose Edit attribute mappings.
Configure the following mappings:

User attribute in the application	Maps to this string value or user attribute in IAM Identity Center	Format
`Subject`	`${user:email}`	`emailAddress`
`https://aws.amazon.com/SAML/Attributes/Role`	`arn:aws:iam:: <YourAWSAccount ID>:saml-provider/IAM_Identity_Center`, `arn:aws:iam:: <YourAWSAccount ID>:role/QuickSight-Admin-Role`	`unspecified`
`https://aws.amazon.com/SAML/Attributes/RoleSessionName`	`${user:email}`	`unspecified`
`https://aws.amazon.com/SAML/Attributes/PrincipalTag:Email`	`${user:email}`	`url`

Choose Save changes.

Validate federation to QuickSight from IAM Identity Center

On the IAM Identity Center console, note down the user portal URL available on the Settings page. We suggest you log out of your AWS account first, or open an incognito browser window. Navigate to the user portal URL, sign in with the credentials of an AD user, and choose your QuickSight application.

You’re automatically redirected to the QuickSight console.

Summary

This post provided step-by-step instructions to configure federated SSO with Azure AD as IdP through IAM Identity Center. We also discussed how SCIM keeps your IAM Identity Center identities in sync with identities from your IdP. This includes any provisioning, updating, and deprovisioning of users between your IdP and IAM Identity Center.

If you have any questions or feedback, please leave a comment.

For additional discussions and help getting answers to your questions, check out the QuickSight Community.

About the author

Aditya Ravikumar is a Solutions Architect at Amazon Web Services. He is based in Seattle, USA. Aditya’s core interests include software development, databases, data analytics and machine learning. He works with AWS customers/partners to provide guidance and technical assistance to transform their business through innovative use of cloud technologies.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.

Raji Sivasubramaniam is a Sr. Solutions Architect at AWS, focusing on Analytics. Raji is specialized in architecting end-to-end Enterprise Data Management, Business Intelligence and Analytics solutions for Fortune 500 and Fortune 100 companies across the globe. She has in-depth experience in integrated healthcare data and analytics with wide variety of healthcare datasets including managed market, physician targeting and patient analytics.

Perform multi-cloud analytics using Amazon QuickSight, Amazon Athena Federated Query, and Microsoft Azure Synapse

2022-12-05 Harish Rajagopalan

Post Syndicated from Harish Rajagopalan original https://aws.amazon.com/blogs/big-data/perform-multi-cloud-analytics-using-amazon-quicksight-amazon-athena-federated-query-and-microsoft-azure-synapse/

In this post, we show how to use Amazon QuickSight and Amazon Athena Federated Query to build dashboards and visualizations on data that is stored in Microsoft Azure Synapse databases.

Organizations today use data stores that are best suited for the applications they build. Additionally, they may also continue to use some of their legacy data stores as they modernize and migrate to the cloud. These disparate data stores might be spread across on-premises data centers and different cloud providers. This presents a challenge for analysts to be able to access, visualize, and derive insights from the disparate data stores.

QuickSight is a fast, cloud-powered business analytics service that enables employees within an organization to build visualizations, perform ad hoc analysis, and quickly get business insights from their data on their devices anytime. Amazon Athena is a serverless interactive query service that provides full ANSI SQL support to query a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet, that are stored on Amazon Simple Storage Service (Amazon S3). For data that isn’t stored on Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store it in Amazon S3.

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena. You can think of a connector as an extension of Athena’s query engine. In this post, we use the Athena connector for Azure Synapse analytics that enables Athena to run SQL queries on your Azure Synapse databases using JDBC.

Solution overview

Consider the following reference architecture for visualizing data from Azure Synapse Analytics.

In order to explain this architecture, let’s walk through a sample use case of analyzing fitness data of users. Our sample dataset contains users’ fitness information like age, height, and weight, and daily run stats like miles, calories, average heart rate, and average speed, along with hours of sleep.

We run queries on this dataset to derive insights using visualizations in QuickSight. With QuickSight, you can create trends of daily miles run, keep track of the average heart rate over a period of time, and detect anomalies, if any. You can also track your daily sleep patterns and compare how rest impacts your daily activities. The out-of-the-box insights feature gives vital weekly insights that can help you be on top of your fitness goals. The following screenshot shows sample rows of our dataset stored in Azure Synapse.

Prerequisites

Make sure you have the following prerequisites:

An AWS account set up with QuickSight enabled. If you don’t have a QuickSight account, you can sign up for one. You can access the QuickSight free trial as part of the AWS Free Tier option.
An Azure account with data pre-loaded in Synapse. We use a sample fitness dataset in this post. We used a data generator to generate this data.
A virtual private connection (VPN) between AWS and Azure.

Note that the AWS resources for the steps in this post need to be in the same Region.

Configure a Lambda connector

To configure your Lambda connector, complete the following steps:

Load the data.
In the Azure account, the sample data for fitness devices is stored and accessed in an Azure Synapse Analytics workspace using a dedicated SQL pool table. The firewall settings for Synapse should allow for access to a VPC through a VPN. You can use your data or tables that you need to connect QuickSight to in this step.
On the Amazon S3 console, create a spillover bucket and note the name to use in a later step.
This bucket is used for storing the spillover data for the Synapse connector.
On the AWS Serverless Application Repository console, choose Available applications in the navigation pane.
On the Public applications tab, search for synapse and choose AthenaSynapseConnector with the AWS verified author tag.
Create the Lambda function with the following configuration:
1. For Name, enter AthenaSynapseConnector.
2. For SecretNamePrefix, enter AthenaJdbcFederation.
3. For SpillBucket, enter the name of the S3 bucket you created.
4. For DefaultConnectionString, enter the JDBC connection string from the Azure SQL pool connection strings property.
5. For LambdaFunctionName, enter a function name.
6. For SecurityGroupIds and SubnetIds, enter the security group and subnet for your VPC (this is needed for the template to run successfully).
7. Leave the remaining values as their default.
Choose Deploy.
After the function is deployed successfully, navigate to the athena_hybrid_azure function.
On the Configurations tab, choose Environment variables in the navigation pane.
Add the key azure_synapse_demo_connection_string with the same value as the default key (the JDBC connection string from the Azure SQL pool connection strings property).

For this post, we removed the VPC configuration.
Choose VPC in the navigation pane and choose None to remove the VPC configuration.
Now you’re ready to configure the data source.
On the Athena console, choose Data sources in the navigation pane.
Choose Create data source.
Choose Microsoft Azure Synapse as your data source.
Choose Next.
Create a data source with the following parameters:
1. For Data source name, enter azure_synapse_demo.
2. For Connection details, choose the Lambda function athena_hybrid_azure.
Choose Next.

Create a dataset on QuickSight to read the data from Azure Synapse

Now that the configuration on the Athena side is complete, let’s configure QuickSight.

On the QuickSight console, on the user name menu, choose Manage QuickSight.
Choose Security & permissions in the navigation pane.
Under QuickSight access to AWS services, choose Manage.
Choose Amazon Athena and in the pop-up permissions box, choose Next.
On the S3 Bucket tab, select the spill bucket you created earlier.
On the Lambda tab, select the athena_hybrid_azure function.
Choose Finish.
If the QuickSight access to AWS services window appears, choose Save.
Choose the QuickSight icon on the top left and choose New dataset.
Choose Athena as the data source.
For Data source name, enter a name.
Check the Athena workgroup settings where the Athena data source was created.
Choose Create data source.
Choose the catalog azure_synapse_demo and the database dbo.
Choose Edit/Preview data.
Change the query mode to SPICE.
Choose Publish & Visualize.
Create an analysis in QuickSight.
Publish a QuickSight dashboard.

If you’re new to QuickSight or looking to build stunning dashboards, this workshop provides step-by-step instructions to grow your dashboard building skills from basic to advanced level. The following screenshot is an example dashboard to give you some inspiration.

Clean up

To avoid ongoing charges, complete the following steps:

Delete the S3 bucket created for the Athena spill data.
Delete the Athena data source.
On the AWS CloudFormation console, select the stack you created for AthenaSynapseConnector and choose Delete.
This will delete the created resources, such as the Lambda function. Check the stack’s Events tab to track the progress of the deletion, and wait for the stack status to change to DELETE_COMPLETE.
Delete the QuickSight datasets.
Delete the QuickSight analysis.
Delete your QuickSight subscription and close the account.

Conclusion

In this post, we showed you how to overcome the challenges of connecting to and analyzing data in other clouds by using AWS analytics services to connect to Azure Synapse Analytics with Athena Federated Query and QuickSight. We also showed you how to visualize and derive insights from the fitness data using QuickSight. With QuickSight and Athena Federated Query, organizations can now access additional data sources beyond those already supported natively by QuickSight. If you have data in sources other than Amazon S3, you can use Athena Federated Query to analyze the data in place or build pipelines that extract and store data in Amazon S3.

For more information and resources for QuickSight and Athena, visit Analytics on AWS.

About the authors

Harish Rajagopalan is a Senior Solutions Architect at Amazon Web Services. Harish works with enterprise customers and helps them with their cloud journey.

Salim Khan is a Specialist Solutions Architect for Amazon QuickSight. Salim has over 16 years of experience implementing enterprise business intelligence (BI) solutions. Prior to AWS, Salim worked as a BI consultant catering to industry verticals like Automotive, Healthcare, Entertainment, Consumer, Publishing and Financial Services. He has delivered business intelligence, data warehousing, data integration and master data management solutions across enterprises.

Sriram Vasantha is a Senior Solutions Architect at AWS in Central US helping customers innovate on the cloud. Sriram focuses on application and data modernization, DevSecOps, and digital transformation. In his spare time, Sriram enjoys playing different musical instruments like Piano, Organ, and Guitar.

Adarsha Nagappasetty is a Senior Solutions Architect at Amazon Web Services. Adarsha works with enterprise customers in Central US and helps them with their cloud journey. In his spare time, Adarsha enjoys spending time outdoors with his family!

Explore your data lake using Amazon Athena for Apache Spark

2022-12-01 Pathik Shah

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/explore-your-data-lake-using-amazon-athena-for-apache-spark/

Amazon Athena now enables data analysts and data engineers to enjoy the easy-to-use, interactive, serverless experience of Athena with Apache Spark in addition to SQL. You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. For interactive Spark applications, you can spend less time waiting and be more productive because Athena instantly starts running applications in less than a second. And because Athena is serverless and fully managed, analysts can run their workloads without worrying about the underlying infrastructure.

Data lakes are a common mechanism to store and analyze data because they allow companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository. Apache Spark is a popular open-source, distributed processing system optimized for fast analytics workloads against data of any size. It’s often used to explore data lakes to derive insights. For performing interactive data explorations on the data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine in Athena. It enables you to be more productive and get started quickly, spending almost no time setting up infrastructure and Spark configurations.

In this post, we show how you can use Athena for Apache Spark to explore and derive insights from your data lake hosted on Amazon Simple Storage Service (Amazon S3).

Solution overview

We showcase reading and exploring CSV and Parquet datasets to perform interactive analysis using Athena for Apache Spark and the expressive power of Python. We also perform visual analysis using the pre-installed Python libraries. For running this analysis, we use the built-in notebook editor in Athena.

For the purpose of this post, we use the NOAA Global Surface Summary of Day public dataset from the Registry of Open Data on AWS, which consists of daily weather summaries from various NOAA weather stations. The dataset is primarily in plain text CSV format. We have transformed the entire and subsets of the CSV dataset into Parquet format for our demo.

Before running the demo, we want to introduce the following concepts related to Athena for Spark:

Sessions – When you open a notebook in Athena, a new session is started for it automatically. Sessions keep track of the variables and state of notebooks.
Calculations – Running a cell in a notebook means running a calculation in the current session. As long as a session is running, calculations use and modify the state that is maintained for the notebook.

For more details, refer to Session and Calculations.

Prerequisites

For this demo, you need the following prerequisites:

An AWS account with access to the AWS Management Console
Athena permissions on the workgroup DemoAthenaSparkWorkgroup, which you create as part of this demo
AWS Identity and Access Management (IAM) permissions to create, read, and update the IAM role and policies created as part of the demo
Amazon S3 permissions to create an S3 bucket and read the bucket location

The following policy grants these permissions. Attach it to the IAM role or user you use to sign in to the console. Make sure to provide your AWS account ID and the Region in which you’re running the demo.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "athena:*",
            "Resource": "arn:aws:athena:<REGION>:<ACCOUNT_ID>:workgroup/DemoAthenaSparkWorkgroup"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:GetRole",
                "iam:ListAttachedRolePolicies",
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "iam:PutRolePolicy",
                "iam:ListRolePolicies",
                "iam:GetRolePolicy",
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::<ACCOUNT_ID>:role/service-role/AWSAthenaSparkExecutionRole-*",
                "arn:aws:iam::<ACCOUNT_ID>:policy/service-role/AWSAthenaSparkRolePolicy-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<ACCOUNT_ID>-<REGION>-athena-results-bucket-*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:ListPolicies",
                "iam:ListRoles",
                "athena:ListWorkGroups",
                "athena:ListEngineVersions"
            ],
            "Resource": "*"
        }
    ]
}

Create your Athena workgroup

We create a new Athena workgroup with Spark as the engine. Complete the following steps:

On the Athena console, choose Workgroups in the navigation pane.
Choose Create workgroup.
For Workgroup name, enter DemoAthenaSparkWorkgroup.
Make sure to enter the exact name because the preceding IAM permissions are scoped down for the workgroup with this name.
For Analytics engine, choose Apache Spark.
For Additional configurations, select Use defaults.
The defaults include the creation of an IAM role with the required permissions to run Spark calculations on Athena and an S3 bucket to store calculation results. It also sets the notebook (which we create later) encryption key management to an AWS Key Management Service (AWS KMS) key owned by Athena.
Optionally, add tags to your workgroup.
Choose Create workgroup.

Modify the IAM role

Creating the workgroup creates a new IAM role. Choose the newly created workgroup, then the value under Role ARN to be redirected to the IAM console.

Add the following permission as an inline policy to the IAM role created earlier. This allows the role to read the S3 datasets. For instructions, refer to the section To embed an inline policy for a user or role (console) in Adding IAM identity permissions (console).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/*",
                "arn:aws:s3:::noaa-gsod-pds/2022/*",
                "arn:aws:s3:::noaa-gsod-pds",
                "arn:aws:s3:::athena-examples-us-east-1"
            ]
        }
    ]
}

Set up your notebook

To run the analysis on Spark on Athena, we need a notebook. Complete the following steps to create one:

On the Athena console, choose Notebook Editor.
Choose the newly created workgroup DemoAthenaSparkWorkgroup on the drop-down menu.
Choose Create Notebook.
Provide a notebook name, for example AthenaSparkBlog.
Keep the default session parameters.
Choose Create.

Your notebook should now be loaded, which means you can start running Spark code. You should see the following screenshot.

Explore the dataset

Now that we have workgroup and notebook created, let’s start exploring the NOAA Global Surface Summary of Day dataset. The datasets used in this post are stored in the following locations:

CSV data for year 2022 – s3://noaa-gsod-pds/2022/
Parquet data for year 2021 – s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2021/
Parquet data for year 2020 – s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2020/
Entire dataset in Parquet format (until October 2022) – s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/

In the rest of this post, we show PySpark code snippets. Copy the code and enter it in the notebook’s cell. Press Shift+Enter to run the code as a calculation. Alternatively, you can choose Run. Add more cells to run subsequent code snippets.

We start by reading the CSV dataset for the year 2022 and print its schema to understand the columns contained in the dataset. Run the following code in the notebook cell:

year_22_csv = spark.read.option("header","true").csv(f"s3://noaa-gsod-pds/2022/")
year_22_csv.printSchema()

We get the following output.

We were able to submit the preceding code as a calculation instantly using the notebook.

Let’s continue exploring the dataset. Looking at the columns in the schema, we’re interested in previewing the data for the following attributes in 2022:

TEMP – Mean temperature
WDSP – Mean wind speed
GUST – Maximum wind gust
MAX – Maximum temperature
MIN – Minimum temperature
Name – Station name

Run the following code:

year_22_csv.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show()

We get the following output.

Now we have an idea of what the dataset looks like. Next, let’s perform some analysis to find the maximum recorded temperature for the Seattle-Tacoma Airport in 2022. Run the following code:

from pyspark.sql.functions import max

year_22_csv.filter("NAME == 'SEATTLE TACOMA AIRPORT, WA US'").agg(max("MAX").alias("max_temp_yr_2022")).show()

We get the following output.

Next, we want to find the maximum recorded temperature for each month of 2022. For this, we use the Spark SQL feature of Athena. First, we need to create a temporary view on the year_22_csv data frame. Run the following code:

year_22_csv.createOrReplaceTempView("y22view")

To run our Spark SQL query, we use %%sql magic:

%%sql
select month(to_date(date,'yyyy-MM-dd')) as month_yr_22,max(MAX) as max_temp 
from y22view where NAME == 'SEATTLE TACOMA AIRPORT, WA US' 
group by 1

We get the following output.

The output of the preceding query produces the month in numeric form. To make it more readable, let’s convert the month numbers into month names. For this, we use a user-defined function (UDF) and register it to use in the Spark SQL queries for the rest of the notebook session. Run the following code to create and register the UDF:

import calendar

# UDF to convert month number to month name
spark.udf.register("month_name_udf",lambda x: calendar.month_name[int(x)])

We rerun the query to find the maximum recorded temperature for each month of 2022 but with the month_name_udf UDF we just created. Also, this time we sort the results based on the maximum temperature value. See the following code:

%%sql
select month_name_udf(month(to_date(date,'yyyy-MM-dd'))) as month_yr_22,
max(MAX) as max_temp
from y22view where NAME == 'SEATTLE TACOMA AIRPORT, WA US' 
group by 1 order by 2 desc

The following output shows the month names.

Until now, we have run interactive explorations for the year 2022 of the NOAA Global Surface Summary of Day dataset. Let’s say we want to compare the temperature values with the previous 2 years. We compare the maximum temperature across 2020, 2021, and 2022. As a reminder, the dataset for 2022 is in CSV format and for 2020 and 2021, the datasets are in Parquet format.

To continue with the analysis, we read the 2020 and 2021 Parquet datasets into the data frame and create temporary views on the respective data frames. Run the following code:

#Read the dataset
year_20_pq = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2020/")
year_21_pq = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2021/")

#Create temporary views
year_20_pq.createOrReplaceTempView("y20view")
year_21_pq.createOrReplaceTempView("y21view")

#Preview the datasets
print('Preview for year 2020:')
year_20_pq.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show(1)
print('Preview for year 2021:')
year_21_pq.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show(1)

We get the following output.

To compare the recorded maximum temperature for each month in 2020, 2021, and 2022, we perform a join operation on the three views created so far from their respective data frames. Also, we reuse the month_name_udf UDF to convert month number to month name. Run the following code:

%%sql
select month_name_udf(month(to_date(y21.DATE,'yyyy-MM-dd'))) as month, max(y20.max) as max_temp_2020, 
max(y21.max) as max_temp_2021, max(y22.max) as max_temp_2022 \
from y20view y20 inner join y21view y21 inner join y22view y22 \
on month(to_date(y20.DATE,'yyyy-MM-dd')) = month(to_date(y21.DATE,'yyyy-MM-dd'))
and month(to_date(y21.DATE,'yyyy-MM-dd')) = month(to_date(y22.DATE,'yyyy-MM-dd')) \
where y20.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y21.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y22.NAME == 'SEATTLE TACOMA AIRPORT, WA US' \
group by 1

We get the following output.

So far, we’ve read CSV and Parquet datasets, run analysis on the individual datasets, and performed join and aggregation operations on them to derive insights instantly in an interactive mode. Next, we show how you can use the pre-installed libraries like Seaborn, Matplotlib, and Pandas for Spark on Athena to generate a visual analysis. For the full list of preinstalled Python libraries, refer to List of preinstalled Python libraries.

We plot a visual analysis to compare the recorded maximum temperature values for each month in 2020, 2021, and 2022. Run the following code, which creates a Spark data frame from the SQL query, converts it into a Pandas data frame, and uses Seaborn and Matplotlib for plotting:

import seaborn as sns
import matplotlib.pyplot as plt

y20_21_22=spark.sql("select month(to_date(y21.DATE,'yyyy-MM-dd')) as month, max(y20.max) as max_temp_yr_2020, \
max(y21.max) as max_temp_yr_2021, max(y22.max) as max_temp_yr_2022 \
from y20view y20 inner join y21view y21 inner join y22view y22 \
on month(to_date(y20.DATE,'yyyy-MM-dd')) = month(to_date(y21.DATE,'yyyy-MM-dd')) \
and month(to_date(y21.DATE,'yyyy-MM-dd')) = month(to_date(y22.DATE,'yyyy-MM-dd')) \
where y20.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y21.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y22.NAME == 'SEATTLE TACOMA AIRPORT, WA US' \
group by 1 order by 1")

#convert to pandas dataframe
y20_21_22=y20_21_22.toPandas()

#change datatypes to float for plotting
y20_21_22['max_temp_yr_2020']= y20_21_22['max_temp_yr_2020'].astype(float)
y20_21_22['max_temp_yr_2021']= y20_21_22['max_temp_yr_2021'].astype(float)
y20_21_22['max_temp_yr_2022']= y20_21_22['max_temp_yr_2022'].astype(float)

# Unpivot dataframe from wide to long format for plotting
y20_21_22=y20_21_22.melt('month',var_name='max_temperature', \
             value_name='temperature')

plt.clf()

sns.catplot(data=y20_21_22,x='month',y='temperature', hue='max_temperature', \
            sort=False, kind='point', height=4, aspect=1.5)
%matplot plt

The following graph shows our output.

Next, we plot a heatmap showing the maximum temperature trend for each month across all the years in the dataset. For this, we have converted the entire CSV dataset (until October 2022) into Parquet format and stored it in s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/.

Run the following code to plot the heatmap:

noaa = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/")
noaa.createOrReplaceTempView("noaaview")

#query to find maximum temperature for each month from year 1973 to 2022
year_hist=spark.sql("select month(to_date(date,'yyyy-MM-dd')) as month, \
year(to_date(date,'yyyy-MM-dd')) as year,  cast(max(temp) as float) as temp \
from noaaview where NAME == 'SEATTLE TACOMA AIRPORT, WA US' group by 1,2") 

# convert spark dataframe to pandas
year_hist=year_hist.toPandas()
year_hist=year_hist.pivot("month","year","temp")

plt.clf()
grid_kws = {"height_ratios": (0.9, .05), "hspace": .5}
f, (ax, cbar_ax) = plt.subplots(2, gridspec_kw=grid_kws)

sns.heatmap(year_hist, ax=ax, cbar_ax=cbar_ax, cmap="RdYlBu_r", \
            cbar_kws={"orientation": "horizontal"})
%matplot plt

We get the following output.

From the potting, we can see the trend has been almost similar across the years, where the temperature rises during summer months and lowers as winter approaches in the Seattle-Tacoma Airport area. You can continue exploring the datasets further, running more analyses and plotting more visuals to get the feel of the interactive and instant-on experience Athena for Apache Spark offers.

Clean up resources

When you’re done with the demo, make sure to delete the S3 bucket you created to store the workgroup calculations to avoid storage costs. Also, you can delete the workgroup, which deletes the notebook as well.

Conclusion

In this post, we saw how you can use the interactive and serverless experience of Athena for Spark as the engine to run calculations instantly. You just need to create a workgroup and notebook to start running the Spark code. We explored datasets stored in different formats in an S3 data lake and ran interactive analyses to derive various insights. Also, we ran visual analyses by plotting charts using the preinstalled libraries. To learn more about Spark on Athena, refer to Using Apache Spark in Amazon Athena.

About the Authors

Pathik Shah is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Raj Devnath is a Sr. Product Manager at AWS working on Amazon Athena. He is passionate about building products customers love and helping customers extract value from their data. His background is in delivering solutions for multiple end markets, such as finance, retail, smart buildings, home automation, and data communication systems.

How to use Amazon Macie to preview sensitive data in S3 buckets

2022-12-01 Koulick Ghosh

Post Syndicated from Koulick Ghosh original https://aws.amazon.com/blogs/security/how-to-use-amazon-macie-to-preview-sensitive-data-in-s3-buckets/

Security teams use Amazon Macie to discover and protect sensitive data, such as names, payment card data, and AWS credentials, in Amazon Simple Storage Service (Amazon S3). When Macie discovers sensitive data, these teams will want to see examples of the actual sensitive data found. Reviewing a sampling of the discovered data helps them quickly confirm that the object is truly sensitive according to their data protection and privacy policies.

In this post, we walk you through how your data security teams are able to use a new capability in Amazon Macie to retrieve up to 10 examples of sensitive data found in your S3 objects, so that you are able to confirm the nature of the data at a glance. Additionally, we will discuss how you are able to control who is able to use this capability, so that only authorized personnel have permissions to view these examples.

The challenge customers face

After a Macie sensitive data discovery job is run, security teams start their work. The security team will review the Macie findings to investigate the discovered sensitive data and decide what actions to take to protect such data. The findings provide details that include the severity of the finding, information on the affected S3 object, and a summary of the type, location, and amount of sensitive data found. However, Macie findings only contain pointers to data that Macie found in the object. In order to complete their investigation, customers in the past had to do additional work to extract the contents of a sensitive object, such as navigating to a different AWS account where the object is located, downloading and manually searching for keywords in a file editor, or writing and refining SQL queries by using Amazon S3 Select. The investigations are further slowed down when the object type is one that is not easily readable without additional tooling, such as big-data file types like Avro and Parquet. By using the Macie capability to retrieve sensitive data samples, you are able to review the discovered data and make decisions concerning the finding remediation.

Prerequisites

To implement the ability to retrieve and reveal samples of sensitive data, you’ll need the following prerequisites:

Enable Amazon Macie in your AWS account. For instructions, see Getting started with Amazon Macie.
Set your account as the delegated Macie administrator account and enable Macie in at least one member account by using AWS Organizations. In this post, we will refer to the delegated administrator account as Account A and the member account as Account B.
Configure Macie detailed classification results in Account A.

Note: The detailed classification results contain a record for each Amazon S3 object that you configure the job to analyze, and include the location of up to 1,000 occurrences of each type of sensitive data that Macie found in an object. Macie uses the location information in the detailed classification results to retrieve the examples of sensitive data. The detailed classification results are stored in an S3 bucket of your choice. In this post, we will refer to this bucket as DOC-EXAMPLE-BUCKET1.
Create an S3 bucket that contains sensitive data in Account B. In this post, we will refer to this bucket as DOC-EXAMPLE-BUCKET2.

Note: You should enable server-side encryption on this bucket by using customer managed AWS Key Management Service (AWS KMS) keys (a type of encryption known as SSE-KMS).
(Optional) Add sensitive data to DOC-EXAMPLE-BUCKET2. This post uses a sample dataset that contains fake sensitive data. You are able to download this sample dataset, unarchive the .zip folder, and follow these steps to upload the objects to S3. This is a synthetic dataset generated by AWS that we will use for the examples in this post. All data in this blog post has been artificially created by AWS for demonstration purposes and has not been collected from any individual person. Similarly, such data does not relate back to any individual person, nor is it intended to.
Create and run a sensitive data discovery job from Account A to analyze the contents of DOC-EXAMPLE-BUCKET2.
(Optional) Set up the AWS Command Line Interface (AWS CLI).

Configure Macie to retrieve and reveal examples of sensitive data

In this section, we’ll describe how to configure Macie so that you are able to retrieve and view examples of sensitive data from Macie findings.

To configure Macie (console)

In the AWS Management Console, in the Macie delegated administrator account (Account A), follow these steps from the Amazon Macie User Guide.

To configure Macie (AWS CLI)

Confirm that you have Macie enabled.

	$ aws macie2 get-macie-session --query 'status'
	// The expected response is "ENABLED"

Confirm that you have configured the detailed classification results bucket.

	$ aws macie2 get-classification-export-configuration

	// The expected response is:
	{
   	 "configuration": {
   		 	    "s3Destination": {
        		    "bucketName": " DOC-EXAMPLE-BUCKET1 ",
           			"kmsKeyArn": "arn:aws:kms:<YOUR-REGION>:<YOUR-ACCOUNT-ID>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET1>"
     		  	 }
		}	
	}

Create a new KMS key to encrypt the retrieved examples of sensitive data. Make sure that the key is created in the same AWS Region where you are operating Macie.

$ aws kms create-key
{
    "KeyMetadata": {
        "Origin": "AWS_KMS",
        "KeyId": "<YOUR-KEY-ID>",
        "Description": "",
        "KeyManager": "CUSTOMER",
        "Enabled": true,
        "KeySpec": "SYMMETRIC_DEFAULT",
        "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "Enabled",
        "CreationDate": 1502910355.475,
        "Arn": "arn:aws:kms: <YOUR-AWS-REGION>:<AWS-ACCOUNT-A>:key/<YOUR-KEY-ID>",
        "AWSAccountId": "<AWS-ACCOUNT-A>",
        "MultiRegion": false
        "EncryptionAlgorithms": [
            "SYMMETRIC_DEFAULT"
        ],
    }
}

Give this key the alias REVEAL-KMS-KEY.

$ aws kms CreateAlias
{
   "AliasName": " <REVEAL-KMS-KEY> ",
   "TargetKeyId": "<YOUR-KEY-ID>"
}

Enable the feature in Macie and configure it to encrypt the data by using REVEAL-KMS-KEY. You do not specify a key policy for your new KMS key in this step. The key policy will be discussed later in the post.

$ aws macie2 update-reveal-configuration --configuration '{"status":"ENABLED","kmsKeyId":"alias/ <REVEAL-KMS-KEY> "}'

// The expected response is:
{
    "configuration": {
        "kmsKeyId": "arn:aws:kms:<YOUR-REGION>: <YOUR ACCOUNT ID>:key/<REVEAL-KMS-KEY>.",
        "status": "ENABLED"
    }
}

Control access to read sensitive data and protect data displayed in Macie

This new Macie capability uses the AWS Identity and Access Management (IAM) policies, S3 bucket policies, and AWS KMS key policies that you have defined in your accounts. This means that in order to see examples through the Macie console or by invoking the Macie API, the IAM principal needs to have read access to the S3 object and to decrypt the object if it is server-side encrypted. It’s important to note that Macie uses the IAM permissions of the AWS principal to locate, retrieve, and reveal the samples and does not use the Macie service-linked role to perform these tasks.

Using the setup discussed in the previous section, you will walk through how to control access to the ability to retrieve and reveal sensitive data examples. To recap, you created and ran a discovery job from the Amazon Macie delegated administrator account (Account A) to analyze the contents of DOC-EXAMPLE-BUCKET2 in a member account (Account B). You configured Macie to retrieve examples and to encrypt the examples of sensitive data with the REVEAL-KMS-KEY.

The next step is to create and use an IAM role that will be assumed by other users in Account A to retrieve and reveal examples of sensitive data discovered by Macie. In this post, we’ll refer to this role as MACIE-REVEAL-ROLE.

To apply the principle of least privilege and allow only authorized personnel to view the sensitive data samples, grant the following permissions so that Macie users who assume MACIE-REVEAL-ROLE will be able to successfully retrieve and reveal examples of sensitive data:

Step 1 – Update the IAM policy for MACIE-REVEAL-ROLE.
Step 2 – Update the KMS key policy for REVEAL-KMS-KEY.
Step 3 – Update the S3 bucket policy for DOC-EXAMPLE-BUCKET2 and the KMS key policy used for its server-side encryption in Account B.

After you grant these permissions, MACIE-REVEAL-ROLE is succcesfully able to retrieve and reveal examples of sensitive data in DOC-EXAMPLE-BUCKET2, as shown in Figure 1.

Figure 1: Macie runs the discovery job from the delegated administrator account in a member account, and MACIE-REVEAL-ROLE retrieves examples of sensitive data

Step 1: Update the IAM policy

Provide the following required permissions to MACIE-REVEAL-ROLE:

Allow GetObject from DOC-EXAMPLE-BUCKET2 in Account B.
Allow decryption of DOC-EXAMPLE-BUCKET2 if it is server-side encrypted with a customer managed key (SSE-KMS).
Allow GetObject from DOC-EXAMPLE-BUCKET1.
Allow decryption of the Macie discovery results.
Allow the necessary Macie actions to retrieve and reveal sensitive data examples.

To set up the required permissions

Use the following commands to provide the permissions. Make sure to replace the placeholders with your own data.

{
    "Version": "2012-10-17",
    "Statement": [
	{
            "Sid": "AllowGetFromCompanyDataBucket",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET2>/*"
        },
        {
            "Sid": "AllowKMSDecryptForCompanyDataBucket",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": "arn:aws:kms:<AWS-Region>:<AWS-Account-B>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET2>"
        },
        {
            "Sid": "AllowGetObjectfromMacieResultsBucket",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET1>/*"
        },
	{
            "Sid": "AllowKMSDecryptForMacieRoleDiscoveryBucket",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": "arn:aws:kms:<AWS-REGION>:<AWS-ACCOUNT-A>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET1>"
        },
	{
            "Sid": "AllowActionsRetrieveAndReveal",
            "Effect": "Allow",
            "Action": [
                "macie2:GetMacieSession",
                "macie2:GetFindings",
                "macie2:GetSensitiveDataOccurrencesAvailability",
                "macie2:GetSensitiveDataOccurrences",
                "macie2:ListFindingsFilters",
                "macie2:GetBucketStatistics",
                "macie2:ListMembers",
                "macie2:ListFindings",
                "macie2:GetFindingStatistics",
                "macie2:GetAdministratorAccount",
                "macie2:GetClassificationExportConfiguration",
                "macie2:GetRevealConfiguration",
                "macie2:DescribeBuckets"
            ],
            "Resource": "*” 
        }
    ]
}

Step 2: Update the KMS key policy

Next, update the KMS key policy that is used to encrypt sensitive data samples that you retrieve and reveal in your delegated administrator account.

To update the key policy

Allow the MACIE-REVEAL-ROLE access to the KMS key that you created for protecting the retrieved sensitive data, using the following commands. Make sure to replace the placeholders with your own data.

	{
            "Sid": "AllowMacieRoleDecrypt",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam:<AWS-REGION>:<AWS-ACCOUNT-A>:role/<MACIE-REVEAL-ROLE>"
            },
            "Action": [
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:GenerateDataKey"
            ],
            "Resource": "arn:aws:kms:<AWS-REGION>:<AWS-ACCOUNT-A>:key/<REVEAL-KMS-KEY>"
        }

Step 3: Update the bucket policy of the S3 bucket

Finally, update the bucket policy of the S3 bucket in member accounts, and update the key policy of the key used for SSE-KMS.

To update the S3 bucket policy and KMS key policy

Use the following commands to update key policy for the KMS key used for server-side encryption of the DOC-EXAMPLE-BUCKET2 bucket in Account B.

	{
            "Sid": "AllowMacieRoleDecrypt”
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam:<AWS-REGION>:<AWS-ACCOUNT-A>:role/<MACIE-REVEAL-ROLE>"
            },
            "Action": "kms:Decrypt",
            "Resource": "arn:aws:kms:<AWS-REGION>:<AWS-ACCOUNT-B>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET2>"
  }

Use the following commands to update the bucket policy of DOC-EXAMPLE-BUCKET2 to allow cross-account access for MACIE-REVEAL-ROLE to get objects from this bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowMacieRoleGet",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<AWS-ACCOUNT-A>:role/<MACIE-REVEAL-ROLE>"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET2>/*"
        }
    ]
}

Retrieve and reveal sensitive data samples

Now that you’ve put in place the necessary permissions, users who assume MACIE-REVEAL-ROLE will be able to conveniently retrieve and reveal sensitive data samples.

To retrieve and reveal sensitive data samples

In the Macie console, in the left navigation pane, choose Findings, and select a specific finding. Under Sensitive Data, choose Review.

Figure 2: The finding details panel
On the Reveal sensitive data page, choose Reveal samples.

Figure 3: The Reveal sensitive data page
Under Sensitive data, you will be able to view up to 10 examples of the sensitive data found by Amazon Macie.

Figure 4: Examples of sensitive data revealed in the Amazon Macie console

You are able to find additional information on setting up the Macie Reveal function in the Amazon Macie User Guide.

Conclusion

In this post, we showed how you are to retrieve and review examples of sensitive data that were found in Amazon S3 using Amazon Macie. This capability will make it easier for your data protection teams to review the sensitive contents found in S3 buckets across the accounts in your AWS environment. With this information, security teams are able to quickly take remediation actions, such as updating the configuration of sensitive buckets, quarantining files with sensitive information, or sending a notification to the owner of the account where the sensitive data resides. In certain cases, you are able to add the examples to an allow list in Macie if you don’t want Macie to report those as sensitive data (for example, corporate addresses or sample data that is used for testing).

The following are links to additional resources that you will be able to use to expand your knowledge of Amazon Macie capabilities and features:

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.

Want more AWS Security news? Follow us on Twitter.

Journey to adopt Cloud-Native DevOps platform Series #1: OfferUp modernized DevOps platform with Amazon EKS and Flagger to accelerate time to market

2022-11-30 Purna Sanyal

Post Syndicated from Purna Sanyal original https://aws.amazon.com/blogs/devops/journey-to-adopt-cloud-native-devops-platform-series-1-offerup-modernized-devops-platform-with-amazon-eks-and-flagger-to-accelerate-time-to-market/

In this two part series, we discuss the challenges faced by OfferUp, a Digital Native customer, to meet business growth and time-to-market. Their journey involved modernizing their existing DevOps platform, from the traditional monolith virtual machine (VM) based architecture to modern containerized architecture and running cloud-native applications for secured progressive delivery to accelerate time to market. This series will provide strategies, architecture patterns, and technical steps you can adopt to become more agile and innovative like OfferUp has.

OfferUp engineers were using the homegrown DevOps platform to build and release new services on the marketplace platform. In this first post, we discuss the key challenges encountered by OfferUp engineers with the existing DevOps platform, as well as how OfferUp modernized its DevOps platform with Amazon Elastic Kubernetes Service (Amazon EKS) and Flagger, automating production releases with progressive delivery techniques for faster time-to-market with new products and services. Amazon EKS is a managed container service to run and scale Kubernetes applications in the cloud or on-premises.

Previous DevOps architecture

OfferUp is a leading online and mobile customer to customer (C2C) marketplace where users can both buy and sell goods on the platform. Users can browse and purchase products from a broad range of categories, including furniture, clothing, sports equipment, toys, and many more. As a mobile-first company, OfferUp puts a great deal of emphasis on in-person communication between buyers and sellers.

OfferUp built a home grown, self-managed DevOps platform. This platform used a set of manual processes and third-party applications that allows both developers and operations engineers to build and deploy code to a production environment. The DevOps pipeline included topic areas such as source code control, continuous integration/continuous delivery (CI/CD), microservices, as well as development and test Methodologies. The following diagram depicts the previous architecture of OfferUp’s DevOps platform, which was self-managed on Amazon Elastic Compute Cloud (Amazon EC2).

Figure 1: Previous DevOps architecture of OfferUp

OfferUp used GitHub for code repositories. Once the source code was committed in the code repository, Jenkins pulled the source code from code repositories on a scheduled or on-demand basis and built Amazon Machine Images (AMI). The built image was deployed in production by a custom built deployment tool, Vanaheim, which supports one-box canary deployment and full roll-out deployment strategies. The DevOps engineers used to manually create a deployment job in the Vanaheim portal and then manually monitor the test success rate and service metrics to detect any impact from the deployment. Once the success rate was reached, a full production roll out was performed from the Vanaheim portal.

Key challenges with previous DevOps pipeline

In 2020, OfferUp experienced significant transaction volume growth on its Marketplace platform with the increase of its user base. With OfferUp’s acquisition of LetGo in 2020, there was a need to build a scalable DevOps platform to support future integration and organic growth. The previous DevOps platform, designed and deployed over seven years ago, had reached the limits of its scalability, and could no longer keep up with the platform’s growth. The previous architecture was expensive to run and had a complex infrastructure that made it difficult to upgrade and add new features.

The following key factors drove the push for modernization:

Manual verification was required to check if the code was correctly deployed in one of the servers in production, and if the deployment was right in one server, then it was rolled out to other production servers. Full Rollout to production wasn’t automated due to frequent failures requiring manual rollbacks.
The previous platform required a longer deployment time (1–2 hours) due to the authoritative batch process, which sometimes caused delays in releasing and testing of new features.
The self-managed nature of the Jenkins and Vanaheim clusters was consuming far too much engineering time. Most of the institutional knowledge of this legacy platform was lost over the years and it didn’t align with OfferUp’s philosophy of small DevOps engineering teams. Innovation had stalled partly due to the difficulty of simultaneously upgrading the DevOps platform and releasing new features.

DevOps platform automation with Flagger and Gloo Ingress Controller on Amazon EKS

A key requirement for the next-generation system was that the new architecture would reduce the operational burden on engineering teams, deployment lifecycle, and total cost of ownership. OfferUp evaluated multiple managed container orchestration platforms for its DevOps Platform. It finally selected Amazon EKS for high availability, reducing the average time to deploy a change to the stack from hours to just a few minutes and reducing the complexity in managing and upgrading the Kubernetes cluster. On the Amazon EKS platform, OfferUp uses Flagger, a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements several deployment strategies (Canary releases, A/B testing, and Blue/Green mirroring) using the Gloo Edge ingress controller for traffic routing. Datadog is used as an observability service for monitoring the health of the deployments and effectively managing the canary to progressive delivery. For release analysis, Flagger runs a query on Datadog logs and uses Slack for alerting and notifications. The cloud native technology components of the architecture are described as follows:

Kubernetes and Amazon EKS – Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. Kubernetes is a graduate project in the CNCF. Amazon EKS is a fully-managed, certified Kubernetes conformant service that simplifies the process of building, securing, operating, and maintaining Kubernetes clusters on AWS. Amazon EKS integrates with core AWS services, such as Amazon CloudWatch, Auto Scaling Groups, and AWS Identity and Access Management (IAM) to provide a seamless experience for monitoring, scaling, and load balancing your containerized applications.

Helm – Helm manage Kubernetes applications. Helm Charts define, install, and upgrade even the most complex Kubernetes application. Charts are easy to create, version, share, and publish. If Kubernetes were an operating system, then Helm would be the package manager. Helm is a graduate project in the CNCF and is maintained by the Helm community.

Flagger – Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements a control loop that gradually shifts traffic to the canary while measuring key performance indicators such as HTTP requests success rate, requests average duration, and pods health. Based on the set thresholds, a canary is either promoted or aborted and its analysis is pushed to a Slack channel. Flagger became a CNCF project – part of the Flux family of GitOps tools.

Gloo Edge – Gloo Edge is a feature-rich, Kubernetes-native ingress controller. Gloo Edge is exceptional in its function-level routing; its support for legacy apps, microservices, and serverless; its discovery capabilities; and its tight integration with leading open-source projects. Gloo Edge is uniquely designed to support hybrid applications, in which multiple technologies, architectures, protocols, and clouds can coexist.

Observability platform – Datadog’s integrations with Kubernetes, Docker, and AWS will let you track the full range of Amazon EKS metrics, as well as logs and performance data from your cluster and applications. Datadog gives you comprehensive coverage of your dynamic infrastructure and applications with features like auto discovery to track services across containers, sophisticated graphing, and alerting options.

Modernized DevOps architecture

In the new architecture, OfferUp uses Github as a version control tool and Github actions as their CI/CD tool. On every Pull request, tests are run, artifacts are built and stored in the JFrog Artifactory, and docker Images are stored in the Amazon Elastic Container Registry (Amazon ECR). Separate deployment pipelines are triggered based on the environment (dev, staging, and production) of choice. Flagger detects any changes in the version of the application and gradually shifts production traffic to the canary. It measures the requests success rate and average response duration metrics from Datadog to decide full rollout in production. For an application deployment, a canary promotion can be defined using Flagger’s custom resource. Flagger rolls back the deployment when the success rate falls below the defined desired success rate metrics.

Figure 2: Modernized DevOps architecture of OfferUp

With the modernized DevOps platform, OfferUp moved from monolithic to microservice architecture where front-end applications and GraphQL runs on the Amazon EKS cluster. The production cluster runs 110 services and 650+ pods on 60 nodes. The cluster scales up to 100 nodes with Amazon Auto Scaling group based on the traffic pattern. On the networking front, the cluster has a private endpoint and uses both VPC CNI plugin, and the CoreDNS add-on. There are four Amazon EKS clusters, one each for the production, test, utility, and the staging environments. OfferUp has a plan to explore Karpenter open-source autoscaling project, and it will move new applications to the Amazon EKS cluster, allowing the total node counts to scale up to 200.

Benefits of modernized architecture

The new architecture helped OfferUp make automated decisions to deploy new releases and improve the time to market while reducing unplanned production downtime

Faster deployments and Quicker rollbacks – The new architecture reduces the Service Deployment time from one hour down to five minutes, and automates rollback time to five minutes from the manual rollback time of one hour.
Automate deployment of new releases – The lack of canary deployment processes in the previous architecture required OfferUp engineers to manually intervene to validate the deployment status, which led to administrative overhead and production outages. The canary deployments take care of the traffic shifting by automatically measuring the requests’ success rate and latency metrics from Datadog and subsequently release the service to production. Deployments are automatically rolled back when the success rate falls below the defined success rate metric thresholds.
Simplified Configuration – Configuration has been simplified drastically and integrated within the CI/CD pipeline in the new architecture, thereby reducing configuration complexity, eliminating manual processes, and saving Developers time.
More time to Focus on Innovation – With fully automated progressive delivery, the developers no longer need to spend time testing and releasing source code in production. Similarly, migrating from a Self-managed DevOps platform to the Managed Amazon EKS services lowered the DevOps platform’s infrastructure management burden on the engineering team. This helps developers spend more time focusing on building and testing new features and innovations.
Cost reduction – Moving from self-managed Amazon EC2-based architecture to the Amazon EKS cluster reduced the cost of operations through shared nodes and improved pod density. The previous architecture was using 200 nodes of Amazon EC2 instances. The same workload was moved to a 50 nodes Amazon EKS cluster. Furthermore, custom applications (Vanaheim and Jenkins) were retired, further reducing the costs.

Conclusion

In this post, you see how OfferUp embarked on the journey to modernize its DevOps platform to support its growth and developers’ velocity. The key factors that drove the modernization decisions were the ability to scale the platform to support the automated testing of features in production, the faster release of new features, cost reduction, and to facilitate future innovation. The modernized DevOps platform on Amazon EKS also decreased the ongoing operational support burden for engineers, and the scalability of the design opens up a lot of headroom for growth.

We encourage you to look into modernizing your existing CI/CD pipeline on Amazon EKS with the Flagger progressive delivery mechanism. Amazon EKS removes the undifferentiated heavy lifting of managing and updating the Kubernetes cluster. Managed node groups automate the provisioning and lifecycle management of worker nodes in an Amazon EKS cluster, which greatly simplifies operational activities, such as new Kubernetes version deployments.

In the next part of the series, you’ll discover how to implement Flagger and Gloo Edge Ingress Controller on Amazon EKS to automate the release process for applications running on Kubernetes.

Further Reading

Journey to adopt Cloud-Native DevOps platform Series #2: Progressive delivery on Amazon EKS with Flagger and Gloo Edge Ingress Controller

About the authors:

Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation

2022-11-30 Srividya Parthasarathy

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/centrally-manage-access-and-permissions-for-amazon-redshift-data-sharing-with-aws-lake-formation/

Today’s global, data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Amazon Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another.

Some customers share their data with 50–100 data warehouses in different accounts and do a lot of cross-sharing, making it difficult to track who is accessing what data. They have to navigate to an individual account’s Amazon Redshift console to retrieve the access information. Also, many customers have their data lake on Amazon Simple Storage Service (Amazon S3), which is shared within and across various business units. As the organization grows and democratizes the data, administrators want the ability to manage the datashare centrally for governance and auditing, and to enforce fine-grained access control.

Working backward from customer ask, we are announcing the preview of the following new feature: Amazon Redshift data sharing integration with AWS Lake Formation, which enables Amazon Redshift customers to centrally manage access to their Amazon Redshift datashares using Lake Formation.

Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns, and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles.

Customers are using the data mesh approach, which provides a mechanism to share data across business units. Customers are also using a modern data architecture to share data from data lake stores and Amazon Redshift purpose-built data stores across business units. Lake Formation provides the ability to enforce data governance within and across business units, which enables secure data access and sharing, easy data discovery, and centralized audit for data access.

United Airlines is in the business of connecting people and uniting the world.

“As a data-driven enterprise, United is trying to create a unified data and analytics experience for our analytics community that will innovate and build modern data-driven applications. We believe we can achieve this by building a purpose-built data mesh architecture using a variety of AWS services like Athena, Aurora, Amazon Redshift, and Lake Formation to simplify management and governance around granular data access and collaboration.”

-Ashok Srinivas, Director of ML Engineering and Sarang Bapat, Director of Data Engineering.

In this post, we show how to centrally manage access and permissions for Amazon Redshift data sharing with Lake Formation.

Solution overview

In this solution, we demonstrate how integration of Amazon Redshift data sharing with Lake Formation for data governance can help you build data domains, and how you can use the data mesh approach to bring data domains together to enable data sharing and federation across business units. The following diagram illustrates our solution architecture.

solution architecture

The data mesh is a decentralized, domain-oriented architecture that emphasizes separating data producers from data consumers via a centralized, federated Data Catalog. Typically, the producers and consumers run within their own account. The details of these data mesh characteristics are as follows:

Data producers – Data producers own their data products and are responsible for building their data, maintaining its accuracy, and keeping their data product up to date. They determine what datasets can be published for consumption and share their datasets by registering them with the centralized data catalog in a central governance account. You might have a producer steward or administrator persona for managing the data products with the central governance steward or administrators team.
Central governance account – Lake Formation enables fine-grained access management on the shared dataset. The centralized Data Catalog offers consumers the ability to quickly find shared datasets, allows administrators to centrally manage access permissions on shared datasets, and provides security teams the ability to audit and track data product usage across business units.
Data consumers – The data consumer obtains access to shared resources from the central governance account. These resources are available inside the consumer’s AWS Glue Data Catalog, allowing fine-grained access on the database and table that can be managed by the consumer’s data stewards and administrators.

The following steps provide an overview of how Amazon Redshift data sharing can be governed and managed by Lake Formation in the central governance pattern of a data mesh architecture:

In the producer account, data objects are created and maintained in the Amazon Redshift producer cluster. A data warehouse admin creates the Amazon Redshift datashare and adds datasets (tables, views) to the share.
The data warehouse admin grants and authorizes access on the datashare to the central governance account’s Data Catalog.
In the central governance account, the data lake admin accepts the datashare and creates the AWS Glue database that points to the Amazon Redshift datashare so that Lake Formation can manage it.
The data lake admin shares the AWS Glue database and tables to the consumer account using Lake Formation cross-account sharing.
In the consumer account, the data lake admin accepts the resource share invitation via AWS Resource Access Manager (AWS RAM) and can view the database listed in the account.
The data lake admin defines the fine-grained access control and grants permissions on databases and tables to IAM users (for this post, consumer1 and consumer2) in the account.
In the Amazon Redshift cluster, the data warehouse admin creates an Amazon Redshift database that points to the Glue database and authorizes usage on the created Amazon Redshift database to the IAM users.
The data analyst as an IAM user can now use their preferred tools like the Amazon Redshift query editor to access the dataset based on the Lake Formation fine-grained permissions.

We use the following account setup for our example in this post:

Producer account – 123456789012
Central account – 112233445566
Consumer account – 665544332211

Prerequisites

Create the Amazon Redshift data share and add datasets

In the data producer account, create an Amazon Redshift cluster using the RA3 node type with encryption enabled. Complete the following steps:

On the Amazon Redshift console, create a cluster subnet group.

For more information, refer to Managing cluster subnet groups using the console.

Choose Create cluster.
For Cluster identifier, provide the cluster name of your choice.
For Preview track, choose preview_2022.

For Node type, choose one of the RA3 node types.

This feature is only supported on the RA3 node type.

For Number of nodes, enter the number of nodes that you need for your cluster.
Under Database configurations, choose the admin user name and admin user password.
Under Cluster permissions, you can select the IAM role and set it as the default.

For more information about the default IAM role, refer to Creating an IAM role as default for Amazon Redshift.

cluster permissions

Turn on the Use defaults option next to Additional configurations to modify the default settings.
Under Network and security, specify the following:
1. For Virtual private cloud (VPC), choose the VPC you would like to deploy the cluster in.
2. For VPC security groups, either leave as default or add the security groups of your choice.
3. For Cluster subnet group, choose the cluster subnet group you created.

additional configurations

Under Database configuration, in the Encryption section, select Use AWS Key Management Service (AWS KMS) or Use a hardware security module (HSM).

Encryption is disabled by default.

For Choose an AWS KMS key, you can either choose an existing AWS Key Management Service (AWS KMS) key, or choose Create an AWS KMS key to create a new key.

For more information, refer to Creating keys.

database configurations

Choose Create cluster.
For this post, create tables and load data into the producer Amazon Redshift cluster using the following script.

Authorize the datashare

Install or update the latest AWS Command Line Interface (AWS CLI) version to run the AWS CLI to authorize the datashare. For instructions, refer to Installing or updating the latest version of the AWS CLI.

Set up Lake Formation permissions

To use the AWS Glue Data Catalog in Lake Formation, complete the following steps in the central governance account to update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

Sign in to the Lake Formation console as admin.
In the navigation pane, under Data catalog, choose Settings.
Deselect Use only IAM access control for new databases.
Deselect Use only IAM access control for new tables in new databases.
Choose Version 2 for Cross account version settings.
Choose Save.

data catalog settings

Set up an IAM user as a data lake administrator

If you’re using an existing data lake administrator user or role add the following managed policies, if not attached and skip the below setup steps:

AWSGlueServiceRole
AmazonRedshiftFullAccess

Otherwise, to set up an IAM user as a data lake administrator, complete the following steps:

On the IAM console, choose Users in the navigation pane.
Select the IAM user who you want to designate as the data lake administrator.
Choose Add an inline policy on the Permissions tab.
Replace <AccountID> with your own account ID and add the following policy:

{
    "Version": "2012-10-17",
    "Statement": [ {
        "Condition": {"StringEquals": {
            "iam:AWSServiceName":"lakeformation.amazonaws.com"}},
            "Action":"iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Effect": "Allow"},
            {"Action": ["iam:PutRolePolicy"],
            "Resource": "arn:aws:iam::<AccountID>:role/aws-service role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess",
            "Effect": "Allow"
     },{
                  "Effect": "Allow",
                  "Action": [
                    "ram:AcceptResourceShareInvitation",
                    "ram:RejectResourceShareInvitation",
                    "ec2:DescribeAvailabilityZones",
                    "ram:EnableSharingWithAwsOrganization"
                  ],
                  "Resource": "*"
                }]
}

Provide a policy name.
Review and save your settings.
Choose Add permissions, and choose Attach existing policies directly.
Add the following policies:
1. AWSLakeFormationCrossAccountManager
2. AWSGlueConsoleFullAccess
3. AWSGlueServiceRole
4. AWSLakeFormationDataAdmin
5. AWSCloudShellFullAccess
6. AmazonRedshiftFullAccess
Choose Next: Review and add permissions.

Data consumer account setup

In the consumer account, follow the steps mentioned previously in the central governance account to set up Lake Formation and a data lake administrator.

In the data consumer account, create an Amazon Redshift cluster using the RA3 node type with encryption (refer to the steps demonstrated to create an Amazon Redshift cluster in the producer account).
Choose Launch stack to deploy an AWS CloudFormation template to create two IAM users with policies.

The stack creates the following users under the data analyst persona:

consumer1
consumer2

After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
Capture the ConsoleIAMLoginURL and LFUsersCredentials values.

createiamusers

Choose the LFUsersCredentials value to navigate to the AWS Secrets Manager console.
In the Secret value section, choose Retrieve secret value.

secret value

Capture the secret value for the password.

Both consumer1 and consumer2 need to use this same password to log in to the AWS Management Console.

secret value

Configure an Amazon Redshift datashare using Lake Formation

Producer account

Create a datashare using the console

Complete the following steps to create an Amazon Redshift datashare in the data producer account and share it with Lake Formation in the central account:

On the Amazon Redshift console, choose the cluster to create the datashare.
On the cluster details page, navigate to the Datashares tab.
Under Datashares created in my namespace, choose Connect to database.

connect to database

Choose Create datashare.

create datashare

For Datashare type, choose Datashare.
For Datashare name, enter the name (for this post, demotahoeds).
For Database name, choose the database from where to add datashare objects (for this post, dev).
For Publicly accessible, choose Turn off (or choose Turn on to share the datashare with clusters that are publicly accessible).

datashare information

Under DataShare objects, choose Add to add the schema to the datashare (in this post, the public schema).
Under Tables and views, choose Add to add the tables and views to the datashare (for this post, we add the table customer and view customer_view).

datashare objects

Under Data consumers, choose Publish to AWS Data Catalog.
For Publish to the following accounts, choose Other AWS accounts.
Provide the AWS account ID of the consumer account. For this post, we provide the AWS account ID of the Lake Formation central governance account.
To share within the same account, choose Local account.
Choose Create datashare.

data consumers

After the datashare is created, you can verify by going back to the Datashares tab and entering the datashare name in the search bar under Datashares created in my namespace.
Choose the datashare name to view its details.
Under Data consumers, you will see the consumer status of the consumer data catalog account as Pending Authorization.

data consumers

Choose the checkbox against the consumer data catalog which will enable the Authorize option.

authorize

Click Authorize to authorize the datashare access to the consumer account data catalog, consumer status will change to Authorized.

authorized

Create a datashare using a SQL command

Complete the following steps to create a datashare in data producer account 1 and share it with Lake Formation in the central account:

On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
Choose (right-click) the cluster name and choose Edit connection or Create Connection.
For Authentication, choose Temporary credentials.

Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.

For Database, enter a database name (for this post, dev).
For Database user, enter the user authorized to access the database (for this post, awsuser).
Choose Save to connect to the database.

Connecting to an Amazon Redshift database

Run the following SQL commands to create the datashare and add the data objects to be shared:

create datashare demotahoeds;
ALTER DATASHARE demotahoeds ADD SCHEMA PUBLIC;
ALTER DATASHARE demotahoeds ADD TABLE customer;
ALTER DATASHARE demotahoeds ADD TABLE customer_view;

Run the following SQL command to share the producer datashare to the central governance account:

GRANT USAGE ON DATASHARE demotahoeds TO ACCOUNT '<central-aws-account-id>' via DATA CATALOG

Run the following SQL command

You can verify the datashare created and objects shared by running the following SQL command:

DESC DATASHARE demotahoeds

DESC DATASHARE demotahoeds

Run the following command using the AWS CLI to authorize the datashare to the central data catalog so that Lake Formation can manage them:

aws redshift authorize-data-share \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-identifier DataCatalog/<central-aws-account-id>

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [{
        "ConsumerIdentifier": "DataCatalog/XXXXXXXXXXXX",
        "Status": "AUTHORIZED",
        "CreatedDate": "2022-11-09T21:10:30.507000+00:00",
        "StatusChangeDate": "2022-11-09T21:10:50.932000+00:00"
    }]
}

You can verify the datashare status on the console by following the steps outlined in the previous section.

Central catalog account

The data lake admin accepts and registers the datashare with Lake Formation in the central governance account and creates a database for the same. Complete the following steps:

Sign in to the console as the data lake administrator IAM user or role.
If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
Under Data catalog in the navigation pane, choose Data sharing and view the Amazon Redshift datashare invitations on the Configuration tab.
Select the datashare and choose Review Invitation.

AWS Lake Formation data sharing

A window pops up with the details of the invitation.

Choose Accept to register the Amazon Redshift datashare to the AWS Glue Data Catalog.

accept reject invitation

Provide a name for the AWS Glue database and choose Skip to Review and create.

Skip to Review and create

Review the content and choose Create database.

create database

After the AWS Glue database is created on the Amazon Redshift datashare, you can view them under Shared Databases.

Shared Databases.

You can also use the AWS CLI to register the datashare and create the database. Use the following commands:

Describe the Amazon Redshift datashare that is shared with the central account:

aws redshift describe-data-shares

Accept and associate the Amazon Redshift datashare to Data Catalog:

aws redshift associate-data-share-consumer \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-arn arn:aws:glue:us-east-1:<central-aws-account-id>:catalog

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:123456789012:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:123456789012:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [
        {
            "ConsumerIdentifier": "arn:aws:glue:us-east-1:112233445566:catalog",
            "Status": "ACTIVE",
            "ConsumerRegion": "us-east-1",
            "CreatedDate": "2022-11-09T23:25:22.378000+00:00",
            "StatusChangeDate": "2022-11-09T23:25:22.378000+00:00"
        }
    ]
}

aws lakeformation register-resource \
--resource-arn arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds

Create the AWS Glue database that points to the accepted Amazon Redshift datashare:

aws glue create-database --region <central-catalog-region> --cli-input-json '{
    "CatalogId": "<central-aws-account-id>",
    "DatabaseInput": {
        "Name": "demotahoedb",
        "FederatedDatabase": {
            "Identifier": "arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds",
            "ConnectionName": "aws:redshift"
        }
    }
}'

Now the data lake administrator of the central governance account can view and share access on both the database and tables to the data consumer account using the Lake Formation cross-account sharing feature.

Grant datashare access to the data consumer

To grant the data consumer account permissions on the shared AWS Glue database, complete the following steps:

On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
Choose Grant.
Under Principals, select External accounts.
Provide the data consumer account ID (for this post, 665544332211).
Under LF_Tags or catalog resources, select Named data catalog resources.
For Databases, choose the database demotahoedb.
Select Describe for both Database permissions and Grantable permissions.
Choose Grant to apply the permissions.

grant data permissions

To grant the data consumer account permissions on tables, complete the following steps:

On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
Choose Grant.
Under Principals, select External accounts.
Provide the consumer account (for this post, we use 665544332211).
Under LF-Tags or catalog resources, select Named data catalog resources.
For Databases, choose the database demotahoedb.
For Tables, choose All tables.
Select Describe and Select for both Table permissions and Grantable permissions.
Choose Grant to apply the changes.

grant the data consumer account permissions on tables

Consumer account

The consumer admin will receive the shared resources from the central governance account and delegate access to other users in the consumer account as shown in the following table.

IAM User	Object Access	Object Type	Access Level
`consumer1`	`public.customer`	Table	All
`consumer2`	`public.customer_view`	View	specific columns: `c_customer_id`, `c_birth_country`, `cd_gender`, `cd_marital_status`, `cd_education_status`

In the data consumer account, follow these steps to accept the resources shared with the account:

Sign in to the console as the data lake administrator IAM user or role.
If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
Sign in to the AWS RAM console.
In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations. You will receive 2 invitations.

Choose the pending invitations and accept the resource share.

Choose the pending invitation and accept the resource share

On the Lake formation console, under Data catalog in the navigation pane, choose Databases to view the cross-account shared database.

choose Databases to view the cross-account shared database

Grant access to the data analyst and IAM users using Lake Formation

Now the data lake admin in the data consumer account can delegate permissions on the shared database and tables to users in the consumer account.

Grant database permissions to consumer1 and consumer2

To grant the IAM users consumer1 and consumer2 database permissions, follow these steps:

On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
Select the database demotahoedb and on the Actions menu, choose Grant.

choose Grant database

Under Principals, select IAM users and roles.
Choose the IAM users consumer1 and consumer2.
Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
Select Describe for Database permissions.
Choose Grant to apply the permissions.

Choose Grant to apply the permissions

Grant table permissions to consumer1

To grant the IAM user consumer1 permissions on table public.customer, follow these steps:

Under Data catalog in the navigation pane, choose Databases.
Select the database demotahoedb and on the Actions menu, choose Grant.
Under Principals, select IAM users and roles.
Choose IAM user consumer1.
Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
For Tables, choose public.customer.
Select Describe and Select for Table permissions.
Choose Grant to apply the permissions.

Grant table permissions to consumer1

Grant column permissions to consumer2

To grant the IAM user consumer2 permissions on non-sensitive columns in public.customer_view, follow these steps:

Under Data catalog in the navigation pane, choose Databases.
Select the database demotahoedb and on the Actions menu, choose Grant.
Under Principals, select IAM users and roles.
Choose the IAM user consumer2.
Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
For Tables, choose public.customer_view.

Grant column permissions to consumer2

Select Select for Table permissions.
Under Data Permissions, select Column-based access.
Select Include columns and choose the non-sensitive columns (c_customer_id, c_birth_country, cd_gender, cd_marital_status, and cd_education_status).
Choose Grant to apply the permissions.

table permissions

Consume the datashare from the data consumer account in the Amazon Redshift cluster

In the Amazon Redshift consumer data warehouse, log in as the admin user using Query Editor V2 and complete the following steps:

Create the Amazon Redshift database from the shared catalog database using the following SQL command:

CREATE DATABASE demotahoedb FROM ARN 'arn:aws:glue:<producer-region>:<producer-aws-account-id>:database/demotahoedb' WITH DATA CATALOG SCHEMA demotahoedb ;

Run the following SQL commands to create and grant usage on the Amazon Redshift database to the IAM users consumer1 and consumer2:

CREATE USER IAM:consumer1 password disable;
CREATE USER IAM:consumer2  password disable;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer1;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer2;

In order to use a federated identity to enforce Lake Formation permissions, follow the next steps to configure Query Editor v2.

Choose the settings icon in the bottom left corner of the Query Editor v2, then choose Account settings.

identity to enforce Lake Formation permissions

Under Connection settings, select Authenticate with IAM credentials.
Choose Save.

Authenticate with IAM credentials

Query the shared datasets as a consumer user

To validate that the IAM user consumer1 has datashare access from Amazon Redshift, perform the following steps:

Sign in to the console as IAM user consumer1.
On the Amazon Redshift console, choose Query Editor V2 in the navigation pane.
To connect to the consumer cluster, choose the consumer cluster in the tree-view pane.
When prompted, for Authentication, select Temporary credentials using your IAM identity.
For Database, enter the database name (for this post, dev).
The user name will be mapped to your current IAM identity (for this post, consumer1).
Choose Save.

edit connection for redshift

Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:

select current_user;

To find the federated databases created on the consumer account, run the following SQL command:

SHOW DATABASES FROM DATA CATALOG [ACCOUNT '<id1>', '<id2>'] [LIKE 'expression'];

federated databases created on the consumer account

To validate permissions for consumer1, run the following SQL command:

select * from demotahoedb.public.customer limit 10;

As shown in the following screenshot, consumer1 is able to successfully access the datashare customer object.

Now let’s validate that consumer2 doesn’t have access to the datashare tables “public.customer” on the same consumer cluster.

Log out of the console and sign in as IAM user consumer2.
Follow the same steps to connect to the database using the query editor.
Once connected, run the same query:

select * from demotahoedb.public.customer limit 10;

The user consumer2 should get a permission denied error, as in the following screenshot.

should get a permission denied error

Let’s validate the column-level access permissions of consumer2 on public.customer_view view.

Connect to Query Editor v2 as consumer2 and run the following SQL command:

select c_customer_id,c_birth_country,cd_gender,cd_marital_status from demotahoedb.public.customer_view limit 10;

In the following screenshot, you can see consumer2 is only able to access columns as granted by Lake Formation.

access columns as granted by Lake Formation

Conclusion

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. Using Amazon Redshift data sharing with Lake Formation for data governance helps build the data mesh architecture, enabling data sharing and federation across business units with fine-grained access control.

Special thanks to everyone who contributed to launch Amazon Redshift data sharing with AWS Lake Formation:

Debu Panda, Michael Chess, Vlad Ponomarenko, Ting Yan, Erol Murtezaoglu, Sharda Khubchandani, Rui Bi

References

About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Ranjan Burman is a Analytics Specialist Solutions Architect, with AWS.

Vikram Sahadevan is a Senior Resident Architect on the AWS Data Lab team. He enjoys efforts that focus around providing prescriptive architectural guidance, sharing best practices, and removing technical roadblocks with joint engineering engagements between customers and AWS technical resources that accelerate data, analytics, artificial intelligence, and machine learning initiatives.

Steve Mitchell is a Senior Solution Architect with a passion for analytics and data mesh. He enjoys working closely with customers as they transition to a modern data architecture.

Log analytics the easy way with Amazon OpenSearch Serverless

2022-11-30 Prashant Agrawal

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/log-analytics-the-easy-way-with-amazon-opensearch-serverless/

We recently announced the preview release of Amazon OpenSearch Serverless, a new serverless option for Amazon OpenSearch Service, which makes it easy for you to run large-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. It automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads.

OpenSearch Serverless supports two primary use cases:

Log analytics that focuses on analyzing large volumes of semi-structured, machine-generated time series data for operational, security, and user behavior insights
Full-text search that powers customer applications in their internal networks (content management systems, legal documents) and internet-facing applications such as ecommerce website catalog search and content search

This post focuses on building a simple log analytics pipeline with OpenSearch Serverless.

Solution overview

In the following sections, we walk through the steps to create and access a collection in OpenSearch Serverless, and demonstrate how to configure two different data ingestion pipelines to index data into the collection.

Create a collection

To get started with OpenSearch Serverless, you first create a collection. A collection in OpenSearch Serverless is a logical grouping of one or more indexes that represent an analytics workload.

The following graphic gives a quick navigation for creating a collection. Alternatively, refer to this blog post to learn more about how to create and configure a collection in OpenSearch Serverless.

Access the collection

You can use the AWS Identity and Access Management (IAM) credentials with a secret key and access key ID for your IAM users and roles to access your collection programmatically. Alternatively, you can set up SAML authentication for accessing the OpenSearch Dashboards. Note that SAML authentication is only available to access OpenSearch Dashboards; you require IAM credentials to perform any operations using the AWS Command Line Interface (AWS CLI), API, and OpenSearch clients for indexing and searching data. In this post, we use IAM credentials to access the collections.

Create a data ingestion pipeline

OpenSearch Serverless supports the same ingestion pipelines as the open-source OpenSearch and managed clusters. These clients include applications like Logstash and Amazon Kinesis Data Firehose, and language clients like Java Script, Python, Go, Java, and more. For more details on all the ingestion pipelines and supported clients, refer to ingesting data into OpenSearch Serverless collections.

Using Logstash

The open-source version of Logstash (Logstash OSS) provides a convenient way to use the bulk API to upload data into your collections. OpenSearch Serverless supports the logstash-output-opensearch output plugin, which supports IAM credentials for data access control. In this post, we show how to use the file input plugin to send data from your command line console to an OpenSearch Serverless collection. Complete the following steps:

Download the logstash-oss-with-opensearch-output-plugin file (this example uses the distro for macos-x64; for other distros, refer to the artifacts):
```
wget https://artifacts.opensearch.org/logstash/logstash-oss-with-opensearch-output-plugin-8.4.0-macos-x64.tar.gz
```

Extract the downloaded tarball:

tar -zxvf logstash-oss-with-opensearch-output-plugin-8.4.0-macos-x64.tar.gz
cd logstash-8.4.0/

Update the logstash-output-opensearch plugin to the latest version:
```
./bin/logstash-plugin update logstash-output-opensearch
```
The OpenSearch output plugin for OpenSearch Serverless uses IAM credentials to authenticate. In this example, we show how to use the file input plugin to read data from a file and ingest into an OpenSearch Serverless collection.

Create a log file with the following sample data and name it sample.log:

{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:25.855Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}
{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.155Z","traffic":"heavy","weather_category":"Cloudy","weather":"Heavy Fog"}
{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.255Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}
{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.556Z","traffic":"heavy","weather_category":"Cloudy","weather":"Heavy Fog"}
{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.756Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}

Create a new file and add the following content, and save the file as logstash-output-opensearch.conf after providing the information about your file path, host, Region, access key, and secret access key:

input {
   file {
     path => "<path/to/your/sample.log>"
     start_position => "beginning"
   }
}
output {
    opensearch {
        ecs_compatibility => disabled
        index => "logstash-sample"
        hosts => "<HOST>:443"
        auth_type => {
            type => 'aws_iam'
            aws_access_key_id => '<AWS_ACCESS_KEY_ID>'
            aws_secret_access_key => '<AWS_SECRET_ACCESS_KEY>'
            region => '<REGION>'
            service_name => 'aoss'
            }
        legacy_template => false
        default_server_major_version => 2
    }
}

Use the following command to start Logstash with the config file created in the previous step. This creates an index called logstash-sample and ingests the document added under the sample.log file:
```
./bin/logstash -f <path/to/your/config/file>
```

Search using OpenSearch Dashboards by running the following query:

GET logstash-sample/_search
{
  "query": {
    "match_all": {}
  },
  "track_total_hits" : true
}

In this step, you used a file input plugin from Logstash to send data to OpenSearch Serverless. You can replace the input plugin with any other plugin supported by Logstash, such as Amazon Simple Storage Service (Amazon S3), stdin, tcp, or others, to send data to the OpenSearch Serverless collection.

Using a Python client

OpenSearch provides high-level clients for several popular programming languages, which you can use to integrate with your application. With OpenSearch Serverless, you can continue to use your existing OpenSearch client to load and query your data in collections.

In this section, we show how to use the opensearch-py client for Python to establish a secure connection with your OpenSearch Serverless collection, create an index, send sample logs, and analyze those log data using OpenSearch Dashboards. In this example, we use a sample event generated from fleets carrying goods and packages. This data contains pertinent fields such as source, destination, weather, speed, and traffic. The following is a sample record:

"_source" : {
    "deviceId" : 2823605996,
    "fleetRegNo" : "IRV82MBYQ1",
    "carrier" : "AOS Van Lines",
    "temperature" : 14,
    "tripId" : 6741375582,
    "originODC" : "ODC Las Vegas",
    "originCountry" : "United States",
    "originCity" : "Las Vegas",
    "destinationCity" : "San Jose",
    "@timestamp" : "2022-11-17T17:11:25.855Z",
    "traffic" : "heavy",
    "weather" : "Cloudy"
    ...
    ...
}

To set up the Python client for OpenSearch, you must have the following prerequisites:

Python3 installed on your local machine or the server from where you are running this code
Package Installer for Python (PIP) installed
The AWS CLI configured; we use it to store the secret key and access key for credentials

Complete the following steps to set up the Python client:

Add the OpenSearch Python client to your project and use Python’s virtual environment to set up the required packages:

mkdir python-sample
cd python-sample
python3 -m venv .env
source .env/bin/activate
.env/bin/python3 -m pip install opensearch-py
.env/bin/python3 -m pip install requests_aws4auth
.env/bin/python3 -m pip install boto3
.env/bin/python3 -m pip install geopy

Save your frequently used configuration settings and credentials in files that are maintained by the AWS CLI (see Quick configuration with aws configure) by using the following commands and providing your access key, secret key, and Region:
```
aws configure
```

The following sample code uses the opensearch-py client for Python to establish a secure connection to the specified OpenSearch Serverless collection and index a sample document to index time series. You must provide values for region and host. Note that you must use aoss as the service name for OpenSearch Service. Copy the code and save in a file as sample_python.py:

from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3

host = '<host>' # OpenSearch Serverless collection endpoint
region = '<region>' # e.g. us-west-2

service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service,
session_token=credentials.token)

# Create an OpenSearch client
client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = awsauth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)
# Specify index name
index_name = 'octank-iot-logs-2022-11-19'

# Prepare a document to index 
document = {
    "deviceId" : 2823605996,
    "fleetRegNo" : "IRV82MBYQ1",
    "carrier" : "AOS Van Lines",
    "temperature" : 14,
    "tripId" : 6741375582,
    "originODC" : "ODC Las Vegas",
    "originCountry" : "United States",
    "originCity" : "Las Vegas",
    "destinationCity" : "San Jose",
    "@timestamp" : "2022-11-19T17:11:25.855Z",
    "traffic" : "heavy",
    "weather" : "Cloudy"
}

# Index Documents
response = client.index(
    index = index_name,
    body = document
)

print('\n Document indexed with response:')
print(response)


# Search for the Documents
q = 'heavy'
query = {
    'size': 5,
        'query': {
        'multi_match': {
        'query': q,
        'fields': ['traffic']
        }
    }
}

response = client.search(
body = query,
index = index_name
)
print('\nSearch results:')
print(response)

Run the sample code:
```
python3 sample_python.py
```
On the OpenSearch Service console, select your collection.
On OpenSearch Dashboards, choose Dev Tools.

Run the following search query to retrieve documents:

GET octank-iot-logs-*/_search
{
  "query": {
    "match_all": {}
  }
}

After you have ingested the data, you can use OpenSearch Dashboards to visualize your data. In the following example, we analyze data visually to gain insights on various dimensions such as average fuel consumed by a specific fleet, traffic conditions, distance traveled, and average mileage by the fleet.

Conclusion

In this post, you created a log analytics pipeline using OpenSearch Serverless, a new serverless option for OpenSearch Service. With OpenSearch Serverless, you can focus on building your application without having to worry about provisioning, tuning, and scaling the underlying infrastructure. OpenSearch Serverless supports the same ingestion pipelines and high-level clients as the open-source OpenSearch project. You can easily get started using the familiar OpenSearch indexing and query APIs to load and search your data and use OpenSearch Dashboards to visualize that data.

Stay tuned for a series of posts focusing on the various options available for you to build effective log analytics and search applications. Get hands-on with OpenSearch Serverless by taking the Getting Started with Amazon OpenSearch Serverless workshop and build a similar log analytics pipeline that was discussed in this post.

About the authors

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Pavani Baddepudi is a senior product manager working in search services at AWS. Her interests include distributed systems, networking, and security.

Automate your Amazon QuickSight deployment with the new API-based account creation and deletion

2022-11-21 Srikanth Baheti

Post Syndicated from Srikanth Baheti original https://aws.amazon.com/blogs/big-data/automate-your-amazon-quicksight-deployment-with-the-new-api-based-account-creation-and-deletion/

Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data, create interactive dashboards, and share these with tens of thousands of users, either within the QuickSight interface, or embedded in software as a service (SaaS) applications or web portals.

We’re excited to announce the availability of QuickSight APIs that enable administrators and developers to programmatically create and delete accounts with QuickSight Enterprise and Enterprise + Q editions. This allows developers and administrators to automate the deployment and teardown of QuickSight accounts in their organization at scale.

Feature overview

To create and delete QuickSight accounts programmatically, you can use the following APIs:

You need to have the correct AWS Identity and Access Management (IAM) permissions to sign up for QuickSight. For more information, see IAM policy examples for Amazon QuickSight. If your IAM policy includes both the Subscribe and CreateAccountSubscription actions, make sure that both actions are set to Allow. If either action is set to Deny, the Deny action prevails and your API call fails.

Create a QuickSight account

Use the CreateAccountSubscription API to create a QuickSight account:

POST /account/AwsAccountId HTTP/1.1 
Content-type: application/json 
{ 
    "AccountName": "string", 
    "ActiveDirectoryName": "string", 
    "AdminGroup": [ "string" ], 
    "AuthenticationMethod": "string", 
    "AuthorGroup": [ "string" ], 
    "ContactNumber": "string", 
    "DirectoryId": "string", 
    "Edition": "string", 
    "EmailAddress": "string", 
    "FirstName": "string", 
    "LastName": "string", 
    "NotificationEmail": "string", 
    "ReaderGroup": [ "string" ], 
    "Realm": "string" 
}

You can choose to subscribe to Enterprise or Enterprise + Q edition of QuickSight. You can choose the user authentication method: IAM only, IAM and QuickSight, or Active Directory. If you choose Active Directory, you need to provide an Active Directory name and an admin group associated with your Active Directory. Provide the notification email address that you want QuickSight to send notifications to regarding your QuickSight account or QuickSight subscription.

The following code shows the response for CreateAccountSubscription:

HTTP/1.1 Status
Content-type: application/json
{
   "RequestId": "string",
   "SignupResponse": { 
      "accountName": "string",
      "directoryType": "string",
      "IAMUser": boolean,
      "userLoginName": "string"
   }
}

Describe a QuickSight account

Use the DescribeAccountSubscription API to receive a description of a QuickSight account’s subscription:

GET /account/AwsAccountId HTTP/1.1

A successful API call returns an AccountInfo object that includes an account’s name, subscription status, authentication type, edition, and notification email address:

HTTP/1.1 Status 
Content-type: application/json 
{ 
    "AccountInfo": { 
        "AccountName": "string", 
        "AccountSubscriptionStatus": "string", 
        "AuthenticationType": "string", 
        "Edition": "string", 
        "NotificationEmail": "string" 
    },
    "RequestId": "string" 
}

Update and delete a QuickSight account

To avoid accidental deletion, a termination protection flag has been introduced. When a new QuickSight account is provisioned through the CreateAccountSubscription API or user interface, the termination protection flag is set to True by default. You can use the existing DescribeAccountSettings API to inquire the current value of the termination protection flag:

GET /account/AwsAccountId HTTP/1.1

The following code shows the response for DescribeAccountSettings:

HTTP/1.1 Status 
Content-type: application/json
{ 
    "AccountSettings": { 
        "AccountName": "string", 
        "Edition": "string", 
        "DefaultNamespace": "string", 
        "PublicSharingEnabled": "boolean", 
        "NotificationEmail": "string",
        "TerminationProtectionEnabled": "boolean"
    },
    "RequestId": "string" 
}

Use the UpdateAccountSettings API to disable the termination protection flag before proceeding with deleting the account. Optionally, this API can be used to update notification email:

PUT /accounts/{AwsAccountId}/settings HTTP/1.1 
Content-type: application/json 
{ 
    "AwsAccountId": "string", 
    "DefaultNamespace": "string", //Currently only supports default namespace: default 
    "TerminationProtectionEnabled": "boolean",
     "NotificationEmail": "string"
}

After the termination protection has been set to False, use the DeleteAccountSubscription API to delete the QuickSight account:

DELETE /account/{AwsAccountId} HTTP/1.1 
Content-type: application/json

Sample use case

Let’s consider a fictional company, AnyCompany, which owns healthcare facilities across the country. The central IT team of AnyCompany is responsible for setting up and maintaining the IT infrastructure and services for all the facilities in each state. Because AnyCompany is scaling their business, they want to automate the onboarding and maintenance of the IT infrastructure as much as possible. QuickSight is one of the services used by AnyCompany sites, and central IT needs to be able to set up and tear down QuickSight accounts automatically.

The following code shows their sample CreateAccountSubscription API call for setting up a QuickSight Enterprise account:

aws quicksight create-account-subscription --edition ENTERPRISE 
--authentication-method IAM_AND_QUICKSIGHT --aws-account-id XXXXXXXXXXXX 
--account-name quicksight-enterprise-reporting --notification-email XXXXXXXXXX 
--region us-west-2

The following screenshot shows the response.

They use the DescribeAccountSubscription API to monitor the status of new QuickSight account’s subscription:

aws quicksight describe-account-subscription --aws-account-id XXXXXXXXXXX

The following screenshot shows the response.

In case of a site closing event, the following code turns off the account protection using the UpdateAccountSettings API call:

 aws quicksight update-account-settings --aws-account-id XXXXXXXXXXXX 
 --default-namespace default --no-termination-protection-enabled 
 --region us-west-2

The following screenshot shows the response.

The following code deletes the QuickSight account using the DeleteAccountSubscription API call:

aws quicksight delete-account-subscription --aws-account-id XXXXXXXXXXXX 
--region us-west-2

The following screenshot shows the response.

Conclusion

Provisioning and deleting QuickSight accounts enables IT teams to automate their deployments of QuickSight instances across their organization. You can scale and keep up with the rapidly growing business, and minimize human error such as choosing the wrong authentication method. For more information, refer to Amazon QuickSight and What’s New in the Amazon QuickSight User Guide.

Try out QuickSight account provisioning and deletion APIs to automate your QuickSight deployments, and share your feedback and questions in the comments.

About the authors

Mayank Agarwal is a product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. He focuses on account administration, governance and developer experience. He started his career as an embedded software engineer developing handheld devices. Prior to QuickSight he was leading engineering teams at Credence ID, developing custom mobile embedded device and web solutions using AWS services that make biometric enrollment and identification fast, intuitive, and cost-effective for Government sector, healthcare and transaction security applications.

How GoDaddy built a data mesh to decentralize data ownership using AWS Lake Formation

2022-11-21 Ankit Jhalaria

Post Syndicated from Ankit Jhalaria original https://aws.amazon.com/blogs/big-data/how-godaddy-built-a-data-mesh-to-decentralize-data-ownership-using-aws-lake-formation/

This is a guest post co-written with Ankit Jhalaria from GoDaddy.

GoDaddy is empowering everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their idea, build a professional website, attract customers, and manage their work.

GoDaddy is a data-driven company, and getting meaningful insights from data helps them drive business decisions to delight their customers. In 2018, GoDaddy began a large infrastructure revamp and partnered with AWS to innovate faster than ever before to meet the needs of its customer growth around the world. As part of this revamp, the GoDaddy Data Platform team wanted to set the company up for long-term success by creating a well-defined data strategy and setting goals to decentralize the ownership and processing of data.

In this post, we discuss how GoDaddy uses AWS Lake Formation to simplify security management and data governance at scale, and enable data as a service (DaaS) supporting organization-wide data accessibility with cross-account data sharing using a data mesh architecture.

The challenge

In the vast ocean of data, deriving useful insights is an art. Prior to the AWS partnership, GoDaddy had a shared Hadoop cluster on premises that various teams used to create and share datasets with other analysts for collaboration. As the teams grew, copies of data started to grow in the Hadoop Distributed File System (HDFS). Several teams started to build tooling to manage this challenge independently, duplicating efforts. Managing permissions on these data assets became harder. Making data discoverable across a growing number of data catalogs and systems is something that had started to become a big challenge. Although the cost of storage these days is relatively inexpensive, when there are several copies of the same data asset available, it makes it harder for analysts to efficiently and reliably use the data available to them. Business analysts need robust pipelines on key datasets that they rely upon to make business decisions.

Solution overview

In GoDaddy’s data mesh hub and spoke model, a central data catalog contains information about all the data products that exist in the company. In AWS terminology, this is the AWS Glue Data Catalog. The data platform team provides APIs, SDKs, and Airflow Operators as components that different teams use to interact with the catalog. Activities such as updating the metastore to reflect a new partition for a given data product, and occasionally running MSCK repair operations, are all handled in the central governance account, and Lake Formation is used to secure access to the Data Catalog.

The data platform team introduced a layer of data governance that ensures best practices for building data products are followed throughout the company. We provide the tooling to support data engineers and business analysts while leaving the domain experts to run their data pipelines. With this approach, we have well-curated data products that are intuitive and easy to understand for our business analysts.

A data product refers to an entity that powers insights for analytical purposes. In simple terms, this could refer to an actual dataset pointing to a location in Amazon Simple Storage Service (Amazon S3). Data producers are responsible for the processing of data and creating new snapshots or partitions depending on the business needs. In some cases, data is refreshed every 24 hours, and other cases, every hour. Data consumers come to the data mesh to consume data, and permissions are managed in the central governance account through Lake Formation. Lake Formation uses AWS Resource Access Manager (AWS RAM) to send resource shares to different consumer accounts to be able to access the data from the central governance account. We go into details about this functionality later in the post.

The following diagram illustrates the solution architecture.

Defining metadata with the central schema repository

Data is only useful if end-users can derive meaningful insights from it—otherwise, it’s just noise. As part of onboarding with the data platform, a data producer registers their schema with the data platform along with relevant metadata. This is reviewed by the data governance team that ensures best practices for creating datasets are followed. We have automated some of the most common data governance review items. This is also the place where producers define a contract about reliable data deliveries, often referred to as Service Level Objective (SLO). After a contract is in place, the data platform team’s background processes monitor and send out alerts when data producers fail to meet their contract or SLO.

When managing permissions with Lake Formation, you register the Amazon S3 location of different S3 buckets. Lake Formation uses AWS RAM to share the named resource.

When managing resources with AWS RAM, the central governance account creates AWS RAM shares. The data platform provides a custom AWS Service Catalog product to accept AWS RAM shares in consumer accounts.

Having consistent schemas with meaningful names and descriptions makes the discovery of datasets easy. Every data producer who is a domain expert is responsible for creating well-defined schemas that business users use to generate insights to make key business decisions. Data producers register their schemas along with additional metadata with the data lake repository. Metadata includes information about the team responsible for the dataset, such as their SLO contract, description, and contact information. This information gets checked into a Git repository where automation kicks in and validates the request to make sure it conforms to standards and best practices. We use AWS CloudFormation templates to provision resources. The following code is a sample of what the registration metadata looks like.

As part of the registration process, automation steps run in the background to take care of the following on behalf of the data producer:

Register the producer’s Amazon S3 location of the data with Lake Formation – This allows us to use Lake Formation for fine-grained access to control the table in the AWS Glue Data Catalog that refers to this location as well as to the underlying data.
Create the underlying AWS Glue database and table – Based on the schema specified by the data producer along with the metadata, we create the underlying AWS Glue database and table in the central governance account. As part of this, we also use table properties of AWS Glue to store additional metadata to use later for analysis.
Define the SLO contract – Any business-critical dataset needs to have a well-defined SLO contract. As part of dataset registration, the data producer defines a contract with a cron expression that gets used by the data platform to create an event rule in Amazon EventBridge. This rule triggers an AWS Lambda function to watch for deliveries of the data and triggers an alert to the data producer’s Slack channel if they breach the contract.

Consuming data from the data mesh catalog

When a data consumer belonging to a given line of business (LOB) identifies the data product that they’re interested in, they submit a request to the central governance team containing their AWS account ID that they use to query the data. The data platform provides a portal to discover datasets across the company. After the request is approved, automation runs to create an AWS RAM share with the consumer account covering the AWS Glue database and tables mapped to the data product registered in the AWS Glue Data Catalog of the central governance account.

The following screenshot shows an example of a resource share.

The consumer data lake admin needs to accept the AWS RAM share and create a resource link in Lake Formation to start querying the shared dataset within their account. We automated this process by building an AWS Service Catalog product that runs in the consumer’s account as a Lambda function that accepts shares on behalf of consumers.

When the resource linked datasets are available in the consumer account, the consumer data lake admin provides grants to IAM users and roles mapping to data consumers within the account. These consumers (application or user persona) can now query the datasets using AWS analytics services of their choice like Amazon Athena and Amazon EMR based on the access privileges granted by the consumer data lake admin.

Day-to-day operations and metrics

Managing permissions using Lake Formation is one part of the overall ecosystem. After permissions have been granted, data producers create new snapshots of the data at a certain cadence that can vary from every 15 minutes to a day. Data producers are integrated with the data platform APIs that informs the platform about any new refreshes of the data. The data platform automatically writes a 0-byte _SUCCESS file for every dataset that gets refreshed, and notifies the subscribed consumer account via an Amazon Simple Notification Service (Amazon SNS) topic in the central governance account. Consumers use this as a signal to trigger their data pipelines and processes to start processing newer version of the data utilizing an event-driven approach.

There are over 2,000 data products built on the GoDaddy data mesh on AWS. Every day, there are thousands of updates to the AWS Glue metastore in the central data governance account. There are hundreds of data producers generating data every hour in a wide array of S3 buckets, and thousands of data consumers consuming data across a wide array of tools, including Athena, Amazon EMR, and Tableau from different AWS accounts.

Business outcomes

With the move to AWS, GoDaddy’s Data Platform team laid the foundations to build a modern data platform that has increased our velocity of building data products and delighting our customers. The data platform has successfully transitioned from a monolithic platform to a model where ownership of data has been decentralized. We accelerated the data platform adoption to over 10 lines of business and over 300 teams globally, and are successfully managing multiple petabytes of data spread across hundreds of accounts to help our business derive insights faster.

Conclusion

GoDaddy’s hub and spoke data mesh architecture built using Lake Formation simplifies security management and data governance at scale, to deliver data as a service supporting company-wide data accessibility. Our data mesh manages multiple petabytes of data across hundreds of accounts, enabling decentralized ownership of well-defined datasets with automation in place, which helps the business discover data assets quicker and derive business insights faster.

This post illustrates the use of Lake Formation to build a data mesh architecture that enables a DaaS model for a modernized enterprise data platform. For more information, see Design a data mesh architecture using AWS Lake Formation and AWS Glue.

About the Authors

Ankit Jhalaria is the Director Of Engineering on the Data Platform at GoDaddy. He has over 10 years of experience working in big data technologies. Outside of work, Ankit loves hiking, playing board games, building IoT projects, and contributing to open-source projects.

Harsh Vardhan is an AWS Solutions Architect, specializing in Analytics. He has over 6 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Kyle Tedeschi is a Principal Solutions Architect at AWS. He enjoys helping customers innovate, transform, and become leaders in their respective domains. Outside of work, Kyle is an avid snowboarder, car enthusiast, and traveler.

Use Karpenter to speed up Amazon EMR on EKS autoscaling

2022-11-17 Changbin Gong

Post Syndicated from Changbin Gong original https://aws.amazon.com/blogs/big-data/use-karpenter-to-speed-up-amazon-emr-on-eks-autoscaling/

Amazon EMR on Amazon EKS is a deployment option for Amazon EMR that allows organizations to run Apache Spark on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, the Spark jobs run on the Amazon EMR runtime for Apache Spark. This increases the performance of your Spark jobs so that they run faster and cost less than open source Apache Spark. Also, you can run Amazon EMR-based Apache Spark applications with other types of applications on the same EKS cluster to improve resource utilization and simplify infrastructure management.

Karpenter was introduced at AWS re:Invent 2021 to provide a dynamic, high performance, open-source cluster auto scaling solution for Kubernetes. It automatically provisions new nodes in response to unschedulable pods. It observes the aggregate resource requests of unscheduled pods and makes decisions to launch new nodes and terminate stop them to reduce scheduling latencies as well as infrastructure costs.

To configure Karpenter, you create provisioners that define how Karpenter manages the pods that are pending and expires nodes. Although most use cases are addressed with a single provisioner, multiple provisioners are useful in multi-tenant use cases such as isolating nodes for billing, using different node constraints (such as no GPUs for a team), or using different deprovisioning settings. Karpenter launches nodes with minimal compute resources to fit un-schedulable pods for efficient binpacking. It works in tandem with the Kubernetes scheduler to bind un-schedulable pods to the new nodes that are provisioned. The following diagram illustrates how it works.

This post shows how to integrate Karpenter into your EMR on EKS architecture to achieve faster and capacity-aware auto scaling capabilities to speed up your big data and machine learning (ML) workloads while reducing costs. We run the same workload using both Cluster Autoscaler and Karpenter, to see some of the improvements we discuss in the next section.

Improvements compared to Cluster Autoscaler

Like Karpenter, Kubernetes Cluster Autoscaler (CAS) is designed to add nodes when requests come in to run pods that can’t be met by current capacity. Cluster Autoscaler is part of the Kubernetes project, with implementations by major Kubernetes cloud providers. By taking a fresh look at provisioning, Karpenter offers the following improvements:

No node group management overhead – Because you have different resource requirements for different Spark workloads along with other workloads in your EKS cluster, you need to create separate node groups that can meet your requirements, like instance sizes, Availability Zones, and purchase options. This can quickly grow to tens and hundreds of node groups, which adds additional management overhead. Karpenter manages each instance directly, without the use of additional orchestration mechanisms like node groups, taking a group-less approach by calling the EC2 Fleet API directly to provision nodes. This allows Karpenter to use diverse instance types, Availability Zones, and purchase options by simply creating a single provisioner, as shown in the following figure.
Quick retries – If the Amazon Elastic Compute Cloud (Amazon EC2) capacity isn’t available, Karpenter can retry in milliseconds instead of minutes. This is can be a really useful if you’re using EC2 Spot Instances and you’re unable to get capacity to specific instance types.
Designed to handle full flexibility of the cloud – Karpenter has the ability to efficiently address the full range of instance types available through AWS. Cluster Autoscaler wasn’t originally built with the flexibility to handle hundreds of instance types, Availability Zones, and purchase options. We recommend being as flexible as you can be to enable Karpenter get the just-in-time capacity you need.
Improves the overall node utilization by binpacking – Karpenter batches pending pods and then binpacks them based on CPU, memory, and GPUs required, taking into account node overhead (for example, daemon set resources required). After the pods are binpacked on the most efficient instance type, Karpenter takes other instance types that are similar or larger than the most efficient packing, and passes the instance type options to an API called EC2 Fleet, following some of the best practices of instance diversification to improve the chances of getting the request capacity.

Best practices using Karpenter with EMR on EKS

For general best practices with Karpenter, refer to Karpenter Best Practices. The following are additional things to consider with EMR on EKS:

Avoid inter-AZ data transfer cost by either configuring the Karpenter provisioner to launch in a single Availability Zone or use node selector or affinity and anti-affinity to schedule the driver and the executors of the same job to a single Availability Zone. See the following code:
```
nodeSelector:
  topology.kubernetes.io/zone: us-east-1a
```
Cost optimize Spark workloads using EC2 Spot Instances for executors and On-Demand Instances for the driver by using the node selector with the label karpenter.sh/capacity-type in the pod templates. We recommend using pod templates to specify driver pods to run on On-Demand Instances and executor pods to run on Spot Instances. This allows you to consolidate provisioner specs because you don’t need two specs per job type. It also follows the best practice of using customization defined on workload types and to keep provisioner specs to support a broader number of use cases.
When using EC2 Spot Instances, maximize the instance diversification in the provisioner configuration to adhere to the best practices. To select suitable instance types, you can use the ec2-instance-selector, a CLI tool and go library that recommends instance types based on resource criteria like vCPUs and memory.

Solution overview

This post provides an example of how to set up both Cluster Autoscaler and Karpenter in an EKS cluster and compare the auto scaling improvements by running a sample EMR on EKS workload.

The following diagram illustrates the architecture of this solution.

We use the Transaction Processing Performance Council-Decision Support (TPC-DS), a decision support benchmark to sequentially run three Spark SQL queries (q70-v2.4, q82-v2.4, q64-v2.4) with a fixed number of 50 executors, against 17.7 billion records, approximately 924 GB compressed data in Parquet file format. For more details on TPC-DS, refer to the eks-spark-benchmark GitHub repo.

We submit the same job with different Spark driver and executor specs to mimic different jobs solely to observe the auto scaling behavior and binpacking. We recommend you right-size your Spark executors based on the workload characteristics for production workloads.

The following code is an example Spark configuration that results in pod spec requests of 4 vCPU and 15 GB:

--conf spark.executor.instances=50 --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.driver.memoryOverhead=5g --conf spark.executor.cores=4 --conf spark.executor.memory=10g  --conf spark.executor.memoryOverhead=5g

We use pod templates to schedule Spark drivers on On-Demand Instances and executors on EC2 Spot Instances (which can save up to 90% over On-Demand Instance prices). Spark’s inherent resiliency has the driver launch new executors to replace the ones that fail due to Spot interruptions. See the following code:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: spot
  containers:
  - name: spark-kubernetes-executor


apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
  containers:
  - name: spark-kubernetes-driver

Prerequisites

We use an AWS Cloud9 IDE to run all the instructions throughout this post.

To create your IDE, run the following commands in AWS CloudShell. The default Region is us-east-1, but you can change it if needed.

# clone the repo
git clone https://github.com/black-mirror-1/karpenter-for-emr-on-eks.git
cd karpenter-for-emr-on-eks
./setup/create-cloud9-ide.sh

Navigate to the AWS Cloud9 IDE using the URL from the output of the script.

Install tools on the AWS Cloud9 IDE

Install the following tools required on the AWS Cloud9 environment by the running a script:

Run the following instructions in your AWS Cloud9 environment and not CloudShell.

Clone the GitHub repository:

cd ~/environment
git clone https://github.com/black-mirror-1/karpenter-for-emr-on-eks.git
cd ~/environment/karpenter-for-emr-on-eks

Set up the required environment variables. Feel free to adjust the following code according to your needs:

# Install envsubst (from GNU gettext utilities) and bash-completion
sudo yum -y install jq gettext bash-completion moreutils

# Setup env variables required
export EKSCLUSTER_NAME=aws-blog
export EKS_VERSION="1.23"
# get the link to the same version as EKS from here https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html
export KUBECTL_URL="https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.7/2022-06-29/bin/linux/amd64/kubectl"
export HELM_VERSION="v3.9.4"
export KARPENTER_VERSION="v0.18.1"
# get the most recent matching version of the Cluster Autoscaler from here https://github.com/kubernetes/autoscaler/releases
export CAS_VERSION="v1.23.1"

Install the AWS Cloud9 CLI tools:

cd ~/environment/karpenter-for-emr-on-eks
./setup/c9-install-tools.sh

Provision the infrastructure

We set up the following resources using the provision infrastructure script:

A new EKS cluster
EMR on EKS enabler
The required AWS Identity and Access Management (IAM) roles
An Amazon Simple Storage Service (Amazon S3) bucket to store results and pod templates
Karpenter set up to scale in Availability Zone {AWS_REGION}a
Optionally, Cluster Autoscaler set up to scale multiple node groups in Availability Zone {AWS_REGION}b
Prometheus to ship metrics to Amazon Managed Service for Prometheus

Create the EMR on EKS and Karpenter infrastructure:

cd ~/environment/karpenter-for-emr-on-eks
./setup/create-eks-emr-infra.sh

Validate the setup:

# Should have results that are running
kubectl get nodes
kubectl get pods -n karpenter
kubectl get po -n kube-system -l app.kubernetes.io/instance=cluster-autoscaler
kubectl get po -n prometheus

Understanding Karpenter configurations

Because the sample workload has driver and executor specs that are of different sizes, we have identified the instances from c5, c5a, c5d, c5ad, c6a, m4, m5, m5a, m5d, m5ad, and m6a families of sizes 2xlarge, 4xlarge, 8xlarge, and 9xlarge for our workload using the amazon-ec2-instance-selector CLI. With CAS, we need to create a total of 12 node groups, as shown in eksctl-config.yaml, but can define the same constraints in Karpenter with a single provisioner, as shown in the following code:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  provider:
    launchTemplate: {EKSCLUSTER_NAME}-karpenter-launchtemplate
    subnetSelector:
      karpenter.sh/discovery: {EKSCLUSTER_NAME}
  labels:
    app: kspark
  requirements:
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand","spot"]
    - key: "kubernetes.io/arch" 
      operator: In
      values: ["amd64"]
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: [c5, c5a, c5d, c5ad, m5, c6a]
    - key: karpenter.k8s.aws/instance-size
      operator: In
      values: [2xlarge, 4xlarge, 8xlarge, 9xlarge]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["{AWS_REGION}a"]

  limits:
    resources:
      cpu: "2000"

  ttlSecondsAfterEmpty: 30

We have set up both auto scalers to scale down nodes that are empty for 30 seconds using ttlSecondsAfterEmpty in Karpenter and --scale-down-unneeded-time in CAS.

Karpenter by design will try to achieve the most efficient packing of the pods on a node based on CPU, memory, and GPUs required.

Run a sample workload

To run a sample workload, complete the following steps:

Lets review the AWS Command Line Interface (AWS CLI) command to submit a sample job:

aws emr-containers start-job-run \
  --virtual-cluster-id $VIRTUAL_CLUSTER_ID \
  --name karpenter-benchmark-${CORES}vcpu-${MEMORY}gb  \
  --execution-role-arn $EMR_ROLE_ARN \
  --release-label emr-6.5.0-latest \
  --job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "local:///usr/lib/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar",
      "entryPointArguments":["s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned","s3://'$S3BUCKET'/EMRONEKS_TPCDS-TEST-3T-RESULT-KA","/opt/tpcds-kit/tools","parquet","3000","1","false","q70-v2.4,q82-v2.4,q64-v2.4","true"],
      "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --conf spark.executor.instances=50 --conf spark.driver.cores='$CORES' --conf spark.driver.memory='$EXEC_MEMORY'g --conf spark.executor.cores='$CORES' --conf spark.executor.memory='$EXEC_MEMORY'g"}}' \
  --configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.kubernetes.node.selector.app": "kspark",
          "spark.kubernetes.node.selector.topology.kubernetes.io/zone": "'${AWS_REGION}'a",

          "spark.kubernetes.container.image":  "'$ECR_URL'/eks-spark-benchmark:emr6.5",
          "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/pod-template/karpenter-driver-pod-template.yaml",
          "spark.kubernetes.executor.podTemplateFile": "s3://'$S3BUCKET'/pod-template/karpenter-executor-pod-template.yaml",
          "spark.network.timeout": "2000s",
          "spark.executor.heartbeatInterval": "300s",
          "spark.kubernetes.executor.limit.cores": "'$CORES'",
          "spark.executor.memoryOverhead": "'$MEMORY_OVERHEAD'G",
          "spark.driver.memoryOverhead": "'$MEMORY_OVERHEAD'G",
          "spark.kubernetes.executor.podNamePrefix": "karpenter-'$CORES'vcpu-'$MEMORY'gb",
          "spark.executor.defaultJavaOptions": "-verbose:gc -XX:+UseG1GC",
          "spark.driver.defaultJavaOptions": "-verbose:gc -XX:+UseG1GC",

          "spark.ui.prometheus.enabled":"true",
          "spark.executor.processTreeMetrics.enabled":"true",
          "spark.kubernetes.driver.annotation.prometheus.io/scrape":"true",
          "spark.kubernetes.driver.annotation.prometheus.io/path":"/metrics/executors/prometheus/",
          "spark.kubernetes.driver.annotation.prometheus.io/port":"4040",
          "spark.kubernetes.driver.service.annotation.prometheus.io/scrape":"true",
          "spark.kubernetes.driver.service.annotation.prometheus.io/path":"/metrics/driver/prometheus/",
          "spark.kubernetes.driver.service.annotation.prometheus.io/port":"4040",
          "spark.metrics.conf.*.sink.prometheusServlet.class":"org.apache.spark.metrics.sink.PrometheusServlet",
          "spark.metrics.conf.*.sink.prometheusServlet.path":"/metrics/driver/prometheus/",
          "spark.metrics.conf.master.sink.prometheusServlet.path":"/metrics/master/prometheus/",
          "spark.metrics.conf.applications.sink.prometheusServlet.path":"/metrics/applications/prometheus/"
         }}
    ]}'

Submit four jobs with different driver and executor vCPUs and memory sizes on Karpenter:

# the arguments are vcpus and memory
export EMRCLUSTER_NAME=${EKSCLUSTER_NAME}-emr
./sample-workloads/emr6.5-tpcds-karpenter.sh 4 7
./sample-workloads/emr6.5-tpcds-karpenter.sh 8 15
./sample-workloads/emr6.5-tpcds-karpenter.sh 4 15
./sample-workloads/emr6.5-tpcds-karpenter.sh 8 31

To monitor the pods’s autoscaling status in real time, open a new terminal in Cloud9 IDE and run the following command (nothing is returned at the start):
```
watch -n1 "kubectl get pod -n emr-karpenter"
```
Observe the EC2 instance and node auto scaling status in a second terminal tab by running the following command (by design, Karpenter schedules in Availability Zone a):
```
watch -n1 "kubectl get node --label-columns=node.kubernetes.io/instance-type,karpenter.sh/capacity-type,topology.kubernetes.io/zone,app -l app=kspark"
```

Compare with Cluster Autoscaler (Optional)

We have set up Cluster Autoscaler during the infrastructure setup step with the following configuration:

Launch EC2 nodes in Availability Zone b
Contain 12 node groups (6 each for On-Demand and Spot)
Scale down unneeded nodes after 30 seconds with --scale-down-unneeded-time
Use the least-waste expander on CAS, which can select the node group that will have the least idle CPU for binpacking efficiency

Submit four jobs with different driver and executor vCPUs and memory sizes on CAS:

# the arguments are vcpus and memory
./sample-workloads/emr6.5-tpcds-ca.sh 4 7
./sample-workloads/emr6.5-tpcds-ca.sh 8 15
./sample-workloads/emr6.5-tpcds-ca.sh 4 15
./sample-workloads/emr6.5-tpcds-ca.sh 8 31

To monitor the pods’s autoscaling status in real time, open a new terminal in Cloud9 IDE and run the following command (nothing is returned at the start):
```
watch -n1 "kubectl get pod -n emr-ca"
```
Observe the EC2 instance and node auto scaling status in a second terminal tab by running the following command (by design, CAS schedules in Availability Zone b):
```
watch -n1 "kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone,app -l app=caspark"
```

Observations

The time from pod creation to being scheduled on average is less with Karpenter than CAS, as shown in the following figure; you can see a noticeable difference when you run large scale workloads.

As shown in the following figures, as the jobs were completed, Karpenter was able to scale down the nodes that aren’t needed within seconds. In contrast, CAS takes minutes, because it sends a signal to the node groups, adding additional latency. This in turn helps reduce overall costs by reducing the number of seconds unneeded EC2 instances are running.

Clean up

To clean up your environment, delete all the resources created in reverse order by running the cleanup script:

export EKSCLUSTER_NAME=aws-blog
cd ~/environment/karpenter-for-emr-on-eks
./setup/cleanup.sh

Conclusion

In this post, we showed you how to use Karpenter to simplify EKS node provisioning, and speed up auto scaling of EMR on EKS workloads. We encourage you to try Karpenter and provide any feedback by creating a GitHub issue.

Overview of solution

Prerequisites

Create an Amazon OpenSearch Service domain

Create a Kinesis Data Firehose delivery stream

Create a VPC flow logs subscription

Explore VPC flow logs in Amazon OpenSearch Service Dashboards

Clean up

Conclusion

About the Author

Generic consumer sizing guidance

Size your producer cluster

Size and setup initial consumer cluster

Setup Amazon Redshift data sharing

Test consumer only workload on initial consumer cluster

Test consumer only workload on different consumer cluster configurations

Test producer only workload on different producer cluster configurations

Re-evaluate after a full workload run over time

Automated sizing solutions

Solution walkthrough

Size your production cluster

Identify the workload to be isolated

Setup Simple Replay

Baseline workload

Setup initial producer and consumer test clusters

Replay workload on initial producer and consumer

Replay consumer workload on different configurations

Replay producer workload on different configurations

Re-evaluate after a full workload run over time

Clean up

Applying Amazon Redshift and data sharing best practices

Summary

About the Authors

Solution overview

Prerequisites

Create an Amazon Redshift Serverless Workgroup and Namespace

Create the S3 bucket and folder

Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT

Connect to the BigQuery Source

Connect to the Amazon Redshift Target

Convert BigQuery schema to an Amazon Redshift

Analyze the assessment report and address the action items

Apply converted schema to target Amazon Redshift

Migrate data from BigQuery to Amazon Redshift using AWS SCT data extraction agents

Generating trust and key stores (optional)

Install and configure Data Extraction Agent

Starting Data Extraction Agent(s)

Register the Data Extraction Agent

Add virtual partitions for large tables (optional)

Create a local migration task

Start the Local Data Migration Task

View data in Amazon Redshift

Clean up up any AWS resources that you created for this exercise

Conclusion

About the authors

Introduction:

Prerequisites

Walkthrough

Cleanup

Conclusion

­­­­Amazon OpenSearch Service managed VPC endpoints

Shared search cluster for multiple development teams

Solution overview

Self-managed AWS PrivateLink architecture to connect different VPCs

OpenSearch Service managed VPC endpoints architecture (powered by AWS PrivateLink)

Conclusion

About the authors

Concurrency scaling overview

Enable Amazon Redshift concurrency scaling

Example use cases

Scenario 1: All queries triggered concurrently with concurrency scaling turned off

Scenario 2: All queries triggered concurrently with concurrency scaling cluster max limit set to 5 clusters

Scenario 3: All queries triggered concurrently with concurrency scaling cluster limit set to 10 clusters

Test results review

Monitor concurrency scaling

Configure usage limit

Summary

About the Authors

Overview of solution

Prerequisites

Set up and run the AWS Glue job

Amazon OpenSearch Service managed VPC endpoints